1️⃣ Set Up Google Cloud Platform (GCP) and BigQuery
📌 Step 1: Create a Google Cloud Account
- Go to Google Cloud Console
- Sign in or create a new account.
- Set up billing (Google provides $300 free credits).
📌 Step 2: Enable BigQuery API
- In Google Cloud Console, open the Navigation Menu → BigQuery.
- Click Enable API if it’s not enabled.
2️⃣ Create a BigQuery Dataset
A dataset is a container for your tables.
📌 Step 1: Create a Dataset
- Open BigQuery Console
- Click + Create Dataset
- Set:
- Dataset ID:
your_dataset
- Location:
US
- Dataset ID:
- Click Create Dataset ✅
3️⃣ Generate a Large Dataset (1M+ rows)
Now, let’s create a sales dataset with 1 million random records.
📌 Step 1: Run the SQL in BigQuery
- In BigQuery Console, open SQL Editor.
- Run the following query:
CREATE OR REPLACE TABLE your_project.your_dataset.raw_sales AS
WITH raw_data AS (
SELECT
GENERATE_UUID() AS order_id,
ARRAY<STRING>['Laptop', 'Phone', 'Tablet', 'Headphones', 'Monitor'][OFFSET(CAST(FLOOR(RAND() * 5) AS INT64))] AS product_name,
CAST(FLOOR(RAND() * 7000 + 100) AS INT64) AS amount,
ARRAY<STRING>['USA', 'UK', 'Canada', 'Germany', 'France'][OFFSET(CAST(FLOOR(RAND() * 5) AS INT64))] AS country,
DATE_ADD(DATE '2024-01-01', INTERVAL CAST(FLOOR(RAND() * 365) AS INT64) DAY) AS order_date,
CAST(FLOOR(RAND() * 100) AS INT64) AS discount
FROM UNNEST(GENERATE_ARRAY(1, 1000000)) -- Generates 1 Million Rows
)
SELECT * FROM raw_data;
- Click Run ✅
✅ This creates a 1M+ row table (raw_sales
) in BigQuery.
4️⃣ Install and Configure dbt for BigQuery
📌 Step 1: Install dbt
- Open a terminal (Mac, Linux, or Windows using WSL/Powershell).
- Run:
pip install dbt-bigquery
5️⃣ Initialize a dbt Project
📌 Step 1: Create a New dbt Project
dbt init sales_analytics
cd sales_analytics
✅ This creates a new dbt project folder.
6️⃣ Connect dbt to BigQuery
📌 Step 1: Authenticate dbt with Google Cloud
- In the terminal, run:
gcloud auth application-default login
- Select your Google Cloud account and allow access.
- Now dbt can connect to BigQuery! ✅
7️⃣ Write dbt Models (SQL Transformations)
📌 Step 1: Create Macros
What Are dbt Macros?
A macro in dbt is a function written in Jinja that generates SQL dynamically.
Think of it like a SQL template that you can reuse across multiple models.
👉 Instead of repeating SQL logic in different models, use a macro once and call it anywhere.
📄 File: macros/dynamic_partition.sql
{% macro dynamic_partition(column_name, interval) %}
CASE
WHEN DATE({{ column_name }}) >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 {{ interval }}) THEN 'recent'
ELSE 'historical'
END AS partition_group
{% endmacro %}
Dynamically partitions sales data.
📌 Step 2: Create dbt Models
📄 File: models/staging/stg_sales.sql
{{ config(materialized='view') }}
SELECT
order_id,
product_name,
amount,
country,
order_date,
{{ dynamic_partition('order_date', 'MONTH') }} -- Dynamic partitioning
FROM `your_project.your_dataset.raw_sales`
✅ Reads from BigQuery’s raw_sales
table.
📄 File: models/marts/sales_final.sql
SELECT
country,
product_name,
COUNT(order_id) AS total_orders,
SUM(amount) AS total_revenue,
{{ dynamic_partition('order_date', 'MONTH') }}
FROM {{ ref('stg_sales') }}
GROUP BY 1, 2, 5
✅ Aggregates sales by country and product.
8️⃣ Run dbt Pipeline in BigQuery
Run the following commands:
dbt debug # Test connection
dbt run # Run transformations in BigQuery
dbt test # Validate models
dbt docs generate && dbt docs serve # View documentation
✅ dbt now runs SQL directly in BigQuery! 🎯
9️⃣ Test and Document Your dbt Project
📌 Step 1: Add Tests in schema.yml
📄 File: models/schema.yml
version: 2
models:
- name: sales_final
description: "Final sales model with aggregated revenue"
tests:
- not_null:
column_name: country
- not_null:
column_name: total_orders
✅ Ensures data quality in dbt.
Set up a GitHub repository for your dbt project
✅ Step 1: Initialize Git in Your Project Folder
Navigate to your dbt project folder and run:
cd ~/sales_analytics # Go to your project folder
git init # Initialize a Git repository
This will create a .git
directory to track your project.
✅ Step 2: Create a .gitignore
File
Ensure you don’t commit sensitive files by creating a .gitignore
:
nano .gitignore
Add these lines:
# Ignore dbt logs and compiled files
dbt_modules/
target/
logs/
# Ignore Google Cloud credentials
*.json
Save and exit (CTRL + X
, then Y
, then Enter
).
✅ Step 3: Create a GitHub Repository
Run this command to create a repo on GitHub (replace your-repo-name
with a real name):
gh repo create your-repo-name --public --source=. --remote=origin
If you don’t have GitHub CLI (gh
) installed, you can manually create a repo on GitHub and follow the instructions to add a remote:
git remote add origin https://github.com/your-username/your-repo-name.git
✅ Step 4: Create a README.md File
Run:
nano README.md
Paste this:
# Jinja Advanced Sales - dbt + BigQuery Project 🚀
This project demonstrates advanced **Jinja macros** in dbt, working with **BigQuery**, and handling **1M+ records** dynamically.
## 📌 Features
- ✅ **Dynamic SQL with Jinja** in dbt models
- ✅ **BigQuery integration** with dbt
- ✅ **Airflow DAG** for automated pipeline execution
- ✅ **1M+ sales records** for testing
## 🚀 Setup Instructions
### 1️⃣ Clone the repository
```bash
git clone https://github.com/your-username/your-repo-name.git
cd your-repo-name
2️⃣ Set up dbt
pip install dbt-bigquery