Bigquery
Build analytical pipelines on Google BigQuery using SQL, streaming inserts, and federated queries.
You are an expert in Google BigQuery analytics engineering, skilled at writing cost-efficient SQL, designing partitioned and clustered tables, building streaming pipelines, and leveraging BigQuery ML for in-warehouse machine learning. ## Key Points - **Running exploratory queries with `SELECT *` on multi-TB tables**: Each query is billed per bytes scanned; preview with column selection or use table preview in the console - **Not partitioning tables that are queried with date filters**: Without partitioning, every query scans the full table regardless of the WHERE clause date range - **Using streaming inserts for batch workloads**: Streaming inserts cost $0.01/200MB and have a 90-minute buffer delay for exports; use load jobs for batch data which are free - Running ad-hoc and scheduled SQL analytics on datasets from gigabytes to petabytes without infrastructure management - Building real-time analytics dashboards with streaming inserts and materialized views - Training and deploying ML models directly in the warehouse using BigQuery ML without data export - Querying data across BigQuery, Cloud SQL, Cloud Storage, and Sheets using federated queries - Processing geospatial analytics with built-in GIS functions (ST_DISTANCE, ST_WITHIN, etc.)
skilldb get data-pipeline-services-skills/BigqueryFull skill: 257 linesGoogle BigQuery
You are an expert in Google BigQuery analytics engineering, skilled at writing cost-efficient SQL, designing partitioned and clustered tables, building streaming pipelines, and leveraging BigQuery ML for in-warehouse machine learning.
Core Philosophy
Serverless and Slot-Based Execution
BigQuery is fully serverless. Queries consume slots (units of compute) automatically. Optimize by reducing bytes scanned, not by tuning cluster sizes. On-demand pricing charges per TB scanned; flat-rate pricing reserves slots.
Partition and Cluster Everything
Partitioning prunes entire segments of data; clustering sorts data within partitions for efficient block-level filtering. Together they are the primary cost and performance optimization.
Separate Storage from Compute Billing
Storage costs are low and decline automatically after 90 days (long-term storage pricing). Compute costs dominate; focus optimization on reducing query scan volume.
Setup
Configure BigQuery access and create foundational resources:
# Install CLI and authenticate
gcloud auth application-default login
pip install google-cloud-bigquery
# Create dataset
bq mk --dataset --location=US my_project:analytics
Python client setup:
from google.cloud import bigquery
client = bigquery.Client(project="my-project")
# Run a query
query = """
SELECT table_name, row_count, size_bytes
FROM `my-project.analytics.INFORMATION_SCHEMA.TABLE_STORAGE`
ORDER BY size_bytes DESC
"""
df = client.query(query).to_dataframe()
Terraform resource definition:
resource "google_bigquery_dataset" "analytics" {
dataset_id = "analytics"
project = "my-project"
location = "US"
default_partition_expiration_ms = 7776000000 # 90 days
delete_contents_on_destroy = false
}
Key Patterns
Do: Create partitioned and clustered tables
CREATE OR REPLACE TABLE analytics.events
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_name
AS
SELECT
event_id,
user_id,
event_name,
event_timestamp,
event_properties
FROM raw.events_staging;
-- Always filter on partition column to limit scan
SELECT event_name, COUNT(*) AS cnt
FROM analytics.events
WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31'
AND user_id = 'user_123'
GROUP BY event_name;
Do Not: Use SELECT * on large tables without partition filters
-- BAD - scans entire table, costs proportional to full table size
SELECT * FROM analytics.events LIMIT 100;
-- GOOD - partition pruning, column selection
SELECT event_id, event_name, event_timestamp
FROM analytics.events
WHERE DATE(event_timestamp) = CURRENT_DATE()
LIMIT 100;
-- Enforce with require_partition_filter
ALTER TABLE analytics.events
SET OPTIONS (require_partition_filter = TRUE);
Do: Use MERGE for upsert patterns
MERGE INTO analytics.customers AS target
USING staging.customers_update AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET
target.email = source.email,
target.name = source.name,
target.updated_at = CURRENT_TIMESTAMP()
WHEN NOT MATCHED THEN INSERT
(customer_id, email, name, created_at, updated_at)
VALUES
(source.customer_id, source.email, source.name,
CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP());
Common Patterns
Streaming inserts from application code
from google.cloud import bigquery
client = bigquery.Client()
table_ref = client.dataset("analytics").table("realtime_events")
rows = [
{
"event_id": "evt_001",
"user_id": "user_42",
"event_name": "page_view",
"event_timestamp": "2024-06-15T10:30:00Z",
"properties": {"page": "/pricing", "referrer": "google"},
},
]
errors = client.insert_rows_json(table_ref, rows)
if errors:
raise RuntimeError(f"Streaming insert errors: {errors}")
Scheduled queries for recurring transformations
from google.cloud import bigquery_datatransfer
transfer_client = bigquery_datatransfer.DataTransferServiceClient()
transfer_config = bigquery_datatransfer.TransferConfig(
destination_dataset_id="analytics",
display_name="daily_revenue_rollup",
data_source_id="scheduled_query",
schedule="every 24 hours",
params={
"query": """
CREATE OR REPLACE TABLE analytics.daily_revenue AS
SELECT
DATE(order_timestamp) AS order_date,
SUM(amount) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM analytics.orders
WHERE DATE(order_timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
GROUP BY 1
""",
},
)
transfer_client.create_transfer_config(
parent=f"projects/my-project/locations/US",
transfer_config=transfer_config,
)
BigQuery ML for in-warehouse predictions
-- Train a model
CREATE OR REPLACE MODEL analytics.churn_model
OPTIONS (
model_type = 'LOGISTIC_REG',
input_label_cols = ['churned'],
auto_class_weights = TRUE,
data_split_method = 'AUTO_SPLIT'
) AS
SELECT
total_orders,
avg_order_value,
days_since_last_order,
support_tickets,
churned
FROM analytics.customer_features;
-- Predict on new data
SELECT
customer_id,
predicted_churned,
predicted_churned_probs[OFFSET(1)].prob AS churn_probability
FROM ML.PREDICT(MODEL analytics.churn_model,
(SELECT * FROM analytics.customer_features_latest));
Materialized views for pre-aggregation
CREATE MATERIALIZED VIEW analytics.hourly_event_counts
PARTITION BY DATE(event_hour)
CLUSTER BY event_name
AS
SELECT
TIMESTAMP_TRUNC(event_timestamp, HOUR) AS event_hour,
event_name,
COUNT(*) AS event_count,
COUNT(DISTINCT user_id) AS unique_users
FROM analytics.events
GROUP BY 1, 2;
-- BigQuery automatically uses materialized view when querying base table
-- if the optimizer determines it satisfies the query
Federated queries to external sources
-- Query Cloud SQL directly from BigQuery
SELECT orders.order_id, orders.amount, inv.stock_qty
FROM analytics.orders AS orders
JOIN EXTERNAL_QUERY(
'projects/my-project/locations/us/connections/cloudsql_conn',
'SELECT product_id, stock_qty FROM inventory WHERE stock_qty > 0'
) AS inv
ON orders.product_id = inv.product_id;
Anti-Patterns
- Running exploratory queries with
SELECT *on multi-TB tables: Each query is billed per bytes scanned; preview with column selection or use table preview in the console - Not partitioning tables that are queried with date filters: Without partitioning, every query scans the full table regardless of the WHERE clause date range
- Using streaming inserts for batch workloads: Streaming inserts cost $0.01/200MB and have a 90-minute buffer delay for exports; use load jobs for batch data which are free
- Creating too many small tables instead of using partitions: Thousands of date-sharded tables (
events_20240101,events_20240102) are harder to query and manage than a single partitioned table
When to Use
- Running ad-hoc and scheduled SQL analytics on datasets from gigabytes to petabytes without infrastructure management
- Building real-time analytics dashboards with streaming inserts and materialized views
- Training and deploying ML models directly in the warehouse using BigQuery ML without data export
- Querying data across BigQuery, Cloud SQL, Cloud Storage, and Sheets using federated queries
- Processing geospatial analytics with built-in GIS functions (ST_DISTANCE, ST_WITHIN, etc.)
Install this skill directly: skilldb add data-pipeline-services-skills
Related Skills
Airbyte
Configure Airbyte open-source data integration with custom connectors, destinations, and CDC replication.
Apache Airflow
Orchestrate data pipelines using Apache Airflow DAGs, operators, sensors, and XCom.
Apache Spark
Process large-scale data with Apache Spark using PySpark DataFrames, Spark SQL, and structured streaming.
Clickhouse
Build high-performance OLAP queries on ClickHouse using MergeTree engines, materialized views, and aggregations.
DBT
Build and test data transformation pipelines using dbt models, macros, and incremental strategies.
Fivetran
Configure and manage Fivetran connectors for automated data ingestion into warehouses.