Skip to main content
Technology & EngineeringData Pipeline Services257 lines

Bigquery

Build analytical pipelines on Google BigQuery using SQL, streaming inserts, and federated queries.

Quick Summary14 lines
You are an expert in Google BigQuery analytics engineering, skilled at writing cost-efficient SQL, designing partitioned and clustered tables, building streaming pipelines, and leveraging BigQuery ML for in-warehouse machine learning.

## Key Points

- **Running exploratory queries with `SELECT *` on multi-TB tables**: Each query is billed per bytes scanned; preview with column selection or use table preview in the console
- **Not partitioning tables that are queried with date filters**: Without partitioning, every query scans the full table regardless of the WHERE clause date range
- **Using streaming inserts for batch workloads**: Streaming inserts cost $0.01/200MB and have a 90-minute buffer delay for exports; use load jobs for batch data which are free
- Running ad-hoc and scheduled SQL analytics on datasets from gigabytes to petabytes without infrastructure management
- Building real-time analytics dashboards with streaming inserts and materialized views
- Training and deploying ML models directly in the warehouse using BigQuery ML without data export
- Querying data across BigQuery, Cloud SQL, Cloud Storage, and Sheets using federated queries
- Processing geospatial analytics with built-in GIS functions (ST_DISTANCE, ST_WITHIN, etc.)
skilldb get data-pipeline-services-skills/BigqueryFull skill: 257 lines
Paste into your CLAUDE.md or agent config

Google BigQuery

You are an expert in Google BigQuery analytics engineering, skilled at writing cost-efficient SQL, designing partitioned and clustered tables, building streaming pipelines, and leveraging BigQuery ML for in-warehouse machine learning.

Core Philosophy

Serverless and Slot-Based Execution

BigQuery is fully serverless. Queries consume slots (units of compute) automatically. Optimize by reducing bytes scanned, not by tuning cluster sizes. On-demand pricing charges per TB scanned; flat-rate pricing reserves slots.

Partition and Cluster Everything

Partitioning prunes entire segments of data; clustering sorts data within partitions for efficient block-level filtering. Together they are the primary cost and performance optimization.

Separate Storage from Compute Billing

Storage costs are low and decline automatically after 90 days (long-term storage pricing). Compute costs dominate; focus optimization on reducing query scan volume.

Setup

Configure BigQuery access and create foundational resources:

# Install CLI and authenticate
gcloud auth application-default login
pip install google-cloud-bigquery

# Create dataset
bq mk --dataset --location=US my_project:analytics

Python client setup:

from google.cloud import bigquery

client = bigquery.Client(project="my-project")

# Run a query
query = """
    SELECT table_name, row_count, size_bytes
    FROM `my-project.analytics.INFORMATION_SCHEMA.TABLE_STORAGE`
    ORDER BY size_bytes DESC
"""
df = client.query(query).to_dataframe()

Terraform resource definition:

resource "google_bigquery_dataset" "analytics" {
  dataset_id = "analytics"
  project    = "my-project"
  location   = "US"

  default_partition_expiration_ms = 7776000000  # 90 days
  delete_contents_on_destroy     = false
}

Key Patterns

Do: Create partitioned and clustered tables

CREATE OR REPLACE TABLE analytics.events
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_name
AS
SELECT
    event_id,
    user_id,
    event_name,
    event_timestamp,
    event_properties
FROM raw.events_staging;

-- Always filter on partition column to limit scan
SELECT event_name, COUNT(*) AS cnt
FROM analytics.events
WHERE DATE(event_timestamp) BETWEEN '2024-01-01' AND '2024-01-31'
  AND user_id = 'user_123'
GROUP BY event_name;

Do Not: Use SELECT * on large tables without partition filters

-- BAD - scans entire table, costs proportional to full table size
SELECT * FROM analytics.events LIMIT 100;

-- GOOD - partition pruning, column selection
SELECT event_id, event_name, event_timestamp
FROM analytics.events
WHERE DATE(event_timestamp) = CURRENT_DATE()
LIMIT 100;

-- Enforce with require_partition_filter
ALTER TABLE analytics.events
SET OPTIONS (require_partition_filter = TRUE);

Do: Use MERGE for upsert patterns

MERGE INTO analytics.customers AS target
USING staging.customers_update AS source
ON target.customer_id = source.customer_id

WHEN MATCHED THEN UPDATE SET
    target.email = source.email,
    target.name = source.name,
    target.updated_at = CURRENT_TIMESTAMP()

WHEN NOT MATCHED THEN INSERT
    (customer_id, email, name, created_at, updated_at)
VALUES
    (source.customer_id, source.email, source.name,
     CURRENT_TIMESTAMP(), CURRENT_TIMESTAMP());

Common Patterns

Streaming inserts from application code

from google.cloud import bigquery

client = bigquery.Client()
table_ref = client.dataset("analytics").table("realtime_events")

rows = [
    {
        "event_id": "evt_001",
        "user_id": "user_42",
        "event_name": "page_view",
        "event_timestamp": "2024-06-15T10:30:00Z",
        "properties": {"page": "/pricing", "referrer": "google"},
    },
]

errors = client.insert_rows_json(table_ref, rows)
if errors:
    raise RuntimeError(f"Streaming insert errors: {errors}")

Scheduled queries for recurring transformations

from google.cloud import bigquery_datatransfer

transfer_client = bigquery_datatransfer.DataTransferServiceClient()

transfer_config = bigquery_datatransfer.TransferConfig(
    destination_dataset_id="analytics",
    display_name="daily_revenue_rollup",
    data_source_id="scheduled_query",
    schedule="every 24 hours",
    params={
        "query": """
            CREATE OR REPLACE TABLE analytics.daily_revenue AS
            SELECT
                DATE(order_timestamp) AS order_date,
                SUM(amount) AS total_revenue,
                COUNT(DISTINCT customer_id) AS unique_customers
            FROM analytics.orders
            WHERE DATE(order_timestamp) >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY)
            GROUP BY 1
        """,
    },
)

transfer_client.create_transfer_config(
    parent=f"projects/my-project/locations/US",
    transfer_config=transfer_config,
)

BigQuery ML for in-warehouse predictions

-- Train a model
CREATE OR REPLACE MODEL analytics.churn_model
OPTIONS (
    model_type = 'LOGISTIC_REG',
    input_label_cols = ['churned'],
    auto_class_weights = TRUE,
    data_split_method = 'AUTO_SPLIT'
) AS
SELECT
    total_orders,
    avg_order_value,
    days_since_last_order,
    support_tickets,
    churned
FROM analytics.customer_features;

-- Predict on new data
SELECT
    customer_id,
    predicted_churned,
    predicted_churned_probs[OFFSET(1)].prob AS churn_probability
FROM ML.PREDICT(MODEL analytics.churn_model,
    (SELECT * FROM analytics.customer_features_latest));

Materialized views for pre-aggregation

CREATE MATERIALIZED VIEW analytics.hourly_event_counts
PARTITION BY DATE(event_hour)
CLUSTER BY event_name
AS
SELECT
    TIMESTAMP_TRUNC(event_timestamp, HOUR) AS event_hour,
    event_name,
    COUNT(*) AS event_count,
    COUNT(DISTINCT user_id) AS unique_users
FROM analytics.events
GROUP BY 1, 2;

-- BigQuery automatically uses materialized view when querying base table
-- if the optimizer determines it satisfies the query

Federated queries to external sources

-- Query Cloud SQL directly from BigQuery
SELECT orders.order_id, orders.amount, inv.stock_qty
FROM analytics.orders AS orders
JOIN EXTERNAL_QUERY(
    'projects/my-project/locations/us/connections/cloudsql_conn',
    'SELECT product_id, stock_qty FROM inventory WHERE stock_qty > 0'
) AS inv
ON orders.product_id = inv.product_id;

Anti-Patterns

  • Running exploratory queries with SELECT * on multi-TB tables: Each query is billed per bytes scanned; preview with column selection or use table preview in the console
  • Not partitioning tables that are queried with date filters: Without partitioning, every query scans the full table regardless of the WHERE clause date range
  • Using streaming inserts for batch workloads: Streaming inserts cost $0.01/200MB and have a 90-minute buffer delay for exports; use load jobs for batch data which are free
  • Creating too many small tables instead of using partitions: Thousands of date-sharded tables (events_20240101, events_20240102) are harder to query and manage than a single partitioned table

When to Use

  • Running ad-hoc and scheduled SQL analytics on datasets from gigabytes to petabytes without infrastructure management
  • Building real-time analytics dashboards with streaming inserts and materialized views
  • Training and deploying ML models directly in the warehouse using BigQuery ML without data export
  • Querying data across BigQuery, Cloud SQL, Cloud Storage, and Sheets using federated queries
  • Processing geospatial analytics with built-in GIS functions (ST_DISTANCE, ST_WITHIN, etc.)

Install this skill directly: skilldb add data-pipeline-services-skills

Get CLI access →