Skip to main content
Technology & EngineeringGcp Services238 lines

Bigquery

Analyze large datasets with Google BigQuery serverless data warehouse and SQL engine

Quick Summary31 lines
You are an expert in Google BigQuery for serverless, petabyte-scale data analytics and SQL-based data warehousing.

## Key Points

- **Running `SELECT *` on wide tables** -- BigQuery charges by bytes scanned per column. Selecting all columns when you need three wastes money and slows queries.
- **Querying unpartitioned tables without filters** -- A full table scan on a multi-terabyte table can cost more than a month of partitioned queries. Always partition and filter.
- **Using streaming inserts for historical backfills** -- Streaming is designed for real-time. Backfilling millions of rows via streaming inserts is expensive and hits quota limits. Use batch loads.
- Petabyte-scale columnar storage with automatic compression
- Serverless query execution with on-demand or flat-rate pricing
- Standard SQL with support for nested/repeated fields, UDFs, and scripting
- Streaming inserts and batch loading from Cloud Storage, Sheets, and external sources
- Built-in ML (BigQuery ML), geospatial analytics (GIS), and BI Engine for dashboards
- Materialized views, scheduled queries, and data transfer service
- **Partition tables by date.** Time-based partitioning reduces query costs by scanning only relevant partitions.
- **Cluster frequently filtered columns.** Clustering on high-cardinality columns (e.g., user_id, country) further reduces bytes scanned.
- **Use `SELECT` with specific columns.** Avoid `SELECT *` on wide tables; BigQuery charges by bytes scanned per column.

## Quick Example

```bash
gcloud services enable bigquery.googleapis.com
```

```bash
bq mk --dataset \
  --location=US \
  --description="Analytics dataset" \
  my_project:analytics
```
skilldb get gcp-services-skills/BigqueryFull skill: 238 lines
Paste into your CLAUDE.md or agent config

GCP Service — BigQuery

You are an expert in Google BigQuery for serverless, petabyte-scale data analytics and SQL-based data warehousing.

Core Philosophy

BigQuery is built for analytical workloads where you scan large volumes of data to produce insights. It is not a transactional database and should never be used as one. Its columnar storage and massively parallel query engine make it extraordinarily fast at aggregations, joins, and window functions over terabytes of data, but every query scans (and bills) by the byte. The fundamental discipline of BigQuery usage is controlling how much data each query touches.

Partitioning and clustering are not optional performance tuning -- they are cost controls. A query on an unpartitioned, unclustered multi-terabyte table can cost hundreds of dollars in a single execution. Partition by the time column you most frequently filter on, and cluster by the high-cardinality columns you group or filter by. Together, these reduce bytes scanned by orders of magnitude and make BigQuery economically viable for frequent queries.

Prefer batch loading over streaming inserts whenever latency allows. Batch loads from Cloud Storage are free, while streaming inserts charge per row and per byte. Reserve streaming for use cases that genuinely need sub-minute latency, like real-time dashboards or event-driven pipelines. For everything else -- daily ETL, backfills, log imports -- batch is cheaper and simpler.

Anti-Patterns

  • Running SELECT * on wide tables -- BigQuery charges by bytes scanned per column. Selecting all columns when you need three wastes money and slows queries.
  • Querying unpartitioned tables without filters -- A full table scan on a multi-terabyte table can cost more than a month of partitioned queries. Always partition and filter.
  • Using streaming inserts for historical backfills -- Streaming is designed for real-time. Backfilling millions of rows via streaming inserts is expensive and hits quota limits. Use batch loads.
  • Joining large tables without pre-filtering -- Apply WHERE clauses before joins to reduce the data shuffled across BigQuery's execution slots. Unfiltered joins on large tables consume excessive slots and run slowly.
  • Not setting --maximum_bytes_billed -- Without a billing cap, a single misconfigured query can scan your entire dataset. Always set a bytes-billed limit on ad-hoc queries to prevent bill shocks.

Overview

BigQuery is a fully managed, serverless data warehouse that enables fast SQL analytics over massive datasets. It separates storage and compute, allowing independent scaling of each. BigQuery supports standard SQL, streaming ingestion, machine learning (BQML), and integrations with BI tools.

Key capabilities:

  • Petabyte-scale columnar storage with automatic compression
  • Serverless query execution with on-demand or flat-rate pricing
  • Standard SQL with support for nested/repeated fields, UDFs, and scripting
  • Streaming inserts and batch loading from Cloud Storage, Sheets, and external sources
  • Built-in ML (BigQuery ML), geospatial analytics (GIS), and BI Engine for dashboards
  • Materialized views, scheduled queries, and data transfer service

Setup & Configuration

Enable the API

gcloud services enable bigquery.googleapis.com

Create a dataset

bq mk --dataset \
  --location=US \
  --description="Analytics dataset" \
  my_project:analytics

Create a table from schema

bq mk --table \
  analytics.events \
  event_id:STRING,user_id:STRING,event_type:STRING,timestamp:TIMESTAMP,payload:JSON

Load data from Cloud Storage

bq load \
  --source_format=NEWLINE_DELIMITED_JSON \
  --autodetect \
  analytics.events \
  gs://my-bucket/events/*.json

Load CSV with schema

bq load \
  --source_format=CSV \
  --skip_leading_rows=1 \
  analytics.sales \
  gs://my-bucket/sales.csv \
  order_id:STRING,product:STRING,amount:NUMERIC,date:DATE

Core Patterns

Standard SQL queries

-- Aggregate daily active users
SELECT
  DATE(timestamp) AS day,
  COUNT(DISTINCT user_id) AS dau
FROM `my_project.analytics.events`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY day
ORDER BY day DESC;

Working with nested and repeated fields

-- Query nested structs and arrays
SELECT
  order_id,
  customer.name AS customer_name,
  item.product_name,
  item.quantity
FROM `my_project.analytics.orders`,
  UNNEST(items) AS item
WHERE customer.tier = 'premium'
  AND item.quantity > 5;

Window functions

SELECT
  user_id,
  event_type,
  timestamp,
  ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp) AS event_sequence,
  LAG(timestamp) OVER (PARTITION BY user_id ORDER BY timestamp) AS prev_event_time
FROM `my_project.analytics.events`
WHERE DATE(timestamp) = CURRENT_DATE();

Streaming inserts (Python)

from google.cloud import bigquery

client = bigquery.Client()
table_id = "my_project.analytics.events"

rows_to_insert = [
    {
        "event_id": "evt-001",
        "user_id": "user-123",
        "event_type": "page_view",
        "timestamp": "2025-01-15T10:30:00Z",
    },
]

errors = client.insert_rows_json(table_id, rows_to_insert)
if errors:
    print(f"Insert errors: {errors}")

Query from client library (Node.js)

const { BigQuery } = require('@google-cloud/bigquery');
const bigquery = new BigQuery();

async function queryActiveUsers() {
  const query = `
    SELECT user_id, COUNT(*) as event_count
    FROM \`my_project.analytics.events\`
    WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
    GROUP BY user_id
    ORDER BY event_count DESC
    LIMIT 100
  `;

  const [rows] = await bigquery.query({ query, location: 'US' });
  rows.forEach(row => console.log(row.user_id, row.event_count));
}

Scheduled queries

bq mk --transfer_config \
  --target_dataset=analytics \
  --display_name="Daily aggregation" \
  --data_source=scheduled_query \
  --schedule="every 24 hours" \
  --params='{
    "query": "INSERT INTO analytics.daily_summary SELECT DATE(timestamp) as day, COUNT(*) as total FROM analytics.events WHERE DATE(timestamp) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) GROUP BY day",
    "destination_table_name_template": "",
    "write_disposition": "WRITE_APPEND"
  }'

BigQuery ML (simple model)

-- Create a logistic regression model
CREATE OR REPLACE MODEL `analytics.churn_model`
OPTIONS(model_type='logistic_reg', input_label_cols=['churned']) AS
SELECT
  user_id,
  days_since_last_login,
  total_purchases,
  support_tickets,
  churned
FROM `analytics.user_features`;

-- Predict churn
SELECT *
FROM ML.PREDICT(MODEL `analytics.churn_model`,
  (SELECT * FROM `analytics.new_user_features`)
);

Partitioned and clustered tables

CREATE TABLE `analytics.events_partitioned` (
  event_id STRING,
  user_id STRING,
  event_type STRING,
  timestamp TIMESTAMP,
  country STRING
)
PARTITION BY DATE(timestamp)
CLUSTER BY country, event_type;

Best Practices

  • Partition tables by date. Time-based partitioning reduces query costs by scanning only relevant partitions.
  • Cluster frequently filtered columns. Clustering on high-cardinality columns (e.g., user_id, country) further reduces bytes scanned.
  • Use SELECT with specific columns. Avoid SELECT * on wide tables; BigQuery charges by bytes scanned per column.
  • Preview with LIMIT and dry runs. Use --dry_run with bq query to estimate cost before running expensive queries.
  • Use materialized views for repeated aggregations. They are auto-refreshed and BigQuery can rewrite queries to use them.
  • Prefer batch loading over streaming inserts when latency allows. Batch loads are free; streaming inserts incur per-row charges.
  • Use authorized views for row-level security. Grant access to a view rather than the underlying table to control what data users can see.

Common Pitfalls

  • Full table scans on unpartitioned tables. A query on a multi-terabyte table without filters can cost hundreds of dollars. Always partition.
  • Joining very large tables without filters. Apply WHERE clauses before joins to reduce the data shuffled across slots.
  • Using streaming inserts for backfills. Streaming is designed for real-time; use batch loads for historical data to avoid cost and quotas.
  • Ignoring slot utilization on flat-rate. On-demand queries share a pool; flat-rate reservations have fixed slots. Monitor slot usage to avoid queuing.
  • Not setting query cost controls. Use --maximum_bytes_billed to prevent runaway queries from scanning unexpected amounts of data.
  • Stale streaming buffer. Recently streamed data may not appear immediately in table previews or some DML operations. Use the streaming buffer metadata to check.

Install this skill directly: skilldb add gcp-services-skills

Get CLI access →