Bigquery
Analyze large datasets with Google BigQuery serverless data warehouse and SQL engine
You are an expert in Google BigQuery for serverless, petabyte-scale data analytics and SQL-based data warehousing. ## Key Points - **Running `SELECT *` on wide tables** -- BigQuery charges by bytes scanned per column. Selecting all columns when you need three wastes money and slows queries. - **Querying unpartitioned tables without filters** -- A full table scan on a multi-terabyte table can cost more than a month of partitioned queries. Always partition and filter. - **Using streaming inserts for historical backfills** -- Streaming is designed for real-time. Backfilling millions of rows via streaming inserts is expensive and hits quota limits. Use batch loads. - Petabyte-scale columnar storage with automatic compression - Serverless query execution with on-demand or flat-rate pricing - Standard SQL with support for nested/repeated fields, UDFs, and scripting - Streaming inserts and batch loading from Cloud Storage, Sheets, and external sources - Built-in ML (BigQuery ML), geospatial analytics (GIS), and BI Engine for dashboards - Materialized views, scheduled queries, and data transfer service - **Partition tables by date.** Time-based partitioning reduces query costs by scanning only relevant partitions. - **Cluster frequently filtered columns.** Clustering on high-cardinality columns (e.g., user_id, country) further reduces bytes scanned. - **Use `SELECT` with specific columns.** Avoid `SELECT *` on wide tables; BigQuery charges by bytes scanned per column. ## Quick Example ```bash gcloud services enable bigquery.googleapis.com ``` ```bash bq mk --dataset \ --location=US \ --description="Analytics dataset" \ my_project:analytics ```
skilldb get gcp-services-skills/BigqueryFull skill: 238 linesGCP Service — BigQuery
You are an expert in Google BigQuery for serverless, petabyte-scale data analytics and SQL-based data warehousing.
Core Philosophy
BigQuery is built for analytical workloads where you scan large volumes of data to produce insights. It is not a transactional database and should never be used as one. Its columnar storage and massively parallel query engine make it extraordinarily fast at aggregations, joins, and window functions over terabytes of data, but every query scans (and bills) by the byte. The fundamental discipline of BigQuery usage is controlling how much data each query touches.
Partitioning and clustering are not optional performance tuning -- they are cost controls. A query on an unpartitioned, unclustered multi-terabyte table can cost hundreds of dollars in a single execution. Partition by the time column you most frequently filter on, and cluster by the high-cardinality columns you group or filter by. Together, these reduce bytes scanned by orders of magnitude and make BigQuery economically viable for frequent queries.
Prefer batch loading over streaming inserts whenever latency allows. Batch loads from Cloud Storage are free, while streaming inserts charge per row and per byte. Reserve streaming for use cases that genuinely need sub-minute latency, like real-time dashboards or event-driven pipelines. For everything else -- daily ETL, backfills, log imports -- batch is cheaper and simpler.
Anti-Patterns
- Running
SELECT *on wide tables -- BigQuery charges by bytes scanned per column. Selecting all columns when you need three wastes money and slows queries. - Querying unpartitioned tables without filters -- A full table scan on a multi-terabyte table can cost more than a month of partitioned queries. Always partition and filter.
- Using streaming inserts for historical backfills -- Streaming is designed for real-time. Backfilling millions of rows via streaming inserts is expensive and hits quota limits. Use batch loads.
- Joining large tables without pre-filtering -- Apply WHERE clauses before joins to reduce the data shuffled across BigQuery's execution slots. Unfiltered joins on large tables consume excessive slots and run slowly.
- Not setting
--maximum_bytes_billed-- Without a billing cap, a single misconfigured query can scan your entire dataset. Always set a bytes-billed limit on ad-hoc queries to prevent bill shocks.
Overview
BigQuery is a fully managed, serverless data warehouse that enables fast SQL analytics over massive datasets. It separates storage and compute, allowing independent scaling of each. BigQuery supports standard SQL, streaming ingestion, machine learning (BQML), and integrations with BI tools.
Key capabilities:
- Petabyte-scale columnar storage with automatic compression
- Serverless query execution with on-demand or flat-rate pricing
- Standard SQL with support for nested/repeated fields, UDFs, and scripting
- Streaming inserts and batch loading from Cloud Storage, Sheets, and external sources
- Built-in ML (BigQuery ML), geospatial analytics (GIS), and BI Engine for dashboards
- Materialized views, scheduled queries, and data transfer service
Setup & Configuration
Enable the API
gcloud services enable bigquery.googleapis.com
Create a dataset
bq mk --dataset \
--location=US \
--description="Analytics dataset" \
my_project:analytics
Create a table from schema
bq mk --table \
analytics.events \
event_id:STRING,user_id:STRING,event_type:STRING,timestamp:TIMESTAMP,payload:JSON
Load data from Cloud Storage
bq load \
--source_format=NEWLINE_DELIMITED_JSON \
--autodetect \
analytics.events \
gs://my-bucket/events/*.json
Load CSV with schema
bq load \
--source_format=CSV \
--skip_leading_rows=1 \
analytics.sales \
gs://my-bucket/sales.csv \
order_id:STRING,product:STRING,amount:NUMERIC,date:DATE
Core Patterns
Standard SQL queries
-- Aggregate daily active users
SELECT
DATE(timestamp) AS day,
COUNT(DISTINCT user_id) AS dau
FROM `my_project.analytics.events`
WHERE timestamp >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
GROUP BY day
ORDER BY day DESC;
Working with nested and repeated fields
-- Query nested structs and arrays
SELECT
order_id,
customer.name AS customer_name,
item.product_name,
item.quantity
FROM `my_project.analytics.orders`,
UNNEST(items) AS item
WHERE customer.tier = 'premium'
AND item.quantity > 5;
Window functions
SELECT
user_id,
event_type,
timestamp,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp) AS event_sequence,
LAG(timestamp) OVER (PARTITION BY user_id ORDER BY timestamp) AS prev_event_time
FROM `my_project.analytics.events`
WHERE DATE(timestamp) = CURRENT_DATE();
Streaming inserts (Python)
from google.cloud import bigquery
client = bigquery.Client()
table_id = "my_project.analytics.events"
rows_to_insert = [
{
"event_id": "evt-001",
"user_id": "user-123",
"event_type": "page_view",
"timestamp": "2025-01-15T10:30:00Z",
},
]
errors = client.insert_rows_json(table_id, rows_to_insert)
if errors:
print(f"Insert errors: {errors}")
Query from client library (Node.js)
const { BigQuery } = require('@google-cloud/bigquery');
const bigquery = new BigQuery();
async function queryActiveUsers() {
const query = `
SELECT user_id, COUNT(*) as event_count
FROM \`my_project.analytics.events\`
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
GROUP BY user_id
ORDER BY event_count DESC
LIMIT 100
`;
const [rows] = await bigquery.query({ query, location: 'US' });
rows.forEach(row => console.log(row.user_id, row.event_count));
}
Scheduled queries
bq mk --transfer_config \
--target_dataset=analytics \
--display_name="Daily aggregation" \
--data_source=scheduled_query \
--schedule="every 24 hours" \
--params='{
"query": "INSERT INTO analytics.daily_summary SELECT DATE(timestamp) as day, COUNT(*) as total FROM analytics.events WHERE DATE(timestamp) = DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY) GROUP BY day",
"destination_table_name_template": "",
"write_disposition": "WRITE_APPEND"
}'
BigQuery ML (simple model)
-- Create a logistic regression model
CREATE OR REPLACE MODEL `analytics.churn_model`
OPTIONS(model_type='logistic_reg', input_label_cols=['churned']) AS
SELECT
user_id,
days_since_last_login,
total_purchases,
support_tickets,
churned
FROM `analytics.user_features`;
-- Predict churn
SELECT *
FROM ML.PREDICT(MODEL `analytics.churn_model`,
(SELECT * FROM `analytics.new_user_features`)
);
Partitioned and clustered tables
CREATE TABLE `analytics.events_partitioned` (
event_id STRING,
user_id STRING,
event_type STRING,
timestamp TIMESTAMP,
country STRING
)
PARTITION BY DATE(timestamp)
CLUSTER BY country, event_type;
Best Practices
- Partition tables by date. Time-based partitioning reduces query costs by scanning only relevant partitions.
- Cluster frequently filtered columns. Clustering on high-cardinality columns (e.g., user_id, country) further reduces bytes scanned.
- Use
SELECTwith specific columns. AvoidSELECT *on wide tables; BigQuery charges by bytes scanned per column. - Preview with
LIMITand dry runs. Use--dry_runwithbq queryto estimate cost before running expensive queries. - Use materialized views for repeated aggregations. They are auto-refreshed and BigQuery can rewrite queries to use them.
- Prefer batch loading over streaming inserts when latency allows. Batch loads are free; streaming inserts incur per-row charges.
- Use authorized views for row-level security. Grant access to a view rather than the underlying table to control what data users can see.
Common Pitfalls
- Full table scans on unpartitioned tables. A query on a multi-terabyte table without filters can cost hundreds of dollars. Always partition.
- Joining very large tables without filters. Apply WHERE clauses before joins to reduce the data shuffled across slots.
- Using streaming inserts for backfills. Streaming is designed for real-time; use batch loads for historical data to avoid cost and quotas.
- Ignoring slot utilization on flat-rate. On-demand queries share a pool; flat-rate reservations have fixed slots. Monitor slot usage to avoid queuing.
- Not setting query cost controls. Use
--maximum_bytes_billedto prevent runaway queries from scanning unexpected amounts of data. - Stale streaming buffer. Recently streamed data may not appear immediately in table previews or some DML operations. Use the streaming buffer metadata to check.
Install this skill directly: skilldb add gcp-services-skills
Related Skills
Cloud Functions
Build and deploy event-driven serverless functions on Google Cloud Functions
Cloud Run
Deploy and manage containerized applications on Google Cloud Run serverless platform
Cloud Storage
Store, retrieve, and manage objects in Google Cloud Storage buckets
Firestore
Model, query, and manage data with Google Cloud Firestore NoSQL document database
Adversarial Code Review
Adversarial implementation review methodology that validates code completeness against requirements with fresh objectivity. Uses a coach-player dialectical loop to catch real gaps in security, logic, and data flow.
API Design Testing
Design, document, and test APIs following RESTful principles, consistent