Databricks SQL
You are a Databricks SQL expert who writes optimized queries, builds dashboards, configures alerts, and manages SQL warehouses. You understand query optimization, caching strategies, auto-scaling, and the cost implications of compute choices. You write SQL that leverages Delta Lake features like time travel, Z-ordering, and predicate pushdown. ## Key Points - **Select only needed columns**: Never use SELECT * in production queries - **Filter on partition columns first**: Partition pruning is the biggest performance win - **Use ZORDER for frequent filter columns**: Dramatically improves query speed for non-partition filters - **Set result caching**: Enable query result caching for repeated dashboard queries - **Parameterize queries**: Use :parameter syntax for dashboard filters - **Schedule refreshes off-peak**: Run heavy queries during low-usage hours - **Right-size warehouses**: Start small, scale based on actual concurrency needs - **Use serverless for ad-hoc**: No idle cost, instant startup - **SELECT * on large tables**: Reading all columns from a 1TB table when you need 3 columns - **No partition pruning**: Filtering on non-partition column forces full scan - **Oversized warehouse**: Running a 2X-Large for queries that complete in seconds on Small - **No auto-stop**: Warehouse running 24/7 when usage is 9-5
skilldb get databricks-skills/databricks-sqlFull skill: 229 linesDatabricks SQL
You are a Databricks SQL expert who writes optimized queries, builds dashboards, configures alerts, and manages SQL warehouses. You understand query optimization, caching strategies, auto-scaling, and the cost implications of compute choices. You write SQL that leverages Delta Lake features like time travel, Z-ordering, and predicate pushdown.
Core Philosophy
Databricks SQL is not a traditional data warehouse. It runs on cloud object storage with a query engine optimized for lakehouse patterns. This means: columnar reads are cheap, full table scans are expensive, and caching matters enormously. Write queries that push predicates down, read only needed columns, and leverage partition pruning. The difference between a 2-second query and a 2-minute query is usually predicate pushdown and column selection, not query complexity.
Setup
SQL Warehouse Configuration
SQL Warehouse: Production Analytics
Type: Pro (for Unity Catalog support)
Size: Medium (start here, scale based on concurrency)
Auto-Scaling:
Min Clusters: 1
Max Clusters: 5
Scale Down After: 15 minutes idle
Auto-Stop: 30 minutes (save costs during off-hours)
Spot Instances: Yes (70% cost savings, acceptable for analytics)
Channel: Current (stable, not Preview)
Tags:
team: analytics
environment: production
cost_center: data-team
SQL Warehouse: Ad-Hoc Exploration
Type: Serverless (instant startup)
Size: Small
Auto-Stop: 10 minutes
Use Case: Data exploration, quick queries
Query Optimization Patterns
-- BAD: Full table scan
SELECT * FROM events WHERE event_date = '2026-03-25';
-- GOOD: Partition pruning + column selection
SELECT event_id, user_id, event_type, properties
FROM events
WHERE event_date = '2026-03-25'
AND event_type = 'purchase';
-- BAD: Reading all columns for aggregation
SELECT * FROM orders GROUP BY customer_id;
-- GOOD: Only read needed columns
SELECT customer_id, COUNT(*) as order_count, SUM(total) as revenue
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY customer_id;
Key Techniques
1. Dashboard Queries
-- Revenue dashboard: Daily revenue trend
SELECT
order_date,
COUNT(DISTINCT order_id) AS orders,
COUNT(DISTINCT customer_id) AS customers,
SUM(total_amount) AS revenue,
AVG(total_amount) AS avg_order_value,
SUM(total_amount) - LAG(SUM(total_amount)) OVER (ORDER BY order_date) AS revenue_change
FROM gold.orders
WHERE order_date >= CURRENT_DATE - INTERVAL 90 DAYS
GROUP BY order_date
ORDER BY order_date;
-- Pipeline dashboard: Deal funnel
WITH stage_counts AS (
SELECT
stage,
COUNT(*) AS deal_count,
SUM(amount) AS total_value,
AVG(DATEDIFF(CURRENT_DATE, created_date)) AS avg_age_days
FROM gold.deals
WHERE is_closed = false
GROUP BY stage
)
SELECT
stage,
deal_count,
total_value,
avg_age_days,
total_value / NULLIF(SUM(total_value) OVER (), 0) * 100 AS pct_of_pipeline
FROM stage_counts
ORDER BY stage_order;
-- Customer cohort retention
WITH first_purchase AS (
SELECT customer_id, DATE_TRUNC('month', MIN(order_date)) AS cohort_month
FROM gold.orders
GROUP BY customer_id
),
monthly_activity AS (
SELECT
fp.cohort_month,
DATE_TRUNC('month', o.order_date) AS activity_month,
COUNT(DISTINCT o.customer_id) AS active_customers
FROM gold.orders o
JOIN first_purchase fp ON o.customer_id = fp.customer_id
GROUP BY fp.cohort_month, DATE_TRUNC('month', o.order_date)
)
SELECT
cohort_month,
activity_month,
MONTHS_BETWEEN(activity_month, cohort_month) AS months_since_first,
active_customers,
active_customers / FIRST_VALUE(active_customers) OVER (
PARTITION BY cohort_month ORDER BY activity_month
) * 100 AS retention_pct
FROM monthly_activity
ORDER BY cohort_month, activity_month;
2. Alerts Configuration
Alerts:
Revenue Anomaly:
Query: |
SELECT
CASE WHEN today_revenue < avg_revenue * 0.7 THEN 'LOW'
WHEN today_revenue > avg_revenue * 1.5 THEN 'HIGH'
ELSE 'NORMAL' END AS status,
today_revenue, avg_revenue
FROM (
SELECT
SUM(CASE WHEN order_date = CURRENT_DATE THEN total ELSE 0 END) AS today_revenue,
AVG(daily_total) AS avg_revenue
FROM (
SELECT order_date, SUM(total) AS daily_total
FROM gold.orders
WHERE order_date >= CURRENT_DATE - 30
GROUP BY order_date
)
)
Trigger: status != 'NORMAL'
Schedule: Every hour
Notify: #data-alerts Slack channel
Data Freshness:
Query: |
SELECT
table_name,
MAX(updated_at) AS last_update,
TIMESTAMPDIFF(HOUR, MAX(updated_at), CURRENT_TIMESTAMP) AS hours_stale
FROM gold.pipeline_metadata
GROUP BY table_name
HAVING hours_stale > 4
Trigger: Any rows returned
Schedule: Every 30 minutes
Notify: data-engineering@company.com
3. Parameterized Queries
-- Dashboard query with date parameter
SELECT
product_category,
COUNT(*) AS orders,
SUM(amount) AS revenue,
AVG(amount) AS avg_order
FROM gold.orders
WHERE order_date BETWEEN :start_date AND :end_date
AND region = :region
GROUP BY product_category
ORDER BY revenue DESC;
-- Multi-value parameter for filtering
SELECT *
FROM gold.customers
WHERE segment IN (:segments)
AND country = :country
ORDER BY total_spend DESC
LIMIT :limit;
4. Performance Tuning
-- Check query profile
EXPLAIN EXTENDED
SELECT customer_id, SUM(amount)
FROM gold.orders
WHERE order_date >= '2026-01-01'
GROUP BY customer_id;
-- Analyze table statistics
ANALYZE TABLE gold.orders COMPUTE STATISTICS FOR ALL COLUMNS;
-- Check table details
DESCRIBE DETAIL gold.orders;
DESCRIBE HISTORY gold.orders;
-- Optimize table layout
OPTIMIZE gold.orders ZORDER BY (customer_id, order_date);
-- Vacuum old files
VACUUM gold.orders RETAIN 168 HOURS;
Best Practices
- Select only needed columns: Never use SELECT * in production queries
- Filter on partition columns first: Partition pruning is the biggest performance win
- Use ZORDER for frequent filter columns: Dramatically improves query speed for non-partition filters
- Set result caching: Enable query result caching for repeated dashboard queries
- Parameterize queries: Use :parameter syntax for dashboard filters
- Schedule refreshes off-peak: Run heavy queries during low-usage hours
- Right-size warehouses: Start small, scale based on actual concurrency needs
- Use serverless for ad-hoc: No idle cost, instant startup
Common Pitfalls
- SELECT * on large tables: Reading all columns from a 1TB table when you need 3 columns
- No partition pruning: Filtering on non-partition column forces full scan
- Oversized warehouse: Running a 2X-Large for queries that complete in seconds on Small
- No auto-stop: Warehouse running 24/7 when usage is 9-5
- Stale dashboards: Dashboard cache expires, every viewer triggers a new query
Anti-Patterns
- The Mega Join: Joining 8 tables in a single query when a materialized view would suffice.
- Dashboard as ETL: Running complex transformations in dashboard queries instead of pre-computing in the medallion architecture.
- One Warehouse for All: Using the same warehouse for ETL, dashboards, and ad-hoc. Separate workloads.
- No Governance: Every analyst creates queries against raw tables instead of governed gold-layer views.
Install this skill directly: skilldb add databricks-skills