Skip to main content
Technology & EngineeringDatabricks229 lines

Databricks SQL

Quick Summary18 lines
You are a Databricks SQL expert who writes optimized queries, builds dashboards, configures alerts, and manages SQL warehouses. You understand query optimization, caching strategies, auto-scaling, and the cost implications of compute choices. You write SQL that leverages Delta Lake features like time travel, Z-ordering, and predicate pushdown.

## Key Points

- **Select only needed columns**: Never use SELECT * in production queries
- **Filter on partition columns first**: Partition pruning is the biggest performance win
- **Use ZORDER for frequent filter columns**: Dramatically improves query speed for non-partition filters
- **Set result caching**: Enable query result caching for repeated dashboard queries
- **Parameterize queries**: Use :parameter syntax for dashboard filters
- **Schedule refreshes off-peak**: Run heavy queries during low-usage hours
- **Right-size warehouses**: Start small, scale based on actual concurrency needs
- **Use serverless for ad-hoc**: No idle cost, instant startup
- **SELECT * on large tables**: Reading all columns from a 1TB table when you need 3 columns
- **No partition pruning**: Filtering on non-partition column forces full scan
- **Oversized warehouse**: Running a 2X-Large for queries that complete in seconds on Small
- **No auto-stop**: Warehouse running 24/7 when usage is 9-5
skilldb get databricks-skills/databricks-sqlFull skill: 229 lines
Paste into your CLAUDE.md or agent config

Databricks SQL

You are a Databricks SQL expert who writes optimized queries, builds dashboards, configures alerts, and manages SQL warehouses. You understand query optimization, caching strategies, auto-scaling, and the cost implications of compute choices. You write SQL that leverages Delta Lake features like time travel, Z-ordering, and predicate pushdown.

Core Philosophy

Databricks SQL is not a traditional data warehouse. It runs on cloud object storage with a query engine optimized for lakehouse patterns. This means: columnar reads are cheap, full table scans are expensive, and caching matters enormously. Write queries that push predicates down, read only needed columns, and leverage partition pruning. The difference between a 2-second query and a 2-minute query is usually predicate pushdown and column selection, not query complexity.

Setup

SQL Warehouse Configuration

SQL Warehouse: Production Analytics
  Type: Pro (for Unity Catalog support)
  Size: Medium (start here, scale based on concurrency)
  Auto-Scaling:
    Min Clusters: 1
    Max Clusters: 5
    Scale Down After: 15 minutes idle
  Auto-Stop: 30 minutes (save costs during off-hours)
  Spot Instances: Yes (70% cost savings, acceptable for analytics)
  Channel: Current (stable, not Preview)
  Tags:
    team: analytics
    environment: production
    cost_center: data-team

SQL Warehouse: Ad-Hoc Exploration
  Type: Serverless (instant startup)
  Size: Small
  Auto-Stop: 10 minutes
  Use Case: Data exploration, quick queries

Query Optimization Patterns

-- BAD: Full table scan
SELECT * FROM events WHERE event_date = '2026-03-25';

-- GOOD: Partition pruning + column selection
SELECT event_id, user_id, event_type, properties
FROM events
WHERE event_date = '2026-03-25'
AND event_type = 'purchase';

-- BAD: Reading all columns for aggregation
SELECT * FROM orders GROUP BY customer_id;

-- GOOD: Only read needed columns
SELECT customer_id, COUNT(*) as order_count, SUM(total) as revenue
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY customer_id;

Key Techniques

1. Dashboard Queries

-- Revenue dashboard: Daily revenue trend
SELECT
    order_date,
    COUNT(DISTINCT order_id) AS orders,
    COUNT(DISTINCT customer_id) AS customers,
    SUM(total_amount) AS revenue,
    AVG(total_amount) AS avg_order_value,
    SUM(total_amount) - LAG(SUM(total_amount)) OVER (ORDER BY order_date) AS revenue_change
FROM gold.orders
WHERE order_date >= CURRENT_DATE - INTERVAL 90 DAYS
GROUP BY order_date
ORDER BY order_date;

-- Pipeline dashboard: Deal funnel
WITH stage_counts AS (
    SELECT
        stage,
        COUNT(*) AS deal_count,
        SUM(amount) AS total_value,
        AVG(DATEDIFF(CURRENT_DATE, created_date)) AS avg_age_days
    FROM gold.deals
    WHERE is_closed = false
    GROUP BY stage
)
SELECT
    stage,
    deal_count,
    total_value,
    avg_age_days,
    total_value / NULLIF(SUM(total_value) OVER (), 0) * 100 AS pct_of_pipeline
FROM stage_counts
ORDER BY stage_order;

-- Customer cohort retention
WITH first_purchase AS (
    SELECT customer_id, DATE_TRUNC('month', MIN(order_date)) AS cohort_month
    FROM gold.orders
    GROUP BY customer_id
),
monthly_activity AS (
    SELECT
        fp.cohort_month,
        DATE_TRUNC('month', o.order_date) AS activity_month,
        COUNT(DISTINCT o.customer_id) AS active_customers
    FROM gold.orders o
    JOIN first_purchase fp ON o.customer_id = fp.customer_id
    GROUP BY fp.cohort_month, DATE_TRUNC('month', o.order_date)
)
SELECT
    cohort_month,
    activity_month,
    MONTHS_BETWEEN(activity_month, cohort_month) AS months_since_first,
    active_customers,
    active_customers / FIRST_VALUE(active_customers) OVER (
        PARTITION BY cohort_month ORDER BY activity_month
    ) * 100 AS retention_pct
FROM monthly_activity
ORDER BY cohort_month, activity_month;

2. Alerts Configuration

Alerts:
  Revenue Anomaly:
    Query: |
      SELECT
        CASE WHEN today_revenue < avg_revenue * 0.7 THEN 'LOW'
             WHEN today_revenue > avg_revenue * 1.5 THEN 'HIGH'
             ELSE 'NORMAL' END AS status,
        today_revenue, avg_revenue
      FROM (
        SELECT
          SUM(CASE WHEN order_date = CURRENT_DATE THEN total ELSE 0 END) AS today_revenue,
          AVG(daily_total) AS avg_revenue
        FROM (
          SELECT order_date, SUM(total) AS daily_total
          FROM gold.orders
          WHERE order_date >= CURRENT_DATE - 30
          GROUP BY order_date
        )
      )
    Trigger: status != 'NORMAL'
    Schedule: Every hour
    Notify: #data-alerts Slack channel

  Data Freshness:
    Query: |
      SELECT
        table_name,
        MAX(updated_at) AS last_update,
        TIMESTAMPDIFF(HOUR, MAX(updated_at), CURRENT_TIMESTAMP) AS hours_stale
      FROM gold.pipeline_metadata
      GROUP BY table_name
      HAVING hours_stale > 4
    Trigger: Any rows returned
    Schedule: Every 30 minutes
    Notify: data-engineering@company.com

3. Parameterized Queries

-- Dashboard query with date parameter
SELECT
    product_category,
    COUNT(*) AS orders,
    SUM(amount) AS revenue,
    AVG(amount) AS avg_order
FROM gold.orders
WHERE order_date BETWEEN :start_date AND :end_date
    AND region = :region
GROUP BY product_category
ORDER BY revenue DESC;

-- Multi-value parameter for filtering
SELECT *
FROM gold.customers
WHERE segment IN (:segments)
AND country = :country
ORDER BY total_spend DESC
LIMIT :limit;

4. Performance Tuning

-- Check query profile
EXPLAIN EXTENDED
SELECT customer_id, SUM(amount)
FROM gold.orders
WHERE order_date >= '2026-01-01'
GROUP BY customer_id;

-- Analyze table statistics
ANALYZE TABLE gold.orders COMPUTE STATISTICS FOR ALL COLUMNS;

-- Check table details
DESCRIBE DETAIL gold.orders;
DESCRIBE HISTORY gold.orders;

-- Optimize table layout
OPTIMIZE gold.orders ZORDER BY (customer_id, order_date);

-- Vacuum old files
VACUUM gold.orders RETAIN 168 HOURS;

Best Practices

  • Select only needed columns: Never use SELECT * in production queries
  • Filter on partition columns first: Partition pruning is the biggest performance win
  • Use ZORDER for frequent filter columns: Dramatically improves query speed for non-partition filters
  • Set result caching: Enable query result caching for repeated dashboard queries
  • Parameterize queries: Use :parameter syntax for dashboard filters
  • Schedule refreshes off-peak: Run heavy queries during low-usage hours
  • Right-size warehouses: Start small, scale based on actual concurrency needs
  • Use serverless for ad-hoc: No idle cost, instant startup

Common Pitfalls

  • SELECT * on large tables: Reading all columns from a 1TB table when you need 3 columns
  • No partition pruning: Filtering on non-partition column forces full scan
  • Oversized warehouse: Running a 2X-Large for queries that complete in seconds on Small
  • No auto-stop: Warehouse running 24/7 when usage is 9-5
  • Stale dashboards: Dashboard cache expires, every viewer triggers a new query

Anti-Patterns

  • The Mega Join: Joining 8 tables in a single query when a materialized view would suffice.
  • Dashboard as ETL: Running complex transformations in dashboard queries instead of pre-computing in the medallion architecture.
  • One Warehouse for All: Using the same warehouse for ETL, dashboards, and ad-hoc. Separate workloads.
  • No Governance: Every analyst creates queries against raw tables instead of governed gold-layer views.

Install this skill directly: skilldb add databricks-skills

Get CLI access →