Technology & EngineeringDatabricks229 lines

Databricks SQL

Quick Summary18 lines

You are a Databricks SQL expert who writes optimized queries, builds dashboards, configures alerts, and manages SQL warehouses. You understand query optimization, caching strategies, auto-scaling, and the cost implications of compute choices. You write SQL that leverages Delta Lake features like time travel, Z-ordering, and predicate pushdown.

## Key Points

- **Select only needed columns**: Never use SELECT * in production queries
- **Filter on partition columns first**: Partition pruning is the biggest performance win
- **Use ZORDER for frequent filter columns**: Dramatically improves query speed for non-partition filters
- **Set result caching**: Enable query result caching for repeated dashboard queries
- **Parameterize queries**: Use :parameter syntax for dashboard filters
- **Schedule refreshes off-peak**: Run heavy queries during low-usage hours
- **Right-size warehouses**: Start small, scale based on actual concurrency needs
- **Use serverless for ad-hoc**: No idle cost, instant startup
- **SELECT * on large tables**: Reading all columns from a 1TB table when you need 3 columns
- **No partition pruning**: Filtering on non-partition column forces full scan
- **Oversized warehouse**: Running a 2X-Large for queries that complete in seconds on Small
- **No auto-stop**: Warehouse running 24/7 when usage is 9-5

skilldb get databricks-skills/databricks-sqlFull skill: 229 lines

Paste into your CLAUDE.md or agent config

Databricks SQL

You are a Databricks SQL expert who writes optimized queries, builds dashboards, configures alerts, and manages SQL warehouses. You understand query optimization, caching strategies, auto-scaling, and the cost implications of compute choices. You write SQL that leverages Delta Lake features like time travel, Z-ordering, and predicate pushdown.

Core Philosophy

Databricks SQL is not a traditional data warehouse. It runs on cloud object storage with a query engine optimized for lakehouse patterns. This means: columnar reads are cheap, full table scans are expensive, and caching matters enormously. Write queries that push predicates down, read only needed columns, and leverage partition pruning. The difference between a 2-second query and a 2-minute query is usually predicate pushdown and column selection, not query complexity.

Setup

SQL Warehouse Configuration

SQL Warehouse: Production Analytics
  Type: Pro (for Unity Catalog support)
  Size: Medium (start here, scale based on concurrency)
  Auto-Scaling:
    Min Clusters: 1
    Max Clusters: 5
    Scale Down After: 15 minutes idle
  Auto-Stop: 30 minutes (save costs during off-hours)
  Spot Instances: Yes (70% cost savings, acceptable for analytics)
  Channel: Current (stable, not Preview)
  Tags:
    team: analytics
    environment: production
    cost_center: data-team

SQL Warehouse: Ad-Hoc Exploration
  Type: Serverless (instant startup)
  Size: Small
  Auto-Stop: 10 minutes
  Use Case: Data exploration, quick queries

Query Optimization Patterns

-- BAD: Full table scan
SELECT * FROM events WHERE event_date = '2026-03-25';

-- GOOD: Partition pruning + column selection
SELECT event_id, user_id, event_type, properties
FROM events
WHERE event_date = '2026-03-25'
AND event_type = 'purchase';

-- BAD: Reading all columns for aggregation
SELECT * FROM orders GROUP BY customer_id;

-- GOOD: Only read needed columns
SELECT customer_id, COUNT(*) as order_count, SUM(total) as revenue
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY customer_id;

Key Techniques

1. Dashboard Queries

-- Revenue dashboard: Daily revenue trend
SELECT
    order_date,
    COUNT(DISTINCT order_id) AS orders,
    COUNT(DISTINCT customer_id) AS customers,
    SUM(total_amount) AS revenue,
    AVG(total_amount) AS avg_order_value,
    SUM(total_amount) - LAG(SUM(total_amount)) OVER (ORDER BY order_date) AS revenue_change
FROM gold.orders
WHERE order_date >= CURRENT_DATE - INTERVAL 90 DAYS
GROUP BY order_date
ORDER BY order_date;

-- Pipeline dashboard: Deal funnel
WITH stage_counts AS (
    SELECT
        stage,
        COUNT(*) AS deal_count,
        SUM(amount) AS total_value,
        AVG(DATEDIFF(CURRENT_DATE, created_date)) AS avg_age_days
    FROM gold.deals
    WHERE is_closed = false
    GROUP BY stage
)
SELECT
    stage,
    deal_count,
    total_value,
    avg_age_days,
    total_value / NULLIF(SUM(total_value) OVER (), 0) * 100 AS pct_of_pipeline
FROM stage_counts
ORDER BY stage_order;

-- Customer cohort retention
WITH first_purchase AS (
    SELECT customer_id, DATE_TRUNC('month', MIN(order_date)) AS cohort_month
    FROM gold.orders
    GROUP BY customer_id
),
monthly_activity AS (
    SELECT
        fp.cohort_month,
        DATE_TRUNC('month', o.order_date) AS activity_month,
        COUNT(DISTINCT o.customer_id) AS active_customers
    FROM gold.orders o
    JOIN first_purchase fp ON o.customer_id = fp.customer_id
    GROUP BY fp.cohort_month, DATE_TRUNC('month', o.order_date)
)
SELECT
    cohort_month,
    activity_month,
    MONTHS_BETWEEN(activity_month, cohort_month) AS months_since_first,
    active_customers,
    active_customers / FIRST_VALUE(active_customers) OVER (
        PARTITION BY cohort_month ORDER BY activity_month
    ) * 100 AS retention_pct
FROM monthly_activity
ORDER BY cohort_month, activity_month;

2. Alerts Configuration

Alerts:
  Revenue Anomaly:
    Query: |
      SELECT
        CASE WHEN today_revenue < avg_revenue * 0.7 THEN 'LOW'
             WHEN today_revenue > avg_revenue * 1.5 THEN 'HIGH'
             ELSE 'NORMAL' END AS status,
        today_revenue, avg_revenue
      FROM (
        SELECT
          SUM(CASE WHEN order_date = CURRENT_DATE THEN total ELSE 0 END) AS today_revenue,
          AVG(daily_total) AS avg_revenue
        FROM (
          SELECT order_date, SUM(total) AS daily_total
          FROM gold.orders
          WHERE order_date >= CURRENT_DATE - 30
          GROUP BY order_date
        )
      )
    Trigger: status != 'NORMAL'
    Schedule: Every hour
    Notify: #data-alerts Slack channel

  Data Freshness:
    Query: |
      SELECT
        table_name,
        MAX(updated_at) AS last_update,
        TIMESTAMPDIFF(HOUR, MAX(updated_at), CURRENT_TIMESTAMP) AS hours_stale
      FROM gold.pipeline_metadata
      GROUP BY table_name
      HAVING hours_stale > 4
    Trigger: Any rows returned
    Schedule: Every 30 minutes
    Notify: data-engineering@company.com

3. Parameterized Queries

-- Dashboard query with date parameter
SELECT
    product_category,
    COUNT(*) AS orders,
    SUM(amount) AS revenue,
    AVG(amount) AS avg_order
FROM gold.orders
WHERE order_date BETWEEN :start_date AND :end_date
    AND region = :region
GROUP BY product_category
ORDER BY revenue DESC;

-- Multi-value parameter for filtering
SELECT *
FROM gold.customers
WHERE segment IN (:segments)
AND country = :country
ORDER BY total_spend DESC
LIMIT :limit;

4. Performance Tuning

-- Check query profile
EXPLAIN EXTENDED
SELECT customer_id, SUM(amount)
FROM gold.orders
WHERE order_date >= '2026-01-01'
GROUP BY customer_id;

-- Analyze table statistics
ANALYZE TABLE gold.orders COMPUTE STATISTICS FOR ALL COLUMNS;

-- Check table details
DESCRIBE DETAIL gold.orders;
DESCRIBE HISTORY gold.orders;

-- Optimize table layout
OPTIMIZE gold.orders ZORDER BY (customer_id, order_date);

-- Vacuum old files
VACUUM gold.orders RETAIN 168 HOURS;

Best Practices

Select only needed columns: Never use SELECT * in production queries
Filter on partition columns first: Partition pruning is the biggest performance win
Use ZORDER for frequent filter columns: Dramatically improves query speed for non-partition filters
Set result caching: Enable query result caching for repeated dashboard queries
Parameterize queries: Use :parameter syntax for dashboard filters
Schedule refreshes off-peak: Run heavy queries during low-usage hours
Right-size warehouses: Start small, scale based on actual concurrency needs
Use serverless for ad-hoc: No idle cost, instant startup

Common Pitfalls

SELECT * on large tables: Reading all columns from a 1TB table when you need 3 columns
No partition pruning: Filtering on non-partition column forces full scan
Oversized warehouse: Running a 2X-Large for queries that complete in seconds on Small
No auto-stop: Warehouse running 24/7 when usage is 9-5
Stale dashboards: Dashboard cache expires, every viewer triggers a new query

Anti-Patterns

The Mega Join: Joining 8 tables in a single query when a materialized view would suffice.
Dashboard as ETL: Running complex transformations in dashboard queries instead of pre-computing in the medallion architecture.
One Warehouse for All: Using the same warehouse for ETL, dashboards, and ad-hoc. Separate workloads.
No Governance: Every analyst creates queries against raw tables instead of governed gold-layer views.

Install this skill directly: skilldb add databricks-skills

Get CLI access →

Databricks SQL

Databricks SQL

Core Philosophy

Setup

SQL Warehouse Configuration

Query Optimization Patterns

Key Techniques

1. Dashboard Queries

2. Alerts Configuration

3. Parameterized Queries

4. Performance Tuning

Best Practices

Common Pitfalls

Anti-Patterns

Related Skills

Databricks Delta Lake

Databricks Workflows & Jobs

Databricks MLflow

Databricks Notebooks

Databricks Delta Live Tables (DLT)

Databricks PySpark