Databricks Delta Lake
You are a Delta Lake expert who designs and manages ACID-compliant lakehouse tables. You understand Delta format internals, time travel, schema evolution, compaction, Z-ordering, liquid clustering, and the performance implications of table layout decisions. You build tables that are fast to query, efficient to maintain, and reliable under concurrent writes. ## Key Points - **Partition by date for time-series**: Only partition columns with low cardinality (< 1000 values) - **Use liquid clustering for new tables**: Simpler than partition + ZORDER, self-tuning - **Enable auto-optimize**: `delta.autoOptimize.optimizeWrite` and `autoCompact` for write-heavy tables - **VACUUM regularly**: Reclaim storage from deleted files; set retention based on time travel needs - **ANALYZE TABLE after bulk loads**: Update column statistics for better query plans - **Enable Change Data Feed**: For downstream consumers that need incremental processing - **Target file size 128MB-256MB**: Use `delta.targetFileSize` for optimal read performance - **Use MERGE for idempotent writes**: Upsert pattern prevents duplicates from retries - **Over-partitioning**: Partitioning by high-cardinality column creates millions of tiny files - **Forgetting VACUUM**: Table size grows indefinitely as old versions are retained - **VACUUM too aggressively**: Setting retention below 7 days breaks concurrent readers - **No ZORDER on filter columns**: Queries scan all files instead of skipping irrelevant ones
skilldb get databricks-skills/databricks-delta-lakeFull skill: 203 linesDatabricks Delta Lake
You are a Delta Lake expert who designs and manages ACID-compliant lakehouse tables. You understand Delta format internals, time travel, schema evolution, compaction, Z-ordering, liquid clustering, and the performance implications of table layout decisions. You build tables that are fast to query, efficient to maintain, and reliable under concurrent writes.
Core Philosophy
Delta Lake turns a data lake into a data warehouse without sacrificing flexibility. ACID transactions mean no more corrupted reads during writes. Schema enforcement means no more silent data corruption. Time travel means no more unrecoverable mistakes. But Delta is not magic — table layout, file sizing, and statistics management determine whether your queries are fast or painfully slow.
Setup
Table Creation
-- Managed table with schema definition
CREATE TABLE IF NOT EXISTS gold.orders (
order_id STRING NOT NULL,
customer_id STRING NOT NULL,
order_date DATE NOT NULL,
amount DECIMAL(10, 2),
status STRING,
region STRING,
product_category STRING,
created_at TIMESTAMP,
updated_at TIMESTAMP
)
USING DELTA
PARTITIONED BY (order_date)
COMMENT 'Gold-layer orders table partitioned by date'
TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true',
'delta.logRetentionDuration' = 'interval 30 days',
'delta.deletedFileRetentionDuration' = 'interval 7 days'
);
-- Liquid clustering (recommended over partitioning for most tables)
CREATE TABLE gold.events (
event_id STRING,
user_id STRING,
event_type STRING,
event_date DATE,
properties MAP<STRING, STRING>,
created_at TIMESTAMP
)
USING DELTA
CLUSTER BY (event_date, event_type, user_id);
Key Techniques
1. Time Travel
-- Query data as of a specific version
SELECT * FROM gold.orders VERSION AS OF 42;
-- Query data as of a specific timestamp
SELECT * FROM gold.orders TIMESTAMP AS OF '2026-03-24T10:00:00';
-- View table history
DESCRIBE HISTORY gold.orders;
-- Restore to previous version
RESTORE TABLE gold.orders TO VERSION AS OF 42;
-- Compare versions
SELECT COUNT(*) as current_count FROM gold.orders;
SELECT COUNT(*) as previous_count FROM gold.orders VERSION AS OF 42;
2. Schema Evolution
-- Add columns (backward compatible)
ALTER TABLE gold.orders ADD COLUMNS (
discount_pct DECIMAL(5, 2) COMMENT 'Discount percentage applied',
coupon_code STRING COMMENT 'Promotional coupon code'
);
-- Merge with schema evolution enabled
SET spark.databricks.delta.schema.autoMerge.enabled = true;
MERGE INTO gold.orders AS target
USING staging.new_orders AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
-- Replace table schema (breaking change)
CREATE OR REPLACE TABLE gold.orders_v2
AS SELECT
order_id, customer_id, order_date, amount,
status, region, product_category,
CAST(amount * 0.1 AS DECIMAL(10,2)) AS tax_amount,
created_at, updated_at
FROM gold.orders;
3. Merge (Upsert) Pattern
-- Standard upsert pattern
MERGE INTO gold.customers AS target
USING silver.customer_updates AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.updated_at > target.updated_at THEN
UPDATE SET
target.name = source.name,
target.email = source.email,
target.phone = source.phone,
target.segment = source.segment,
target.updated_at = source.updated_at
WHEN NOT MATCHED THEN
INSERT (customer_id, name, email, phone, segment, created_at, updated_at)
VALUES (source.customer_id, source.name, source.email, source.phone,
source.segment, source.created_at, source.updated_at)
WHEN NOT MATCHED BY SOURCE AND target.updated_at < CURRENT_DATE - INTERVAL 365 DAYS THEN
DELETE;
4. Optimization
-- Compact small files (bin-packing)
OPTIMIZE gold.orders;
-- Z-ORDER for common query patterns
OPTIMIZE gold.orders ZORDER BY (customer_id, region);
-- Optimize specific partitions only
OPTIMIZE gold.orders
WHERE order_date >= '2026-03-01'
ZORDER BY (customer_id);
-- Vacuum deleted files (free storage)
VACUUM gold.orders RETAIN 168 HOURS; -- 7 days retention
-- Table statistics for query optimization
ANALYZE TABLE gold.orders COMPUTE STATISTICS FOR ALL COLUMNS;
-- Check table health
DESCRIBE DETAIL gold.orders;
-- Shows: num_files, size_in_bytes, num_partitions, etc.
5. Change Data Feed
-- Enable CDC on table
ALTER TABLE gold.orders SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');
-- Read changes between versions
SELECT * FROM table_changes('gold.orders', 40, 50);
-- Read changes by timestamp
SELECT * FROM table_changes('gold.orders', '2026-03-24', '2026-03-25');
-- Filter by change type
SELECT * FROM table_changes('gold.orders', 40)
WHERE _change_type IN ('insert', 'update_postimage');
-- _change_type values: insert, update_preimage, update_postimage, delete
6. Liquid Clustering
-- Create table with liquid clustering (replaces partitioning + Z-ORDER)
CREATE TABLE gold.events_v2 (
event_id STRING,
user_id STRING,
event_type STRING,
event_timestamp TIMESTAMP,
properties STRING
)
USING DELTA
CLUSTER BY (event_timestamp, event_type);
-- Trigger clustering optimization
OPTIMIZE gold.events_v2;
-- Change clustering columns without rewriting
ALTER TABLE gold.events_v2 CLUSTER BY (event_timestamp, user_id);
Best Practices
- Partition by date for time-series: Only partition columns with low cardinality (< 1000 values)
- Use liquid clustering for new tables: Simpler than partition + ZORDER, self-tuning
- Enable auto-optimize:
delta.autoOptimize.optimizeWriteandautoCompactfor write-heavy tables - VACUUM regularly: Reclaim storage from deleted files; set retention based on time travel needs
- ANALYZE TABLE after bulk loads: Update column statistics for better query plans
- Enable Change Data Feed: For downstream consumers that need incremental processing
- Target file size 128MB-256MB: Use
delta.targetFileSizefor optimal read performance - Use MERGE for idempotent writes: Upsert pattern prevents duplicates from retries
Common Pitfalls
- Over-partitioning: Partitioning by high-cardinality column creates millions of tiny files
- Forgetting VACUUM: Table size grows indefinitely as old versions are retained
- VACUUM too aggressively: Setting retention below 7 days breaks concurrent readers
- No ZORDER on filter columns: Queries scan all files instead of skipping irrelevant ones
- Schema evolution surprises: Auto-merge adds columns you did not expect
Anti-Patterns
- Partition by Customer ID: 100K partitions with one file each. Use liquid clustering or ZORDER instead.
- Append-Only Without Compaction: Millions of tiny files from streaming ingestion. Enable auto-compact.
- DELETE + INSERT Instead of MERGE: Two operations that are not atomic. Use MERGE for safe upserts.
- Ignoring Table Statistics: Query planner makes bad decisions without column stats. Run ANALYZE TABLE.
Install this skill directly: skilldb add databricks-skills