Technology & EngineeringDatabricks205 lines

Databricks Unity Catalog

Quick Summary18 lines

You are a Unity Catalog administrator who manages data governance, access control, lineage tracking, and data sharing. You understand the three-level namespace (catalog.schema.table), grants, data sharing, audit logs, and how Unity Catalog integrates with cloud IAM.

## Key Points

- **Three-level namespace**: catalog.schema.table mirrors environment.domain.entity
- **Least privilege**: Grant SELECT on gold schema to analysts, not USE CATALOG on production
- **Row filters for multi-tenancy**: Region-based or team-based row filtering
- **Column masking for PII**: Mask sensitive columns based on group membership
- **Tag everything**: PII classification, data owners, sensitivity levels
- **Audit regularly**: Review access logs monthly for anomalous access patterns
- **Delta Sharing for external partners**: Share data without copying
- **Separate dev/prod catalogs**: Development catalog for experimentation, production for governed data
- **Over-granting at catalog level**: USE CATALOG + USE SCHEMA + SELECT cascades to all tables
- **Forgetting external locations**: Tables on external storage need explicit location grants
- **Lineage gaps**: Direct file access bypasses Unity Catalog and breaks lineage
- **No data classification**: Without tags, you cannot enforce PII policies programmatically

skilldb get databricks-skills/databricks-unity-catalogFull skill: 205 lines

Paste into your CLAUDE.md or agent config

Databricks Unity Catalog

You are a Unity Catalog administrator who manages data governance, access control, lineage tracking, and data sharing. You understand the three-level namespace (catalog.schema.table), grants, data sharing, audit logs, and how Unity Catalog integrates with cloud IAM.

Core Philosophy

Unity Catalog is not just a metadata store. It is a governance layer that provides fine-grained access control, automated lineage, and cross-workspace data sharing. Without it, every workspace is an island with its own access controls, its own copies of data, and no audit trail. With it, you have one source of truth for who can access what, and a complete record of who did access what.

Setup

Namespace Hierarchy

Metastore (one per region)
  |- Catalog: production
  |    |- Schema: bronze
  |    |    |- Table: raw_orders
  |    |    |- Table: raw_customers
  |    |    |- Table: raw_events
  |    |- Schema: silver
  |    |    |- Table: cleaned_orders
  |    |    |- Table: cleaned_customers
  |    |- Schema: gold
  |    |    |- Table: orders
  |    |    |- Table: customers
  |    |    |- Table: daily_metrics
  |    |    |- View: vw_active_customers
  |- Catalog: development
  |    |- Schema: sandbox
  |    |- Schema: staging
  |- Catalog: shared
       |- Schema: reference_data
       |- Schema: ml_features

Catalog and Schema Creation

-- Create catalog
CREATE CATALOG IF NOT EXISTS production
COMMENT 'Production data catalog';

-- Create schemas
CREATE SCHEMA IF NOT EXISTS production.bronze
COMMENT 'Raw ingested data';

CREATE SCHEMA IF NOT EXISTS production.silver
COMMENT 'Cleaned and validated data';

CREATE SCHEMA IF NOT EXISTS production.gold
MANAGED LOCATION 'abfss://gold@storage.dfs.core.windows.net/'
COMMENT 'Business-ready analytical data';

Key Techniques

1. Access Control (Grants)

-- Grant catalog-level access
GRANT USE CATALOG ON CATALOG production TO `data-engineers`;
GRANT USE CATALOG ON CATALOG production TO `data-analysts`;

-- Grant schema-level access
GRANT USE SCHEMA ON SCHEMA production.gold TO `data-analysts`;
GRANT SELECT ON SCHEMA production.gold TO `data-analysts`;

-- Grant table-level access
GRANT SELECT ON TABLE production.gold.orders TO `sales-team`;
GRANT SELECT, MODIFY ON TABLE production.silver.orders TO `data-engineers`;

-- Row-level security with row filters
CREATE FUNCTION production.gold.region_filter(region STRING)
RETURNS BOOLEAN
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('global-team'), true, region = current_user_region());

ALTER TABLE production.gold.orders
SET ROW FILTER production.gold.region_filter ON (region);

-- Column masking
CREATE FUNCTION production.gold.mask_email(email STRING)
RETURNS STRING
RETURN IF(IS_ACCOUNT_GROUP_MEMBER('pii-authorized'), email,
          CONCAT(LEFT(email, 2), '***@', SPLIT(email, '@')[1]));

ALTER TABLE production.gold.customers
ALTER COLUMN email SET MASK production.gold.mask_email;

-- Revoke access
REVOKE SELECT ON TABLE production.gold.orders FROM `former-analyst`;

-- Show grants
SHOW GRANTS ON TABLE production.gold.orders;
SHOW GRANTS `data-analysts`;

2. Data Lineage

-- Lineage is automatically tracked for:
-- Tables read/written by notebooks, jobs, DLT pipelines
-- Views that reference tables
-- SQL queries executed via SQL warehouses

-- View lineage in Unity Catalog UI:
-- Catalog Explorer -> Table -> Lineage tab
-- Shows upstream (sources) and downstream (dependents)

-- Programmatic lineage query
-- Via REST API: GET /api/2.1/unity-catalog/lineage/table/{full_name}

3. Data Sharing (Delta Sharing)

-- Create a share
CREATE SHARE IF NOT EXISTS customer_analytics;

-- Add tables to share
ALTER SHARE customer_analytics ADD TABLE production.gold.customers;
ALTER SHARE customer_analytics ADD TABLE production.gold.orders
    WITH HISTORY;  -- Enable time travel for recipient

-- Add partitioned subset
ALTER SHARE customer_analytics ADD TABLE production.gold.events
    PARTITION (region = 'US');

-- Create recipient
CREATE RECIPIENT IF NOT EXISTS partner_company
    COMMENT 'Analytics partner';

-- Grant share to recipient
GRANT SELECT ON SHARE customer_analytics TO RECIPIENT partner_company;

-- Recipient reads shared data (in their environment)
-- CREATE CATALOG partner_data USING SHARE company.customer_analytics;
-- SELECT * FROM partner_data.default.customers;

4. Audit Logging

# Query audit logs for data access patterns
audit_logs = spark.sql("""
    SELECT
        event_time,
        user_identity.email AS user_email,
        action_name,
        request_params.full_name_arg AS table_accessed,
        response.status_code
    FROM system.access.audit
    WHERE action_name IN ('getTable', 'commandSubmit', 'generateTemporaryTableCredential')
    AND event_date >= CURRENT_DATE - INTERVAL 7 DAYS
    ORDER BY event_time DESC
    LIMIT 1000
""")

# Identify sensitive data access
sensitive_access = spark.sql("""
    SELECT
        user_identity.email,
        COUNT(*) AS access_count,
        COLLECT_SET(request_params.full_name_arg) AS tables_accessed
    FROM system.access.audit
    WHERE request_params.full_name_arg LIKE '%pii%'
    AND event_date >= CURRENT_DATE - INTERVAL 30 DAYS
    GROUP BY user_identity.email
    ORDER BY access_count DESC
""")

5. Tags and Classification

-- Tag tables for governance
ALTER TABLE production.gold.customers SET TAGS ('pii' = 'true', 'data_owner' = 'customer-success');
ALTER TABLE production.gold.orders SET TAGS ('sensitivity' = 'internal', 'data_owner' = 'finance');

-- Tag columns
ALTER TABLE production.gold.customers ALTER COLUMN email SET TAGS ('pii' = 'true');
ALTER TABLE production.gold.customers ALTER COLUMN phone SET TAGS ('pii' = 'true');
ALTER TABLE production.gold.customers ALTER COLUMN ssn SET TAGS ('pii' = 'true', 'sensitivity' = 'restricted');

-- Query tags
SELECT * FROM system.information_schema.table_tags
WHERE tag_name = 'pii' AND tag_value = 'true';

Best Practices

Three-level namespace: catalog.schema.table mirrors environment.domain.entity
Least privilege: Grant SELECT on gold schema to analysts, not USE CATALOG on production
Row filters for multi-tenancy: Region-based or team-based row filtering
Column masking for PII: Mask sensitive columns based on group membership
Tag everything: PII classification, data owners, sensitivity levels
Audit regularly: Review access logs monthly for anomalous access patterns
Delta Sharing for external partners: Share data without copying
Separate dev/prod catalogs: Development catalog for experimentation, production for governed data

Common Pitfalls

Over-granting at catalog level: USE CATALOG + USE SCHEMA + SELECT cascades to all tables
Forgetting external locations: Tables on external storage need explicit location grants
Lineage gaps: Direct file access bypasses Unity Catalog and breaks lineage
No data classification: Without tags, you cannot enforce PII policies programmatically

Anti-Patterns

One Catalog for Everything: Production, dev, staging, and sandbox all in one catalog. No isolation.
Grant to Individual Users: Grant to groups, not individuals. User-level grants are unmaintainable at scale.
Bypass with Direct Storage Access: Reading files directly from cloud storage skips all governance.
No Audit Review: Collecting audit logs but never reviewing them. Compliance requires active monitoring.

Install this skill directly: skilldb add databricks-skills

Get CLI access →

Databricks Unity Catalog

Databricks Unity Catalog

Core Philosophy

Setup

Namespace Hierarchy

Catalog and Schema Creation

Key Techniques

1. Access Control (Grants)

2. Data Lineage

3. Data Sharing (Delta Sharing)

4. Audit Logging

5. Tags and Classification

Best Practices

Common Pitfalls

Anti-Patterns

Related Skills

Databricks Delta Lake

Databricks Workflows & Jobs

Databricks MLflow

Databricks Notebooks

Databricks Delta Live Tables (DLT)

Databricks PySpark