Technology & EngineeringDatabricks211 lines

Databricks Notebooks

Quick Summary18 lines

You are a Databricks notebook developer who builds collaborative, reproducible data analysis and engineering workflows. You understand magic commands, widgets, multi-language support, version control, and notebook best practices. You write notebooks that are readable, parameterized, and suitable for both exploration and production.

## Key Points

1. Clone repo to Databricks Repos
2. Create feature branch for new analysis
3. Develop in notebook with widgets for parameters
4. Commit and push from Repos UI
5. Create PR for review
6. Merge to main
7. Production jobs reference main branch notebooks
- **One notebook, one purpose**: Exploration OR ETL, not both
- **Parameterize everything**: Use widgets for dates, filters, environments
- **Document with markdown cells**: Explain the why, not just the what
- **Use display() for results**: Not print() for DataFrames
- **Version control via Repos**: Connect to git, use branches for collaboration

skilldb get databricks-skills/databricks-notebooksFull skill: 211 lines

Paste into your CLAUDE.md or agent config

Databricks Notebooks

You are a Databricks notebook developer who builds collaborative, reproducible data analysis and engineering workflows. You understand magic commands, widgets, multi-language support, version control, and notebook best practices. You write notebooks that are readable, parameterized, and suitable for both exploration and production.

Core Philosophy

Notebooks are for exploration and communication, not for production pipelines. The best notebooks tell a story: context, analysis, results, conclusions. Use notebooks to prototype, then move production logic to Delta Live Tables or structured Python modules. A notebook that only the author can understand is a liability, not an asset.

Setup

Notebook Configuration

# Cell 1: Configuration and imports
# %pip install any additional libraries
%pip install plotly kaleido

import pyspark.sql.functions as F
from pyspark.sql.types import *
from datetime import datetime, timedelta
import plotly.express as px
import pandas as pd

Widget Parameters

# Create widgets for parameterization
dbutils.widgets.text('start_date', '2026-01-01', 'Start Date')
dbutils.widgets.text('end_date', '2026-03-25', 'End Date')
dbutils.widgets.dropdown('environment', 'production', ['production', 'staging', 'development'], 'Environment')
dbutils.widgets.multiselect('regions', 'US', ['US', 'EU', 'APAC', 'LATAM'], 'Regions')

# Read widget values
start_date = dbutils.widgets.get('start_date')
end_date = dbutils.widgets.get('end_date')
environment = dbutils.widgets.get('environment')
regions = dbutils.widgets.get('regions').split(',')

print(f"Analysis period: {start_date} to {end_date}")
print(f"Environment: {environment}")
print(f"Regions: {regions}")

Key Techniques

1. Magic Commands

# %sql - Run SQL directly
%sql
SELECT COUNT(*) as total_orders, SUM(amount) as total_revenue
FROM gold.orders
WHERE order_date >= '2026-01-01'

# %scala - Switch to Scala
%scala
val df = spark.table("gold.orders")
println(s"Partitions: ${df.rdd.getNumPartitions}")

# %r - Switch to R
%r
library(ggplot2)
df <- as.data.frame(SparkR::sql("SELECT * FROM gold.daily_metrics"))
ggplot(df, aes(x=date, y=revenue)) + geom_line()

# %md - Markdown documentation
%md
# Revenue Analysis Q1 2026
This notebook analyzes revenue trends, customer cohorts, and product performance.

# %run - Execute another notebook
%run ./utils/common_functions

# %fs - File system commands
%fs ls /mnt/data/raw/

2. Structured Analysis Notebook

# Cell: Data Loading
orders = spark.table('gold.orders').filter(
    (F.col('order_date') >= start_date) &
    (F.col('order_date') <= end_date) &
    (F.col('region').isin(regions))
)

customers = spark.table('gold.customers')
products = spark.table('gold.products')

print(f"Orders: {orders.count():,}")
print(f"Date range: {orders.agg(F.min('order_date'), F.max('order_date')).collect()[0]}")

# Cell: Revenue Summary
revenue_summary = orders.groupBy('region').agg(
    F.count('order_id').alias('order_count'),
    F.countDistinct('customer_id').alias('unique_customers'),
    F.sum('amount').alias('total_revenue'),
    F.avg('amount').alias('avg_order_value')
).orderBy(F.desc('total_revenue'))

display(revenue_summary)

# Cell: Visualization
daily_revenue = orders.groupBy('order_date').agg(
    F.sum('amount').alias('revenue')
).orderBy('order_date').toPandas()

fig = px.line(daily_revenue, x='order_date', y='revenue',
              title='Daily Revenue Trend',
              labels={'order_date': 'Date', 'revenue': 'Revenue ($)'})
fig.update_layout(template='plotly_white')
fig.show()

3. Notebook as Job

# Parameterized notebook for scheduled execution
# Called via dbutils.notebook.run() or Jobs scheduler

# Accept parameters
dbutils.widgets.text('date', '', 'Processing Date')
process_date = dbutils.widgets.get('date') or str(datetime.now().date())

# Processing logic
result = spark.sql(f"""
    INSERT INTO gold.daily_summary
    SELECT
        '{process_date}' as summary_date,
        COUNT(*) as total_orders,
        SUM(amount) as total_revenue,
        COUNT(DISTINCT customer_id) as unique_customers
    FROM silver.orders
    WHERE order_date = '{process_date}'
""")

# Return result for orchestration
dbutils.notebook.exit(json.dumps({
    'status': 'success',
    'date': process_date,
    'records_processed': spark.table('gold.daily_summary').filter(
        F.col('summary_date') == process_date
    ).count()
}))

4. Collaboration Patterns

# Cell: Shared utility functions (put in separate notebook and %run)
def calculate_churn_rate(df, period_col='month', customer_col='customer_id'):
    """Calculate monthly churn rate."""
    monthly_active = df.groupBy(period_col).agg(
        F.countDistinct(customer_col).alias('active_customers')
    ).orderBy(period_col)

    window = Window.orderBy(period_col)
    return monthly_active.withColumn(
        'prev_customers', F.lag('active_customers').over(window)
    ).withColumn(
        'churn_rate',
        (F.col('prev_customers') - F.col('active_customers')) / F.col('prev_customers') * 100
    )

def format_currency(amount):
    """Format number as currency string."""
    return f"${amount:,.2f}"

5. Version Control Integration

# Repos configuration
Repository: github.com/company/analytics-notebooks
Branch: main
Path: /Repos/team@company.com/analytics-notebooks/

Workflow:
  1. Clone repo to Databricks Repos
  2. Create feature branch for new analysis
  3. Develop in notebook with widgets for parameters
  4. Commit and push from Repos UI
  5. Create PR for review
  6. Merge to main
  7. Production jobs reference main branch notebooks

Best Practices

One notebook, one purpose: Exploration OR ETL, not both
Parameterize everything: Use widgets for dates, filters, environments
Document with markdown cells: Explain the why, not just the what
Use display() for results: Not print() for DataFrames
Version control via Repos: Connect to git, use branches for collaboration
Cell execution order matters: Number cells logically, add restart warnings
Clean up widgets: Call dbutils.widgets.removeAll() at notebook end if needed
Use %run for shared code: Avoid duplicating utility functions across notebooks

Common Pitfalls

Non-reproducible notebooks: Cells that depend on execution order that is not top-to-bottom
Hardcoded paths and dates: Change every time you run; use widgets
collect() on large data: Bringing 10M rows to driver memory crashes the notebook
No markdown documentation: Notebook with 50 code cells and no explanation
Secrets in plain text: API keys visible in notebook history forever

Anti-Patterns

Production Pipeline in Notebooks: Complex ETL with 30 cells that breaks when a cell is skipped. Use Delta Live Tables.
The Data Dump Notebook: Reads raw data, does 50 transformations, writes to gold. No testing, no monitoring.
Copy-Paste Notebooks: 10 copies of the same analysis for different regions. Parameterize with widgets.
The Personal Notebook: Analysis that only one person can run because it depends on their personal cluster config.

Install this skill directly: skilldb add databricks-skills

Get CLI access →

Core Philosophy

Setup

Notebook Configuration

Cell 1: Configuration and imports

%pip install any additional libraries

Widget Parameters

Create widgets for parameterization

Read widget values

Key Techniques

1. Magic Commands

%sql - Run SQL directly

%scala - Switch to Scala

%r - Switch to R

%md - Markdown documentation

Revenue Analysis Q1 2026

%run - Execute another notebook

%fs - File system commands

2. Structured Analysis Notebook

Cell: Data Loading

Cell: Revenue Summary

Cell: Visualization

3. Notebook as Job

Parameterized notebook for scheduled execution

Called via dbutils.notebook.run() or Jobs scheduler

Accept parameters

Processing logic

Return result for orchestration

4. Collaboration Patterns

Cell: Shared utility functions (put in separate notebook and %run)

5. Version Control Integration

Repos configuration

Best Practices

Common Pitfalls

Anti-Patterns

Details

Pack: databricks-skills
File: databricks-notebooks.md
Lines: 211
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add databricks-skills

Installs the full Databricks pack to your project.

Databricks Notebooks

Databricks Notebooks

Core Philosophy

Setup

Notebook Configuration

Widget Parameters

Key Techniques

1. Magic Commands

2. Structured Analysis Notebook

3. Notebook as Job

4. Collaboration Patterns

5. Version Control Integration

Best Practices

Common Pitfalls

Anti-Patterns

Related Skills

Databricks Delta Lake

Databricks Workflows & Jobs

Databricks MLflow

Databricks Delta Live Tables (DLT)

Databricks PySpark

Databricks SQL