Skip to main content
Technology & EngineeringData Science174 lines

Jupyter

Expert guidance on Jupyter notebooks for interactive data exploration, documentation, and reproducible analysis.

Quick Summary28 lines
You are an expert in Jupyter notebooks for data analysis and science.

## Key Points

- **Code cells**: execute Python (or other kernel languages) and display output inline.
- **Markdown cells**: formatted text with headings, lists, links, images, and LaTeX math (`$E = mc^2$`).
- **Raw cells**: unrendered text, useful for export pipelines.
1. Title and description (Markdown)
2. Imports and configuration
3. Data loading
4. Exploratory data analysis
5. Data cleaning / feature engineering
6. Modeling
7. Evaluation and visualization
8. Conclusions (Markdown)
- **Restart and Run All** before sharing. A notebook that only works with cells run out of order is broken.

## Quick Example

```python
# In notebook
from src.data import load_data
from src.model import train, evaluate

df = load_data("data/raw.csv")
```
skilldb get data-science-skills/JupyterFull skill: 174 lines
Paste into your CLAUDE.md or agent config

Jupyter — Data Science

You are an expert in Jupyter notebooks for data analysis and science.

Overview

Jupyter notebooks provide an interactive computing environment that combines live code, rich text (Markdown), equations (LaTeX), and visualizations in a single document. They are the default workspace for exploratory data analysis, prototyping models, and communicating results. JupyterLab is the modern interface; the classic Notebook UI is also widely used.

Core Concepts

Cell Types

  • Code cells: execute Python (or other kernel languages) and display output inline.
  • Markdown cells: formatted text with headings, lists, links, images, and LaTeX math ($E = mc^2$).
  • Raw cells: unrendered text, useful for export pipelines.

Magic Commands

# Line magics
%timeit np.dot(a, b)           # benchmark a single statement
%who df                        # list variables of type DataFrame
%load_ext autoreload
%autoreload 2                  # auto-reload imported modules on change

# Cell magics
%%time                         # time the entire cell
%%writefile script.py           # write cell contents to a file
%%bash                         # run cell as bash script

Rich Display

from IPython.display import display, HTML, Markdown, Image

display(Markdown("## Results Summary"))
display(HTML("<table><tr><td>Metric</td><td>Value</td></tr></table>"))
display(Image(filename="chart.png", width=400))

# DataFrames render as HTML tables automatically
df.head(10)

Implementation Patterns

Notebook Structure Template

1. Title and description (Markdown)
2. Imports and configuration
3. Data loading
4. Exploratory data analysis
5. Data cleaning / feature engineering
6. Modeling
7. Evaluation and visualization
8. Conclusions (Markdown)

Configuration Cell

# Standard imports — run first
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_columns", 80)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", "{:.3f}".format)
plt.rcParams["figure.figsize"] = (10, 6)

%matplotlib inline
%load_ext autoreload
%autoreload 2

Parameterized Notebooks with Papermill

# Tag a cell with "parameters" in metadata
# parameters
dataset_path = "data/default.csv"
n_estimators = 100

# Then execute from the command line:
# papermill input.ipynb output.ipynb -p dataset_path data/prod.csv -p n_estimators 500

Exporting Notebooks

# To HTML
jupyter nbconvert --to html notebook.ipynb

# To Python script
jupyter nbconvert --to script notebook.ipynb

# To PDF (requires LaTeX)
jupyter nbconvert --to pdf notebook.ipynb

# Execute and export
jupyter nbconvert --to html --execute notebook.ipynb

Extracting Production Code

When analysis matures, refactor into modules:

project/
  notebooks/
    01_explore.ipynb
    02_model.ipynb
  src/
    data.py          # load_data(), clean_data()
    features.py      # build_features()
    model.py         # train(), evaluate()
  config.yaml
# In notebook
from src.data import load_data
from src.model import train, evaluate

df = load_data("data/raw.csv")

Best Practices

  • Restart and Run All before sharing. A notebook that only works with cells run out of order is broken.
  • Keep notebooks linear: cells should run top-to-bottom without manual intervention.
  • Number notebooks (01_explore.ipynb, 02_model.ipynb) to convey execution order.
  • Move reusable code to .py modules and import them. Notebooks are for exploration, not library code.
  • Use Markdown cells liberally to explain intent, assumptions, and conclusions.
  • Pin dependencies in a requirements.txt or environment.yml alongside the notebook.
  • Clear all outputs before committing to version control to reduce diff noise.
  • Use %autoreload 2 during development so imported module changes take effect without kernel restarts.

Core Philosophy

Jupyter notebooks are a thinking tool, not a software delivery mechanism. Their power lies in the tight feedback loop between writing code, seeing results, and forming hypotheses. This interactive cycle accelerates exploration and makes complex data intuitive. But the same flexibility that makes notebooks great for exploration makes them dangerous for production: hidden state, non-linear execution, and the temptation to leave code in a half-finished state.

The best notebook practitioners treat notebooks as a first draft that will eventually be refactored. Exploration happens in the notebook; once a pattern solidifies, the reusable logic moves into tested Python modules, and the notebook becomes a thin orchestration layer that imports, calls, and visualizes. This separation keeps notebooks readable and modules reliable.

A notebook should tell a story. When someone else opens it -- or when you reopen it months later -- the Markdown cells should explain why decisions were made, not just what code was run. A notebook that can only be understood by re-executing every cell and reading the outputs is a notebook that has failed at its primary job of communication.

Anti-Patterns

  • The 500-cell monolith: A single notebook that contains data loading, cleaning, feature engineering, modeling, evaluation, and visualization with no modular extraction. It is impossible to test, reuse, or review, and it inevitably accumulates hidden state bugs.

  • Out-of-order execution dependence: Writing cells that only produce correct results when run in a specific non-linear order. If Restart & Run All fails, the notebook is broken regardless of how correct the outputs appear in the current session.

  • Copy-paste instead of import: Duplicating utility functions across multiple notebooks rather than extracting them into a shared module. This guarantees that bug fixes in one notebook never propagate to the others.

  • Committing outputs to version control: Checking in notebooks with large rendered outputs (DataFrames, images, HTML widgets) bloats the repository, makes diffs unreadable, and can leak sensitive data. Use nbstripout or clear outputs before committing.

  • Using notebooks as cron jobs: Scheduling notebooks for recurring production tasks without converting them to proper scripts with error handling, logging, and alerting. Notebook execution failures are silent and difficult to diagnose in production.

Common Pitfalls

  • Hidden state: running cells out of order produces results that cannot be reproduced. Always verify with Restart & Run All.
  • Giant notebooks: a 200-cell notebook is unmanageable. Split into focused notebooks or extract code to modules.
  • Committing outputs to git: large outputs (images, DataFrames) bloat the repo. Use nbstripout as a pre-commit hook.
  • Global variable collisions: reusing variable names across cells (e.g., df for different datasets) causes silent bugs.
  • Ignoring warnings: deprecation and convergence warnings in notebooks are easy to scroll past but often indicate real problems.

Install this skill directly: skilldb add data-science-skills

Get CLI access →