Jupyter
Expert guidance on Jupyter notebooks for interactive data exploration, documentation, and reproducible analysis.
You are an expert in Jupyter notebooks for data analysis and science.
## Key Points
- **Code cells**: execute Python (or other kernel languages) and display output inline.
- **Markdown cells**: formatted text with headings, lists, links, images, and LaTeX math (`$E = mc^2$`).
- **Raw cells**: unrendered text, useful for export pipelines.
1. Title and description (Markdown)
2. Imports and configuration
3. Data loading
4. Exploratory data analysis
5. Data cleaning / feature engineering
6. Modeling
7. Evaluation and visualization
8. Conclusions (Markdown)
- **Restart and Run All** before sharing. A notebook that only works with cells run out of order is broken.
## Quick Example
```python
# In notebook
from src.data import load_data
from src.model import train, evaluate
df = load_data("data/raw.csv")
```skilldb get data-science-skills/JupyterFull skill: 174 linesJupyter — Data Science
You are an expert in Jupyter notebooks for data analysis and science.
Overview
Jupyter notebooks provide an interactive computing environment that combines live code, rich text (Markdown), equations (LaTeX), and visualizations in a single document. They are the default workspace for exploratory data analysis, prototyping models, and communicating results. JupyterLab is the modern interface; the classic Notebook UI is also widely used.
Core Concepts
Cell Types
- Code cells: execute Python (or other kernel languages) and display output inline.
- Markdown cells: formatted text with headings, lists, links, images, and LaTeX math (
$E = mc^2$). - Raw cells: unrendered text, useful for export pipelines.
Magic Commands
# Line magics
%timeit np.dot(a, b) # benchmark a single statement
%who df # list variables of type DataFrame
%load_ext autoreload
%autoreload 2 # auto-reload imported modules on change
# Cell magics
%%time # time the entire cell
%%writefile script.py # write cell contents to a file
%%bash # run cell as bash script
Rich Display
from IPython.display import display, HTML, Markdown, Image
display(Markdown("## Results Summary"))
display(HTML("<table><tr><td>Metric</td><td>Value</td></tr></table>"))
display(Image(filename="chart.png", width=400))
# DataFrames render as HTML tables automatically
df.head(10)
Implementation Patterns
Notebook Structure Template
1. Title and description (Markdown)
2. Imports and configuration
3. Data loading
4. Exploratory data analysis
5. Data cleaning / feature engineering
6. Modeling
7. Evaluation and visualization
8. Conclusions (Markdown)
Configuration Cell
# Standard imports — run first
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option("display.max_columns", 80)
pd.set_option("display.max_rows", 100)
pd.set_option("display.float_format", "{:.3f}".format)
plt.rcParams["figure.figsize"] = (10, 6)
%matplotlib inline
%load_ext autoreload
%autoreload 2
Parameterized Notebooks with Papermill
# Tag a cell with "parameters" in metadata
# parameters
dataset_path = "data/default.csv"
n_estimators = 100
# Then execute from the command line:
# papermill input.ipynb output.ipynb -p dataset_path data/prod.csv -p n_estimators 500
Exporting Notebooks
# To HTML
jupyter nbconvert --to html notebook.ipynb
# To Python script
jupyter nbconvert --to script notebook.ipynb
# To PDF (requires LaTeX)
jupyter nbconvert --to pdf notebook.ipynb
# Execute and export
jupyter nbconvert --to html --execute notebook.ipynb
Extracting Production Code
When analysis matures, refactor into modules:
project/
notebooks/
01_explore.ipynb
02_model.ipynb
src/
data.py # load_data(), clean_data()
features.py # build_features()
model.py # train(), evaluate()
config.yaml
# In notebook
from src.data import load_data
from src.model import train, evaluate
df = load_data("data/raw.csv")
Best Practices
- Restart and Run All before sharing. A notebook that only works with cells run out of order is broken.
- Keep notebooks linear: cells should run top-to-bottom without manual intervention.
- Number notebooks (
01_explore.ipynb,02_model.ipynb) to convey execution order. - Move reusable code to
.pymodules and import them. Notebooks are for exploration, not library code. - Use Markdown cells liberally to explain intent, assumptions, and conclusions.
- Pin dependencies in a
requirements.txtorenvironment.ymlalongside the notebook. - Clear all outputs before committing to version control to reduce diff noise.
- Use
%autoreload 2during development so imported module changes take effect without kernel restarts.
Core Philosophy
Jupyter notebooks are a thinking tool, not a software delivery mechanism. Their power lies in the tight feedback loop between writing code, seeing results, and forming hypotheses. This interactive cycle accelerates exploration and makes complex data intuitive. But the same flexibility that makes notebooks great for exploration makes them dangerous for production: hidden state, non-linear execution, and the temptation to leave code in a half-finished state.
The best notebook practitioners treat notebooks as a first draft that will eventually be refactored. Exploration happens in the notebook; once a pattern solidifies, the reusable logic moves into tested Python modules, and the notebook becomes a thin orchestration layer that imports, calls, and visualizes. This separation keeps notebooks readable and modules reliable.
A notebook should tell a story. When someone else opens it -- or when you reopen it months later -- the Markdown cells should explain why decisions were made, not just what code was run. A notebook that can only be understood by re-executing every cell and reading the outputs is a notebook that has failed at its primary job of communication.
Anti-Patterns
-
The 500-cell monolith: A single notebook that contains data loading, cleaning, feature engineering, modeling, evaluation, and visualization with no modular extraction. It is impossible to test, reuse, or review, and it inevitably accumulates hidden state bugs.
-
Out-of-order execution dependence: Writing cells that only produce correct results when run in a specific non-linear order. If Restart & Run All fails, the notebook is broken regardless of how correct the outputs appear in the current session.
-
Copy-paste instead of import: Duplicating utility functions across multiple notebooks rather than extracting them into a shared module. This guarantees that bug fixes in one notebook never propagate to the others.
-
Committing outputs to version control: Checking in notebooks with large rendered outputs (DataFrames, images, HTML widgets) bloats the repository, makes diffs unreadable, and can leak sensitive data. Use nbstripout or clear outputs before committing.
-
Using notebooks as cron jobs: Scheduling notebooks for recurring production tasks without converting them to proper scripts with error handling, logging, and alerting. Notebook execution failures are silent and difficult to diagnose in production.
Common Pitfalls
- Hidden state: running cells out of order produces results that cannot be reproduced. Always verify with Restart & Run All.
- Giant notebooks: a 200-cell notebook is unmanageable. Split into focused notebooks or extract code to modules.
- Committing outputs to git: large outputs (images, DataFrames) bloat the repo. Use
nbstripoutas a pre-commit hook. - Global variable collisions: reusing variable names across cells (e.g.,
dffor different datasets) causes silent bugs. - Ignoring warnings: deprecation and convergence warnings in notebooks are easy to scroll past but often indicate real problems.
Install this skill directly: skilldb add data-science-skills
Related Skills
Data Cleaning
Expert guidance on data cleaning and preprocessing techniques for preparing raw data for analysis and modeling.
Feature Engineering
Expert guidance on feature engineering patterns for transforming raw data into predictive ML features.
Matplotlib
Expert guidance on Matplotlib for creating static, animated, and interactive visualizations in Python.
Numpy
Expert guidance on NumPy for numerical computing, array operations, and linear algebra in Python.
Pandas
Expert guidance on Pandas for tabular data manipulation, transformation, and analysis in Python.
Polars
Expert guidance on Polars for high-performance dataframe operations with a lazy query engine in Python.