Skip to main content
Technology & EngineeringData Science172 lines

Pandas

Expert guidance on Pandas for tabular data manipulation, transformation, and analysis in Python.

Quick Summary34 lines
You are an expert in Pandas for data analysis and science.

## Key Points

- **Use vectorized operations** instead of iterating with `iterrows()` or `apply()` where possible. Vectorized ops are 10-100x faster.
- **Specify dtypes on read** to reduce memory usage (`dtype={"col": "category"}` for low-cardinality strings).
- **Use `.copy()`** when creating subsets you intend to modify, to avoid `SettingWithCopyWarning`.
- **Prefer `pd.merge()`** over `DataFrame.join()` for clarity when join keys differ.
- **Use `query()` and `eval()`** for readable filtering on large DataFrames.
- **Convert to categorical** for repeated string columns to save memory.
- **Chained indexing** (`df["col"][0] = val`) can silently fail. Always use `.loc` or `.iloc`.
- **Ignoring the index** after filtering leads to misaligned joins. Use `reset_index(drop=True)`.
- **Using `inplace=True`** is discouraged in modern Pandas — it hinders method chaining and can be slower.
- **Loading huge CSVs without chunking** causes memory errors. Use `chunksize` parameter or switch to Parquet.
- **Applying Python functions row-wise** (`apply(func, axis=1)`) is slow. Restructure to column-wise vectorized operations.

## Quick Example

```python
customers = pd.DataFrame({"id": [1, 2], "name": ["Alice", "Bob"]})
orders = pd.DataFrame({"customer_id": [1, 1, 2], "amount": [50, 30, 80]})

merged = pd.merge(orders, customers, left_on="customer_id", right_on="id")
```

```python
# Pivot table
pivot = sales.pivot_table(values="revenue", index="region", columns="product", aggfunc="sum")

# Melt (wide to long)
long = pivot.reset_index().melt(id_vars="region", var_name="product", value_name="revenue")
```
skilldb get data-science-skills/PandasFull skill: 172 lines
Paste into your CLAUDE.md or agent config

Pandas — Data Science

You are an expert in Pandas for data analysis and science.

Overview

Pandas is the foundational library for tabular data in Python. It provides the DataFrame and Series data structures for efficient data manipulation, cleaning, aggregation, and exploration. Built on top of NumPy, it excels at handling heterogeneous, labeled data with an expressive API.

Core Concepts

DataFrame and Series

A DataFrame is a 2D labeled structure (rows and columns). A Series is a single column.

import pandas as pd

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Carol"],
    "age": [30, 25, 35],
    "salary": [70000, 60000, 90000],
})

# Series access
ages = df["age"]

Indexing and Selection

# Label-based
df.loc[0, "name"]

# Position-based
df.iloc[0, 0]

# Boolean filtering
high_earners = df[df["salary"] > 65000]

# Multiple conditions
senior_high = df[(df["age"] > 28) & (df["salary"] > 65000)]

GroupBy and Aggregation

sales = pd.DataFrame({
    "region": ["East", "West", "East", "West"],
    "product": ["A", "A", "B", "B"],
    "revenue": [100, 150, 200, 250],
})

# Single aggregation
sales.groupby("region")["revenue"].sum()

# Multiple aggregations
sales.groupby("region")["revenue"].agg(["mean", "sum", "count"])

# Named aggregations
sales.groupby("region").agg(
    total_revenue=("revenue", "sum"),
    avg_revenue=("revenue", "mean"),
)

Merging and Joining

customers = pd.DataFrame({"id": [1, 2], "name": ["Alice", "Bob"]})
orders = pd.DataFrame({"customer_id": [1, 1, 2], "amount": [50, 30, 80]})

merged = pd.merge(orders, customers, left_on="customer_id", right_on="id")

Reshaping

# Pivot table
pivot = sales.pivot_table(values="revenue", index="region", columns="product", aggfunc="sum")

# Melt (wide to long)
long = pivot.reset_index().melt(id_vars="region", var_name="product", value_name="revenue")

Implementation Patterns

Method Chaining

result = (
    df
    .query("age > 25")
    .assign(tax=lambda x: x["salary"] * 0.3)
    .sort_values("salary", ascending=False)
    .reset_index(drop=True)
)

Handling Missing Data

# Detect
df.isna().sum()

# Fill
df["salary"].fillna(df["salary"].median(), inplace=False)

# Drop rows with any NaN
df.dropna(subset=["salary", "age"])

Efficient I/O

# CSV
df = pd.read_csv("data.csv", parse_dates=["date_col"], dtype={"id": "int32"})

# Parquet (faster, smaller)
df.to_parquet("data.parquet", index=False)
df = pd.read_parquet("data.parquet")

Window Functions

df["rolling_avg"] = df["salary"].rolling(window=3).mean()
df["cumulative_sum"] = df["salary"].cumsum()
df["rank"] = df["salary"].rank(method="dense", ascending=False)

Best Practices

  • Use vectorized operations instead of iterating with iterrows() or apply() where possible. Vectorized ops are 10-100x faster.
  • Specify dtypes on read to reduce memory usage (dtype={"col": "category"} for low-cardinality strings).
  • Use .copy() when creating subsets you intend to modify, to avoid SettingWithCopyWarning.
  • Prefer pd.merge() over DataFrame.join() for clarity when join keys differ.
  • Use query() and eval() for readable filtering on large DataFrames.
  • Convert to categorical for repeated string columns to save memory.

Core Philosophy

Pandas is at its best when you think in terms of whole-column transformations rather than row-by-row operations. The DataFrame is not a spreadsheet to be edited cell by cell; it is a structured collection of vectors that should be transformed declaratively. When you reach for iterrows() or apply(func, axis=1), pause and ask whether the same result can be achieved with vectorized operations, groupby, merge, or a built-in method. Almost always, it can -- and the vectorized version will be both faster and more readable.

Method chaining is the idiomatic way to build data transformation pipelines in Pandas. A chain of .query(), .assign(), .groupby(), .sort_values() reads as a sequence of declarative steps, each producing a new DataFrame without mutating the original. This style makes code easier to reason about, test, and debug because every intermediate result is an independent object. Avoid inplace=True, which breaks chaining and offers no real performance benefit.

Be deliberate about memory. Pandas loads entire datasets into RAM, and its default dtype choices are generous (64-bit integers and floats for everything). For large datasets, specifying dtypes on read, converting low-cardinality strings to categoricals, and using Parquet instead of CSV can reduce memory usage by an order of magnitude. When Pandas runs out of memory, the answer is not always "get a bigger machine" -- it is often "specify your dtypes."

Anti-Patterns

  • Chained indexing for assignment: Writing df["col"][mask] = value instead of df.loc[mask, "col"] = value. Chained indexing may silently operate on a copy rather than the original DataFrame, producing incorrect results without any error or warning.

  • Using apply with axis=1 for vectorizable logic: Passing a Python function to apply(func, axis=1) when the same logic could be expressed as a column-wise vectorized operation. Row-wise apply is effectively a Python for-loop and can be 10-100x slower.

  • Loading large CSVs without dtype or chunking: Calling pd.read_csv() on a multi-gigabyte file with default settings, letting Pandas infer (and over-allocate) dtypes for every column. This exhausts memory on files that would fit easily with explicit dtype specification or chunked reading.

  • Mutating DataFrames in place during iteration: Modifying a DataFrame while iterating over it with iterrows() or similar methods. This produces undefined behavior and is never the right approach. Build a new Series or DataFrame and assign it.

  • Ignoring the index after filtering: Filtering a DataFrame and then performing operations that depend on integer position (like joining or concatenating) without calling reset_index(drop=True). The surviving index values from the original DataFrame cause alignment surprises.

Common Pitfalls

  • Chained indexing (df["col"][0] = val) can silently fail. Always use .loc or .iloc.
  • Ignoring the index after filtering leads to misaligned joins. Use reset_index(drop=True).
  • Using inplace=True is discouraged in modern Pandas — it hinders method chaining and can be slower.
  • Loading huge CSVs without chunking causes memory errors. Use chunksize parameter or switch to Parquet.
  • Applying Python functions row-wise (apply(func, axis=1)) is slow. Restructure to column-wise vectorized operations.

Install this skill directly: skilldb add data-science-skills

Get CLI access →