Pandas
Expert guidance on Pandas for tabular data manipulation, transformation, and analysis in Python.
You are an expert in Pandas for data analysis and science.
## Key Points
- **Use vectorized operations** instead of iterating with `iterrows()` or `apply()` where possible. Vectorized ops are 10-100x faster.
- **Specify dtypes on read** to reduce memory usage (`dtype={"col": "category"}` for low-cardinality strings).
- **Use `.copy()`** when creating subsets you intend to modify, to avoid `SettingWithCopyWarning`.
- **Prefer `pd.merge()`** over `DataFrame.join()` for clarity when join keys differ.
- **Use `query()` and `eval()`** for readable filtering on large DataFrames.
- **Convert to categorical** for repeated string columns to save memory.
- **Chained indexing** (`df["col"][0] = val`) can silently fail. Always use `.loc` or `.iloc`.
- **Ignoring the index** after filtering leads to misaligned joins. Use `reset_index(drop=True)`.
- **Using `inplace=True`** is discouraged in modern Pandas — it hinders method chaining and can be slower.
- **Loading huge CSVs without chunking** causes memory errors. Use `chunksize` parameter or switch to Parquet.
- **Applying Python functions row-wise** (`apply(func, axis=1)`) is slow. Restructure to column-wise vectorized operations.
## Quick Example
```python
customers = pd.DataFrame({"id": [1, 2], "name": ["Alice", "Bob"]})
orders = pd.DataFrame({"customer_id": [1, 1, 2], "amount": [50, 30, 80]})
merged = pd.merge(orders, customers, left_on="customer_id", right_on="id")
```
```python
# Pivot table
pivot = sales.pivot_table(values="revenue", index="region", columns="product", aggfunc="sum")
# Melt (wide to long)
long = pivot.reset_index().melt(id_vars="region", var_name="product", value_name="revenue")
```skilldb get data-science-skills/PandasFull skill: 172 linesPandas — Data Science
You are an expert in Pandas for data analysis and science.
Overview
Pandas is the foundational library for tabular data in Python. It provides the DataFrame and Series data structures for efficient data manipulation, cleaning, aggregation, and exploration. Built on top of NumPy, it excels at handling heterogeneous, labeled data with an expressive API.
Core Concepts
DataFrame and Series
A DataFrame is a 2D labeled structure (rows and columns). A Series is a single column.
import pandas as pd
df = pd.DataFrame({
"name": ["Alice", "Bob", "Carol"],
"age": [30, 25, 35],
"salary": [70000, 60000, 90000],
})
# Series access
ages = df["age"]
Indexing and Selection
# Label-based
df.loc[0, "name"]
# Position-based
df.iloc[0, 0]
# Boolean filtering
high_earners = df[df["salary"] > 65000]
# Multiple conditions
senior_high = df[(df["age"] > 28) & (df["salary"] > 65000)]
GroupBy and Aggregation
sales = pd.DataFrame({
"region": ["East", "West", "East", "West"],
"product": ["A", "A", "B", "B"],
"revenue": [100, 150, 200, 250],
})
# Single aggregation
sales.groupby("region")["revenue"].sum()
# Multiple aggregations
sales.groupby("region")["revenue"].agg(["mean", "sum", "count"])
# Named aggregations
sales.groupby("region").agg(
total_revenue=("revenue", "sum"),
avg_revenue=("revenue", "mean"),
)
Merging and Joining
customers = pd.DataFrame({"id": [1, 2], "name": ["Alice", "Bob"]})
orders = pd.DataFrame({"customer_id": [1, 1, 2], "amount": [50, 30, 80]})
merged = pd.merge(orders, customers, left_on="customer_id", right_on="id")
Reshaping
# Pivot table
pivot = sales.pivot_table(values="revenue", index="region", columns="product", aggfunc="sum")
# Melt (wide to long)
long = pivot.reset_index().melt(id_vars="region", var_name="product", value_name="revenue")
Implementation Patterns
Method Chaining
result = (
df
.query("age > 25")
.assign(tax=lambda x: x["salary"] * 0.3)
.sort_values("salary", ascending=False)
.reset_index(drop=True)
)
Handling Missing Data
# Detect
df.isna().sum()
# Fill
df["salary"].fillna(df["salary"].median(), inplace=False)
# Drop rows with any NaN
df.dropna(subset=["salary", "age"])
Efficient I/O
# CSV
df = pd.read_csv("data.csv", parse_dates=["date_col"], dtype={"id": "int32"})
# Parquet (faster, smaller)
df.to_parquet("data.parquet", index=False)
df = pd.read_parquet("data.parquet")
Window Functions
df["rolling_avg"] = df["salary"].rolling(window=3).mean()
df["cumulative_sum"] = df["salary"].cumsum()
df["rank"] = df["salary"].rank(method="dense", ascending=False)
Best Practices
- Use vectorized operations instead of iterating with
iterrows()orapply()where possible. Vectorized ops are 10-100x faster. - Specify dtypes on read to reduce memory usage (
dtype={"col": "category"}for low-cardinality strings). - Use
.copy()when creating subsets you intend to modify, to avoidSettingWithCopyWarning. - Prefer
pd.merge()overDataFrame.join()for clarity when join keys differ. - Use
query()andeval()for readable filtering on large DataFrames. - Convert to categorical for repeated string columns to save memory.
Core Philosophy
Pandas is at its best when you think in terms of whole-column transformations rather than row-by-row operations. The DataFrame is not a spreadsheet to be edited cell by cell; it is a structured collection of vectors that should be transformed declaratively. When you reach for iterrows() or apply(func, axis=1), pause and ask whether the same result can be achieved with vectorized operations, groupby, merge, or a built-in method. Almost always, it can -- and the vectorized version will be both faster and more readable.
Method chaining is the idiomatic way to build data transformation pipelines in Pandas. A chain of .query(), .assign(), .groupby(), .sort_values() reads as a sequence of declarative steps, each producing a new DataFrame without mutating the original. This style makes code easier to reason about, test, and debug because every intermediate result is an independent object. Avoid inplace=True, which breaks chaining and offers no real performance benefit.
Be deliberate about memory. Pandas loads entire datasets into RAM, and its default dtype choices are generous (64-bit integers and floats for everything). For large datasets, specifying dtypes on read, converting low-cardinality strings to categoricals, and using Parquet instead of CSV can reduce memory usage by an order of magnitude. When Pandas runs out of memory, the answer is not always "get a bigger machine" -- it is often "specify your dtypes."
Anti-Patterns
-
Chained indexing for assignment: Writing
df["col"][mask] = valueinstead ofdf.loc[mask, "col"] = value. Chained indexing may silently operate on a copy rather than the original DataFrame, producing incorrect results without any error or warning. -
Using apply with axis=1 for vectorizable logic: Passing a Python function to
apply(func, axis=1)when the same logic could be expressed as a column-wise vectorized operation. Row-wise apply is effectively a Python for-loop and can be 10-100x slower. -
Loading large CSVs without dtype or chunking: Calling
pd.read_csv()on a multi-gigabyte file with default settings, letting Pandas infer (and over-allocate) dtypes for every column. This exhausts memory on files that would fit easily with explicit dtype specification or chunked reading. -
Mutating DataFrames in place during iteration: Modifying a DataFrame while iterating over it with
iterrows()or similar methods. This produces undefined behavior and is never the right approach. Build a new Series or DataFrame and assign it. -
Ignoring the index after filtering: Filtering a DataFrame and then performing operations that depend on integer position (like joining or concatenating) without calling
reset_index(drop=True). The surviving index values from the original DataFrame cause alignment surprises.
Common Pitfalls
- Chained indexing (
df["col"][0] = val) can silently fail. Always use.locor.iloc. - Ignoring the index after filtering leads to misaligned joins. Use
reset_index(drop=True). - Using
inplace=Trueis discouraged in modern Pandas — it hinders method chaining and can be slower. - Loading huge CSVs without chunking causes memory errors. Use
chunksizeparameter or switch to Parquet. - Applying Python functions row-wise (
apply(func, axis=1)) is slow. Restructure to column-wise vectorized operations.
Install this skill directly: skilldb add data-science-skills
Related Skills
Data Cleaning
Expert guidance on data cleaning and preprocessing techniques for preparing raw data for analysis and modeling.
Feature Engineering
Expert guidance on feature engineering patterns for transforming raw data into predictive ML features.
Jupyter
Expert guidance on Jupyter notebooks for interactive data exploration, documentation, and reproducible analysis.
Matplotlib
Expert guidance on Matplotlib for creating static, animated, and interactive visualizations in Python.
Numpy
Expert guidance on NumPy for numerical computing, array operations, and linear algebra in Python.
Polars
Expert guidance on Polars for high-performance dataframe operations with a lazy query engine in Python.