Data Cleaning
Expert guidance on data cleaning and preprocessing techniques for preparing raw data for analysis and modeling.
You are an expert in data cleaning and preprocessing for data analysis and science.
## Key Points
- **Profile first, clean second**. Understand the shape of the data before writing any cleaning code.
- **Make cleaning reproducible**: write functions, not ad hoc cell-by-cell notebook edits.
- **Document every decision**: why was a column dropped? Why was median chosen over mean for imputation?
- **Preserve raw data**: never overwrite the source file. Write cleaned output to a separate path.
- **Add validation checks** at the end of every cleaning pipeline so regressions are caught early.
- **Use `errors="coerce"`** on type conversions to surface problems as NaN rather than crashing.
- **Standardize column names** early (`str.lower().str.replace(" ", "_")`) to avoid case-mismatch bugs.
- **Dropping rows carelessly**: removing 30% of data due to one missing column may introduce bias. Consider imputation.
- **Filling missing values with the mean on skewed data**: the median is more robust; for categorical columns, use a sentinel or mode.
- **Cleaning after splitting**: imputation statistics (mean, median) must be computed on the training set only.
- **Ignoring data dictionaries**: column meaning is not always obvious from the name. Check documentation first.
- **Silent type coercion**: Pandas may read "NA" as `NaN` or numeric IDs as floats. Inspect `dtypes` after every read.
## Quick Example
```python
# Exact duplicates
df = df.drop_duplicates()
# Duplicates on key columns (keep latest)
df = df.sort_values("timestamp").drop_duplicates(subset=["user_id", "event_type"], keep="last")
```skilldb get data-science-skills/Data CleaningFull skill: 230 linesData Cleaning — Data Science
You are an expert in data cleaning and preprocessing for data analysis and science.
Overview
Data cleaning is the process of detecting and correcting (or removing) corrupt, inaccurate, and inconsistent records from a dataset. In practice, data scientists spend 60-80% of their time on cleaning and preparation. Reliable cleaning pipelines are essential — a model trained on dirty data produces dirty predictions.
Core Concepts
Profiling the Dataset
import pandas as pd
import numpy as np
df = pd.read_csv("raw_data.csv")
# Shape and types
print(df.shape)
print(df.dtypes)
print(df.info())
# Summary statistics
print(df.describe())
print(df.describe(include="object"))
# Missing values
print(df.isna().sum().sort_values(ascending=False))
print(f"Total missing: {df.isna().sum().sum()} / {df.size} ({df.isna().sum().sum() / df.size:.1%})")
# Duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")
Handling Missing Values
# Drop rows where key columns are missing
df = df.dropna(subset=["target", "id"])
# Fill numeric with median (robust to outliers)
df["income"] = df["income"].fillna(df["income"].median())
# Fill categorical with mode or a sentinel
df["category"] = df["category"].fillna("Unknown")
# Forward/backward fill for time series
df["value"] = df.groupby("id")["value"].transform(lambda x: x.ffill().bfill())
# Indicator column for missingness
df["income_missing"] = df["income"].isna().astype(int)
Handling Duplicates
# Exact duplicates
df = df.drop_duplicates()
# Duplicates on key columns (keep latest)
df = df.sort_values("timestamp").drop_duplicates(subset=["user_id", "event_type"], keep="last")
Data Type Corrections
# String to datetime
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d", errors="coerce")
# String to numeric
df["price"] = pd.to_numeric(df["price"].str.replace("[$,]", "", regex=True), errors="coerce")
# Boolean from string
df["is_active"] = df["status"].map({"active": True, "inactive": False})
# Category dtype for low-cardinality strings
df["region"] = df["region"].astype("category")
Implementation Patterns
Outlier Detection and Treatment
# IQR method
Q1 = df["value"].quantile(0.25)
Q3 = df["value"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = df[(df["value"] < lower) | (df["value"] > upper)]
print(f"Outliers: {len(outliers)} ({len(outliers)/len(df):.1%})")
# Capping (winsorizing)
df["value_capped"] = df["value"].clip(lower=lower, upper=upper)
# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df["value"].dropna()))
df_clean = df[z_scores < 3]
String Cleaning
# Whitespace and case normalization
df["name"] = df["name"].str.strip().str.lower()
# Standardize formats
df["phone"] = df["phone"].str.replace(r"[^\d]", "", regex=True)
df["email"] = df["email"].str.lower().str.strip()
# Fix encoding issues
df["text"] = df["text"].str.encode("utf-8", errors="ignore").str.decode("utf-8")
# Regex extraction
df["zip_code"] = df["address"].str.extract(r"(\d{5})")
Validation Assertions
def validate_dataframe(df: pd.DataFrame) -> None:
"""Run data quality checks; raise on failure."""
assert df["id"].is_unique, "Duplicate IDs found"
assert df["age"].between(0, 120).all(), "Age out of range"
assert df["email"].str.contains("@").all(), "Invalid emails found"
assert df["date"].notna().all(), "Missing dates found"
assert df.shape[0] > 0, "DataFrame is empty"
print(f"All checks passed. {len(df)} rows.")
validate_dataframe(df)
Reproducible Cleaning Pipeline
def clean_sales_data(raw_path: str) -> pd.DataFrame:
"""Load and clean raw sales data."""
df = pd.read_csv(raw_path)
# 1. Standardize column names
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
# 2. Drop exact duplicates
df = df.drop_duplicates()
# 3. Parse dates
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
# 4. Fix types
df["amount"] = pd.to_numeric(df["amount"], errors="coerce")
# 5. Handle missing
df = df.dropna(subset=["order_id", "order_date"])
df["amount"] = df["amount"].fillna(0)
# 6. Remove outliers
df = df[df["amount"] >= 0]
df = df[df["amount"] < df["amount"].quantile(0.999)]
# 7. Validate
assert df["order_id"].is_unique
assert df["amount"].ge(0).all()
return df.reset_index(drop=True)
Schema Enforcement with Pandera
import pandera as pa
schema = pa.DataFrameSchema({
"id": pa.Column(int, pa.Check.gt(0), unique=True),
"name": pa.Column(str, pa.Check.str_length(min_value=1)),
"age": pa.Column(int, pa.Check.in_range(0, 120)),
"email": pa.Column(str, pa.Check.str_matches(r".+@.+\..+")),
"salary": pa.Column(float, pa.Check.gt(0), nullable=True),
})
validated_df = schema.validate(df)
Best Practices
- Profile first, clean second. Understand the shape of the data before writing any cleaning code.
- Make cleaning reproducible: write functions, not ad hoc cell-by-cell notebook edits.
- Document every decision: why was a column dropped? Why was median chosen over mean for imputation?
- Preserve raw data: never overwrite the source file. Write cleaned output to a separate path.
- Add validation checks at the end of every cleaning pipeline so regressions are caught early.
- Use
errors="coerce"on type conversions to surface problems as NaN rather than crashing. - Standardize column names early (
str.lower().str.replace(" ", "_")) to avoid case-mismatch bugs.
Core Philosophy
Data cleaning is not a chore to rush through before the "real work" of modeling begins -- it is the real work. The quality ceiling of any analysis or model is set by the quality of its input data. A data scientist who treats cleaning as a first-class engineering discipline, with version-controlled pipelines, explicit assumptions, and validation gates, will consistently outperform one who treats it as ad hoc notebook surgery.
Cleaning should be reproducible, documented, and defensive. Every decision -- why a column was dropped, why median imputation was chosen over mean, why certain rows were excluded -- should be recorded so that future collaborators (or your future self) can audit the pipeline. Raw data is sacred; it should never be modified in place. The cleaned output is a derived artifact, and the transformation from raw to clean should be a repeatable function, not a series of unreproducible cell executions.
Think of data cleaning as building trust in your data. Profiling and validation are not optional extras; they are the mechanisms by which you verify that your assumptions hold. A pipeline without assertions is a pipeline waiting to silently produce wrong results. When in doubt, be conservative: flag anomalies for human review rather than silently dropping or imputing them.
Anti-Patterns
-
Cleaning by coincidence: Making ad hoc fixes in notebook cells without wrapping them in a reusable function. The result is a cleaning process that cannot be reproduced, audited, or applied to new data batches.
-
Imputation before splitting: Computing fill values (mean, median, mode) on the full dataset and then splitting into train/test. This leaks test-set statistics into training and inflates evaluation metrics.
-
Silent data loss: Dropping rows with missing values across many columns without checking how much data is lost or whether the remaining data is still representative. A 40% row drop can introduce severe selection bias.
-
One-size-fits-all missing value strategy: Applying the same fill method (e.g., zero-fill) to every column regardless of its semantics. A missing age, a missing revenue figure, and a missing categorical label all require different treatment.
-
Skipping validation at the end of the pipeline: Writing cleaning code but never asserting that the output meets expectations. Without assertions on uniqueness, ranges, non-null constraints, and data types, regressions go undetected until they corrupt a model or report.
Common Pitfalls
- Dropping rows carelessly: removing 30% of data due to one missing column may introduce bias. Consider imputation.
- Filling missing values with the mean on skewed data: the median is more robust; for categorical columns, use a sentinel or mode.
- Cleaning after splitting: imputation statistics (mean, median) must be computed on the training set only.
- Ignoring data dictionaries: column meaning is not always obvious from the name. Check documentation first.
- Silent type coercion: Pandas may read "NA" as
NaNor numeric IDs as floats. Inspectdtypesafter every read. - Not checking for near-duplicates: records may differ only in whitespace or casing and still represent the same entity.
Install this skill directly: skilldb add data-science-skills
Related Skills
Feature Engineering
Expert guidance on feature engineering patterns for transforming raw data into predictive ML features.
Jupyter
Expert guidance on Jupyter notebooks for interactive data exploration, documentation, and reproducible analysis.
Matplotlib
Expert guidance on Matplotlib for creating static, animated, and interactive visualizations in Python.
Numpy
Expert guidance on NumPy for numerical computing, array operations, and linear algebra in Python.
Pandas
Expert guidance on Pandas for tabular data manipulation, transformation, and analysis in Python.
Polars
Expert guidance on Polars for high-performance dataframe operations with a lazy query engine in Python.