Technology & EngineeringData Science230 lines

Data Cleaning

Expert guidance on data cleaning and preprocessing techniques for preparing raw data for analysis and modeling.

Quick Summary28 lines

You are an expert in data cleaning and preprocessing for data analysis and science.

## Key Points

- **Profile first, clean second**. Understand the shape of the data before writing any cleaning code.
- **Make cleaning reproducible**: write functions, not ad hoc cell-by-cell notebook edits.
- **Document every decision**: why was a column dropped? Why was median chosen over mean for imputation?
- **Preserve raw data**: never overwrite the source file. Write cleaned output to a separate path.
- **Add validation checks** at the end of every cleaning pipeline so regressions are caught early.
- **Use `errors="coerce"`** on type conversions to surface problems as NaN rather than crashing.
- **Standardize column names** early (`str.lower().str.replace(" ", "_")`) to avoid case-mismatch bugs.
- **Dropping rows carelessly**: removing 30% of data due to one missing column may introduce bias. Consider imputation.
- **Filling missing values with the mean on skewed data**: the median is more robust; for categorical columns, use a sentinel or mode.
- **Cleaning after splitting**: imputation statistics (mean, median) must be computed on the training set only.
- **Ignoring data dictionaries**: column meaning is not always obvious from the name. Check documentation first.
- **Silent type coercion**: Pandas may read "NA" as `NaN` or numeric IDs as floats. Inspect `dtypes` after every read.

## Quick Example

```python
# Exact duplicates
df = df.drop_duplicates()

# Duplicates on key columns (keep latest)
df = df.sort_values("timestamp").drop_duplicates(subset=["user_id", "event_type"], keep="last")
```

skilldb get data-science-skills/Data CleaningFull skill: 230 lines

Paste into your CLAUDE.md or agent config

Data Cleaning — Data Science

You are an expert in data cleaning and preprocessing for data analysis and science.

Overview

Data cleaning is the process of detecting and correcting (or removing) corrupt, inaccurate, and inconsistent records from a dataset. In practice, data scientists spend 60-80% of their time on cleaning and preparation. Reliable cleaning pipelines are essential — a model trained on dirty data produces dirty predictions.

Core Concepts

Profiling the Dataset

import pandas as pd
import numpy as np

df = pd.read_csv("raw_data.csv")

# Shape and types
print(df.shape)
print(df.dtypes)
print(df.info())

# Summary statistics
print(df.describe())
print(df.describe(include="object"))

# Missing values
print(df.isna().sum().sort_values(ascending=False))
print(f"Total missing: {df.isna().sum().sum()} / {df.size} ({df.isna().sum().sum() / df.size:.1%})")

# Duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")

Handling Missing Values

# Drop rows where key columns are missing
df = df.dropna(subset=["target", "id"])

# Fill numeric with median (robust to outliers)
df["income"] = df["income"].fillna(df["income"].median())

# Fill categorical with mode or a sentinel
df["category"] = df["category"].fillna("Unknown")

# Forward/backward fill for time series
df["value"] = df.groupby("id")["value"].transform(lambda x: x.ffill().bfill())

# Indicator column for missingness
df["income_missing"] = df["income"].isna().astype(int)

Handling Duplicates

# Exact duplicates
df = df.drop_duplicates()

# Duplicates on key columns (keep latest)
df = df.sort_values("timestamp").drop_duplicates(subset=["user_id", "event_type"], keep="last")

Data Type Corrections

# String to datetime
df["date"] = pd.to_datetime(df["date"], format="%Y-%m-%d", errors="coerce")

# String to numeric
df["price"] = pd.to_numeric(df["price"].str.replace("[$,]", "", regex=True), errors="coerce")

# Boolean from string
df["is_active"] = df["status"].map({"active": True, "inactive": False})

# Category dtype for low-cardinality strings
df["region"] = df["region"].astype("category")

Implementation Patterns

Outlier Detection and Treatment

# IQR method
Q1 = df["value"].quantile(0.25)
Q3 = df["value"].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outliers = df[(df["value"] < lower) | (df["value"] > upper)]
print(f"Outliers: {len(outliers)} ({len(outliers)/len(df):.1%})")

# Capping (winsorizing)
df["value_capped"] = df["value"].clip(lower=lower, upper=upper)

# Z-score method
from scipy import stats
z_scores = np.abs(stats.zscore(df["value"].dropna()))
df_clean = df[z_scores < 3]

String Cleaning

# Whitespace and case normalization
df["name"] = df["name"].str.strip().str.lower()

# Standardize formats
df["phone"] = df["phone"].str.replace(r"[^\d]", "", regex=True)
df["email"] = df["email"].str.lower().str.strip()

# Fix encoding issues
df["text"] = df["text"].str.encode("utf-8", errors="ignore").str.decode("utf-8")

# Regex extraction
df["zip_code"] = df["address"].str.extract(r"(\d{5})")

Validation Assertions

def validate_dataframe(df: pd.DataFrame) -> None:
    """Run data quality checks; raise on failure."""
    assert df["id"].is_unique, "Duplicate IDs found"
    assert df["age"].between(0, 120).all(), "Age out of range"
    assert df["email"].str.contains("@").all(), "Invalid emails found"
    assert df["date"].notna().all(), "Missing dates found"
    assert df.shape[0] > 0, "DataFrame is empty"
    print(f"All checks passed. {len(df)} rows.")

validate_dataframe(df)

Reproducible Cleaning Pipeline

def clean_sales_data(raw_path: str) -> pd.DataFrame:
    """Load and clean raw sales data."""
    df = pd.read_csv(raw_path)

    # 1. Standardize column names
    df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

    # 2. Drop exact duplicates
    df = df.drop_duplicates()

    # 3. Parse dates
    df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")

    # 4. Fix types
    df["amount"] = pd.to_numeric(df["amount"], errors="coerce")

    # 5. Handle missing
    df = df.dropna(subset=["order_id", "order_date"])
    df["amount"] = df["amount"].fillna(0)

    # 6. Remove outliers
    df = df[df["amount"] >= 0]
    df = df[df["amount"] < df["amount"].quantile(0.999)]

    # 7. Validate
    assert df["order_id"].is_unique
    assert df["amount"].ge(0).all()

    return df.reset_index(drop=True)

Schema Enforcement with Pandera

import pandera as pa

schema = pa.DataFrameSchema({
    "id": pa.Column(int, pa.Check.gt(0), unique=True),
    "name": pa.Column(str, pa.Check.str_length(min_value=1)),
    "age": pa.Column(int, pa.Check.in_range(0, 120)),
    "email": pa.Column(str, pa.Check.str_matches(r".+@.+\..+")),
    "salary": pa.Column(float, pa.Check.gt(0), nullable=True),
})

validated_df = schema.validate(df)

Best Practices

Profile first, clean second. Understand the shape of the data before writing any cleaning code.
Make cleaning reproducible: write functions, not ad hoc cell-by-cell notebook edits.
Document every decision: why was a column dropped? Why was median chosen over mean for imputation?
Preserve raw data: never overwrite the source file. Write cleaned output to a separate path.
Add validation checks at the end of every cleaning pipeline so regressions are caught early.
Use errors="coerce" on type conversions to surface problems as NaN rather than crashing.
Standardize column names early (str.lower().str.replace(" ", "_")) to avoid case-mismatch bugs.

Core Philosophy

Data cleaning is not a chore to rush through before the "real work" of modeling begins -- it is the real work. The quality ceiling of any analysis or model is set by the quality of its input data. A data scientist who treats cleaning as a first-class engineering discipline, with version-controlled pipelines, explicit assumptions, and validation gates, will consistently outperform one who treats it as ad hoc notebook surgery.

Cleaning should be reproducible, documented, and defensive. Every decision -- why a column was dropped, why median imputation was chosen over mean, why certain rows were excluded -- should be recorded so that future collaborators (or your future self) can audit the pipeline. Raw data is sacred; it should never be modified in place. The cleaned output is a derived artifact, and the transformation from raw to clean should be a repeatable function, not a series of unreproducible cell executions.

Think of data cleaning as building trust in your data. Profiling and validation are not optional extras; they are the mechanisms by which you verify that your assumptions hold. A pipeline without assertions is a pipeline waiting to silently produce wrong results. When in doubt, be conservative: flag anomalies for human review rather than silently dropping or imputing them.

Anti-Patterns

Cleaning by coincidence: Making ad hoc fixes in notebook cells without wrapping them in a reusable function. The result is a cleaning process that cannot be reproduced, audited, or applied to new data batches.
Imputation before splitting: Computing fill values (mean, median, mode) on the full dataset and then splitting into train/test. This leaks test-set statistics into training and inflates evaluation metrics.
Silent data loss: Dropping rows with missing values across many columns without checking how much data is lost or whether the remaining data is still representative. A 40% row drop can introduce severe selection bias.
One-size-fits-all missing value strategy: Applying the same fill method (e.g., zero-fill) to every column regardless of its semantics. A missing age, a missing revenue figure, and a missing categorical label all require different treatment.
Skipping validation at the end of the pipeline: Writing cleaning code but never asserting that the output meets expectations. Without assertions on uniqueness, ranges, non-null constraints, and data types, regressions go undetected until they corrupt a model or report.

Common Pitfalls

Dropping rows carelessly: removing 30% of data due to one missing column may introduce bias. Consider imputation.
Filling missing values with the mean on skewed data: the median is more robust; for categorical columns, use a sentinel or mode.
Cleaning after splitting: imputation statistics (mean, median) must be computed on the training set only.
Ignoring data dictionaries: column meaning is not always obvious from the name. Check documentation first.
Silent type coercion: Pandas may read "NA" as NaN or numeric IDs as floats. Inspect dtypes after every read.
Not checking for near-duplicates: records may differ only in whitespace or casing and still represent the same entity.

Install this skill directly: skilldb add data-science-skills

Get CLI access →

Data Cleaning — Data Science

Overview

Core Concepts

Profiling the Dataset

Shape and types

Summary statistics

Missing values

Duplicates

Handling Missing Values

Drop rows where key columns are missing

Fill numeric with median (robust to outliers)

Fill categorical with mode or a sentinel

Forward/backward fill for time series

Indicator column for missingness

Handling Duplicates

Exact duplicates

Duplicates on key columns (keep latest)

Data Type Corrections

String to datetime

String to numeric

Boolean from string

Category dtype for low-cardinality strings

Implementation Patterns

Outlier Detection and Treatment

IQR method

Capping (winsorizing)

Z-score method

String Cleaning

Whitespace and case normalization

Standardize formats

Fix encoding issues

Regex extraction

Validation Assertions

Reproducible Cleaning Pipeline

Schema Enforcement with Pandera

Best Practices

Core Philosophy

Anti-Patterns

Common Pitfalls

Details

Pack: data-science-skills
File: data-cleaning.md
Lines: 230
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add data-science-skills

Installs the full Data Science pack to your project.

Data Cleaning

Data Cleaning — Data Science

Overview

Core Concepts

Profiling the Dataset

Handling Missing Values

Handling Duplicates

Data Type Corrections

Implementation Patterns

Outlier Detection and Treatment

String Cleaning

Validation Assertions

Reproducible Cleaning Pipeline

Schema Enforcement with Pandera

Best Practices

Core Philosophy

Anti-Patterns

Common Pitfalls

Related Skills

Feature Engineering

Jupyter

Matplotlib

Numpy

Pandas

Polars