Skip to content
📦 Technology & EngineeringAi Ml74 lines

Data Preprocessing

Systematic approach to data cleaning, transformation, and feature preparation for

Paste into your CLAUDE.md or agent config

Data Preprocessing

Overview

Data preprocessing transforms raw data into a clean, structured format suitable for machine learning models. It typically consumes 60-80% of project time and has more impact on model performance than algorithm selection. Poor preprocessing introduces bias, leaks information, and produces unreliable models.

Use this skill when preparing datasets for model training, when model performance plateaus and you suspect data quality issues, or when building reproducible preprocessing pipelines for production.

Core Framework

Preprocessing Pipeline Stages

  1. Data Audit: Profile the dataset for shape, types, distributions, and quality.
  2. Cleaning: Handle missing values, duplicates, and corrupted records.
  3. Transformation: Scale, normalize, encode, and reshape features.
  4. Feature Engineering: Create new features from existing ones to capture domain knowledge.
  5. Feature Selection: Remove irrelevant or redundant features.
  6. Validation: Verify the pipeline preserves data integrity and does not leak target information.

Missing Value Strategies

StrategyWhen to Use
Drop rows<5% missing, MCAR (missing completely at random)
Mean/median imputationNumeric, low missingness, no strong skew
Mode imputationCategorical features with low cardinality
KNN imputationFeatures have meaningful neighbor relationships
Indicator variableMissingness itself is informative
Model-based (MICE)Complex missingness patterns, sufficient data

Process

  1. Load data and generate a profiling report (shape, dtypes, null counts, unique values, distributions).
  2. Remove exact duplicate rows; investigate near-duplicates.
  3. Identify and handle missing values using the strategy table above.
  4. Detect outliers using IQR, z-score, or domain-specific thresholds; decide to cap, transform, or remove.
  5. Encode categorical variables: one-hot for low cardinality (<15 levels), target encoding or embeddings for high cardinality.
  6. Scale numeric features: StandardScaler for linear models, MinMaxScaler for neural networks, leave unscaled for tree models.
  7. Engineer domain-specific features: ratios, aggregations, time-based features, interaction terms.
  8. Apply feature selection: remove zero-variance, highly correlated (>0.95) pairs, and low-importance features.
  9. Split data into train/validation/test before any fit-based transformation to prevent leakage.
  10. Wrap all steps in a reproducible pipeline (sklearn Pipeline, or equivalent).

Key Principles

  • Always split before fitting transformers; fitting on test data causes information leakage.
  • Preserve the preprocessing pipeline object for inference; never recompute statistics at prediction time.
  • Document every transformation decision with rationale for reproducibility.
  • Log distributions before and after transformations to catch errors.
  • Handle skewed features with log or Box-Cox transforms before scaling.
  • Time-series data requires time-aware splitting; never shuffle temporal data randomly.
  • Categorical encoding choice materially affects model performance; experiment with multiple approaches.

Common Pitfalls

  • Fitting scalers or encoders on the full dataset before splitting (data leakage).
  • Dropping missing values without analyzing the missingness mechanism (MCAR vs MNAR).
  • One-hot encoding high-cardinality features, creating thousands of sparse columns.
  • Ignoring feature interactions that domain experts would consider obvious.
  • Scaling features for tree-based models (unnecessary and sometimes harmful).
  • Applying the same preprocessing to train and test independently (fit on train, transform both).

Output Format

When delivering a preprocessing plan:

  1. Data Profile: Key statistics and quality issues identified.
  2. Cleaning Decisions: Each issue and the chosen remedy with rationale.
  3. Transformation Pipeline: Ordered list of transformations with parameters.
  4. Feature Engineering: New features created with formulas and justification.
  5. Pipeline Code: Reproducible code or pseudocode for the full pipeline.
  6. Validation Checks: Assertions to verify pipeline correctness.