Data Preprocessing
Systematic approach to data cleaning, transformation, and feature preparation for
Data Preprocessing
Overview
Data preprocessing transforms raw data into a clean, structured format suitable for machine learning models. It typically consumes 60-80% of project time and has more impact on model performance than algorithm selection. Poor preprocessing introduces bias, leaks information, and produces unreliable models.
Use this skill when preparing datasets for model training, when model performance plateaus and you suspect data quality issues, or when building reproducible preprocessing pipelines for production.
Core Framework
Preprocessing Pipeline Stages
- Data Audit: Profile the dataset for shape, types, distributions, and quality.
- Cleaning: Handle missing values, duplicates, and corrupted records.
- Transformation: Scale, normalize, encode, and reshape features.
- Feature Engineering: Create new features from existing ones to capture domain knowledge.
- Feature Selection: Remove irrelevant or redundant features.
- Validation: Verify the pipeline preserves data integrity and does not leak target information.
Missing Value Strategies
| Strategy | When to Use |
|---|---|
| Drop rows | <5% missing, MCAR (missing completely at random) |
| Mean/median imputation | Numeric, low missingness, no strong skew |
| Mode imputation | Categorical features with low cardinality |
| KNN imputation | Features have meaningful neighbor relationships |
| Indicator variable | Missingness itself is informative |
| Model-based (MICE) | Complex missingness patterns, sufficient data |
Process
- Load data and generate a profiling report (shape, dtypes, null counts, unique values, distributions).
- Remove exact duplicate rows; investigate near-duplicates.
- Identify and handle missing values using the strategy table above.
- Detect outliers using IQR, z-score, or domain-specific thresholds; decide to cap, transform, or remove.
- Encode categorical variables: one-hot for low cardinality (<15 levels), target encoding or embeddings for high cardinality.
- Scale numeric features: StandardScaler for linear models, MinMaxScaler for neural networks, leave unscaled for tree models.
- Engineer domain-specific features: ratios, aggregations, time-based features, interaction terms.
- Apply feature selection: remove zero-variance, highly correlated (>0.95) pairs, and low-importance features.
- Split data into train/validation/test before any fit-based transformation to prevent leakage.
- Wrap all steps in a reproducible pipeline (sklearn Pipeline, or equivalent).
Key Principles
- Always split before fitting transformers; fitting on test data causes information leakage.
- Preserve the preprocessing pipeline object for inference; never recompute statistics at prediction time.
- Document every transformation decision with rationale for reproducibility.
- Log distributions before and after transformations to catch errors.
- Handle skewed features with log or Box-Cox transforms before scaling.
- Time-series data requires time-aware splitting; never shuffle temporal data randomly.
- Categorical encoding choice materially affects model performance; experiment with multiple approaches.
Common Pitfalls
- Fitting scalers or encoders on the full dataset before splitting (data leakage).
- Dropping missing values without analyzing the missingness mechanism (MCAR vs MNAR).
- One-hot encoding high-cardinality features, creating thousands of sparse columns.
- Ignoring feature interactions that domain experts would consider obvious.
- Scaling features for tree-based models (unnecessary and sometimes harmful).
- Applying the same preprocessing to train and test independently (fit on train, transform both).
Output Format
When delivering a preprocessing plan:
- Data Profile: Key statistics and quality issues identified.
- Cleaning Decisions: Each issue and the chosen remedy with rationale.
- Transformation Pipeline: Ordered list of transformations with parameters.
- Feature Engineering: New features created with formulas and justification.
- Pipeline Code: Reproducible code or pseudocode for the full pipeline.
- Validation Checks: Assertions to verify pipeline correctness.
Related Skills
Computer Vision Pipeline Design
Designing computer vision pipelines for image and video analysis tasks. Covers
ML Deployment and MLOps
ML model deployment and MLOps practices for production systems. Covers serving
ML Model Evaluation
Comprehensive model evaluation and metrics selection for machine learning. Covers
ML Model Selection
Guides you through choosing the right machine learning model for a given problem.
Neural Network Architecture Design
Guides the design of neural network architectures for various tasks. Covers layer
NLP Pipeline Design
Designing end-to-end natural language processing pipelines from text ingestion to