ML Model Evaluation
Comprehensive model evaluation and metrics selection for machine learning. Covers
ML Model Evaluation
Overview
Model evaluation determines whether a machine learning model meets performance requirements and generalizes to unseen data. Choosing the wrong metric or validation strategy leads to deploying models that fail in production. This skill covers metric selection, validation design, statistical significance testing, and evaluation reporting.
Use this skill when selecting evaluation metrics for a new project, when comparing multiple models, when validating that a model is production-ready, or when stakeholders need to understand model performance in business terms.
Core Framework
Metric Selection by Task
Classification:
- Balanced classes: Accuracy, F1-macro
- Imbalanced classes: F1, Precision-Recall AUC, Matthews Correlation Coefficient
- Ranking needed: ROC-AUC, Average Precision
- Cost-sensitive: Custom cost matrix, expected cost
Regression:
- General: RMSE, MAE
- Relative errors matter: MAPE, sMAPE
- Outlier-robust: Median Absolute Error
- Explained variance: R-squared (with caution)
Ranking / Recommendation:
- NDCG, MAP, MRR, Hit Rate at K
Validation Strategies
| Strategy | When to Use |
|---|---|
| Holdout (train/val/test) | Large datasets (>100k samples) |
| K-fold cross-validation | Medium datasets, stable estimates needed |
| Stratified K-fold | Imbalanced classification |
| Time-series split | Temporal data, no future leakage |
| Group K-fold | Grouped observations (e.g., same user) |
| Nested cross-validation | Hyperparameter tuning + evaluation |
Process
- Define the business objective and translate it into a measurable ML metric.
- Select a primary metric that directly maps to business value and 2-3 secondary metrics for monitoring.
- Choose a validation strategy appropriate to the data structure and size.
- Establish a meaningful baseline (random, majority class, simple heuristic, or previous model).
- Evaluate all candidate models on the validation set using identical splits.
- Compute confidence intervals using bootstrap resampling (1000+ iterations).
- Perform statistical significance testing (paired t-test or McNemar's test) when comparing models.
- Evaluate on the held-out test set exactly once for the final selected model.
- Analyze errors: confusion matrix, residual plots, failure case review.
- Report results with uncertainty estimates and business-impact translation.
Key Principles
- The primary metric must align with business value; optimizing the wrong metric guarantees the wrong model.
- Never use the test set for model selection or hyperparameter tuning; it measures generalization only.
- Accuracy is misleading for imbalanced datasets; always check class-specific metrics.
- Report confidence intervals, not point estimates; a 0.5% accuracy difference may not be significant.
- Calibration matters for probabilistic predictions; a well-calibrated model's 80% confidence should be correct 80% of the time.
- Evaluate fairness metrics across demographic groups when the model affects people.
- Track metrics over time in production to detect data drift and model degradation.
Common Pitfalls
- Reporting test set performance after using it for model selection (overfitting to the test set).
- Using accuracy on a 95/5 class split and concluding the model is excellent.
- Comparing models without statistical significance testing and acting on noise.
- Ignoring calibration when model outputs are used as probabilities for downstream decisions.
- Tuning hyperparameters with standard K-fold instead of nested cross-validation, inflating reported performance.
- Failing to establish a baseline, making it impossible to judge if the model adds value.
Output Format
When reporting model evaluation:
- Task and Metric Justification: Why this metric maps to business value.
- Baseline Performance: Baseline model scores on all metrics.
- Model Comparison Table: All candidates with primary and secondary metrics, confidence intervals.
- Statistical Tests: p-values for pairwise model comparisons.
- Error Analysis: Confusion matrix or residual analysis with notable failure patterns.
- Recommendation: Selected model with clear rationale.
- Production Monitoring Plan: Metrics to track post-deployment and drift detection thresholds.
Related Skills
Computer Vision Pipeline Design
Designing computer vision pipelines for image and video analysis tasks. Covers
Data Preprocessing
Systematic approach to data cleaning, transformation, and feature preparation for
ML Deployment and MLOps
ML model deployment and MLOps practices for production systems. Covers serving
ML Model Selection
Guides you through choosing the right machine learning model for a given problem.
Neural Network Architecture Design
Guides the design of neural network architectures for various tasks. Covers layer
NLP Pipeline Design
Designing end-to-end natural language processing pipelines from text ingestion to