ML Evaluation
Comprehensive model evaluation and metrics selection for machine learning. Covers
Model evaluation determines whether a machine learning model meets performance requirements and generalizes to unseen data. Choosing the wrong metric or validation strategy leads to deploying models that fail in production. This skill covers metric selection, validation design, statistical significance testing, and evaluation reporting. ## Key Points - Balanced classes: Accuracy, F1-macro - Imbalanced classes: F1, Precision-Recall AUC, Matthews Correlation Coefficient - Ranking needed: ROC-AUC, Average Precision - Cost-sensitive: Custom cost matrix, expected cost - General: RMSE, MAE - Relative errors matter: MAPE, sMAPE - Outlier-robust: Median Absolute Error - Explained variance: R-squared (with caution) - NDCG, MAP, MRR, Hit Rate at K 1. Define the business objective and translate it into a measurable ML metric. 2. Select a primary metric that directly maps to business value and 2-3 secondary metrics for monitoring. 3. Choose a validation strategy appropriate to the data structure and size.
skilldb get ai-ml-skills/ML EvaluationFull skill: 96 linesML Model Evaluation
Core Philosophy
Overview
Model evaluation determines whether a machine learning model meets performance requirements and generalizes to unseen data. Choosing the wrong metric or validation strategy leads to deploying models that fail in production. This skill covers metric selection, validation design, statistical significance testing, and evaluation reporting.
Use this skill when selecting evaluation metrics for a new project, when comparing multiple models, when validating that a model is production-ready, or when stakeholders need to understand model performance in business terms.
Core Framework
Metric Selection by Task
Classification:
- Balanced classes: Accuracy, F1-macro
- Imbalanced classes: F1, Precision-Recall AUC, Matthews Correlation Coefficient
- Ranking needed: ROC-AUC, Average Precision
- Cost-sensitive: Custom cost matrix, expected cost
Regression:
- General: RMSE, MAE
- Relative errors matter: MAPE, sMAPE
- Outlier-robust: Median Absolute Error
- Explained variance: R-squared (with caution)
Ranking / Recommendation:
- NDCG, MAP, MRR, Hit Rate at K
Validation Strategies
| Strategy | When to Use |
|---|---|
| Holdout (train/val/test) | Large datasets (>100k samples) |
| K-fold cross-validation | Medium datasets, stable estimates needed |
| Stratified K-fold | Imbalanced classification |
| Time-series split | Temporal data, no future leakage |
| Group K-fold | Grouped observations (e.g., same user) |
| Nested cross-validation | Hyperparameter tuning + evaluation |
Process
- Define the business objective and translate it into a measurable ML metric.
- Select a primary metric that directly maps to business value and 2-3 secondary metrics for monitoring.
- Choose a validation strategy appropriate to the data structure and size.
- Establish a meaningful baseline (random, majority class, simple heuristic, or previous model).
- Evaluate all candidate models on the validation set using identical splits.
- Compute confidence intervals using bootstrap resampling (1000+ iterations).
- Perform statistical significance testing (paired t-test or McNemar's test) when comparing models.
- Evaluate on the held-out test set exactly once for the final selected model.
- Analyze errors: confusion matrix, residual plots, failure case review.
- Report results with uncertainty estimates and business-impact translation.
Key Principles
- The primary metric must align with business value; optimizing the wrong metric guarantees the wrong model.
- Never use the test set for model selection or hyperparameter tuning; it measures generalization only.
- Accuracy is misleading for imbalanced datasets; always check class-specific metrics.
- Report confidence intervals, not point estimates; a 0.5% accuracy difference may not be significant.
- Calibration matters for probabilistic predictions; a well-calibrated model's 80% confidence should be correct 80% of the time.
- Evaluate fairness metrics across demographic groups when the model affects people.
- Track metrics over time in production to detect data drift and model degradation.
Common Pitfalls
- Reporting test set performance after using it for model selection (overfitting to the test set).
- Using accuracy on a 95/5 class split and concluding the model is excellent.
- Comparing models without statistical significance testing and acting on noise.
- Ignoring calibration when model outputs are used as probabilities for downstream decisions.
- Tuning hyperparameters with standard K-fold instead of nested cross-validation, inflating reported performance.
- Failing to establish a baseline, making it impossible to judge if the model adds value.
Output Format
When reporting model evaluation:
- Task and Metric Justification: Why this metric maps to business value.
- Baseline Performance: Baseline model scores on all metrics.
- Model Comparison Table: All candidates with primary and secondary metrics, confidence intervals.
- Statistical Tests: p-values for pairwise model comparisons.
- Error Analysis: Confusion matrix or residual analysis with notable failure patterns.
- Recommendation: Selected model with clear rationale.
- Production Monitoring Plan: Metrics to track post-deployment and drift detection thresholds.
Anti-Patterns
Over-engineering for hypothetical requirements. Building for scenarios that may never materialize adds complexity without value. Solve the problem in front of you first.
Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide wastes time and introduces risk.
Premature abstraction. Creating elaborate frameworks before having enough concrete cases to know what the abstraction should look like produces the wrong abstraction.
Neglecting error handling at system boundaries. Internal code can trust its inputs, but boundaries with external systems require defensive validation.
Skipping documentation. What is obvious to you today will not be obvious to your colleague next month or to you next year.
Install this skill directly: skilldb add ai-ml-skills
Related Skills
Computer Vision Pipeline
Designing computer vision pipelines for image and video analysis tasks. Covers
Data Preprocessing
Systematic approach to data cleaning, transformation, and feature preparation for
ML Deployment
ML model deployment and MLOps practices for production systems. Covers serving
ML Model Selection
Guides you through choosing the right machine learning model for a given problem.
Neural Network Architecture
Guides the design of neural network architectures for various tasks. Covers layer
Nlp Pipeline
Designing end-to-end natural language processing pipelines from text ingestion to