Technology & EngineeringAi Ml96 lines

ML Evaluation

Comprehensive model evaluation and metrics selection for machine learning. Covers

Quick Summary18 lines

Model evaluation determines whether a machine learning model meets performance requirements and generalizes to unseen data. Choosing the wrong metric or validation strategy leads to deploying models that fail in production. This skill covers metric selection, validation design, statistical significance testing, and evaluation reporting.

## Key Points

- Balanced classes: Accuracy, F1-macro
- Imbalanced classes: F1, Precision-Recall AUC, Matthews Correlation Coefficient
- Ranking needed: ROC-AUC, Average Precision
- Cost-sensitive: Custom cost matrix, expected cost
- General: RMSE, MAE
- Relative errors matter: MAPE, sMAPE
- Outlier-robust: Median Absolute Error
- Explained variance: R-squared (with caution)
- NDCG, MAP, MRR, Hit Rate at K
1. Define the business objective and translate it into a measurable ML metric.
2. Select a primary metric that directly maps to business value and 2-3 secondary metrics for monitoring.
3. Choose a validation strategy appropriate to the data structure and size.

skilldb get ai-ml-skills/ML EvaluationFull skill: 96 lines

Paste into your CLAUDE.md or agent config

ML Model Evaluation

Core Philosophy

Overview

Model evaluation determines whether a machine learning model meets performance requirements and generalizes to unseen data. Choosing the wrong metric or validation strategy leads to deploying models that fail in production. This skill covers metric selection, validation design, statistical significance testing, and evaluation reporting.

Use this skill when selecting evaluation metrics for a new project, when comparing multiple models, when validating that a model is production-ready, or when stakeholders need to understand model performance in business terms.

Core Framework

Metric Selection by Task

Classification:

Balanced classes: Accuracy, F1-macro
Imbalanced classes: F1, Precision-Recall AUC, Matthews Correlation Coefficient
Ranking needed: ROC-AUC, Average Precision
Cost-sensitive: Custom cost matrix, expected cost

Regression:

General: RMSE, MAE
Relative errors matter: MAPE, sMAPE
Outlier-robust: Median Absolute Error
Explained variance: R-squared (with caution)

Ranking / Recommendation:

NDCG, MAP, MRR, Hit Rate at K

Validation Strategies

Strategy	When to Use
Holdout (train/val/test)	Large datasets (>100k samples)
K-fold cross-validation	Medium datasets, stable estimates needed
Stratified K-fold	Imbalanced classification
Time-series split	Temporal data, no future leakage
Group K-fold	Grouped observations (e.g., same user)
Nested cross-validation	Hyperparameter tuning + evaluation

Process

Define the business objective and translate it into a measurable ML metric.
Select a primary metric that directly maps to business value and 2-3 secondary metrics for monitoring.
Choose a validation strategy appropriate to the data structure and size.
Establish a meaningful baseline (random, majority class, simple heuristic, or previous model).
Evaluate all candidate models on the validation set using identical splits.
Compute confidence intervals using bootstrap resampling (1000+ iterations).
Perform statistical significance testing (paired t-test or McNemar's test) when comparing models.
Evaluate on the held-out test set exactly once for the final selected model.
Analyze errors: confusion matrix, residual plots, failure case review.
Report results with uncertainty estimates and business-impact translation.

Key Principles

The primary metric must align with business value; optimizing the wrong metric guarantees the wrong model.
Never use the test set for model selection or hyperparameter tuning; it measures generalization only.
Accuracy is misleading for imbalanced datasets; always check class-specific metrics.
Report confidence intervals, not point estimates; a 0.5% accuracy difference may not be significant.
Calibration matters for probabilistic predictions; a well-calibrated model's 80% confidence should be correct 80% of the time.
Evaluate fairness metrics across demographic groups when the model affects people.
Track metrics over time in production to detect data drift and model degradation.

Common Pitfalls

Reporting test set performance after using it for model selection (overfitting to the test set).
Using accuracy on a 95/5 class split and concluding the model is excellent.
Comparing models without statistical significance testing and acting on noise.
Ignoring calibration when model outputs are used as probabilities for downstream decisions.
Tuning hyperparameters with standard K-fold instead of nested cross-validation, inflating reported performance.
Failing to establish a baseline, making it impossible to judge if the model adds value.

Output Format

When reporting model evaluation:

Task and Metric Justification: Why this metric maps to business value.
Baseline Performance: Baseline model scores on all metrics.
Model Comparison Table: All candidates with primary and secondary metrics, confidence intervals.
Statistical Tests: p-values for pairwise model comparisons.
Error Analysis: Confusion matrix or residual analysis with notable failure patterns.
Recommendation: Selected model with clear rationale.
Production Monitoring Plan: Metrics to track post-deployment and drift detection thresholds.

Anti-Patterns

Over-engineering for hypothetical requirements. Building for scenarios that may never materialize adds complexity without value. Solve the problem in front of you first.

Ignoring the existing ecosystem. Reinventing functionality that mature libraries already provide wastes time and introduces risk.

Premature abstraction. Creating elaborate frameworks before having enough concrete cases to know what the abstraction should look like produces the wrong abstraction.

Neglecting error handling at system boundaries. Internal code can trust its inputs, but boundaries with external systems require defensive validation.

Skipping documentation. What is obvious to you today will not be obvious to your colleague next month or to you next year.

Install this skill directly: skilldb add ai-ml-skills

Get CLI access →

ML Evaluation

ML Model Evaluation

Core Philosophy

Overview

Core Framework

Metric Selection by Task

Validation Strategies

Process

Key Principles

Common Pitfalls

Output Format

Anti-Patterns

Related Skills

Computer Vision Pipeline

Data Preprocessing

ML Deployment

ML Model Selection

Neural Network Architecture

Nlp Pipeline