Skip to main content
UncategorizedPrediction818 lines

Prediction Evaluation Metrics

Quick Summary14 lines
Measuring forecast quality is as important as making forecasts. Without rigorous evaluation, you cannot distinguish skill from luck, identify systematic biases, or compare forecasting methods. This skill covers proper scoring rules, the resolution-reliability-sharpness decomposition, ROC curves for binary predictions, CRPS for probabilistic forecasts, methods for comparing forecasters, and tournament design for forecasting competitions.

## Key Points

1. Only use proper scoring rules (Brier, log, spherical) for evaluation; improper rules like accuracy/zero-one incentivize dishonest reporting
2. The Brier score decomposes into reliability (calibration), resolution (discrimination), and uncertainty: improve the first two, the third is fixed
3. Log scoring is more appropriate than Brier when you want to heavily penalize confident wrong predictions (e.g., tail risk applications)
4. ROC/AUC measures discrimination ability independent of calibration; use alongside calibration metrics for complete assessment
5. CRPS is the standard for evaluating full probability distributions; PIT histograms diagnose specific calibration failures (overconfidence, bias)
6. When comparing forecasters, use paired tests (not just average scores) and bootstrap confidence intervals to assess statistical significance
7. Tournament design requires proper scoring rules, participation minimums, category diversity, and anti-gaming measures
8. A complete evaluation pipeline combines point metrics, decomposition, discrimination analysis, probabilistic calibration, and overall grading
skilldb get prediction-skills/prediction-evaluation-metricsFull skill: 818 lines

Install this skill directly: skilldb add prediction-skills

Get CLI access →