Prediction Evaluation Metrics
Measuring forecast quality is as important as making forecasts. Without rigorous evaluation, you cannot distinguish skill from luck, identify systematic biases, or compare forecasting methods. This skill covers proper scoring rules, the resolution-reliability-sharpness decomposition, ROC curves for binary predictions, CRPS for probabilistic forecasts, methods for comparing forecasters, and tournament design for forecasting competitions. ## Key Points 1. Only use proper scoring rules (Brier, log, spherical) for evaluation; improper rules like accuracy/zero-one incentivize dishonest reporting 2. The Brier score decomposes into reliability (calibration), resolution (discrimination), and uncertainty: improve the first two, the third is fixed 3. Log scoring is more appropriate than Brier when you want to heavily penalize confident wrong predictions (e.g., tail risk applications) 4. ROC/AUC measures discrimination ability independent of calibration; use alongside calibration metrics for complete assessment 5. CRPS is the standard for evaluating full probability distributions; PIT histograms diagnose specific calibration failures (overconfidence, bias) 6. When comparing forecasters, use paired tests (not just average scores) and bootstrap confidence intervals to assess statistical significance 7. Tournament design requires proper scoring rules, participation minimums, category diversity, and anti-gaming measures 8. A complete evaluation pipeline combines point metrics, decomposition, discrimination analysis, probabilistic calibration, and overall grading
skilldb get prediction-skills/prediction-evaluation-metricsFull skill: 818 linesPrediction Evaluation Metrics
Overview
Measuring forecast quality is as important as making forecasts. Without rigorous evaluation, you cannot distinguish skill from luck, identify systematic biases, or compare forecasting methods. This skill covers proper scoring rules, the resolution-reliability-sharpness decomposition, ROC curves for binary predictions, CRPS for probabilistic forecasts, methods for comparing forecasters, and tournament design for forecasting competitions.
Proper Scoring Rules
What Makes a Scoring Rule "Proper"
A scoring rule S(p, y) assigns a score to a probability forecast p given outcome y. A scoring rule is strictly proper if the forecaster's expected score is uniquely maximized by reporting their true belief. This means honest reporting is the dominant strategy.
import numpy as np
from typing import Callable
class ScoringRules:
"""Collection of proper scoring rules for probabilistic forecasts."""
@staticmethod
def brier_score(p: float, y: int) -> float:
"""
Brier Score (1950): (p - y)²
Range: 0 (perfect) to 1 (worst)
Reference: 0.25 (always predicting 50%)
Properties:
- Quadratic loss
- Proper (honest reporting is optimal)
- Symmetric around 50%
- Bounded [0, 1]
"""
return (p - y) ** 2
@staticmethod
def log_score(p: float, y: int) -> float:
"""
Logarithmic Score: log(p) if y=1, log(1-p) if y=0
Range: -inf (worst) to 0 (perfect)
Reference: -ln(2) ≈ -0.693 (always predicting 50%)
Properties:
- More punishing of confident wrong predictions
- Strictly proper
- Local: only depends on probability assigned to actual outcome
- Unbounded below (catastrophic for 0% or 100% predictions that are wrong)
"""
p = np.clip(p, 1e-15, 1 - 1e-15)
if y == 1:
return np.log(p)
return np.log(1 - p)
@staticmethod
def spherical_score(p: float, y: int) -> float:
"""
Spherical Score: p_y / sqrt(sum(p_i²))
For binary: p / sqrt(p² + (1-p)²) if y=1
Properties:
- Proper
- Bounded [0, 1]
- Less sensitive to extreme probabilities than log score
"""
norm = np.sqrt(p**2 + (1 - p)**2)
if y == 1:
return p / norm
return (1 - p) / norm
@staticmethod
def zero_one_score(p: float, y: int, threshold: float = 0.5) -> float:
"""
Zero-One (Classification) Score: 1 if correct side, 0 if wrong.
WARNING: This is NOT a proper scoring rule!
A forecaster maximizes this by always predicting 0% or 100%.
Included for comparison only.
"""
predicted = 1 if p >= threshold else 0
return 1 if predicted == y else 0
def compare_scoring_rules(p: float, y: int) -> dict:
"""Compare all scoring rules for a single prediction."""
sr = ScoringRules()
return {
'brier': sr.brier_score(p, y),
'log': sr.log_score(p, y),
'spherical': sr.spherical_score(p, y),
'zero_one': sr.zero_one_score(p, y),
'note': 'Lower Brier is better. Higher log/spherical is better.'
}
# Demonstrate incentive compatibility
def demonstrate_proper_scoring():
"""Show that honest reporting maximizes expected score under proper rules."""
true_prob = 0.7 # Forecaster believes 70% chance of YES
reported_probs = np.linspace(0.01, 0.99, 50)
brier_expected = []
log_expected = []
zero_one_expected = []
for reported in reported_probs:
# Expected Brier = P(Y=1)*(reported-1)² + P(Y=0)*(reported-0)²
expected_brier = true_prob * (reported - 1)**2 + (1 - true_prob) * reported**2
brier_expected.append(expected_brier)
# Expected Log
expected_log = true_prob * np.log(max(reported, 1e-10)) + (1 - true_prob) * np.log(max(1 - reported, 1e-10))
log_expected.append(expected_log)
# Expected Zero-One (NOT proper)
if reported >= 0.5:
expected_01 = true_prob
else:
expected_01 = 1 - true_prob
zero_one_expected.append(expected_01)
best_brier = reported_probs[np.argmin(brier_expected)]
best_log = reported_probs[np.argmax(log_expected)]
best_01 = reported_probs[np.argmax(zero_one_expected)]
return {
'true_probability': true_prob,
'brier_optimal_report': best_brier, # Should be ~0.7
'log_optimal_report': best_log, # Should be ~0.7
'zero_one_optimal_report': best_01, # Will be 1.0 (dishonest!)
'brier_is_proper': abs(best_brier - true_prob) < 0.05,
'log_is_proper': abs(best_log - true_prob) < 0.05,
'zero_one_is_proper': abs(best_01 - true_prob) < 0.05 # False
}
Resolution, Reliability, and Sharpness
The Three Components of Forecast Quality
class ForecastQualityDecomposition:
"""
Decompose forecast quality into three independent components.
Resolution: Ability to separate events that happen from those that don't.
High resolution = forecasts vary appropriately with outcomes.
Reliability (Calibration): Alignment between forecast probabilities and
observed frequencies. Reliable = well-calibrated.
Sharpness: How extreme/decisive the forecasts are.
Sharp forecasts are far from the base rate.
(Sharpness is a property of the forecasts alone, not outcomes.)
"""
def __init__(self, forecasts: np.ndarray, outcomes: np.ndarray, n_bins: int = 10):
self.forecasts = forecasts
self.outcomes = outcomes.astype(float)
self.n_bins = n_bins
self.base_rate = np.mean(outcomes)
def reliability(self) -> float:
"""
Reliability = weighted average of (forecast_bin_mean - outcome_bin_mean)²
Lower is better. 0 = perfectly calibrated.
"""
bins = np.linspace(0, 1, self.n_bins + 1)
n = len(self.forecasts)
rel = 0
for i in range(self.n_bins):
mask = (self.forecasts >= bins[i]) & (self.forecasts < bins[i+1])
n_k = np.sum(mask)
if n_k == 0:
continue
forecast_mean = np.mean(self.forecasts[mask])
outcome_mean = np.mean(self.outcomes[mask])
rel += n_k * (forecast_mean - outcome_mean) ** 2
return rel / n
def resolution(self) -> float:
"""
Resolution = weighted average of (outcome_bin_mean - base_rate)²
Higher is better. 0 = no discrimination ability.
"""
bins = np.linspace(0, 1, self.n_bins + 1)
n = len(self.forecasts)
res = 0
for i in range(self.n_bins):
mask = (self.forecasts >= bins[i]) & (self.forecasts < bins[i+1])
n_k = np.sum(mask)
if n_k == 0:
continue
outcome_mean = np.mean(self.outcomes[mask])
res += n_k * (outcome_mean - self.base_rate) ** 2
return res / n
def uncertainty(self) -> float:
"""
Uncertainty = base_rate * (1 - base_rate)
Property of the dataset, not the forecaster.
"""
return self.base_rate * (1 - self.base_rate)
def sharpness(self) -> float:
"""
Sharpness = variance of forecasts.
Higher = more decisive. Not necessarily better (can be overconfident).
"""
return np.var(self.forecasts)
def full_decomposition(self) -> dict:
"""
Brier Score = Reliability - Resolution + Uncertainty
Good forecast: low reliability (well calibrated),
high resolution (discriminates well).
"""
rel = self.reliability()
res = self.resolution()
unc = self.uncertainty()
bs = np.mean((self.forecasts - self.outcomes) ** 2)
return {
'brier_score': bs,
'reliability': rel,
'resolution': res,
'uncertainty': unc,
'check': abs(bs - (rel - res + unc)),
'sharpness': self.sharpness(),
'brier_skill_score': 1 - bs / unc if unc > 0 else 0,
'interpretation': {
'calibration': 'good' if rel < 0.01 else 'fair' if rel < 0.03 else 'poor',
'discrimination': 'excellent' if res > 0.05 else 'good' if res > 0.02 else 'poor',
'decisiveness': 'sharp' if self.sharpness() > 0.04 else 'moderate' if self.sharpness() > 0.02 else 'hedging'
}
}
ROC Curves for Binary Predictions
class ROCAnalysis:
"""ROC curve analysis for binary prediction evaluation."""
def __init__(self, probabilities: np.ndarray, outcomes: np.ndarray):
self.probs = probabilities
self.outcomes = outcomes.astype(int)
def compute_roc(self, n_thresholds: int = 100) -> dict:
"""Compute ROC curve points."""
thresholds = np.linspace(0, 1, n_thresholds)
tpr_list = [] # True Positive Rate (sensitivity)
fpr_list = [] # False Positive Rate (1 - specificity)
for threshold in thresholds:
predicted_positive = self.probs >= threshold
tp = np.sum(predicted_positive & (self.outcomes == 1))
fp = np.sum(predicted_positive & (self.outcomes == 0))
fn = np.sum(~predicted_positive & (self.outcomes == 1))
tn = np.sum(~predicted_positive & (self.outcomes == 0))
tpr = tp / max(tp + fn, 1)
fpr = fp / max(fp + tn, 1)
tpr_list.append(tpr)
fpr_list.append(fpr)
return {
'fpr': fpr_list,
'tpr': tpr_list,
'thresholds': thresholds.tolist(),
'auc': self.compute_auc(fpr_list, tpr_list)
}
def compute_auc(self, fpr: list, tpr: list) -> float:
"""Area Under the ROC Curve using trapezoidal rule."""
sorted_pairs = sorted(zip(fpr, tpr))
fpr_sorted = [p[0] for p in sorted_pairs]
tpr_sorted = [p[1] for p in sorted_pairs]
auc = 0
for i in range(1, len(fpr_sorted)):
auc += (fpr_sorted[i] - fpr_sorted[i-1]) * (tpr_sorted[i] + tpr_sorted[i-1]) / 2
return auc
def optimal_threshold(self) -> dict:
"""Find the threshold that maximizes Youden's J statistic."""
roc = self.compute_roc()
j_stats = [tpr - fpr for tpr, fpr in zip(roc['tpr'], roc['fpr'])]
best_idx = np.argmax(j_stats)
return {
'optimal_threshold': roc['thresholds'][best_idx],
'sensitivity': roc['tpr'][best_idx],
'specificity': 1 - roc['fpr'][best_idx],
'youden_j': j_stats[best_idx]
}
def interpretation(self) -> dict:
"""Interpret the AUC value."""
auc = self.compute_roc()['auc']
if auc > 0.9:
quality = 'Excellent discrimination'
elif auc > 0.8:
quality = 'Good discrimination'
elif auc > 0.7:
quality = 'Fair discrimination'
elif auc > 0.6:
quality = 'Poor discrimination'
else:
quality = 'No better than random'
return {'auc': auc, 'quality': quality}
CRPS for Probabilistic Forecasts
class CRPSCalculator:
"""
Continuous Ranked Probability Score (CRPS).
The standard metric for evaluating full probability distributions.
CRPS generalizes the Brier Score to continuous outcomes:
CRPS = integral of (F(x) - 1{y <= x})² dx
Where F(x) is the forecast CDF and y is the observation.
"""
@staticmethod
def crps_empirical(ensemble: np.ndarray, observation: float) -> float:
"""
CRPS from ensemble forecasts.
Uses the identity: CRPS = E|X-y| - 0.5*E|X-X'|
where X, X' are independent draws from the forecast distribution.
"""
n = len(ensemble)
# E|X - y|
abs_diff = np.mean(np.abs(ensemble - observation))
# E|X - X'| (expected absolute difference between two random draws)
sorted_ens = np.sort(ensemble)
# Efficient computation
weights = 2 * np.arange(n) - n + 1
spread = np.sum(weights * sorted_ens) / (n * n)
return abs_diff - 0.5 * abs(spread)
@staticmethod
def crps_gaussian(mu: float, sigma: float, observation: float) -> float:
"""
CRPS for Gaussian forecast N(mu, sigma²).
Closed-form solution available.
"""
from scipy.stats import norm
z = (observation - mu) / sigma
crps = sigma * (z * (2 * norm.cdf(z) - 1) + 2 * norm.pdf(z) - 1 / np.sqrt(np.pi))
return crps
@staticmethod
def crps_skill_score(crps_forecast: float, crps_reference: float) -> float:
"""
CRPS Skill Score: improvement over a reference forecast.
1 = perfect, 0 = same as reference, negative = worse than reference.
"""
return 1 - crps_forecast / crps_reference if crps_reference > 0 else 0
@staticmethod
def energy_score(ensemble_forecasts: np.ndarray,
observations: np.ndarray) -> float:
"""
Energy Score: multivariate generalization of CRPS.
Useful for evaluating joint distributions (e.g., multi-site weather).
"""
n_members = ensemble_forecasts.shape[0]
n_obs = len(observations)
# E||X - y|| term
term1 = np.mean([
np.linalg.norm(ensemble_forecasts[i] - observations)
for i in range(n_members)
])
# E||X - X'|| term
term2 = 0
count = 0
for i in range(n_members):
for j in range(i + 1, n_members):
term2 += np.linalg.norm(ensemble_forecasts[i] - ensemble_forecasts[j])
count += 1
term2 = term2 / max(count, 1)
return term1 - 0.5 * term2
@staticmethod
def pit_histogram(ensemble_forecasts: list, observations: list,
n_bins: int = 10) -> dict:
"""
Probability Integral Transform (PIT) histogram.
For well-calibrated probabilistic forecasts, PIT values
should be uniformly distributed.
PIT value = percentile rank of observation within ensemble.
"""
pit_values = []
for ensemble, obs in zip(ensemble_forecasts, observations):
rank = np.sum(np.array(ensemble) <= obs) / len(ensemble)
pit_values.append(rank)
pit_values = np.array(pit_values)
hist, bin_edges = np.histogram(pit_values, bins=n_bins, range=(0, 1))
expected = len(pit_values) / n_bins
# Chi-squared test for uniformity
from scipy.stats import chi2
chi2_stat = np.sum((hist - expected)**2 / expected)
p_value = 1 - chi2.cdf(chi2_stat, n_bins - 1)
return {
'histogram': hist.tolist(),
'bin_edges': bin_edges.tolist(),
'chi2_statistic': chi2_stat,
'p_value': p_value,
'is_uniform': p_value > 0.05,
'interpretation': (
'Well-calibrated (uniform PIT)' if p_value > 0.05
else 'Dome-shaped: overconfident' if hist[len(hist)//2] > expected * 1.3
else 'U-shaped: underconfident' if hist[0] + hist[-1] > expected * 2.5
else 'Skewed: biased forecast'
)
}
Comparing Forecasters
class ForecasterComparison:
"""Methods for rigorously comparing forecasting performance."""
def __init__(self):
self.forecasters = {}
def add_forecaster(self, name: str, probabilities: list, outcomes: list):
self.forecasters[name] = {
'probabilities': np.array(probabilities),
'outcomes': np.array(outcomes, dtype=float)
}
def head_to_head(self, forecaster_a: str, forecaster_b: str) -> dict:
"""Compare two forecasters on shared questions."""
a = self.forecasters[forecaster_a]
b = self.forecasters[forecaster_b]
# Assume same outcomes
bs_a = np.mean((a['probabilities'] - a['outcomes']) ** 2)
bs_b = np.mean((b['probabilities'] - b['outcomes']) ** 2)
# Per-question comparison
a_better = 0
b_better = 0
tie = 0
for i in range(len(a['outcomes'])):
score_a = (a['probabilities'][i] - a['outcomes'][i]) ** 2
score_b = (b['probabilities'][i] - b['outcomes'][i]) ** 2
if score_a < score_b - 0.001:
a_better += 1
elif score_b < score_a - 0.001:
b_better += 1
else:
tie += 1
# Statistical significance (paired t-test)
differences = (a['probabilities'] - a['outcomes'])**2 - (b['probabilities'] - b['outcomes'])**2
from scipy.stats import ttest_rel
t_stat, p_value = ttest_rel(
(a['probabilities'] - a['outcomes'])**2,
(b['probabilities'] - b['outcomes'])**2
)
winner = forecaster_a if bs_a < bs_b else forecaster_b
return {
f'{forecaster_a}_brier': bs_a,
f'{forecaster_b}_brier': bs_b,
'difference': bs_a - bs_b,
'winner': winner,
'margin': abs(bs_a - bs_b),
'statistically_significant': p_value < 0.05,
'p_value': p_value,
f'{forecaster_a}_better_on': a_better,
f'{forecaster_b}_better_on': b_better,
'ties': tie
}
def leaderboard(self, metric: str = 'brier') -> list:
"""Rank all forecasters by chosen metric."""
results = []
for name, data in self.forecasters.items():
p = data['probabilities']
o = data['outcomes']
scores = {
'brier': np.mean((p - o) ** 2),
'log': np.mean([
np.log(max(pi, 1e-10)) if oi == 1 else np.log(max(1-pi, 1e-10))
for pi, oi in zip(p, o)
]),
'calibration_error': self._ece(p, o),
}
results.append({
'name': name,
'n_forecasts': len(p),
**scores
})
# Sort (lower Brier is better, higher log is better)
if metric == 'brier':
results.sort(key=lambda x: x['brier'])
elif metric == 'log':
results.sort(key=lambda x: -x['log'])
# Add ranks
for i, r in enumerate(results):
r['rank'] = i + 1
return results
def _ece(self, probs, outcomes, n_bins=10):
bins = np.linspace(0, 1, n_bins + 1)
ece = 0
total = len(probs)
for i in range(n_bins):
mask = (probs >= bins[i]) & (probs < bins[i+1])
n_k = np.sum(mask)
if n_k > 0:
ece += n_k / total * abs(np.mean(probs[mask]) - np.mean(outcomes[mask]))
return ece
def bootstrap_confidence_intervals(self, forecaster_name: str,
n_bootstrap: int = 1000) -> dict:
"""Compute confidence intervals for a forecaster's Brier score."""
data = self.forecasters[forecaster_name]
n = len(data['probabilities'])
bootstrap_briers = []
for _ in range(n_bootstrap):
indices = np.random.choice(n, n, replace=True)
p_boot = data['probabilities'][indices]
o_boot = data['outcomes'][indices]
bootstrap_briers.append(np.mean((p_boot - o_boot) ** 2))
return {
'point_estimate': np.mean((data['probabilities'] - data['outcomes']) ** 2),
'ci_95': (np.percentile(bootstrap_briers, 2.5),
np.percentile(bootstrap_briers, 97.5)),
'ci_90': (np.percentile(bootstrap_briers, 5),
np.percentile(bootstrap_briers, 95)),
'bootstrap_std': np.std(bootstrap_briers)
}
Tournament Design
Designing Forecasting Competitions
class ForecastingTournament:
"""
Design and run a forecasting tournament.
Based on practices from IARPA's ACE program, Good Judgment Project,
and Metaculus.
"""
def __init__(self, name: str, scoring_rule: str = 'brier',
n_questions: int = 100):
self.name = name
self.scoring_rule = scoring_rule
self.n_questions = n_questions
self.questions = []
self.participants = {}
self.resolved = []
def add_question(self, question: str, resolution_criteria: str,
resolution_date: str, category: str = 'general',
difficulty: str = 'medium'):
"""Add a question to the tournament."""
self.questions.append({
'id': len(self.questions),
'question': question,
'resolution_criteria': resolution_criteria,
'resolution_date': resolution_date,
'category': category,
'difficulty': difficulty,
'resolved': False,
'outcome': None
})
def submit_forecast(self, participant_id: str, question_id: int,
probability: float):
"""Record a participant's forecast."""
if participant_id not in self.participants:
self.participants[participant_id] = {
'forecasts': {},
'scores': []
}
self.participants[participant_id]['forecasts'][question_id] = {
'probability': probability,
'timestamp': 'now'
}
def resolve_question(self, question_id: int, outcome: bool):
"""Resolve a question and score all forecasters."""
question = self.questions[question_id]
question['resolved'] = True
question['outcome'] = outcome
for pid, data in self.participants.items():
if question_id in data['forecasts']:
p = data['forecasts'][question_id]['probability']
score = self._compute_score(p, int(outcome))
data['scores'].append({
'question_id': question_id,
'score': score,
'probability': p,
'outcome': outcome
})
def _compute_score(self, probability: float, outcome: int) -> float:
sr = ScoringRules()
if self.scoring_rule == 'brier':
return sr.brier_score(probability, outcome)
elif self.scoring_rule == 'log':
return sr.log_score(probability, outcome)
return sr.brier_score(probability, outcome)
def standings(self) -> list:
"""Current tournament standings."""
standings = []
for pid, data in self.participants.items():
if not data['scores']:
continue
scores = [s['score'] for s in data['scores']]
if self.scoring_rule == 'brier':
avg_score = np.mean(scores) # Lower is better
sort_key = avg_score
else:
avg_score = np.mean(scores) # Higher is better for log
sort_key = -avg_score
standings.append({
'participant': pid,
'avg_score': avg_score,
'n_forecasts': len(scores),
'participation_rate': len(scores) / max(
sum(1 for q in self.questions if q['resolved']), 1
),
'consistency': np.std(scores)
})
standings.sort(key=lambda x: x['avg_score'] if self.scoring_rule == 'brier' else -x['avg_score'])
for i, s in enumerate(standings):
s['rank'] = i + 1
return standings
def question_design_guidelines(self) -> dict:
"""Best practices for tournament question design."""
return {
'resolution_criteria': {
'rule': 'Must be objectively verifiable',
'examples': [
'GOOD: "Will X exceed 100 by Dec 31 as reported by Source Y?"',
'BAD: "Will X be successful?" (subjective)'
]
},
'time_horizon': {
'rule': 'Mix of short-term (1-3 months) and long-term (6-12 months)',
'rationale': 'Short-term provides frequent feedback; long-term tests deeper reasoning'
},
'difficulty_balance': {
'rule': 'Include easy (~75/25), medium (~60/40), and hard (~55/45) questions',
'rationale': 'Easy questions calibrate; hard questions separate forecasters'
},
'base_rate_awareness': {
'rule': 'Avoid questions where the base rate gives the answer',
'example': '"Will the sun rise tomorrow?" is not informative'
},
'category_diversity': {
'rule': 'Span multiple domains to test breadth',
'domains': ['geopolitics', 'economics', 'technology', 'science', 'social']
},
'update_frequency': {
'rule': 'Allow continuous updating',
'rationale': 'Tests ability to incorporate new information'
}
}
def anti_gaming_measures(self) -> dict:
"""Prevent gaming of the tournament scoring system."""
return {
'participation_minimum': {
'description': 'Require forecasts on at least 50% of questions',
'rationale': 'Prevents cherry-picking easy questions'
},
'proper_scoring': {
'description': 'Use strictly proper scoring rules only',
'rationale': 'Honest reporting is the dominant strategy'
},
'continuous_scoring': {
'description': 'Score the most recent forecast, not just the final one',
'rationale': 'Rewards early accuracy and continuous engagement'
},
'time_weighted_scoring': {
'description': 'Weight earlier forecasts more (harder to be accurate early)',
'rationale': 'Prevents last-minute forecast copying'
},
'category_requirements': {
'description': 'Require minimum forecasts per category',
'rationale': 'Prevents specialization gaming'
}
}
Comprehensive Evaluation Pipeline
class ForecastEvaluationPipeline:
"""End-to-end evaluation pipeline for forecast systems."""
def __init__(self):
self.evaluations = []
def evaluate(self, forecasts: np.ndarray, outcomes: np.ndarray,
ensemble_forecasts: np.ndarray = None) -> dict:
"""Run complete evaluation suite."""
results = {}
# 1. Point forecast metrics
results['brier_score'] = np.mean((forecasts - outcomes) ** 2)
results['log_score'] = np.mean([
ScoringRules.log_score(p, int(o)) for p, o in zip(forecasts, outcomes)
])
# 2. Decomposition
decomp = ForecastQualityDecomposition(forecasts, outcomes)
results['decomposition'] = decomp.full_decomposition()
# 3. ROC analysis
roc = ROCAnalysis(forecasts, outcomes)
results['auc'] = roc.compute_roc()['auc']
results['optimal_threshold'] = roc.optimal_threshold()
# 4. CRPS (if ensemble available)
if ensemble_forecasts is not None:
crps_values = []
for i in range(len(outcomes)):
crps = CRPSCalculator.crps_empirical(ensemble_forecasts[i], outcomes[i])
crps_values.append(crps)
results['mean_crps'] = np.mean(crps_values)
# PIT analysis
pit = CRPSCalculator.pit_histogram(
ensemble_forecasts.tolist(),
outcomes.tolist()
)
results['pit_analysis'] = pit
# 5. Overall assessment
bs = results['brier_score']
auc = results['auc']
cal = results['decomposition']['interpretation']['calibration']
if bs < 0.15 and auc > 0.8 and cal == 'good':
results['overall_grade'] = 'A — Excellent forecasting'
elif bs < 0.20 and auc > 0.7:
results['overall_grade'] = 'B — Good forecasting'
elif bs < 0.25 and auc > 0.6:
results['overall_grade'] = 'C — Fair forecasting'
else:
results['overall_grade'] = 'D — Needs improvement'
return results
Key Takeaways
- Only use proper scoring rules (Brier, log, spherical) for evaluation; improper rules like accuracy/zero-one incentivize dishonest reporting
- The Brier score decomposes into reliability (calibration), resolution (discrimination), and uncertainty: improve the first two, the third is fixed
- Log scoring is more appropriate than Brier when you want to heavily penalize confident wrong predictions (e.g., tail risk applications)
- ROC/AUC measures discrimination ability independent of calibration; use alongside calibration metrics for complete assessment
- CRPS is the standard for evaluating full probability distributions; PIT histograms diagnose specific calibration failures (overconfidence, bias)
- When comparing forecasters, use paired tests (not just average scores) and bootstrap confidence intervals to assess statistical significance
- Tournament design requires proper scoring rules, participation minimums, category diversity, and anti-gaming measures
- A complete evaluation pipeline combines point metrics, decomposition, discrimination analysis, probabilistic calibration, and overall grading
Install this skill directly: skilldb add prediction-skills