Calibration and Accuracy
A forecaster who says "70% chance" should be right about 70% of the time they say that. This is calibration — the alignment between stated confidence and actual frequency of outcomes. Calibration, combined with resolution (the ability to discriminate between events that happen and those that do not), determines forecast quality. This skill covers how to measure, track, and improve forecast accuracy using proper scoring rules, calibration curves, and debiasing techniques drawn from the superforecasting literature. ## Key Points 1. TRIAGE: Focus on questions where effort improves accuracy - Skip questions that are too easy or too hard - Prioritize questions in the "Goldilocks zone" of difficulty 2. BREAK DOWN PROBLEMS: Fermi estimation approach - Decompose into sub-questions - Estimate each component separately 3. STRIKE THE RIGHT BALANCE (inside vs outside view) - Start with the base rate (outside view) - Adjust with case-specific factors (inside view) 4. DISTINGUISH AS MANY DEGREES OF UNCERTAINTY AS THE PROBLEM ALLOWS - Use granular probabilities (not just "likely/unlikely") - The difference between 60% and 65% matters
skilldb get prediction-skills/calibration-and-accuracyFull skill: 617 linesCalibration and Accuracy
Overview
A forecaster who says "70% chance" should be right about 70% of the time they say that. This is calibration — the alignment between stated confidence and actual frequency of outcomes. Calibration, combined with resolution (the ability to discriminate between events that happen and those that do not), determines forecast quality. This skill covers how to measure, track, and improve forecast accuracy using proper scoring rules, calibration curves, and debiasing techniques drawn from the superforecasting literature.
Proper Scoring Rules
What Makes a Scoring Rule "Proper"
A scoring rule is proper if a forecaster maximizes their expected score by reporting their true belief. Improper scoring rules incentivize gaming (e.g., always saying 50% to avoid penalties for being wrong).
Brier Score
The most widely used proper scoring rule for binary outcomes:
import numpy as np
def brier_score(probabilities: list, outcomes: list) -> float:
"""
Brier Score = mean of (forecast - outcome)²
Range: 0 (perfect) to 1 (worst possible)
Baseline: 0.25 (always predicting 50%)
Example:
Forecast 0.80 that it will rain, it rains: (0.80 - 1)² = 0.04
Forecast 0.80 that it will rain, it doesn't: (0.80 - 0)² = 0.64
"""
probs = np.array(probabilities)
outs = np.array(outcomes, dtype=float)
return np.mean((probs - outs) ** 2)
def brier_skill_score(probabilities: list, outcomes: list,
reference_prob: float = None) -> float:
"""
Brier Skill Score: improvement over a reference forecast.
BSS = 1 - BS / BS_reference
BSS = 1: perfect
BSS = 0: no better than reference
BSS < 0: worse than reference
"""
bs = brier_score(probabilities, outcomes)
if reference_prob is None:
reference_prob = np.mean(outcomes) # Climatological base rate
bs_ref = np.mean((reference_prob - np.array(outcomes, dtype=float)) ** 2)
return 1 - bs / bs_ref if bs_ref > 0 else 0
def brier_decomposition(probabilities: list, outcomes: list,
n_bins: int = 10) -> dict:
"""
Decompose Brier Score into three components:
BS = Reliability - Resolution + Uncertainty
Reliability: How well calibrated are the forecasts? (lower is better)
Resolution: How well do forecasts discriminate? (higher is better)
Uncertainty: Base rate variance (fixed property of the dataset)
"""
probs = np.array(probabilities)
outs = np.array(outcomes, dtype=float)
n = len(probs)
base_rate = np.mean(outs)
bins = np.linspace(0, 1, n_bins + 1)
reliability = 0
resolution = 0
for i in range(n_bins):
mask = (probs >= bins[i]) & (probs < bins[i+1])
n_k = np.sum(mask)
if n_k == 0:
continue
forecast_mean = np.mean(probs[mask])
outcome_mean = np.mean(outs[mask])
reliability += n_k * (forecast_mean - outcome_mean) ** 2
resolution += n_k * (outcome_mean - base_rate) ** 2
reliability /= n
resolution /= n
uncertainty = base_rate * (1 - base_rate)
return {
'brier_score': brier_score(probabilities, outcomes),
'reliability': reliability, # Calibration error (lower = better)
'resolution': resolution, # Discrimination (higher = better)
'uncertainty': uncertainty, # Inherent uncertainty (fixed)
'check': abs(reliability - resolution + uncertainty - brier_score(probabilities, outcomes))
}
Log Score (Logarithmic Scoring Rule)
def log_score(probabilities: list, outcomes: list) -> float:
"""
Log Score = mean of log(forecast for actual outcome)
More punishing of confident wrong predictions than Brier.
Predicting 0.99 when outcome is 0: log(0.01) = -4.6
Predicting 0.51 when outcome is 0: log(0.49) = -0.71
Returns negative value (higher/less negative = better).
"""
scores = []
for prob, outcome in zip(probabilities, outcomes):
prob = np.clip(prob, 0.001, 0.999) # Avoid log(0)
if outcome == 1:
scores.append(np.log(prob))
else:
scores.append(np.log(1 - prob))
return np.mean(scores)
def log_score_multiclass(probability_vectors: list, outcomes: list) -> float:
"""
Log score for multi-class predictions.
Each probability_vector assigns probabilities to all classes.
"""
scores = []
for probs, outcome in zip(probability_vectors, outcomes):
prob_of_actual = probs[outcome]
prob_of_actual = max(prob_of_actual, 0.001)
scores.append(np.log(prob_of_actual))
return np.mean(scores)
Calibration Curves
Building and Interpreting Calibration Plots
class CalibrationAnalyzer:
"""Comprehensive calibration analysis for probabilistic forecasts."""
def __init__(self):
self.forecasts = [] # (probability, outcome) pairs
def add(self, probability: float, outcome: bool):
self.forecasts.append((probability, int(outcome)))
def add_batch(self, probabilities: list, outcomes: list):
for p, o in zip(probabilities, outcomes):
self.add(p, o)
def calibration_curve(self, n_bins: int = 10) -> dict:
"""Generate calibration curve data."""
probs = np.array([f[0] for f in self.forecasts])
outs = np.array([f[1] for f in self.forecasts])
bins = np.linspace(0, 1, n_bins + 1)
curve = []
for i in range(n_bins):
mask = (probs >= bins[i]) & (probs < bins[i+1])
count = np.sum(mask)
if count == 0:
continue
bin_center = (bins[i] + bins[i+1]) / 2
mean_predicted = np.mean(probs[mask])
mean_actual = np.mean(outs[mask])
curve.append({
'bin_start': bins[i],
'bin_end': bins[i+1],
'mean_predicted': mean_predicted,
'mean_actual': mean_actual,
'count': int(count),
'error': abs(mean_predicted - mean_actual),
'direction': 'overconfident' if mean_predicted > mean_actual else 'underconfident'
})
return {
'curve': curve,
'perfect_calibration_line': [(x, x) for x in np.linspace(0, 1, 11)]
}
def expected_calibration_error(self, n_bins: int = 10) -> float:
"""ECE: weighted average of bin calibration errors."""
curve = self.calibration_curve(n_bins)['curve']
total = sum(bin_data['count'] for bin_data in curve)
ece = sum(
bin_data['count'] / total * bin_data['error']
for bin_data in curve
)
return ece
def maximum_calibration_error(self, n_bins: int = 10) -> float:
"""MCE: worst bin calibration error."""
curve = self.calibration_curve(n_bins)['curve']
if not curve:
return 0
return max(bin_data['error'] for bin_data in curve)
def overconfidence_score(self) -> float:
"""
Measure systematic overconfidence.
Positive = overconfident, Negative = underconfident.
"""
weighted_error = 0
total = 0
for prob, outcome in self.forecasts:
distance_from_50 = abs(prob - 0.5)
expected_accuracy = max(prob, 1 - prob)
actual_accuracy = int((prob > 0.5) == outcome)
weighted_error += (expected_accuracy - actual_accuracy) * distance_from_50
total += distance_from_50
return weighted_error / total if total > 0 else 0
def full_report(self) -> dict:
"""Generate comprehensive calibration report."""
probs = [f[0] for f in self.forecasts]
outs = [f[1] for f in self.forecasts]
return {
'n_forecasts': len(self.forecasts),
'base_rate': np.mean(outs),
'mean_forecast': np.mean(probs),
'brier_score': brier_score(probs, outs),
'brier_skill': brier_skill_score(probs, outs),
'log_score': log_score(probs, outs),
'ece': self.expected_calibration_error(),
'mce': self.maximum_calibration_error(),
'overconfidence': self.overconfidence_score(),
'decomposition': brier_decomposition(probs, outs),
'calibration_curve': self.calibration_curve(),
'assessment': self._assess_quality()
}
def _assess_quality(self) -> str:
ece = self.expected_calibration_error()
probs = [f[0] for f in self.forecasts]
outs = [f[1] for f in self.forecasts]
bs = brier_score(probs, outs)
if ece < 0.03 and bs < 0.15:
return "Excellent (superforecaster level)"
elif ece < 0.05 and bs < 0.20:
return "Good (well-calibrated)"
elif ece < 0.10:
return "Fair (some calibration issues)"
else:
return "Poor (significant calibration problems)"
Overconfidence Detection and Debiasing
Common Biases in Forecasting
class BiasDetector:
"""Detect common forecasting biases."""
def __init__(self, forecasts: list):
"""forecasts: list of (probability, outcome, metadata) tuples"""
self.forecasts = forecasts
def detect_overconfidence(self) -> dict:
"""
The most common bias: predictions are too extreme.
When you say 90%, it only happens 75% of the time.
"""
extreme = [(p, o) for p, o, _ in self.forecasts if p > 0.8 or p < 0.2]
if len(extreme) < 10:
return {'insufficient_data': True}
high_conf = [(p, o) for p, o in extreme if p > 0.8]
low_conf = [(p, o) for p, o in extreme if p < 0.2]
high_accuracy = np.mean([o for _, o in high_conf]) if high_conf else None
low_accuracy = 1 - np.mean([o for _, o in low_conf]) if low_conf else None
overconfident = False
if high_accuracy is not None and high_accuracy < 0.75:
overconfident = True
if low_accuracy is not None and low_accuracy < 0.75:
overconfident = True
return {
'is_overconfident': overconfident,
'high_confidence_accuracy': high_accuracy,
'low_confidence_accuracy': low_accuracy,
'recommended_extremity_factor': self._compute_shrinkage()
}
def detect_anchoring(self) -> dict:
"""Detect if forecaster is anchoring to round numbers or priors."""
probs = [p for p, _, _ in self.forecasts]
# Check clustering at round numbers
round_numbers = [0.10, 0.20, 0.25, 0.30, 0.40, 0.50, 0.60, 0.70, 0.75, 0.80, 0.90]
near_round = sum(1 for p in probs if min(abs(p - r) for r in round_numbers) < 0.02)
round_fraction = near_round / len(probs)
return {
'fraction_near_round_numbers': round_fraction,
'likely_anchoring': round_fraction > 0.5,
'unique_values': len(set(round(p, 2) for p in probs)),
'recommendation': 'Use more granular probabilities' if round_fraction > 0.5 else 'Good granularity'
}
def detect_base_rate_neglect(self) -> dict:
"""Detect if forecaster ignores base rates."""
base_rate = np.mean([o for _, o, _ in self.forecasts])
mean_forecast = np.mean([p for p, _, _ in self.forecasts])
# If the base rate is 10% but mean forecast is 30%, base rate neglect
neglect_ratio = abs(mean_forecast - base_rate) / max(base_rate, 0.01)
return {
'base_rate': base_rate,
'mean_forecast': mean_forecast,
'divergence': abs(mean_forecast - base_rate),
'likely_neglect': neglect_ratio > 0.5
}
def _compute_shrinkage(self) -> float:
"""Compute optimal shrinkage toward 50% to correct overconfidence."""
from scipy.optimize import minimize_scalar
probs = np.array([p for p, _, _ in self.forecasts])
outs = np.array([o for _, o, _ in self.forecasts])
def shrunk_brier(alpha):
adjusted = 0.5 + alpha * (probs - 0.5)
return np.mean((adjusted - outs) ** 2)
result = minimize_scalar(shrunk_brier, bounds=(0, 1), method='bounded')
return result.x # Optimal alpha (<1 means you're overconfident)
Debiasing Techniques
class Debiaser:
"""Apply debiasing corrections to raw forecasts."""
@staticmethod
def extremity_correction(probability: float, factor: float = 0.85) -> float:
"""
Shrink probabilities toward 50% to correct overconfidence.
factor < 1 reduces extremity (corrects overconfidence)
factor > 1 increases extremity (corrects underconfidence)
"""
return 0.5 + factor * (probability - 0.5)
@staticmethod
def log_odds_correction(probability: float, factor: float = 0.85) -> float:
"""
Correction in log-odds space (more theoretically sound).
"""
if probability <= 0.001 or probability >= 0.999:
return probability
log_odds = np.log(probability / (1 - probability))
adjusted_log_odds = factor * log_odds
return 1 / (1 + np.exp(-adjusted_log_odds))
@staticmethod
def recalibrate_with_platt_scaling(raw_probs: np.ndarray,
outcomes: np.ndarray) -> tuple:
"""
Platt scaling: fit a logistic regression to map raw probabilities
to calibrated probabilities.
"""
from scipy.optimize import minimize
def nll(params):
a, b = params
calibrated = 1 / (1 + np.exp(-(a * np.log(raw_probs / (1 - raw_probs + 1e-10) + 1e-10) + b)))
calibrated = np.clip(calibrated, 1e-10, 1 - 1e-10)
return -np.mean(
outcomes * np.log(calibrated) + (1 - outcomes) * np.log(1 - calibrated)
)
result = minimize(nll, x0=[1.0, 0.0])
a, b = result.x
def calibrate(p):
log_odds = np.log(p / (1 - p + 1e-10) + 1e-10)
return 1 / (1 + np.exp(-(a * log_odds + b)))
return calibrate, {'a': a, 'b': b}
@staticmethod
def isotonic_recalibration(raw_probs: np.ndarray,
outcomes: np.ndarray):
"""
Non-parametric recalibration using isotonic regression.
More flexible than Platt scaling but needs more data.
"""
from sklearn.isotonic import IsotonicRegression
ir = IsotonicRegression(out_of_bounds='clip')
ir.fit(raw_probs, outcomes)
return ir.predict
Tracking Record Methodology
Building a Forecast Track Record
class ForecastTracker:
"""Track and analyze a forecaster's long-term record."""
def __init__(self, forecaster_name: str):
self.name = forecaster_name
self.questions = {}
self.resolved = []
def record_forecast(self, question_id: str, question_text: str,
probability: float, timestamp: str,
category: str = 'general'):
if question_id not in self.questions:
self.questions[question_id] = {
'text': question_text,
'category': category,
'forecasts': []
}
self.questions[question_id]['forecasts'].append({
'probability': probability,
'timestamp': timestamp
})
def resolve(self, question_id: str, outcome: bool):
q = self.questions.get(question_id)
if q:
last_forecast = q['forecasts'][-1]['probability']
self.resolved.append({
'question_id': question_id,
'text': q['text'],
'category': q['category'],
'final_probability': last_forecast,
'outcome': outcome,
'n_updates': len(q['forecasts'])
})
def performance_summary(self) -> dict:
if not self.resolved:
return {'error': 'No resolved questions'}
probs = [r['final_probability'] for r in self.resolved]
outs = [r['outcome'] for r in self.resolved]
analyzer = CalibrationAnalyzer()
analyzer.add_batch(probs, outs)
# Category breakdown
categories = {}
for r in self.resolved:
cat = r['category']
if cat not in categories:
categories[cat] = {'probs': [], 'outs': []}
categories[cat]['probs'].append(r['final_probability'])
categories[cat]['outs'].append(r['outcome'])
cat_scores = {}
for cat, data in categories.items():
cat_scores[cat] = {
'n': len(data['probs']),
'brier': brier_score(data['probs'], data['outs']),
}
return {
'forecaster': self.name,
'total_resolved': len(self.resolved),
'overall': analyzer.full_report(),
'by_category': cat_scores,
'trend': self._performance_trend()
}
def _performance_trend(self, window: int = 20) -> list:
"""Track performance over time to detect improvement or decline."""
if len(self.resolved) < window:
return []
trend = []
for i in range(window, len(self.resolved) + 1):
chunk = self.resolved[i-window:i]
probs = [r['final_probability'] for r in chunk]
outs = [r['outcome'] for r in chunk]
trend.append({
'window_end': i,
'brier_score': brier_score(probs, outs)
})
return trend
Superforecaster Training Principles
The Ten Commandments of Superforecasting (Tetlock)
1. TRIAGE: Focus on questions where effort improves accuracy
- Skip questions that are too easy or too hard
- Prioritize questions in the "Goldilocks zone" of difficulty
2. BREAK DOWN PROBLEMS: Fermi estimation approach
- Decompose into sub-questions
- Estimate each component separately
3. STRIKE THE RIGHT BALANCE (inside vs outside view)
- Start with the base rate (outside view)
- Adjust with case-specific factors (inside view)
4. DISTINGUISH AS MANY DEGREES OF UNCERTAINTY AS THE PROBLEM ALLOWS
- Use granular probabilities (not just "likely/unlikely")
- The difference between 60% and 65% matters
5. BALANCE UNDER- AND OVER-REACTION TO NEW EVIDENCE
- Update, but do not overreact to single data points
- Frequent small updates beat rare large updates
6. LOOK FOR CLASHING CAUSAL FORCES
- Consider arguments for AND against
- Actively seek disconfirming evidence
7. BALANCE PARSIMONY AND COMPLEXITY
- Simple models as baseline
- Add complexity only when it helps
8. BEWARE OF GROUP THINK
- Devil's advocate discipline
- Value dissent
9. EMBRACE CONTINUOUS SELF-IMPROVEMENT
- Track your record
- Analyze where you went wrong
- Update your process, not just your forecasts
10. BRING OUT YOUR INNER SUPERFORECASTER
- Growth mindset about probabilistic reasoning
- Practice, practice, practice
Deliberate Practice for Calibration
class CalibrationTraining:
"""Exercises for improving forecast calibration."""
def __init__(self):
self.exercises = []
self.performance_history = []
def trivia_calibration_exercise(self, questions: list) -> dict:
"""
Classic calibration exercise:
For each question, provide a 90% confidence interval.
A well-calibrated person gets 90% of intervals correct.
"""
correct = 0
for q in questions:
# q = {'question': str, 'answer': float, 'user_low': float, 'user_high': float}
if q['user_low'] <= q['answer'] <= q['user_high']:
correct += 1
accuracy = correct / len(questions) if questions else 0
return {
'intended_coverage': 0.90,
'actual_coverage': accuracy,
'overconfident': accuracy < 0.85,
'underconfident': accuracy > 0.95,
'calibration_gap': abs(accuracy - 0.90),
'recommendation': (
'Widen your intervals' if accuracy < 0.80
else 'Slightly widen intervals' if accuracy < 0.85
else 'Well calibrated' if accuracy <= 0.95
else 'Narrow your intervals (underconfident)'
)
}
def probability_quiz(self, statements: list) -> dict:
"""
Assign probabilities to statements, then check calibration.
statements: list of {'text': str, 'user_prob': float, 'truth': bool}
"""
probs = [s['user_prob'] for s in statements]
truths = [s['truth'] for s in statements]
analyzer = CalibrationAnalyzer()
analyzer.add_batch(probs, truths)
return analyzer.full_report()
Key Takeaways
- Calibration is the alignment between stated probabilities and actual frequencies; it is the foundational measure of forecast quality
- The Brier score decomposes into reliability (calibration), resolution (discrimination), and uncertainty (base rate) — improve the first two
- Log scoring punishes confident wrong predictions more severely than Brier; use it when you want to penalize overconfidence
- Most forecasters are overconfident: their 90% predictions come true only 70-80% of the time; shrinkage toward 50% usually helps
- Expected Calibration Error (ECE) provides a single number summarizing calibration quality; below 0.05 is good, below 0.03 is excellent
- Platt scaling and isotonic regression can recalibrate model outputs using held-out data
- Track records with at least 50-100 resolved forecasts are needed for reliable calibration assessment
- Superforecaster training combines outside view (base rates), granular probabilities, frequent small updates, and deliberate calibration practice
Install this skill directly: skilldb add prediction-skills