Skip to main content
Autonomous AgentsPrediction617 lines

Calibration and Accuracy

Quick Summary18 lines
A forecaster who says "70% chance" should be right about 70% of the time they say that. This is calibration — the alignment between stated confidence and actual frequency of outcomes. Calibration, combined with resolution (the ability to discriminate between events that happen and those that do not), determines forecast quality. This skill covers how to measure, track, and improve forecast accuracy using proper scoring rules, calibration curves, and debiasing techniques drawn from the superforecasting literature.

## Key Points

1. TRIAGE: Focus on questions where effort improves accuracy
- Skip questions that are too easy or too hard
- Prioritize questions in the "Goldilocks zone" of difficulty
2. BREAK DOWN PROBLEMS: Fermi estimation approach
- Decompose into sub-questions
- Estimate each component separately
3. STRIKE THE RIGHT BALANCE (inside vs outside view)
- Start with the base rate (outside view)
- Adjust with case-specific factors (inside view)
4. DISTINGUISH AS MANY DEGREES OF UNCERTAINTY AS THE PROBLEM ALLOWS
- Use granular probabilities (not just "likely/unlikely")
- The difference between 60% and 65% matters
skilldb get prediction-skills/calibration-and-accuracyFull skill: 617 lines
Paste into your CLAUDE.md or agent config

Calibration and Accuracy

Overview

A forecaster who says "70% chance" should be right about 70% of the time they say that. This is calibration — the alignment between stated confidence and actual frequency of outcomes. Calibration, combined with resolution (the ability to discriminate between events that happen and those that do not), determines forecast quality. This skill covers how to measure, track, and improve forecast accuracy using proper scoring rules, calibration curves, and debiasing techniques drawn from the superforecasting literature.

Proper Scoring Rules

What Makes a Scoring Rule "Proper"

A scoring rule is proper if a forecaster maximizes their expected score by reporting their true belief. Improper scoring rules incentivize gaming (e.g., always saying 50% to avoid penalties for being wrong).

Brier Score

The most widely used proper scoring rule for binary outcomes:

import numpy as np

def brier_score(probabilities: list, outcomes: list) -> float:
    """
    Brier Score = mean of (forecast - outcome)²

    Range: 0 (perfect) to 1 (worst possible)
    Baseline: 0.25 (always predicting 50%)

    Example:
        Forecast 0.80 that it will rain, it rains: (0.80 - 1)² = 0.04
        Forecast 0.80 that it will rain, it doesn't: (0.80 - 0)² = 0.64
    """
    probs = np.array(probabilities)
    outs = np.array(outcomes, dtype=float)
    return np.mean((probs - outs) ** 2)


def brier_skill_score(probabilities: list, outcomes: list,
                      reference_prob: float = None) -> float:
    """
    Brier Skill Score: improvement over a reference forecast.
    BSS = 1 - BS / BS_reference

    BSS = 1: perfect
    BSS = 0: no better than reference
    BSS < 0: worse than reference
    """
    bs = brier_score(probabilities, outcomes)

    if reference_prob is None:
        reference_prob = np.mean(outcomes)  # Climatological base rate

    bs_ref = np.mean((reference_prob - np.array(outcomes, dtype=float)) ** 2)

    return 1 - bs / bs_ref if bs_ref > 0 else 0


def brier_decomposition(probabilities: list, outcomes: list,
                        n_bins: int = 10) -> dict:
    """
    Decompose Brier Score into three components:
    BS = Reliability - Resolution + Uncertainty

    Reliability: How well calibrated are the forecasts? (lower is better)
    Resolution: How well do forecasts discriminate? (higher is better)
    Uncertainty: Base rate variance (fixed property of the dataset)
    """
    probs = np.array(probabilities)
    outs = np.array(outcomes, dtype=float)
    n = len(probs)
    base_rate = np.mean(outs)

    bins = np.linspace(0, 1, n_bins + 1)
    reliability = 0
    resolution = 0

    for i in range(n_bins):
        mask = (probs >= bins[i]) & (probs < bins[i+1])
        n_k = np.sum(mask)
        if n_k == 0:
            continue

        forecast_mean = np.mean(probs[mask])
        outcome_mean = np.mean(outs[mask])

        reliability += n_k * (forecast_mean - outcome_mean) ** 2
        resolution += n_k * (outcome_mean - base_rate) ** 2

    reliability /= n
    resolution /= n
    uncertainty = base_rate * (1 - base_rate)

    return {
        'brier_score': brier_score(probabilities, outcomes),
        'reliability': reliability,  # Calibration error (lower = better)
        'resolution': resolution,     # Discrimination (higher = better)
        'uncertainty': uncertainty,    # Inherent uncertainty (fixed)
        'check': abs(reliability - resolution + uncertainty - brier_score(probabilities, outcomes))
    }

Log Score (Logarithmic Scoring Rule)

def log_score(probabilities: list, outcomes: list) -> float:
    """
    Log Score = mean of log(forecast for actual outcome)

    More punishing of confident wrong predictions than Brier.
    Predicting 0.99 when outcome is 0: log(0.01) = -4.6
    Predicting 0.51 when outcome is 0: log(0.49) = -0.71

    Returns negative value (higher/less negative = better).
    """
    scores = []
    for prob, outcome in zip(probabilities, outcomes):
        prob = np.clip(prob, 0.001, 0.999)  # Avoid log(0)
        if outcome == 1:
            scores.append(np.log(prob))
        else:
            scores.append(np.log(1 - prob))
    return np.mean(scores)


def log_score_multiclass(probability_vectors: list, outcomes: list) -> float:
    """
    Log score for multi-class predictions.
    Each probability_vector assigns probabilities to all classes.
    """
    scores = []
    for probs, outcome in zip(probability_vectors, outcomes):
        prob_of_actual = probs[outcome]
        prob_of_actual = max(prob_of_actual, 0.001)
        scores.append(np.log(prob_of_actual))
    return np.mean(scores)

Calibration Curves

Building and Interpreting Calibration Plots

class CalibrationAnalyzer:
    """Comprehensive calibration analysis for probabilistic forecasts."""

    def __init__(self):
        self.forecasts = []  # (probability, outcome) pairs

    def add(self, probability: float, outcome: bool):
        self.forecasts.append((probability, int(outcome)))

    def add_batch(self, probabilities: list, outcomes: list):
        for p, o in zip(probabilities, outcomes):
            self.add(p, o)

    def calibration_curve(self, n_bins: int = 10) -> dict:
        """Generate calibration curve data."""
        probs = np.array([f[0] for f in self.forecasts])
        outs = np.array([f[1] for f in self.forecasts])

        bins = np.linspace(0, 1, n_bins + 1)
        curve = []

        for i in range(n_bins):
            mask = (probs >= bins[i]) & (probs < bins[i+1])
            count = np.sum(mask)
            if count == 0:
                continue

            bin_center = (bins[i] + bins[i+1]) / 2
            mean_predicted = np.mean(probs[mask])
            mean_actual = np.mean(outs[mask])

            curve.append({
                'bin_start': bins[i],
                'bin_end': bins[i+1],
                'mean_predicted': mean_predicted,
                'mean_actual': mean_actual,
                'count': int(count),
                'error': abs(mean_predicted - mean_actual),
                'direction': 'overconfident' if mean_predicted > mean_actual else 'underconfident'
            })

        return {
            'curve': curve,
            'perfect_calibration_line': [(x, x) for x in np.linspace(0, 1, 11)]
        }

    def expected_calibration_error(self, n_bins: int = 10) -> float:
        """ECE: weighted average of bin calibration errors."""
        curve = self.calibration_curve(n_bins)['curve']
        total = sum(bin_data['count'] for bin_data in curve)

        ece = sum(
            bin_data['count'] / total * bin_data['error']
            for bin_data in curve
        )
        return ece

    def maximum_calibration_error(self, n_bins: int = 10) -> float:
        """MCE: worst bin calibration error."""
        curve = self.calibration_curve(n_bins)['curve']
        if not curve:
            return 0
        return max(bin_data['error'] for bin_data in curve)

    def overconfidence_score(self) -> float:
        """
        Measure systematic overconfidence.
        Positive = overconfident, Negative = underconfident.
        """
        weighted_error = 0
        total = 0

        for prob, outcome in self.forecasts:
            distance_from_50 = abs(prob - 0.5)
            expected_accuracy = max(prob, 1 - prob)
            actual_accuracy = int((prob > 0.5) == outcome)
            weighted_error += (expected_accuracy - actual_accuracy) * distance_from_50
            total += distance_from_50

        return weighted_error / total if total > 0 else 0

    def full_report(self) -> dict:
        """Generate comprehensive calibration report."""
        probs = [f[0] for f in self.forecasts]
        outs = [f[1] for f in self.forecasts]

        return {
            'n_forecasts': len(self.forecasts),
            'base_rate': np.mean(outs),
            'mean_forecast': np.mean(probs),
            'brier_score': brier_score(probs, outs),
            'brier_skill': brier_skill_score(probs, outs),
            'log_score': log_score(probs, outs),
            'ece': self.expected_calibration_error(),
            'mce': self.maximum_calibration_error(),
            'overconfidence': self.overconfidence_score(),
            'decomposition': brier_decomposition(probs, outs),
            'calibration_curve': self.calibration_curve(),
            'assessment': self._assess_quality()
        }

    def _assess_quality(self) -> str:
        ece = self.expected_calibration_error()
        probs = [f[0] for f in self.forecasts]
        outs = [f[1] for f in self.forecasts]
        bs = brier_score(probs, outs)

        if ece < 0.03 and bs < 0.15:
            return "Excellent (superforecaster level)"
        elif ece < 0.05 and bs < 0.20:
            return "Good (well-calibrated)"
        elif ece < 0.10:
            return "Fair (some calibration issues)"
        else:
            return "Poor (significant calibration problems)"

Overconfidence Detection and Debiasing

Common Biases in Forecasting

class BiasDetector:
    """Detect common forecasting biases."""

    def __init__(self, forecasts: list):
        """forecasts: list of (probability, outcome, metadata) tuples"""
        self.forecasts = forecasts

    def detect_overconfidence(self) -> dict:
        """
        The most common bias: predictions are too extreme.
        When you say 90%, it only happens 75% of the time.
        """
        extreme = [(p, o) for p, o, _ in self.forecasts if p > 0.8 or p < 0.2]
        if len(extreme) < 10:
            return {'insufficient_data': True}

        high_conf = [(p, o) for p, o in extreme if p > 0.8]
        low_conf = [(p, o) for p, o in extreme if p < 0.2]

        high_accuracy = np.mean([o for _, o in high_conf]) if high_conf else None
        low_accuracy = 1 - np.mean([o for _, o in low_conf]) if low_conf else None

        overconfident = False
        if high_accuracy is not None and high_accuracy < 0.75:
            overconfident = True
        if low_accuracy is not None and low_accuracy < 0.75:
            overconfident = True

        return {
            'is_overconfident': overconfident,
            'high_confidence_accuracy': high_accuracy,
            'low_confidence_accuracy': low_accuracy,
            'recommended_extremity_factor': self._compute_shrinkage()
        }

    def detect_anchoring(self) -> dict:
        """Detect if forecaster is anchoring to round numbers or priors."""
        probs = [p for p, _, _ in self.forecasts]

        # Check clustering at round numbers
        round_numbers = [0.10, 0.20, 0.25, 0.30, 0.40, 0.50, 0.60, 0.70, 0.75, 0.80, 0.90]
        near_round = sum(1 for p in probs if min(abs(p - r) for r in round_numbers) < 0.02)
        round_fraction = near_round / len(probs)

        return {
            'fraction_near_round_numbers': round_fraction,
            'likely_anchoring': round_fraction > 0.5,
            'unique_values': len(set(round(p, 2) for p in probs)),
            'recommendation': 'Use more granular probabilities' if round_fraction > 0.5 else 'Good granularity'
        }

    def detect_base_rate_neglect(self) -> dict:
        """Detect if forecaster ignores base rates."""
        base_rate = np.mean([o for _, o, _ in self.forecasts])
        mean_forecast = np.mean([p for p, _, _ in self.forecasts])

        # If the base rate is 10% but mean forecast is 30%, base rate neglect
        neglect_ratio = abs(mean_forecast - base_rate) / max(base_rate, 0.01)

        return {
            'base_rate': base_rate,
            'mean_forecast': mean_forecast,
            'divergence': abs(mean_forecast - base_rate),
            'likely_neglect': neglect_ratio > 0.5
        }

    def _compute_shrinkage(self) -> float:
        """Compute optimal shrinkage toward 50% to correct overconfidence."""
        from scipy.optimize import minimize_scalar

        probs = np.array([p for p, _, _ in self.forecasts])
        outs = np.array([o for _, o, _ in self.forecasts])

        def shrunk_brier(alpha):
            adjusted = 0.5 + alpha * (probs - 0.5)
            return np.mean((adjusted - outs) ** 2)

        result = minimize_scalar(shrunk_brier, bounds=(0, 1), method='bounded')
        return result.x  # Optimal alpha (<1 means you're overconfident)

Debiasing Techniques

class Debiaser:
    """Apply debiasing corrections to raw forecasts."""

    @staticmethod
    def extremity_correction(probability: float, factor: float = 0.85) -> float:
        """
        Shrink probabilities toward 50% to correct overconfidence.
        factor < 1 reduces extremity (corrects overconfidence)
        factor > 1 increases extremity (corrects underconfidence)
        """
        return 0.5 + factor * (probability - 0.5)

    @staticmethod
    def log_odds_correction(probability: float, factor: float = 0.85) -> float:
        """
        Correction in log-odds space (more theoretically sound).
        """
        if probability <= 0.001 or probability >= 0.999:
            return probability

        log_odds = np.log(probability / (1 - probability))
        adjusted_log_odds = factor * log_odds
        return 1 / (1 + np.exp(-adjusted_log_odds))

    @staticmethod
    def recalibrate_with_platt_scaling(raw_probs: np.ndarray,
                                       outcomes: np.ndarray) -> tuple:
        """
        Platt scaling: fit a logistic regression to map raw probabilities
        to calibrated probabilities.
        """
        from scipy.optimize import minimize

        def nll(params):
            a, b = params
            calibrated = 1 / (1 + np.exp(-(a * np.log(raw_probs / (1 - raw_probs + 1e-10) + 1e-10) + b)))
            calibrated = np.clip(calibrated, 1e-10, 1 - 1e-10)
            return -np.mean(
                outcomes * np.log(calibrated) + (1 - outcomes) * np.log(1 - calibrated)
            )

        result = minimize(nll, x0=[1.0, 0.0])
        a, b = result.x

        def calibrate(p):
            log_odds = np.log(p / (1 - p + 1e-10) + 1e-10)
            return 1 / (1 + np.exp(-(a * log_odds + b)))

        return calibrate, {'a': a, 'b': b}

    @staticmethod
    def isotonic_recalibration(raw_probs: np.ndarray,
                                outcomes: np.ndarray):
        """
        Non-parametric recalibration using isotonic regression.
        More flexible than Platt scaling but needs more data.
        """
        from sklearn.isotonic import IsotonicRegression
        ir = IsotonicRegression(out_of_bounds='clip')
        ir.fit(raw_probs, outcomes)
        return ir.predict

Tracking Record Methodology

Building a Forecast Track Record

class ForecastTracker:
    """Track and analyze a forecaster's long-term record."""

    def __init__(self, forecaster_name: str):
        self.name = forecaster_name
        self.questions = {}
        self.resolved = []

    def record_forecast(self, question_id: str, question_text: str,
                        probability: float, timestamp: str,
                        category: str = 'general'):
        if question_id not in self.questions:
            self.questions[question_id] = {
                'text': question_text,
                'category': category,
                'forecasts': []
            }
        self.questions[question_id]['forecasts'].append({
            'probability': probability,
            'timestamp': timestamp
        })

    def resolve(self, question_id: str, outcome: bool):
        q = self.questions.get(question_id)
        if q:
            last_forecast = q['forecasts'][-1]['probability']
            self.resolved.append({
                'question_id': question_id,
                'text': q['text'],
                'category': q['category'],
                'final_probability': last_forecast,
                'outcome': outcome,
                'n_updates': len(q['forecasts'])
            })

    def performance_summary(self) -> dict:
        if not self.resolved:
            return {'error': 'No resolved questions'}

        probs = [r['final_probability'] for r in self.resolved]
        outs = [r['outcome'] for r in self.resolved]

        analyzer = CalibrationAnalyzer()
        analyzer.add_batch(probs, outs)

        # Category breakdown
        categories = {}
        for r in self.resolved:
            cat = r['category']
            if cat not in categories:
                categories[cat] = {'probs': [], 'outs': []}
            categories[cat]['probs'].append(r['final_probability'])
            categories[cat]['outs'].append(r['outcome'])

        cat_scores = {}
        for cat, data in categories.items():
            cat_scores[cat] = {
                'n': len(data['probs']),
                'brier': brier_score(data['probs'], data['outs']),
            }

        return {
            'forecaster': self.name,
            'total_resolved': len(self.resolved),
            'overall': analyzer.full_report(),
            'by_category': cat_scores,
            'trend': self._performance_trend()
        }

    def _performance_trend(self, window: int = 20) -> list:
        """Track performance over time to detect improvement or decline."""
        if len(self.resolved) < window:
            return []

        trend = []
        for i in range(window, len(self.resolved) + 1):
            chunk = self.resolved[i-window:i]
            probs = [r['final_probability'] for r in chunk]
            outs = [r['outcome'] for r in chunk]
            trend.append({
                'window_end': i,
                'brier_score': brier_score(probs, outs)
            })
        return trend

Superforecaster Training Principles

The Ten Commandments of Superforecasting (Tetlock)

1. TRIAGE: Focus on questions where effort improves accuracy
   - Skip questions that are too easy or too hard
   - Prioritize questions in the "Goldilocks zone" of difficulty

2. BREAK DOWN PROBLEMS: Fermi estimation approach
   - Decompose into sub-questions
   - Estimate each component separately

3. STRIKE THE RIGHT BALANCE (inside vs outside view)
   - Start with the base rate (outside view)
   - Adjust with case-specific factors (inside view)

4. DISTINGUISH AS MANY DEGREES OF UNCERTAINTY AS THE PROBLEM ALLOWS
   - Use granular probabilities (not just "likely/unlikely")
   - The difference between 60% and 65% matters

5. BALANCE UNDER- AND OVER-REACTION TO NEW EVIDENCE
   - Update, but do not overreact to single data points
   - Frequent small updates beat rare large updates

6. LOOK FOR CLASHING CAUSAL FORCES
   - Consider arguments for AND against
   - Actively seek disconfirming evidence

7. BALANCE PARSIMONY AND COMPLEXITY
   - Simple models as baseline
   - Add complexity only when it helps

8. BEWARE OF GROUP THINK
   - Devil's advocate discipline
   - Value dissent

9. EMBRACE CONTINUOUS SELF-IMPROVEMENT
   - Track your record
   - Analyze where you went wrong
   - Update your process, not just your forecasts

10. BRING OUT YOUR INNER SUPERFORECASTER
    - Growth mindset about probabilistic reasoning
    - Practice, practice, practice

Deliberate Practice for Calibration

class CalibrationTraining:
    """Exercises for improving forecast calibration."""

    def __init__(self):
        self.exercises = []
        self.performance_history = []

    def trivia_calibration_exercise(self, questions: list) -> dict:
        """
        Classic calibration exercise:
        For each question, provide a 90% confidence interval.
        A well-calibrated person gets 90% of intervals correct.
        """
        correct = 0
        for q in questions:
            # q = {'question': str, 'answer': float, 'user_low': float, 'user_high': float}
            if q['user_low'] <= q['answer'] <= q['user_high']:
                correct += 1

        accuracy = correct / len(questions) if questions else 0

        return {
            'intended_coverage': 0.90,
            'actual_coverage': accuracy,
            'overconfident': accuracy < 0.85,
            'underconfident': accuracy > 0.95,
            'calibration_gap': abs(accuracy - 0.90),
            'recommendation': (
                'Widen your intervals' if accuracy < 0.80
                else 'Slightly widen intervals' if accuracy < 0.85
                else 'Well calibrated' if accuracy <= 0.95
                else 'Narrow your intervals (underconfident)'
            )
        }

    def probability_quiz(self, statements: list) -> dict:
        """
        Assign probabilities to statements, then check calibration.
        statements: list of {'text': str, 'user_prob': float, 'truth': bool}
        """
        probs = [s['user_prob'] for s in statements]
        truths = [s['truth'] for s in statements]

        analyzer = CalibrationAnalyzer()
        analyzer.add_batch(probs, truths)

        return analyzer.full_report()

Key Takeaways

  1. Calibration is the alignment between stated probabilities and actual frequencies; it is the foundational measure of forecast quality
  2. The Brier score decomposes into reliability (calibration), resolution (discrimination), and uncertainty (base rate) — improve the first two
  3. Log scoring punishes confident wrong predictions more severely than Brier; use it when you want to penalize overconfidence
  4. Most forecasters are overconfident: their 90% predictions come true only 70-80% of the time; shrinkage toward 50% usually helps
  5. Expected Calibration Error (ECE) provides a single number summarizing calibration quality; below 0.05 is good, below 0.03 is excellent
  6. Platt scaling and isotonic regression can recalibrate model outputs using held-out data
  7. Track records with at least 50-100 resolved forecasts are needed for reliable calibration assessment
  8. Superforecaster training combines outside view (base rates), granular probabilities, frequent small updates, and deliberate calibration practice

Install this skill directly: skilldb add prediction-skills

Get CLI access →