Skip to main content
Autonomous AgentsPrediction818 lines

Prediction Evaluation Metrics

Quick Summary14 lines
Measuring forecast quality is as important as making forecasts. Without rigorous evaluation, you cannot distinguish skill from luck, identify systematic biases, or compare forecasting methods. This skill covers proper scoring rules, the resolution-reliability-sharpness decomposition, ROC curves for binary predictions, CRPS for probabilistic forecasts, methods for comparing forecasters, and tournament design for forecasting competitions.

## Key Points

1. Only use proper scoring rules (Brier, log, spherical) for evaluation; improper rules like accuracy/zero-one incentivize dishonest reporting
2. The Brier score decomposes into reliability (calibration), resolution (discrimination), and uncertainty: improve the first two, the third is fixed
3. Log scoring is more appropriate than Brier when you want to heavily penalize confident wrong predictions (e.g., tail risk applications)
4. ROC/AUC measures discrimination ability independent of calibration; use alongside calibration metrics for complete assessment
5. CRPS is the standard for evaluating full probability distributions; PIT histograms diagnose specific calibration failures (overconfidence, bias)
6. When comparing forecasters, use paired tests (not just average scores) and bootstrap confidence intervals to assess statistical significance
7. Tournament design requires proper scoring rules, participation minimums, category diversity, and anti-gaming measures
8. A complete evaluation pipeline combines point metrics, decomposition, discrimination analysis, probabilistic calibration, and overall grading
skilldb get prediction-skills/prediction-evaluation-metricsFull skill: 818 lines
Paste into your CLAUDE.md or agent config

Prediction Evaluation Metrics

Overview

Measuring forecast quality is as important as making forecasts. Without rigorous evaluation, you cannot distinguish skill from luck, identify systematic biases, or compare forecasting methods. This skill covers proper scoring rules, the resolution-reliability-sharpness decomposition, ROC curves for binary predictions, CRPS for probabilistic forecasts, methods for comparing forecasters, and tournament design for forecasting competitions.

Proper Scoring Rules

What Makes a Scoring Rule "Proper"

A scoring rule S(p, y) assigns a score to a probability forecast p given outcome y. A scoring rule is strictly proper if the forecaster's expected score is uniquely maximized by reporting their true belief. This means honest reporting is the dominant strategy.

import numpy as np
from typing import Callable

class ScoringRules:
    """Collection of proper scoring rules for probabilistic forecasts."""

    @staticmethod
    def brier_score(p: float, y: int) -> float:
        """
        Brier Score (1950): (p - y)²

        Range: 0 (perfect) to 1 (worst)
        Reference: 0.25 (always predicting 50%)

        Properties:
        - Quadratic loss
        - Proper (honest reporting is optimal)
        - Symmetric around 50%
        - Bounded [0, 1]
        """
        return (p - y) ** 2

    @staticmethod
    def log_score(p: float, y: int) -> float:
        """
        Logarithmic Score: log(p) if y=1, log(1-p) if y=0

        Range: -inf (worst) to 0 (perfect)
        Reference: -ln(2) ≈ -0.693 (always predicting 50%)

        Properties:
        - More punishing of confident wrong predictions
        - Strictly proper
        - Local: only depends on probability assigned to actual outcome
        - Unbounded below (catastrophic for 0% or 100% predictions that are wrong)
        """
        p = np.clip(p, 1e-15, 1 - 1e-15)
        if y == 1:
            return np.log(p)
        return np.log(1 - p)

    @staticmethod
    def spherical_score(p: float, y: int) -> float:
        """
        Spherical Score: p_y / sqrt(sum(p_i²))

        For binary: p / sqrt(p² + (1-p)²) if y=1

        Properties:
        - Proper
        - Bounded [0, 1]
        - Less sensitive to extreme probabilities than log score
        """
        norm = np.sqrt(p**2 + (1 - p)**2)
        if y == 1:
            return p / norm
        return (1 - p) / norm

    @staticmethod
    def zero_one_score(p: float, y: int, threshold: float = 0.5) -> float:
        """
        Zero-One (Classification) Score: 1 if correct side, 0 if wrong.

        WARNING: This is NOT a proper scoring rule!
        A forecaster maximizes this by always predicting 0% or 100%.
        Included for comparison only.
        """
        predicted = 1 if p >= threshold else 0
        return 1 if predicted == y else 0


def compare_scoring_rules(p: float, y: int) -> dict:
    """Compare all scoring rules for a single prediction."""
    sr = ScoringRules()
    return {
        'brier': sr.brier_score(p, y),
        'log': sr.log_score(p, y),
        'spherical': sr.spherical_score(p, y),
        'zero_one': sr.zero_one_score(p, y),
        'note': 'Lower Brier is better. Higher log/spherical is better.'
    }


# Demonstrate incentive compatibility
def demonstrate_proper_scoring():
    """Show that honest reporting maximizes expected score under proper rules."""
    true_prob = 0.7  # Forecaster believes 70% chance of YES

    reported_probs = np.linspace(0.01, 0.99, 50)
    brier_expected = []
    log_expected = []
    zero_one_expected = []

    for reported in reported_probs:
        # Expected Brier = P(Y=1)*(reported-1)² + P(Y=0)*(reported-0)²
        expected_brier = true_prob * (reported - 1)**2 + (1 - true_prob) * reported**2
        brier_expected.append(expected_brier)

        # Expected Log
        expected_log = true_prob * np.log(max(reported, 1e-10)) + (1 - true_prob) * np.log(max(1 - reported, 1e-10))
        log_expected.append(expected_log)

        # Expected Zero-One (NOT proper)
        if reported >= 0.5:
            expected_01 = true_prob
        else:
            expected_01 = 1 - true_prob
        zero_one_expected.append(expected_01)

    best_brier = reported_probs[np.argmin(brier_expected)]
    best_log = reported_probs[np.argmax(log_expected)]
    best_01 = reported_probs[np.argmax(zero_one_expected)]

    return {
        'true_probability': true_prob,
        'brier_optimal_report': best_brier,      # Should be ~0.7
        'log_optimal_report': best_log,            # Should be ~0.7
        'zero_one_optimal_report': best_01,        # Will be 1.0 (dishonest!)
        'brier_is_proper': abs(best_brier - true_prob) < 0.05,
        'log_is_proper': abs(best_log - true_prob) < 0.05,
        'zero_one_is_proper': abs(best_01 - true_prob) < 0.05  # False
    }

Resolution, Reliability, and Sharpness

The Three Components of Forecast Quality

class ForecastQualityDecomposition:
    """
    Decompose forecast quality into three independent components.

    Resolution: Ability to separate events that happen from those that don't.
                High resolution = forecasts vary appropriately with outcomes.

    Reliability (Calibration): Alignment between forecast probabilities and
                               observed frequencies. Reliable = well-calibrated.

    Sharpness: How extreme/decisive the forecasts are.
               Sharp forecasts are far from the base rate.
               (Sharpness is a property of the forecasts alone, not outcomes.)
    """

    def __init__(self, forecasts: np.ndarray, outcomes: np.ndarray, n_bins: int = 10):
        self.forecasts = forecasts
        self.outcomes = outcomes.astype(float)
        self.n_bins = n_bins
        self.base_rate = np.mean(outcomes)

    def reliability(self) -> float:
        """
        Reliability = weighted average of (forecast_bin_mean - outcome_bin_mean)²
        Lower is better. 0 = perfectly calibrated.
        """
        bins = np.linspace(0, 1, self.n_bins + 1)
        n = len(self.forecasts)
        rel = 0

        for i in range(self.n_bins):
            mask = (self.forecasts >= bins[i]) & (self.forecasts < bins[i+1])
            n_k = np.sum(mask)
            if n_k == 0:
                continue
            forecast_mean = np.mean(self.forecasts[mask])
            outcome_mean = np.mean(self.outcomes[mask])
            rel += n_k * (forecast_mean - outcome_mean) ** 2

        return rel / n

    def resolution(self) -> float:
        """
        Resolution = weighted average of (outcome_bin_mean - base_rate)²
        Higher is better. 0 = no discrimination ability.
        """
        bins = np.linspace(0, 1, self.n_bins + 1)
        n = len(self.forecasts)
        res = 0

        for i in range(self.n_bins):
            mask = (self.forecasts >= bins[i]) & (self.forecasts < bins[i+1])
            n_k = np.sum(mask)
            if n_k == 0:
                continue
            outcome_mean = np.mean(self.outcomes[mask])
            res += n_k * (outcome_mean - self.base_rate) ** 2

        return res / n

    def uncertainty(self) -> float:
        """
        Uncertainty = base_rate * (1 - base_rate)
        Property of the dataset, not the forecaster.
        """
        return self.base_rate * (1 - self.base_rate)

    def sharpness(self) -> float:
        """
        Sharpness = variance of forecasts.
        Higher = more decisive. Not necessarily better (can be overconfident).
        """
        return np.var(self.forecasts)

    def full_decomposition(self) -> dict:
        """
        Brier Score = Reliability - Resolution + Uncertainty

        Good forecast: low reliability (well calibrated),
                       high resolution (discriminates well).
        """
        rel = self.reliability()
        res = self.resolution()
        unc = self.uncertainty()
        bs = np.mean((self.forecasts - self.outcomes) ** 2)

        return {
            'brier_score': bs,
            'reliability': rel,
            'resolution': res,
            'uncertainty': unc,
            'check': abs(bs - (rel - res + unc)),
            'sharpness': self.sharpness(),
            'brier_skill_score': 1 - bs / unc if unc > 0 else 0,
            'interpretation': {
                'calibration': 'good' if rel < 0.01 else 'fair' if rel < 0.03 else 'poor',
                'discrimination': 'excellent' if res > 0.05 else 'good' if res > 0.02 else 'poor',
                'decisiveness': 'sharp' if self.sharpness() > 0.04 else 'moderate' if self.sharpness() > 0.02 else 'hedging'
            }
        }

ROC Curves for Binary Predictions

class ROCAnalysis:
    """ROC curve analysis for binary prediction evaluation."""

    def __init__(self, probabilities: np.ndarray, outcomes: np.ndarray):
        self.probs = probabilities
        self.outcomes = outcomes.astype(int)

    def compute_roc(self, n_thresholds: int = 100) -> dict:
        """Compute ROC curve points."""
        thresholds = np.linspace(0, 1, n_thresholds)
        tpr_list = []  # True Positive Rate (sensitivity)
        fpr_list = []  # False Positive Rate (1 - specificity)

        for threshold in thresholds:
            predicted_positive = self.probs >= threshold
            tp = np.sum(predicted_positive & (self.outcomes == 1))
            fp = np.sum(predicted_positive & (self.outcomes == 0))
            fn = np.sum(~predicted_positive & (self.outcomes == 1))
            tn = np.sum(~predicted_positive & (self.outcomes == 0))

            tpr = tp / max(tp + fn, 1)
            fpr = fp / max(fp + tn, 1)

            tpr_list.append(tpr)
            fpr_list.append(fpr)

        return {
            'fpr': fpr_list,
            'tpr': tpr_list,
            'thresholds': thresholds.tolist(),
            'auc': self.compute_auc(fpr_list, tpr_list)
        }

    def compute_auc(self, fpr: list, tpr: list) -> float:
        """Area Under the ROC Curve using trapezoidal rule."""
        sorted_pairs = sorted(zip(fpr, tpr))
        fpr_sorted = [p[0] for p in sorted_pairs]
        tpr_sorted = [p[1] for p in sorted_pairs]

        auc = 0
        for i in range(1, len(fpr_sorted)):
            auc += (fpr_sorted[i] - fpr_sorted[i-1]) * (tpr_sorted[i] + tpr_sorted[i-1]) / 2

        return auc

    def optimal_threshold(self) -> dict:
        """Find the threshold that maximizes Youden's J statistic."""
        roc = self.compute_roc()
        j_stats = [tpr - fpr for tpr, fpr in zip(roc['tpr'], roc['fpr'])]
        best_idx = np.argmax(j_stats)

        return {
            'optimal_threshold': roc['thresholds'][best_idx],
            'sensitivity': roc['tpr'][best_idx],
            'specificity': 1 - roc['fpr'][best_idx],
            'youden_j': j_stats[best_idx]
        }

    def interpretation(self) -> dict:
        """Interpret the AUC value."""
        auc = self.compute_roc()['auc']
        if auc > 0.9:
            quality = 'Excellent discrimination'
        elif auc > 0.8:
            quality = 'Good discrimination'
        elif auc > 0.7:
            quality = 'Fair discrimination'
        elif auc > 0.6:
            quality = 'Poor discrimination'
        else:
            quality = 'No better than random'

        return {'auc': auc, 'quality': quality}

CRPS for Probabilistic Forecasts

class CRPSCalculator:
    """
    Continuous Ranked Probability Score (CRPS).
    The standard metric for evaluating full probability distributions.

    CRPS generalizes the Brier Score to continuous outcomes:
    CRPS = integral of (F(x) - 1{y <= x})² dx

    Where F(x) is the forecast CDF and y is the observation.
    """

    @staticmethod
    def crps_empirical(ensemble: np.ndarray, observation: float) -> float:
        """
        CRPS from ensemble forecasts.
        Uses the identity: CRPS = E|X-y| - 0.5*E|X-X'|
        where X, X' are independent draws from the forecast distribution.
        """
        n = len(ensemble)

        # E|X - y|
        abs_diff = np.mean(np.abs(ensemble - observation))

        # E|X - X'| (expected absolute difference between two random draws)
        sorted_ens = np.sort(ensemble)
        # Efficient computation
        weights = 2 * np.arange(n) - n + 1
        spread = np.sum(weights * sorted_ens) / (n * n)

        return abs_diff - 0.5 * abs(spread)

    @staticmethod
    def crps_gaussian(mu: float, sigma: float, observation: float) -> float:
        """
        CRPS for Gaussian forecast N(mu, sigma²).
        Closed-form solution available.
        """
        from scipy.stats import norm

        z = (observation - mu) / sigma
        crps = sigma * (z * (2 * norm.cdf(z) - 1) + 2 * norm.pdf(z) - 1 / np.sqrt(np.pi))
        return crps

    @staticmethod
    def crps_skill_score(crps_forecast: float, crps_reference: float) -> float:
        """
        CRPS Skill Score: improvement over a reference forecast.
        1 = perfect, 0 = same as reference, negative = worse than reference.
        """
        return 1 - crps_forecast / crps_reference if crps_reference > 0 else 0

    @staticmethod
    def energy_score(ensemble_forecasts: np.ndarray,
                     observations: np.ndarray) -> float:
        """
        Energy Score: multivariate generalization of CRPS.
        Useful for evaluating joint distributions (e.g., multi-site weather).
        """
        n_members = ensemble_forecasts.shape[0]
        n_obs = len(observations)

        # E||X - y|| term
        term1 = np.mean([
            np.linalg.norm(ensemble_forecasts[i] - observations)
            for i in range(n_members)
        ])

        # E||X - X'|| term
        term2 = 0
        count = 0
        for i in range(n_members):
            for j in range(i + 1, n_members):
                term2 += np.linalg.norm(ensemble_forecasts[i] - ensemble_forecasts[j])
                count += 1
        term2 = term2 / max(count, 1)

        return term1 - 0.5 * term2

    @staticmethod
    def pit_histogram(ensemble_forecasts: list, observations: list,
                      n_bins: int = 10) -> dict:
        """
        Probability Integral Transform (PIT) histogram.
        For well-calibrated probabilistic forecasts, PIT values
        should be uniformly distributed.

        PIT value = percentile rank of observation within ensemble.
        """
        pit_values = []
        for ensemble, obs in zip(ensemble_forecasts, observations):
            rank = np.sum(np.array(ensemble) <= obs) / len(ensemble)
            pit_values.append(rank)

        pit_values = np.array(pit_values)
        hist, bin_edges = np.histogram(pit_values, bins=n_bins, range=(0, 1))
        expected = len(pit_values) / n_bins

        # Chi-squared test for uniformity
        from scipy.stats import chi2
        chi2_stat = np.sum((hist - expected)**2 / expected)
        p_value = 1 - chi2.cdf(chi2_stat, n_bins - 1)

        return {
            'histogram': hist.tolist(),
            'bin_edges': bin_edges.tolist(),
            'chi2_statistic': chi2_stat,
            'p_value': p_value,
            'is_uniform': p_value > 0.05,
            'interpretation': (
                'Well-calibrated (uniform PIT)' if p_value > 0.05
                else 'Dome-shaped: overconfident' if hist[len(hist)//2] > expected * 1.3
                else 'U-shaped: underconfident' if hist[0] + hist[-1] > expected * 2.5
                else 'Skewed: biased forecast'
            )
        }

Comparing Forecasters

class ForecasterComparison:
    """Methods for rigorously comparing forecasting performance."""

    def __init__(self):
        self.forecasters = {}

    def add_forecaster(self, name: str, probabilities: list, outcomes: list):
        self.forecasters[name] = {
            'probabilities': np.array(probabilities),
            'outcomes': np.array(outcomes, dtype=float)
        }

    def head_to_head(self, forecaster_a: str, forecaster_b: str) -> dict:
        """Compare two forecasters on shared questions."""
        a = self.forecasters[forecaster_a]
        b = self.forecasters[forecaster_b]

        # Assume same outcomes
        bs_a = np.mean((a['probabilities'] - a['outcomes']) ** 2)
        bs_b = np.mean((b['probabilities'] - b['outcomes']) ** 2)

        # Per-question comparison
        a_better = 0
        b_better = 0
        tie = 0
        for i in range(len(a['outcomes'])):
            score_a = (a['probabilities'][i] - a['outcomes'][i]) ** 2
            score_b = (b['probabilities'][i] - b['outcomes'][i]) ** 2
            if score_a < score_b - 0.001:
                a_better += 1
            elif score_b < score_a - 0.001:
                b_better += 1
            else:
                tie += 1

        # Statistical significance (paired t-test)
        differences = (a['probabilities'] - a['outcomes'])**2 - (b['probabilities'] - b['outcomes'])**2
        from scipy.stats import ttest_rel
        t_stat, p_value = ttest_rel(
            (a['probabilities'] - a['outcomes'])**2,
            (b['probabilities'] - b['outcomes'])**2
        )

        winner = forecaster_a if bs_a < bs_b else forecaster_b

        return {
            f'{forecaster_a}_brier': bs_a,
            f'{forecaster_b}_brier': bs_b,
            'difference': bs_a - bs_b,
            'winner': winner,
            'margin': abs(bs_a - bs_b),
            'statistically_significant': p_value < 0.05,
            'p_value': p_value,
            f'{forecaster_a}_better_on': a_better,
            f'{forecaster_b}_better_on': b_better,
            'ties': tie
        }

    def leaderboard(self, metric: str = 'brier') -> list:
        """Rank all forecasters by chosen metric."""
        results = []

        for name, data in self.forecasters.items():
            p = data['probabilities']
            o = data['outcomes']

            scores = {
                'brier': np.mean((p - o) ** 2),
                'log': np.mean([
                    np.log(max(pi, 1e-10)) if oi == 1 else np.log(max(1-pi, 1e-10))
                    for pi, oi in zip(p, o)
                ]),
                'calibration_error': self._ece(p, o),
            }

            results.append({
                'name': name,
                'n_forecasts': len(p),
                **scores
            })

        # Sort (lower Brier is better, higher log is better)
        if metric == 'brier':
            results.sort(key=lambda x: x['brier'])
        elif metric == 'log':
            results.sort(key=lambda x: -x['log'])

        # Add ranks
        for i, r in enumerate(results):
            r['rank'] = i + 1

        return results

    def _ece(self, probs, outcomes, n_bins=10):
        bins = np.linspace(0, 1, n_bins + 1)
        ece = 0
        total = len(probs)
        for i in range(n_bins):
            mask = (probs >= bins[i]) & (probs < bins[i+1])
            n_k = np.sum(mask)
            if n_k > 0:
                ece += n_k / total * abs(np.mean(probs[mask]) - np.mean(outcomes[mask]))
        return ece

    def bootstrap_confidence_intervals(self, forecaster_name: str,
                                        n_bootstrap: int = 1000) -> dict:
        """Compute confidence intervals for a forecaster's Brier score."""
        data = self.forecasters[forecaster_name]
        n = len(data['probabilities'])

        bootstrap_briers = []
        for _ in range(n_bootstrap):
            indices = np.random.choice(n, n, replace=True)
            p_boot = data['probabilities'][indices]
            o_boot = data['outcomes'][indices]
            bootstrap_briers.append(np.mean((p_boot - o_boot) ** 2))

        return {
            'point_estimate': np.mean((data['probabilities'] - data['outcomes']) ** 2),
            'ci_95': (np.percentile(bootstrap_briers, 2.5),
                      np.percentile(bootstrap_briers, 97.5)),
            'ci_90': (np.percentile(bootstrap_briers, 5),
                      np.percentile(bootstrap_briers, 95)),
            'bootstrap_std': np.std(bootstrap_briers)
        }

Tournament Design

Designing Forecasting Competitions

class ForecastingTournament:
    """
    Design and run a forecasting tournament.
    Based on practices from IARPA's ACE program, Good Judgment Project,
    and Metaculus.
    """

    def __init__(self, name: str, scoring_rule: str = 'brier',
                 n_questions: int = 100):
        self.name = name
        self.scoring_rule = scoring_rule
        self.n_questions = n_questions
        self.questions = []
        self.participants = {}
        self.resolved = []

    def add_question(self, question: str, resolution_criteria: str,
                     resolution_date: str, category: str = 'general',
                     difficulty: str = 'medium'):
        """Add a question to the tournament."""
        self.questions.append({
            'id': len(self.questions),
            'question': question,
            'resolution_criteria': resolution_criteria,
            'resolution_date': resolution_date,
            'category': category,
            'difficulty': difficulty,
            'resolved': False,
            'outcome': None
        })

    def submit_forecast(self, participant_id: str, question_id: int,
                         probability: float):
        """Record a participant's forecast."""
        if participant_id not in self.participants:
            self.participants[participant_id] = {
                'forecasts': {},
                'scores': []
            }

        self.participants[participant_id]['forecasts'][question_id] = {
            'probability': probability,
            'timestamp': 'now'
        }

    def resolve_question(self, question_id: int, outcome: bool):
        """Resolve a question and score all forecasters."""
        question = self.questions[question_id]
        question['resolved'] = True
        question['outcome'] = outcome

        for pid, data in self.participants.items():
            if question_id in data['forecasts']:
                p = data['forecasts'][question_id]['probability']
                score = self._compute_score(p, int(outcome))
                data['scores'].append({
                    'question_id': question_id,
                    'score': score,
                    'probability': p,
                    'outcome': outcome
                })

    def _compute_score(self, probability: float, outcome: int) -> float:
        sr = ScoringRules()
        if self.scoring_rule == 'brier':
            return sr.brier_score(probability, outcome)
        elif self.scoring_rule == 'log':
            return sr.log_score(probability, outcome)
        return sr.brier_score(probability, outcome)

    def standings(self) -> list:
        """Current tournament standings."""
        standings = []

        for pid, data in self.participants.items():
            if not data['scores']:
                continue

            scores = [s['score'] for s in data['scores']]

            if self.scoring_rule == 'brier':
                avg_score = np.mean(scores)  # Lower is better
                sort_key = avg_score
            else:
                avg_score = np.mean(scores)  # Higher is better for log
                sort_key = -avg_score

            standings.append({
                'participant': pid,
                'avg_score': avg_score,
                'n_forecasts': len(scores),
                'participation_rate': len(scores) / max(
                    sum(1 for q in self.questions if q['resolved']), 1
                ),
                'consistency': np.std(scores)
            })

        standings.sort(key=lambda x: x['avg_score'] if self.scoring_rule == 'brier' else -x['avg_score'])

        for i, s in enumerate(standings):
            s['rank'] = i + 1

        return standings

    def question_design_guidelines(self) -> dict:
        """Best practices for tournament question design."""
        return {
            'resolution_criteria': {
                'rule': 'Must be objectively verifiable',
                'examples': [
                    'GOOD: "Will X exceed 100 by Dec 31 as reported by Source Y?"',
                    'BAD: "Will X be successful?" (subjective)'
                ]
            },
            'time_horizon': {
                'rule': 'Mix of short-term (1-3 months) and long-term (6-12 months)',
                'rationale': 'Short-term provides frequent feedback; long-term tests deeper reasoning'
            },
            'difficulty_balance': {
                'rule': 'Include easy (~75/25), medium (~60/40), and hard (~55/45) questions',
                'rationale': 'Easy questions calibrate; hard questions separate forecasters'
            },
            'base_rate_awareness': {
                'rule': 'Avoid questions where the base rate gives the answer',
                'example': '"Will the sun rise tomorrow?" is not informative'
            },
            'category_diversity': {
                'rule': 'Span multiple domains to test breadth',
                'domains': ['geopolitics', 'economics', 'technology', 'science', 'social']
            },
            'update_frequency': {
                'rule': 'Allow continuous updating',
                'rationale': 'Tests ability to incorporate new information'
            }
        }

    def anti_gaming_measures(self) -> dict:
        """Prevent gaming of the tournament scoring system."""
        return {
            'participation_minimum': {
                'description': 'Require forecasts on at least 50% of questions',
                'rationale': 'Prevents cherry-picking easy questions'
            },
            'proper_scoring': {
                'description': 'Use strictly proper scoring rules only',
                'rationale': 'Honest reporting is the dominant strategy'
            },
            'continuous_scoring': {
                'description': 'Score the most recent forecast, not just the final one',
                'rationale': 'Rewards early accuracy and continuous engagement'
            },
            'time_weighted_scoring': {
                'description': 'Weight earlier forecasts more (harder to be accurate early)',
                'rationale': 'Prevents last-minute forecast copying'
            },
            'category_requirements': {
                'description': 'Require minimum forecasts per category',
                'rationale': 'Prevents specialization gaming'
            }
        }

Comprehensive Evaluation Pipeline

class ForecastEvaluationPipeline:
    """End-to-end evaluation pipeline for forecast systems."""

    def __init__(self):
        self.evaluations = []

    def evaluate(self, forecasts: np.ndarray, outcomes: np.ndarray,
                 ensemble_forecasts: np.ndarray = None) -> dict:
        """Run complete evaluation suite."""
        results = {}

        # 1. Point forecast metrics
        results['brier_score'] = np.mean((forecasts - outcomes) ** 2)
        results['log_score'] = np.mean([
            ScoringRules.log_score(p, int(o)) for p, o in zip(forecasts, outcomes)
        ])

        # 2. Decomposition
        decomp = ForecastQualityDecomposition(forecasts, outcomes)
        results['decomposition'] = decomp.full_decomposition()

        # 3. ROC analysis
        roc = ROCAnalysis(forecasts, outcomes)
        results['auc'] = roc.compute_roc()['auc']
        results['optimal_threshold'] = roc.optimal_threshold()

        # 4. CRPS (if ensemble available)
        if ensemble_forecasts is not None:
            crps_values = []
            for i in range(len(outcomes)):
                crps = CRPSCalculator.crps_empirical(ensemble_forecasts[i], outcomes[i])
                crps_values.append(crps)
            results['mean_crps'] = np.mean(crps_values)

            # PIT analysis
            pit = CRPSCalculator.pit_histogram(
                ensemble_forecasts.tolist(),
                outcomes.tolist()
            )
            results['pit_analysis'] = pit

        # 5. Overall assessment
        bs = results['brier_score']
        auc = results['auc']
        cal = results['decomposition']['interpretation']['calibration']

        if bs < 0.15 and auc > 0.8 and cal == 'good':
            results['overall_grade'] = 'A — Excellent forecasting'
        elif bs < 0.20 and auc > 0.7:
            results['overall_grade'] = 'B — Good forecasting'
        elif bs < 0.25 and auc > 0.6:
            results['overall_grade'] = 'C — Fair forecasting'
        else:
            results['overall_grade'] = 'D — Needs improvement'

        return results

Key Takeaways

  1. Only use proper scoring rules (Brier, log, spherical) for evaluation; improper rules like accuracy/zero-one incentivize dishonest reporting
  2. The Brier score decomposes into reliability (calibration), resolution (discrimination), and uncertainty: improve the first two, the third is fixed
  3. Log scoring is more appropriate than Brier when you want to heavily penalize confident wrong predictions (e.g., tail risk applications)
  4. ROC/AUC measures discrimination ability independent of calibration; use alongside calibration metrics for complete assessment
  5. CRPS is the standard for evaluating full probability distributions; PIT histograms diagnose specific calibration failures (overconfidence, bias)
  6. When comparing forecasters, use paired tests (not just average scores) and bootstrap confidence intervals to assess statistical significance
  7. Tournament design requires proper scoring rules, participation minimums, category diversity, and anti-gaming measures
  8. A complete evaluation pipeline combines point metrics, decomposition, discrimination analysis, probabilistic calibration, and overall grading

Install this skill directly: skilldb add prediction-skills

Get CLI access →