Skip to main content
UncategorizedPrediction690 lines

Ensemble Prediction Methods

Quick Summary14 lines
Ensemble prediction combines multiple forecasting models or forecasters to produce predictions that are more accurate and better calibrated than any individual model. The principle is deceptively simple — averaging diverse, independent predictions reduces error — but the practice involves sophisticated techniques for selecting models, weighting contributions, and managing the diversity-accuracy tradeoff. From weather forecasting to superforecasting tournaments, ensembles consistently outperform individual predictions.

## Key Points

1. Simple averaging of diverse predictions is remarkably effective and hard to beat consistently
2. The Diversity Prediction Theorem guarantees ensemble error is less than or equal to average individual error — diversity is free accuracy
3. Correlation between models determines the ceiling on ensemble improvement: uncorrelated models give the most benefit
4. Performance-weighted ensembles outperform equal-weight when you have reliable track records (at least 20-30 observations)
5. Stacking (meta-learning) is the most powerful method but requires careful cross-validation to prevent overfitting
6. Extremization (pushing ensemble probability away from 50%) corrects the systematic underconfidence of simple averages
7. Superforecasters naturally think in ensembles: they combine base rates, case-specific analysis, and devil's advocacy
8. The optimal ensemble size is usually 5-20 models; beyond that, marginal diversity drops and the improvement plateaus
skilldb get prediction-skills/ensemble-prediction-methodsFull skill: 690 lines
Paste into your CLAUDE.md or agent config

Ensemble Prediction Methods

Overview

Ensemble prediction combines multiple forecasting models or forecasters to produce predictions that are more accurate and better calibrated than any individual model. The principle is deceptively simple — averaging diverse, independent predictions reduces error — but the practice involves sophisticated techniques for selecting models, weighting contributions, and managing the diversity-accuracy tradeoff. From weather forecasting to superforecasting tournaments, ensembles consistently outperform individual predictions.

The Mathematical Foundation

Why Ensembles Work: The Bias-Variance Decomposition

The expected error of any prediction can be decomposed:

Total Error = Bias² + Variance + Irreducible Noise

For an ensemble of M models:
  Ensemble Variance ≈ Individual Variance / M  (if models are independent)
  Ensemble Bias ≈ Average Individual Bias

Ensembles primarily reduce variance (the "shakiness" of predictions) while preserving the average bias. This is why diverse models matter: if all models make the same errors, averaging does not help.

import numpy as np

def demonstrate_ensemble_benefit(n_models: int, n_trials: int = 10000,
                                  individual_std: float = 1.0,
                                  correlation: float = 0.0) -> dict:
    """Show how ensemble size and correlation affect accuracy."""
    true_value = 5.0
    individual_mse = []
    ensemble_mse = []

    for _ in range(n_trials):
        # Generate correlated predictions
        cov_matrix = np.full((n_models, n_models), correlation * individual_std**2)
        np.fill_diagonal(cov_matrix, individual_std**2)
        predictions = np.random.multivariate_normal(
            np.full(n_models, true_value), cov_matrix
        )

        individual_mse.append((predictions[0] - true_value)**2)
        ensemble_mse.append((np.mean(predictions) - true_value)**2)

    return {
        'individual_rmse': np.sqrt(np.mean(individual_mse)),
        'ensemble_rmse': np.sqrt(np.mean(ensemble_mse)),
        'improvement': 1 - np.sqrt(np.mean(ensemble_mse)) / np.sqrt(np.mean(individual_mse)),
        'theoretical_improvement': 1 - np.sqrt(
            (1 + correlation * (n_models - 1)) / n_models
        )
    }

# Demonstration
for corr in [0.0, 0.3, 0.6, 0.9]:
    result = demonstrate_ensemble_benefit(10, correlation=corr)
    print(f"Correlation {corr}: {result['improvement']:.1%} improvement "
          f"(theory: {result['theoretical_improvement']:.1%})")

The Diversity Prediction Theorem

Ensemble Error = Average Individual Error - Diversity

Where Diversity = Average squared deviation of individual predictions from ensemble mean

This remarkable theorem shows that ensemble error is ALWAYS less than or equal to the average individual error. Diversity is free accuracy.

def diversity_prediction_theorem(predictions: list, true_value: float) -> dict:
    """Demonstrate the diversity-prediction theorem."""
    predictions = np.array(predictions)
    ensemble = np.mean(predictions)

    # Individual errors
    individual_errors = (predictions - true_value) ** 2
    avg_individual_error = np.mean(individual_errors)

    # Ensemble error
    ensemble_error = (ensemble - true_value) ** 2

    # Diversity
    diversity = np.mean((predictions - ensemble) ** 2)

    # The theorem: ensemble_error = avg_individual_error - diversity
    theorem_check = abs(ensemble_error - (avg_individual_error - diversity))

    return {
        'avg_individual_error': avg_individual_error,
        'ensemble_error': ensemble_error,
        'diversity': diversity,
        'theorem_holds': theorem_check < 1e-10,
        'improvement_from_diversity': diversity / avg_individual_error
    }

Simple Averaging Methods

Equal-Weight Averaging

The simplest and often surprisingly effective approach:

class SimpleEnsemble:
    """Equal-weight ensemble of multiple models."""

    def __init__(self):
        self.models = []
        self.model_names = []

    def add_model(self, name: str, model):
        self.models.append(model)
        self.model_names.append(name)

    def predict(self, X) -> dict:
        predictions = [model.predict(X) for model in self.models]
        predictions = np.array(predictions)

        return {
            'mean': np.mean(predictions, axis=0),
            'median': np.median(predictions, axis=0),
            'std': np.std(predictions, axis=0),
            'min': np.min(predictions, axis=0),
            'max': np.max(predictions, axis=0),
            'individual': dict(zip(self.model_names, predictions))
        }

    def predict_probability(self, X) -> dict:
        """For probability forecasts (binary outcomes)."""
        probs = [model.predict_proba(X) for model in self.models]
        probs = np.array(probs)

        mean_prob = np.mean(probs, axis=0)

        return {
            'probability': mean_prob,
            'agreement': 1 - np.std(probs, axis=0) * 2,
            'individual': dict(zip(self.model_names, probs))
        }

Trimmed Mean

Remove extreme predictions before averaging to reduce the impact of outlier models:

def trimmed_mean_ensemble(predictions: list, trim_fraction: float = 0.1) -> float:
    """Remove top and bottom X% of predictions before averaging."""
    predictions = sorted(predictions)
    n = len(predictions)
    trim_count = int(n * trim_fraction)

    if trim_count > 0:
        trimmed = predictions[trim_count:-trim_count]
    else:
        trimmed = predictions

    return np.mean(trimmed)

Weighted Averaging

Performance-Based Weighting

class WeightedEnsemble:
    """Ensemble with weights based on historical performance."""

    def __init__(self, decay_rate: float = 0.95):
        self.models = {}
        self.performance_history = {}
        self.decay_rate = decay_rate

    def add_model(self, name: str, model):
        self.models[name] = model
        self.performance_history[name] = []

    def record_performance(self, name: str, error: float):
        """Record a model's prediction error."""
        self.performance_history[name].append(error)

    def compute_weights(self, method: str = 'inverse_error') -> dict:
        """Compute model weights from historical performance."""

        if method == 'inverse_error':
            # Weight inversely proportional to recent RMSE
            weights = {}
            for name, errors in self.performance_history.items():
                if not errors:
                    weights[name] = 1.0
                    continue
                # Apply exponential decay to weight recent performance more
                weighted_errors = []
                for i, error in enumerate(reversed(errors)):
                    weighted_errors.append(error * self.decay_rate ** i)
                rmse = np.sqrt(np.mean(np.array(weighted_errors) ** 2))
                weights[name] = 1.0 / max(rmse, 0.001)

        elif method == 'bayesian':
            # Bayesian model averaging using likelihood
            weights = {}
            for name, errors in self.performance_history.items():
                if not errors:
                    weights[name] = 1.0
                    continue
                # Log-likelihood under Gaussian errors
                log_lik = -0.5 * sum(e**2 for e in errors[-20:])
                weights[name] = np.exp(log_lik)

        elif method == 'brier_weighted':
            # For probability forecasts, weight by Brier skill
            weights = {}
            for name, errors in self.performance_history.items():
                brier = np.mean(np.array(errors) ** 2)
                skill = max(0, 1 - brier / 0.25)  # Relative to coin flip
                weights[name] = skill + 0.01  # Small floor

        # Normalize
        total = sum(weights.values())
        return {name: w / total for name, w in weights.items()}

    def predict(self, X) -> dict:
        weights = self.compute_weights()
        predictions = {}

        for name, model in self.models.items():
            predictions[name] = model.predict(X)

        weighted_pred = sum(
            predictions[name] * weights[name]
            for name in self.models
        )

        return {
            'prediction': weighted_pred,
            'weights': weights,
            'individual': predictions
        }

Optimal Linear Combination

from scipy.optimize import minimize

def optimal_ensemble_weights(
    predictions_matrix: np.ndarray,  # (n_models, n_samples)
    actuals: np.ndarray,             # (n_samples,)
    regularization: float = 0.01
) -> np.ndarray:
    """
    Find optimal weights that minimize ensemble MSE.
    Uses constrained optimization: weights sum to 1, non-negative.
    """
    n_models = predictions_matrix.shape[0]

    def objective(weights):
        ensemble_pred = predictions_matrix.T @ weights
        mse = np.mean((ensemble_pred - actuals) ** 2)
        # L2 regularization to prevent overfitting
        reg = regularization * np.sum(weights ** 2)
        return mse + reg

    # Constraints: weights sum to 1
    constraints = {'type': 'eq', 'fun': lambda w: np.sum(w) - 1}

    # Bounds: non-negative weights
    bounds = [(0, 1) for _ in range(n_models)]

    # Initial: equal weights
    x0 = np.full(n_models, 1.0 / n_models)

    result = minimize(objective, x0, bounds=bounds, constraints=constraints)

    return result.x

Stacking (Meta-Learning)

Training a Model to Combine Models

from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge

class StackedEnsemble:
    """
    Two-layer ensemble: base models generate features,
    meta-model learns optimal combination.
    """

    def __init__(self, base_models: list, meta_model=None):
        self.base_models = base_models
        self.meta_model = meta_model or Ridge(alpha=1.0)
        self.fitted = False

    def fit(self, X: np.ndarray, y: np.ndarray, n_folds: int = 5):
        """
        Train using out-of-fold predictions to prevent leakage.
        """
        n_samples = len(X)
        n_models = len(self.base_models)
        meta_features = np.zeros((n_samples, n_models))

        kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)

        for fold_idx, (train_idx, val_idx) in enumerate(kf.split(X)):
            X_train, X_val = X[train_idx], X[val_idx]
            y_train = y[train_idx]

            for model_idx, model in enumerate(self.base_models):
                model.fit(X_train, y_train)
                meta_features[val_idx, model_idx] = model.predict(X_val)

        # Refit base models on full data
        for model in self.base_models:
            model.fit(X, y)

        # Train meta-model on out-of-fold predictions
        self.meta_model.fit(meta_features, y)
        self.fitted = True

        # Report meta-model weights (if linear)
        if hasattr(self.meta_model, 'coef_'):
            print("Meta-model weights:", self.meta_model.coef_)

    def predict(self, X: np.ndarray) -> np.ndarray:
        """Generate stacked prediction."""
        meta_features = np.column_stack([
            model.predict(X) for model in self.base_models
        ])
        return self.meta_model.predict(meta_features)

Boosting for Predictions

Sequential Error Correction

class BoostedForecaster:
    """
    Each subsequent model focuses on errors of the ensemble so far.
    """

    def __init__(self, base_model_class, n_rounds: int = 10,
                 learning_rate: float = 0.1):
        self.base_model_class = base_model_class
        self.n_rounds = n_rounds
        self.learning_rate = learning_rate
        self.models = []
        self.weights = []

    def fit(self, X: np.ndarray, y: np.ndarray):
        """Train boosted ensemble."""
        residuals = y.copy()

        for round_num in range(self.n_rounds):
            model = self.base_model_class()
            model.fit(X, residuals)

            predictions = model.predict(X)
            self.models.append(model)
            self.weights.append(self.learning_rate)

            # Update residuals
            residuals = residuals - self.learning_rate * predictions

            # Track progress
            ensemble_pred = self.predict(X)
            mse = np.mean((ensemble_pred - y) ** 2)
            print(f"Round {round_num + 1}: MSE = {mse:.4f}")

    def predict(self, X: np.ndarray) -> np.ndarray:
        """Generate boosted prediction."""
        prediction = np.zeros(len(X))
        for model, weight in zip(self.models, self.weights):
            prediction += weight * model.predict(X)
        return prediction

Superforecaster Techniques

How the Best Forecasters Use Ensemble Thinking

Philip Tetlock's Good Judgment Project identified "superforecasters" who consistently outperform. Their ensemble-like mental processes include:

class SuperforecasterProcess:
    """Model the superforecaster's prediction methodology."""

    def __init__(self):
        self.perspectives = []
        self.base_rates = {}
        self.adjustments = []

    def step1_outside_view(self, reference_class: str,
                           base_rate: float) -> float:
        """Start with the base rate (reference class forecasting)."""
        self.base_rates[reference_class] = base_rate
        return base_rate

    def step2_inside_view_adjustments(self, factors: list) -> list:
        """
        Apply case-specific adjustments.
        Each factor shifts the probability up or down.
        """
        adjustments = []
        for factor in factors:
            adjustment = {
                'factor': factor['description'],
                'direction': factor['direction'],  # 'up' or 'down'
                'magnitude': factor['magnitude'],   # 0 to 0.2
                'confidence': factor['confidence']   # 0 to 1
            }
            adjustments.append(adjustment)
        self.adjustments = adjustments
        return adjustments

    def step3_synthesize(self, base_rate: float) -> float:
        """Combine base rate with adjustments."""
        probability = base_rate

        for adj in self.adjustments:
            shift = adj['magnitude'] * adj['confidence']
            if adj['direction'] == 'up':
                probability += shift * (1 - probability)  # Compress toward 1
            else:
                probability -= shift * probability  # Compress toward 0

        # Avoid extreme probabilities (superforecasters rarely go past 95%)
        probability = max(0.02, min(0.98, probability))
        return probability

    def step4_devils_advocate(self, probability: float) -> float:
        """
        Actively seek disconfirming evidence.
        If you find it compelling, adjust toward 50%.
        """
        # This represents the discipline of considering you might be wrong
        disconfirming_strength = 0.1  # How compelling the counterargument is
        return probability * (1 - disconfirming_strength) + 0.5 * disconfirming_strength

    def step5_update_with_new_info(self, current: float,
                                    new_evidence_strength: float,
                                    direction: str) -> float:
        """Granular updates when new information arrives."""
        # Superforecasters make many small updates, not large jumps
        update_size = new_evidence_strength * 0.1  # Small increments

        if direction == 'up':
            return current + update_size * (1 - current)
        else:
            return current - update_size * current

Building a Human-Model Ensemble

class HumanModelEnsemble:
    """
    Combine human forecasters with statistical/ML models.
    Research shows this hybrid outperforms either alone.
    """

    def __init__(self):
        self.human_forecasts = {}
        self.model_forecasts = {}
        self.track_records = {}

    def add_human_forecast(self, forecaster_id: str, question: str,
                           probability: float, reasoning: str):
        if question not in self.human_forecasts:
            self.human_forecasts[question] = []
        self.human_forecasts[question].append({
            'forecaster': forecaster_id,
            'probability': probability,
            'reasoning': reasoning
        })

    def add_model_forecast(self, model_name: str, question: str,
                           probability: float):
        if question not in self.model_forecasts:
            self.model_forecasts[question] = []
        self.model_forecasts[question].append({
            'model': model_name,
            'probability': probability
        })

    def aggregate(self, question: str,
                  human_weight: float = 0.6,
                  extremize: float = 1.0) -> dict:
        """
        Aggregate human and model forecasts.

        human_weight: relative weight of human vs model forecasts
        extremize: >1 pushes probabilities away from 50% (corrects
                   regression to mean in simple averages)
        """
        human_probs = [f['probability'] for f in self.human_forecasts.get(question, [])]
        model_probs = [f['probability'] for f in self.model_forecasts.get(question, [])]

        if not human_probs and not model_probs:
            return {'error': 'No forecasts available'}

        # Average within each group
        human_avg = np.mean(human_probs) if human_probs else 0.5
        model_avg = np.mean(model_probs) if model_probs else 0.5

        # Weighted combination
        combined = human_weight * human_avg + (1 - human_weight) * model_avg

        # Extremize (push away from 50%)
        if extremize != 1.0:
            odds = combined / (1 - combined)
            extremized_odds = odds ** extremize
            combined = extremized_odds / (1 + extremized_odds)

        return {
            'probability': combined,
            'human_average': human_avg,
            'model_average': model_avg,
            'n_human': len(human_probs),
            'n_model': len(model_probs),
            'agreement': 1 - np.std(human_probs + model_probs) if (human_probs + model_probs) else 0
        }

Diversity Management

Measuring and Maximizing Ensemble Diversity

class DiversityManager:
    """Tools for managing model diversity in ensembles."""

    @staticmethod
    def correlation_matrix(predictions: dict) -> np.ndarray:
        """Compute pairwise prediction correlations between models."""
        names = list(predictions.keys())
        n = len(names)
        matrix = np.zeros((n, n))

        for i in range(n):
            for j in range(n):
                corr = np.corrcoef(predictions[names[i]], predictions[names[j]])[0, 1]
                matrix[i, j] = corr

        return matrix, names

    @staticmethod
    def select_diverse_subset(predictions: dict, actuals: np.ndarray,
                              max_models: int) -> list:
        """
        Select a diverse subset of models using greedy forward selection.
        At each step, add the model that most improves ensemble performance.
        """
        available = list(predictions.keys())
        selected = []
        best_error = float('inf')

        for _ in range(max_models):
            best_addition = None
            for candidate in available:
                trial = selected + [candidate]
                trial_preds = np.mean([predictions[m] for m in trial], axis=0)
                error = np.mean((trial_preds - actuals) ** 2)
                if error < best_error:
                    best_error = error
                    best_addition = candidate

            if best_addition:
                selected.append(best_addition)
                available.remove(best_addition)

        return selected

    @staticmethod
    def diversity_sources():
        """Different ways to create model diversity."""
        return {
            'algorithm_diversity': 'Use different model types (linear, tree, neural, etc.)',
            'feature_diversity': 'Use different input features for different models',
            'data_diversity': 'Train on different subsets or time periods',
            'hyperparameter_diversity': 'Same algorithm with different settings',
            'temporal_diversity': 'Models trained at different times',
            'perspective_diversity': 'Frame the problem differently for each model',
            'human_diversity': 'Forecasters with different backgrounds/expertise'
        }

Practical Ensemble Recipes

Recipe 1: Quick and Effective

def quick_ensemble(question: str, approaches: list) -> float:
    """
    Fast ensemble using three approaches:
    1. Base rate from historical data
    2. Current model prediction
    3. Expert judgment

    Simply average with slight extremization.
    """
    probabilities = [a['probability'] for a in approaches]
    mean_prob = np.mean(probabilities)

    # Slight extremization (research-backed)
    odds = mean_prob / (1 - mean_prob + 1e-10)
    extremized_odds = odds ** 1.2
    result = extremized_odds / (1 + extremized_odds)

    return np.clip(result, 0.01, 0.99)

Recipe 2: Production Ensemble Pipeline

class ProductionEnsemble:
    """Full ensemble pipeline for production forecasting."""

    def __init__(self, models: list, validation_window: int = 30):
        self.models = models
        self.validation_window = validation_window
        self.weights = np.full(len(models), 1.0 / len(models))
        self.recalibrated = False

    def fit(self, X_train, y_train, X_val, y_val):
        """Train all models and optimize weights on validation set."""
        # Train base models
        for model in self.models:
            model.fit(X_train, y_train)

        # Get validation predictions
        val_preds = np.array([m.predict(X_val) for m in self.models])

        # Optimize weights
        self.weights = optimal_ensemble_weights(val_preds, y_val)

        # Recalibrate
        ensemble_val = val_preds.T @ self.weights
        self._fit_calibration(ensemble_val, y_val)

    def _fit_calibration(self, predictions, actuals):
        """Fit isotonic regression for calibration."""
        from sklearn.isotonic import IsotonicRegression
        self.calibrator = IsotonicRegression(out_of_bounds='clip')
        self.calibrator.fit(predictions, actuals)
        self.recalibrated = True

    def predict(self, X) -> dict:
        raw_preds = np.array([m.predict(X) for m in self.models])
        ensemble = raw_preds.T @ self.weights

        if self.recalibrated:
            calibrated = self.calibrator.predict(ensemble)
        else:
            calibrated = ensemble

        return {
            'prediction': calibrated,
            'raw': ensemble,
            'model_weights': dict(zip(
                [type(m).__name__ for m in self.models],
                self.weights
            )),
            'model_agreement': 1 - np.std(raw_preds, axis=0).mean()
        }

Key Takeaways

  1. Simple averaging of diverse predictions is remarkably effective and hard to beat consistently
  2. The Diversity Prediction Theorem guarantees ensemble error is less than or equal to average individual error — diversity is free accuracy
  3. Correlation between models determines the ceiling on ensemble improvement: uncorrelated models give the most benefit
  4. Performance-weighted ensembles outperform equal-weight when you have reliable track records (at least 20-30 observations)
  5. Stacking (meta-learning) is the most powerful method but requires careful cross-validation to prevent overfitting
  6. Extremization (pushing ensemble probability away from 50%) corrects the systematic underconfidence of simple averages
  7. Superforecasters naturally think in ensembles: they combine base rates, case-specific analysis, and devil's advocacy
  8. The optimal ensemble size is usually 5-20 models; beyond that, marginal diversity drops and the improvement plateaus

Install this skill directly: skilldb add prediction-skills

Get CLI access →