Ensemble Prediction Methods
Ensemble prediction combines multiple forecasting models or forecasters to produce predictions that are more accurate and better calibrated than any individual model. The principle is deceptively simple — averaging diverse, independent predictions reduces error — but the practice involves sophisticated techniques for selecting models, weighting contributions, and managing the diversity-accuracy tradeoff. From weather forecasting to superforecasting tournaments, ensembles consistently outperform individual predictions. ## Key Points 1. Simple averaging of diverse predictions is remarkably effective and hard to beat consistently 2. The Diversity Prediction Theorem guarantees ensemble error is less than or equal to average individual error — diversity is free accuracy 3. Correlation between models determines the ceiling on ensemble improvement: uncorrelated models give the most benefit 4. Performance-weighted ensembles outperform equal-weight when you have reliable track records (at least 20-30 observations) 5. Stacking (meta-learning) is the most powerful method but requires careful cross-validation to prevent overfitting 6. Extremization (pushing ensemble probability away from 50%) corrects the systematic underconfidence of simple averages 7. Superforecasters naturally think in ensembles: they combine base rates, case-specific analysis, and devil's advocacy 8. The optimal ensemble size is usually 5-20 models; beyond that, marginal diversity drops and the improvement plateaus
skilldb get prediction-skills/ensemble-prediction-methodsFull skill: 690 linesEnsemble Prediction Methods
Overview
Ensemble prediction combines multiple forecasting models or forecasters to produce predictions that are more accurate and better calibrated than any individual model. The principle is deceptively simple — averaging diverse, independent predictions reduces error — but the practice involves sophisticated techniques for selecting models, weighting contributions, and managing the diversity-accuracy tradeoff. From weather forecasting to superforecasting tournaments, ensembles consistently outperform individual predictions.
The Mathematical Foundation
Why Ensembles Work: The Bias-Variance Decomposition
The expected error of any prediction can be decomposed:
Total Error = Bias² + Variance + Irreducible Noise
For an ensemble of M models:
Ensemble Variance ≈ Individual Variance / M (if models are independent)
Ensemble Bias ≈ Average Individual Bias
Ensembles primarily reduce variance (the "shakiness" of predictions) while preserving the average bias. This is why diverse models matter: if all models make the same errors, averaging does not help.
import numpy as np
def demonstrate_ensemble_benefit(n_models: int, n_trials: int = 10000,
individual_std: float = 1.0,
correlation: float = 0.0) -> dict:
"""Show how ensemble size and correlation affect accuracy."""
true_value = 5.0
individual_mse = []
ensemble_mse = []
for _ in range(n_trials):
# Generate correlated predictions
cov_matrix = np.full((n_models, n_models), correlation * individual_std**2)
np.fill_diagonal(cov_matrix, individual_std**2)
predictions = np.random.multivariate_normal(
np.full(n_models, true_value), cov_matrix
)
individual_mse.append((predictions[0] - true_value)**2)
ensemble_mse.append((np.mean(predictions) - true_value)**2)
return {
'individual_rmse': np.sqrt(np.mean(individual_mse)),
'ensemble_rmse': np.sqrt(np.mean(ensemble_mse)),
'improvement': 1 - np.sqrt(np.mean(ensemble_mse)) / np.sqrt(np.mean(individual_mse)),
'theoretical_improvement': 1 - np.sqrt(
(1 + correlation * (n_models - 1)) / n_models
)
}
# Demonstration
for corr in [0.0, 0.3, 0.6, 0.9]:
result = demonstrate_ensemble_benefit(10, correlation=corr)
print(f"Correlation {corr}: {result['improvement']:.1%} improvement "
f"(theory: {result['theoretical_improvement']:.1%})")
The Diversity Prediction Theorem
Ensemble Error = Average Individual Error - Diversity
Where Diversity = Average squared deviation of individual predictions from ensemble mean
This remarkable theorem shows that ensemble error is ALWAYS less than or equal to the average individual error. Diversity is free accuracy.
def diversity_prediction_theorem(predictions: list, true_value: float) -> dict:
"""Demonstrate the diversity-prediction theorem."""
predictions = np.array(predictions)
ensemble = np.mean(predictions)
# Individual errors
individual_errors = (predictions - true_value) ** 2
avg_individual_error = np.mean(individual_errors)
# Ensemble error
ensemble_error = (ensemble - true_value) ** 2
# Diversity
diversity = np.mean((predictions - ensemble) ** 2)
# The theorem: ensemble_error = avg_individual_error - diversity
theorem_check = abs(ensemble_error - (avg_individual_error - diversity))
return {
'avg_individual_error': avg_individual_error,
'ensemble_error': ensemble_error,
'diversity': diversity,
'theorem_holds': theorem_check < 1e-10,
'improvement_from_diversity': diversity / avg_individual_error
}
Simple Averaging Methods
Equal-Weight Averaging
The simplest and often surprisingly effective approach:
class SimpleEnsemble:
"""Equal-weight ensemble of multiple models."""
def __init__(self):
self.models = []
self.model_names = []
def add_model(self, name: str, model):
self.models.append(model)
self.model_names.append(name)
def predict(self, X) -> dict:
predictions = [model.predict(X) for model in self.models]
predictions = np.array(predictions)
return {
'mean': np.mean(predictions, axis=0),
'median': np.median(predictions, axis=0),
'std': np.std(predictions, axis=0),
'min': np.min(predictions, axis=0),
'max': np.max(predictions, axis=0),
'individual': dict(zip(self.model_names, predictions))
}
def predict_probability(self, X) -> dict:
"""For probability forecasts (binary outcomes)."""
probs = [model.predict_proba(X) for model in self.models]
probs = np.array(probs)
mean_prob = np.mean(probs, axis=0)
return {
'probability': mean_prob,
'agreement': 1 - np.std(probs, axis=0) * 2,
'individual': dict(zip(self.model_names, probs))
}
Trimmed Mean
Remove extreme predictions before averaging to reduce the impact of outlier models:
def trimmed_mean_ensemble(predictions: list, trim_fraction: float = 0.1) -> float:
"""Remove top and bottom X% of predictions before averaging."""
predictions = sorted(predictions)
n = len(predictions)
trim_count = int(n * trim_fraction)
if trim_count > 0:
trimmed = predictions[trim_count:-trim_count]
else:
trimmed = predictions
return np.mean(trimmed)
Weighted Averaging
Performance-Based Weighting
class WeightedEnsemble:
"""Ensemble with weights based on historical performance."""
def __init__(self, decay_rate: float = 0.95):
self.models = {}
self.performance_history = {}
self.decay_rate = decay_rate
def add_model(self, name: str, model):
self.models[name] = model
self.performance_history[name] = []
def record_performance(self, name: str, error: float):
"""Record a model's prediction error."""
self.performance_history[name].append(error)
def compute_weights(self, method: str = 'inverse_error') -> dict:
"""Compute model weights from historical performance."""
if method == 'inverse_error':
# Weight inversely proportional to recent RMSE
weights = {}
for name, errors in self.performance_history.items():
if not errors:
weights[name] = 1.0
continue
# Apply exponential decay to weight recent performance more
weighted_errors = []
for i, error in enumerate(reversed(errors)):
weighted_errors.append(error * self.decay_rate ** i)
rmse = np.sqrt(np.mean(np.array(weighted_errors) ** 2))
weights[name] = 1.0 / max(rmse, 0.001)
elif method == 'bayesian':
# Bayesian model averaging using likelihood
weights = {}
for name, errors in self.performance_history.items():
if not errors:
weights[name] = 1.0
continue
# Log-likelihood under Gaussian errors
log_lik = -0.5 * sum(e**2 for e in errors[-20:])
weights[name] = np.exp(log_lik)
elif method == 'brier_weighted':
# For probability forecasts, weight by Brier skill
weights = {}
for name, errors in self.performance_history.items():
brier = np.mean(np.array(errors) ** 2)
skill = max(0, 1 - brier / 0.25) # Relative to coin flip
weights[name] = skill + 0.01 # Small floor
# Normalize
total = sum(weights.values())
return {name: w / total for name, w in weights.items()}
def predict(self, X) -> dict:
weights = self.compute_weights()
predictions = {}
for name, model in self.models.items():
predictions[name] = model.predict(X)
weighted_pred = sum(
predictions[name] * weights[name]
for name in self.models
)
return {
'prediction': weighted_pred,
'weights': weights,
'individual': predictions
}
Optimal Linear Combination
from scipy.optimize import minimize
def optimal_ensemble_weights(
predictions_matrix: np.ndarray, # (n_models, n_samples)
actuals: np.ndarray, # (n_samples,)
regularization: float = 0.01
) -> np.ndarray:
"""
Find optimal weights that minimize ensemble MSE.
Uses constrained optimization: weights sum to 1, non-negative.
"""
n_models = predictions_matrix.shape[0]
def objective(weights):
ensemble_pred = predictions_matrix.T @ weights
mse = np.mean((ensemble_pred - actuals) ** 2)
# L2 regularization to prevent overfitting
reg = regularization * np.sum(weights ** 2)
return mse + reg
# Constraints: weights sum to 1
constraints = {'type': 'eq', 'fun': lambda w: np.sum(w) - 1}
# Bounds: non-negative weights
bounds = [(0, 1) for _ in range(n_models)]
# Initial: equal weights
x0 = np.full(n_models, 1.0 / n_models)
result = minimize(objective, x0, bounds=bounds, constraints=constraints)
return result.x
Stacking (Meta-Learning)
Training a Model to Combine Models
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge
class StackedEnsemble:
"""
Two-layer ensemble: base models generate features,
meta-model learns optimal combination.
"""
def __init__(self, base_models: list, meta_model=None):
self.base_models = base_models
self.meta_model = meta_model or Ridge(alpha=1.0)
self.fitted = False
def fit(self, X: np.ndarray, y: np.ndarray, n_folds: int = 5):
"""
Train using out-of-fold predictions to prevent leakage.
"""
n_samples = len(X)
n_models = len(self.base_models)
meta_features = np.zeros((n_samples, n_models))
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
for fold_idx, (train_idx, val_idx) in enumerate(kf.split(X)):
X_train, X_val = X[train_idx], X[val_idx]
y_train = y[train_idx]
for model_idx, model in enumerate(self.base_models):
model.fit(X_train, y_train)
meta_features[val_idx, model_idx] = model.predict(X_val)
# Refit base models on full data
for model in self.base_models:
model.fit(X, y)
# Train meta-model on out-of-fold predictions
self.meta_model.fit(meta_features, y)
self.fitted = True
# Report meta-model weights (if linear)
if hasattr(self.meta_model, 'coef_'):
print("Meta-model weights:", self.meta_model.coef_)
def predict(self, X: np.ndarray) -> np.ndarray:
"""Generate stacked prediction."""
meta_features = np.column_stack([
model.predict(X) for model in self.base_models
])
return self.meta_model.predict(meta_features)
Boosting for Predictions
Sequential Error Correction
class BoostedForecaster:
"""
Each subsequent model focuses on errors of the ensemble so far.
"""
def __init__(self, base_model_class, n_rounds: int = 10,
learning_rate: float = 0.1):
self.base_model_class = base_model_class
self.n_rounds = n_rounds
self.learning_rate = learning_rate
self.models = []
self.weights = []
def fit(self, X: np.ndarray, y: np.ndarray):
"""Train boosted ensemble."""
residuals = y.copy()
for round_num in range(self.n_rounds):
model = self.base_model_class()
model.fit(X, residuals)
predictions = model.predict(X)
self.models.append(model)
self.weights.append(self.learning_rate)
# Update residuals
residuals = residuals - self.learning_rate * predictions
# Track progress
ensemble_pred = self.predict(X)
mse = np.mean((ensemble_pred - y) ** 2)
print(f"Round {round_num + 1}: MSE = {mse:.4f}")
def predict(self, X: np.ndarray) -> np.ndarray:
"""Generate boosted prediction."""
prediction = np.zeros(len(X))
for model, weight in zip(self.models, self.weights):
prediction += weight * model.predict(X)
return prediction
Superforecaster Techniques
How the Best Forecasters Use Ensemble Thinking
Philip Tetlock's Good Judgment Project identified "superforecasters" who consistently outperform. Their ensemble-like mental processes include:
class SuperforecasterProcess:
"""Model the superforecaster's prediction methodology."""
def __init__(self):
self.perspectives = []
self.base_rates = {}
self.adjustments = []
def step1_outside_view(self, reference_class: str,
base_rate: float) -> float:
"""Start with the base rate (reference class forecasting)."""
self.base_rates[reference_class] = base_rate
return base_rate
def step2_inside_view_adjustments(self, factors: list) -> list:
"""
Apply case-specific adjustments.
Each factor shifts the probability up or down.
"""
adjustments = []
for factor in factors:
adjustment = {
'factor': factor['description'],
'direction': factor['direction'], # 'up' or 'down'
'magnitude': factor['magnitude'], # 0 to 0.2
'confidence': factor['confidence'] # 0 to 1
}
adjustments.append(adjustment)
self.adjustments = adjustments
return adjustments
def step3_synthesize(self, base_rate: float) -> float:
"""Combine base rate with adjustments."""
probability = base_rate
for adj in self.adjustments:
shift = adj['magnitude'] * adj['confidence']
if adj['direction'] == 'up':
probability += shift * (1 - probability) # Compress toward 1
else:
probability -= shift * probability # Compress toward 0
# Avoid extreme probabilities (superforecasters rarely go past 95%)
probability = max(0.02, min(0.98, probability))
return probability
def step4_devils_advocate(self, probability: float) -> float:
"""
Actively seek disconfirming evidence.
If you find it compelling, adjust toward 50%.
"""
# This represents the discipline of considering you might be wrong
disconfirming_strength = 0.1 # How compelling the counterargument is
return probability * (1 - disconfirming_strength) + 0.5 * disconfirming_strength
def step5_update_with_new_info(self, current: float,
new_evidence_strength: float,
direction: str) -> float:
"""Granular updates when new information arrives."""
# Superforecasters make many small updates, not large jumps
update_size = new_evidence_strength * 0.1 # Small increments
if direction == 'up':
return current + update_size * (1 - current)
else:
return current - update_size * current
Building a Human-Model Ensemble
class HumanModelEnsemble:
"""
Combine human forecasters with statistical/ML models.
Research shows this hybrid outperforms either alone.
"""
def __init__(self):
self.human_forecasts = {}
self.model_forecasts = {}
self.track_records = {}
def add_human_forecast(self, forecaster_id: str, question: str,
probability: float, reasoning: str):
if question not in self.human_forecasts:
self.human_forecasts[question] = []
self.human_forecasts[question].append({
'forecaster': forecaster_id,
'probability': probability,
'reasoning': reasoning
})
def add_model_forecast(self, model_name: str, question: str,
probability: float):
if question not in self.model_forecasts:
self.model_forecasts[question] = []
self.model_forecasts[question].append({
'model': model_name,
'probability': probability
})
def aggregate(self, question: str,
human_weight: float = 0.6,
extremize: float = 1.0) -> dict:
"""
Aggregate human and model forecasts.
human_weight: relative weight of human vs model forecasts
extremize: >1 pushes probabilities away from 50% (corrects
regression to mean in simple averages)
"""
human_probs = [f['probability'] for f in self.human_forecasts.get(question, [])]
model_probs = [f['probability'] for f in self.model_forecasts.get(question, [])]
if not human_probs and not model_probs:
return {'error': 'No forecasts available'}
# Average within each group
human_avg = np.mean(human_probs) if human_probs else 0.5
model_avg = np.mean(model_probs) if model_probs else 0.5
# Weighted combination
combined = human_weight * human_avg + (1 - human_weight) * model_avg
# Extremize (push away from 50%)
if extremize != 1.0:
odds = combined / (1 - combined)
extremized_odds = odds ** extremize
combined = extremized_odds / (1 + extremized_odds)
return {
'probability': combined,
'human_average': human_avg,
'model_average': model_avg,
'n_human': len(human_probs),
'n_model': len(model_probs),
'agreement': 1 - np.std(human_probs + model_probs) if (human_probs + model_probs) else 0
}
Diversity Management
Measuring and Maximizing Ensemble Diversity
class DiversityManager:
"""Tools for managing model diversity in ensembles."""
@staticmethod
def correlation_matrix(predictions: dict) -> np.ndarray:
"""Compute pairwise prediction correlations between models."""
names = list(predictions.keys())
n = len(names)
matrix = np.zeros((n, n))
for i in range(n):
for j in range(n):
corr = np.corrcoef(predictions[names[i]], predictions[names[j]])[0, 1]
matrix[i, j] = corr
return matrix, names
@staticmethod
def select_diverse_subset(predictions: dict, actuals: np.ndarray,
max_models: int) -> list:
"""
Select a diverse subset of models using greedy forward selection.
At each step, add the model that most improves ensemble performance.
"""
available = list(predictions.keys())
selected = []
best_error = float('inf')
for _ in range(max_models):
best_addition = None
for candidate in available:
trial = selected + [candidate]
trial_preds = np.mean([predictions[m] for m in trial], axis=0)
error = np.mean((trial_preds - actuals) ** 2)
if error < best_error:
best_error = error
best_addition = candidate
if best_addition:
selected.append(best_addition)
available.remove(best_addition)
return selected
@staticmethod
def diversity_sources():
"""Different ways to create model diversity."""
return {
'algorithm_diversity': 'Use different model types (linear, tree, neural, etc.)',
'feature_diversity': 'Use different input features for different models',
'data_diversity': 'Train on different subsets or time periods',
'hyperparameter_diversity': 'Same algorithm with different settings',
'temporal_diversity': 'Models trained at different times',
'perspective_diversity': 'Frame the problem differently for each model',
'human_diversity': 'Forecasters with different backgrounds/expertise'
}
Practical Ensemble Recipes
Recipe 1: Quick and Effective
def quick_ensemble(question: str, approaches: list) -> float:
"""
Fast ensemble using three approaches:
1. Base rate from historical data
2. Current model prediction
3. Expert judgment
Simply average with slight extremization.
"""
probabilities = [a['probability'] for a in approaches]
mean_prob = np.mean(probabilities)
# Slight extremization (research-backed)
odds = mean_prob / (1 - mean_prob + 1e-10)
extremized_odds = odds ** 1.2
result = extremized_odds / (1 + extremized_odds)
return np.clip(result, 0.01, 0.99)
Recipe 2: Production Ensemble Pipeline
class ProductionEnsemble:
"""Full ensemble pipeline for production forecasting."""
def __init__(self, models: list, validation_window: int = 30):
self.models = models
self.validation_window = validation_window
self.weights = np.full(len(models), 1.0 / len(models))
self.recalibrated = False
def fit(self, X_train, y_train, X_val, y_val):
"""Train all models and optimize weights on validation set."""
# Train base models
for model in self.models:
model.fit(X_train, y_train)
# Get validation predictions
val_preds = np.array([m.predict(X_val) for m in self.models])
# Optimize weights
self.weights = optimal_ensemble_weights(val_preds, y_val)
# Recalibrate
ensemble_val = val_preds.T @ self.weights
self._fit_calibration(ensemble_val, y_val)
def _fit_calibration(self, predictions, actuals):
"""Fit isotonic regression for calibration."""
from sklearn.isotonic import IsotonicRegression
self.calibrator = IsotonicRegression(out_of_bounds='clip')
self.calibrator.fit(predictions, actuals)
self.recalibrated = True
def predict(self, X) -> dict:
raw_preds = np.array([m.predict(X) for m in self.models])
ensemble = raw_preds.T @ self.weights
if self.recalibrated:
calibrated = self.calibrator.predict(ensemble)
else:
calibrated = ensemble
return {
'prediction': calibrated,
'raw': ensemble,
'model_weights': dict(zip(
[type(m).__name__ for m in self.models],
self.weights
)),
'model_agreement': 1 - np.std(raw_preds, axis=0).mean()
}
Key Takeaways
- Simple averaging of diverse predictions is remarkably effective and hard to beat consistently
- The Diversity Prediction Theorem guarantees ensemble error is less than or equal to average individual error — diversity is free accuracy
- Correlation between models determines the ceiling on ensemble improvement: uncorrelated models give the most benefit
- Performance-weighted ensembles outperform equal-weight when you have reliable track records (at least 20-30 observations)
- Stacking (meta-learning) is the most powerful method but requires careful cross-validation to prevent overfitting
- Extremization (pushing ensemble probability away from 50%) corrects the systematic underconfidence of simple averages
- Superforecasters naturally think in ensembles: they combine base rates, case-specific analysis, and devil's advocacy
- The optimal ensemble size is usually 5-20 models; beyond that, marginal diversity drops and the improvement plateaus
Install this skill directly: skilldb add prediction-skills