Monte Carlo Simulation
Monte Carlo simulation uses repeated random sampling to estimate the probability distributions of uncertain outcomes. Named after the famous casino, the method replaces analytical solutions with computational brute force: simulate a process thousands or millions of times with random inputs, then analyze the distribution of results. It is one of the most versatile and powerful tools in the forecaster's toolkit, applicable to finance, engineering, project management, physics, and any domain with quantifiable uncertainty. ## Key Points 1. Define a model of the system with uncertain inputs 2. Assign probability distributions to each uncertain input 3. Randomly sample from those distributions 4. Run the model with sampled inputs 5. Repeat thousands of times 6. Analyze the distribution of outputs 1. Monte Carlo simulation transforms uncertain inputs into probability distributions of outcomes 2. Choose distributions carefully: triangular for expert estimates, lognormal for positive skewed data, beta for proportions 3. Convergence analysis tells you when you have enough simulations; standard error should be less than 1% of the mean 4. Variance reduction techniques (antithetic variates, stratified sampling, Latin Hypercube) can cut required simulations by 50-90% 5. Sensitivity analysis (tornado diagrams) identifies which inputs matter most, focusing your uncertainty-reduction efforts 6. Always model correlations between inputs; independent sampling of correlated variables produces unrealistic scenarios
skilldb get prediction-skills/monte-carlo-simulationFull skill: 592 linesMonte Carlo Simulation
Overview
Monte Carlo simulation uses repeated random sampling to estimate the probability distributions of uncertain outcomes. Named after the famous casino, the method replaces analytical solutions with computational brute force: simulate a process thousands or millions of times with random inputs, then analyze the distribution of results. It is one of the most versatile and powerful tools in the forecaster's toolkit, applicable to finance, engineering, project management, physics, and any domain with quantifiable uncertainty.
Core Concepts
The Fundamental Idea
Instead of solving a complex problem analytically, Monte Carlo methods:
- Define a model of the system with uncertain inputs
- Assign probability distributions to each uncertain input
- Randomly sample from those distributions
- Run the model with sampled inputs
- Repeat thousands of times
- Analyze the distribution of outputs
import numpy as np
from typing import Callable
import matplotlib.pyplot as plt
class MonteCarloSimulation:
"""General-purpose Monte Carlo simulation engine."""
def __init__(self, model_fn: Callable, n_simulations: int = 10000):
self.model_fn = model_fn
self.n_simulations = n_simulations
self.results = []
self.input_distributions = {}
def add_input(self, name: str, distribution: str, **params):
"""Define an uncertain input with its probability distribution."""
self.input_distributions[name] = {
'distribution': distribution,
'params': params
}
def _sample_input(self, name: str) -> float:
"""Draw a random sample from an input's distribution."""
spec = self.input_distributions[name]
dist = spec['distribution']
params = spec['params']
if dist == 'normal':
return np.random.normal(params['mean'], params['std'])
elif dist == 'uniform':
return np.random.uniform(params['low'], params['high'])
elif dist == 'triangular':
return np.random.triangular(params['low'], params['mode'], params['high'])
elif dist == 'lognormal':
return np.random.lognormal(params['mean'], params['sigma'])
elif dist == 'beta':
return np.random.beta(params['alpha'], params['beta'])
elif dist == 'poisson':
return np.random.poisson(params['lam'])
elif dist == 'exponential':
return np.random.exponential(params['scale'])
elif dist == 'custom':
return params['sample_fn']()
else:
raise ValueError(f"Unknown distribution: {dist}")
def run(self) -> np.ndarray:
"""Execute the Monte Carlo simulation."""
self.results = []
for _ in range(self.n_simulations):
inputs = {
name: self._sample_input(name)
for name in self.input_distributions
}
result = self.model_fn(**inputs)
self.results.append(result)
self.results = np.array(self.results)
return self.results
def statistics(self) -> dict:
"""Compute summary statistics of simulation results."""
return {
'mean': np.mean(self.results),
'median': np.median(self.results),
'std': np.std(self.results),
'min': np.min(self.results),
'max': np.max(self.results),
'percentile_5': np.percentile(self.results, 5),
'percentile_25': np.percentile(self.results, 25),
'percentile_75': np.percentile(self.results, 75),
'percentile_95': np.percentile(self.results, 95),
'prob_positive': np.mean(self.results > 0),
'var_95': np.percentile(self.results, 5), # Value at Risk
'cvar_95': np.mean(self.results[self.results <= np.percentile(self.results, 5)])
}
def probability_of(self, condition: Callable) -> float:
"""Estimate probability that a condition is met."""
return np.mean([condition(r) for r in self.results])
Choosing Probability Distributions
Common Distributions and When to Use Them
def distribution_guide():
"""Guide for selecting appropriate probability distributions."""
return {
'normal': {
'use_when': 'Symmetric uncertainty around a central value',
'examples': ['Measurement error', 'Height of adults', 'IQ scores'],
'params': 'mean, std',
'caution': 'Allows negative values; use lognormal for strictly positive'
},
'triangular': {
'use_when': 'You can estimate min, most likely, and max',
'examples': ['Project duration estimates', 'Cost estimates', 'Expert elicitation'],
'params': 'low, mode, high',
'caution': 'Simple but may underestimate tail risk'
},
'lognormal': {
'use_when': 'Strictly positive, right-skewed values',
'examples': ['Stock prices', 'Income', 'Project costs', 'Claim sizes'],
'params': 'mean (of log), sigma (of log)',
'caution': 'Heavy right tail can produce extreme values'
},
'uniform': {
'use_when': 'Complete ignorance within a known range',
'examples': ['Arrival time within an interval', 'Random selection'],
'params': 'low, high',
'caution': 'Rarely reflects real uncertainty; consider triangular instead'
},
'beta': {
'use_when': 'Proportion or probability (bounded 0-1)',
'examples': ['Conversion rates', 'Success probabilities', 'Market share'],
'params': 'alpha, beta',
'caution': 'Flexible shape; alpha=beta=1 is uniform'
},
'poisson': {
'use_when': 'Count of events in a fixed interval',
'examples': ['Customer arrivals', 'Defects per unit', 'Emails per hour'],
'params': 'lambda (rate)',
'caution': 'Assumes events are independent with constant rate'
},
'exponential': {
'use_when': 'Time between events (memoryless)',
'examples': ['Time between failures', 'Customer inter-arrival times'],
'params': 'scale (1/rate)',
'caution': 'Memoryless property may not hold in practice'
},
'weibull': {
'use_when': 'Time to failure with increasing/decreasing hazard',
'examples': ['Equipment lifetime', 'Material fatigue'],
'params': 'shape, scale',
'caution': 'shape < 1: decreasing hazard, shape > 1: increasing hazard'
}
}
Fitting Distributions to Data
from scipy import stats
def fit_distribution(data: np.ndarray, candidates: list = None) -> dict:
"""Fit multiple distributions to data and select the best one."""
if candidates is None:
candidates = ['norm', 'lognorm', 'gamma', 'weibull_min', 'beta', 'expon']
results = {}
for dist_name in candidates:
dist = getattr(stats, dist_name)
try:
params = dist.fit(data)
# Kolmogorov-Smirnov test
ks_stat, p_value = stats.kstest(data, dist_name, args=params)
# AIC
log_likelihood = np.sum(dist.logpdf(data, *params))
k = len(params)
aic = 2 * k - 2 * log_likelihood
results[dist_name] = {
'params': params,
'ks_statistic': ks_stat,
'p_value': p_value,
'aic': aic,
'log_likelihood': log_likelihood
}
except Exception:
continue
# Rank by AIC (lower is better)
ranked = sorted(results.items(), key=lambda x: x[1]['aic'])
best = ranked[0]
return {
'best_distribution': best[0],
'best_params': best[1]['params'],
'best_aic': best[1]['aic'],
'all_results': dict(ranked)
}
Practical Applications
Project Cost and Duration Estimation
def project_cost_simulation():
"""Simulate total project cost with uncertain task costs."""
tasks = {
'design': {'low': 50000, 'mode': 80000, 'high': 150000},
'development': {'low': 200000, 'mode': 350000, 'high': 600000},
'testing': {'low': 30000, 'mode': 60000, 'high': 120000},
'deployment': {'low': 20000, 'mode': 40000, 'high': 80000},
'contingency_events': {'probability': 0.3, 'cost_if_occurs': 100000}
}
def model(**kwargs):
total = 0
for task_name, task_spec in tasks.items():
if task_name == 'contingency_events':
if np.random.random() < task_spec['probability']:
total += np.random.triangular(
task_spec['cost_if_occurs'] * 0.5,
task_spec['cost_if_occurs'],
task_spec['cost_if_occurs'] * 2
)
else:
total += np.random.triangular(
task_spec['low'], task_spec['mode'], task_spec['high']
)
return total
sim = MonteCarloSimulation(model, n_simulations=50000)
sim.run()
stats = sim.statistics()
print(f"Expected cost: ${stats['mean']:,.0f}")
print(f"80% confidence range: ${stats['percentile_10']:,.0f} - ${stats['percentile_90']:,.0f}")
print(f"95th percentile (budget target): ${stats['percentile_95']:,.0f}")
print(f"Probability under $500K: {sim.probability_of(lambda x: x < 500000):.1%}")
return stats
Portfolio Risk Analysis (Value at Risk)
def portfolio_var_simulation(
holdings: dict,
correlation_matrix: np.ndarray,
horizon_days: int = 10,
n_sims: int = 100000
) -> dict:
"""
Simulate portfolio Value at Risk using correlated returns.
holdings: {'AAPL': {'value': 100000, 'annual_return': 0.12, 'annual_vol': 0.25}, ...}
"""
n_assets = len(holdings)
asset_names = list(holdings.keys())
# Cholesky decomposition for correlated random draws
L = np.linalg.cholesky(correlation_matrix)
# Daily parameters
dt = horizon_days / 252
portfolio_returns = np.zeros(n_sims)
for sim in range(n_sims):
# Generate correlated random numbers
z = np.random.standard_normal(n_assets)
correlated_z = L @ z
portfolio_value_change = 0
for i, (name, asset) in enumerate(holdings.items()):
# Geometric Brownian Motion
daily_return = (
(asset['annual_return'] - 0.5 * asset['annual_vol']**2) * dt
+ asset['annual_vol'] * np.sqrt(dt) * correlated_z[i]
)
value_change = asset['value'] * (np.exp(daily_return) - 1)
portfolio_value_change += value_change
portfolio_returns[sim] = portfolio_value_change
total_portfolio_value = sum(h['value'] for h in holdings.values())
return {
'total_value': total_portfolio_value,
'expected_change': np.mean(portfolio_returns),
'var_95': -np.percentile(portfolio_returns, 5),
'var_99': -np.percentile(portfolio_returns, 1),
'cvar_95': -np.mean(portfolio_returns[portfolio_returns <= np.percentile(portfolio_returns, 5)]),
'max_loss': -np.min(portfolio_returns),
'max_gain': np.max(portfolio_returns),
'prob_loss': np.mean(portfolio_returns < 0)
}
Revenue Forecasting
def revenue_forecast_simulation(
current_customers: int = 1000,
monthly_growth_rate: tuple = (0.02, 0.05, 0.10), # min, mode, max
churn_rate: tuple = (0.01, 0.03, 0.06),
avg_revenue_per_customer: tuple = (45, 50, 60),
months: int = 12,
n_sims: int = 10000
) -> dict:
"""Simulate 12-month revenue forecast."""
monthly_revenues = np.zeros((n_sims, months))
for sim in range(n_sims):
customers = current_customers
for month in range(months):
growth = np.random.triangular(*monthly_growth_rate)
churn = np.random.triangular(*churn_rate)
arpc = np.random.triangular(*avg_revenue_per_customer)
new_customers = int(customers * growth)
lost_customers = int(customers * churn)
customers = customers + new_customers - lost_customers
customers = max(0, customers)
monthly_revenues[sim, month] = customers * arpc
annual_revenues = np.sum(monthly_revenues, axis=1)
return {
'annual_revenue': {
'mean': np.mean(annual_revenues),
'median': np.median(annual_revenues),
'p10': np.percentile(annual_revenues, 10),
'p90': np.percentile(annual_revenues, 90),
},
'monthly_trajectory': {
'mean': np.mean(monthly_revenues, axis=0).tolist(),
'p10': np.percentile(monthly_revenues, 10, axis=0).tolist(),
'p90': np.percentile(monthly_revenues, 90, axis=0).tolist(),
},
'prob_over_1M': np.mean(annual_revenues > 1_000_000),
}
Convergence Analysis
How Many Simulations Are Enough?
def convergence_analysis(sim: MonteCarloSimulation,
checkpoints: list = None) -> dict:
"""Analyze how results converge as simulation count increases."""
if checkpoints is None:
checkpoints = [100, 500, 1000, 2500, 5000, 10000, 25000, 50000]
convergence = []
all_results = sim.results
for n in checkpoints:
if n > len(all_results):
break
subset = all_results[:n]
convergence.append({
'n': n,
'mean': np.mean(subset),
'std': np.std(subset),
'p5': np.percentile(subset, 5),
'p95': np.percentile(subset, 95),
'standard_error': np.std(subset) / np.sqrt(n)
})
# Check if converged (standard error < 1% of mean)
if convergence:
latest = convergence[-1]
se_ratio = latest['standard_error'] / abs(latest['mean']) if latest['mean'] != 0 else float('inf')
converged = se_ratio < 0.01
else:
converged = False
return {
'convergence_path': convergence,
'is_converged': converged,
'recommended_n': _recommend_n(convergence)
}
def _recommend_n(convergence: list) -> int:
"""Recommend simulation count based on convergence analysis."""
for i in range(1, len(convergence)):
mean_change = abs(convergence[i]['mean'] - convergence[i-1]['mean'])
relative_change = mean_change / abs(convergence[i]['mean']) if convergence[i]['mean'] != 0 else float('inf')
if relative_change < 0.005: # Less than 0.5% change
return convergence[i]['n']
return convergence[-1]['n'] * 2 if convergence else 10000
Variance Reduction Techniques
class VarianceReduction:
"""Techniques to improve Monte Carlo efficiency."""
@staticmethod
def antithetic_variates(model_fn: Callable, n_sims: int) -> np.ndarray:
"""
For each random draw U, also use 1-U.
Reduces variance by introducing negative correlation.
"""
results = []
for _ in range(n_sims // 2):
u = np.random.random()
result1 = model_fn(u)
result2 = model_fn(1 - u) # Antithetic
results.extend([result1, result2])
return np.array(results)
@staticmethod
def stratified_sampling(model_fn: Callable, n_sims: int,
n_strata: int = 100) -> np.ndarray:
"""
Divide the probability space into equal strata.
Sample from each stratum to ensure even coverage.
"""
results = []
samples_per_stratum = n_sims // n_strata
for i in range(n_strata):
for _ in range(samples_per_stratum):
u = (i + np.random.random()) / n_strata
results.append(model_fn(u))
return np.array(results)
@staticmethod
def latin_hypercube(model_fn: Callable, n_sims: int,
n_dims: int) -> np.ndarray:
"""
Latin Hypercube Sampling for multi-dimensional inputs.
Ensures each dimension is evenly sampled.
"""
samples = np.zeros((n_sims, n_dims))
for dim in range(n_dims):
perm = np.random.permutation(n_sims)
for i in range(n_sims):
samples[perm[i], dim] = (i + np.random.random()) / n_sims
results = []
for i in range(n_sims):
results.append(model_fn(*samples[i]))
return np.array(results)
@staticmethod
def importance_sampling(model_fn: Callable, target_dist,
proposal_dist, n_sims: int) -> dict:
"""
Sample from a proposal distribution and reweight.
Useful for estimating rare event probabilities.
"""
samples = proposal_dist.rvs(n_sims)
weights = target_dist.pdf(samples) / proposal_dist.pdf(samples)
values = np.array([model_fn(s) for s in samples])
weighted_mean = np.average(values, weights=weights)
effective_n = np.sum(weights)**2 / np.sum(weights**2)
return {
'estimate': weighted_mean,
'effective_sample_size': effective_n,
'efficiency': effective_n / n_sims
}
Sensitivity Analysis
Tornado Diagrams
def tornado_analysis(model_fn: Callable, base_inputs: dict,
input_ranges: dict, n_sims: int = 5000) -> list:
"""
Determine which inputs have the largest impact on output.
Vary each input while holding others at base values.
"""
base_output = model_fn(**base_inputs)
sensitivities = []
for input_name, (low, high) in input_ranges.items():
# Run with low value
low_inputs = base_inputs.copy()
low_inputs[input_name] = low
low_output = model_fn(**low_inputs)
# Run with high value
high_inputs = base_inputs.copy()
high_inputs[input_name] = high
high_output = model_fn(**high_inputs)
swing = abs(high_output - low_output)
sensitivities.append({
'input': input_name,
'low_value': low,
'high_value': high,
'low_output': low_output,
'high_output': high_output,
'swing': swing,
'sensitivity': swing / base_output if base_output != 0 else float('inf')
})
# Sort by swing (largest impact first)
sensitivities.sort(key=lambda x: -x['swing'])
return sensitivities
Correlation Analysis
def input_output_correlation(sim: MonteCarloSimulation,
input_samples: dict) -> dict:
"""Compute correlation between each input and the output."""
correlations = {}
for input_name, samples in input_samples.items():
corr = np.corrcoef(samples, sim.results)[0, 1]
rank_corr = stats.spearmanr(samples, sim.results).correlation
correlations[input_name] = {
'pearson': corr,
'spearman': rank_corr,
'contribution_to_variance': corr ** 2
}
# Sort by variance contribution
sorted_corr = sorted(
correlations.items(),
key=lambda x: -abs(x[1]['contribution_to_variance'])
)
return dict(sorted_corr)
Common Pitfalls
1. Ignoring Correlations Between Inputs
# WRONG: Independent sampling when inputs are correlated
growth = np.random.normal(0.05, 0.02)
inflation = np.random.normal(0.03, 0.01)
# These are likely correlated!
# RIGHT: Use correlated sampling
mean = [0.05, 0.03]
cov = [[0.0004, 0.0002], # growth variance, covariance
[0.0002, 0.0001]] # covariance, inflation variance
growth, inflation = np.random.multivariate_normal(mean, cov)
2. Using Wrong Distributions
Do not default to normal distributions. Real-world data is often skewed, bounded, or fat-tailed. Always check historical data when available.
3. Insufficient Simulations for Tail Events
If you care about the 1st percentile outcome, you need far more simulations than if you only care about the mean. Rule of thumb: to estimate the p-th percentile accurately, you need at least 100 / p * 100 simulations (e.g., 10,000 for the 1st percentile).
4. Ignoring Model Risk
The Monte Carlo result is only as good as your model. If the model structure is wrong, perfectly sampled random inputs still give wrong answers.
Key Takeaways
- Monte Carlo simulation transforms uncertain inputs into probability distributions of outcomes
- Choose distributions carefully: triangular for expert estimates, lognormal for positive skewed data, beta for proportions
- Convergence analysis tells you when you have enough simulations; standard error should be less than 1% of the mean
- Variance reduction techniques (antithetic variates, stratified sampling, Latin Hypercube) can cut required simulations by 50-90%
- Sensitivity analysis (tornado diagrams) identifies which inputs matter most, focusing your uncertainty-reduction efforts
- Always model correlations between inputs; independent sampling of correlated variables produces unrealistic scenarios
- For tail risk estimation (VaR, extreme events), you need significantly more simulations than for mean estimation
Install this skill directly: skilldb add prediction-skills