Skip to main content
Technology & EngineeringWebassembly328 lines

Performance

Optimizing WebAssembly performance including binary size, execution speed, memory usage, and profiling techniques

Quick Summary18 lines
You are an expert in WebAssembly performance optimization for building WebAssembly applications.

## Key Points

- **Baseline/Liftoff (V8)** — fast, unoptimized compilation for quick startup
- **Optimizing/TurboFan (V8)** — slower compilation producing faster code
- **Cranelift (Wasmtime)** — optimizing compiler, supports both JIT and AOT
- **AOT** — pre-compile to native code, eliminating runtime compilation entirely
1. Open Chrome DevTools, go to the **Performance** tab
2. Enable **WebAssembly** in the timeline settings (gear icon)
3. Record a profile while the Wasm code runs
4. Wasm functions appear in the flame chart with their export names
5. Use the **Bottom-Up** view to find the hottest Wasm functions
- **Always run `wasm-opt` on release builds** — even after compiler optimizations, `wasm-opt` typically reduces size by 10-30% and can improve execution speed.
- **Profile before optimizing** — use browser DevTools or Wasmtime's profiling to identify actual bottlenecks. Do not optimize blind.
- **Batch interop calls** — the JS/Wasm boundary costs 10-50ns per call. Passing a buffer of 1000 items is vastly cheaper than 1000 individual calls.
skilldb get webassembly-skills/PerformanceFull skill: 328 lines
Paste into your CLAUDE.md or agent config

Performance Optimization — WebAssembly

You are an expert in WebAssembly performance optimization for building WebAssembly applications.

Overview

WebAssembly runs at near-native speed, but achieving peak performance requires deliberate optimization of binary size, execution throughput, memory layout, and host interop overhead. Performance work spans the entire pipeline: source language choices, compiler flags, Wasm-specific optimizations with wasm-opt, runtime configuration, and profiling with browser DevTools or standalone tools.

Core Concepts

Performance Dimensions

DimensionWhat It Affects
Binary sizeDownload time, compilation time, cache size
Startup timeTime to compile + instantiate the module
ThroughputRaw computation speed of hot loops
Memory usageLinear memory footprint, GC pressure
Interop costOverhead of crossing JS/Wasm boundary

Compilation Tiers

Modern Wasm runtimes use tiered compilation:

  • Baseline/Liftoff (V8) — fast, unoptimized compilation for quick startup
  • Optimizing/TurboFan (V8) — slower compilation producing faster code
  • Cranelift (Wasmtime) — optimizing compiler, supports both JIT and AOT
  • AOT — pre-compile to native code, eliminating runtime compilation entirely

wasm-opt Optimization Levels

FlagEffect
-OStandard optimization
-O2Aggressive optimization (more inlining)
-O3Maximum speed optimization
-OsOptimize for size (moderate)
-OzOptimize aggressively for size
-O4Equivalent to -O3 with additional flattening passes

Implementation Patterns

Binary Size Reduction (Rust)

# Cargo.toml
[profile.release]
opt-level = "z"       # optimize for size
lto = true            # link-time optimization (enables cross-crate inlining)
codegen-units = 1     # single codegen unit for better LTO
panic = "abort"       # remove unwinding code (~10% size reduction)
strip = true          # strip debug symbols

[dependencies]
wasm-bindgen = "0.2"
# Use a small allocator instead of the default
lol_alloc = "0.4"
// Use lol_alloc as the global allocator
use lol_alloc::AssumeSingleThreaded;
use lol_alloc::FreeListAllocator;

#[global_allocator]
static ALLOCATOR: AssumeSingleThreaded<FreeListAllocator> =
    unsafe { AssumeSingleThreaded::new(FreeListAllocator::new()) };
# Build and optimize
cargo build --target wasm32-unknown-unknown --release
wasm-opt -Oz -o output.wasm target/wasm32-unknown-unknown/release/app.wasm

# Check size
wasm-opt --metrics output.wasm 2>&1 | head -5
ls -la output.wasm

Strip Unused Exports

# wasm-opt can remove unused functions
wasm-opt -Oz --remove-unused-functions --remove-unused-module-elements \
  -o optimized.wasm input.wasm

# For Rust, ensure only needed functions are public
# Use #[wasm_bindgen] only on functions you actually export

Memory-Efficient Data Layout

// Bad: struct with padding due to alignment
struct ParticleBad {
    active: bool,   // 1 byte + 7 padding
    x: f64,         // 8 bytes
    y: f64,         // 8 bytes
    active2: bool,  // 1 byte + 7 padding
    vx: f64,        // 8 bytes
}
// Total: 40 bytes with padding

// Good: group by size to minimize padding
#[repr(C)]
struct ParticleGood {
    x: f64,         // 8 bytes
    y: f64,         // 8 bytes
    vx: f64,        // 8 bytes
    vy: f64,        // 8 bytes
    active: bool,   // 1 byte
    _pad: [u8; 7],  // explicit padding
}
// Total: 40 bytes, but fields accessed together are colocated

Structure of Arrays (SoA) for SIMD

// Array of Structures (AoS) — poor for SIMD
struct Particles {
    particles: Vec<Particle>,
}

// Structure of Arrays (SoA) — SIMD-friendly
struct ParticleSystem {
    x: Vec<f64>,
    y: Vec<f64>,
    vx: Vec<f64>,
    vy: Vec<f64>,
    count: usize,
}

impl ParticleSystem {
    fn update(&mut self, dt: f64) {
        // This loop auto-vectorizes well with SIMD
        for i in 0..self.count {
            self.x[i] += self.vx[i] * dt;
            self.y[i] += self.vy[i] * dt;
        }
    }
}

SIMD Intrinsics (Rust)

use std::arch::wasm32::*;

pub fn sum_f32_simd(data: &[f32]) -> f32 {
    let chunks = data.len() / 4;
    let mut acc = f32x4_splat(0.0);

    for i in 0..chunks {
        let offset = i * 4;
        let v = v128_load(data[offset..].as_ptr() as *const v128);
        acc = f32x4_add(acc, v);
    }

    // Horizontal sum of the 4 lanes
    let mut result = f32x4_extract_lane::<0>(acc)
        + f32x4_extract_lane::<1>(acc)
        + f32x4_extract_lane::<2>(acc)
        + f32x4_extract_lane::<3>(acc);

    // Handle remainder
    for i in (chunks * 4)..data.len() {
        result += data[i];
    }

    result
}

Minimizing Interop Overhead

// Bad: calling Wasm per element
for (let i = 0; i < 10000; i++) {
  instance.exports.process_item(data[i]); // 10,000 boundary crossings
}

// Good: pass entire buffer, process in Wasm
const ptr = writeArrayToWasm(data);
instance.exports.process_batch(ptr, data.length); // 1 boundary crossing
const results = readArrayFromWasm(resultPtr, data.length);
// Wasm side: batch processing
#[no_mangle]
pub extern "C" fn process_batch(ptr: *const f64, len: usize) -> f64 {
    let data = unsafe { std::slice::from_raw_parts(ptr, len) };
    data.iter().map(|x| x * x + 1.0).sum()
}

Pre-allocation to Avoid Memory Growth

// Pre-allocate all needed memory upfront
static mut BUFFER: [u8; 1024 * 1024] = [0; 1024 * 1024]; // 1MB static buffer

#[no_mangle]
pub extern "C" fn get_buffer_ptr() -> *mut u8 {
    unsafe { BUFFER.as_mut_ptr() }
}

#[no_mangle]
pub extern "C" fn get_buffer_len() -> usize {
    unsafe { BUFFER.len() }
}

Profiling in Chrome DevTools

// Mark performance sections for Wasm profiling
performance.mark("wasm-start");
instance.exports.heavy_computation();
performance.mark("wasm-end");
performance.measure("wasm-compute", "wasm-start", "wasm-end");

const measures = performance.getEntriesByName("wasm-compute");
console.log(`Wasm computation: ${measures[0].duration.toFixed(2)}ms`);

Steps for browser profiling:

  1. Open Chrome DevTools, go to the Performance tab
  2. Enable WebAssembly in the timeline settings (gear icon)
  3. Record a profile while the Wasm code runs
  4. Wasm functions appear in the flame chart with their export names
  5. Use the Bottom-Up view to find the hottest Wasm functions

Benchmarking with Wasmtime

# Profile with perf (Linux) or Instruments (macOS)
wasmtime run --profile perfmap app.wasm

# Measure compilation time separately
time wasmtime compile app.wasm
time wasmtime run app.cwasm

# Fuel-based execution metering
wasmtime run --fuel 1000000 app.wasm

Lazy Instantiation

// Compile eagerly, instantiate lazily
const modulePromise = WebAssembly.compileStreaming(fetch("heavy.wasm"));

async function getHeavyModule() {
  const module = await modulePromise;
  // Only instantiate when actually needed
  return WebAssembly.instantiate(module, imports);
}

// First call pays instantiation cost; subsequent calls can reuse or pool instances

Tree-Shaking Wasm Exports

// Only export what consumers need. Each export prevents dead-code elimination.
// Avoid: pub fn helper() that is only used internally

// Use #[inline(always)] for small internal helpers
#[inline(always)]
fn internal_helper(x: f64) -> f64 {
    x * x + 2.0 * x + 1.0
}

#[no_mangle]
pub extern "C" fn compute(x: f64) -> f64 {
    internal_helper(x) // inlined, no extra function in output
}

Best Practices

  • Always run wasm-opt on release builds — even after compiler optimizations, wasm-opt typically reduces size by 10-30% and can improve execution speed.
  • Profile before optimizing — use browser DevTools or Wasmtime's profiling to identify actual bottlenecks. Do not optimize blind.
  • Batch interop calls — the JS/Wasm boundary costs 10-50ns per call. Passing a buffer of 1000 items is vastly cheaper than 1000 individual calls.
  • Use SIMD for data-parallel work — Wasm SIMD (v128) is supported in all modern browsers and Wasmtime. Image processing, physics, and audio processing see 2-4x speedups.
  • Pre-allocate memorymemory.grow() is expensive and invalidates all typed array views. Allocate sufficient memory upfront based on expected workload.
  • Prefer f32 over f64 when precision allowsf32 operations are faster and enable 4-wide SIMD instead of 2-wide.
  • Enable LTO for Rust — link-time optimization enables cross-crate dead code elimination and inlining, critical for small Wasm binaries.
  • Use AOT compilation for server-side — eliminates JIT compilation latency entirely, important for serverless cold starts.

Common Pitfalls

  • Benchmarking debug builds — debug Wasm is 5-20x slower than optimized. Always benchmark release builds with wasm-opt applied.
  • Ignoring binary size — a 5MB Wasm file takes seconds to download and compile. Users perceive this as slowness even if execution is fast. Target under 500KB for most web applications.
  • Over-using memory.grow() — each growth operation is expensive and invalidates all JS typed array views. Growing inside a hot loop causes severe performance degradation.
  • Not considering compilation time — Wasm must be compiled to native code before execution. For large modules (>1MB), compilation can take hundreds of milliseconds, dominating total latency.
  • Measuring wall time without warm-up — browser JIT tiers promote Wasm functions from baseline to optimized after repeated calls. First-run and steady-state performance differ significantly.
  • Allocating in hot loops — heap allocation in Wasm triggers the allocator and may cause memory growth. Pre-allocate buffers outside the loop and reuse them.
  • Assuming SIMD is always faster — SIMD has overhead for shuffling data into vector registers. For short arrays or non-uniform access patterns, scalar code can be faster.
  • Not stripping debug info — debug symbols and name sections can double binary size. Strip them for production with wasm-opt --strip-debug or strip = true in Cargo.toml.

Core Philosophy

WebAssembly performance optimization is a pipeline problem, not a single-step fix. The binary must be small enough to download quickly, compile fast enough to start without delay, execute efficiently in hot loops, and cross the JS/Wasm boundary as rarely as possible. Optimizing only one dimension while ignoring others produces a fast module that takes forever to load, or a tiny module that runs slowly. Measure all four dimensions — binary size, startup time, throughput, and interop cost — and optimize the bottleneck.

Profile before optimizing. Browser DevTools and Wasmtime both provide profiling tools that show exactly which functions are hot, how much time is spent at the boundary, and where memory growth occurs. Optimizing blind — "let me try SIMD" — wastes effort on non-bottlenecks and may make things worse. Identify the actual hotspot, then apply the targeted optimization.

wasm-opt is not optional. Even after compiler-level optimization with -O3 or opt-level = "z", running wasm-opt on the output typically reduces binary size by 10-30% and can improve execution speed. It applies Wasm-specific optimizations that language-level compilers cannot: dead code elimination across modules, function inlining based on Wasm semantics, and constant propagation through the stack machine model.

Anti-Patterns

  • Benchmarking debug builds — debug Wasm is 5-20x slower than optimized release builds; performance measurements on debug builds are meaningless and lead to incorrect optimization decisions.

  • Ignoring binary size — a 5MB Wasm file takes seconds to download and compile, dominating total latency even if execution is fast; target under 500KB for most web applications.

  • Growing memory inside hot loops — each memory.grow() call is expensive, invalidates all JS TypedArray views, and causes a garbage collection pause; pre-allocate sufficient memory before the hot loop begins.

  • Calling Wasm functions individually for each data element — the JS/Wasm boundary costs 10-50ns per call; processing an array of 10,000 elements should be one call with a buffer, not 10,000 individual function calls.

  • Shipping production builds without running wasm-opt — skipping the post-compilation optimization pass leaves 10-30% of size reduction and potential speed improvements on the table.

Install this skill directly: skilldb add webassembly-skills

Get CLI access →