Performance
Optimizing WebAssembly performance including binary size, execution speed, memory usage, and profiling techniques
You are an expert in WebAssembly performance optimization for building WebAssembly applications. ## Key Points - **Baseline/Liftoff (V8)** — fast, unoptimized compilation for quick startup - **Optimizing/TurboFan (V8)** — slower compilation producing faster code - **Cranelift (Wasmtime)** — optimizing compiler, supports both JIT and AOT - **AOT** — pre-compile to native code, eliminating runtime compilation entirely 1. Open Chrome DevTools, go to the **Performance** tab 2. Enable **WebAssembly** in the timeline settings (gear icon) 3. Record a profile while the Wasm code runs 4. Wasm functions appear in the flame chart with their export names 5. Use the **Bottom-Up** view to find the hottest Wasm functions - **Always run `wasm-opt` on release builds** — even after compiler optimizations, `wasm-opt` typically reduces size by 10-30% and can improve execution speed. - **Profile before optimizing** — use browser DevTools or Wasmtime's profiling to identify actual bottlenecks. Do not optimize blind. - **Batch interop calls** — the JS/Wasm boundary costs 10-50ns per call. Passing a buffer of 1000 items is vastly cheaper than 1000 individual calls.
skilldb get webassembly-skills/PerformanceFull skill: 328 linesPerformance Optimization — WebAssembly
You are an expert in WebAssembly performance optimization for building WebAssembly applications.
Overview
WebAssembly runs at near-native speed, but achieving peak performance requires deliberate optimization of binary size, execution throughput, memory layout, and host interop overhead. Performance work spans the entire pipeline: source language choices, compiler flags, Wasm-specific optimizations with wasm-opt, runtime configuration, and profiling with browser DevTools or standalone tools.
Core Concepts
Performance Dimensions
| Dimension | What It Affects |
|---|---|
| Binary size | Download time, compilation time, cache size |
| Startup time | Time to compile + instantiate the module |
| Throughput | Raw computation speed of hot loops |
| Memory usage | Linear memory footprint, GC pressure |
| Interop cost | Overhead of crossing JS/Wasm boundary |
Compilation Tiers
Modern Wasm runtimes use tiered compilation:
- Baseline/Liftoff (V8) — fast, unoptimized compilation for quick startup
- Optimizing/TurboFan (V8) — slower compilation producing faster code
- Cranelift (Wasmtime) — optimizing compiler, supports both JIT and AOT
- AOT — pre-compile to native code, eliminating runtime compilation entirely
wasm-opt Optimization Levels
| Flag | Effect |
|---|---|
-O | Standard optimization |
-O2 | Aggressive optimization (more inlining) |
-O3 | Maximum speed optimization |
-Os | Optimize for size (moderate) |
-Oz | Optimize aggressively for size |
-O4 | Equivalent to -O3 with additional flattening passes |
Implementation Patterns
Binary Size Reduction (Rust)
# Cargo.toml
[profile.release]
opt-level = "z" # optimize for size
lto = true # link-time optimization (enables cross-crate inlining)
codegen-units = 1 # single codegen unit for better LTO
panic = "abort" # remove unwinding code (~10% size reduction)
strip = true # strip debug symbols
[dependencies]
wasm-bindgen = "0.2"
# Use a small allocator instead of the default
lol_alloc = "0.4"
// Use lol_alloc as the global allocator
use lol_alloc::AssumeSingleThreaded;
use lol_alloc::FreeListAllocator;
#[global_allocator]
static ALLOCATOR: AssumeSingleThreaded<FreeListAllocator> =
unsafe { AssumeSingleThreaded::new(FreeListAllocator::new()) };
# Build and optimize
cargo build --target wasm32-unknown-unknown --release
wasm-opt -Oz -o output.wasm target/wasm32-unknown-unknown/release/app.wasm
# Check size
wasm-opt --metrics output.wasm 2>&1 | head -5
ls -la output.wasm
Strip Unused Exports
# wasm-opt can remove unused functions
wasm-opt -Oz --remove-unused-functions --remove-unused-module-elements \
-o optimized.wasm input.wasm
# For Rust, ensure only needed functions are public
# Use #[wasm_bindgen] only on functions you actually export
Memory-Efficient Data Layout
// Bad: struct with padding due to alignment
struct ParticleBad {
active: bool, // 1 byte + 7 padding
x: f64, // 8 bytes
y: f64, // 8 bytes
active2: bool, // 1 byte + 7 padding
vx: f64, // 8 bytes
}
// Total: 40 bytes with padding
// Good: group by size to minimize padding
#[repr(C)]
struct ParticleGood {
x: f64, // 8 bytes
y: f64, // 8 bytes
vx: f64, // 8 bytes
vy: f64, // 8 bytes
active: bool, // 1 byte
_pad: [u8; 7], // explicit padding
}
// Total: 40 bytes, but fields accessed together are colocated
Structure of Arrays (SoA) for SIMD
// Array of Structures (AoS) — poor for SIMD
struct Particles {
particles: Vec<Particle>,
}
// Structure of Arrays (SoA) — SIMD-friendly
struct ParticleSystem {
x: Vec<f64>,
y: Vec<f64>,
vx: Vec<f64>,
vy: Vec<f64>,
count: usize,
}
impl ParticleSystem {
fn update(&mut self, dt: f64) {
// This loop auto-vectorizes well with SIMD
for i in 0..self.count {
self.x[i] += self.vx[i] * dt;
self.y[i] += self.vy[i] * dt;
}
}
}
SIMD Intrinsics (Rust)
use std::arch::wasm32::*;
pub fn sum_f32_simd(data: &[f32]) -> f32 {
let chunks = data.len() / 4;
let mut acc = f32x4_splat(0.0);
for i in 0..chunks {
let offset = i * 4;
let v = v128_load(data[offset..].as_ptr() as *const v128);
acc = f32x4_add(acc, v);
}
// Horizontal sum of the 4 lanes
let mut result = f32x4_extract_lane::<0>(acc)
+ f32x4_extract_lane::<1>(acc)
+ f32x4_extract_lane::<2>(acc)
+ f32x4_extract_lane::<3>(acc);
// Handle remainder
for i in (chunks * 4)..data.len() {
result += data[i];
}
result
}
Minimizing Interop Overhead
// Bad: calling Wasm per element
for (let i = 0; i < 10000; i++) {
instance.exports.process_item(data[i]); // 10,000 boundary crossings
}
// Good: pass entire buffer, process in Wasm
const ptr = writeArrayToWasm(data);
instance.exports.process_batch(ptr, data.length); // 1 boundary crossing
const results = readArrayFromWasm(resultPtr, data.length);
// Wasm side: batch processing
#[no_mangle]
pub extern "C" fn process_batch(ptr: *const f64, len: usize) -> f64 {
let data = unsafe { std::slice::from_raw_parts(ptr, len) };
data.iter().map(|x| x * x + 1.0).sum()
}
Pre-allocation to Avoid Memory Growth
// Pre-allocate all needed memory upfront
static mut BUFFER: [u8; 1024 * 1024] = [0; 1024 * 1024]; // 1MB static buffer
#[no_mangle]
pub extern "C" fn get_buffer_ptr() -> *mut u8 {
unsafe { BUFFER.as_mut_ptr() }
}
#[no_mangle]
pub extern "C" fn get_buffer_len() -> usize {
unsafe { BUFFER.len() }
}
Profiling in Chrome DevTools
// Mark performance sections for Wasm profiling
performance.mark("wasm-start");
instance.exports.heavy_computation();
performance.mark("wasm-end");
performance.measure("wasm-compute", "wasm-start", "wasm-end");
const measures = performance.getEntriesByName("wasm-compute");
console.log(`Wasm computation: ${measures[0].duration.toFixed(2)}ms`);
Steps for browser profiling:
- Open Chrome DevTools, go to the Performance tab
- Enable WebAssembly in the timeline settings (gear icon)
- Record a profile while the Wasm code runs
- Wasm functions appear in the flame chart with their export names
- Use the Bottom-Up view to find the hottest Wasm functions
Benchmarking with Wasmtime
# Profile with perf (Linux) or Instruments (macOS)
wasmtime run --profile perfmap app.wasm
# Measure compilation time separately
time wasmtime compile app.wasm
time wasmtime run app.cwasm
# Fuel-based execution metering
wasmtime run --fuel 1000000 app.wasm
Lazy Instantiation
// Compile eagerly, instantiate lazily
const modulePromise = WebAssembly.compileStreaming(fetch("heavy.wasm"));
async function getHeavyModule() {
const module = await modulePromise;
// Only instantiate when actually needed
return WebAssembly.instantiate(module, imports);
}
// First call pays instantiation cost; subsequent calls can reuse or pool instances
Tree-Shaking Wasm Exports
// Only export what consumers need. Each export prevents dead-code elimination.
// Avoid: pub fn helper() that is only used internally
// Use #[inline(always)] for small internal helpers
#[inline(always)]
fn internal_helper(x: f64) -> f64 {
x * x + 2.0 * x + 1.0
}
#[no_mangle]
pub extern "C" fn compute(x: f64) -> f64 {
internal_helper(x) // inlined, no extra function in output
}
Best Practices
- Always run
wasm-opton release builds — even after compiler optimizations,wasm-opttypically reduces size by 10-30% and can improve execution speed. - Profile before optimizing — use browser DevTools or Wasmtime's profiling to identify actual bottlenecks. Do not optimize blind.
- Batch interop calls — the JS/Wasm boundary costs 10-50ns per call. Passing a buffer of 1000 items is vastly cheaper than 1000 individual calls.
- Use SIMD for data-parallel work — Wasm SIMD (
v128) is supported in all modern browsers and Wasmtime. Image processing, physics, and audio processing see 2-4x speedups. - Pre-allocate memory —
memory.grow()is expensive and invalidates all typed array views. Allocate sufficient memory upfront based on expected workload. - Prefer
f32overf64when precision allows —f32operations are faster and enable 4-wide SIMD instead of 2-wide. - Enable LTO for Rust — link-time optimization enables cross-crate dead code elimination and inlining, critical for small Wasm binaries.
- Use AOT compilation for server-side — eliminates JIT compilation latency entirely, important for serverless cold starts.
Common Pitfalls
- Benchmarking debug builds — debug Wasm is 5-20x slower than optimized. Always benchmark release builds with
wasm-optapplied. - Ignoring binary size — a 5MB Wasm file takes seconds to download and compile. Users perceive this as slowness even if execution is fast. Target under 500KB for most web applications.
- Over-using
memory.grow()— each growth operation is expensive and invalidates all JS typed array views. Growing inside a hot loop causes severe performance degradation. - Not considering compilation time — Wasm must be compiled to native code before execution. For large modules (>1MB), compilation can take hundreds of milliseconds, dominating total latency.
- Measuring wall time without warm-up — browser JIT tiers promote Wasm functions from baseline to optimized after repeated calls. First-run and steady-state performance differ significantly.
- Allocating in hot loops — heap allocation in Wasm triggers the allocator and may cause memory growth. Pre-allocate buffers outside the loop and reuse them.
- Assuming SIMD is always faster — SIMD has overhead for shuffling data into vector registers. For short arrays or non-uniform access patterns, scalar code can be faster.
- Not stripping debug info — debug symbols and name sections can double binary size. Strip them for production with
wasm-opt --strip-debugorstrip = truein Cargo.toml.
Core Philosophy
WebAssembly performance optimization is a pipeline problem, not a single-step fix. The binary must be small enough to download quickly, compile fast enough to start without delay, execute efficiently in hot loops, and cross the JS/Wasm boundary as rarely as possible. Optimizing only one dimension while ignoring others produces a fast module that takes forever to load, or a tiny module that runs slowly. Measure all four dimensions — binary size, startup time, throughput, and interop cost — and optimize the bottleneck.
Profile before optimizing. Browser DevTools and Wasmtime both provide profiling tools that show exactly which functions are hot, how much time is spent at the boundary, and where memory growth occurs. Optimizing blind — "let me try SIMD" — wastes effort on non-bottlenecks and may make things worse. Identify the actual hotspot, then apply the targeted optimization.
wasm-opt is not optional. Even after compiler-level optimization with -O3 or opt-level = "z", running wasm-opt on the output typically reduces binary size by 10-30% and can improve execution speed. It applies Wasm-specific optimizations that language-level compilers cannot: dead code elimination across modules, function inlining based on Wasm semantics, and constant propagation through the stack machine model.
Anti-Patterns
-
Benchmarking debug builds — debug Wasm is 5-20x slower than optimized release builds; performance measurements on debug builds are meaningless and lead to incorrect optimization decisions.
-
Ignoring binary size — a 5MB Wasm file takes seconds to download and compile, dominating total latency even if execution is fast; target under 500KB for most web applications.
-
Growing memory inside hot loops — each
memory.grow()call is expensive, invalidates all JS TypedArray views, and causes a garbage collection pause; pre-allocate sufficient memory before the hot loop begins. -
Calling Wasm functions individually for each data element — the JS/Wasm boundary costs 10-50ns per call; processing an array of 10,000 elements should be one call with a buffer, not 10,000 individual function calls.
-
Shipping production builds without running
wasm-opt— skipping the post-compilation optimization pass leaves 10-30% of size reduction and potential speed improvements on the table.
Install this skill directly: skilldb add webassembly-skills
Related Skills
Assemblyscript
Writing WebAssembly modules using AssemblyScript, a TypeScript-like language that compiles to Wasm
Js Interop
JavaScript and WebAssembly interop patterns including memory sharing, type marshaling, and binding generation
Rust WASM
Compiling Rust to WebAssembly using wasm-pack, wasm-bindgen, and the Rust Wasm ecosystem
Wasi
WebAssembly System Interface (WASI) for portable system-level access including filesystem, networking, and clocks
WASM Basics
WebAssembly fundamentals including module structure, types, memory model, and binary/text formats
WASM in Browser
Using WebAssembly in the browser for canvas rendering, audio processing, Web Workers, and DOM integration