Technology & EngineeringBrowser Automation243 lines

Debugging Flaky Browser Tests

Diagnose and fix flaky end-to-end tests. Covers the categories of

Quick Summary32 lines

A flaky test is a test that sometimes passes and sometimes fails on the same code. They are the slow death of CI. Once a team is rerunning failed tests by reflex, the suite has stopped catching real bugs and the team has stopped trusting it.

## Key Points

- Screenshot at the moment of failure.
- DOM HTML at the moment of failure.
- Console logs from the browser.
- Network requests/responses.
- Server logs from the application under test.
- **Each test creates its own state.** Users, fixtures, data. No shared state across tests.
- **Each test cleans up after itself** or runs in an isolated database/storage.
- **Avoid waitForTimeout entirely.** Replace with web-first assertions.
- **Stub all external services in tests.** Real services have downtime; stubs don't.
- **Run tests in parallel by default.** If a test fails in parallel but passes serial, it's hiding ordering dependencies.
- **Trace on retry.** Failed tests get a trace; the trace tells you why.
- **Quarantine, don't disable.** If a test is flaky and you can't fix immediately, quarantine it (run separately, don't gate CI on it). Don't delete; the test catches a real concern.

## Quick Example

```
for i in {1..100}; do npm test -- --grep "the test" || echo "FAIL $i"; done
```

```ts
// fragile fix
await page.waitForTimeout(3000);

// proper fix
await expect(page.getByText('Profile updated')).toBeVisible();
```

skilldb get browser-automation-skills/Debugging Flaky Browser TestsFull skill: 243 lines

Paste into your CLAUDE.md or agent config

A flaky test is a test that sometimes passes and sometimes fails on the same code. They are the slow death of CI. Once a team is rerunning failed tests by reflex, the suite has stopped catching real bugs and the team has stopped trusting it.

Flake rates above 1% (one in a hundred runs) are noticeable. Above 5%, the suite is broken — engineers reflexively rerun rather than investigate. The goal is sub-0.1% flake rates: real failures only.

This skill covers diagnosing and eliminating browser-test flakes.

Categories of Flake

Flakes come from a few categories. Identifying which category a flake belongs to is the first step.

1. Timing Flakes

The most common. The test acts before the page is ready. Click a button before it's rendered. Read text before it appears. Submit a form before validation runs.

Symptoms: passes most of the time; fails when the system is loaded; fails consistently in slow CI environments.

Cause: implicit assumption that something has happened that hasn't yet. Hardcoded sleeps that aren't long enough.

2. Ordering Flakes

The test depends on order with another test. Test A passes alone; test B passes alone; together they fail.

Symptoms: passes when run alone; fails in the full suite. Failure depends on which other tests ran first.

Cause: state leakage. Cookies, localStorage, database rows, file system artifacts that one test creates and another assumes are absent or present.

3. Environment Flakes

The test depends on the environment. Time of day. Network conditions. CI runner specs.

Symptoms: passes locally; fails in CI. Or passes in CI usually; fails on certain runners.

Cause: test reads system time; test depends on external services; test assumes deterministic resource availability.

4. Real Bug Flakes

The flake is the test correctly catching a real intermittent bug.

Symptoms: small percentage of failures; the failure mode varies; production has occasional support tickets matching the failure mode.

Cause: an actual race condition, timing-dependent bug, or rare error path in the application.

The mistake teams make: treating real-bug flakes like timing flakes. Adding a wait. Hiding the real issue. Production then has the bug; users hit it; you find out via support.

Diagnostic Process

When a test flakes:

Step 1: Reproduce

Run the test in a loop. 100 iterations. Note the failure rate. If it's 100% with the right conditions (load, ordering), it's not flaky — it's broken. If it's 5%, you have a flake.

for i in {1..100}; do npm test -- --grep "the test" || echo "FAIL $i"; done

If you can't reproduce it locally, look at the CI artifacts. Trace files. Screenshots. DOM snapshots.

Step 2: Capture Failure State

When the test fails, capture everything:

Screenshot at the moment of failure.
DOM HTML at the moment of failure.
Console logs from the browser.
Network requests/responses.
Server logs from the application under test.

Playwright's trace does most of this; Cypress has time travel. Use the tools.

Step 3: Compare to Pass

Run the test until it passes. Capture the same state. Diff.

The diff shows what's different between pass and fail. Often, the diff is "the modal hadn't opened yet" or "the API response hadn't returned yet." Now you know which timing is the problem.

Step 4: Fix the Root Cause

Add a proper wait — not a sleep. Wait for the specific condition that distinguishes pass from fail.

// fragile fix
await page.waitForTimeout(3000);

// proper fix
await expect(page.getByText('Profile updated')).toBeVisible();

The proper fix is the test that asserts the expected state.

Step 5: Verify

Run the fix in a loop. 100 iterations. Confirm the flake rate drops to 0%.

Common Patterns That Cause Flakes

Pattern: Click Before Render

// flaky
await page.goto('/dashboard');
await page.click('button.export');

The button might not exist when goto returns; goto returns after navigation but the page may still be rendering.

// fixed
await page.goto('/dashboard');
await page.getByRole('button', { name: 'Export' }).click();

The locator auto-waits for the button to be ready.

Pattern: Read Before Update

// flaky
await page.click('button.save');
const message = await page.textContent('.success-message');
expect(message).toBe('Saved');

The success message renders asynchronously after save.

// fixed
await page.click('button.save');
await expect(page.locator('.success-message')).toHaveText('Saved');

The web-first assertion waits for the text to match.

Pattern: Test Depends on Test

test('user signs up', async ({ page }) => {
  await page.fill('email', 'alice@example.com');
  // ...
});

test('user logs in', async ({ page }) => {
  await page.fill('email', 'alice@example.com');
  // assumes the previous test created Alice
});

If the suite reorders or runs in parallel, the second test fails.

// fixed: each test creates its own user
test('user logs in', async ({ page }) => {
  const email = `alice-${Date.now()}@example.com`;
  await createUser(email);
  await page.fill('email', email);
  // ...
});

Each test is responsible for its own state.

Pattern: External Dependency

// flaky
test('search returns results', async ({ page }) => {
  await page.goto('/search?q=test');
  await expect(page.locator('.results')).toBeVisible();
});

If the search depends on a third-party service, downtime there means flake.

// fixed: stub the third-party
test('search returns results', async ({ page }) => {
  await page.route('**/api/search**', (route) =>
    route.fulfill({ body: JSON.stringify(fixtureResults) })
  );
  await page.goto('/search?q=test');
  await expect(page.locator('.results')).toBeVisible();
});

The test doesn't depend on the external service's availability.

Architectural Practices

To prevent flakes systemically:

Each test creates its own state. Users, fixtures, data. No shared state across tests.
Each test cleans up after itself or runs in an isolated database/storage.
Avoid waitForTimeout entirely. Replace with web-first assertions.
Stub all external services in tests. Real services have downtime; stubs don't.
Run tests in parallel by default. If a test fails in parallel but passes serial, it's hiding ordering dependencies.
Trace on retry. Failed tests get a trace; the trace tells you why.
Quarantine, don't disable. If a test is flaky and you can't fix immediately, quarantine it (run separately, don't gate CI on it). Don't delete; the test catches a real concern.

Handling Real-Bug Flakes

When the flake is a real bug:

Document the failure mode.
Reproduce it reliably (using the diagnostic capture).
Fix the bug in the application code, not the test.
Re-run the test; it should now be stable.

The temptation to "just add a wait" makes the test pass but the bug stays in production. Resist.

Flake Tracking

Track flakes:

Flake rate per test (failures per 1000 runs).
Time-since-last-failure.
Top-flaky tests for the week.

The dashboard surfaces which tests need attention. Address the top three flaky tests every sprint; the suite gets healthier over time.

Anti-Patterns

waitForTimeout instead of web-first assertions. Brittle; hides real timing issues. Replace.

Disabling flaky tests. The test was catching a real intermittent issue. Now production has it. Quarantine, investigate, fix.

Reflexive retries without investigation. "It's flaky, just rerun." The flake rate climbs; the suite stops catching anything. Investigate every flake.

Shared test state. One test creates a user; another expects it. Tests can't run in parallel. Each test owns its state.

Real external dependencies in tests. Third-party service is down; the test fails. Stub everything.

No trace artifacts on failure. Investigation requires reproducing locally and guessing. Enable tracing.

Install this skill directly: skilldb add browser-automation-skills

Get CLI access →