Technology & EngineeringBrowser Automation209 lines

Agent-Driven Browser Tasks

Connect an LLM agent to a browser to perform tasks: navigation, form

Quick Summary27 lines

Putting an LLM in control of a browser is the central capability of the agentic web. The agent sees a page, decides what to do, takes an action, sees the result, decides what to do next. The loop is general; the implementation is full of surprises.

## Key Points

- **Navigate** to a URL.
- **Click** on an element identified by a description.
- **Type** text into a field identified by a description.
- **Scroll** the page (up, down, to element).
- **Wait** for content to load.
- **Screenshot** the current page.
- **Get text** from an element.
- **Read URL** of the current page.
- **Press key** (Enter, Tab, Escape).
- **Go back** in history.
- Be ambiguity-free in its name (`click_button` vs. `click_link`).
- Accept human-readable descriptions of targets, not selectors.

## Quick Example

```
[1] heading "Welcome to Example"
[2] link "Sign in" -> /login
[3] textbox "Search"
[4] button "Submit"
```

skilldb get browser-automation-skills/Agent-Driven Browser TasksFull skill: 209 lines

Paste into your CLAUDE.md or agent config

Putting an LLM in control of a browser is the central capability of the agentic web. The agent sees a page, decides what to do, takes an action, sees the result, decides what to do next. The loop is general; the implementation is full of surprises.

This skill covers the patterns that make LLM-driven browser automation work. Distinct from headless scraping (which knows the target structure) and human-driven testing (which has fixed scripts), agent-driven browser tasks adapt to the page they see.

The Action Set

The agent needs a defined set of actions. Standard set:

Navigate to a URL.
Click on an element identified by a description.
Type text into a field identified by a description.
Scroll the page (up, down, to element).
Wait for content to load.
Screenshot the current page.
Get text from an element.
Read URL of the current page.
Press key (Enter, Tab, Escape).
Go back in history.

The action set is small. The agent decides which to use; the framework executes. Each action returns a result the agent can read in its next reasoning step.

The action signatures matter. Each action should:

Be ambiguity-free in its name (click_button vs. click_link).
Accept human-readable descriptions of targets, not selectors.
Return concise feedback (the agent doesn't need a 5,000-token DOM dump).

Two Models of Page Context

Two main approaches to letting the agent see the page:

1. Accessibility Tree (Text)

Convert the page to its accessibility tree — a structured text representation of headings, links, buttons, form fields. Pass to the agent as context.

[1] heading "Welcome to Example"
[2] link "Sign in" -> /login
[3] textbox "Search"
[4] button "Submit"

Pros:

Compact; works in small context windows.
Easy for the agent to reason about (text in, text out).
Stable; markup changes don't break it as long as semantics hold.

Cons:

Misses visual layout.
Can't see images, colors, positions.
Sites without good accessibility have poor trees.

2. Screenshot (Vision)

Take a screenshot. Pass to a vision-capable model. The model decides what to do based on what it sees.

Pros:

Sees what a human sees.
Robust to sites with poor accessibility.
Captures visual cues (highlighted, error states, modals).

Cons:

Expensive (vision tokens).
The model has to figure out coordinates or locate by description.
Slower per step.

Hybrid: screenshot + accessibility tree together. The model uses the tree for the actions (click button[2]) and the screenshot for the situational understanding.

Element Identification

The hardest problem in agent-driven browser tasks is connecting "the agent says click the Submit button" to "this DOM element."

Approaches:

Indexed Tree

Number every interactive element on the page ([1], [2], ...). The agent says "click [4]." The framework translates the index to the corresponding element.

Stable within a single page state; changes when the page mutates. Best for short tasks where the page is captured, the agent decides, and the action is taken before the page changes.

Description-Based

The agent says "click the Submit button at the bottom of the form." The framework finds the element matching the description.

Robust to mutations. Requires fuzzy matching: the framework uses heuristics (semantic similarity, role + text matching) to find the most likely element.

Coordinate-Based

The agent (vision-capable) says "click at (450, 720)." The framework clicks those coordinates.

Most flexible; least robust. Sites with responsive layouts, scroll positions, or zoom levels render differently and the coordinates fail.

Hybrid

Use indexed tree as primary; fall back to description-based when the index is stale.

Most production systems converge on hybrid. The reliability comes from combining approaches.

The Agent Loop

A typical agent loop:

1. Get current page state (screenshot + tree)
2. Pass to agent with goal: "what should you do next?"
3. Agent returns an action (click, type, etc.)
4. Execute the action
5. Wait for the page to update
6. Loop back to step 1

Each iteration is one "turn." The agent reasons, acts, sees the result, decides the next action. Tasks complete when the agent says "task complete" or fails when a budget is exceeded.

Per-iteration overhead: 2-5 seconds for vision-capable models, 0.5-1.5 seconds for text-only with accessibility trees. Tasks of 5-20 turns are common; 50+ turn tasks should be checked for whether the agent is making progress.

Loop Stopping Criteria

Agent loops without termination criteria run forever. Set explicit stopping criteria:

Goal completion. The agent declares success.
Step budget. Maximum number of iterations (e.g., 30).
Time budget. Maximum wall-clock time (e.g., 5 minutes).
No-progress detection. N consecutive iterations without state change suggest the agent is stuck.
Repeated action detection. The agent attempting the same action 3 times in a row indicates it's looping.

When a stop criterion fires, return the failure with the conversation history so a human can diagnose.

Common Failure Modes

Agents driving browsers fail in distinctive ways:

Hallucinated Elements

The agent decides to click "the Login button" but no such element exists on this page. The framework either errors or clicks something wrong.

Mitigation: ground the agent's reasoning in the actual current state. Pass the accessibility tree explicitly; require the agent to reference indexed elements; reject actions that don't match.

Stuck in Confirmation Modals

The page opens a modal asking "are you sure?" The agent doesn't notice. It tries to click the original action and fails because the modal is in the way.

Mitigation: detect modals by visual or DOM signals; alert the agent that one is present. Some frameworks include modal detection as a built-in.

Login and Captchas

The task requires login. The agent doesn't have credentials, or hits a captcha. The task fails.

Mitigation: for tasks that involve known sites, log in upfront with stored credentials; pass the authenticated session to the agent. For captchas, the agent likely cannot proceed; build a human-in-the-loop fallback.

Page Loads That Don't Settle

The agent acts before the page is ready. The action targets a stale element; the result is unexpected.

Mitigation: framework-side waiting before each action. Don't trust the agent to know about loading; the framework should wait until the page is settled.

Task Drift

The agent strays from the original goal. Looks at unrelated content; reasons itself into a different task; gets distracted by a sidebar.

Mitigation: include the original goal in every prompt iteration. The agent's prompt restates the goal at each turn.

Validation and Output

For data-extraction tasks, validate the output before trusting it:

Schema validation (the extracted data has the expected fields).
Sanity checks (numbers in plausible ranges; dates parse).
Cross-reference (the data matches what the agent said it found).

Many agent-task failures are silent — the agent succeeds in its own reasoning but produces wrong output. Validation catches them.

Cost Management

Agent-driven browser tasks are expensive. Per-task cost can be cents to dollars depending on model and task length.

Cost levers:

Use cheaper models for early iterations; switch to expensive models only on complex steps.
Cache page states; reduce redundant captures.
Set step budgets; bail when exceeded.
Use accessibility trees instead of screenshots when possible.
Deduplicate similar tasks across users.

Track cost per task; the metric drives optimization.

Anti-Patterns

Open-ended loops without stopping criteria. Agent loops forever. Set step and time budgets.

No element grounding. Agent hallucinates elements. Pass the actual page state and require references.

Trusting agent declarations of success. "Task complete" without validation. Verify the actual outcome.

Vision-only on every step. Expensive. Use accessibility tree where possible; vision when needed.

No screenshot on failure. Failure investigation requires logs only. Save screenshots and DOM at each step for post-failure debugging.

Manual selectors mixed in. The framework lets the agent specify a CSS selector. The agent picks brittle selectors. Restrict to the action set.

Install this skill directly: skilldb add browser-automation-skills

Get CLI access →