Skip to main content
Technology & EngineeringMulti Agent Orchestration177 lines

Evaluation-Driven Agent Development

Build the eval suite that tells you whether changes to your agent

Quick Summary18 lines
The fundamental challenge of agent development: changes that look good in spot-checks regress in cases you didn't think to test. The agent is more capable on the new prompt but worse on the edge cases. You don't notice until users hit the regression.

## Key Points

- **Input.** The user query or task.
- **Expected behavior.** What good output looks like, or what specific properties it should have.
- **Tags.** Categories the case represents.
- **Importance.** Some cases are critical; some are nice-to-have.
- The query that's vague: "What about that thing?" — agent should ask for clarification.
- The query with conflicting info: contradictory specifications. Agent should resolve or escalate.
- The query that requires no tool: chitchat. Agent shouldn't call tools.
- The query that fails to find an answer: low-information state. Agent should report not-found, not hallucinate.
- The adversarial query: prompt-injection attempts.
- The high-stakes query: actions with side effects. Agent should confirm.
- Wrong tool called.
- Right tool, wrong arguments.
skilldb get multi-agent-orchestration-skills/Evaluation-Driven Agent DevelopmentFull skill: 177 lines
Paste into your CLAUDE.md or agent config

The fundamental challenge of agent development: changes that look good in spot-checks regress in cases you didn't think to test. The agent is more capable on the new prompt but worse on the edge cases. You don't notice until users hit the regression.

The solution is an evaluation suite. A set of inputs and expected behaviors. Run the suite on every change. The change ships only if it improves the suite or holds it steady on important metrics.

This is not a luxury. Agent systems without evals drift. Each change might improve the system on the cases the developer remembered, but the system as a whole gets worse. Evals are the calibration that prevents drift.

What an Eval Set Looks Like

An eval set is a curated list of test cases. Each case has:

  • Input. The user query or task.
  • Expected behavior. What good output looks like, or what specific properties it should have.
  • Tags. Categories the case represents.
  • Importance. Some cases are critical; some are nice-to-have.

Example case:

{
  "id": "search-001",
  "input": "Find me documents about onboarding new employees",
  "expected_tools": ["search_documents"],
  "expected_args_pattern": { "query": ".*onboarding.*employees.*" },
  "expected_response_contains": ["document"],
  "tags": ["search", "common"],
  "importance": "high"
}

The eval framework runs the agent on the input, observes which tools it called and what it returned, and scores against the expected behavior.

Building the Eval Set

Start with the cases users actually run. Look at production logs (or simulated user queries from test users); pick the most common patterns; codify them as test cases.

Then add edge cases:

  • The query that's vague: "What about that thing?" — agent should ask for clarification.
  • The query with conflicting info: contradictory specifications. Agent should resolve or escalate.
  • The query that requires no tool: chitchat. Agent shouldn't call tools.
  • The query that fails to find an answer: low-information state. Agent should report not-found, not hallucinate.
  • The adversarial query: prompt-injection attempts.
  • The high-stakes query: actions with side effects. Agent should confirm.

Each category produces 5-20 cases. The set grows over time as you encounter new patterns in production.

A starting eval set has 50-100 cases. A mature one has 500-2000. Cover breadth before depth: many categories, few cases each, then deepen the categories that matter most.

Scoring Metrics

Beyond simple pass/fail, evaluate on multiple dimensions:

Tool Selection

Did the agent call the right tools? In the right order? With the right arguments?

Score: did search_documents get called? Did its query parameter contain the user's keywords? Was it called with reasonable parameters (e.g., max_results <= 20)?

This metric is mechanical and easy to verify. Often the most important: 80% of agent failures are tool-selection failures.

Output Correctness

Did the agent produce a correct answer?

For factual questions: does the answer match the ground truth? Use exact match for simple cases; use semantic match (LLM-as-judge) for nuanced ones.

For generative tasks: does the output meet specified criteria? Length, format, content properties.

Output Quality

Beyond correctness, is the output well-written? Useful?

This is harder to measure programmatically. LLM-as-judge: a separate LLM evaluates the response on criteria (accuracy, helpfulness, clarity). Calibrate the judge by spot-checking its scores.

Cost

How many tokens did the run take? How many tool calls? How long did it take?

Cost is a real metric. An agent that improves on quality but doubles the cost may not be a net improvement.

Failure Mode Distribution

When the agent fails, what kind of failure?

  • Wrong tool called.
  • Right tool, wrong arguments.
  • Right tool, hallucinated response.
  • Right tool, response is unhelpful.
  • Tool call failure not gracefully handled.
  • Loop / stuck.

The distribution of failures tells you where to focus. If most failures are wrong-tool calls, work on tool descriptions. If most are hallucinations, work on grounding.

LLM-as-Judge Cautions

Using an LLM to score outputs is convenient but has pitfalls:

  • Inconsistency. Same input scored differently across runs. Use deterministic settings (temperature 0); average over multiple runs for noisy cases.
  • Bias. Some judges have systematic biases (e.g., preferring longer responses, more polite ones). Calibrate against human ratings.
  • Cost. Judging adds LLM calls; cost can dominate. Sample cases; don't judge every run.

Use LLM-as-judge for cases where mechanical scoring isn't possible. Use mechanical scoring (exact match, regex, schema validation) wherever feasible.

Running Evals

Evals run:

  • On every PR that touches the agent (CI gating).
  • Nightly on the full eval set.
  • On request when investigating issues.
  • Periodically (weekly) on a larger production-like set to catch drift.

CI gating means: a PR that regresses the eval is blocked from merging. The team has to either fix the regression or update the eval set with explicit acknowledgment of why the previously-correct behavior is now different.

The full eval set is too large for every PR. Use a hot subset for PRs (50-100 cases) and the full set for nightly. Tag the hot subset for representativeness.

Eval Maintenance

Eval sets need maintenance:

  • Add cases as users encounter them. Production failures become eval cases. The set grows.
  • Update expected behavior when the desired behavior changes. The eval is a contract; if you change the contract, document why and update the cases.
  • Prune obsolete cases. Some cases describe behaviors that are no longer relevant (deprecated tools, removed features). Remove them.
  • Calibrate periodically. Run human evaluators on a sample; compare against your automated scoring; adjust scoring if it diverges.

Eval sets that don't grow stop being useful. Eval sets that grow without curation become noise. Both pitfalls require attention.

Eval-Driven Development

The discipline:

  1. Before changing the agent, identify the cases the change should improve. If they're not in the eval set, add them.
  2. Run evals before the change. Note the baseline.
  3. Make the change.
  4. Run evals after. Compare.
  5. If the change improves the targeted cases without regressing others, ship.
  6. If it regresses some cases, decide: are those cases still important? If yes, the change isn't ready; iterate. If no, document the rationale and update the set.

This process slows individual changes but produces an agent that gets better over time without backsliding.

When the Eval Disagrees with Intuition

You change the agent; eval scores improve; but spot-checks feel worse. What now?

Two possibilities:

  1. The eval set is missing the dimension you're sensing. Add cases that capture it.
  2. Your intuition is unreliable. The eval is more right than your spot-check.

Both happen. Investigate. Don't override the eval based on a single spot-check; do investigate which is correct.

Over time, the eval becomes the source of truth. Spot-checks supplement.

Anti-Patterns

No evals. Changes ship based on developer feel. The agent gets worse on cases nobody remembered. The team is surprised.

Tiny eval set. 5 cases. The agent passes them but fails on production. Coverage is too narrow.

Pass/fail only. No quality dimensions. The agent technically completes but unhelpfully. Multi-metric scoring catches this.

LLM-as-judge without calibration. Scoring is inconsistent; gains and losses are noise. Calibrate judges against human ratings.

No CI gating. Eval failures don't block PRs. Engineers ignore them. Gate.

Stale eval set. Cases describe behavior that's no longer desired. The set blocks legitimate changes. Maintain.

Install this skill directly: skilldb add multi-agent-orchestration-skills

Get CLI access →