Evaluation-Driven Agent Development
Build the eval suite that tells you whether changes to your agent
The fundamental challenge of agent development: changes that look good in spot-checks regress in cases you didn't think to test. The agent is more capable on the new prompt but worse on the edge cases. You don't notice until users hit the regression. ## Key Points - **Input.** The user query or task. - **Expected behavior.** What good output looks like, or what specific properties it should have. - **Tags.** Categories the case represents. - **Importance.** Some cases are critical; some are nice-to-have. - The query that's vague: "What about that thing?" — agent should ask for clarification. - The query with conflicting info: contradictory specifications. Agent should resolve or escalate. - The query that requires no tool: chitchat. Agent shouldn't call tools. - The query that fails to find an answer: low-information state. Agent should report not-found, not hallucinate. - The adversarial query: prompt-injection attempts. - The high-stakes query: actions with side effects. Agent should confirm. - Wrong tool called. - Right tool, wrong arguments.
skilldb get multi-agent-orchestration-skills/Evaluation-Driven Agent DevelopmentFull skill: 177 linesThe fundamental challenge of agent development: changes that look good in spot-checks regress in cases you didn't think to test. The agent is more capable on the new prompt but worse on the edge cases. You don't notice until users hit the regression.
The solution is an evaluation suite. A set of inputs and expected behaviors. Run the suite on every change. The change ships only if it improves the suite or holds it steady on important metrics.
This is not a luxury. Agent systems without evals drift. Each change might improve the system on the cases the developer remembered, but the system as a whole gets worse. Evals are the calibration that prevents drift.
What an Eval Set Looks Like
An eval set is a curated list of test cases. Each case has:
- Input. The user query or task.
- Expected behavior. What good output looks like, or what specific properties it should have.
- Tags. Categories the case represents.
- Importance. Some cases are critical; some are nice-to-have.
Example case:
{
"id": "search-001",
"input": "Find me documents about onboarding new employees",
"expected_tools": ["search_documents"],
"expected_args_pattern": { "query": ".*onboarding.*employees.*" },
"expected_response_contains": ["document"],
"tags": ["search", "common"],
"importance": "high"
}
The eval framework runs the agent on the input, observes which tools it called and what it returned, and scores against the expected behavior.
Building the Eval Set
Start with the cases users actually run. Look at production logs (or simulated user queries from test users); pick the most common patterns; codify them as test cases.
Then add edge cases:
- The query that's vague: "What about that thing?" — agent should ask for clarification.
- The query with conflicting info: contradictory specifications. Agent should resolve or escalate.
- The query that requires no tool: chitchat. Agent shouldn't call tools.
- The query that fails to find an answer: low-information state. Agent should report not-found, not hallucinate.
- The adversarial query: prompt-injection attempts.
- The high-stakes query: actions with side effects. Agent should confirm.
Each category produces 5-20 cases. The set grows over time as you encounter new patterns in production.
A starting eval set has 50-100 cases. A mature one has 500-2000. Cover breadth before depth: many categories, few cases each, then deepen the categories that matter most.
Scoring Metrics
Beyond simple pass/fail, evaluate on multiple dimensions:
Tool Selection
Did the agent call the right tools? In the right order? With the right arguments?
Score: did search_documents get called? Did its query parameter contain the user's keywords? Was it called with reasonable parameters (e.g., max_results <= 20)?
This metric is mechanical and easy to verify. Often the most important: 80% of agent failures are tool-selection failures.
Output Correctness
Did the agent produce a correct answer?
For factual questions: does the answer match the ground truth? Use exact match for simple cases; use semantic match (LLM-as-judge) for nuanced ones.
For generative tasks: does the output meet specified criteria? Length, format, content properties.
Output Quality
Beyond correctness, is the output well-written? Useful?
This is harder to measure programmatically. LLM-as-judge: a separate LLM evaluates the response on criteria (accuracy, helpfulness, clarity). Calibrate the judge by spot-checking its scores.
Cost
How many tokens did the run take? How many tool calls? How long did it take?
Cost is a real metric. An agent that improves on quality but doubles the cost may not be a net improvement.
Failure Mode Distribution
When the agent fails, what kind of failure?
- Wrong tool called.
- Right tool, wrong arguments.
- Right tool, hallucinated response.
- Right tool, response is unhelpful.
- Tool call failure not gracefully handled.
- Loop / stuck.
The distribution of failures tells you where to focus. If most failures are wrong-tool calls, work on tool descriptions. If most are hallucinations, work on grounding.
LLM-as-Judge Cautions
Using an LLM to score outputs is convenient but has pitfalls:
- Inconsistency. Same input scored differently across runs. Use deterministic settings (temperature 0); average over multiple runs for noisy cases.
- Bias. Some judges have systematic biases (e.g., preferring longer responses, more polite ones). Calibrate against human ratings.
- Cost. Judging adds LLM calls; cost can dominate. Sample cases; don't judge every run.
Use LLM-as-judge for cases where mechanical scoring isn't possible. Use mechanical scoring (exact match, regex, schema validation) wherever feasible.
Running Evals
Evals run:
- On every PR that touches the agent (CI gating).
- Nightly on the full eval set.
- On request when investigating issues.
- Periodically (weekly) on a larger production-like set to catch drift.
CI gating means: a PR that regresses the eval is blocked from merging. The team has to either fix the regression or update the eval set with explicit acknowledgment of why the previously-correct behavior is now different.
The full eval set is too large for every PR. Use a hot subset for PRs (50-100 cases) and the full set for nightly. Tag the hot subset for representativeness.
Eval Maintenance
Eval sets need maintenance:
- Add cases as users encounter them. Production failures become eval cases. The set grows.
- Update expected behavior when the desired behavior changes. The eval is a contract; if you change the contract, document why and update the cases.
- Prune obsolete cases. Some cases describe behaviors that are no longer relevant (deprecated tools, removed features). Remove them.
- Calibrate periodically. Run human evaluators on a sample; compare against your automated scoring; adjust scoring if it diverges.
Eval sets that don't grow stop being useful. Eval sets that grow without curation become noise. Both pitfalls require attention.
Eval-Driven Development
The discipline:
- Before changing the agent, identify the cases the change should improve. If they're not in the eval set, add them.
- Run evals before the change. Note the baseline.
- Make the change.
- Run evals after. Compare.
- If the change improves the targeted cases without regressing others, ship.
- If it regresses some cases, decide: are those cases still important? If yes, the change isn't ready; iterate. If no, document the rationale and update the set.
This process slows individual changes but produces an agent that gets better over time without backsliding.
When the Eval Disagrees with Intuition
You change the agent; eval scores improve; but spot-checks feel worse. What now?
Two possibilities:
- The eval set is missing the dimension you're sensing. Add cases that capture it.
- Your intuition is unreliable. The eval is more right than your spot-check.
Both happen. Investigate. Don't override the eval based on a single spot-check; do investigate which is correct.
Over time, the eval becomes the source of truth. Spot-checks supplement.
Anti-Patterns
No evals. Changes ship based on developer feel. The agent gets worse on cases nobody remembered. The team is surprised.
Tiny eval set. 5 cases. The agent passes them but fails on production. Coverage is too narrow.
Pass/fail only. No quality dimensions. The agent technically completes but unhelpfully. Multi-metric scoring catches this.
LLM-as-judge without calibration. Scoring is inconsistent; gains and losses are noise. Calibrate judges against human ratings.
No CI gating. Eval failures don't block PRs. Engineers ignore them. Gate.
Stale eval set. Cases describe behavior that's no longer desired. The set blocks legitimate changes. Maintain.
Install this skill directly: skilldb add multi-agent-orchestration-skills
Related Skills
Multi-Agent Handoff Patterns
Coordinate multiple specialized agents on a single task — when to hand
Agent Tool Design Principles
Design the tools that an LLM agent uses. Covers naming, parameter
Building Agent Workflows with LangGraph
Use LangGraph (or equivalent state-machine frameworks) to express
Adversarial Code Review
Adversarial implementation review methodology that validates code completeness against requirements with fresh objectivity. Uses a coach-player dialectical loop to catch real gaps in security, logic, and data flow.
API Design Testing
Design, document, and test APIs following RESTful principles, consistent
Architecture
Design software systems with sound architecture — choosing patterns, defining boundaries,