agent-evaluation
Testing and evaluating AI agents: trajectory evaluation, task completion metrics, tool-use accuracy measurement, regression testing, benchmark suites, and A/B testing agent configurations. Covers practical approaches to measuring whether agents are working correctly and improving over time.
Measure agent performance systematically: track task completion, tool accuracy, regressions, and compare configurations. ## Key Points 1. **Start with task completion** — Does the agent finish the task? Everything else is secondary. 2. **Use LLM-as-judge for open-ended tasks** — When there is no exact expected answer, use a model to score the output. 3. **Track efficiency alongside accuracy** — An agent that takes 50 steps to do what should take 5 is broken even if it gets the right answer. 4. **Run benchmarks on every agent change** — Prompt changes, tool changes, model changes. Regressions happen silently. 5. **Test failure modes explicitly** — Include test cases with broken tools, ambiguous tasks, and impossible requests to verify the agent handles them gracefully. 6. **Keep test cases versioned** — Store them in your repo alongside the agent code. Review them in PRs. 7. **Measure cost per task** — Track tokens and API calls per task completion. Optimize for cost after correctness.
skilldb get ai-agent-orchestration-skills/agent-evaluationFull skill: 553 linesAgent Evaluation
Measure agent performance systematically: track task completion, tool accuracy, regressions, and compare configurations.
Trajectory Evaluation
Evaluate not just the final answer, but the entire sequence of actions the agent took.
from dataclasses import dataclass, field
@dataclass
class AgentStep:
tool_name: str
tool_input: dict
tool_result: str
reasoning: str = ""
@dataclass
class AgentTrajectory:
task: str
steps: list[AgentStep] = field(default_factory=list)
final_answer: str = ""
total_tokens: int = 0
total_time_ms: float = 0
success: bool = False
def record_trajectory(task: str, tools: list[dict]) -> AgentTrajectory:
"""Run an agent and record its full trajectory."""
import time
trajectory = AgentTrajectory(task=task)
messages = [{"role": "user", "content": task}]
start = time.time()
for _ in range(20):
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
tools=tools,
messages=messages,
)
trajectory.total_tokens += response.usage.input_tokens + response.usage.output_tokens
if response.stop_reason == "end_turn":
trajectory.final_answer = extract_text(response)
break
messages.append({"role": "assistant", "content": response.content})
for block in response.content:
if block.type == "tool_use":
result = execute_tool(block.name, block.input)
reasoning = ""
for b in response.content:
if b.type == "text":
reasoning = b.text
break
trajectory.steps.append(AgentStep(
tool_name=block.name,
tool_input=block.input,
tool_result=result,
reasoning=reasoning,
))
messages.append({"role": "user", "content": [{
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
}]})
trajectory.total_time_ms = (time.time() - start) * 1000
return trajectory
Grading Trajectories
import anthropic
client = anthropic.Anthropic()
def grade_trajectory(trajectory: AgentTrajectory,
expected_answer: str = None,
expected_tools: list[str] = None) -> dict:
"""Grade an agent trajectory on multiple dimensions."""
scores = {}
# 1. Task completion (use LLM as judge)
if expected_answer:
judge_response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Compare the agent's answer to the expected answer.
Score from 0-10 where 10 is a perfect match in meaning (exact wording not required).
Task: {trajectory.task}
Expected: {expected_answer}
Agent answer: {trajectory.final_answer}
Return ONLY a JSON object: {{"score": N, "reason": "..."}}""",
}],
)
import json
text = judge_response.content[0].text
start = text.index("{")
end = text.rindex("}") + 1
result = json.loads(text[start:end])
scores["completion"] = result["score"]
scores["completion_reason"] = result["reason"]
# 2. Tool efficiency
num_steps = len(trajectory.steps)
scores["num_steps"] = num_steps
if expected_tools:
used = [s.tool_name for s in trajectory.steps]
expected_set = set(expected_tools)
used_set = set(used)
scores["tool_precision"] = len(used_set & expected_set) / max(len(used_set), 1)
scores["tool_recall"] = len(used_set & expected_set) / max(len(expected_set), 1)
# 3. Efficiency
scores["total_tokens"] = trajectory.total_tokens
scores["total_time_ms"] = trajectory.total_time_ms
# 4. Error rate
errors = sum(1 for s in trajectory.steps if s.tool_result.startswith("Error:"))
scores["error_rate"] = errors / max(num_steps, 1)
# 5. Redundancy (repeated identical tool calls)
calls = [(s.tool_name, str(s.tool_input)) for s in trajectory.steps]
unique_calls = set(calls)
scores["redundancy"] = 1 - (len(unique_calls) / max(len(calls), 1))
return scores
Task Completion Metrics
Define clear pass/fail criteria for agent tasks.
@dataclass
class TestCase:
name: str
task: str
expected_answer: str = ""
expected_tools: list[str] = field(default_factory=list)
validation_fn: callable = None # Custom validation function
max_steps: int = 15
max_tokens: int = 50000
timeout_ms: float = 60000
@dataclass
class TestResult:
test_name: str
passed: bool
scores: dict
trajectory: AgentTrajectory
failure_reason: str = ""
def run_test_case(test: TestCase, tools: list[dict]) -> TestResult:
"""Run a single test case and evaluate the result."""
trajectory = record_trajectory(test.task, tools)
scores = grade_trajectory(
trajectory,
expected_answer=test.expected_answer,
expected_tools=test.expected_tools,
)
# Determine pass/fail
passed = True
reasons = []
# Check completion score
if "completion" in scores and scores["completion"] < 7:
passed = False
reasons.append(f"Low completion score: {scores['completion']}/10")
# Check step limit
if len(trajectory.steps) > test.max_steps:
passed = False
reasons.append(f"Exceeded step limit: {len(trajectory.steps)} > {test.max_steps}")
# Check token budget
if trajectory.total_tokens > test.max_tokens:
passed = False
reasons.append(f"Exceeded token budget: {trajectory.total_tokens} > {test.max_tokens}")
# Check timeout
if trajectory.total_time_ms > test.timeout_ms:
passed = False
reasons.append(f"Exceeded timeout: {trajectory.total_time_ms:.0f}ms > {test.timeout_ms:.0f}ms")
# Custom validation
if test.validation_fn:
custom_pass, custom_reason = test.validation_fn(trajectory)
if not custom_pass:
passed = False
reasons.append(f"Custom validation failed: {custom_reason}")
return TestResult(
test_name=test.name,
passed=passed,
scores=scores,
trajectory=trajectory,
failure_reason="; ".join(reasons) if reasons else "",
)
Benchmark Suite
Run a suite of tests and track results over time.
import json
import time
from pathlib import Path
class BenchmarkSuite:
"""Run and track agent benchmarks."""
def __init__(self, name: str, tests: list[TestCase],
results_dir: str = ".benchmarks"):
self.name = name
self.tests = tests
self.results_dir = Path(results_dir)
self.results_dir.mkdir(parents=True, exist_ok=True)
def run(self, tools: list[dict], agent_config: dict = None) -> dict:
"""Run all tests and return aggregate results."""
results = []
for test in self.tests:
print(f"Running: {test.name}...", end=" ")
result = run_test_case(test, tools)
status = "PASS" if result.passed else "FAIL"
print(f"{status}")
if not result.passed:
print(f" Reason: {result.failure_reason}")
results.append(result)
# Aggregate
passed = sum(1 for r in results if r.passed)
total = len(results)
avg_steps = sum(len(r.trajectory.steps) for r in results) / max(total, 1)
avg_tokens = sum(r.trajectory.total_tokens for r in results) / max(total, 1)
avg_time = sum(r.trajectory.total_time_ms for r in results) / max(total, 1)
summary = {
"suite": self.name,
"timestamp": time.time(),
"config": agent_config or {},
"pass_rate": passed / total,
"passed": passed,
"total": total,
"avg_steps": avg_steps,
"avg_tokens": avg_tokens,
"avg_time_ms": avg_time,
"test_results": [
{
"name": r.test_name,
"passed": r.passed,
"scores": r.scores,
"failure_reason": r.failure_reason,
"steps": len(r.trajectory.steps),
"tokens": r.trajectory.total_tokens,
}
for r in results
],
}
# Save results
filename = f"{self.name}_{int(time.time())}.json"
(self.results_dir / filename).write_text(json.dumps(summary, indent=2))
print(f"\n{'='*50}")
print(f"Results: {passed}/{total} passed ({passed/total:.0%})")
print(f"Avg steps: {avg_steps:.1f}")
print(f"Avg tokens: {avg_tokens:.0f}")
print(f"Avg time: {avg_time:.0f}ms")
return summary
# Define a benchmark
coding_benchmark = BenchmarkSuite(
name="coding_tasks",
tests=[
TestCase(
name="fizzbuzz",
task="Write a Python function called fizzbuzz(n) that returns 'Fizz' for "
"multiples of 3, 'Buzz' for multiples of 5, 'FizzBuzz' for both, "
"and the number as a string otherwise. Save it to fizzbuzz.py and test it.",
expected_tools=["write_file", "run_command"],
validation_fn=lambda t: (
any("fizzbuzz" in s.tool_input.get("path", "") for s in t.steps),
"Should write fizzbuzz.py"
),
),
TestCase(
name="read_and_summarize",
task="Read the file data.txt and provide a summary of its contents.",
expected_tools=["read_file"],
max_steps=5,
),
TestCase(
name="debug_error",
task="Run 'python app.py' and fix any errors you find.",
expected_tools=["run_command", "read_file", "write_file"],
max_steps=10,
),
],
)
# Run benchmark
results = coding_benchmark.run(tools=available_tools, agent_config={"model": "claude-sonnet-4-20250514"})
Regression Testing
Detect when agent changes cause previously passing tests to fail.
class RegressionTracker:
"""Track test results over time and detect regressions."""
def __init__(self, results_dir: str = ".benchmarks"):
self.results_dir = Path(results_dir)
def load_history(self, suite_name: str) -> list[dict]:
"""Load all past results for a suite."""
results = []
for f in sorted(self.results_dir.glob(f"{suite_name}_*.json")):
results.append(json.loads(f.read_text()))
return results
def check_regressions(self, suite_name: str, current: dict) -> list[dict]:
"""Compare current results against the last run."""
history = self.load_history(suite_name)
if not history:
return []
previous = history[-1]
regressions = []
prev_results = {r["name"]: r for r in previous["test_results"]}
curr_results = {r["name"]: r for r in current["test_results"]}
for name, curr in curr_results.items():
prev = prev_results.get(name)
if prev and prev["passed"] and not curr["passed"]:
regressions.append({
"test": name,
"was": "PASS",
"now": "FAIL",
"reason": curr.get("failure_reason", ""),
})
# Also flag significant performance regressions
if prev and curr["tokens"] > prev["tokens"] * 1.5:
regressions.append({
"test": name,
"type": "performance",
"metric": "tokens",
"was": prev["tokens"],
"now": curr["tokens"],
})
return regressions
tracker = RegressionTracker()
def run_with_regression_check(suite: BenchmarkSuite, tools: list[dict],
config: dict) -> dict:
results = suite.run(tools, config)
regressions = tracker.check_regressions(suite.name, results)
if regressions:
print(f"\nREGRESSIONS DETECTED ({len(regressions)}):")
for r in regressions:
if r.get("type") == "performance":
print(f" PERF: {r['test']} - {r['metric']}: {r['was']} -> {r['now']}")
else:
print(f" FAIL: {r['test']} - was {r['was']}, now {r['now']}")
print(f" Reason: {r['reason']}")
return results
A/B Testing Agent Configurations
Compare different agent setups to find the best configuration.
def ab_test(suite: BenchmarkSuite, tools: list[dict],
config_a: dict, config_b: dict,
num_runs: int = 3) -> dict:
"""A/B test two agent configurations."""
results_a = []
results_b = []
for run_idx in range(num_runs):
print(f"\n--- Run {run_idx + 1}/{num_runs} ---")
print("Config A:")
ra = suite.run(tools, config_a)
results_a.append(ra)
print("Config B:")
rb = suite.run(tools, config_b)
results_b.append(rb)
# Aggregate results
def aggregate(results_list):
return {
"avg_pass_rate": sum(r["pass_rate"] for r in results_list) / len(results_list),
"avg_tokens": sum(r["avg_tokens"] for r in results_list) / len(results_list),
"avg_steps": sum(r["avg_steps"] for r in results_list) / len(results_list),
"avg_time_ms": sum(r["avg_time_ms"] for r in results_list) / len(results_list),
}
agg_a = aggregate(results_a)
agg_b = aggregate(results_b)
print(f"\n{'='*60}")
print(f"A/B Test Results ({num_runs} runs each)")
print(f"{'='*60}")
print(f"{'Metric':<20} {'Config A':>12} {'Config B':>12} {'Winner':>10}")
print(f"{'-'*60}")
metrics = [
("Pass Rate", "avg_pass_rate", True),
("Avg Tokens", "avg_tokens", False),
("Avg Steps", "avg_steps", False),
("Avg Time (ms)", "avg_time_ms", False),
]
for label, key, higher_is_better in metrics:
va = agg_a[key]
vb = agg_b[key]
if higher_is_better:
winner = "A" if va > vb else "B" if vb > va else "Tie"
else:
winner = "A" if va < vb else "B" if vb < va else "Tie"
print(f"{label:<20} {va:>12.2f} {vb:>12.2f} {winner:>10}")
return {"config_a": agg_a, "config_b": agg_b}
# Example: compare model sizes
ab_test(
suite=coding_benchmark,
tools=available_tools,
config_a={"model": "claude-sonnet-4-20250514", "max_tokens": 4096},
config_b={"model": "claude-haiku-35-20241022", "max_tokens": 4096},
num_runs=3,
)
Tool-Use Accuracy Metrics
Specifically measure how well the agent uses tools.
def analyze_tool_usage(trajectories: list[AgentTrajectory]) -> dict:
"""Analyze tool usage patterns across multiple trajectories."""
all_calls = []
for t in trajectories:
for step in t.steps:
all_calls.append({
"tool": step.tool_name,
"success": not step.tool_result.startswith("Error:"),
"task": t.task,
})
if not all_calls:
return {"message": "No tool calls recorded."}
# Per-tool success rate
from collections import Counter
tool_counts = Counter(c["tool"] for c in all_calls)
tool_success = Counter(c["tool"] for c in all_calls if c["success"])
tool_stats = {}
for tool, count in tool_counts.items():
success = tool_success.get(tool, 0)
tool_stats[tool] = {
"total_calls": count,
"success_rate": success / count,
"failures": count - success,
}
# Overall stats
total = len(all_calls)
successes = sum(1 for c in all_calls if c["success"])
return {
"total_tool_calls": total,
"overall_success_rate": successes / total,
"per_tool": tool_stats,
"most_used": tool_counts.most_common(5),
}
Evaluation Best Practices
-
Start with task completion — Does the agent finish the task? Everything else is secondary.
-
Use LLM-as-judge for open-ended tasks — When there is no exact expected answer, use a model to score the output.
-
Track efficiency alongside accuracy — An agent that takes 50 steps to do what should take 5 is broken even if it gets the right answer.
-
Run benchmarks on every agent change — Prompt changes, tool changes, model changes. Regressions happen silently.
-
Test failure modes explicitly — Include test cases with broken tools, ambiguous tasks, and impossible requests to verify the agent handles them gracefully.
-
Keep test cases versioned — Store them in your repo alongside the agent code. Review them in PRs.
-
Measure cost per task — Track tokens and API calls per task completion. Optimize for cost after correctness.
Install this skill directly: skilldb add ai-agent-orchestration-skills
Related Skills
agent-architecture
Core patterns for building AI agent systems: the observe-think-act loop, ReAct pattern implementation, tool-use cycles, memory systems (short-term and long-term), and planning strategies. Covers how to structure an agent's main loop, manage state between iterations, and wire together perception, reasoning, and action into a reliable autonomous system.
agent-error-recovery
Handling failures in AI agent systems: retry strategies with backoff, fallback tools, graceful degradation, human-in-the-loop escalation, stuck-loop detection, and context recovery after crashes. Covers practical patterns for making agents robust against tool failures, API errors, and reasoning dead-ends.
agent-frameworks
Comparison of major AI agent frameworks: LangGraph, CrewAI, AutoGen, Semantic Kernel, and Claude Agent SDK. Covers when to use each framework, their trade-offs, core patterns, practical setup examples, and migration strategies between frameworks.
agent-guardrails
Safety and control systems for AI agents: input and output validation, action authorization, rate limiting, cost controls, content filtering, scope restriction, and audit logging. Covers practical implementations for keeping agents within bounds while maintaining their usefulness.
agent-memory
Memory systems for AI agents: conversation history management, summarization strategies, vector-based long-term memory, entity memory, episodic memory, and memory retrieval patterns. Covers practical implementations for giving agents persistent, searchable memory across sessions and within long-running tasks.
agent-planning
Planning strategies for AI agents: chain-of-thought prompting, tree-of-thought exploration, plan-and-execute patterns, iterative refinement, task decomposition, and goal tracking. Covers practical implementations that make agents more reliable at complex, multi-step tasks by thinking before acting.