Skip to main content
Technology & EngineeringAi Agent Orchestration553 lines

agent-evaluation

Testing and evaluating AI agents: trajectory evaluation, task completion metrics, tool-use accuracy measurement, regression testing, benchmark suites, and A/B testing agent configurations. Covers practical approaches to measuring whether agents are working correctly and improving over time.

Quick Summary13 lines
Measure agent performance systematically: track task completion, tool accuracy, regressions, and compare configurations.

## Key Points

1. **Start with task completion** — Does the agent finish the task? Everything else is secondary.
2. **Use LLM-as-judge for open-ended tasks** — When there is no exact expected answer, use a model to score the output.
3. **Track efficiency alongside accuracy** — An agent that takes 50 steps to do what should take 5 is broken even if it gets the right answer.
4. **Run benchmarks on every agent change** — Prompt changes, tool changes, model changes. Regressions happen silently.
5. **Test failure modes explicitly** — Include test cases with broken tools, ambiguous tasks, and impossible requests to verify the agent handles them gracefully.
6. **Keep test cases versioned** — Store them in your repo alongside the agent code. Review them in PRs.
7. **Measure cost per task** — Track tokens and API calls per task completion. Optimize for cost after correctness.
skilldb get ai-agent-orchestration-skills/agent-evaluationFull skill: 553 lines
Paste into your CLAUDE.md or agent config

Agent Evaluation

Measure agent performance systematically: track task completion, tool accuracy, regressions, and compare configurations.


Trajectory Evaluation

Evaluate not just the final answer, but the entire sequence of actions the agent took.

from dataclasses import dataclass, field


@dataclass
class AgentStep:
    tool_name: str
    tool_input: dict
    tool_result: str
    reasoning: str = ""


@dataclass
class AgentTrajectory:
    task: str
    steps: list[AgentStep] = field(default_factory=list)
    final_answer: str = ""
    total_tokens: int = 0
    total_time_ms: float = 0
    success: bool = False


def record_trajectory(task: str, tools: list[dict]) -> AgentTrajectory:
    """Run an agent and record its full trajectory."""
    import time

    trajectory = AgentTrajectory(task=task)
    messages = [{"role": "user", "content": task}]
    start = time.time()

    for _ in range(20):
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=4096,
            tools=tools,
            messages=messages,
        )

        trajectory.total_tokens += response.usage.input_tokens + response.usage.output_tokens

        if response.stop_reason == "end_turn":
            trajectory.final_answer = extract_text(response)
            break

        messages.append({"role": "assistant", "content": response.content})

        for block in response.content:
            if block.type == "tool_use":
                result = execute_tool(block.name, block.input)
                reasoning = ""
                for b in response.content:
                    if b.type == "text":
                        reasoning = b.text
                        break

                trajectory.steps.append(AgentStep(
                    tool_name=block.name,
                    tool_input=block.input,
                    tool_result=result,
                    reasoning=reasoning,
                ))

                messages.append({"role": "user", "content": [{
                    "type": "tool_result",
                    "tool_use_id": block.id,
                    "content": result,
                }]})

    trajectory.total_time_ms = (time.time() - start) * 1000
    return trajectory

Grading Trajectories

import anthropic

client = anthropic.Anthropic()


def grade_trajectory(trajectory: AgentTrajectory,
                     expected_answer: str = None,
                     expected_tools: list[str] = None) -> dict:
    """Grade an agent trajectory on multiple dimensions."""
    scores = {}

    # 1. Task completion (use LLM as judge)
    if expected_answer:
        judge_response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=500,
            messages=[{
                "role": "user",
                "content": f"""Compare the agent's answer to the expected answer.
Score from 0-10 where 10 is a perfect match in meaning (exact wording not required).

Task: {trajectory.task}
Expected: {expected_answer}
Agent answer: {trajectory.final_answer}

Return ONLY a JSON object: {{"score": N, "reason": "..."}}""",
            }],
        )
        import json
        text = judge_response.content[0].text
        start = text.index("{")
        end = text.rindex("}") + 1
        result = json.loads(text[start:end])
        scores["completion"] = result["score"]
        scores["completion_reason"] = result["reason"]

    # 2. Tool efficiency
    num_steps = len(trajectory.steps)
    scores["num_steps"] = num_steps
    if expected_tools:
        used = [s.tool_name for s in trajectory.steps]
        expected_set = set(expected_tools)
        used_set = set(used)
        scores["tool_precision"] = len(used_set & expected_set) / max(len(used_set), 1)
        scores["tool_recall"] = len(used_set & expected_set) / max(len(expected_set), 1)

    # 3. Efficiency
    scores["total_tokens"] = trajectory.total_tokens
    scores["total_time_ms"] = trajectory.total_time_ms

    # 4. Error rate
    errors = sum(1 for s in trajectory.steps if s.tool_result.startswith("Error:"))
    scores["error_rate"] = errors / max(num_steps, 1)

    # 5. Redundancy (repeated identical tool calls)
    calls = [(s.tool_name, str(s.tool_input)) for s in trajectory.steps]
    unique_calls = set(calls)
    scores["redundancy"] = 1 - (len(unique_calls) / max(len(calls), 1))

    return scores

Task Completion Metrics

Define clear pass/fail criteria for agent tasks.

@dataclass
class TestCase:
    name: str
    task: str
    expected_answer: str = ""
    expected_tools: list[str] = field(default_factory=list)
    validation_fn: callable = None  # Custom validation function
    max_steps: int = 15
    max_tokens: int = 50000
    timeout_ms: float = 60000


@dataclass
class TestResult:
    test_name: str
    passed: bool
    scores: dict
    trajectory: AgentTrajectory
    failure_reason: str = ""


def run_test_case(test: TestCase, tools: list[dict]) -> TestResult:
    """Run a single test case and evaluate the result."""
    trajectory = record_trajectory(test.task, tools)

    scores = grade_trajectory(
        trajectory,
        expected_answer=test.expected_answer,
        expected_tools=test.expected_tools,
    )

    # Determine pass/fail
    passed = True
    reasons = []

    # Check completion score
    if "completion" in scores and scores["completion"] < 7:
        passed = False
        reasons.append(f"Low completion score: {scores['completion']}/10")

    # Check step limit
    if len(trajectory.steps) > test.max_steps:
        passed = False
        reasons.append(f"Exceeded step limit: {len(trajectory.steps)} > {test.max_steps}")

    # Check token budget
    if trajectory.total_tokens > test.max_tokens:
        passed = False
        reasons.append(f"Exceeded token budget: {trajectory.total_tokens} > {test.max_tokens}")

    # Check timeout
    if trajectory.total_time_ms > test.timeout_ms:
        passed = False
        reasons.append(f"Exceeded timeout: {trajectory.total_time_ms:.0f}ms > {test.timeout_ms:.0f}ms")

    # Custom validation
    if test.validation_fn:
        custom_pass, custom_reason = test.validation_fn(trajectory)
        if not custom_pass:
            passed = False
            reasons.append(f"Custom validation failed: {custom_reason}")

    return TestResult(
        test_name=test.name,
        passed=passed,
        scores=scores,
        trajectory=trajectory,
        failure_reason="; ".join(reasons) if reasons else "",
    )

Benchmark Suite

Run a suite of tests and track results over time.

import json
import time
from pathlib import Path


class BenchmarkSuite:
    """Run and track agent benchmarks."""

    def __init__(self, name: str, tests: list[TestCase],
                 results_dir: str = ".benchmarks"):
        self.name = name
        self.tests = tests
        self.results_dir = Path(results_dir)
        self.results_dir.mkdir(parents=True, exist_ok=True)

    def run(self, tools: list[dict], agent_config: dict = None) -> dict:
        """Run all tests and return aggregate results."""
        results = []
        for test in self.tests:
            print(f"Running: {test.name}...", end=" ")
            result = run_test_case(test, tools)
            status = "PASS" if result.passed else "FAIL"
            print(f"{status}")
            if not result.passed:
                print(f"  Reason: {result.failure_reason}")
            results.append(result)

        # Aggregate
        passed = sum(1 for r in results if r.passed)
        total = len(results)
        avg_steps = sum(len(r.trajectory.steps) for r in results) / max(total, 1)
        avg_tokens = sum(r.trajectory.total_tokens for r in results) / max(total, 1)
        avg_time = sum(r.trajectory.total_time_ms for r in results) / max(total, 1)

        summary = {
            "suite": self.name,
            "timestamp": time.time(),
            "config": agent_config or {},
            "pass_rate": passed / total,
            "passed": passed,
            "total": total,
            "avg_steps": avg_steps,
            "avg_tokens": avg_tokens,
            "avg_time_ms": avg_time,
            "test_results": [
                {
                    "name": r.test_name,
                    "passed": r.passed,
                    "scores": r.scores,
                    "failure_reason": r.failure_reason,
                    "steps": len(r.trajectory.steps),
                    "tokens": r.trajectory.total_tokens,
                }
                for r in results
            ],
        }

        # Save results
        filename = f"{self.name}_{int(time.time())}.json"
        (self.results_dir / filename).write_text(json.dumps(summary, indent=2))

        print(f"\n{'='*50}")
        print(f"Results: {passed}/{total} passed ({passed/total:.0%})")
        print(f"Avg steps: {avg_steps:.1f}")
        print(f"Avg tokens: {avg_tokens:.0f}")
        print(f"Avg time: {avg_time:.0f}ms")

        return summary


# Define a benchmark
coding_benchmark = BenchmarkSuite(
    name="coding_tasks",
    tests=[
        TestCase(
            name="fizzbuzz",
            task="Write a Python function called fizzbuzz(n) that returns 'Fizz' for "
                 "multiples of 3, 'Buzz' for multiples of 5, 'FizzBuzz' for both, "
                 "and the number as a string otherwise. Save it to fizzbuzz.py and test it.",
            expected_tools=["write_file", "run_command"],
            validation_fn=lambda t: (
                any("fizzbuzz" in s.tool_input.get("path", "") for s in t.steps),
                "Should write fizzbuzz.py"
            ),
        ),
        TestCase(
            name="read_and_summarize",
            task="Read the file data.txt and provide a summary of its contents.",
            expected_tools=["read_file"],
            max_steps=5,
        ),
        TestCase(
            name="debug_error",
            task="Run 'python app.py' and fix any errors you find.",
            expected_tools=["run_command", "read_file", "write_file"],
            max_steps=10,
        ),
    ],
)

# Run benchmark
results = coding_benchmark.run(tools=available_tools, agent_config={"model": "claude-sonnet-4-20250514"})

Regression Testing

Detect when agent changes cause previously passing tests to fail.

class RegressionTracker:
    """Track test results over time and detect regressions."""

    def __init__(self, results_dir: str = ".benchmarks"):
        self.results_dir = Path(results_dir)

    def load_history(self, suite_name: str) -> list[dict]:
        """Load all past results for a suite."""
        results = []
        for f in sorted(self.results_dir.glob(f"{suite_name}_*.json")):
            results.append(json.loads(f.read_text()))
        return results

    def check_regressions(self, suite_name: str, current: dict) -> list[dict]:
        """Compare current results against the last run."""
        history = self.load_history(suite_name)
        if not history:
            return []

        previous = history[-1]
        regressions = []

        prev_results = {r["name"]: r for r in previous["test_results"]}
        curr_results = {r["name"]: r for r in current["test_results"]}

        for name, curr in curr_results.items():
            prev = prev_results.get(name)
            if prev and prev["passed"] and not curr["passed"]:
                regressions.append({
                    "test": name,
                    "was": "PASS",
                    "now": "FAIL",
                    "reason": curr.get("failure_reason", ""),
                })

            # Also flag significant performance regressions
            if prev and curr["tokens"] > prev["tokens"] * 1.5:
                regressions.append({
                    "test": name,
                    "type": "performance",
                    "metric": "tokens",
                    "was": prev["tokens"],
                    "now": curr["tokens"],
                })

        return regressions


tracker = RegressionTracker()

def run_with_regression_check(suite: BenchmarkSuite, tools: list[dict],
                               config: dict) -> dict:
    results = suite.run(tools, config)
    regressions = tracker.check_regressions(suite.name, results)

    if regressions:
        print(f"\nREGRESSIONS DETECTED ({len(regressions)}):")
        for r in regressions:
            if r.get("type") == "performance":
                print(f"  PERF: {r['test']} - {r['metric']}: {r['was']} -> {r['now']}")
            else:
                print(f"  FAIL: {r['test']} - was {r['was']}, now {r['now']}")
                print(f"        Reason: {r['reason']}")

    return results

A/B Testing Agent Configurations

Compare different agent setups to find the best configuration.

def ab_test(suite: BenchmarkSuite, tools: list[dict],
            config_a: dict, config_b: dict,
            num_runs: int = 3) -> dict:
    """A/B test two agent configurations."""
    results_a = []
    results_b = []

    for run_idx in range(num_runs):
        print(f"\n--- Run {run_idx + 1}/{num_runs} ---")

        print("Config A:")
        ra = suite.run(tools, config_a)
        results_a.append(ra)

        print("Config B:")
        rb = suite.run(tools, config_b)
        results_b.append(rb)

    # Aggregate results
    def aggregate(results_list):
        return {
            "avg_pass_rate": sum(r["pass_rate"] for r in results_list) / len(results_list),
            "avg_tokens": sum(r["avg_tokens"] for r in results_list) / len(results_list),
            "avg_steps": sum(r["avg_steps"] for r in results_list) / len(results_list),
            "avg_time_ms": sum(r["avg_time_ms"] for r in results_list) / len(results_list),
        }

    agg_a = aggregate(results_a)
    agg_b = aggregate(results_b)

    print(f"\n{'='*60}")
    print(f"A/B Test Results ({num_runs} runs each)")
    print(f"{'='*60}")
    print(f"{'Metric':<20} {'Config A':>12} {'Config B':>12} {'Winner':>10}")
    print(f"{'-'*60}")

    metrics = [
        ("Pass Rate", "avg_pass_rate", True),
        ("Avg Tokens", "avg_tokens", False),
        ("Avg Steps", "avg_steps", False),
        ("Avg Time (ms)", "avg_time_ms", False),
    ]

    for label, key, higher_is_better in metrics:
        va = agg_a[key]
        vb = agg_b[key]
        if higher_is_better:
            winner = "A" if va > vb else "B" if vb > va else "Tie"
        else:
            winner = "A" if va < vb else "B" if vb < va else "Tie"
        print(f"{label:<20} {va:>12.2f} {vb:>12.2f} {winner:>10}")

    return {"config_a": agg_a, "config_b": agg_b}


# Example: compare model sizes
ab_test(
    suite=coding_benchmark,
    tools=available_tools,
    config_a={"model": "claude-sonnet-4-20250514", "max_tokens": 4096},
    config_b={"model": "claude-haiku-35-20241022", "max_tokens": 4096},
    num_runs=3,
)

Tool-Use Accuracy Metrics

Specifically measure how well the agent uses tools.

def analyze_tool_usage(trajectories: list[AgentTrajectory]) -> dict:
    """Analyze tool usage patterns across multiple trajectories."""
    all_calls = []
    for t in trajectories:
        for step in t.steps:
            all_calls.append({
                "tool": step.tool_name,
                "success": not step.tool_result.startswith("Error:"),
                "task": t.task,
            })

    if not all_calls:
        return {"message": "No tool calls recorded."}

    # Per-tool success rate
    from collections import Counter
    tool_counts = Counter(c["tool"] for c in all_calls)
    tool_success = Counter(c["tool"] for c in all_calls if c["success"])

    tool_stats = {}
    for tool, count in tool_counts.items():
        success = tool_success.get(tool, 0)
        tool_stats[tool] = {
            "total_calls": count,
            "success_rate": success / count,
            "failures": count - success,
        }

    # Overall stats
    total = len(all_calls)
    successes = sum(1 for c in all_calls if c["success"])

    return {
        "total_tool_calls": total,
        "overall_success_rate": successes / total,
        "per_tool": tool_stats,
        "most_used": tool_counts.most_common(5),
    }

Evaluation Best Practices

  1. Start with task completion — Does the agent finish the task? Everything else is secondary.

  2. Use LLM-as-judge for open-ended tasks — When there is no exact expected answer, use a model to score the output.

  3. Track efficiency alongside accuracy — An agent that takes 50 steps to do what should take 5 is broken even if it gets the right answer.

  4. Run benchmarks on every agent change — Prompt changes, tool changes, model changes. Regressions happen silently.

  5. Test failure modes explicitly — Include test cases with broken tools, ambiguous tasks, and impossible requests to verify the agent handles them gracefully.

  6. Keep test cases versioned — Store them in your repo alongside the agent code. Review them in PRs.

  7. Measure cost per task — Track tokens and API calls per task completion. Optimize for cost after correctness.

Install this skill directly: skilldb add ai-agent-orchestration-skills

Get CLI access →

Related Skills

agent-architecture

Core patterns for building AI agent systems: the observe-think-act loop, ReAct pattern implementation, tool-use cycles, memory systems (short-term and long-term), and planning strategies. Covers how to structure an agent's main loop, manage state between iterations, and wire together perception, reasoning, and action into a reliable autonomous system.

Ai Agent Orchestration368L

agent-error-recovery

Handling failures in AI agent systems: retry strategies with backoff, fallback tools, graceful degradation, human-in-the-loop escalation, stuck-loop detection, and context recovery after crashes. Covers practical patterns for making agents robust against tool failures, API errors, and reasoning dead-ends.

Ai Agent Orchestration470L

agent-frameworks

Comparison of major AI agent frameworks: LangGraph, CrewAI, AutoGen, Semantic Kernel, and Claude Agent SDK. Covers when to use each framework, their trade-offs, core patterns, practical setup examples, and migration strategies between frameworks.

Ai Agent Orchestration433L

agent-guardrails

Safety and control systems for AI agents: input and output validation, action authorization, rate limiting, cost controls, content filtering, scope restriction, and audit logging. Covers practical implementations for keeping agents within bounds while maintaining their usefulness.

Ai Agent Orchestration564L

agent-memory

Memory systems for AI agents: conversation history management, summarization strategies, vector-based long-term memory, entity memory, episodic memory, and memory retrieval patterns. Covers practical implementations for giving agents persistent, searchable memory across sessions and within long-running tasks.

Ai Agent Orchestration443L

agent-planning

Planning strategies for AI agents: chain-of-thought prompting, tree-of-thought exploration, plan-and-execute patterns, iterative refinement, task decomposition, and goal tracking. Covers practical implementations that make agents more reliable at complex, multi-step tasks by thinking before acting.

Ai Agent Orchestration459L