Skip to main content
Technology & EngineeringAi Testing Evals472 lines

agent-trajectory-testing

Covers testing AI agent behavior end-to-end: trajectory evaluation, tool-call sequence validation, multi-step correctness verification, stuck-loop detection, cost regression testing, and timeout handling. Triggers: "test my AI agent", "agent trajectory evaluation", "tool call testing", "multi-step agent testing", "agent stuck detection", "agent cost regression", "validate agent behavior".

Quick Summary15 lines
Test AI agents that make multi-step decisions, call tools, and produce complex workflows. This skill covers trajectory evaluation, sequence validation, stuck detection, and cost control.

## Key Points

- The same input can produce different valid tool-call sequences
- Intermediate steps matter, not just the final answer
- Agents can get stuck in loops or take unnecessarily expensive paths
- Latency and cost compound across steps
1. **Only testing final answers**: The path matters. A correct answer via 50 steps is a bug.
2. **Ignoring cost**: Agent loops can burn through API budgets in minutes.
3. **Deterministic expectations**: Allow multiple valid trajectories for the same task.
4. **No timeout**: Always set timeouts. Agents can run forever.
5. **Testing in production**: Use sandboxed tools. Never let test agents call real APIs with side effects.
skilldb get ai-testing-evals-skills/agent-trajectory-testingFull skill: 472 lines
Paste into your CLAUDE.md or agent config

Agent Trajectory Testing

Test AI agents that make multi-step decisions, call tools, and produce complex workflows. This skill covers trajectory evaluation, sequence validation, stuck detection, and cost control.


Why Agent Testing Is Different

Agents are non-deterministic, multi-step systems. Traditional unit tests fail because:

  • The same input can produce different valid tool-call sequences
  • Intermediate steps matter, not just the final answer
  • Agents can get stuck in loops or take unnecessarily expensive paths
  • Latency and cost compound across steps

Agent testing evaluates the trajectory, not just the destination.


Trajectory Evaluation

Defining a Trajectory

from dataclasses import dataclass, field
from typing import Any

@dataclass
class ToolCall:
    name: str
    arguments: dict[str, Any]
    result: Any = None

@dataclass
class AgentStep:
    reasoning: str
    tool_call: ToolCall | None
    timestamp: float = 0.0
    tokens_used: int = 0

@dataclass
class AgentTrajectory:
    task: str
    steps: list[AgentStep] = field(default_factory=list)
    final_answer: str = ""
    total_cost: float = 0.0
    total_time: float = 0.0

    @property
    def tool_calls(self) -> list[ToolCall]:
        return [s.tool_call for s in self.steps if s.tool_call]

    @property
    def tool_sequence(self) -> list[str]:
        return [tc.name for tc in self.tool_calls]

Trajectory Scoring

class TrajectoryEvaluator:
    def evaluate(self, trajectory: AgentTrajectory, criteria: dict) -> dict:
        scores = {}

        # 1. Final answer correctness
        if "expected_answer" in criteria:
            scores["answer_correct"] = self._check_answer(
                trajectory.final_answer, criteria["expected_answer"]
            )

        # 2. Required tools used
        if "required_tools" in criteria:
            used = set(trajectory.tool_sequence)
            required = set(criteria["required_tools"])
            scores["required_tools_used"] = len(required & used) / len(required)

        # 3. Forbidden tools avoided
        if "forbidden_tools" in criteria:
            used = set(trajectory.tool_sequence)
            forbidden = set(criteria["forbidden_tools"])
            scores["forbidden_tools_avoided"] = 1.0 if not (used & forbidden) else 0.0

        # 4. Efficiency — fewer steps is better
        if "max_steps" in criteria:
            actual = len(trajectory.steps)
            max_steps = criteria["max_steps"]
            scores["efficiency"] = min(1.0, max_steps / max(actual, 1))

        # 5. Cost within budget
        if "max_cost" in criteria:
            scores["cost_within_budget"] = 1.0 if trajectory.total_cost <= criteria["max_cost"] else 0.0

        return scores

    def _check_answer(self, actual: str, expected: str) -> float:
        # Simple containment check — replace with semantic similarity for production
        return 1.0 if expected.lower() in actual.lower() else 0.0

Tool-Call Sequence Validation

import re

class SequenceValidator:
    """Validate that an agent's tool-call sequence matches expected patterns."""

    @staticmethod
    def exact_match(actual: list[str], expected: list[str]) -> bool:
        return actual == expected

    @staticmethod
    def contains_subsequence(actual: list[str], required: list[str]) -> bool:
        """Check that required tools appear in order (not necessarily consecutive)."""
        it = iter(actual)
        return all(tool in it for tool in required)

    @staticmethod
    def matches_pattern(actual: list[str], pattern: str) -> bool:
        """Match tool sequence against a regex pattern.

        Example patterns:
          "search,.*,answer"  — search first, answer last
          "(search|browse),extract,answer"  — search or browse, then extract, then answer
        """
        sequence_str = ",".join(actual)
        return bool(re.match(pattern, sequence_str))

    @staticmethod
    def no_repeated_failures(actual: list[ToolCall], max_repeats: int = 3) -> bool:
        """Detect if the agent called the same tool with same args repeatedly."""
        seen = []
        repeat_count = 0
        for tc in actual:
            key = (tc.name, json.dumps(tc.arguments, sort_keys=True))
            if seen and seen[-1] == key:
                repeat_count += 1
                if repeat_count >= max_repeats:
                    return False
            else:
                repeat_count = 1
            seen.append(key)
        return True


# Test examples
validator = SequenceValidator()

# Agent should search before answering
assert validator.contains_subsequence(
    ["search", "read_doc", "search", "answer"],
    ["search", "answer"]
)

# Agent should never call delete_file
assert "delete_file" not in ["search", "read_doc", "answer"]

# Agent should follow: search -> extract -> validate -> answer
assert validator.matches_pattern(
    ["search", "extract", "validate", "answer"],
    r"search,extract,validate,answer"
)

Multi-Step Correctness

@dataclass
class StepAssertion:
    step_index: int | None  # None = any step
    tool_name: str | None
    argument_checks: dict[str, Any] | None  # key -> expected value or callable
    result_check: Any = None  # callable(result) -> bool

class MultiStepValidator:
    def __init__(self, assertions: list[StepAssertion]):
        self.assertions = assertions

    def validate(self, trajectory: AgentTrajectory) -> list[str]:
        failures = []
        tool_calls = trajectory.tool_calls

        for assertion in self.assertions:
            if assertion.step_index is not None:
                # Check specific step
                if assertion.step_index >= len(tool_calls):
                    failures.append(f"Step {assertion.step_index} does not exist (only {len(tool_calls)} steps)")
                    continue
                if not self._check_step(tool_calls[assertion.step_index], assertion):
                    failures.append(f"Step {assertion.step_index} failed assertion")
            else:
                # Check any step matches
                if not any(self._check_step(tc, assertion) for tc in tool_calls):
                    failures.append(f"No step matched assertion for tool '{assertion.tool_name}'")

        return failures

    def _check_step(self, tc: ToolCall, assertion: StepAssertion) -> bool:
        if assertion.tool_name and tc.name != assertion.tool_name:
            return False
        if assertion.argument_checks:
            for key, expected in assertion.argument_checks.items():
                actual = tc.arguments.get(key)
                if callable(expected):
                    if not expected(actual):
                        return False
                elif actual != expected:
                    return False
        if assertion.result_check and not assertion.result_check(tc.result):
            return False
        return True


# Example: validate a research agent
validator = MultiStepValidator([
    # First step must be a search
    StepAssertion(step_index=0, tool_name="web_search", argument_checks=None),
    # Some step must read a document
    StepAssertion(step_index=None, tool_name="read_document", argument_checks=None),
    # The search query must contain the topic
    StepAssertion(
        step_index=0,
        tool_name="web_search",
        argument_checks={"query": lambda q: "python" in q.lower()},
    ),
])

failures = validator.validate(trajectory)
assert len(failures) == 0, f"Validation failures: {failures}"

Stuck Detection Testing

import time

class StuckDetector:
    """Detect when an agent is stuck in a loop or making no progress."""

    def __init__(self, max_identical_calls: int = 3, max_steps: int = 20,
                 max_time_seconds: float = 120):
        self.max_identical_calls = max_identical_calls
        self.max_steps = max_steps
        self.max_time_seconds = max_time_seconds

    def check(self, trajectory: AgentTrajectory) -> dict:
        issues = []

        # 1. Repeated identical tool calls
        calls = trajectory.tool_calls
        for i in range(len(calls) - self.max_identical_calls + 1):
            window = calls[i:i + self.max_identical_calls]
            signatures = [(c.name, json.dumps(c.arguments, sort_keys=True)) for c in window]
            if len(set(signatures)) == 1:
                issues.append({
                    "type": "repeated_call",
                    "step": i,
                    "tool": window[0].name,
                    "count": self.max_identical_calls,
                })

        # 2. Alternating between two tools (ping-pong)
        if len(calls) >= 6:
            for i in range(len(calls) - 5):
                names = [c.name for c in calls[i:i + 6]]
                if names[0] == names[2] == names[4] and names[1] == names[3] == names[5]:
                    if names[0] != names[1]:
                        issues.append({
                            "type": "ping_pong",
                            "step": i,
                            "tools": [names[0], names[1]],
                        })

        # 3. Too many steps
        if len(trajectory.steps) > self.max_steps:
            issues.append({
                "type": "too_many_steps",
                "count": len(trajectory.steps),
                "limit": self.max_steps,
            })

        # 4. Time limit exceeded
        if trajectory.total_time > self.max_time_seconds:
            issues.append({
                "type": "timeout",
                "elapsed": trajectory.total_time,
                "limit": self.max_time_seconds,
            })

        return {
            "is_stuck": len(issues) > 0,
            "issues": issues,
        }


# Test the stuck detector itself
def test_stuck_detector():
    detector = StuckDetector(max_identical_calls=3, max_steps=10)

    # Normal trajectory — no issues
    normal = AgentTrajectory(
        task="test",
        steps=[
            AgentStep("thinking", ToolCall("search", {"q": "python"})),
            AgentStep("reading", ToolCall("read", {"url": "doc.html"})),
            AgentStep("answering", None),
        ],
        final_answer="Python is great",
    )
    assert not detector.check(normal)["is_stuck"]

    # Stuck trajectory — repeated calls
    stuck = AgentTrajectory(
        task="test",
        steps=[
            AgentStep("try", ToolCall("search", {"q": "help"})),
            AgentStep("retry", ToolCall("search", {"q": "help"})),
            AgentStep("retry again", ToolCall("search", {"q": "help"})),
        ],
    )
    result = detector.check(stuck)
    assert result["is_stuck"]
    assert result["issues"][0]["type"] == "repeated_call"

Cost Regression Testing

# Pricing per 1M tokens (example rates)
MODEL_PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
}

def estimate_trajectory_cost(trajectory: AgentTrajectory, model: str) -> float:
    pricing = MODEL_PRICING[model]
    total_input = sum(s.tokens_used * 0.7 for s in trajectory.steps)  # rough split
    total_output = sum(s.tokens_used * 0.3 for s in trajectory.steps)
    return (total_input * pricing["input"] + total_output * pricing["output"]) / 1_000_000

class CostRegressionChecker:
    def __init__(self, budget_file: str = "agent_cost_baselines.json"):
        self.budget_file = Path(budget_file)
        self.budgets = json.loads(self.budget_file.read_text()) if self.budget_file.exists() else {}

    def check(self, task_name: str, actual_cost: float, tolerance: float = 0.20) -> dict:
        if task_name not in self.budgets:
            return {"status": "no_baseline", "cost": actual_cost}

        baseline = self.budgets[task_name]
        max_allowed = baseline * (1 + tolerance)

        if actual_cost > max_allowed:
            return {
                "status": "regression",
                "baseline": baseline,
                "actual": actual_cost,
                "max_allowed": max_allowed,
                "overage_pct": (actual_cost - baseline) / baseline * 100,
            }
        return {"status": "pass", "baseline": baseline, "actual": actual_cost}

    def update_baseline(self, task_name: str, cost: float):
        self.budgets[task_name] = cost
        self.budget_file.write_text(json.dumps(self.budgets, indent=2))

End-to-End Agent Test Framework

import pytest
import asyncio

class AgentTestCase:
    def __init__(self, name: str, task: str, criteria: dict, timeout: float = 60):
        self.name = name
        self.task = task
        self.criteria = criteria
        self.timeout = timeout

async def run_agent_test(agent, test_case: AgentTestCase) -> dict:
    """Execute a full agent test with trajectory capture."""
    start = time.time()

    try:
        trajectory = await asyncio.wait_for(
            agent.run(test_case.task),
            timeout=test_case.timeout,
        )
    except asyncio.TimeoutError:
        return {"status": "timeout", "elapsed": time.time() - start}

    elapsed = time.time() - start
    trajectory.total_time = elapsed

    # Run all evaluations
    evaluator = TrajectoryEvaluator()
    scores = evaluator.evaluate(trajectory, test_case.criteria)

    stuck = StuckDetector().check(trajectory)

    return {
        "status": "complete",
        "scores": scores,
        "stuck_check": stuck,
        "steps": len(trajectory.steps),
        "tool_sequence": trajectory.tool_sequence,
        "elapsed": elapsed,
        "final_answer": trajectory.final_answer[:500],
    }


# pytest integration
AGENT_TESTS = [
    AgentTestCase(
        name="simple_search",
        task="What is the population of Tokyo?",
        criteria={
            "expected_answer": "13 million",
            "required_tools": ["web_search"],
            "forbidden_tools": ["code_exec"],
            "max_steps": 5,
            "max_cost": 0.01,
        },
    ),
    AgentTestCase(
        name="multi_step_analysis",
        task="Compare GDP of France and Germany and explain the difference",
        criteria={
            "required_tools": ["web_search"],
            "max_steps": 10,
            "max_cost": 0.05,
        },
    ),
]

@pytest.mark.parametrize("test_case", AGENT_TESTS, ids=lambda tc: tc.name)
@pytest.mark.asyncio
async def test_agent(agent, test_case):
    result = await run_agent_test(agent, test_case)
    assert result["status"] == "complete", f"Agent did not complete: {result['status']}"
    assert not result["stuck_check"]["is_stuck"], f"Agent got stuck: {result['stuck_check']['issues']}"
    for metric, score in result["scores"].items():
        assert score >= 0.8, f"{metric} score too low: {score}"

Common Pitfalls

  1. Only testing final answers: The path matters. A correct answer via 50 steps is a bug.
  2. Ignoring cost: Agent loops can burn through API budgets in minutes.
  3. Deterministic expectations: Allow multiple valid trajectories for the same task.
  4. No timeout: Always set timeouts. Agents can run forever.
  5. Testing in production: Use sandboxed tools. Never let test agents call real APIs with side effects.

Install this skill directly: skilldb add ai-testing-evals-skills

Get CLI access →

Related Skills

ci-cd-for-ai

Covers implementing CI/CD pipelines for AI applications: running LLM evals in GitHub Actions, gating deployments on eval scores, monitoring prompt and model drift, versioning prompts alongside code, cost tracking, and canary deployments for AI features. Triggers: "CI for AI", "run evals in GitHub Actions", "gate deployment on eval score", "prompt drift detection", "version prompts in CI", "AI deployment pipeline", "LLM CI/CD".

Ai Testing Evals479L

eval-frameworks

Covers popular LLM evaluation frameworks and how to use them: Braintrust, Promptfoo, RAGAS, DeepEval, LangSmith, and custom eval harnesses. Includes setup, configuration, writing eval cases, CI integration, and choosing the right framework for your use case. Triggers: "eval framework", "Braintrust setup", "Promptfoo config", "RAGAS evaluation", "DeepEval", "LangSmith evals", "custom eval harness", "which eval tool should I use".

Ai Testing Evals568L

llm-as-judge

Covers using LLMs to evaluate other LLM outputs: rubric design, pairwise comparison, reference-based and reference-free grading, calibration techniques, inter-rater reliability measurement, and cost-efficient judging strategies. Triggers: "LLM as judge", "use GPT to evaluate outputs", "AI grading AI", "rubric for LLM evaluation", "pairwise comparison", "LLM evaluator", "auto-grade LLM responses".

Ai Testing Evals451L

llm-eval-fundamentals

Covers the foundations of evaluating LLM-powered applications: why evaluation matters, the taxonomy of metric types (exact match, semantic similarity, LLM-as-judge), building and curating eval datasets, establishing baselines, detecting regressions, and designing eval pipelines that scale from prototyping through production. Triggers: "evaluate my LLM app", "set up evals", "how do I measure LLM quality", "create an eval pipeline", "LLM metrics", "eval dataset".

Ai Testing Evals348L

prompt-testing

Covers testing and hardening prompts for LLM applications: prompt regression testing, A/B testing prompt variants, temperature sensitivity analysis, edge case libraries, prompt versioning strategies, and golden test sets. Triggers: "test my prompt", "prompt regression", "A/B test prompts", "prompt versioning", "temperature sensitivity", "golden test set for prompts", "prompt quality assurance".

Ai Testing Evals447L

red-teaming-ai

Covers red-teaming AI applications for safety and robustness: adversarial prompt testing, jailbreak resistance evaluation, PII leakage detection, hallucination measurement, bias detection, safety benchmarks, and building automated red-team pipelines. Triggers: "red team my AI", "adversarial testing for LLMs", "jailbreak testing", "PII leakage test", "hallucination detection", "AI bias testing", "safety benchmark", "AI security testing".

Ai Testing Evals544L