Technology & EngineeringAi Testing Evals472 lines

agent-trajectory-testing

Covers testing AI agent behavior end-to-end: trajectory evaluation, tool-call sequence validation, multi-step correctness verification, stuck-loop detection, cost regression testing, and timeout handling. Triggers: "test my AI agent", "agent trajectory evaluation", "tool call testing", "multi-step agent testing", "agent stuck detection", "agent cost regression", "validate agent behavior".

Quick Summary15 lines

Test AI agents that make multi-step decisions, call tools, and produce complex workflows. This skill covers trajectory evaluation, sequence validation, stuck detection, and cost control.

## Key Points

- The same input can produce different valid tool-call sequences
- Intermediate steps matter, not just the final answer
- Agents can get stuck in loops or take unnecessarily expensive paths
- Latency and cost compound across steps
1. **Only testing final answers**: The path matters. A correct answer via 50 steps is a bug.
2. **Ignoring cost**: Agent loops can burn through API budgets in minutes.
3. **Deterministic expectations**: Allow multiple valid trajectories for the same task.
4. **No timeout**: Always set timeouts. Agents can run forever.
5. **Testing in production**: Use sandboxed tools. Never let test agents call real APIs with side effects.

skilldb get ai-testing-evals-skills/agent-trajectory-testingFull skill: 472 lines

Paste into your CLAUDE.md or agent config

Agent Trajectory Testing

Test AI agents that make multi-step decisions, call tools, and produce complex workflows. This skill covers trajectory evaluation, sequence validation, stuck detection, and cost control.

Why Agent Testing Is Different

Agents are non-deterministic, multi-step systems. Traditional unit tests fail because:

The same input can produce different valid tool-call sequences
Intermediate steps matter, not just the final answer
Agents can get stuck in loops or take unnecessarily expensive paths
Latency and cost compound across steps

Agent testing evaluates the trajectory, not just the destination.

Trajectory Evaluation

Defining a Trajectory

from dataclasses import dataclass, field
from typing import Any

@dataclass
class ToolCall:
    name: str
    arguments: dict[str, Any]
    result: Any = None

@dataclass
class AgentStep:
    reasoning: str
    tool_call: ToolCall | None
    timestamp: float = 0.0
    tokens_used: int = 0

@dataclass
class AgentTrajectory:
    task: str
    steps: list[AgentStep] = field(default_factory=list)
    final_answer: str = ""
    total_cost: float = 0.0
    total_time: float = 0.0

    @property
    def tool_calls(self) -> list[ToolCall]:
        return [s.tool_call for s in self.steps if s.tool_call]

    @property
    def tool_sequence(self) -> list[str]:
        return [tc.name for tc in self.tool_calls]

Trajectory Scoring

class TrajectoryEvaluator:
    def evaluate(self, trajectory: AgentTrajectory, criteria: dict) -> dict:
        scores = {}

        # 1. Final answer correctness
        if "expected_answer" in criteria:
            scores["answer_correct"] = self._check_answer(
                trajectory.final_answer, criteria["expected_answer"]
            )

        # 2. Required tools used
        if "required_tools" in criteria:
            used = set(trajectory.tool_sequence)
            required = set(criteria["required_tools"])
            scores["required_tools_used"] = len(required & used) / len(required)

        # 3. Forbidden tools avoided
        if "forbidden_tools" in criteria:
            used = set(trajectory.tool_sequence)
            forbidden = set(criteria["forbidden_tools"])
            scores["forbidden_tools_avoided"] = 1.0 if not (used & forbidden) else 0.0

        # 4. Efficiency — fewer steps is better
        if "max_steps" in criteria:
            actual = len(trajectory.steps)
            max_steps = criteria["max_steps"]
            scores["efficiency"] = min(1.0, max_steps / max(actual, 1))

        # 5. Cost within budget
        if "max_cost" in criteria:
            scores["cost_within_budget"] = 1.0 if trajectory.total_cost <= criteria["max_cost"] else 0.0

        return scores

    def _check_answer(self, actual: str, expected: str) -> float:
        # Simple containment check — replace with semantic similarity for production
        return 1.0 if expected.lower() in actual.lower() else 0.0

Tool-Call Sequence Validation

import re

class SequenceValidator:
    """Validate that an agent's tool-call sequence matches expected patterns."""

    @staticmethod
    def exact_match(actual: list[str], expected: list[str]) -> bool:
        return actual == expected

    @staticmethod
    def contains_subsequence(actual: list[str], required: list[str]) -> bool:
        """Check that required tools appear in order (not necessarily consecutive)."""
        it = iter(actual)
        return all(tool in it for tool in required)

    @staticmethod
    def matches_pattern(actual: list[str], pattern: str) -> bool:
        """Match tool sequence against a regex pattern.

        Example patterns:
          "search,.*,answer"  — search first, answer last
          "(search|browse),extract,answer"  — search or browse, then extract, then answer
        """
        sequence_str = ",".join(actual)
        return bool(re.match(pattern, sequence_str))

    @staticmethod
    def no_repeated_failures(actual: list[ToolCall], max_repeats: int = 3) -> bool:
        """Detect if the agent called the same tool with same args repeatedly."""
        seen = []
        repeat_count = 0
        for tc in actual:
            key = (tc.name, json.dumps(tc.arguments, sort_keys=True))
            if seen and seen[-1] == key:
                repeat_count += 1
                if repeat_count >= max_repeats:
                    return False
            else:
                repeat_count = 1
            seen.append(key)
        return True


# Test examples
validator = SequenceValidator()

# Agent should search before answering
assert validator.contains_subsequence(
    ["search", "read_doc", "search", "answer"],
    ["search", "answer"]
)

# Agent should never call delete_file
assert "delete_file" not in ["search", "read_doc", "answer"]

# Agent should follow: search -> extract -> validate -> answer
assert validator.matches_pattern(
    ["search", "extract", "validate", "answer"],
    r"search,extract,validate,answer"
)

Multi-Step Correctness

@dataclass
class StepAssertion:
    step_index: int | None  # None = any step
    tool_name: str | None
    argument_checks: dict[str, Any] | None  # key -> expected value or callable
    result_check: Any = None  # callable(result) -> bool

class MultiStepValidator:
    def __init__(self, assertions: list[StepAssertion]):
        self.assertions = assertions

    def validate(self, trajectory: AgentTrajectory) -> list[str]:
        failures = []
        tool_calls = trajectory.tool_calls

        for assertion in self.assertions:
            if assertion.step_index is not None:
                # Check specific step
                if assertion.step_index >= len(tool_calls):
                    failures.append(f"Step {assertion.step_index} does not exist (only {len(tool_calls)} steps)")
                    continue
                if not self._check_step(tool_calls[assertion.step_index], assertion):
                    failures.append(f"Step {assertion.step_index} failed assertion")
            else:
                # Check any step matches
                if not any(self._check_step(tc, assertion) for tc in tool_calls):
                    failures.append(f"No step matched assertion for tool '{assertion.tool_name}'")

        return failures

    def _check_step(self, tc: ToolCall, assertion: StepAssertion) -> bool:
        if assertion.tool_name and tc.name != assertion.tool_name:
            return False
        if assertion.argument_checks:
            for key, expected in assertion.argument_checks.items():
                actual = tc.arguments.get(key)
                if callable(expected):
                    if not expected(actual):
                        return False
                elif actual != expected:
                    return False
        if assertion.result_check and not assertion.result_check(tc.result):
            return False
        return True


# Example: validate a research agent
validator = MultiStepValidator([
    # First step must be a search
    StepAssertion(step_index=0, tool_name="web_search", argument_checks=None),
    # Some step must read a document
    StepAssertion(step_index=None, tool_name="read_document", argument_checks=None),
    # The search query must contain the topic
    StepAssertion(
        step_index=0,
        tool_name="web_search",
        argument_checks={"query": lambda q: "python" in q.lower()},
    ),
])

failures = validator.validate(trajectory)
assert len(failures) == 0, f"Validation failures: {failures}"

Stuck Detection Testing

import time

class StuckDetector:
    """Detect when an agent is stuck in a loop or making no progress."""

    def __init__(self, max_identical_calls: int = 3, max_steps: int = 20,
                 max_time_seconds: float = 120):
        self.max_identical_calls = max_identical_calls
        self.max_steps = max_steps
        self.max_time_seconds = max_time_seconds

    def check(self, trajectory: AgentTrajectory) -> dict:
        issues = []

        # 1. Repeated identical tool calls
        calls = trajectory.tool_calls
        for i in range(len(calls) - self.max_identical_calls + 1):
            window = calls[i:i + self.max_identical_calls]
            signatures = [(c.name, json.dumps(c.arguments, sort_keys=True)) for c in window]
            if len(set(signatures)) == 1:
                issues.append({
                    "type": "repeated_call",
                    "step": i,
                    "tool": window[0].name,
                    "count": self.max_identical_calls,
                })

        # 2. Alternating between two tools (ping-pong)
        if len(calls) >= 6:
            for i in range(len(calls) - 5):
                names = [c.name for c in calls[i:i + 6]]
                if names[0] == names[2] == names[4] and names[1] == names[3] == names[5]:
                    if names[0] != names[1]:
                        issues.append({
                            "type": "ping_pong",
                            "step": i,
                            "tools": [names[0], names[1]],
                        })

        # 3. Too many steps
        if len(trajectory.steps) > self.max_steps:
            issues.append({
                "type": "too_many_steps",
                "count": len(trajectory.steps),
                "limit": self.max_steps,
            })

        # 4. Time limit exceeded
        if trajectory.total_time > self.max_time_seconds:
            issues.append({
                "type": "timeout",
                "elapsed": trajectory.total_time,
                "limit": self.max_time_seconds,
            })

        return {
            "is_stuck": len(issues) > 0,
            "issues": issues,
        }


# Test the stuck detector itself
def test_stuck_detector():
    detector = StuckDetector(max_identical_calls=3, max_steps=10)

    # Normal trajectory — no issues
    normal = AgentTrajectory(
        task="test",
        steps=[
            AgentStep("thinking", ToolCall("search", {"q": "python"})),
            AgentStep("reading", ToolCall("read", {"url": "doc.html"})),
            AgentStep("answering", None),
        ],
        final_answer="Python is great",
    )
    assert not detector.check(normal)["is_stuck"]

    # Stuck trajectory — repeated calls
    stuck = AgentTrajectory(
        task="test",
        steps=[
            AgentStep("try", ToolCall("search", {"q": "help"})),
            AgentStep("retry", ToolCall("search", {"q": "help"})),
            AgentStep("retry again", ToolCall("search", {"q": "help"})),
        ],
    )
    result = detector.check(stuck)
    assert result["is_stuck"]
    assert result["issues"][0]["type"] == "repeated_call"

Cost Regression Testing

# Pricing per 1M tokens (example rates)
MODEL_PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
}

def estimate_trajectory_cost(trajectory: AgentTrajectory, model: str) -> float:
    pricing = MODEL_PRICING[model]
    total_input = sum(s.tokens_used * 0.7 for s in trajectory.steps)  # rough split
    total_output = sum(s.tokens_used * 0.3 for s in trajectory.steps)
    return (total_input * pricing["input"] + total_output * pricing["output"]) / 1_000_000

class CostRegressionChecker:
    def __init__(self, budget_file: str = "agent_cost_baselines.json"):
        self.budget_file = Path(budget_file)
        self.budgets = json.loads(self.budget_file.read_text()) if self.budget_file.exists() else {}

    def check(self, task_name: str, actual_cost: float, tolerance: float = 0.20) -> dict:
        if task_name not in self.budgets:
            return {"status": "no_baseline", "cost": actual_cost}

        baseline = self.budgets[task_name]
        max_allowed = baseline * (1 + tolerance)

        if actual_cost > max_allowed:
            return {
                "status": "regression",
                "baseline": baseline,
                "actual": actual_cost,
                "max_allowed": max_allowed,
                "overage_pct": (actual_cost - baseline) / baseline * 100,
            }
        return {"status": "pass", "baseline": baseline, "actual": actual_cost}

    def update_baseline(self, task_name: str, cost: float):
        self.budgets[task_name] = cost
        self.budget_file.write_text(json.dumps(self.budgets, indent=2))

End-to-End Agent Test Framework

import pytest
import asyncio

class AgentTestCase:
    def __init__(self, name: str, task: str, criteria: dict, timeout: float = 60):
        self.name = name
        self.task = task
        self.criteria = criteria
        self.timeout = timeout

async def run_agent_test(agent, test_case: AgentTestCase) -> dict:
    """Execute a full agent test with trajectory capture."""
    start = time.time()

    try:
        trajectory = await asyncio.wait_for(
            agent.run(test_case.task),
            timeout=test_case.timeout,
        )
    except asyncio.TimeoutError:
        return {"status": "timeout", "elapsed": time.time() - start}

    elapsed = time.time() - start
    trajectory.total_time = elapsed

    # Run all evaluations
    evaluator = TrajectoryEvaluator()
    scores = evaluator.evaluate(trajectory, test_case.criteria)

    stuck = StuckDetector().check(trajectory)

    return {
        "status": "complete",
        "scores": scores,
        "stuck_check": stuck,
        "steps": len(trajectory.steps),
        "tool_sequence": trajectory.tool_sequence,
        "elapsed": elapsed,
        "final_answer": trajectory.final_answer[:500],
    }


# pytest integration
AGENT_TESTS = [
    AgentTestCase(
        name="simple_search",
        task="What is the population of Tokyo?",
        criteria={
            "expected_answer": "13 million",
            "required_tools": ["web_search"],
            "forbidden_tools": ["code_exec"],
            "max_steps": 5,
            "max_cost": 0.01,
        },
    ),
    AgentTestCase(
        name="multi_step_analysis",
        task="Compare GDP of France and Germany and explain the difference",
        criteria={
            "required_tools": ["web_search"],
            "max_steps": 10,
            "max_cost": 0.05,
        },
    ),
]

@pytest.mark.parametrize("test_case", AGENT_TESTS, ids=lambda tc: tc.name)
@pytest.mark.asyncio
async def test_agent(agent, test_case):
    result = await run_agent_test(agent, test_case)
    assert result["status"] == "complete", f"Agent did not complete: {result['status']}"
    assert not result["stuck_check"]["is_stuck"], f"Agent got stuck: {result['stuck_check']['issues']}"
    for metric, score in result["scores"].items():
        assert score >= 0.8, f"{metric} score too low: {score}"

Common Pitfalls

Only testing final answers: The path matters. A correct answer via 50 steps is a bug.
Ignoring cost: Agent loops can burn through API budgets in minutes.
Deterministic expectations: Allow multiple valid trajectories for the same task.
No timeout: Always set timeouts. Agents can run forever.
Testing in production: Use sandboxed tools. Never let test agents call real APIs with side effects.

Install this skill directly: skilldb add ai-testing-evals-skills

Get CLI access →

Related Skills

ci-cd-for-ai

Covers implementing CI/CD pipelines for AI applications: running LLM evals in GitHub Actions, gating deployments on eval scores, monitoring prompt and model drift, versioning prompts alongside code, cost tracking, and canary deployments for AI features. Triggers: "CI for AI", "run evals in GitHub Actions", "gate deployment on eval score", "prompt drift detection", "version prompts in CI", "AI deployment pipeline", "LLM CI/CD".

Ai Testing Evals•479L

eval-frameworks

Covers popular LLM evaluation frameworks and how to use them: Braintrust, Promptfoo, RAGAS, DeepEval, LangSmith, and custom eval harnesses. Includes setup, configuration, writing eval cases, CI integration, and choosing the right framework for your use case. Triggers: "eval framework", "Braintrust setup", "Promptfoo config", "RAGAS evaluation", "DeepEval", "LangSmith evals", "custom eval harness", "which eval tool should I use".

Ai Testing Evals•568L

llm-as-judge

Covers using LLMs to evaluate other LLM outputs: rubric design, pairwise comparison, reference-based and reference-free grading, calibration techniques, inter-rater reliability measurement, and cost-efficient judging strategies. Triggers: "LLM as judge", "use GPT to evaluate outputs", "AI grading AI", "rubric for LLM evaluation", "pairwise comparison", "LLM evaluator", "auto-grade LLM responses".

Ai Testing Evals•451L

llm-eval-fundamentals

Covers the foundations of evaluating LLM-powered applications: why evaluation matters, the taxonomy of metric types (exact match, semantic similarity, LLM-as-judge), building and curating eval datasets, establishing baselines, detecting regressions, and designing eval pipelines that scale from prototyping through production. Triggers: "evaluate my LLM app", "set up evals", "how do I measure LLM quality", "create an eval pipeline", "LLM metrics", "eval dataset".

Ai Testing Evals•348L

prompt-testing

Covers testing and hardening prompts for LLM applications: prompt regression testing, A/B testing prompt variants, temperature sensitivity analysis, edge case libraries, prompt versioning strategies, and golden test sets. Triggers: "test my prompt", "prompt regression", "A/B test prompts", "prompt versioning", "temperature sensitivity", "golden test set for prompts", "prompt quality assurance".

Ai Testing Evals•447L

red-teaming-ai

Covers red-teaming AI applications for safety and robustness: adversarial prompt testing, jailbreak resistance evaluation, PII leakage detection, hallucination measurement, bias detection, safety benchmarks, and building automated red-team pipelines. Triggers: "red team my AI", "adversarial testing for LLMs", "jailbreak testing", "PII leakage test", "hallucination detection", "AI bias testing", "safety benchmark", "AI security testing".

Ai Testing Evals•544L