agent-trajectory-testing
Covers testing AI agent behavior end-to-end: trajectory evaluation, tool-call sequence validation, multi-step correctness verification, stuck-loop detection, cost regression testing, and timeout handling. Triggers: "test my AI agent", "agent trajectory evaluation", "tool call testing", "multi-step agent testing", "agent stuck detection", "agent cost regression", "validate agent behavior".
Test AI agents that make multi-step decisions, call tools, and produce complex workflows. This skill covers trajectory evaluation, sequence validation, stuck detection, and cost control. ## Key Points - The same input can produce different valid tool-call sequences - Intermediate steps matter, not just the final answer - Agents can get stuck in loops or take unnecessarily expensive paths - Latency and cost compound across steps 1. **Only testing final answers**: The path matters. A correct answer via 50 steps is a bug. 2. **Ignoring cost**: Agent loops can burn through API budgets in minutes. 3. **Deterministic expectations**: Allow multiple valid trajectories for the same task. 4. **No timeout**: Always set timeouts. Agents can run forever. 5. **Testing in production**: Use sandboxed tools. Never let test agents call real APIs with side effects.
skilldb get ai-testing-evals-skills/agent-trajectory-testingFull skill: 472 linesAgent Trajectory Testing
Test AI agents that make multi-step decisions, call tools, and produce complex workflows. This skill covers trajectory evaluation, sequence validation, stuck detection, and cost control.
Why Agent Testing Is Different
Agents are non-deterministic, multi-step systems. Traditional unit tests fail because:
- The same input can produce different valid tool-call sequences
- Intermediate steps matter, not just the final answer
- Agents can get stuck in loops or take unnecessarily expensive paths
- Latency and cost compound across steps
Agent testing evaluates the trajectory, not just the destination.
Trajectory Evaluation
Defining a Trajectory
from dataclasses import dataclass, field
from typing import Any
@dataclass
class ToolCall:
name: str
arguments: dict[str, Any]
result: Any = None
@dataclass
class AgentStep:
reasoning: str
tool_call: ToolCall | None
timestamp: float = 0.0
tokens_used: int = 0
@dataclass
class AgentTrajectory:
task: str
steps: list[AgentStep] = field(default_factory=list)
final_answer: str = ""
total_cost: float = 0.0
total_time: float = 0.0
@property
def tool_calls(self) -> list[ToolCall]:
return [s.tool_call for s in self.steps if s.tool_call]
@property
def tool_sequence(self) -> list[str]:
return [tc.name for tc in self.tool_calls]
Trajectory Scoring
class TrajectoryEvaluator:
def evaluate(self, trajectory: AgentTrajectory, criteria: dict) -> dict:
scores = {}
# 1. Final answer correctness
if "expected_answer" in criteria:
scores["answer_correct"] = self._check_answer(
trajectory.final_answer, criteria["expected_answer"]
)
# 2. Required tools used
if "required_tools" in criteria:
used = set(trajectory.tool_sequence)
required = set(criteria["required_tools"])
scores["required_tools_used"] = len(required & used) / len(required)
# 3. Forbidden tools avoided
if "forbidden_tools" in criteria:
used = set(trajectory.tool_sequence)
forbidden = set(criteria["forbidden_tools"])
scores["forbidden_tools_avoided"] = 1.0 if not (used & forbidden) else 0.0
# 4. Efficiency — fewer steps is better
if "max_steps" in criteria:
actual = len(trajectory.steps)
max_steps = criteria["max_steps"]
scores["efficiency"] = min(1.0, max_steps / max(actual, 1))
# 5. Cost within budget
if "max_cost" in criteria:
scores["cost_within_budget"] = 1.0 if trajectory.total_cost <= criteria["max_cost"] else 0.0
return scores
def _check_answer(self, actual: str, expected: str) -> float:
# Simple containment check — replace with semantic similarity for production
return 1.0 if expected.lower() in actual.lower() else 0.0
Tool-Call Sequence Validation
import re
class SequenceValidator:
"""Validate that an agent's tool-call sequence matches expected patterns."""
@staticmethod
def exact_match(actual: list[str], expected: list[str]) -> bool:
return actual == expected
@staticmethod
def contains_subsequence(actual: list[str], required: list[str]) -> bool:
"""Check that required tools appear in order (not necessarily consecutive)."""
it = iter(actual)
return all(tool in it for tool in required)
@staticmethod
def matches_pattern(actual: list[str], pattern: str) -> bool:
"""Match tool sequence against a regex pattern.
Example patterns:
"search,.*,answer" — search first, answer last
"(search|browse),extract,answer" — search or browse, then extract, then answer
"""
sequence_str = ",".join(actual)
return bool(re.match(pattern, sequence_str))
@staticmethod
def no_repeated_failures(actual: list[ToolCall], max_repeats: int = 3) -> bool:
"""Detect if the agent called the same tool with same args repeatedly."""
seen = []
repeat_count = 0
for tc in actual:
key = (tc.name, json.dumps(tc.arguments, sort_keys=True))
if seen and seen[-1] == key:
repeat_count += 1
if repeat_count >= max_repeats:
return False
else:
repeat_count = 1
seen.append(key)
return True
# Test examples
validator = SequenceValidator()
# Agent should search before answering
assert validator.contains_subsequence(
["search", "read_doc", "search", "answer"],
["search", "answer"]
)
# Agent should never call delete_file
assert "delete_file" not in ["search", "read_doc", "answer"]
# Agent should follow: search -> extract -> validate -> answer
assert validator.matches_pattern(
["search", "extract", "validate", "answer"],
r"search,extract,validate,answer"
)
Multi-Step Correctness
@dataclass
class StepAssertion:
step_index: int | None # None = any step
tool_name: str | None
argument_checks: dict[str, Any] | None # key -> expected value or callable
result_check: Any = None # callable(result) -> bool
class MultiStepValidator:
def __init__(self, assertions: list[StepAssertion]):
self.assertions = assertions
def validate(self, trajectory: AgentTrajectory) -> list[str]:
failures = []
tool_calls = trajectory.tool_calls
for assertion in self.assertions:
if assertion.step_index is not None:
# Check specific step
if assertion.step_index >= len(tool_calls):
failures.append(f"Step {assertion.step_index} does not exist (only {len(tool_calls)} steps)")
continue
if not self._check_step(tool_calls[assertion.step_index], assertion):
failures.append(f"Step {assertion.step_index} failed assertion")
else:
# Check any step matches
if not any(self._check_step(tc, assertion) for tc in tool_calls):
failures.append(f"No step matched assertion for tool '{assertion.tool_name}'")
return failures
def _check_step(self, tc: ToolCall, assertion: StepAssertion) -> bool:
if assertion.tool_name and tc.name != assertion.tool_name:
return False
if assertion.argument_checks:
for key, expected in assertion.argument_checks.items():
actual = tc.arguments.get(key)
if callable(expected):
if not expected(actual):
return False
elif actual != expected:
return False
if assertion.result_check and not assertion.result_check(tc.result):
return False
return True
# Example: validate a research agent
validator = MultiStepValidator([
# First step must be a search
StepAssertion(step_index=0, tool_name="web_search", argument_checks=None),
# Some step must read a document
StepAssertion(step_index=None, tool_name="read_document", argument_checks=None),
# The search query must contain the topic
StepAssertion(
step_index=0,
tool_name="web_search",
argument_checks={"query": lambda q: "python" in q.lower()},
),
])
failures = validator.validate(trajectory)
assert len(failures) == 0, f"Validation failures: {failures}"
Stuck Detection Testing
import time
class StuckDetector:
"""Detect when an agent is stuck in a loop or making no progress."""
def __init__(self, max_identical_calls: int = 3, max_steps: int = 20,
max_time_seconds: float = 120):
self.max_identical_calls = max_identical_calls
self.max_steps = max_steps
self.max_time_seconds = max_time_seconds
def check(self, trajectory: AgentTrajectory) -> dict:
issues = []
# 1. Repeated identical tool calls
calls = trajectory.tool_calls
for i in range(len(calls) - self.max_identical_calls + 1):
window = calls[i:i + self.max_identical_calls]
signatures = [(c.name, json.dumps(c.arguments, sort_keys=True)) for c in window]
if len(set(signatures)) == 1:
issues.append({
"type": "repeated_call",
"step": i,
"tool": window[0].name,
"count": self.max_identical_calls,
})
# 2. Alternating between two tools (ping-pong)
if len(calls) >= 6:
for i in range(len(calls) - 5):
names = [c.name for c in calls[i:i + 6]]
if names[0] == names[2] == names[4] and names[1] == names[3] == names[5]:
if names[0] != names[1]:
issues.append({
"type": "ping_pong",
"step": i,
"tools": [names[0], names[1]],
})
# 3. Too many steps
if len(trajectory.steps) > self.max_steps:
issues.append({
"type": "too_many_steps",
"count": len(trajectory.steps),
"limit": self.max_steps,
})
# 4. Time limit exceeded
if trajectory.total_time > self.max_time_seconds:
issues.append({
"type": "timeout",
"elapsed": trajectory.total_time,
"limit": self.max_time_seconds,
})
return {
"is_stuck": len(issues) > 0,
"issues": issues,
}
# Test the stuck detector itself
def test_stuck_detector():
detector = StuckDetector(max_identical_calls=3, max_steps=10)
# Normal trajectory — no issues
normal = AgentTrajectory(
task="test",
steps=[
AgentStep("thinking", ToolCall("search", {"q": "python"})),
AgentStep("reading", ToolCall("read", {"url": "doc.html"})),
AgentStep("answering", None),
],
final_answer="Python is great",
)
assert not detector.check(normal)["is_stuck"]
# Stuck trajectory — repeated calls
stuck = AgentTrajectory(
task="test",
steps=[
AgentStep("try", ToolCall("search", {"q": "help"})),
AgentStep("retry", ToolCall("search", {"q": "help"})),
AgentStep("retry again", ToolCall("search", {"q": "help"})),
],
)
result = detector.check(stuck)
assert result["is_stuck"]
assert result["issues"][0]["type"] == "repeated_call"
Cost Regression Testing
# Pricing per 1M tokens (example rates)
MODEL_PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
}
def estimate_trajectory_cost(trajectory: AgentTrajectory, model: str) -> float:
pricing = MODEL_PRICING[model]
total_input = sum(s.tokens_used * 0.7 for s in trajectory.steps) # rough split
total_output = sum(s.tokens_used * 0.3 for s in trajectory.steps)
return (total_input * pricing["input"] + total_output * pricing["output"]) / 1_000_000
class CostRegressionChecker:
def __init__(self, budget_file: str = "agent_cost_baselines.json"):
self.budget_file = Path(budget_file)
self.budgets = json.loads(self.budget_file.read_text()) if self.budget_file.exists() else {}
def check(self, task_name: str, actual_cost: float, tolerance: float = 0.20) -> dict:
if task_name not in self.budgets:
return {"status": "no_baseline", "cost": actual_cost}
baseline = self.budgets[task_name]
max_allowed = baseline * (1 + tolerance)
if actual_cost > max_allowed:
return {
"status": "regression",
"baseline": baseline,
"actual": actual_cost,
"max_allowed": max_allowed,
"overage_pct": (actual_cost - baseline) / baseline * 100,
}
return {"status": "pass", "baseline": baseline, "actual": actual_cost}
def update_baseline(self, task_name: str, cost: float):
self.budgets[task_name] = cost
self.budget_file.write_text(json.dumps(self.budgets, indent=2))
End-to-End Agent Test Framework
import pytest
import asyncio
class AgentTestCase:
def __init__(self, name: str, task: str, criteria: dict, timeout: float = 60):
self.name = name
self.task = task
self.criteria = criteria
self.timeout = timeout
async def run_agent_test(agent, test_case: AgentTestCase) -> dict:
"""Execute a full agent test with trajectory capture."""
start = time.time()
try:
trajectory = await asyncio.wait_for(
agent.run(test_case.task),
timeout=test_case.timeout,
)
except asyncio.TimeoutError:
return {"status": "timeout", "elapsed": time.time() - start}
elapsed = time.time() - start
trajectory.total_time = elapsed
# Run all evaluations
evaluator = TrajectoryEvaluator()
scores = evaluator.evaluate(trajectory, test_case.criteria)
stuck = StuckDetector().check(trajectory)
return {
"status": "complete",
"scores": scores,
"stuck_check": stuck,
"steps": len(trajectory.steps),
"tool_sequence": trajectory.tool_sequence,
"elapsed": elapsed,
"final_answer": trajectory.final_answer[:500],
}
# pytest integration
AGENT_TESTS = [
AgentTestCase(
name="simple_search",
task="What is the population of Tokyo?",
criteria={
"expected_answer": "13 million",
"required_tools": ["web_search"],
"forbidden_tools": ["code_exec"],
"max_steps": 5,
"max_cost": 0.01,
},
),
AgentTestCase(
name="multi_step_analysis",
task="Compare GDP of France and Germany and explain the difference",
criteria={
"required_tools": ["web_search"],
"max_steps": 10,
"max_cost": 0.05,
},
),
]
@pytest.mark.parametrize("test_case", AGENT_TESTS, ids=lambda tc: tc.name)
@pytest.mark.asyncio
async def test_agent(agent, test_case):
result = await run_agent_test(agent, test_case)
assert result["status"] == "complete", f"Agent did not complete: {result['status']}"
assert not result["stuck_check"]["is_stuck"], f"Agent got stuck: {result['stuck_check']['issues']}"
for metric, score in result["scores"].items():
assert score >= 0.8, f"{metric} score too low: {score}"
Common Pitfalls
- Only testing final answers: The path matters. A correct answer via 50 steps is a bug.
- Ignoring cost: Agent loops can burn through API budgets in minutes.
- Deterministic expectations: Allow multiple valid trajectories for the same task.
- No timeout: Always set timeouts. Agents can run forever.
- Testing in production: Use sandboxed tools. Never let test agents call real APIs with side effects.
Install this skill directly: skilldb add ai-testing-evals-skills
Related Skills
ci-cd-for-ai
Covers implementing CI/CD pipelines for AI applications: running LLM evals in GitHub Actions, gating deployments on eval scores, monitoring prompt and model drift, versioning prompts alongside code, cost tracking, and canary deployments for AI features. Triggers: "CI for AI", "run evals in GitHub Actions", "gate deployment on eval score", "prompt drift detection", "version prompts in CI", "AI deployment pipeline", "LLM CI/CD".
eval-frameworks
Covers popular LLM evaluation frameworks and how to use them: Braintrust, Promptfoo, RAGAS, DeepEval, LangSmith, and custom eval harnesses. Includes setup, configuration, writing eval cases, CI integration, and choosing the right framework for your use case. Triggers: "eval framework", "Braintrust setup", "Promptfoo config", "RAGAS evaluation", "DeepEval", "LangSmith evals", "custom eval harness", "which eval tool should I use".
llm-as-judge
Covers using LLMs to evaluate other LLM outputs: rubric design, pairwise comparison, reference-based and reference-free grading, calibration techniques, inter-rater reliability measurement, and cost-efficient judging strategies. Triggers: "LLM as judge", "use GPT to evaluate outputs", "AI grading AI", "rubric for LLM evaluation", "pairwise comparison", "LLM evaluator", "auto-grade LLM responses".
llm-eval-fundamentals
Covers the foundations of evaluating LLM-powered applications: why evaluation matters, the taxonomy of metric types (exact match, semantic similarity, LLM-as-judge), building and curating eval datasets, establishing baselines, detecting regressions, and designing eval pipelines that scale from prototyping through production. Triggers: "evaluate my LLM app", "set up evals", "how do I measure LLM quality", "create an eval pipeline", "LLM metrics", "eval dataset".
prompt-testing
Covers testing and hardening prompts for LLM applications: prompt regression testing, A/B testing prompt variants, temperature sensitivity analysis, edge case libraries, prompt versioning strategies, and golden test sets. Triggers: "test my prompt", "prompt regression", "A/B test prompts", "prompt versioning", "temperature sensitivity", "golden test set for prompts", "prompt quality assurance".
red-teaming-ai
Covers red-teaming AI applications for safety and robustness: adversarial prompt testing, jailbreak resistance evaluation, PII leakage detection, hallucination measurement, bias detection, safety benchmarks, and building automated red-team pipelines. Triggers: "red team my AI", "adversarial testing for LLMs", "jailbreak testing", "PII leakage test", "hallucination detection", "AI bias testing", "safety benchmark", "AI security testing".