Skip to main content
Technology & EngineeringAi Testing Evals568 lines

eval-frameworks

Covers popular LLM evaluation frameworks and how to use them: Braintrust, Promptfoo, RAGAS, DeepEval, LangSmith, and custom eval harnesses. Includes setup, configuration, writing eval cases, CI integration, and choosing the right framework for your use case. Triggers: "eval framework", "Braintrust setup", "Promptfoo config", "RAGAS evaluation", "DeepEval", "LangSmith evals", "custom eval harness", "which eval tool should I use".

Quick Summary30 lines
Set up and use LLM evaluation frameworks. This skill covers Braintrust, Promptfoo, RAGAS, DeepEval, LangSmith, and building custom eval harnesses.

## Key Points

- file://prompts/support-v1.txt
- file://prompts/support-v2.txt
- openai:gpt-4o
- openai:gpt-4o-mini
- anthropic:messages:claude-sonnet-4-20250514
- vars: file://test_cases.csv
- **Just starting out?** Use Promptfoo — YAML config, fast setup, model-agnostic.
- **pytest shop?** Use DeepEval — feels native, good for test-driven development.
- **Building RAG?** Use RAGAS — purpose-built metrics for retrieval + generation.
- **Using LangChain?** Use LangSmith — integrated tracing and eval.
- **Need full control?** Build a custom harness — 100 lines gets you surprisingly far.
- **Enterprise production?** Use Braintrust — logging, experiments, team collaboration.

## Quick Example

```bash
npm install -g promptfoo
# or
npx promptfoo@latest init
```

```bash
pip install braintrust
```
skilldb get ai-testing-evals-skills/eval-frameworksFull skill: 568 lines
Paste into your CLAUDE.md or agent config

Eval Frameworks

Set up and use LLM evaluation frameworks. This skill covers Braintrust, Promptfoo, RAGAS, DeepEval, LangSmith, and building custom eval harnesses.


Framework Comparison

FrameworkBest ForLanguageOSSCI SupportLLM-as-Judge
BraintrustProduction evals, loggingPython/TSPartialYesYes
PromptfooPrompt testing, red-teamingYAML/JSYesYesYes
RAGASRAG evaluationPythonYesYesYes
DeepEvalUnit-test-style evalsPythonYesYesYes
LangSmithLangChain apps, tracingPython/TSNoYesYes
CustomFull control, specific needsAnyN/ADIYDIY

Promptfoo

The most versatile open-source eval framework. YAML-driven, model-agnostic, built-in red-teaming.

Installation

npm install -g promptfoo
# or
npx promptfoo@latest init

Configuration

# promptfooconfig.yaml
description: "Customer support bot evaluation"

prompts:
  - file://prompts/support-v1.txt
  - file://prompts/support-v2.txt

providers:
  - openai:gpt-4o
  - openai:gpt-4o-mini
  - anthropic:messages:claude-sonnet-4-20250514

defaultTest:
  options:
    transformVars: "{ ...vars, timestamp: new Date().toISOString() }"

tests:
  - vars:
      query: "How do I reset my password?"
    assert:
      - type: contains
        value: "password"
      - type: llm-rubric
        value: "The response should provide clear step-by-step instructions"
      - type: cost
        threshold: 0.01

  - vars:
      query: "I want to cancel my subscription"
    assert:
      - type: not-contains
        value: "I'm just an AI"
      - type: llm-rubric
        value: "The response should acknowledge the request and offer retention options"
      - type: latency
        threshold: 3000  # ms

  - vars:
      query: ""
    assert:
      - type: llm-rubric
        value: "The response should ask the user to provide more details"

  # Load test cases from file
  - vars: file://test_cases.csv

Running

# Run evals
promptfoo eval

# View results in browser
promptfoo view

# Run in CI (exit code 1 if assertions fail)
promptfoo eval --no-cache --ci

# Red-team a prompt
promptfoo redteam generate -o redteam_tests.yaml
promptfoo eval -c redteam_tests.yaml

Custom Assertions

// custom_assert.js
module.exports = (output, context) => {
  const response = output.text;

  // Check response is valid JSON
  try {
    const data = JSON.parse(response);
    if (!data.answer) {
      return { pass: false, reason: "Missing 'answer' field" };
    }
    if (data.answer.length > 500) {
      return { pass: false, reason: "Answer exceeds 500 characters" };
    }
    return { pass: true };
  } catch {
    return { pass: false, reason: "Response is not valid JSON" };
  }
};
# Use in config
tests:
  - vars:
      query: "What is 2+2?"
    assert:
      - type: javascript
        value: file://custom_assert.js

Braintrust

Production-grade eval platform with logging, tracing, and experiment tracking.

Setup

pip install braintrust
import braintrust
from braintrust import Eval

# Initialize
braintrust.login(api_key="your-api-key")

# Define an eval
@braintrust.traced
async def my_app(input: str) -> str:
    """Your LLM application."""
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": input}],
    )
    return response.choices[0].message.content

# Run evaluation
Eval(
    "customer-support-bot",
    data=lambda: [
        {"input": "How do I reset my password?", "expected": "Go to settings..."},
        {"input": "What are your hours?", "expected": "We are open 9-5..."},
    ],
    task=lambda input: my_app(input),
    scores=[
        braintrust.Score(
            name="accuracy",
            scorer=lambda output, expected: braintrust.Factuality(output=output, expected=expected),
        ),
    ],
)

Custom Scorers

from autoevals import Factuality, Faithfulness, Battle

# Built-in scorers
factuality = Factuality()
result = factuality(
    output="Paris is the capital of France",
    expected="The capital of France is Paris",
    input="What is the capital of France?",
)
print(result.score)  # 1.0

# Pairwise comparison
battle = Battle()
result = battle(
    instructions="Answer the question helpfully",
    output="Paris is the capital.",
    expected="The capital of France is Paris, also known as the City of Light.",
)

DeepEval

pytest-style LLM evaluation — feels like writing unit tests.

Setup

pip install deepeval

Writing Tests

import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    AnswerRelevancyMetric,
    FaithfulnessMetric,
    HallucinationMetric,
    GEval,
)

# Basic test
def test_answer_relevancy():
    test_case = LLMTestCase(
        input="What is the capital of France?",
        actual_output="The capital of France is Paris.",
    )
    metric = AnswerRelevancyMetric(threshold=0.7)
    assert_test(test_case, [metric])

# RAG faithfulness test
def test_faithfulness():
    test_case = LLMTestCase(
        input="What did the report say about revenue?",
        actual_output="Revenue increased by 15% year-over-year.",
        retrieval_context=[
            "The annual report shows revenue grew by 15% compared to last year."
        ],
    )
    metric = FaithfulnessMetric(threshold=0.8)
    assert_test(test_case, [metric])

# Custom criteria with G-Eval
def test_custom_criteria():
    test_case = LLMTestCase(
        input="Explain quantum computing to a 10-year-old.",
        actual_output="Imagine a magic coin that can be heads AND tails at the same time...",
    )
    metric = GEval(
        name="child-friendliness",
        criteria="The response should use simple words, fun analogies, and no jargon.",
        evaluation_params=["actual_output"],
        threshold=0.7,
    )
    assert_test(test_case, [metric])

# Hallucination detection
def test_no_hallucination():
    test_case = LLMTestCase(
        input="Summarize this document",
        actual_output="The document discusses AI safety and mentions OpenAI's founding in 2015.",
        context=["This document covers AI safety research trends from 2020-2024."],
    )
    metric = HallucinationMetric(threshold=0.5)
    assert_test(test_case, [metric])

Running

# Run with pytest
deepeval test run test_evals.py

# With verbose output
deepeval test run test_evals.py -v

# Generate report
deepeval test run test_evals.py --report

RAGAS (RAG Assessment)

Purpose-built for evaluating retrieval-augmented generation.

Setup

pip install ragas

Core Metrics

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": [
        "What is the company's return policy?",
        "How long does shipping take?",
    ],
    "answer": [
        "You can return items within 30 days for a full refund.",
        "Standard shipping takes 5-7 business days.",
    ],
    "contexts": [
        ["Our return policy allows returns within 30 days. Full refund for unused items."],
        ["Shipping: Standard 5-7 days, Express 2-3 days. Free shipping over $50."],
    ],
    "ground_truth": [
        "Items can be returned within 30 days for a full refund if unused.",
        "Standard shipping takes 5-7 business days. Express is 2-3 days.",
    ],
}

dataset = Dataset.from_dict(eval_data)

# Run evaluation
result = evaluate(
    dataset,
    metrics=[
        faithfulness,       # Is the answer grounded in the context?
        answer_relevancy,   # Does the answer address the question?
        context_precision,  # Are retrieved chunks relevant?
        context_recall,     # Did retrieval find all needed info?
    ],
)

print(result)
# {'faithfulness': 0.95, 'answer_relevancy': 0.92, 'context_precision': 0.88, 'context_recall': 0.90}

LangSmith

Evaluation and tracing for LangChain applications.

Setup

pip install langsmith langchain
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY="your-api-key"

Creating and Running Evals

from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

# Create a dataset
dataset = client.create_dataset("support-eval")
client.create_examples(
    inputs=[
        {"question": "How do I reset my password?"},
        {"question": "What are your hours?"},
    ],
    outputs=[
        {"answer": "Go to Settings > Security > Reset Password"},
        {"answer": "We are open Monday-Friday 9am-5pm EST"},
    ],
    dataset_id=dataset.id,
)

# Define your app
def my_app(inputs: dict) -> dict:
    # Your LLM application logic
    response = chain.invoke(inputs["question"])
    return {"answer": response}

# Define evaluators
def correctness(run, example) -> dict:
    """Check if the answer contains key information."""
    prediction = run.outputs["answer"]
    reference = example.outputs["answer"]
    score = 1.0 if reference.lower() in prediction.lower() else 0.0
    return {"key": "correctness", "score": score}

# Run evaluation
results = evaluate(
    my_app,
    data="support-eval",
    evaluators=[correctness],
    experiment_prefix="v2-prompt",
)

Custom Eval Harness

When frameworks are overkill or too opinionated, build your own.

import asyncio
import json
import time
from pathlib import Path
from dataclasses import dataclass, field, asdict
from typing import Callable, Any

@dataclass
class EvalMetric:
    name: str
    scorer: Callable[[str, str, dict], float]  # (output, expected, metadata) -> score
    threshold: float = 0.0

@dataclass
class EvalResult:
    case_id: str
    scores: dict[str, float] = field(default_factory=dict)
    passed: bool = True
    output: str = ""
    latency_ms: float = 0.0
    error: str = ""

class CustomEvalHarness:
    def __init__(self, name: str, app_fn: Callable, metrics: list[EvalMetric]):
        self.name = name
        self.app_fn = app_fn
        self.metrics = metrics

    async def run(self, dataset_path: str, output_path: str = None) -> dict:
        cases = [json.loads(line) for line in Path(dataset_path).read_text().strip().split("\n")]
        results = []

        for case in cases:
            result = EvalResult(case_id=case["id"])
            start = time.time()

            try:
                output = await self.app_fn(case["input"])
                result.output = output
                result.latency_ms = (time.time() - start) * 1000

                for metric in self.metrics:
                    score = metric.scorer(output, case.get("expected", ""), case.get("metadata", {}))
                    result.scores[metric.name] = score
                    if score < metric.threshold:
                        result.passed = False
            except Exception as e:
                result.error = str(e)
                result.passed = False

            results.append(result)

        # Aggregate
        summary = {
            "name": self.name,
            "total": len(results),
            "passed": sum(r.passed for r in results),
            "failed": sum(not r.passed for r in results),
            "avg_latency_ms": sum(r.latency_ms for r in results) / len(results),
            "metrics": {},
        }
        for metric in self.metrics:
            scores = [r.scores.get(metric.name, 0) for r in results if not r.error]
            summary["metrics"][metric.name] = {
                "mean": sum(scores) / len(scores) if scores else 0,
                "min": min(scores) if scores else 0,
                "max": max(scores) if scores else 0,
                "threshold": metric.threshold,
                "pass_rate": sum(s >= metric.threshold for s in scores) / len(scores) if scores else 0,
            }

        if output_path:
            Path(output_path).write_text(json.dumps({
                "summary": summary,
                "results": [asdict(r) for r in results],
            }, indent=2))

        return summary


# Usage
harness = CustomEvalHarness(
    name="qa-eval",
    app_fn=my_qa_app,
    metrics=[
        EvalMetric("contains_answer", lambda out, exp, _: 1.0 if exp.lower() in out.lower() else 0.0, threshold=0.8),
        EvalMetric("not_too_long", lambda out, exp, _: 1.0 if len(out) < 1000 else 0.0, threshold=1.0),
    ],
)

results = asyncio.run(harness.run("eval_dataset.jsonl", "eval_results.json"))
print(f"Pass rate: {results['passed']}/{results['total']}")

CI Integration Pattern (Any Framework)

# .github/workflows/evals.yml
name: LLM Evals
on:
  pull_request:
    paths: ["prompts/**", "src/llm/**", "eval/**"]

jobs:
  eval:
    runs-on: ubuntu-latest
    env:
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -r requirements-eval.txt

      # Framework-specific run command
      - name: Run evals (Promptfoo)
        run: npx promptfoo eval --no-cache --ci -o results.json

      # OR: Run evals (DeepEval)
      # run: deepeval test run tests/eval/ --report

      # OR: Run evals (Custom)
      # run: python eval/run.py --dataset eval/golden.jsonl --output results.json

      - name: Check pass rate
        run: |
          python -c "
          import json, sys
          r = json.load(open('results.json'))
          rate = r['passed'] / r['total']
          print(f'Pass rate: {rate:.1%}')
          if rate < 0.9:
              print('FAIL: pass rate below 90% threshold')
              sys.exit(1)
          "

Choosing the Right Framework

  • Just starting out? Use Promptfoo — YAML config, fast setup, model-agnostic.
  • pytest shop? Use DeepEval — feels native, good for test-driven development.
  • Building RAG? Use RAGAS — purpose-built metrics for retrieval + generation.
  • Using LangChain? Use LangSmith — integrated tracing and eval.
  • Need full control? Build a custom harness — 100 lines gets you surprisingly far.
  • Enterprise production? Use Braintrust — logging, experiments, team collaboration.

Install this skill directly: skilldb add ai-testing-evals-skills

Get CLI access →

Related Skills

agent-trajectory-testing

Covers testing AI agent behavior end-to-end: trajectory evaluation, tool-call sequence validation, multi-step correctness verification, stuck-loop detection, cost regression testing, and timeout handling. Triggers: "test my AI agent", "agent trajectory evaluation", "tool call testing", "multi-step agent testing", "agent stuck detection", "agent cost regression", "validate agent behavior".

Ai Testing Evals472L

ci-cd-for-ai

Covers implementing CI/CD pipelines for AI applications: running LLM evals in GitHub Actions, gating deployments on eval scores, monitoring prompt and model drift, versioning prompts alongside code, cost tracking, and canary deployments for AI features. Triggers: "CI for AI", "run evals in GitHub Actions", "gate deployment on eval score", "prompt drift detection", "version prompts in CI", "AI deployment pipeline", "LLM CI/CD".

Ai Testing Evals479L

llm-as-judge

Covers using LLMs to evaluate other LLM outputs: rubric design, pairwise comparison, reference-based and reference-free grading, calibration techniques, inter-rater reliability measurement, and cost-efficient judging strategies. Triggers: "LLM as judge", "use GPT to evaluate outputs", "AI grading AI", "rubric for LLM evaluation", "pairwise comparison", "LLM evaluator", "auto-grade LLM responses".

Ai Testing Evals451L

llm-eval-fundamentals

Covers the foundations of evaluating LLM-powered applications: why evaluation matters, the taxonomy of metric types (exact match, semantic similarity, LLM-as-judge), building and curating eval datasets, establishing baselines, detecting regressions, and designing eval pipelines that scale from prototyping through production. Triggers: "evaluate my LLM app", "set up evals", "how do I measure LLM quality", "create an eval pipeline", "LLM metrics", "eval dataset".

Ai Testing Evals348L

prompt-testing

Covers testing and hardening prompts for LLM applications: prompt regression testing, A/B testing prompt variants, temperature sensitivity analysis, edge case libraries, prompt versioning strategies, and golden test sets. Triggers: "test my prompt", "prompt regression", "A/B test prompts", "prompt versioning", "temperature sensitivity", "golden test set for prompts", "prompt quality assurance".

Ai Testing Evals447L

red-teaming-ai

Covers red-teaming AI applications for safety and robustness: adversarial prompt testing, jailbreak resistance evaluation, PII leakage detection, hallucination measurement, bias detection, safety benchmarks, and building automated red-team pipelines. Triggers: "red team my AI", "adversarial testing for LLMs", "jailbreak testing", "PII leakage test", "hallucination detection", "AI bias testing", "safety benchmark", "AI security testing".

Ai Testing Evals544L