Technology & EngineeringAi Testing Evals479 lines

ci-cd-for-ai

Covers implementing CI/CD pipelines for AI applications: running LLM evals in GitHub Actions, gating deployments on eval scores, monitoring prompt and model drift, versioning prompts alongside code, cost tracking, and canary deployments for AI features. Triggers: "CI for AI", "run evals in GitHub Actions", "gate deployment on eval score", "prompt drift detection", "version prompts in CI", "AI deployment pipeline", "LLM CI/CD".

Quick Summary18 lines

Build deployment pipelines that treat AI quality as a first-class gate. This skill covers running evals in CI, gating on scores, monitoring drift, and versioning prompts alongside code.

## Key Points

- The same prompt can produce different outputs
- Model updates change behavior without code changes
- Quality is a spectrum, not a binary
- Evals cost real money (API calls)
- name: Run pre-deploy eval
- name: Check deployment gates
- name: Deploy to production
- name: Check for drift
1. **Running full evals on every PR**: Use a small golden set for PRs, full suite nightly.
2. **No caching**: Cache LLM responses by input hash to avoid paying twice for unchanged prompts.
3. **Hard-coded thresholds**: Store thresholds in config files, not in CI scripts.
4. **Ignoring cost**: Track and budget eval API costs or they will surprise you.

skilldb get ai-testing-evals-skills/ci-cd-for-aiFull skill: 479 lines

Paste into your CLAUDE.md or agent config

CI/CD for AI Applications

Build deployment pipelines that treat AI quality as a first-class gate. This skill covers running evals in CI, gating on scores, monitoring drift, and versioning prompts alongside code.

The AI CI/CD Problem

Traditional CI tests are deterministic: they pass or fail. AI applications are probabilistic:

The same prompt can produce different outputs
Model updates change behavior without code changes
Quality is a spectrum, not a binary
Evals cost real money (API calls)

Your CI pipeline must handle all of this.

Pipeline Architecture

┌──────────┐     ┌──────────┐     ┌───────────┐     ┌──────────┐
│  PR Open  │────>│ Fast Eval│────>│ Full Eval │────>│  Deploy  │
│           │     │ (golden  │     │ (nightly) │     │ (gated)  │
│           │     │  set)    │     │           │     │          │
└──────────┘     └────┬─────┘     └─────┬─────┘     └────┬─────┘
                      │                 │                  │
                 Pass: merge       Pass: promote      Pass: canary
                 Fail: block       Fail: alert        Fail: rollback

GitHub Actions: Fast Eval on PR

# .github/workflows/ai-eval-pr.yml
name: AI Eval (PR)
on:
  pull_request:
    paths:
      - "prompts/**"
      - "src/ai/**"
      - "src/agents/**"
      - "eval/**"

concurrency:
  group: ai-eval-${{ github.head_ref }}
  cancel-in-progress: true

jobs:
  fast-eval:
    runs-on: ubuntu-latest
    timeout-minutes: 10
    env:
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
      ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
          cache: pip

      - run: pip install -r requirements-eval.txt

      - name: Detect changed prompts
        id: detect
        run: |
          changed=$(git diff --name-only origin/main -- prompts/ src/ai/ | tr '\n' ',')
          echo "changed=$changed" >> $GITHUB_OUTPUT
          if [ -z "$changed" ]; then
            echo "skip=true" >> $GITHUB_OUTPUT
          fi

      - name: Run golden set eval
        if: steps.detect.outputs.skip != 'true'
        run: |
          python eval/run.py \
            --dataset eval/golden_set.jsonl \
            --output eval_results.json \
            --temperature 0 \
            --seed 42

      - name: Check regression
        if: steps.detect.outputs.skip != 'true'
        run: |
          python eval/check_regression.py \
            --results eval_results.json \
            --baseline eval/baselines/main.json \
            --threshold 0.03

      - name: Post results to PR
        if: always() && steps.detect.outputs.skip != 'true'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('eval_results.json', 'utf8'));
            const summary = results.summary;
            const body = `## AI Eval Results
            | Metric | Score | Threshold | Status |
            |--------|-------|-----------|--------|
            ${Object.entries(summary.metrics).map(([k, v]) =>
              `| ${k} | ${v.mean.toFixed(3)} | ${v.threshold} | ${v.mean >= v.threshold ? '✅' : '❌'} |`
            ).join('\n')}

            **Pass rate:** ${summary.passed}/${summary.total}
            **Avg latency:** ${summary.avg_latency_ms.toFixed(0)}ms`;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body,
            });

Nightly Full Eval

# .github/workflows/ai-eval-nightly.yml
name: AI Eval (Nightly)
on:
  schedule:
    - cron: "0 3 * * *"  # 3 AM UTC
  workflow_dispatch:

jobs:
  full-eval:
    runs-on: ubuntu-latest
    timeout-minutes: 60
    env:
      OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - run: pip install -r requirements-eval.txt

      - name: Run full eval suite
        run: |
          python eval/run.py \
            --dataset eval/full_suite.jsonl \
            --output full_results.json \
            --temperature 0 \
            --seed 42 \
            --concurrency 20

      - name: Update baselines (if main)
        if: github.ref == 'refs/heads/main'
        run: |
          cp full_results.json eval/baselines/main.json
          git config user.name "github-actions"
          git config user.email "actions@github.com"
          git add eval/baselines/main.json
          git diff --cached --quiet || git commit -m "chore: update eval baseline"
          git push

      - name: Alert on regression
        if: failure()
        run: |
          curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
            -H 'Content-Type: application/json' \
            -d '{"text": "Nightly AI eval FAILED. Check: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"}'

      - name: Upload results artifact
        uses: actions/upload-artifact@v4
        with:
          name: eval-results-${{ github.run_number }}
          path: full_results.json

Gating Deployments on Eval Scores

# eval/gate.py — deployment gate checker
import json
import sys
from pathlib import Path

GATES = {
    "accuracy": {"min": 0.90, "weight": 0.4},
    "faithfulness": {"min": 0.85, "weight": 0.3},
    "latency_p95_ms": {"max": 5000, "weight": 0.1},
    "cost_per_query": {"max": 0.05, "weight": 0.2},
}

def check_gates(results_path: str) -> bool:
    results = json.loads(Path(results_path).read_text())
    metrics = results["summary"]["metrics"]
    all_passed = True

    print("=" * 60)
    print("DEPLOYMENT GATE CHECK")
    print("=" * 60)

    for gate_name, gate_config in GATES.items():
        value = metrics.get(gate_name, {}).get("mean", 0)

        if "min" in gate_config:
            passed = value >= gate_config["min"]
            threshold_str = f">= {gate_config['min']}"
        else:
            passed = value <= gate_config["max"]
            threshold_str = f"<= {gate_config['max']}"

        status = "PASS" if passed else "FAIL"
        print(f"  [{status}] {gate_name}: {value:.4f} (threshold: {threshold_str})")

        if not passed:
            all_passed = False

    print("=" * 60)
    print(f"RESULT: {'DEPLOY APPROVED' if all_passed else 'DEPLOY BLOCKED'}")
    return all_passed

if __name__ == "__main__":
    if not check_gates(sys.argv[1]):
        sys.exit(1)

# In deployment workflow
- name: Run pre-deploy eval
  run: python eval/run.py --dataset eval/golden_set.jsonl --output pre_deploy.json

- name: Check deployment gates
  run: python eval/gate.py pre_deploy.json

- name: Deploy to production
  if: success()
  run: ./deploy.sh

Monitoring Prompt and Model Drift

import json
import hashlib
from datetime import datetime
from pathlib import Path

class DriftMonitor:
    def __init__(self, history_path: str = "eval/drift_history.jsonl"):
        self.history_path = Path(history_path)

    def record_run(self, run_data: dict):
        """Append eval run to history."""
        run_data["timestamp"] = datetime.utcnow().isoformat()
        with open(self.history_path, "a") as f:
            f.write(json.dumps(run_data) + "\n")

    def detect_drift(self, metric: str, window: int = 7, threshold: float = 0.05) -> dict:
        """Detect if a metric is trending downward over recent runs."""
        runs = [json.loads(line) for line in self.history_path.read_text().strip().split("\n")]
        recent = runs[-window:]

        if len(recent) < 3:
            return {"drift_detected": False, "reason": "insufficient data"}

        values = [r["metrics"].get(metric, 0) for r in recent]

        # Simple trend: compare first half to second half
        mid = len(values) // 2
        first_half_avg = sum(values[:mid]) / mid
        second_half_avg = sum(values[mid:]) / (len(values) - mid)
        delta = second_half_avg - first_half_avg

        return {
            "drift_detected": abs(delta) > threshold,
            "direction": "declining" if delta < 0 else "improving",
            "delta": round(delta, 4),
            "recent_values": values,
            "first_half_avg": round(first_half_avg, 4),
            "second_half_avg": round(second_half_avg, 4),
        }

    def check_model_change(self, model_name: str, expected_hash: str) -> bool:
        """Detect if the model endpoint changed behavior (silent updates)."""
        # Run a fixed canary prompt and hash the output
        # If the hash changes, the model may have been updated
        canary_output = call_model(model_name, "What is 2+2? Answer with just the number.")
        actual_hash = hashlib.sha256(canary_output.encode()).hexdigest()[:16]
        return actual_hash == expected_hash

Drift Alert in CI

- name: Check for drift
  run: |
    python -c "
    from eval.drift import DriftMonitor
    monitor = DriftMonitor()
    for metric in ['accuracy', 'faithfulness', 'latency_p50']:
        result = monitor.detect_drift(metric, window=7, threshold=0.03)
        if result['drift_detected']:
            print(f'DRIFT WARNING: {metric} is {result[\"direction\"]} by {result[\"delta\"]}')
    "

Versioning Prompts Alongside Code

Directory Structure

repo/
├── src/
│   └── ai/
│       ├── prompts/
│       │   ├── summarize.v3.txt
│       │   ├── classify.v2.txt
│       │   └── manifest.yaml       # Maps prompt names to active versions
│       └── llm_client.py
├── eval/
│   ├── golden_set.jsonl
│   ├── baselines/
│   │   └── main.json
│   └── run.py
└── .github/
    └── workflows/
        └── ai-eval-pr.yml

Prompt Manifest

# src/ai/prompts/manifest.yaml
prompts:
  summarize:
    active: v3
    model: gpt-4o
    temperature: 0
    max_tokens: 500
  classify:
    active: v2
    model: gpt-4o-mini
    temperature: 0
    max_tokens: 100

Loader with Version Pinning

from pathlib import Path
import yaml

class PromptLoader:
    def __init__(self, prompts_dir: str = "src/ai/prompts"):
        self.dir = Path(prompts_dir)
        self.manifest = yaml.safe_load((self.dir / "manifest.yaml").read_text())

    def get(self, name: str, version: str = None) -> tuple[str, dict]:
        config = self.manifest["prompts"][name]
        ver = version or config["active"]
        template = (self.dir / f"{name}.{ver}.txt").read_text()
        return template, config

    def get_all_active(self) -> dict:
        return {
            name: self.get(name)
            for name in self.manifest["prompts"]
        }

Canary Deployments for AI

import random

class AICanaryDeployer:
    """Route a percentage of traffic to the new prompt/model version."""

    def __init__(self, canary_pct: float = 0.05):
        self.canary_pct = canary_pct
        self.metrics = {"canary": [], "stable": []}

    def route(self, request_id: str) -> str:
        """Deterministic routing based on request ID."""
        hash_val = int(hashlib.md5(request_id.encode()).hexdigest(), 16)
        if (hash_val % 100) < (self.canary_pct * 100):
            return "canary"
        return "stable"

    def record_metric(self, variant: str, score: float):
        self.metrics[variant].append(score)

    def should_promote(self, min_samples: int = 100, min_improvement: float = 0.0) -> dict:
        canary = self.metrics["canary"]
        stable = self.metrics["stable"]

        if len(canary) < min_samples:
            return {"decision": "wait", "canary_samples": len(canary)}

        canary_mean = sum(canary) / len(canary)
        stable_mean = sum(stable) / len(stable)
        delta = canary_mean - stable_mean

        if delta >= min_improvement:
            return {"decision": "promote", "canary_mean": canary_mean,
                    "stable_mean": stable_mean, "delta": delta}
        elif delta < -0.05:
            return {"decision": "rollback", "canary_mean": canary_mean,
                    "stable_mean": stable_mean, "delta": delta}
        else:
            return {"decision": "no_difference", "canary_mean": canary_mean,
                    "stable_mean": stable_mean, "delta": delta}

Cost Tracking in CI

# eval/cost_tracker.py
import json
from pathlib import Path

class CICostTracker:
    def __init__(self, budget_path: str = "eval/cost_budget.json"):
        self.budget = json.loads(Path(budget_path).read_text())

    def check_budget(self, run_cost: float) -> dict:
        daily_budget = self.budget["daily_eval_budget_usd"]
        monthly_spent = self.budget.get("month_to_date_usd", 0)
        monthly_budget = self.budget["monthly_eval_budget_usd"]

        return {
            "run_cost": run_cost,
            "daily_budget": daily_budget,
            "daily_ok": run_cost <= daily_budget,
            "monthly_spent": monthly_spent + run_cost,
            "monthly_budget": monthly_budget,
            "monthly_ok": (monthly_spent + run_cost) <= monthly_budget,
        }

// eval/cost_budget.json
{
  "daily_eval_budget_usd": 10.00,
  "monthly_eval_budget_usd": 200.00,
  "month_to_date_usd": 45.50
}

Common Pitfalls

Running full evals on every PR: Use a small golden set for PRs, full suite nightly.
No caching: Cache LLM responses by input hash to avoid paying twice for unchanged prompts.
Hard-coded thresholds: Store thresholds in config files, not in CI scripts.
Ignoring cost: Track and budget eval API costs or they will surprise you.
No rollback plan: Always have a way to revert to the previous prompt/model version instantly.
Testing only accuracy: Gate on latency, cost, and safety metrics too.

Install this skill directly: skilldb add ai-testing-evals-skills

Get CLI access →

Related Skills

agent-trajectory-testing

Covers testing AI agent behavior end-to-end: trajectory evaluation, tool-call sequence validation, multi-step correctness verification, stuck-loop detection, cost regression testing, and timeout handling. Triggers: "test my AI agent", "agent trajectory evaluation", "tool call testing", "multi-step agent testing", "agent stuck detection", "agent cost regression", "validate agent behavior".

Ai Testing Evals•472L

eval-frameworks

Covers popular LLM evaluation frameworks and how to use them: Braintrust, Promptfoo, RAGAS, DeepEval, LangSmith, and custom eval harnesses. Includes setup, configuration, writing eval cases, CI integration, and choosing the right framework for your use case. Triggers: "eval framework", "Braintrust setup", "Promptfoo config", "RAGAS evaluation", "DeepEval", "LangSmith evals", "custom eval harness", "which eval tool should I use".

Ai Testing Evals•568L

llm-as-judge

Covers using LLMs to evaluate other LLM outputs: rubric design, pairwise comparison, reference-based and reference-free grading, calibration techniques, inter-rater reliability measurement, and cost-efficient judging strategies. Triggers: "LLM as judge", "use GPT to evaluate outputs", "AI grading AI", "rubric for LLM evaluation", "pairwise comparison", "LLM evaluator", "auto-grade LLM responses".

Ai Testing Evals•451L

llm-eval-fundamentals

Covers the foundations of evaluating LLM-powered applications: why evaluation matters, the taxonomy of metric types (exact match, semantic similarity, LLM-as-judge), building and curating eval datasets, establishing baselines, detecting regressions, and designing eval pipelines that scale from prototyping through production. Triggers: "evaluate my LLM app", "set up evals", "how do I measure LLM quality", "create an eval pipeline", "LLM metrics", "eval dataset".

Ai Testing Evals•348L

prompt-testing

Covers testing and hardening prompts for LLM applications: prompt regression testing, A/B testing prompt variants, temperature sensitivity analysis, edge case libraries, prompt versioning strategies, and golden test sets. Triggers: "test my prompt", "prompt regression", "A/B test prompts", "prompt versioning", "temperature sensitivity", "golden test set for prompts", "prompt quality assurance".

Ai Testing Evals•447L

red-teaming-ai

Covers red-teaming AI applications for safety and robustness: adversarial prompt testing, jailbreak resistance evaluation, PII leakage detection, hallucination measurement, bias detection, safety benchmarks, and building automated red-team pipelines. Triggers: "red team my AI", "adversarial testing for LLMs", "jailbreak testing", "PII leakage test", "hallucination detection", "AI bias testing", "safety benchmark", "AI security testing".

Ai Testing Evals•544L