ci-cd-for-ai
Covers implementing CI/CD pipelines for AI applications: running LLM evals in GitHub Actions, gating deployments on eval scores, monitoring prompt and model drift, versioning prompts alongside code, cost tracking, and canary deployments for AI features. Triggers: "CI for AI", "run evals in GitHub Actions", "gate deployment on eval score", "prompt drift detection", "version prompts in CI", "AI deployment pipeline", "LLM CI/CD".
Build deployment pipelines that treat AI quality as a first-class gate. This skill covers running evals in CI, gating on scores, monitoring drift, and versioning prompts alongside code. ## Key Points - The same prompt can produce different outputs - Model updates change behavior without code changes - Quality is a spectrum, not a binary - Evals cost real money (API calls) - name: Run pre-deploy eval - name: Check deployment gates - name: Deploy to production - name: Check for drift 1. **Running full evals on every PR**: Use a small golden set for PRs, full suite nightly. 2. **No caching**: Cache LLM responses by input hash to avoid paying twice for unchanged prompts. 3. **Hard-coded thresholds**: Store thresholds in config files, not in CI scripts. 4. **Ignoring cost**: Track and budget eval API costs or they will surprise you.
skilldb get ai-testing-evals-skills/ci-cd-for-aiFull skill: 479 linesCI/CD for AI Applications
Build deployment pipelines that treat AI quality as a first-class gate. This skill covers running evals in CI, gating on scores, monitoring drift, and versioning prompts alongside code.
The AI CI/CD Problem
Traditional CI tests are deterministic: they pass or fail. AI applications are probabilistic:
- The same prompt can produce different outputs
- Model updates change behavior without code changes
- Quality is a spectrum, not a binary
- Evals cost real money (API calls)
Your CI pipeline must handle all of this.
Pipeline Architecture
┌──────────┐ ┌──────────┐ ┌───────────┐ ┌──────────┐
│ PR Open │────>│ Fast Eval│────>│ Full Eval │────>│ Deploy │
│ │ │ (golden │ │ (nightly) │ │ (gated) │
│ │ │ set) │ │ │ │ │
└──────────┘ └────┬─────┘ └─────┬─────┘ └────┬─────┘
│ │ │
Pass: merge Pass: promote Pass: canary
Fail: block Fail: alert Fail: rollback
GitHub Actions: Fast Eval on PR
# .github/workflows/ai-eval-pr.yml
name: AI Eval (PR)
on:
pull_request:
paths:
- "prompts/**"
- "src/ai/**"
- "src/agents/**"
- "eval/**"
concurrency:
group: ai-eval-${{ github.head_ref }}
cancel-in-progress: true
jobs:
fast-eval:
runs-on: ubuntu-latest
timeout-minutes: 10
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: pip
- run: pip install -r requirements-eval.txt
- name: Detect changed prompts
id: detect
run: |
changed=$(git diff --name-only origin/main -- prompts/ src/ai/ | tr '\n' ',')
echo "changed=$changed" >> $GITHUB_OUTPUT
if [ -z "$changed" ]; then
echo "skip=true" >> $GITHUB_OUTPUT
fi
- name: Run golden set eval
if: steps.detect.outputs.skip != 'true'
run: |
python eval/run.py \
--dataset eval/golden_set.jsonl \
--output eval_results.json \
--temperature 0 \
--seed 42
- name: Check regression
if: steps.detect.outputs.skip != 'true'
run: |
python eval/check_regression.py \
--results eval_results.json \
--baseline eval/baselines/main.json \
--threshold 0.03
- name: Post results to PR
if: always() && steps.detect.outputs.skip != 'true'
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const results = JSON.parse(fs.readFileSync('eval_results.json', 'utf8'));
const summary = results.summary;
const body = `## AI Eval Results
| Metric | Score | Threshold | Status |
|--------|-------|-----------|--------|
${Object.entries(summary.metrics).map(([k, v]) =>
`| ${k} | ${v.mean.toFixed(3)} | ${v.threshold} | ${v.mean >= v.threshold ? '✅' : '❌'} |`
).join('\n')}
**Pass rate:** ${summary.passed}/${summary.total}
**Avg latency:** ${summary.avg_latency_ms.toFixed(0)}ms`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: body,
});
Nightly Full Eval
# .github/workflows/ai-eval-nightly.yml
name: AI Eval (Nightly)
on:
schedule:
- cron: "0 3 * * *" # 3 AM UTC
workflow_dispatch:
jobs:
full-eval:
runs-on: ubuntu-latest
timeout-minutes: 60
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -r requirements-eval.txt
- name: Run full eval suite
run: |
python eval/run.py \
--dataset eval/full_suite.jsonl \
--output full_results.json \
--temperature 0 \
--seed 42 \
--concurrency 20
- name: Update baselines (if main)
if: github.ref == 'refs/heads/main'
run: |
cp full_results.json eval/baselines/main.json
git config user.name "github-actions"
git config user.email "actions@github.com"
git add eval/baselines/main.json
git diff --cached --quiet || git commit -m "chore: update eval baseline"
git push
- name: Alert on regression
if: failure()
run: |
curl -X POST "${{ secrets.SLACK_WEBHOOK }}" \
-H 'Content-Type: application/json' \
-d '{"text": "Nightly AI eval FAILED. Check: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}"}'
- name: Upload results artifact
uses: actions/upload-artifact@v4
with:
name: eval-results-${{ github.run_number }}
path: full_results.json
Gating Deployments on Eval Scores
# eval/gate.py — deployment gate checker
import json
import sys
from pathlib import Path
GATES = {
"accuracy": {"min": 0.90, "weight": 0.4},
"faithfulness": {"min": 0.85, "weight": 0.3},
"latency_p95_ms": {"max": 5000, "weight": 0.1},
"cost_per_query": {"max": 0.05, "weight": 0.2},
}
def check_gates(results_path: str) -> bool:
results = json.loads(Path(results_path).read_text())
metrics = results["summary"]["metrics"]
all_passed = True
print("=" * 60)
print("DEPLOYMENT GATE CHECK")
print("=" * 60)
for gate_name, gate_config in GATES.items():
value = metrics.get(gate_name, {}).get("mean", 0)
if "min" in gate_config:
passed = value >= gate_config["min"]
threshold_str = f">= {gate_config['min']}"
else:
passed = value <= gate_config["max"]
threshold_str = f"<= {gate_config['max']}"
status = "PASS" if passed else "FAIL"
print(f" [{status}] {gate_name}: {value:.4f} (threshold: {threshold_str})")
if not passed:
all_passed = False
print("=" * 60)
print(f"RESULT: {'DEPLOY APPROVED' if all_passed else 'DEPLOY BLOCKED'}")
return all_passed
if __name__ == "__main__":
if not check_gates(sys.argv[1]):
sys.exit(1)
# In deployment workflow
- name: Run pre-deploy eval
run: python eval/run.py --dataset eval/golden_set.jsonl --output pre_deploy.json
- name: Check deployment gates
run: python eval/gate.py pre_deploy.json
- name: Deploy to production
if: success()
run: ./deploy.sh
Monitoring Prompt and Model Drift
import json
import hashlib
from datetime import datetime
from pathlib import Path
class DriftMonitor:
def __init__(self, history_path: str = "eval/drift_history.jsonl"):
self.history_path = Path(history_path)
def record_run(self, run_data: dict):
"""Append eval run to history."""
run_data["timestamp"] = datetime.utcnow().isoformat()
with open(self.history_path, "a") as f:
f.write(json.dumps(run_data) + "\n")
def detect_drift(self, metric: str, window: int = 7, threshold: float = 0.05) -> dict:
"""Detect if a metric is trending downward over recent runs."""
runs = [json.loads(line) for line in self.history_path.read_text().strip().split("\n")]
recent = runs[-window:]
if len(recent) < 3:
return {"drift_detected": False, "reason": "insufficient data"}
values = [r["metrics"].get(metric, 0) for r in recent]
# Simple trend: compare first half to second half
mid = len(values) // 2
first_half_avg = sum(values[:mid]) / mid
second_half_avg = sum(values[mid:]) / (len(values) - mid)
delta = second_half_avg - first_half_avg
return {
"drift_detected": abs(delta) > threshold,
"direction": "declining" if delta < 0 else "improving",
"delta": round(delta, 4),
"recent_values": values,
"first_half_avg": round(first_half_avg, 4),
"second_half_avg": round(second_half_avg, 4),
}
def check_model_change(self, model_name: str, expected_hash: str) -> bool:
"""Detect if the model endpoint changed behavior (silent updates)."""
# Run a fixed canary prompt and hash the output
# If the hash changes, the model may have been updated
canary_output = call_model(model_name, "What is 2+2? Answer with just the number.")
actual_hash = hashlib.sha256(canary_output.encode()).hexdigest()[:16]
return actual_hash == expected_hash
Drift Alert in CI
- name: Check for drift
run: |
python -c "
from eval.drift import DriftMonitor
monitor = DriftMonitor()
for metric in ['accuracy', 'faithfulness', 'latency_p50']:
result = monitor.detect_drift(metric, window=7, threshold=0.03)
if result['drift_detected']:
print(f'DRIFT WARNING: {metric} is {result[\"direction\"]} by {result[\"delta\"]}')
"
Versioning Prompts Alongside Code
Directory Structure
repo/
├── src/
│ └── ai/
│ ├── prompts/
│ │ ├── summarize.v3.txt
│ │ ├── classify.v2.txt
│ │ └── manifest.yaml # Maps prompt names to active versions
│ └── llm_client.py
├── eval/
│ ├── golden_set.jsonl
│ ├── baselines/
│ │ └── main.json
│ └── run.py
└── .github/
└── workflows/
└── ai-eval-pr.yml
Prompt Manifest
# src/ai/prompts/manifest.yaml
prompts:
summarize:
active: v3
model: gpt-4o
temperature: 0
max_tokens: 500
classify:
active: v2
model: gpt-4o-mini
temperature: 0
max_tokens: 100
Loader with Version Pinning
from pathlib import Path
import yaml
class PromptLoader:
def __init__(self, prompts_dir: str = "src/ai/prompts"):
self.dir = Path(prompts_dir)
self.manifest = yaml.safe_load((self.dir / "manifest.yaml").read_text())
def get(self, name: str, version: str = None) -> tuple[str, dict]:
config = self.manifest["prompts"][name]
ver = version or config["active"]
template = (self.dir / f"{name}.{ver}.txt").read_text()
return template, config
def get_all_active(self) -> dict:
return {
name: self.get(name)
for name in self.manifest["prompts"]
}
Canary Deployments for AI
import random
class AICanaryDeployer:
"""Route a percentage of traffic to the new prompt/model version."""
def __init__(self, canary_pct: float = 0.05):
self.canary_pct = canary_pct
self.metrics = {"canary": [], "stable": []}
def route(self, request_id: str) -> str:
"""Deterministic routing based on request ID."""
hash_val = int(hashlib.md5(request_id.encode()).hexdigest(), 16)
if (hash_val % 100) < (self.canary_pct * 100):
return "canary"
return "stable"
def record_metric(self, variant: str, score: float):
self.metrics[variant].append(score)
def should_promote(self, min_samples: int = 100, min_improvement: float = 0.0) -> dict:
canary = self.metrics["canary"]
stable = self.metrics["stable"]
if len(canary) < min_samples:
return {"decision": "wait", "canary_samples": len(canary)}
canary_mean = sum(canary) / len(canary)
stable_mean = sum(stable) / len(stable)
delta = canary_mean - stable_mean
if delta >= min_improvement:
return {"decision": "promote", "canary_mean": canary_mean,
"stable_mean": stable_mean, "delta": delta}
elif delta < -0.05:
return {"decision": "rollback", "canary_mean": canary_mean,
"stable_mean": stable_mean, "delta": delta}
else:
return {"decision": "no_difference", "canary_mean": canary_mean,
"stable_mean": stable_mean, "delta": delta}
Cost Tracking in CI
# eval/cost_tracker.py
import json
from pathlib import Path
class CICostTracker:
def __init__(self, budget_path: str = "eval/cost_budget.json"):
self.budget = json.loads(Path(budget_path).read_text())
def check_budget(self, run_cost: float) -> dict:
daily_budget = self.budget["daily_eval_budget_usd"]
monthly_spent = self.budget.get("month_to_date_usd", 0)
monthly_budget = self.budget["monthly_eval_budget_usd"]
return {
"run_cost": run_cost,
"daily_budget": daily_budget,
"daily_ok": run_cost <= daily_budget,
"monthly_spent": monthly_spent + run_cost,
"monthly_budget": monthly_budget,
"monthly_ok": (monthly_spent + run_cost) <= monthly_budget,
}
// eval/cost_budget.json
{
"daily_eval_budget_usd": 10.00,
"monthly_eval_budget_usd": 200.00,
"month_to_date_usd": 45.50
}
Common Pitfalls
- Running full evals on every PR: Use a small golden set for PRs, full suite nightly.
- No caching: Cache LLM responses by input hash to avoid paying twice for unchanged prompts.
- Hard-coded thresholds: Store thresholds in config files, not in CI scripts.
- Ignoring cost: Track and budget eval API costs or they will surprise you.
- No rollback plan: Always have a way to revert to the previous prompt/model version instantly.
- Testing only accuracy: Gate on latency, cost, and safety metrics too.
Install this skill directly: skilldb add ai-testing-evals-skills
Related Skills
agent-trajectory-testing
Covers testing AI agent behavior end-to-end: trajectory evaluation, tool-call sequence validation, multi-step correctness verification, stuck-loop detection, cost regression testing, and timeout handling. Triggers: "test my AI agent", "agent trajectory evaluation", "tool call testing", "multi-step agent testing", "agent stuck detection", "agent cost regression", "validate agent behavior".
eval-frameworks
Covers popular LLM evaluation frameworks and how to use them: Braintrust, Promptfoo, RAGAS, DeepEval, LangSmith, and custom eval harnesses. Includes setup, configuration, writing eval cases, CI integration, and choosing the right framework for your use case. Triggers: "eval framework", "Braintrust setup", "Promptfoo config", "RAGAS evaluation", "DeepEval", "LangSmith evals", "custom eval harness", "which eval tool should I use".
llm-as-judge
Covers using LLMs to evaluate other LLM outputs: rubric design, pairwise comparison, reference-based and reference-free grading, calibration techniques, inter-rater reliability measurement, and cost-efficient judging strategies. Triggers: "LLM as judge", "use GPT to evaluate outputs", "AI grading AI", "rubric for LLM evaluation", "pairwise comparison", "LLM evaluator", "auto-grade LLM responses".
llm-eval-fundamentals
Covers the foundations of evaluating LLM-powered applications: why evaluation matters, the taxonomy of metric types (exact match, semantic similarity, LLM-as-judge), building and curating eval datasets, establishing baselines, detecting regressions, and designing eval pipelines that scale from prototyping through production. Triggers: "evaluate my LLM app", "set up evals", "how do I measure LLM quality", "create an eval pipeline", "LLM metrics", "eval dataset".
prompt-testing
Covers testing and hardening prompts for LLM applications: prompt regression testing, A/B testing prompt variants, temperature sensitivity analysis, edge case libraries, prompt versioning strategies, and golden test sets. Triggers: "test my prompt", "prompt regression", "A/B test prompts", "prompt versioning", "temperature sensitivity", "golden test set for prompts", "prompt quality assurance".
red-teaming-ai
Covers red-teaming AI applications for safety and robustness: adversarial prompt testing, jailbreak resistance evaluation, PII leakage detection, hallucination measurement, bias detection, safety benchmarks, and building automated red-team pipelines. Triggers: "red team my AI", "adversarial testing for LLMs", "jailbreak testing", "PII leakage test", "hallucination detection", "AI bias testing", "safety benchmark", "AI security testing".