llm-as-judge
Covers using LLMs to evaluate other LLM outputs: rubric design, pairwise comparison, reference-based and reference-free grading, calibration techniques, inter-rater reliability measurement, and cost-efficient judging strategies. Triggers: "LLM as judge", "use GPT to evaluate outputs", "AI grading AI", "rubric for LLM evaluation", "pairwise comparison", "LLM evaluator", "auto-grade LLM responses".
Use LLMs to evaluate LLM outputs at scale. This skill covers rubric design, comparison methods, calibration, reliability measurement, and cost optimization. ## Key Points - The task is subjective (style, helpfulness, reasoning quality) - There is no single correct answer - Human evaluation is too expensive or slow for the scale you need - You need to evaluate thousands of outputs per day - Exact match or simple pattern matching suffices - The judge model is weaker than the model being evaluated - You have not calibrated against human judgments - Choose the better response. If truly equal, choose "tie". - Do not let response length bias your judgment. - Do not let the order of presentation bias your judgment. 1. Factual accuracy (to the best of your knowledge) 2. Relevance to the question
skilldb get ai-testing-evals-skills/llm-as-judgeFull skill: 451 linesLLM-as-Judge
Use LLMs to evaluate LLM outputs at scale. This skill covers rubric design, comparison methods, calibration, reliability measurement, and cost optimization.
When to Use LLM-as-Judge
Use LLM-as-judge when:
- The task is subjective (style, helpfulness, reasoning quality)
- There is no single correct answer
- Human evaluation is too expensive or slow for the scale you need
- You need to evaluate thousands of outputs per day
Do NOT use LLM-as-judge when:
- Exact match or simple pattern matching suffices
- The judge model is weaker than the model being evaluated
- You have not calibrated against human judgments
Rubric Design
Single-Dimension Scoring
HELPFULNESS_RUBRIC = """You are an expert evaluator. Rate the assistant's response on helpfulness.
Score 1 - Not helpful: Does not address the question. Irrelevant or empty response.
Score 2 - Slightly helpful: Addresses the topic but misses the main question.
Score 3 - Moderately helpful: Answers the question but with significant gaps.
Score 4 - Helpful: Answers the question well with minor omissions.
Score 5 - Very helpful: Comprehensive, accurate, and directly addresses the question.
Question: {question}
Response: {response}
Provide your score and reasoning in JSON format:
{{"score": <1-5>, "reasoning": "<explanation>"}}"""
Multi-Dimension Scoring
MULTI_RUBRIC = """Evaluate the response on these dimensions. Score each 1-5.
ACCURACY: Does the response contain factually correct information?
1=Major errors 2=Several errors 3=Minor errors 4=Mostly correct 5=Fully correct
COMPLETENESS: Does the response address all parts of the question?
1=Misses most 2=Partial 3=Addresses main point 4=Addresses most 5=Fully complete
CLARITY: Is the response well-organized and easy to understand?
1=Incoherent 2=Confusing 3=Acceptable 4=Clear 5=Excellent
CONCISENESS: Is the response appropriately concise without being terse?
1=Way too long/short 2=Too verbose/brief 3=Acceptable 4=Well-calibrated 5=Perfectly sized
Question: {question}
Reference answer: {reference}
Response to evaluate: {response}
Respond with JSON:
{{"accuracy": int, "completeness": int, "clarity": int, "conciseness": int, "overall": float, "reasoning": str}}
For "overall", compute a weighted average: accuracy*0.4 + completeness*0.3 + clarity*0.2 + conciseness*0.1"""
Scoring Functions
import json
from openai import AsyncOpenAI
client = AsyncOpenAI()
async def judge_single(
question: str,
response: str,
rubric: str,
reference: str = "",
model: str = "gpt-4o",
) -> dict:
"""Score a single response using an LLM judge."""
prompt = rubric.format(
question=question,
response=response,
reference=reference,
)
result = await client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0,
)
return json.loads(result.choices[0].message.content)
async def judge_batch(
cases: list[dict],
rubric: str,
model: str = "gpt-4o",
concurrency: int = 10,
) -> list[dict]:
"""Score multiple cases concurrently."""
import asyncio
semaphore = asyncio.Semaphore(concurrency)
async def bounded(case):
async with semaphore:
return await judge_single(
question=case["question"],
response=case["response"],
rubric=rubric,
reference=case.get("reference", ""),
model=model,
)
return await asyncio.gather(*[bounded(c) for c in cases])
Pairwise Comparison
More reliable than absolute scoring for subtle quality differences.
PAIRWISE_PROMPT = """You are comparing two responses to the same question. Which response is better?
Question: {question}
Response A:
{response_a}
Response B:
{response_b}
Consider: accuracy, helpfulness, clarity, and completeness.
Rules:
- Choose the better response. If truly equal, choose "tie".
- Do not let response length bias your judgment.
- Do not let the order of presentation bias your judgment.
Respond with JSON:
{{"winner": "A" | "B" | "tie", "reasoning": "<explanation>", "confidence": "high" | "medium" | "low"}}"""
async def pairwise_compare(
question: str,
response_a: str,
response_b: str,
model: str = "gpt-4o",
swap_order: bool = True,
) -> dict:
"""Compare two responses, optionally swapping order to reduce position bias."""
# First comparison: A vs B
result1 = await judge_single(
question=question,
response=response_a,
rubric=PAIRWISE_PROMPT.replace("{response_a}", response_a).replace("{response_b}", response_b),
model=model,
)
if not swap_order:
return result1
# Second comparison: B vs A (to detect position bias)
result2 = await judge_single(
question=question,
response=response_b,
rubric=PAIRWISE_PROMPT.replace("{response_a}", response_b).replace("{response_b}", response_a),
model=model,
)
# Map result2 back (swap A/B)
winner2_mapped = {"A": "B", "B": "A", "tie": "tie"}[result2["winner"]]
# Aggregate
if result1["winner"] == winner2_mapped:
return {"winner": result1["winner"], "consistent": True, "confidence": "high"}
else:
return {"winner": "tie", "consistent": False, "confidence": "low",
"note": "Position bias detected — results disagreed when order swapped"}
Reference-Based vs Reference-Free Grading
Reference-Based
REFERENCE_BASED_RUBRIC = """Score how well the response matches the reference answer.
Question: {question}
Reference (gold standard): {reference}
Response to evaluate: {response}
Scoring:
5 - Semantically equivalent to the reference
4 - Captures all key points, minor differences in wording
3 - Captures most key points, some omissions
2 - Partially correct, misses major points
1 - Largely incorrect or irrelevant
Respond with JSON: {{"score": int, "reasoning": str}}"""
Reference-Free
REFERENCE_FREE_RUBRIC = """Evaluate this response based solely on the question asked.
You do NOT have a reference answer. Judge based on your own knowledge.
Question: {question}
Response: {response}
Evaluate on:
1. Factual accuracy (to the best of your knowledge)
2. Relevance to the question
3. Completeness
4. Clarity
Respond with JSON: {{"score": int, "accuracy_concerns": list[str], "reasoning": str}}"""
When to Use Which
| Scenario | Method | Why |
|---|---|---|
| QA with known answers | Reference-based | Objective comparison to gold standard |
| Creative writing | Reference-free | No single correct answer exists |
| Summarization | Reference-based | Compare against expert summary |
| Open-ended chat | Reference-free | Many valid responses possible |
| Code review | Hybrid | Check correctness (ref) + style (free) |
Calibration
Human-Judge Agreement
from sklearn.metrics import cohen_kappa_score
import numpy as np
def calibrate_judge(
human_scores: list[int],
llm_scores: list[int],
) -> dict:
"""Measure agreement between human and LLM judge scores."""
assert len(human_scores) == len(llm_scores)
# Cohen's kappa (chance-adjusted agreement)
kappa = cohen_kappa_score(human_scores, llm_scores)
# Exact agreement rate
exact = sum(h == l for h, l in zip(human_scores, llm_scores)) / len(human_scores)
# Within-1 agreement (scores differ by at most 1)
within_1 = sum(abs(h - l) <= 1 for h, l in zip(human_scores, llm_scores)) / len(human_scores)
# Bias detection: does the LLM systematically score higher or lower?
human_mean = np.mean(human_scores)
llm_mean = np.mean(llm_scores)
bias = llm_mean - human_mean
return {
"cohens_kappa": round(kappa, 3),
"exact_agreement": round(exact, 3),
"within_1_agreement": round(within_1, 3),
"human_mean": round(human_mean, 2),
"llm_mean": round(llm_mean, 2),
"bias": round(bias, 2),
"interpretation": interpret_kappa(kappa),
}
def interpret_kappa(kappa: float) -> str:
if kappa < 0.20: return "poor agreement — judge is unreliable"
if kappa < 0.40: return "fair agreement — use with caution"
if kappa < 0.60: return "moderate agreement — acceptable for screening"
if kappa < 0.80: return "substantial agreement — good for production"
return "near-perfect agreement — excellent"
# Example
result = calibrate_judge(
human_scores=[5, 4, 3, 2, 5, 4, 3, 1, 5, 4],
llm_scores= [5, 4, 4, 2, 5, 3, 3, 2, 4, 4],
)
print(result)
# {'cohens_kappa': 0.62, 'exact_agreement': 0.6, 'within_1_agreement': 1.0, ...}
Inter-Rater Reliability
Use multiple judge calls to improve reliability.
import statistics
async def multi_judge(
question: str,
response: str,
rubric: str,
model: str = "gpt-4o",
n_judges: int = 3,
temperature: float = 0.3,
) -> dict:
"""Use multiple independent judge calls for reliability."""
tasks = []
for _ in range(n_judges):
tasks.append(judge_single(question, response, rubric, model=model))
results = await asyncio.gather(*tasks)
scores = [r["score"] for r in results]
return {
"median_score": statistics.median(scores),
"mean_score": statistics.mean(scores),
"scores": scores,
"agreement": len(set(scores)) == 1,
"spread": max(scores) - min(scores),
"individual_results": results,
}
# If spread > 2, the rubric may be ambiguous — refine it
Cost-Efficient Judging
Cascading Judges
async def cascading_judge(
question: str,
response: str,
rubric: str,
) -> dict:
"""Use a cheap model first, escalate to expensive model for borderline cases."""
# Stage 1: Fast, cheap judge
fast_result = await judge_single(
question, response, rubric, model="gpt-4o-mini"
)
score = fast_result["score"]
# Clear pass or fail — no need for expensive judge
if score >= 4 or score <= 2:
return {**fast_result, "judge_model": "gpt-4o-mini", "escalated": False}
# Stage 2: Borderline (score 3) — escalate to better judge
accurate_result = await judge_single(
question, response, rubric, model="gpt-4o"
)
return {**accurate_result, "judge_model": "gpt-4o", "escalated": True}
# Cost savings: typically 60-70% of cases are clear, saving expensive API calls
Sampling Strategy
def select_eval_sample(
outputs: list[dict],
sample_size: int = 100,
strategy: str = "stratified",
) -> list[dict]:
"""Select a representative sample for LLM judging."""
if strategy == "random":
import random
return random.sample(outputs, min(sample_size, len(outputs)))
if strategy == "stratified":
# Group by category/difficulty, sample proportionally
from collections import defaultdict
groups = defaultdict(list)
for o in outputs:
groups[o.get("category", "default")].append(o)
sample = []
per_group = max(1, sample_size // len(groups))
for group_outputs in groups.values():
import random
sample.extend(random.sample(group_outputs, min(per_group, len(group_outputs))))
return sample[:sample_size]
if strategy == "uncertainty":
# Prioritize outputs where a cheap judge was uncertain
borderline = [o for o in outputs if o.get("fast_score") == 3]
non_borderline = [o for o in outputs if o.get("fast_score") != 3]
import random
return (borderline[:sample_size] +
random.sample(non_borderline, max(0, sample_size - len(borderline))))
Reducing Judge Bias
DEBIASING_TECHNIQUES = {
"position_swap": "Present responses in both orders and check consistency",
"name_blind": "Remove model names from responses before judging",
"length_control": "Instruct judge to ignore length differences",
"chain_of_thought": "Require reasoning before score to improve calibration",
"few_shot_anchoring": "Include scored examples in the prompt to anchor the scale",
}
# Few-shot anchoring example
ANCHORED_RUBRIC = """Rate the response 1-5 on helpfulness.
Example — Score 5 (Very helpful):
Q: "How do I sort a list in Python?"
A: "Use sorted() for a new list or list.sort() for in-place. Both accept key= and reverse= parameters. Example: sorted([3,1,2]) returns [1,2,3]."
Example — Score 2 (Slightly helpful):
Q: "How do I sort a list in Python?"
A: "Python has many features for working with lists."
Now evaluate:
Question: {question}
Response: {response}
JSON: {{"score": int, "reasoning": str}}"""
Common Pitfalls
- Self-evaluation bias: GPT-4 prefers GPT-4 outputs. Use a different model family as judge when possible.
- Position bias: The first response in a pairwise comparison is favored. Always swap and check.
- Length bias: Longer responses get higher scores. Explicitly instruct against this.
- Vague rubrics: "Is it good?" fails. Define exactly what each score level means.
- No calibration: Always measure agreement with human labels before trusting LLM judges.
- Single-call judging: One judge call is noisy. Use 3+ calls for important evaluations.
Install this skill directly: skilldb add ai-testing-evals-skills
Related Skills
agent-trajectory-testing
Covers testing AI agent behavior end-to-end: trajectory evaluation, tool-call sequence validation, multi-step correctness verification, stuck-loop detection, cost regression testing, and timeout handling. Triggers: "test my AI agent", "agent trajectory evaluation", "tool call testing", "multi-step agent testing", "agent stuck detection", "agent cost regression", "validate agent behavior".
ci-cd-for-ai
Covers implementing CI/CD pipelines for AI applications: running LLM evals in GitHub Actions, gating deployments on eval scores, monitoring prompt and model drift, versioning prompts alongside code, cost tracking, and canary deployments for AI features. Triggers: "CI for AI", "run evals in GitHub Actions", "gate deployment on eval score", "prompt drift detection", "version prompts in CI", "AI deployment pipeline", "LLM CI/CD".
eval-frameworks
Covers popular LLM evaluation frameworks and how to use them: Braintrust, Promptfoo, RAGAS, DeepEval, LangSmith, and custom eval harnesses. Includes setup, configuration, writing eval cases, CI integration, and choosing the right framework for your use case. Triggers: "eval framework", "Braintrust setup", "Promptfoo config", "RAGAS evaluation", "DeepEval", "LangSmith evals", "custom eval harness", "which eval tool should I use".
llm-eval-fundamentals
Covers the foundations of evaluating LLM-powered applications: why evaluation matters, the taxonomy of metric types (exact match, semantic similarity, LLM-as-judge), building and curating eval datasets, establishing baselines, detecting regressions, and designing eval pipelines that scale from prototyping through production. Triggers: "evaluate my LLM app", "set up evals", "how do I measure LLM quality", "create an eval pipeline", "LLM metrics", "eval dataset".
prompt-testing
Covers testing and hardening prompts for LLM applications: prompt regression testing, A/B testing prompt variants, temperature sensitivity analysis, edge case libraries, prompt versioning strategies, and golden test sets. Triggers: "test my prompt", "prompt regression", "A/B test prompts", "prompt versioning", "temperature sensitivity", "golden test set for prompts", "prompt quality assurance".
red-teaming-ai
Covers red-teaming AI applications for safety and robustness: adversarial prompt testing, jailbreak resistance evaluation, PII leakage detection, hallucination measurement, bias detection, safety benchmarks, and building automated red-team pipelines. Triggers: "red team my AI", "adversarial testing for LLMs", "jailbreak testing", "PII leakage test", "hallucination detection", "AI bias testing", "safety benchmark", "AI security testing".