Skip to main content
Technology & EngineeringAi Testing Evals451 lines

llm-as-judge

Covers using LLMs to evaluate other LLM outputs: rubric design, pairwise comparison, reference-based and reference-free grading, calibration techniques, inter-rater reliability measurement, and cost-efficient judging strategies. Triggers: "LLM as judge", "use GPT to evaluate outputs", "AI grading AI", "rubric for LLM evaluation", "pairwise comparison", "LLM evaluator", "auto-grade LLM responses".

Quick Summary18 lines
Use LLMs to evaluate LLM outputs at scale. This skill covers rubric design, comparison methods, calibration, reliability measurement, and cost optimization.

## Key Points

- The task is subjective (style, helpfulness, reasoning quality)
- There is no single correct answer
- Human evaluation is too expensive or slow for the scale you need
- You need to evaluate thousands of outputs per day
- Exact match or simple pattern matching suffices
- The judge model is weaker than the model being evaluated
- You have not calibrated against human judgments
- Choose the better response. If truly equal, choose "tie".
- Do not let response length bias your judgment.
- Do not let the order of presentation bias your judgment.
1. Factual accuracy (to the best of your knowledge)
2. Relevance to the question
skilldb get ai-testing-evals-skills/llm-as-judgeFull skill: 451 lines
Paste into your CLAUDE.md or agent config

LLM-as-Judge

Use LLMs to evaluate LLM outputs at scale. This skill covers rubric design, comparison methods, calibration, reliability measurement, and cost optimization.


When to Use LLM-as-Judge

Use LLM-as-judge when:

  • The task is subjective (style, helpfulness, reasoning quality)
  • There is no single correct answer
  • Human evaluation is too expensive or slow for the scale you need
  • You need to evaluate thousands of outputs per day

Do NOT use LLM-as-judge when:

  • Exact match or simple pattern matching suffices
  • The judge model is weaker than the model being evaluated
  • You have not calibrated against human judgments

Rubric Design

Single-Dimension Scoring

HELPFULNESS_RUBRIC = """You are an expert evaluator. Rate the assistant's response on helpfulness.

Score 1 - Not helpful: Does not address the question. Irrelevant or empty response.
Score 2 - Slightly helpful: Addresses the topic but misses the main question.
Score 3 - Moderately helpful: Answers the question but with significant gaps.
Score 4 - Helpful: Answers the question well with minor omissions.
Score 5 - Very helpful: Comprehensive, accurate, and directly addresses the question.

Question: {question}
Response: {response}

Provide your score and reasoning in JSON format:
{{"score": <1-5>, "reasoning": "<explanation>"}}"""

Multi-Dimension Scoring

MULTI_RUBRIC = """Evaluate the response on these dimensions. Score each 1-5.

ACCURACY: Does the response contain factually correct information?
  1=Major errors  2=Several errors  3=Minor errors  4=Mostly correct  5=Fully correct

COMPLETENESS: Does the response address all parts of the question?
  1=Misses most  2=Partial  3=Addresses main point  4=Addresses most  5=Fully complete

CLARITY: Is the response well-organized and easy to understand?
  1=Incoherent  2=Confusing  3=Acceptable  4=Clear  5=Excellent

CONCISENESS: Is the response appropriately concise without being terse?
  1=Way too long/short  2=Too verbose/brief  3=Acceptable  4=Well-calibrated  5=Perfectly sized

Question: {question}
Reference answer: {reference}
Response to evaluate: {response}

Respond with JSON:
{{"accuracy": int, "completeness": int, "clarity": int, "conciseness": int, "overall": float, "reasoning": str}}

For "overall", compute a weighted average: accuracy*0.4 + completeness*0.3 + clarity*0.2 + conciseness*0.1"""

Scoring Functions

import json
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def judge_single(
    question: str,
    response: str,
    rubric: str,
    reference: str = "",
    model: str = "gpt-4o",
) -> dict:
    """Score a single response using an LLM judge."""
    prompt = rubric.format(
        question=question,
        response=response,
        reference=reference,
    )
    result = await client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(result.choices[0].message.content)


async def judge_batch(
    cases: list[dict],
    rubric: str,
    model: str = "gpt-4o",
    concurrency: int = 10,
) -> list[dict]:
    """Score multiple cases concurrently."""
    import asyncio
    semaphore = asyncio.Semaphore(concurrency)

    async def bounded(case):
        async with semaphore:
            return await judge_single(
                question=case["question"],
                response=case["response"],
                rubric=rubric,
                reference=case.get("reference", ""),
                model=model,
            )

    return await asyncio.gather(*[bounded(c) for c in cases])

Pairwise Comparison

More reliable than absolute scoring for subtle quality differences.

PAIRWISE_PROMPT = """You are comparing two responses to the same question. Which response is better?

Question: {question}

Response A:
{response_a}

Response B:
{response_b}

Consider: accuracy, helpfulness, clarity, and completeness.

Rules:
- Choose the better response. If truly equal, choose "tie".
- Do not let response length bias your judgment.
- Do not let the order of presentation bias your judgment.

Respond with JSON:
{{"winner": "A" | "B" | "tie", "reasoning": "<explanation>", "confidence": "high" | "medium" | "low"}}"""

async def pairwise_compare(
    question: str,
    response_a: str,
    response_b: str,
    model: str = "gpt-4o",
    swap_order: bool = True,
) -> dict:
    """Compare two responses, optionally swapping order to reduce position bias."""

    # First comparison: A vs B
    result1 = await judge_single(
        question=question,
        response=response_a,
        rubric=PAIRWISE_PROMPT.replace("{response_a}", response_a).replace("{response_b}", response_b),
        model=model,
    )

    if not swap_order:
        return result1

    # Second comparison: B vs A (to detect position bias)
    result2 = await judge_single(
        question=question,
        response=response_b,
        rubric=PAIRWISE_PROMPT.replace("{response_a}", response_b).replace("{response_b}", response_a),
        model=model,
    )

    # Map result2 back (swap A/B)
    winner2_mapped = {"A": "B", "B": "A", "tie": "tie"}[result2["winner"]]

    # Aggregate
    if result1["winner"] == winner2_mapped:
        return {"winner": result1["winner"], "consistent": True, "confidence": "high"}
    else:
        return {"winner": "tie", "consistent": False, "confidence": "low",
                "note": "Position bias detected — results disagreed when order swapped"}

Reference-Based vs Reference-Free Grading

Reference-Based

REFERENCE_BASED_RUBRIC = """Score how well the response matches the reference answer.

Question: {question}
Reference (gold standard): {reference}
Response to evaluate: {response}

Scoring:
5 - Semantically equivalent to the reference
4 - Captures all key points, minor differences in wording
3 - Captures most key points, some omissions
2 - Partially correct, misses major points
1 - Largely incorrect or irrelevant

Respond with JSON: {{"score": int, "reasoning": str}}"""

Reference-Free

REFERENCE_FREE_RUBRIC = """Evaluate this response based solely on the question asked.
You do NOT have a reference answer. Judge based on your own knowledge.

Question: {question}
Response: {response}

Evaluate on:
1. Factual accuracy (to the best of your knowledge)
2. Relevance to the question
3. Completeness
4. Clarity

Respond with JSON: {{"score": int, "accuracy_concerns": list[str], "reasoning": str}}"""

When to Use Which

ScenarioMethodWhy
QA with known answersReference-basedObjective comparison to gold standard
Creative writingReference-freeNo single correct answer exists
SummarizationReference-basedCompare against expert summary
Open-ended chatReference-freeMany valid responses possible
Code reviewHybridCheck correctness (ref) + style (free)

Calibration

Human-Judge Agreement

from sklearn.metrics import cohen_kappa_score
import numpy as np

def calibrate_judge(
    human_scores: list[int],
    llm_scores: list[int],
) -> dict:
    """Measure agreement between human and LLM judge scores."""
    assert len(human_scores) == len(llm_scores)

    # Cohen's kappa (chance-adjusted agreement)
    kappa = cohen_kappa_score(human_scores, llm_scores)

    # Exact agreement rate
    exact = sum(h == l for h, l in zip(human_scores, llm_scores)) / len(human_scores)

    # Within-1 agreement (scores differ by at most 1)
    within_1 = sum(abs(h - l) <= 1 for h, l in zip(human_scores, llm_scores)) / len(human_scores)

    # Bias detection: does the LLM systematically score higher or lower?
    human_mean = np.mean(human_scores)
    llm_mean = np.mean(llm_scores)
    bias = llm_mean - human_mean

    return {
        "cohens_kappa": round(kappa, 3),
        "exact_agreement": round(exact, 3),
        "within_1_agreement": round(within_1, 3),
        "human_mean": round(human_mean, 2),
        "llm_mean": round(llm_mean, 2),
        "bias": round(bias, 2),
        "interpretation": interpret_kappa(kappa),
    }

def interpret_kappa(kappa: float) -> str:
    if kappa < 0.20: return "poor agreement — judge is unreliable"
    if kappa < 0.40: return "fair agreement — use with caution"
    if kappa < 0.60: return "moderate agreement — acceptable for screening"
    if kappa < 0.80: return "substantial agreement — good for production"
    return "near-perfect agreement — excellent"

# Example
result = calibrate_judge(
    human_scores=[5, 4, 3, 2, 5, 4, 3, 1, 5, 4],
    llm_scores=  [5, 4, 4, 2, 5, 3, 3, 2, 4, 4],
)
print(result)
# {'cohens_kappa': 0.62, 'exact_agreement': 0.6, 'within_1_agreement': 1.0, ...}

Inter-Rater Reliability

Use multiple judge calls to improve reliability.

import statistics

async def multi_judge(
    question: str,
    response: str,
    rubric: str,
    model: str = "gpt-4o",
    n_judges: int = 3,
    temperature: float = 0.3,
) -> dict:
    """Use multiple independent judge calls for reliability."""
    tasks = []
    for _ in range(n_judges):
        tasks.append(judge_single(question, response, rubric, model=model))

    results = await asyncio.gather(*tasks)
    scores = [r["score"] for r in results]

    return {
        "median_score": statistics.median(scores),
        "mean_score": statistics.mean(scores),
        "scores": scores,
        "agreement": len(set(scores)) == 1,
        "spread": max(scores) - min(scores),
        "individual_results": results,
    }

# If spread > 2, the rubric may be ambiguous — refine it

Cost-Efficient Judging

Cascading Judges

async def cascading_judge(
    question: str,
    response: str,
    rubric: str,
) -> dict:
    """Use a cheap model first, escalate to expensive model for borderline cases."""

    # Stage 1: Fast, cheap judge
    fast_result = await judge_single(
        question, response, rubric, model="gpt-4o-mini"
    )
    score = fast_result["score"]

    # Clear pass or fail — no need for expensive judge
    if score >= 4 or score <= 2:
        return {**fast_result, "judge_model": "gpt-4o-mini", "escalated": False}

    # Stage 2: Borderline (score 3) — escalate to better judge
    accurate_result = await judge_single(
        question, response, rubric, model="gpt-4o"
    )
    return {**accurate_result, "judge_model": "gpt-4o", "escalated": True}

# Cost savings: typically 60-70% of cases are clear, saving expensive API calls

Sampling Strategy

def select_eval_sample(
    outputs: list[dict],
    sample_size: int = 100,
    strategy: str = "stratified",
) -> list[dict]:
    """Select a representative sample for LLM judging."""
    if strategy == "random":
        import random
        return random.sample(outputs, min(sample_size, len(outputs)))

    if strategy == "stratified":
        # Group by category/difficulty, sample proportionally
        from collections import defaultdict
        groups = defaultdict(list)
        for o in outputs:
            groups[o.get("category", "default")].append(o)

        sample = []
        per_group = max(1, sample_size // len(groups))
        for group_outputs in groups.values():
            import random
            sample.extend(random.sample(group_outputs, min(per_group, len(group_outputs))))
        return sample[:sample_size]

    if strategy == "uncertainty":
        # Prioritize outputs where a cheap judge was uncertain
        borderline = [o for o in outputs if o.get("fast_score") == 3]
        non_borderline = [o for o in outputs if o.get("fast_score") != 3]
        import random
        return (borderline[:sample_size] +
                random.sample(non_borderline, max(0, sample_size - len(borderline))))

Reducing Judge Bias

DEBIASING_TECHNIQUES = {
    "position_swap": "Present responses in both orders and check consistency",
    "name_blind": "Remove model names from responses before judging",
    "length_control": "Instruct judge to ignore length differences",
    "chain_of_thought": "Require reasoning before score to improve calibration",
    "few_shot_anchoring": "Include scored examples in the prompt to anchor the scale",
}

# Few-shot anchoring example
ANCHORED_RUBRIC = """Rate the response 1-5 on helpfulness.

Example — Score 5 (Very helpful):
Q: "How do I sort a list in Python?"
A: "Use sorted() for a new list or list.sort() for in-place. Both accept key= and reverse= parameters. Example: sorted([3,1,2]) returns [1,2,3]."

Example — Score 2 (Slightly helpful):
Q: "How do I sort a list in Python?"
A: "Python has many features for working with lists."

Now evaluate:
Question: {question}
Response: {response}

JSON: {{"score": int, "reasoning": str}}"""

Common Pitfalls

  1. Self-evaluation bias: GPT-4 prefers GPT-4 outputs. Use a different model family as judge when possible.
  2. Position bias: The first response in a pairwise comparison is favored. Always swap and check.
  3. Length bias: Longer responses get higher scores. Explicitly instruct against this.
  4. Vague rubrics: "Is it good?" fails. Define exactly what each score level means.
  5. No calibration: Always measure agreement with human labels before trusting LLM judges.
  6. Single-call judging: One judge call is noisy. Use 3+ calls for important evaluations.

Install this skill directly: skilldb add ai-testing-evals-skills

Get CLI access →

Related Skills

agent-trajectory-testing

Covers testing AI agent behavior end-to-end: trajectory evaluation, tool-call sequence validation, multi-step correctness verification, stuck-loop detection, cost regression testing, and timeout handling. Triggers: "test my AI agent", "agent trajectory evaluation", "tool call testing", "multi-step agent testing", "agent stuck detection", "agent cost regression", "validate agent behavior".

Ai Testing Evals472L

ci-cd-for-ai

Covers implementing CI/CD pipelines for AI applications: running LLM evals in GitHub Actions, gating deployments on eval scores, monitoring prompt and model drift, versioning prompts alongside code, cost tracking, and canary deployments for AI features. Triggers: "CI for AI", "run evals in GitHub Actions", "gate deployment on eval score", "prompt drift detection", "version prompts in CI", "AI deployment pipeline", "LLM CI/CD".

Ai Testing Evals479L

eval-frameworks

Covers popular LLM evaluation frameworks and how to use them: Braintrust, Promptfoo, RAGAS, DeepEval, LangSmith, and custom eval harnesses. Includes setup, configuration, writing eval cases, CI integration, and choosing the right framework for your use case. Triggers: "eval framework", "Braintrust setup", "Promptfoo config", "RAGAS evaluation", "DeepEval", "LangSmith evals", "custom eval harness", "which eval tool should I use".

Ai Testing Evals568L

llm-eval-fundamentals

Covers the foundations of evaluating LLM-powered applications: why evaluation matters, the taxonomy of metric types (exact match, semantic similarity, LLM-as-judge), building and curating eval datasets, establishing baselines, detecting regressions, and designing eval pipelines that scale from prototyping through production. Triggers: "evaluate my LLM app", "set up evals", "how do I measure LLM quality", "create an eval pipeline", "LLM metrics", "eval dataset".

Ai Testing Evals348L

prompt-testing

Covers testing and hardening prompts for LLM applications: prompt regression testing, A/B testing prompt variants, temperature sensitivity analysis, edge case libraries, prompt versioning strategies, and golden test sets. Triggers: "test my prompt", "prompt regression", "A/B test prompts", "prompt versioning", "temperature sensitivity", "golden test set for prompts", "prompt quality assurance".

Ai Testing Evals447L

red-teaming-ai

Covers red-teaming AI applications for safety and robustness: adversarial prompt testing, jailbreak resistance evaluation, PII leakage detection, hallucination measurement, bias detection, safety benchmarks, and building automated red-team pipelines. Triggers: "red team my AI", "adversarial testing for LLMs", "jailbreak testing", "PII leakage test", "hallucination detection", "AI bias testing", "safety benchmark", "AI security testing".

Ai Testing Evals544L