Technology & EngineeringAi Testing Evals451 lines

llm-as-judge

Covers using LLMs to evaluate other LLM outputs: rubric design, pairwise comparison, reference-based and reference-free grading, calibration techniques, inter-rater reliability measurement, and cost-efficient judging strategies. Triggers: "LLM as judge", "use GPT to evaluate outputs", "AI grading AI", "rubric for LLM evaluation", "pairwise comparison", "LLM evaluator", "auto-grade LLM responses".

Quick Summary18 lines

Use LLMs to evaluate LLM outputs at scale. This skill covers rubric design, comparison methods, calibration, reliability measurement, and cost optimization.

## Key Points

- The task is subjective (style, helpfulness, reasoning quality)
- There is no single correct answer
- Human evaluation is too expensive or slow for the scale you need
- You need to evaluate thousands of outputs per day
- Exact match or simple pattern matching suffices
- The judge model is weaker than the model being evaluated
- You have not calibrated against human judgments
- Choose the better response. If truly equal, choose "tie".
- Do not let response length bias your judgment.
- Do not let the order of presentation bias your judgment.
1. Factual accuracy (to the best of your knowledge)
2. Relevance to the question

skilldb get ai-testing-evals-skills/llm-as-judgeFull skill: 451 lines

Paste into your CLAUDE.md or agent config

LLM-as-Judge

Use LLMs to evaluate LLM outputs at scale. This skill covers rubric design, comparison methods, calibration, reliability measurement, and cost optimization.

When to Use LLM-as-Judge

Use LLM-as-judge when:

The task is subjective (style, helpfulness, reasoning quality)
There is no single correct answer
Human evaluation is too expensive or slow for the scale you need
You need to evaluate thousands of outputs per day

Do NOT use LLM-as-judge when:

Exact match or simple pattern matching suffices
The judge model is weaker than the model being evaluated
You have not calibrated against human judgments

Rubric Design

Single-Dimension Scoring

HELPFULNESS_RUBRIC = """You are an expert evaluator. Rate the assistant's response on helpfulness.

Score 1 - Not helpful: Does not address the question. Irrelevant or empty response.
Score 2 - Slightly helpful: Addresses the topic but misses the main question.
Score 3 - Moderately helpful: Answers the question but with significant gaps.
Score 4 - Helpful: Answers the question well with minor omissions.
Score 5 - Very helpful: Comprehensive, accurate, and directly addresses the question.

Question: {question}
Response: {response}

Provide your score and reasoning in JSON format:
{{"score": <1-5>, "reasoning": "<explanation>"}}"""

Multi-Dimension Scoring

MULTI_RUBRIC = """Evaluate the response on these dimensions. Score each 1-5.

ACCURACY: Does the response contain factually correct information?
  1=Major errors  2=Several errors  3=Minor errors  4=Mostly correct  5=Fully correct

COMPLETENESS: Does the response address all parts of the question?
  1=Misses most  2=Partial  3=Addresses main point  4=Addresses most  5=Fully complete

CLARITY: Is the response well-organized and easy to understand?
  1=Incoherent  2=Confusing  3=Acceptable  4=Clear  5=Excellent

CONCISENESS: Is the response appropriately concise without being terse?
  1=Way too long/short  2=Too verbose/brief  3=Acceptable  4=Well-calibrated  5=Perfectly sized

Question: {question}
Reference answer: {reference}
Response to evaluate: {response}

Respond with JSON:
{{"accuracy": int, "completeness": int, "clarity": int, "conciseness": int, "overall": float, "reasoning": str}}

For "overall", compute a weighted average: accuracy*0.4 + completeness*0.3 + clarity*0.2 + conciseness*0.1"""

Scoring Functions

import json
from openai import AsyncOpenAI

client = AsyncOpenAI()

async def judge_single(
    question: str,
    response: str,
    rubric: str,
    reference: str = "",
    model: str = "gpt-4o",
) -> dict:
    """Score a single response using an LLM judge."""
    prompt = rubric.format(
        question=question,
        response=response,
        reference=reference,
    )
    result = await client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(result.choices[0].message.content)


async def judge_batch(
    cases: list[dict],
    rubric: str,
    model: str = "gpt-4o",
    concurrency: int = 10,
) -> list[dict]:
    """Score multiple cases concurrently."""
    import asyncio
    semaphore = asyncio.Semaphore(concurrency)

    async def bounded(case):
        async with semaphore:
            return await judge_single(
                question=case["question"],
                response=case["response"],
                rubric=rubric,
                reference=case.get("reference", ""),
                model=model,
            )

    return await asyncio.gather(*[bounded(c) for c in cases])

Pairwise Comparison

More reliable than absolute scoring for subtle quality differences.

PAIRWISE_PROMPT = """You are comparing two responses to the same question. Which response is better?

Question: {question}

Response A:
{response_a}

Response B:
{response_b}

Consider: accuracy, helpfulness, clarity, and completeness.

Rules:
- Choose the better response. If truly equal, choose "tie".
- Do not let response length bias your judgment.
- Do not let the order of presentation bias your judgment.

Respond with JSON:
{{"winner": "A" | "B" | "tie", "reasoning": "<explanation>", "confidence": "high" | "medium" | "low"}}"""

async def pairwise_compare(
    question: str,
    response_a: str,
    response_b: str,
    model: str = "gpt-4o",
    swap_order: bool = True,
) -> dict:
    """Compare two responses, optionally swapping order to reduce position bias."""

    # First comparison: A vs B
    result1 = await judge_single(
        question=question,
        response=response_a,
        rubric=PAIRWISE_PROMPT.replace("{response_a}", response_a).replace("{response_b}", response_b),
        model=model,
    )

    if not swap_order:
        return result1

    # Second comparison: B vs A (to detect position bias)
    result2 = await judge_single(
        question=question,
        response=response_b,
        rubric=PAIRWISE_PROMPT.replace("{response_a}", response_b).replace("{response_b}", response_a),
        model=model,
    )

    # Map result2 back (swap A/B)
    winner2_mapped = {"A": "B", "B": "A", "tie": "tie"}[result2["winner"]]

    # Aggregate
    if result1["winner"] == winner2_mapped:
        return {"winner": result1["winner"], "consistent": True, "confidence": "high"}
    else:
        return {"winner": "tie", "consistent": False, "confidence": "low",
                "note": "Position bias detected — results disagreed when order swapped"}

Reference-Based vs Reference-Free Grading

Reference-Based

REFERENCE_BASED_RUBRIC = """Score how well the response matches the reference answer.

Question: {question}
Reference (gold standard): {reference}
Response to evaluate: {response}

Scoring:
5 - Semantically equivalent to the reference
4 - Captures all key points, minor differences in wording
3 - Captures most key points, some omissions
2 - Partially correct, misses major points
1 - Largely incorrect or irrelevant

Respond with JSON: {{"score": int, "reasoning": str}}"""

Reference-Free

REFERENCE_FREE_RUBRIC = """Evaluate this response based solely on the question asked.
You do NOT have a reference answer. Judge based on your own knowledge.

Question: {question}
Response: {response}

Evaluate on:
1. Factual accuracy (to the best of your knowledge)
2. Relevance to the question
3. Completeness
4. Clarity

Respond with JSON: {{"score": int, "accuracy_concerns": list[str], "reasoning": str}}"""

When to Use Which

Scenario	Method	Why
QA with known answers	Reference-based	Objective comparison to gold standard
Creative writing	Reference-free	No single correct answer exists
Summarization	Reference-based	Compare against expert summary
Open-ended chat	Reference-free	Many valid responses possible
Code review	Hybrid	Check correctness (ref) + style (free)

Calibration

Human-Judge Agreement

from sklearn.metrics import cohen_kappa_score
import numpy as np

def calibrate_judge(
    human_scores: list[int],
    llm_scores: list[int],
) -> dict:
    """Measure agreement between human and LLM judge scores."""
    assert len(human_scores) == len(llm_scores)

    # Cohen's kappa (chance-adjusted agreement)
    kappa = cohen_kappa_score(human_scores, llm_scores)

    # Exact agreement rate
    exact = sum(h == l for h, l in zip(human_scores, llm_scores)) / len(human_scores)

    # Within-1 agreement (scores differ by at most 1)
    within_1 = sum(abs(h - l) <= 1 for h, l in zip(human_scores, llm_scores)) / len(human_scores)

    # Bias detection: does the LLM systematically score higher or lower?
    human_mean = np.mean(human_scores)
    llm_mean = np.mean(llm_scores)
    bias = llm_mean - human_mean

    return {
        "cohens_kappa": round(kappa, 3),
        "exact_agreement": round(exact, 3),
        "within_1_agreement": round(within_1, 3),
        "human_mean": round(human_mean, 2),
        "llm_mean": round(llm_mean, 2),
        "bias": round(bias, 2),
        "interpretation": interpret_kappa(kappa),
    }

def interpret_kappa(kappa: float) -> str:
    if kappa < 0.20: return "poor agreement — judge is unreliable"
    if kappa < 0.40: return "fair agreement — use with caution"
    if kappa < 0.60: return "moderate agreement — acceptable for screening"
    if kappa < 0.80: return "substantial agreement — good for production"
    return "near-perfect agreement — excellent"

# Example
result = calibrate_judge(
    human_scores=[5, 4, 3, 2, 5, 4, 3, 1, 5, 4],
    llm_scores=  [5, 4, 4, 2, 5, 3, 3, 2, 4, 4],
)
print(result)
# {'cohens_kappa': 0.62, 'exact_agreement': 0.6, 'within_1_agreement': 1.0, ...}

Inter-Rater Reliability

Use multiple judge calls to improve reliability.

import statistics

async def multi_judge(
    question: str,
    response: str,
    rubric: str,
    model: str = "gpt-4o",
    n_judges: int = 3,
    temperature: float = 0.3,
) -> dict:
    """Use multiple independent judge calls for reliability."""
    tasks = []
    for _ in range(n_judges):
        tasks.append(judge_single(question, response, rubric, model=model))

    results = await asyncio.gather(*tasks)
    scores = [r["score"] for r in results]

    return {
        "median_score": statistics.median(scores),
        "mean_score": statistics.mean(scores),
        "scores": scores,
        "agreement": len(set(scores)) == 1,
        "spread": max(scores) - min(scores),
        "individual_results": results,
    }

# If spread > 2, the rubric may be ambiguous — refine it

Cost-Efficient Judging

Cascading Judges

async def cascading_judge(
    question: str,
    response: str,
    rubric: str,
) -> dict:
    """Use a cheap model first, escalate to expensive model for borderline cases."""

    # Stage 1: Fast, cheap judge
    fast_result = await judge_single(
        question, response, rubric, model="gpt-4o-mini"
    )
    score = fast_result["score"]

    # Clear pass or fail — no need for expensive judge
    if score >= 4 or score <= 2:
        return {**fast_result, "judge_model": "gpt-4o-mini", "escalated": False}

    # Stage 2: Borderline (score 3) — escalate to better judge
    accurate_result = await judge_single(
        question, response, rubric, model="gpt-4o"
    )
    return {**accurate_result, "judge_model": "gpt-4o", "escalated": True}

# Cost savings: typically 60-70% of cases are clear, saving expensive API calls

Sampling Strategy

def select_eval_sample(
    outputs: list[dict],
    sample_size: int = 100,
    strategy: str = "stratified",
) -> list[dict]:
    """Select a representative sample for LLM judging."""
    if strategy == "random":
        import random
        return random.sample(outputs, min(sample_size, len(outputs)))

    if strategy == "stratified":
        # Group by category/difficulty, sample proportionally
        from collections import defaultdict
        groups = defaultdict(list)
        for o in outputs:
            groups[o.get("category", "default")].append(o)

        sample = []
        per_group = max(1, sample_size // len(groups))
        for group_outputs in groups.values():
            import random
            sample.extend(random.sample(group_outputs, min(per_group, len(group_outputs))))
        return sample[:sample_size]

    if strategy == "uncertainty":
        # Prioritize outputs where a cheap judge was uncertain
        borderline = [o for o in outputs if o.get("fast_score") == 3]
        non_borderline = [o for o in outputs if o.get("fast_score") != 3]
        import random
        return (borderline[:sample_size] +
                random.sample(non_borderline, max(0, sample_size - len(borderline))))

Reducing Judge Bias

DEBIASING_TECHNIQUES = {
    "position_swap": "Present responses in both orders and check consistency",
    "name_blind": "Remove model names from responses before judging",
    "length_control": "Instruct judge to ignore length differences",
    "chain_of_thought": "Require reasoning before score to improve calibration",
    "few_shot_anchoring": "Include scored examples in the prompt to anchor the scale",
}

# Few-shot anchoring example
ANCHORED_RUBRIC = """Rate the response 1-5 on helpfulness.

Example — Score 5 (Very helpful):
Q: "How do I sort a list in Python?"
A: "Use sorted() for a new list or list.sort() for in-place. Both accept key= and reverse= parameters. Example: sorted([3,1,2]) returns [1,2,3]."

Example — Score 2 (Slightly helpful):
Q: "How do I sort a list in Python?"
A: "Python has many features for working with lists."

Now evaluate:
Question: {question}
Response: {response}

JSON: {{"score": int, "reasoning": str}}"""

Common Pitfalls

Self-evaluation bias: GPT-4 prefers GPT-4 outputs. Use a different model family as judge when possible.
Position bias: The first response in a pairwise comparison is favored. Always swap and check.
Length bias: Longer responses get higher scores. Explicitly instruct against this.
Vague rubrics: "Is it good?" fails. Define exactly what each score level means.
No calibration: Always measure agreement with human labels before trusting LLM judges.
Single-call judging: One judge call is noisy. Use 3+ calls for important evaluations.

Install this skill directly: skilldb add ai-testing-evals-skills

Get CLI access →

Related Skills

agent-trajectory-testing

Covers testing AI agent behavior end-to-end: trajectory evaluation, tool-call sequence validation, multi-step correctness verification, stuck-loop detection, cost regression testing, and timeout handling. Triggers: "test my AI agent", "agent trajectory evaluation", "tool call testing", "multi-step agent testing", "agent stuck detection", "agent cost regression", "validate agent behavior".

Ai Testing Evals•472L

ci-cd-for-ai

Covers implementing CI/CD pipelines for AI applications: running LLM evals in GitHub Actions, gating deployments on eval scores, monitoring prompt and model drift, versioning prompts alongside code, cost tracking, and canary deployments for AI features. Triggers: "CI for AI", "run evals in GitHub Actions", "gate deployment on eval score", "prompt drift detection", "version prompts in CI", "AI deployment pipeline", "LLM CI/CD".

Ai Testing Evals•479L

eval-frameworks

Covers popular LLM evaluation frameworks and how to use them: Braintrust, Promptfoo, RAGAS, DeepEval, LangSmith, and custom eval harnesses. Includes setup, configuration, writing eval cases, CI integration, and choosing the right framework for your use case. Triggers: "eval framework", "Braintrust setup", "Promptfoo config", "RAGAS evaluation", "DeepEval", "LangSmith evals", "custom eval harness", "which eval tool should I use".

Ai Testing Evals•568L

llm-eval-fundamentals

Covers the foundations of evaluating LLM-powered applications: why evaluation matters, the taxonomy of metric types (exact match, semantic similarity, LLM-as-judge), building and curating eval datasets, establishing baselines, detecting regressions, and designing eval pipelines that scale from prototyping through production. Triggers: "evaluate my LLM app", "set up evals", "how do I measure LLM quality", "create an eval pipeline", "LLM metrics", "eval dataset".

Ai Testing Evals•348L

prompt-testing

Covers testing and hardening prompts for LLM applications: prompt regression testing, A/B testing prompt variants, temperature sensitivity analysis, edge case libraries, prompt versioning strategies, and golden test sets. Triggers: "test my prompt", "prompt regression", "A/B test prompts", "prompt versioning", "temperature sensitivity", "golden test set for prompts", "prompt quality assurance".

Ai Testing Evals•447L

red-teaming-ai

Covers red-teaming AI applications for safety and robustness: adversarial prompt testing, jailbreak resistance evaluation, PII leakage detection, hallucination measurement, bias detection, safety benchmarks, and building automated red-team pipelines. Triggers: "red team my AI", "adversarial testing for LLMs", "jailbreak testing", "PII leakage test", "hallucination detection", "AI bias testing", "safety benchmark", "AI security testing".

Ai Testing Evals•544L

LLM-as-Judge

When to Use LLM-as-Judge

Rubric Design

Single-Dimension Scoring

Multi-Dimension Scoring

Scoring Functions

Pairwise Comparison

Reference-Based vs Reference-Free Grading

Reference-Based

Reference-Free

When to Use Which

Calibration

Human-Judge Agreement

Example

{'cohens_kappa': 0.62, 'exact_agreement': 0.6, 'within_1_agreement': 1.0, ...}

Inter-Rater Reliability

If spread > 2, the rubric may be ambiguous — refine it

Cost-Efficient Judging

Cascading Judges

Cost savings: typically 60-70% of cases are clear, saving expensive API calls

Sampling Strategy

Reducing Judge Bias

Few-shot anchoring example

Common Pitfalls

Details

Pack: ai-testing-evals-skills
File: llm-as-judge.md
Lines: 451
Category: Technology & Engineering

Download via CLI

Pro

$ skilldb add ai-testing-evals-skills

Installs the full Ai Testing Evals pack to your project.