Database

Browse 5,877 skills across 396 packs and 37 categories

Filters:Technology & Engineering ai-testing-evals-skills Clear all

Showing 1–8 of 8 skills

agent-trajectory-testing

Covers testing AI agent behavior end-to-end: trajectory evaluation, tool-call sequence validation, multi-step correctness verification, stuck-loop detection, cost regression testing, and timeout handling. Triggers: "test my AI agent", "agent trajectory evaluation", "tool call testing", "multi-step agent testing", "agent stuck detection", "agent cost regression", "validate agent behavior".

Technology & EngineeringAi Testing Evals

ci-cd-for-ai

479L

Covers implementing CI/CD pipelines for AI applications: running LLM evals in GitHub Actions, gating deployments on eval scores, monitoring prompt and model drift, versioning prompts alongside code, cost tracking, and canary deployments for AI features. Triggers: "CI for AI", "run evals in GitHub Actions", "gate deployment on eval score", "prompt drift detection", "version prompts in CI", "AI deployment pipeline", "LLM CI/CD".

Technology & EngineeringAi Testing Evals

eval-frameworks

568L

Covers popular LLM evaluation frameworks and how to use them: Braintrust, Promptfoo, RAGAS, DeepEval, LangSmith, and custom eval harnesses. Includes setup, configuration, writing eval cases, CI integration, and choosing the right framework for your use case. Triggers: "eval framework", "Braintrust setup", "Promptfoo config", "RAGAS evaluation", "DeepEval", "LangSmith evals", "custom eval harness", "which eval tool should I use".

Technology & EngineeringAi Testing Evals

llm-as-judge

451L

Covers using LLMs to evaluate other LLM outputs: rubric design, pairwise comparison, reference-based and reference-free grading, calibration techniques, inter-rater reliability measurement, and cost-efficient judging strategies. Triggers: "LLM as judge", "use GPT to evaluate outputs", "AI grading AI", "rubric for LLM evaluation", "pairwise comparison", "LLM evaluator", "auto-grade LLM responses".

Technology & EngineeringAi Testing Evals

llm-eval-fundamentals

348L

Covers the foundations of evaluating LLM-powered applications: why evaluation matters, the taxonomy of metric types (exact match, semantic similarity, LLM-as-judge), building and curating eval datasets, establishing baselines, detecting regressions, and designing eval pipelines that scale from prototyping through production. Triggers: "evaluate my LLM app", "set up evals", "how do I measure LLM quality", "create an eval pipeline", "LLM metrics", "eval dataset".

Technology & EngineeringAi Testing Evals

prompt-testing

447L

Covers testing and hardening prompts for LLM applications: prompt regression testing, A/B testing prompt variants, temperature sensitivity analysis, edge case libraries, prompt versioning strategies, and golden test sets. Triggers: "test my prompt", "prompt regression", "A/B test prompts", "prompt versioning", "temperature sensitivity", "golden test set for prompts", "prompt quality assurance".

Technology & EngineeringAi Testing Evals

red-teaming-ai

544L

Covers red-teaming AI applications for safety and robustness: adversarial prompt testing, jailbreak resistance evaluation, PII leakage detection, hallucination measurement, bias detection, safety benchmarks, and building automated red-team pipelines. Triggers: "red team my AI", "adversarial testing for LLMs", "jailbreak testing", "PII leakage test", "hallucination detection", "AI bias testing", "safety benchmark", "AI security testing".

Technology & EngineeringAi Testing Evals

structured-output-testing

396L

Covers testing and validating structured outputs from LLMs: JSON mode validation, schema conformance with Zod and JSON Schema, handling partial and malformed outputs, retry strategies with exponential backoff, and building type-safe LLM response pipelines. Triggers: "validate LLM JSON output", "test structured output", "JSON schema validation for AI", "type-safe LLM responses", "handle malformed LLM output", "Zod validation for AI".

Technology & EngineeringAi Testing Evals