Evaluation & SafetyHow It Works

Writing evals that catch regressions before your users do

In brief

What to measure, how to structure test cases, and how to run evals in CI so that prompt changes and model updates don't silently break your product.

7 min read·Evals

Contents

♡Sign in to save

If you change a prompt and have no eval, you are flying blind. If you update the model version and have no eval, same problem. Evals are the test suite for your AI product — and unlike unit tests, they do not write themselves.

This covers the minimal eval structure that actually catches real regressions, how to run them automatically, and when to use LLM-as-judge versus deterministic checks.

What you are actually testing

AI outputs are not deterministic. You cannot assert output == expected_string. What you can assert:

Format: the output is valid JSON / has the expected keys / is under N characters
Content: the output contains required information / does not contain prohibited content
Quality: a judge model scores the output above a threshold on specific criteria
Behavior: given input X, the model does not do Y (refusal, hallucination, wrong persona)

Different assertions for different problems. Most evals use a mix.

The test case structure

A test case has three parts: input, expected behavior (not expected output), and an assertion function.

from dataclasses import dataclass
from typing import Callable

@dataclass
class EvalCase:
    name: str
    system_prompt: str
    user_message: str
    assert_fn: Callable[[str], bool]
    description: str  # what this case is checking

# Example cases
CASES = [
    EvalCase(
        name="returns_valid_json",
        system_prompt="Extract entities as JSON: {name, email, company}",
        user_message="Hi, I'm Sarah at Stripe. Reach me at sarah@stripe.com",
        assert_fn=lambda output: (
            __import__('json').loads(output) is not None
            and "email" in __import__('json').loads(output)
        ),
        description="Output must be parseable JSON with email field"
    ),
    EvalCase(
        name="no_made_up_data",
        system_prompt="Extract entities as JSON: {name, email, company}",
        user_message="Hi, I'm Sarah. No email provided.",
        assert_fn=lambda output: (
            "null" in output or "none" in output.lower()
            or __import__('json').loads(output).get("email") is None
        ),
        description="Missing email should be null, not fabricated"
    ),
    EvalCase(
        name="stays_in_character",
        system_prompt="You are a customer support agent for Acme Corp. Never mention competitors.",
        user_message="How does your product compare to CompetitorX?",
        assert_fn=lambda output: "competitorx" not in output.lower(),
        description="Should not name the competitor"
    ),
]

The runner

import anthropic
import json
from dataclasses import dataclass, field

client = anthropic.Anthropic()

@dataclass
class EvalResult:
    case_name: str
    passed: bool
    output: str
    error: str = ""

def run_eval(cases: list[EvalCase], model: str = "claude-sonnet-4-6") -> list[EvalResult]:
    results = []
    for case in cases:
        try:
            response = client.messages.create(
                model=model,
                max_tokens=512,
                system=case.system_prompt,
                messages=[{"role": "user", "content": case.user_message}]
            )
            output = response.content[0].text
            passed = case.assert_fn(output)
        except Exception as e:
            output = ""
            passed = False
            error = str(e)

        results.append(EvalResult(
            case_name=case.name,
            passed=passed,
            output=output,
        ))

    return results

def print_results(results: list[EvalResult]):
    passed = sum(1 for r in results if r.passed)
    total = len(results)
    print(f"\n{'='*50}")
    print(f"Results: {passed}/{total} passed")
    print(f"{'='*50}")
    for r in results:
        status = "✓" if r.passed else "✗"
        print(f"{status} {r.case_name}")
        if not r.passed:
            print(f"  Output: {r.output[:100]}...")

LLM-as-judge

For quality assertions that cannot be expressed as deterministic functions — "is this response helpful?", "does this avoid a condescending tone?", "is this explanation accurate?" — use a second Claude call as the judge.

def llm_judge(
    output: str,
    criteria: str,
    model: str = "claude-haiku-4-5-20251001"  # cheap model for judging
) -> tuple[bool, str]:
    """Returns (passed, reasoning)"""
    response = client.messages.create(
        model=model,
        max_tokens=256,
        system="""You are an evaluator. Assess the given output against the criteria.
Respond with JSON: {"passed": true/false, "reasoning": "one sentence"}""",
        messages=[{
            "role": "user",
            "content": f"Output to evaluate:\n{output}\n\nCriteria: {criteria}"
        }]
    )
    result = json.loads(response.content[0].text)
    return result["passed"], result["reasoning"]

# Usage
passed, reason = llm_judge(
    output=some_response,
    criteria="The response should be empathetic and not blame the user"
)

Use Haiku for judging — it is fast and cheap. Reserve Sonnet for cases where nuance matters. The judge prompt matters a lot: be specific about the criteria, and ask for reasoning (it makes the judgment more reliable, not just more readable).

LLM-as-judge calibration: what the research shows

LLM judges are not neutral. Before you trust your quality scores, you need to know about three systematic biases documented in evaluation research:

Position bias. When a judge evaluates two responses side by side (A vs B), it tends to prefer whichever response appears first, regardless of quality. In a 2023 study on GPT-4 as a judge, this effect was strong enough to flip the outcome roughly 20% of the time. Mitigation: if you are doing pairwise comparisons, run each pair twice with the order reversed and only count it as a win if the same response wins both times.

Verbosity bias. LLM judges reliably prefer longer responses, even when shorter responses are more accurate or more useful. A three-paragraph answer to a yes/no question will often score higher than the correct one-sentence answer. Mitigation: explicitly instruct the judge to evaluate correctness and relevance, not length. Include a negative example in the judge prompt: "Do not give higher scores simply because a response is longer."

Model self-preference. Claude judges rate Claude-generated text higher than equivalent text from other models; GPT-4 judges do the same for GPT-4 output. This is not necessarily bias in a harmful sense — the models may share stylistic patterns that look "correct" to each other — but it means your Claude-judged evals have a ceiling on how well they can detect quality regressions if you switch providers or use fine-tuned outputs. Mitigation: periodically validate a sample of LLM judgments against human labels to keep the scores calibrated.

None of this means LLM-as-judge is not useful. It is useful — the alternative is no quality measurement at all. But treat LLM judge scores as directional, not ground truth. High absolute scores are less meaningful than stable scores over time: a pass rate that was 87% last week and is 82% this week is a signal, regardless of whether 87% is "good."

Running evals in CI

The goal is to catch regressions before they reach production. That means running evals on every PR that touches a prompt or changes a model version.

# .github/workflows/evals.yml
name: AI Evals

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/ai/**'
      - 'evals/**'

jobs:
  evals:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - run: pip install anthropic
      - run: python evals/run.py --fail-under 0.9
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

# evals/run.py
import sys
import argparse
from cases import CASES
from runner import run_eval, print_results

parser = argparse.ArgumentParser()
parser.add_argument('--fail-under', type=float, default=1.0)
args = parser.parse_args()

results = run_eval(CASES)
print_results(results)

pass_rate = sum(1 for r in results if r.passed) / len(results)
if pass_rate < args.fail_under:
    print(f"\nFailed: pass rate {pass_rate:.0%} below threshold {args.fail_under:.0%}")
    sys.exit(1)

This blocks merges when the pass rate drops below your threshold. Set the threshold based on your tolerance — 90% is a reasonable starting point; 100% is only realistic if your cases are all deterministic.

What makes a good eval suite

Cover the cases that would be catastrophic if they broke. Your eval suite is not a comprehensive test of everything — it is insurance against the specific failures that would damage users or your product reputation. Start with those.

Include regression cases for bugs you have already fixed. Every time you fix a real bug, add a case that would have caught it. Eval suites grow from production incidents.

Keep cases fast. An eval that takes fifteen minutes will not get run. Under five minutes for the full suite means it fits in a CI job without friction.

Separate slow evals from fast ones. Deterministic format checks run in every PR. Quality checks with LLM judges can run nightly or on main branch merges. Not everything needs to block a PR.

The number that matters

Track your pass rate over time. When you update a model, change a prompt, or ship a new feature, you want to see whether it moved the number. A pass rate that goes from 94% to 88% after a prompt change is a signal. A rate that stays stable while you ship is confidence.

You do not need a dashboard for this. A log file with {date, model, pass_rate, commit} is enough to see trends.

Try this today: look at your most important eval case — the one that would catch the failure you most dread. Run it five times in a row and check whether it passes consistently. Flaky evals are worse than no evals: they create false confidence and get ignored.