Evaluation & SafetyHow It Works

Auditing your eval suite: are you testing the right things?

In brief

Most eval suites test what was easy to write, not what matters most. A structured audit finds the gaps before production does — coverage blind spots, flaky assertions, and the failure modes you forgot to cover.

6 min read·Evals

Contents

♡Sign in to save

Most developers who have an eval suite have the same problem: they wrote evals for the cases they could easily construct, not the cases that matter most. After six months, the suite passes reliably — and still misses the bugs users actually encounter.

An eval audit is a structured process for finding those gaps. It takes two to three hours and usually surfaces three to five cases that should exist but don't.

Evals are typically written in two batches: at the start of a project ("let's set up the basics") and after a production incident ("let's make sure that never happens again"). Both batches have coverage problems.

The initial batch tests the happy path and a few obvious failure modes. It is written before you know what the real failures are.

The incident batch tests the specific bug that just happened. It does not test the adjacent bugs that are also possible but haven't happened yet.

Over time, the eval suite becomes a record of the bugs you have already had, not a prediction of the bugs you are about to have.

The audit process is designed to fix this.

Step 1: Map your failure surface

Before you look at your existing evals, write down the complete list of ways your AI feature could fail in production. Do this from scratch, without looking at your current tests.

Organize by failure type:

Output failures:

Wrong format (malformed JSON, unexpected structure)
Truncated output (hit max_tokens before finishing)
Hallucinated facts or entities
Correct format, wrong content
Refusal when it should answer

Behavior failures:

Breaks character or persona
Ignores instructions in the system prompt
Leaks confidential information from the system prompt
Uses a different language than expected
Changes behavior unpredictably across turns in a conversation

Edge case failures:

Very short input
Very long input (near context limit)
Input in an unexpected language
Input that is adversarial (trying to jailbreak or manipulate)
Input that is ambiguous

Integration failures:

Downstream parse fails on valid output
Output contains content that breaks the UI renderer
Output is too long for the display area

Write these down before checking whether you have tests for them.

Step 2: Check coverage

Now compare your failure surface to your existing test cases. For each item on your list, ask: "Do I have a test that would catch this if it started happening?"

Be strict. "I have a test that sometimes catches this" is not the same as "I have a test that reliably catches this."

A simple coverage table:

Failure mode	Test exists?	Reliable?	Priority if missing
Malformed JSON output	✓	✓	—
Truncated output	✗	—	High
Persona breaks in turn 5+	✓	Flaky	Medium
Adversarial input	✗	—	High
Very short input	✗	—	Low

Anything High priority with no test is the first thing you add after the audit.

Step 3: Identify flaky assertions

Flaky evals are the worst kind — they create false confidence when they pass and false alarms when they fail. A failing CI job that turns out to be a flaky eval teaches your team to ignore CI failures.

Signs a test is flaky:

It passes sometimes and fails sometimes with identical input (temperature > 0 in your eval runner)
It uses LLM-as-judge with a vague criteria ("is this helpful?")
It relies on exact substring matching for content that could be phrased many ways
It assumes a specific output length that can legitimately vary

Check each test: would it pass consistently if you ran it ten times? If not, either fix the assertion or mark it as a monitored test (run nightly, not in CI).

def run_flakiness_check(case: EvalCase, runs: int = 10) -> dict:
    """Run a case multiple times to detect flakiness."""
    import anthropic
    client = anthropic.Anthropic()

    results = []
    for i in range(runs):
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=512,
            system=case.system_prompt,
            messages=[{"role": "user", "content": case.user_message}]
        )
        output = response.content[0].text
        passed = case.assert_fn(output)
        results.append(passed)

    pass_rate = sum(results) / len(results)
    return {
        "case": case.name,
        "pass_rate": pass_rate,
        "flaky": pass_rate < 1.0 and pass_rate > 0.0,
        "runs": results
    }

Run this on your top 10-20 most important cases. Anything with a pass rate below 95% is flaky and needs attention.

CI-readiness threshold: if more than 2 of your top 10 cases show a pass rate below 95%, your suite is not CI-ready. Fix the flaky cases or mark them as monitored (nightly-only) before you let this gate production deploys — a CI that fails intermittently stops being trusted, and engineers start merging around it.

Step 4: Check your golden dataset currency

Most eval suites include a golden dataset: question-answer pairs where you know the right answer. This dataset ages. If your prompt has changed since you wrote the golden answers, the answers may no longer be what your system should produce.

For each item in your golden dataset, ask:

Is this question still representative of real user input?
Is this answer still what the current system prompt would produce for a perfect response?
Has anything about your system (prompt, retrieval, tools) changed in ways that make this answer wrong?

A golden dataset that hasn't been reviewed in six months is almost certainly stale.

Step 5: Check for missing regression tests

Look at your last three months of production bugs. For each one:

Could your eval suite have caught it before it shipped?
Did you add a test for it after it happened?

If the answer to (2) is "no," add it now. Every production bug that doesn't become an eval case is a bug waiting to happen again.

If the answer to (1) is "no — not catchable by an eval," ask why. Some failures are genuinely hard to test (emergent behavior across many turns, user-specific edge cases). But often, a more creative assertion could have caught it.

What a healthy eval suite looks like

Deterministic fast path — format, structure, and key content checks that run in seconds. These block CI on every PR.
LLM-judge medium path — quality checks with specific, calibrated criteria. Run nightly or on main branch.
Regression cases — one test per production bug. Never deleted.
Adversarial cases — at least 5-10 inputs designed to break the system. Reviewed quarterly.
Golden dataset — reviewed every time the system prompt changes significantly.
Flakiness budget — any test with <95% pass rate across 10 runs is fixed or moved to monitored.

The audit is not a one-time process. Run it quarterly, or any time you make a major change to the system prompt or retrieval pipeline.

Related: Writing evals that catch regressions — the foundational implementation guide before auditing. Evaluating multi-agent systems — evaluation patterns specific to multi-agent pipelines.

Try this today: pick the five failure modes you most dread from Step 1 above. Write them down without looking at your current tests. Then check whether you have tests for each one. If you're missing two or more, start there.