Auditing your eval suite: are you testing the right things?
In brief
Most eval suites test what was easy to write, not what matters most. A structured audit finds the gaps before production does — coverage blind spots, flaky assertions, and the failure modes you forgot to cover.
Contents
Most developers who have an eval suite have the same problem: they wrote evals for the cases they could easily construct, not the cases that matter most. After six months, the suite passes reliably — and still misses the bugs users actually encounter.
An eval audit is a structured process for finding those gaps. It takes two to three hours and usually surfaces three to five cases that should exist but don't.
Why eval suites develop blind spots
Evals are typically written in two batches: at the start of a project ("let's set up the basics") and after a production incident ("let's make sure that never happens again"). Both batches have coverage problems.
The initial batch tests the happy path and a few obvious failure modes. It is written before you know what the real failures are.
The incident batch tests the specific bug that just happened. It does not test the adjacent bugs that are also possible but haven't happened yet.
Over time, the eval suite becomes a record of the bugs you have already had, not a prediction of the bugs you are about to have.
The audit process is designed to fix this.
Step 1: Map your failure surface
Before you look at your existing evals, write down the complete list of ways your AI feature could fail in production. Do this from scratch, without looking at your current tests.
Organize by failure type:
Output failures:
- Wrong format (malformed JSON, unexpected structure)
- Truncated output (hit max_tokens before finishing)
- Hallucinated facts or entities
- Correct format, wrong content
- Refusal when it should answer
Behavior failures:
- Breaks character or persona
- Ignores instructions in the system prompt
- Leaks confidential information from the system prompt
- Uses a different language than expected
- Changes behavior unpredictably across turns in a conversation
Edge case failures:
- Very short input
- Very long input (near context limit)
- Input in an unexpected language
- Input that is adversarial (trying to jailbreak or manipulate)
- Input that is ambiguous
Integration failures:
- Downstream parse fails on valid output
- Output contains content that breaks the UI renderer
- Output is too long for the display area
Write these down before checking whether you have tests for them.
Step 2: Check coverage
Now compare your failure surface to your existing test cases. For each item on your list, ask: "Do I have a test that would catch this if it started happening?"
Be strict. "I have a test that sometimes catches this" is not the same as "I have a test that reliably catches this."
A simple coverage table:
| Failure mode | Test exists? | Reliable? | Priority if missing |
|---|---|---|---|
| Malformed JSON output | ✓ | ✓ | — |
| Truncated output | ✗ | — | High |
| Persona breaks in turn 5+ | ✓ | Flaky | Medium |
| Adversarial input | ✗ | — | High |
| Very short input | ✗ | — | Low |
Anything High priority with no test is the first thing you add after the audit.
Step 3: Identify flaky assertions
Flaky evals are the worst kind — they create false confidence when they pass and false alarms when they fail. A failing CI job that turns out to be a flaky eval teaches your team to ignore CI failures.
Signs a test is flaky:
- It passes sometimes and fails sometimes with identical input (temperature > 0 in your eval runner)
- It uses LLM-as-judge with a vague criteria ("is this helpful?")
- It relies on exact substring matching for content that could be phrased many ways
- It assumes a specific output length that can legitimately vary
Check each test: would it pass consistently if you ran it ten times? If not, either fix the assertion or mark it as a monitored test (run nightly, not in CI).
def run_flakiness_check(case: EvalCase, runs: int = 10) -> dict:
"""Run a case multiple times to detect flakiness."""
import anthropic
client = anthropic.Anthropic()
results = []
for i in range(runs):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=case.system_prompt,
messages=[{"role": "user", "content": case.user_message}]
)
output = response.content[0].text
passed = case.assert_fn(output)
results.append(passed)
pass_rate = sum(results) / len(results)
return {
"case": case.name,
"pass_rate": pass_rate,
"flaky": pass_rate < 1.0 and pass_rate > 0.0,
"runs": results
}
Run this on your top 10-20 most important cases. Anything with a pass rate below 95% is flaky and needs attention.
Step 4: Check your golden dataset currency
Most eval suites include a golden dataset: question-answer pairs where you know the right answer. This dataset ages. If your prompt has changed since you wrote the golden answers, the answers may no longer be what your system should produce.
For each item in your golden dataset, ask:
- Is this question still representative of real user input?
- Is this answer still what the current system prompt would produce for a perfect response?
- Has anything about your system (prompt, retrieval, tools) changed in ways that make this answer wrong?
A golden dataset that hasn't been reviewed in six months is almost certainly stale.
Step 5: Check for missing regression tests
Look at your last three months of production bugs. For each one:
- Could your eval suite have caught it before it shipped?
- Did you add a test for it after it happened?
If the answer to (2) is "no," add it now. Every production bug that doesn't become an eval case is a bug waiting to happen again.
If the answer to (1) is "no — not catchable by an eval," ask why. Some failures are genuinely hard to test (emergent behavior across many turns, user-specific edge cases). But often, a more creative assertion could have caught it.
What a healthy eval suite looks like
- Deterministic fast path — format, structure, and key content checks that run in seconds. These block CI on every PR.
- LLM-judge medium path — quality checks with specific, calibrated criteria. Run nightly or on main branch.
- Regression cases — one test per production bug. Never deleted.
- Adversarial cases — at least 5-10 inputs designed to break the system. Reviewed quarterly.
- Golden dataset — reviewed every time the system prompt changes significantly.
- Flakiness budget — any test with <95% pass rate across 10 runs is fixed or moved to monitored.
The audit is not a one-time process. Run it quarterly, or any time you make a major change to the system prompt or retrieval pipeline.
Related: Writing evals that catch regressions — the foundational implementation guide before auditing. Evaluating multi-agent systems — evaluation patterns specific to multi-agent pipelines.
Try this today: pick the five failure modes you most dread from Step 1 above. Write them down without looking at your current tests. Then check whether you have tests for each one. If you're missing two or more, start there.
Further reading
- Demystifying evals for AI agents — Anthropic's engineering guide to eval design
- Designing AI-resistant technical evaluations — how to write evals that stay useful as models improve