AI Codex
Evaluation & SafetyCore Definition

How to know if your Claude integration is actually working

Evals are the testing framework for AI — and they work differently from software tests. You're not checking for correct answers. You're measuring behavior across a range of realistic situations.

5 min read·Evals

In software engineering, testing is straightforward: you know what the correct output is, you check whether the code produces it. Pass or fail. Green or red.

AI evaluation doesn't work like this.

When you're testing a Claude integration, there often isn't a single correct answer. There's a range of good answers and a range of bad ones. Whether a response is "good" depends on context, tone, completeness, and whether it actually helps the user — judgments that can't always be reduced to a comparison against expected output.

That's what makes evals a discipline of their own.

What evals actually are

An eval is a structured way to measure how well your AI system performs across a representative set of inputs.

The "representative" part matters. Your eval set should reflect the real distribution of what your users will actually ask — not just the easy cases, and not just the edge cases. A good eval set covers:

  • Typical inputs (the 80% of requests that look normal)
  • Tricky inputs (ambiguous questions, incomplete context, conflicting instructions)
  • Edge cases (unusual requests, potential misuse, things the system should decline)
  • Regression tests (specific failures you've encountered and fixed)

How to score them

There are three main approaches to scoring evals, and most good evaluation systems use all three:

Human review. A person reads the output and rates it. High signal, high cost, doesn't scale. Good for building your initial scoring intuition and for calibrating automated methods.

Reference-based scoring. Compare the output against a known-good answer. Works well for tasks with clear correct answers: extraction, classification, structured output. Doesn't work for open-ended generation.

LLM-as-judge. Use a second Claude call to evaluate the output. Give it a rubric — "was this response helpful? accurate? appropriate in tone?" — and have it score the original response. This scales better than human review and handles nuance better than reference comparison. Claude is well-suited to this role.

What to measure

The right metrics depend on your application, but most Claude integrations care about some combination of:

Accuracy. Is the information correct? For factual tasks, this is measurable. For open-ended tasks, it's fuzzy.

Completeness. Did the response address the full question? Missing information is a common failure mode.

Format adherence. If your system prompt specifies a response format, does Claude follow it? Evals can check this programmatically.

Tone and persona. Does the response sound the way your product should sound? This requires human or LLM-as-judge scoring.

Safety and compliance. Did Claude avoid outputs that violate your guidelines? This is critical for any consumer-facing application.

The eval loop in practice

Evals aren't a one-time setup. They're a continuous feedback loop:

  1. Build an initial eval set from real or realistic inputs
  2. Run it against your current system prompt and configuration
  3. Identify where the system underperforms
  4. Make a change (update the system prompt, add context, adjust the temperature)
  5. Re-run evals to verify the change improved things without breaking anything else
  6. Add the new failure cases to your eval set before moving on

That last step is critical. Every bug you find is a new test case. Over time, your eval set becomes a comprehensive map of your system's behavior — and a safety net against regressions.

Why this matters more for Claude than for traditional software

When you change traditional software, you know exactly what changed. When you update a system prompt, you've changed the behavior of every possible input simultaneously — in ways you can't fully predict.

Evals are how you regain visibility. They let you make changes with confidence: not "I think this is better" but "I ran 200 tests and the scores improved by 18%."

For any Claude integration you're serious about, evals aren't optional. They're the difference between deploying on hope and deploying on evidence.


Further reading