AI Codex
Evaluation & SafetyHow It Works

How to evaluate your agents — without being a developer

In brief

Most agents in production have never been formally tested. The person who set them up tried a few examples and it seemed fine. That's how you end up with a contract review agent that hallucinates clause details. Evaluation doesn't require code — it requires a spreadsheet and 30 minutes a week.

9 min read·AI Agent

Contents

Sign in to save

Take any five AI agents running inside a company right now and ask the person who deployed them: "Have you formally tested this?"

Four out of five will give the same answer: "I tried it a bunch and it seemed to work well."

That's not evaluation. That's a vibe check. And vibes are how you end up with a contract review agent that confidently cites clause 4.3 when the actual problematic clause is 7.1. The contract reviews that matter most — the ones with real financial exposure — are exactly the cases where the agent is most likely to be confidently wrong in ways you haven't tested for.

Formal evaluation sounds like something that requires code. ML benchmarks, automated test runners, regression suites. It doesn't. Evaluation means: you have a consistent set of test cases, you run them regularly, and you know what good output looks like. You can do all of this in a spreadsheet. The rigor is in the test design, not the tooling.

Here's the complete process.


What evaluation actually is at the Agent Operator level

Let's clear up the terminology, because "evaluation" or "evals" is used differently by different people.

For an ML researcher, evaluation means running a model against a benchmark dataset and measuring performance across thousands of examples. For a developer, it might mean an automated test suite that runs on every code change. Neither of these is what you need.

For an Agent Operator, evaluation means: a documented set of inputs and expected outputs that you run on a schedule to check whether your agent is still performing correctly.

The inputs are real examples — questions or tasks the agent actually receives. The expected outputs are written descriptions of what good looks like. You run these cases, you compare the actual output to your expectation, and you note any failures.

That's it. The value comes from doing it consistently, not from doing it with sophisticated tooling.


Step 1: Build your test set

This is the most important step and the one most often skipped because it requires thinking before doing.

Pull 20 real examples from the agent's recent history.

If you have logs of what users have submitted to the agent, go back through the last month and pick 20 examples. Choose them deliberately:

  • 12 typical cases (the most common types of inputs the agent receives)
  • 5 edge cases (unusual inputs: incomplete information, ambiguous phrasing, requests that are adjacent to but outside the agent's scope)
  • 3 known failure cases if you have them — times the agent produced a wrong or unhelpful output

If you don't have logs, create 20 realistic examples based on your knowledge of how the agent is used. Ask a few users "what's a typical question you'd ask it?" and "what's the weirdest thing you've tried to ask it?"

Write down what "good" looks like for each case.

This is the step that distinguishes evaluation from guessing. For every test case, write one or two sentences describing what a good response looks like. Be specific — not "accurate" but:

  • "Identifies the correct clause number and quotes the relevant language"
  • "Gives the handling procedure for hazmat category 3 shipments, not category 2"
  • "Declines to answer and redirects to HR with a specific contact name"
  • "Lists exactly 3 options with cost and lead time for each, not more, not fewer"

Vague criteria ("clear and accurate") produce inconsistent judgments. Specific criteria produce consistent ones.

Don't build a test set of 100. Twenty is enough to catch most problems. A hundred cases you stop running is worth nothing. Twenty you run every week is worth everything.


Step 2: Score regularly

Running your test set takes about 30 minutes. Do it:

  • Every week (if the agent is in a high-stakes workflow or high volume)
  • Every two weeks (for most agents)
  • After any change to the system prompt, context documents, or model version

The scoring rubric: Keep it simple. Three categories:

  • Pass: The output matches or exceeds your written expectation
  • Partial: The output is roughly right but missing something or formatted incorrectly
  • Fail: The output is wrong, misleading, or unhelpful

No partial credit math. Just count your Passes, Partials, and Fails.

The action threshold: If more than 2 of your 20 cases fail in any given run, stop and investigate before doing anything else with this agent. Especially before expanding its scope or adding users. A 10% failure rate sounds small until you multiply it by your volume.

Track in a spreadsheet. Date each run. The trend matters as much as any single result. An agent whose pass rate drops from 18/20 to 15/20 over three weeks is telling you something — even if 15/20 still feels "good enough."

Here's the column structure that works:

Test Case # Input Expected Output Date: 05-07 Result Date: 05-14 Result Notes
1 [the input] [what good looks like] Pass Pass
2 [the input] [what good looks like] Pass Partial Format changed

Keep this spreadsheet. It becomes your institutional memory for what the agent has done and when things changed.


Step 3: The human-in-the-loop audit

Your test set catches regressions — things that used to work and stopped working. But real usage is messier than your test set. Users ask things you didn't anticipate. Context gets used in combinations you didn't test.

Once a week, pull 10 random real outputs from the last 7 days. Rate each one:

  • Would I be comfortable if a department head saw this output?
  • Is there anything in this that's factually wrong?
  • Does it match the format and scope the agent is supposed to produce?

This is a 15-minute exercise. It catches drift that your test set misses — the gradual expansion of scope as users push the boundaries, the edge case you didn't anticipate, the output that's technically passing your rubric but feels off in actual use.


Step 4: The freshness check

Stale context is the most common cause of mysterious agent failures — the "it worked last month" problems that have no obvious cause.

Every month, ask these questions for each agent:

  • What documents or data sources does this agent use?
  • When were they last updated?
  • Has anything changed in our business that should be reflected in those documents?

Make a list. For each document the agent relies on: what's the update schedule, who owns it, and when does it need to be refreshed?

For policies and runbooks that change infrequently, a quarterly review is usually fine. For operational data like pricing, inventory, or scheduling — if it's not connected to a live source, it needs more frequent manual updates.

If you find that an agent is working from a document that was last updated six months ago, don't assume it's fine. Check: has anything material changed? Update the document and run your test set.


The red flags that mean stop and investigate

These are behaviors that should trigger immediate investigation regardless of your test set results:

The agent is suddenly much more verbose. It's adding caveats, disclaimers, and qualifications it wasn't adding before. This often signals a model version change or a context problem.

The agent starts hedging constantly. "I'm not sure but..." or "This might not be accurate, but..." — if this increases significantly, something changed in the model or the prompt.

Output format drifts. It was producing bullet-pointed summaries; now it's producing paragraphs. Something changed in the prompt or model that's affecting format adherence.

Users start routing around it. They message you directly instead of using the agent, or they use it but don't act on the output. This is a lagging indicator — by the time you notice, quality has been declining for a while.

Your failure rate crosses 2/20. Stop expanding. Fix before you scale.


What to do when you find a failure

Finding a failure is not a crisis — it's the system working. Your evaluation caught something before it caused more damage.

The first response is almost never "change the model." Ninety percent of the time, the fix is in the prompt or the context. Work through the failure systematically:

  1. Can you reproduce it with the specific input that failed? (If not, it may be intermittent — test 3 more times)
  2. Is the context document current and correct?
  3. Does the system prompt have a clear instruction that covers this case?
  4. If not, add a specific example to the system prompt that shows the correct behavior for this type of case

If you've worked through all of this and still can't identify the cause, document the specific failure case before asking for technical help. "It's giving wrong answers" doesn't help anyone diagnose the problem. "On inputs of type X, it consistently gives output Y when it should give output Z — and this started after we updated the policy doc on May 3" does.


Try this today

Take your most important running agent. Open a spreadsheet. Write down 10 real inputs it has received in the last month — go through your logs, ask users, or create realistic examples if you have to. For each one, write one sentence describing what a good response looks like.

That's your test set. Run it against the current agent tonight. Note the results.

You now have a baseline. Next week, run it again and compare. That's evaluation. It took you two hours and it will take 30 minutes a week to maintain.

The contract review agent that's hallucinating clause details needs one of these. Start there.

Weekly brief

For people actually using Claude at work.

Each week: one thing Claude can do in your work that most people haven't figured out yet — plus the failure modes to avoid. No tutorials. No hype.

No spam. Unsubscribe anytime.

What to read next

Picked for where you are now

All articles →