AI Codex
Evaluation & SafetyRole-Specific

How to know if your Claude integration is actually working

Most teams go live on gut feel and find out six weeks later that Claude has been quietly giving wrong answers. Here's how to know before that happens — without being an engineer.

6 min read·Evals

You shipped your Claude integration. Users are using it. But is it working?

Most operators answer this question with vibes. They try it themselves a few times, it seems okay, and they move on. Six weeks later, a customer mentions something was wrong. Or a team member notices Claude keeps doing the same annoying thing. Or you realise you've been running something embarrassing without knowing it.

Evaluation — having a structured way to measure how Claude is performing — is how you close this loop before it costs you. And it doesn't require being a machine learning engineer to do it properly.

What you're actually measuring

When you evaluate a Claude integration, you're asking: across a realistic range of inputs, does Claude do what we want it to do?

That "realistic range" matters more than most teams realise. Claude might perform brilliantly on the twenty questions you tested during development. It might quietly fail on the question type that turns out to be 40% of your actual volume.

Evaluation is the practice of finding those failure modes before your users do.

The simplest eval you can build today

Step 1: Collect 50 real inputs. If you have existing data, use it — actual questions or tasks from real users. If you're pre-launch, write them yourself: 30 typical cases, 10 tricky edge cases, 10 things Claude should handle but might get wrong.

Step 2: Write down what "good" looks like for each one. Not necessarily a word-for-word answer, but criteria. "Mentions the refund policy. Doesn't suggest anything outside our policy. Tone is friendly, not robotic." Three to five criteria per question is enough.

Step 3: Run your current system through all 50. Look at each output. Score it pass/fail against your criteria. You'll immediately see patterns — the types of questions that consistently struggle, the criteria that keep getting missed.

Step 4: Fix the highest-frequency failures first. Usually this means updating your system prompt. Change one thing at a time. Re-run the failing cases. Verify the fix didn't break anything else.

Step 5: Add new failures to the set as you find them. Every bug is a new test case. Over time, your eval set becomes a safety net — you can make changes confidently because you know you'll catch regressions.

The three failure modes to watch for

Accuracy failures. Claude says something wrong or misleading. Especially dangerous for anything involving your product, pricing, or policies. This is where RAG helps — if Claude is making things up, it often needs more grounding in your actual information.

Scope failures. Claude does something you didn't intend — answers questions outside your product area, makes promises you can't keep, engages with topics you'd rather it deflected. Caught by testing edge cases, not typical inputs.

Tone and persona failures. Claude sounds wrong — too formal, too casual, not like your brand, weirdly robotic. Easy to miss if you're only checking for factual accuracy. Worth having someone from your team who isn't close to the build read through a set of outputs fresh.

The tool you already have

Claude itself is one of the most useful evaluation tools available. Once you have a set of outputs you want to evaluate, you can ask Claude to assess them:

"Here is a support response from our AI assistant. Here are our evaluation criteria: [criteria]. Rate this response on each criterion and flag any issues."

Run this across your eval set and you get fast, structured feedback that's cheaper and faster than human review — while still being substantive enough to catch real problems. Use human review for calibration, Claude-as-judge for scale.

What good looks like

A mature eval setup for an early-stage integration is modest: 50–100 test cases, a clear rubric, a process for adding new cases monthly, and a habit of running evals before shipping any significant change to your system prompt.

That's it. No complex infrastructure, no ML expertise required. Just a Google Sheet, an honest rubric, and the discipline to check it before you ship.

The teams that get the most out of Claude aren't the ones who built the most sophisticated systems. They're the ones who built small eval sets early and used them to catch and fix failures while they were still cheap to fix.


Further reading