How to actually evaluate whether your AI rollout is working
Most AI rollout evaluations are either too vague ("the team likes it") or too technical (automated test suites that miss what users actually care about). Here's what works.
Six months after your AI rollout, someone asks: "Is it working?"
If your honest answer is "we think so, people seem positive," you don't have an evaluation process. You have vibes.
Evals — the practice of systematically measuring AI output quality — sounds like a developer concern. In practice, it's an operator concern. You don't need to run automated test suites. You do need a process for knowing whether the rollout is delivering.
The two questions to answer
Every eval process comes down to two questions:
1. Are the outputs good enough? Does Claude produce outputs that are accurate, on-brand, and useful? Would you be comfortable if a customer or executive saw them?
2. Is it making a meaningful difference? Are the metrics you care about moving? Handle time, error rate, output volume, hours saved, tickets escalated?
Most teams only measure one of these. Teams that measure only quality often don't know if the tool is actually changing productivity. Teams that measure only productivity often don't notice when quality degrades.
You need both.
A practical quality evaluation process
Pick a sample size you can actually sustain. For most teams, 10-20 outputs per week reviewed by one person is enough to catch systematic problems.
Create a simple rubric with 3-4 criteria that matter for your use case. For a customer support application:
- Accurate (did it get the facts right?)
- Appropriate tone (did it match the situation?)
- Complete (did it actually answer the question?)
- On-brand (does this sound like us?)
Score each output 1-3 on each criterion. Track scores over time. Look for systematic failures — a specific ticket type that always scores low, a consistent accuracy problem with a product area.
This takes 20-30 minutes a week. It will tell you more than any automated metric.
The productivity metrics that actually work
Before/after measures are hard — there are too many confounding factors. Better: measure the same task type throughout the rollout and track the trend.
Good metrics for common AI use cases:
- Time per ticket (customer support)
- Drafts submitted vs. accepted (content/marketing)
- Time from brief to first draft (any writing workflow)
- Volume produced per person per week (content)
Bad metrics: NPS (too lagged, too many confounds), "team satisfaction" (people often like tools that don't actually save time), cost per token (measures input, not output quality).
The failure signal to watch for
Outputs look fine when you read them, but customers or downstream users push back more than expected. This is the hardest failure mode — the outputs pass your internal review but fail in the real world.
It's almost always a context problem: Claude is optimising for what looks right in isolation, but missing something about how customers actually interpret the communication. Fix: include real customer feedback in your evaluation loop, not just internal review.
The most important thing
Whoever reviews outputs needs to be the person who knows what "good" actually looks like. Not the most junior person on the team, not the person with the most free time. The person with the most judgment about quality.
They don't need to review everything. They need to review a consistent sample, track trends, and have the authority to flag when something needs to change.
Further reading
- Evaluation best practices — Anthropic Docs