Evaluation & SafetyField Note

Running your first AI pilot: a 30-day plan

In brief

Most AI pilots succeed technically and fail politically. The evidence exists — it just wasn't collected in a way anyone can act on. Here's how to design a pilot that produces results your organization will actually use.

6 min read·Evals

Contents

♡Sign in to save

The goal of an AI pilot isn't to explore what AI can do. It's to answer a specific question: does this work well enough to be worth more investment?

That's a different goal, and it requires a different structure. Here's a 30-day plan that produces a real answer.

THE 30-DAY AI PILOT, IN ONE VIEW:

Before Day 1 — agree on the specific question the pilot will answer

Week 1 — build the smallest thing that works end-to-end

Week 2 — fix the obvious failures, start measuring

Week 3 — real users, supervised

Week 4 — unsupervised, with monitoring

Day 30 — the go / no-go / extend decision

The pilots that fail are the ones that skip "define the question" and try to explore instead of decide.

Before day 1: define the question

The most common reason AI pilots drag on is that nobody agreed on what "working" means before they started.

Before you begin, write down:

The specific task. Not "using AI to improve our operations" — that's a direction, not a task. "Using Claude to draft first responses to support tickets in categories A, B, and C" is a task.

The current baseline. How long does this task currently take? What's the error rate or quality level? What does it cost? You need a number to beat.

What "good enough to proceed" looks like. Set the bar before you see results. "If Claude can handle 40% of these queries with a satisfaction rate above 3.5 out of 5, we proceed." Pre-committing to a threshold prevents motivated reasoning after the fact.

Who decides. One person owns the call at the end of 30 days. A committee with no decision-maker produces a pilot that never ends.

Week 1: build the smallest thing that works

Pick the narrowest possible version of your task. If you're testing support automation, pick one question category. If you're testing content generation, pick one content type.

Set up Claude with a basic system prompt. Don't optimize it yet — you need to see what breaks before you know what to fix. Run 20–30 real inputs through it manually.

By the end of week 1, you should have a working prototype and a list of the five most common ways it fails.

Week 2: fix the obvious failures, start measuring

Take your failure list and address the top two or three. These are almost always system prompt issues: Claude doesn't know something it needs to know, a boundary isn't clearly defined, the tone is off.

Simultaneously, set up your measurement. What are you going to track? How will you know if outputs are good? Build the simplest possible eval set — 30 to 50 examples with clear criteria for what a passing output looks like.

Run your updated system through the eval set. You now have a baseline for your own pilot.

Week 3: real users, supervised

Put the pilot in front of real users — but keep it supervised. For support automation, this might mean Claude drafts responses that a human reviews before sending. For content generation, Claude produces a draft that someone edits. For internal tools, a small volunteer group uses it daily.

The supervised layer does two things: it catches errors before they reach people who aren't expecting them, and it generates real data about where Claude succeeds and fails under actual use conditions.

By the end of week 3, you should have 50–100 real interactions to look at. Review them. Update your failure taxonomy. Fix the next tier of problems.

Week 4: unsupervised, with monitoring

Remove or reduce the supervised layer for low-stakes interactions. Run the pilot at closer to real scale. Spot-check 10–15% of outputs daily.

Track your metrics. Are you hitting the threshold you set before the pilot started? Where are you above it, where below it?

This is also the week to collect structured feedback from the people using it — your team if it's internal, users if it's customer-facing. Not "do you like it" but specific: what worked, what was frustrating, what surprised you.

Day 30: the decision

You have four weeks of data. You have eval results. You have user feedback. You have your pre-committed threshold.

The decision should be one of three:

Proceed. You hit your threshold. The task works at acceptable quality. Next step: expand scope, increase automation, or move to the next task.

Iterate. You're close to the threshold and you can see specifically what's keeping you from it. Fix those things and run another two weeks. This is only valid if you can name the specific changes and why you expect them to work.

Stop. The task doesn't work well enough and there's no clear path to fixing it. This isn't failure — it's information. "AI doesn't work well for this particular task with our particular data" is a real finding. The organisations that learn this fast are better positioned than those that spend six months hoping.

The thing most pilots get wrong

They measure output quality but not process change.

Even if Claude handles a task at 80% of human quality, that might not be worth it if it requires significant overhead to review, correct, and manage. The full cost of an AI implementation includes the time to supervise it, fix its mistakes, update its configuration, and handle the cases it can't handle.

Build that into your evaluation. The question isn't just "does Claude do this well?" It's "does Claude doing this create net value for our team?"

That answer requires 30 days of honest data — which is exactly what this structure gives you.

Once your pilot is complete and you need to report results to a manager or executive, the ROI measurement guide covers how to frame the outcomes and what metrics to track going forward.

Try this before you launch: Write down the three things you will measure. Specifically — not "productivity" but "time to complete [task type]" or "volume of [output] per week." If you can't name three measurable things, the pilot isn't ready to start.