Your first 90 days as an Agent Operator — what to build, in what order
In brief
Most people handed AI responsibility try to do everything at once and ship nothing reliable, or wait for a perfect plan and never start. The 90-day path is simpler: one team, one workflow, one agent that actually works. Then you expand.
Contents
Someone handed you AI responsibility. Maybe it was a direct conversation with your CEO. Maybe it accumulated — you answered a few questions, built one thing that worked, and suddenly you're the AI person. Either way, you're now accountable for making AI agents work inside your company, and you're probably doing it without a team, without a budget that covers consultants, and without a playbook.
The two failure modes I see most often:
Chaos: You try to automate five workflows at once, deploy agents to three teams simultaneously, and spend the first 90 days in a perpetual state of "nearly working." Nothing gets to reliable. Users lose confidence. You exhaust yourself.
Paralysis: You spend the first 90 days designing the architecture, waiting for buy-in from more stakeholders, and deferring the actual building until you have a more complete plan. Nothing ships. The CEO's enthusiasm fades. The window closes.
The path between these is narrower than it sounds: one team, one workflow, one agent. Get it working reliably. Measure the results. Then expand.
Here's what that looks like, concretely.
Days 1–30: Diagnose before you build
The first month is not for building. It's for finding the right thing to build.
Start with an honest audit of what already exists.
Shadow AI use is real and it's happening in your company right now. People are using ChatGPT for their work emails. They're pasting customer complaints into Claude and asking for summaries. They're using Gemini to draft proposals. Usually without telling IT. Usually without any standard approach. This isn't a problem to stop — it's signal about where AI is already providing value.
Before you build anything formal, find out what informal AI use is already happening. Ask directly: "What are you using AI for right now?" You'll be surprised. And some of it will point you directly at your best first project.
Interview 3–5 people from different teams about their most painful workflows.
You're looking for tasks that are:
- High volume (they happen frequently — daily or weekly, not quarterly)
- Repetitive in structure (the task follows a predictable pattern, even if the inputs vary)
- Currently consuming significant time (hours per week, not minutes)
- Low-stakes-per-instance (a single error is annoying but not catastrophic)
- Data-accessible (the information the agent needs actually exists and is findable)
Good examples from non-tech companies: supplier invoice triage, first-line customer inquiry responses, contract clause review, safety incident report categorization, shift scheduling requests, HR policy lookups.
Bad first projects: anything that requires making final decisions in high-stakes situations, anything that touches compensation or employment, anything where the input data is inaccessible or highly unstructured, anything where "good" is impossible to define.
Map the top 3 candidates and pick one.
For each candidate workflow, write down four things:
- What does the agent need to know? (What context, documents, or data?)
- What should it produce? (What does the output look like?)
- How would you know if it's wrong? (What does a bad output look like?)
- What happens if it fails? (Who is affected and how badly?)
This exercise usually makes the right choice obvious. The workflow where you can answer all four questions clearly and specifically is the one to build first.
Days 31–60: Build and validate before you roll out
You have a workflow. Now build the agent — but don't roll it out yet.
Set up the agent with a proper system prompt and context.
A system prompt is the standing instruction set that tells Claude who it is, what it's supposed to do, and how it should behave. This is not the same as the user's question — it runs behind the scenes on every interaction. A good system prompt for a first agent is usually 300–600 words. It covers:
- What the agent is and what it's for (one sentence)
- The context it has access to (which documents, which data)
- What it should output and in what format
- What it should do when it doesn't know something ("I don't have enough information to answer that — please contact [name]")
- What it should explicitly not do (critical for high-risk topics)
Write this out before you start configuring. It takes an hour and saves weeks of debugging later.
Build a test set of 10 real cases before anything goes live.
A test set is a list of real inputs — questions or requests the agent will actually receive — along with what a good response looks like. You run this test set to check whether the agent is working correctly.
For your first agent, 10 cases is enough. Include:
- 5 typical cases (the bread-and-butter inputs)
- 3 edge cases (unusual inputs, incomplete information, ambiguous requests)
- 2 cases where you know what the agent should say it can't help with
For each case, write a one-sentence description of what "good" looks like. Not "accurate" — something specific. "Correctly identifies the policy section and gives the relevant clause number" or "Escalates to HR rather than attempting to answer" or "Gives the correct handling time for a fragile shipment category."
Run the agent with 2–3 people maximum before anyone else sees it.
Find the person most likely to be your internal champion — someone enthusiastic about the use case, willing to give detailed feedback, and patient with a first version. Give it to them and one or two colleagues. Ask them to use it for one week for real work.
Collect specific examples of where it worked, where it didn't, and what the failure looked like. This is your debugging data. Most first agents need 3–5 prompt revisions before they're ready for broader rollout. Do those revisions now, not after you've rolled out to 50 people.
Days 61–90: Stabilize before you expand
Your pilot worked. The 2–3 people are using the agent and getting value from it. Now the work is getting to a state you can maintain and explaining it to your CEO.
Get to a point where you can answer "is this agent working?" at any given moment.
This means: you have your test set, you're running it weekly (or after any change), and you know your pass/fail rate. You have a rough sense of volume (how many queries per day). You know who to contact if users report problems.
This is the operational baseline. Without it, you don't have a deployed agent — you have a demo that may or may not be working at any given moment.
Connect one live data source to keep context current.
If the agent relies on a document or database that changes — a pricing sheet, a policy document, a product catalog — set up a way for that to stay current without manual intervention. For most Agent Operators, this is simpler than it sounds: it might be as simple as pointing the agent's Claude Project at a shared Google Drive folder that your team already maintains. When someone updates the document in Drive, the agent's context updates automatically.
If that's not possible yet, at minimum: create a calendar reminder to manually update the context document on whatever schedule the underlying data changes. Monthly for most policies, weekly for operational data, daily if you have something real-time. Stale context is the most common cause of mysterious agent failures that start weeks after the agent was working fine.
Report to your CEO with real numbers.
Before day 90, write a one-page summary of what the agent is doing, what it's handling, and what it's saving. You need at least one concrete number. See the section below on what to defer for what not to include yet.
Then pick the second workflow and start the cycle again.
What to defer (even though it sounds important)
Enterprise MCP servers. An MCP server is a custom integration that gives Claude live access to internal databases and APIs. It's the right long-term architecture. It's also an engineering project that requires real technical resources. You don't need it yet. Native connectors and Zapier solve 80% of what you need in year one.
Custom dashboards and monitoring systems. A spreadsheet with your test results and a weekly 30-minute review session is monitoring. Build the elaborate dashboard after you have enough agents running that the spreadsheet becomes unmanageable.
Company-wide training. Don't train the whole company before you have 3 agents that work reliably. Training 380 people to use an agent that breaks will destroy trust and set back adoption by 6 months. Train the pilot team first. Let word of mouth spread.
Trying to show big numbers fast. The Agent Operator who shows 3 working agents with documented results in 90 days is more credible than the one who demos 12 that break. Your CEO wants to know if this is working and whether to invest more. Three solid wins answer that question. Twelve questionable ones don't.
The single most important thing to get right
Before anything else, write down exactly what each agent is supposed to do — its scope, its context, its output format, and what it should not do. This document is your system prompt, your test set, and your incident response guide. When the agent breaks at 2pm on a Tuesday, this document is how you diagnose it in 20 minutes instead of 4 hours.
Most Agent Operators skip this because they're moving fast. Don't. It takes 2 hours per agent to do it properly and it saves 20 hours of debugging later.
Try this today
Write the job description for your most important current agent — as if it were a new hire. What would you tell them in their first week? What would they need to know? What would good work look like, and what would bad work look like?
That document is your system prompt. If you can't write it, you're not ready to deploy the agent.