AI Codex
Infrastructure & DeploymentHow It Works

Keeping your agent costs under control as you scale

In brief

Your Claude bill went from $200 to $2,000 and you can't explain why. The four cost drivers — bloated system prompts, unnecessary context loading, high failure rates, and no usage monitoring — each have fixes. Cost per task is the metric that matters, not total spend.

7 min read·AI Agent

Contents

Sign in to save

You had one agent running well. Then two more. Then five. The monthly Claude usage bill, which was $200 when you started, is now $2,000. And when your CFO asks why it went up 10x, you don't have a clean answer.

This is a normal place to find yourself. The cost model for AI agents is genuinely different from most software tools, and most Agent Operators don't develop a working mental model of it until the bill arrives.

Here's what you need to understand — and what to do about it.


How agent costs actually work

You pay per token. A token is roughly a word (or a few characters) of text. You pay for tokens going in (the system prompt, the context documents, the user's message) and tokens coming out (the agent's response).

The critical thing to understand: the system prompt runs on every single interaction.

If your system prompt is 2,000 words and your agent handles 500 queries a day, that's roughly 1 million tokens per day just from the system prompt — before any user messages or responses are counted. At current Claude pricing, that's meaningful money. And it compounds: if you add a second agent with a 2,000-word system prompt, you've doubled that component of your cost.

This is why costs go from linear to non-linear as you scale. It's not just that you have more agents — it's that the system prompt cost multiplies with every interaction on every agent.


The four cost drivers

Driver 1: Bloated system prompts

System prompts have a tendency to grow. You add an edge case instruction. You add an example to make the format clearer. You add a warning about a sensitive topic. Each addition made sense at the time. Six months later, what started as a 400-word system prompt is 2,000 words.

A 2,000-word system prompt costs 5x a 400-word one for every interaction. If your agent runs 300 times a day, that's 5x the cost of all your input tokens — which might be a $200/month difference, might be $800/month, depending on volume and model.

The fix: Quarterly system prompt audit. Go through each prompt and ask:

  • Is every instruction here actually doing something? (Remove defensive clauses that are never triggered)
  • Can any of these examples be moved to a reference document that only gets loaded when needed?
  • Is this instruction still relevant, or is it from an edge case we solved three months ago?

A disciplined edit that brings a 2,000-word prompt to 800 words won't hurt quality — it often improves it. Shorter, clearer prompts usually outperform long ones.

Driver 2: Unnecessary context on every call

There's a temptation to preload everything the agent might ever need into the context. Your entire company handbook. The full product catalog. Three years of policy documents. The thinking is: better to have too much than too little.

The problem: you're paying for all of it on every call, even when 90% of it is irrelevant to the specific query.

The fix: Think about what the agent actually needs for a typical query versus what it might need for an unusual one. If 80% of queries only need 5 documents, don't load 50. For the uncommon cases that need more, consider using Claude's retrieval capabilities (native connectors, as discussed in the integration article) to fetch the specific relevant content per query rather than pre-loading everything.

Driver 3: High failure rate causing retries

When an agent fails — produces a wrong answer, an unhelpful response, or a confusing output — users retry. Sometimes they retry manually. Sometimes they retry automatically. Either way, you're paying for the failed interaction and the retry.

A 20% failure rate means you're paying for roughly 1.2 interactions for every task actually completed. At scale, that's a substantial cost premium.

The fix: This is the cost argument for investing in evaluation. Fixing reliability doesn't just improve quality — it directly reduces cost. An agent that fails 5% of the time instead of 20% is processing 15% fewer total tokens for the same number of useful outputs.

Driver 4: No visibility on usage

If you don't know which agent is generating the most cost or why, you can't fix it. Most Agent Operators see the total bill and can't attribute it — they're flying blind.

The fix: Claude for Work admin console → Usage. Review weekly, not monthly. Look at:

  • Which agents have the highest token volume?
  • Are any agents spiking unexpectedly?
  • Is the cost per task stable or trending upward?

A usage spike in a specific agent usually points at one of the other three drivers — a prompt that grew, a context expansion, or a rash of failures. The monitoring tells you where to look.


The metric that matters: cost per task

Total spend is the wrong number to manage. It will always go up as you deploy more agents and handle more volume. The question isn't "are we spending more?" — it's "are we spending less per task as we scale?"

Cost per task = total monthly Claude spend ÷ total tasks completed across all agents

If this number is going down over time, you're getting more efficient as you scale. Good.

If it's going up despite growing volume, something is wrong — usually one of the four drivers above. Find it.


The conversation with your CFO

Most companies bucket their Claude usage in the IT software budget. Levie's observation is worth taking seriously here: as agents take over real operational tasks, the cost of running them should be compared to the operational cost it's replacing, not to other software licenses.

The frame that works with finance leadership:

"Our Claude spend is $2,000/month. The agents are handling 1,800 supplier invoice categorizations per month — tasks that were taking Sarah's team approximately 4 hours a day collectively. At fully-loaded labor cost, that's roughly $15,000/month of work. The cost-to-value ratio is solid. We're also handling [X more tasks]. Here's how it's trending."

This frames the spend as OPEX tied to a function, not as an IT line item that needs to be justified against other software costs. Finance understands "this costs $2K and saves $15K." They are much less equipped to evaluate "$2K/month AI vs $3K/month CRM add-on."


The mistake that makes costs worse, not better

Cutting quality to cut cost.

When an Agent Operator sees a high bill, the first instinct is often to shorten system prompts aggressively, or to reduce the amount of context loaded. Sometimes this works. More often, shorter prompts that produce worse outputs create a different cost problem: higher failure rates, more retries, more manual work downstream, and eventually loss of user confidence that requires expensive rebuilding.

The right place to cut is the fat in your prompts — defensive clauses that aren't triggered, outdated examples, redundant instructions. Not the substance.


Try this today

Open your Claude for Work admin console and look at your usage data. Find your highest-volume agent. Calculate its rough cost per task: divide last month's usage cost attributed to that agent by the number of interactions.

Is that number what you'd expect? Is it trending up, down, or flat?

If you can't attribute costs by agent, that's the first thing to fix — you can't manage what you can't measure.

Related tools

Weekly brief

For people actually using Claude at work.

Each week: one thing Claude can do in your work that most people haven't figured out yet — plus the failure modes to avoid. No tutorials. No hype.

No spam. Unsubscribe anytime.

What to read next

Picked for where you are now

All articles →