Cutting Claude API costs without cutting quality
In brief
Token budgets, model routing, caching, batching, and the decisions that have the biggest impact on your monthly bill.
Contents
Claude API costs scale with tokens. Every word in, every word out, every system prompt, every document you append — it all adds up. For most applications, a small number of decisions account for the majority of the bill. Here is how to find and fix them.
Start with measurement
You cannot optimize what you cannot see. Before changing anything, instrument your application:
import anthropic
from dataclasses import dataclass, field
from collections import defaultdict
@dataclass
class UsageTracker:
input_tokens: int = 0
output_tokens: int = 0
cache_read_tokens: int = 0
cache_creation_tokens: int = 0
request_count: int = 0
costs_by_route: dict = field(default_factory=lambda: defaultdict(float))
def record(self, usage, route: str, model: str):
self.input_tokens += usage.input_tokens
self.output_tokens += usage.output_tokens
self.cache_read_tokens += getattr(usage, 'cache_read_input_tokens', 0)
self.cache_creation_tokens += getattr(usage, 'cache_creation_input_tokens', 0)
self.request_count += 1
# Calculate cost based on model pricing
cost = self._calculate_cost(usage, model)
self.costs_by_route[route] += cost
def _calculate_cost(self, usage, model: str) -> float:
# Rough pricing for Sonnet 4.6 (check current pricing at anthropic.com)
input_cost = usage.input_tokens * 3 / 1_000_000
output_cost = usage.output_tokens * 15 / 1_000_000
return input_cost + output_cost
tracker = UsageTracker()
Run this for a week. You will usually find that 20% of your routes generate 80% of the cost.
The highest-impact changes
1. Prompt caching for repeated context
If your system prompt is over 1,024 tokens and you send thousands of requests per day, prompt caching is your first fix. One parameter change, 80-90% cost reduction on the cached portion. See the implementation guide for details.
2. Right-sizing your model
Not every request needs the most capable model. Most applications have a mix of:
- Simple classification, extraction, formatting → Haiku
- Typical generation, Q&A, summarization → Sonnet
- Complex reasoning, ambiguous problems → Opus
Routing correctly can cut costs 5-10x on simple requests:
def choose_model(task_type: str, complexity_score: float) -> str:
if task_type in ("classify", "extract", "format") or complexity_score < 0.3:
return "claude-haiku-4-5-20251001"
elif complexity_score < 0.7:
return "claude-sonnet-4-6"
else:
return "claude-opus-4-6"
The complexity score can be based on input length, number of constraints, presence of ambiguity signals — whatever correlates with difficulty in your specific domain. Start simple and calibrate with evals.
3. Output length control
Output tokens cost 5x more than input tokens on most Claude models. If your application is generating long responses and only using part of them, you are wasting money.
Strategies:
- Set
max_tokensto the minimum you need, not the maximum you might ever need - Use structured output (JSON mode) which tends to be more concise than prose
- Ask Claude to be concise explicitly in your system prompt — it responds to this
- If you need a short answer, say so: "in one sentence" or "in under 50 words"
# Instead of
response = client.messages.create(max_tokens=4096, ...)
# Use the smallest limit that covers your actual outputs
response = client.messages.create(max_tokens=512, ...)
4. Batch API for async workloads
If your use case does not require real-time responses — nightly data processing, document analysis, bulk classification — the Batch API offers 50% cost savings with up to 24-hour turnaround.
# Create a batch instead of individual requests
batch = client.beta.messages.batches.create(
requests=[
{
"custom_id": f"doc_{i}",
"params": {
"model": "claude-sonnet-4-6",
"max_tokens": 256,
"messages": [{"role": "user", "content": doc}]
}
}
for i, doc in enumerate(documents)
]
)
For workloads with thousands of documents, this is often the single largest cost lever available.
5. Context window management
Long contexts cost proportionally more. If your application accumulates conversation history, cap it:
def trim_history(messages: list[dict], max_tokens: int = 4000) -> list[dict]:
"""Keep the most recent messages within a token budget."""
# Rough estimate: 4 chars ≈ 1 token
total = 0
result = []
for msg in reversed(messages):
content = msg.get("content", "")
if isinstance(content, list):
text = " ".join(b.get("text", "") for b in content if b.get("type") == "text")
else:
text = str(content)
tokens_est = len(text) // 4
if total + tokens_est > max_tokens:
break
result.insert(0, msg)
total += tokens_est
return result
Pair this with prompt caching to preserve as much history as possible while controlling cost.
The decisions that rarely move the needle
- Compression and paraphrasing of prompts. You save maybe 10-15% of input tokens with significant engineering effort, and you often degrade quality enough to need more output tokens to compensate.
- Switching providers for cost alone. If quality matters and you have already calibrated for Claude, switching providers introduces re-calibration costs (new evals, new prompt engineering, new failure modes) that are rarely worth the per-token savings.
- Obsessing over output tokens before fixing system prompts. A 5,000-token system prompt that runs 10,000 times per day is 50M tokens. A 200-token output that runs 10,000 times is 2M. Fix the system prompt first.
A practical audit
Run this against your last week of API usage:
- Which routes have the highest input token counts? Can they use caching?
- Which routes always use the same model? Could lighter tasks use Haiku?
- What is your average output token count? Does it match what you actually use?
- Are any of your workloads async-compatible for batch pricing?
Most applications find 40-60% savings opportunity in the first audit without touching quality. The engineering investment is usually one afternoon.
Further reading
- Pricing — current pricing for all Claude models
- Batch processing — 50% cost reduction via the Batches API
- Token-saving updates — recent API changes that reduce token consumption