Infrastructure & DeploymentHow It Works

Cutting Claude API costs without cutting quality

In brief

Token budgets, model routing, caching, batching, and the decisions that have the biggest impact on your monthly bill.

7 min read·

Contents

Claude API costs scale with tokens. Every word in, every word out, every system prompt, every document you append — it all adds up. For most applications, a small number of decisions account for the majority of the bill. Here is how to find and fix them.

Start with measurement

You cannot optimize what you cannot see. Before changing anything, instrument your application:

import anthropic
from dataclasses import dataclass, field
from collections import defaultdict

@dataclass
class UsageTracker:
    input_tokens: int = 0
    output_tokens: int = 0
    cache_read_tokens: int = 0
    cache_creation_tokens: int = 0
    request_count: int = 0
    costs_by_route: dict = field(default_factory=lambda: defaultdict(float))
    
    def record(self, usage, route: str, model: str):
        self.input_tokens += usage.input_tokens
        self.output_tokens += usage.output_tokens
        self.cache_read_tokens += getattr(usage, 'cache_read_input_tokens', 0)
        self.cache_creation_tokens += getattr(usage, 'cache_creation_input_tokens', 0)
        self.request_count += 1
        # Calculate cost based on model pricing
        cost = self._calculate_cost(usage, model)
        self.costs_by_route[route] += cost
    
    def _calculate_cost(self, usage, model: str) -> float:
        # Rough pricing for Sonnet 4.6 (check current pricing at anthropic.com)
        input_cost = usage.input_tokens * 3 / 1_000_000
        output_cost = usage.output_tokens * 15 / 1_000_000
        return input_cost + output_cost

tracker = UsageTracker()

Run this for a week. You will usually find that 20% of your routes generate 80% of the cost.

The highest-impact changes

1. Prompt caching for repeated context

If your system prompt is over 1,024 tokens and you send thousands of requests per day, prompt caching is your first fix. One parameter change, 80-90% cost reduction on the cached portion. See the implementation guide for details.

2. Right-sizing your model

Not every request needs the most capable model. Most applications have a mix of:

Simple classification, extraction, formatting → Haiku
Typical generation, Q&A, summarization → Sonnet
Complex reasoning, ambiguous problems → Opus

Routing correctly can cut costs 5-10x on simple requests:

def choose_model(task_type: str, complexity_score: float) -> str:
    if task_type in ("classify", "extract", "format") or complexity_score < 0.3:
        return "claude-haiku-4-5-20251001"
    elif complexity_score < 0.7:
        return "claude-sonnet-4-6"
    else:
        return "claude-opus-4-6"

The complexity score can be based on input length, number of constraints, presence of ambiguity signals — whatever correlates with difficulty in your specific domain. Start simple and calibrate with evals.

3. Output length control

Output tokens cost 5x more than input tokens on most Claude models. If your application is generating long responses and only using part of them, you are wasting money.

Strategies:

Set max_tokens to the minimum you need, not the maximum you might ever need
Use structured output (JSON mode) which tends to be more concise than prose
Ask Claude to be concise explicitly in your system prompt — it responds to this
If you need a short answer, say so: "in one sentence" or "in under 50 words"

# Instead of
response = client.messages.create(max_tokens=4096, ...)

# Use the smallest limit that covers your actual outputs
response = client.messages.create(max_tokens=512, ...)

4. Batch API for async workloads

If your use case does not require real-time responses — nightly data processing, document analysis, bulk classification — the Batch API offers 50% cost savings with up to 24-hour turnaround.

# Create a batch instead of individual requests
batch = client.beta.messages.batches.create(
    requests=[
        {
            "custom_id": f"doc_{i}",
            "params": {
                "model": "claude-sonnet-4-6",
                "max_tokens": 256,
                "messages": [{"role": "user", "content": doc}]
            }
        }
        for i, doc in enumerate(documents)
    ]
)

For workloads with thousands of documents, this is often the single largest cost lever available.

5. Context window management

Long contexts cost proportionally more. If your application accumulates conversation history, cap it:

def trim_history(messages: list[dict], max_tokens: int = 4000) -> list[dict]:
    """Keep the most recent messages within a token budget."""
    # Rough estimate: 4 chars ≈ 1 token
    total = 0
    result = []
    for msg in reversed(messages):
        content = msg.get("content", "")
        if isinstance(content, list):
            text = " ".join(b.get("text", "") for b in content if b.get("type") == "text")
        else:
            text = str(content)
        tokens_est = len(text) // 4
        if total + tokens_est > max_tokens:
            break
        result.insert(0, msg)
        total += tokens_est
    return result

Pair this with prompt caching to preserve as much history as possible while controlling cost.

The decisions that rarely move the needle

Compression and paraphrasing of prompts. You save maybe 10-15% of input tokens with significant engineering effort, and you often degrade quality enough to need more output tokens to compensate.
Switching providers for cost alone. If quality matters and you have already calibrated for Claude, switching providers introduces re-calibration costs (new evals, new prompt engineering, new failure modes) that are rarely worth the per-token savings.
Obsessing over output tokens before fixing system prompts. A 5,000-token system prompt that runs 10,000 times per day is 50M tokens. A 200-token output that runs 10,000 times is 2M. Fix the system prompt first.

A practical audit

Run this against your last week of API usage:

Which routes have the highest input token counts? Can they use caching?
Which routes always use the same model? Could lighter tasks use Haiku?
What is your average output token count? Does it match what you actually use?
Are any of your workloads async-compatible for batch pricing?

Most applications find 40-60% savings opportunity in the first audit without touching quality. The engineering investment is usually one afternoon.