Prompt caching with Claude: cut costs 80% on repeated context
In brief
Prompt caching can cut your Claude API costs by 80% on the requests that matter most. Here's exactly how to implement it, why most teams cache the wrong things first, and what to fix.
Contents
Prompt caching is the highest ROI optimization available to most Claude applications. If your requests share a large prefix — a system prompt, a long document, few-shot examples — you can mark that content for caching and pay 90% less on cache hits. The write cost is 25% higher than a normal token, but after the first request, reads are 90% cheaper and 85% faster.
The math is simple. A 10,000-token system prompt costs about $0.03 per request at full price. With caching and a 90% hit rate, that drops to about $0.003. At 10,000 requests per day, that is $270 versus $27. Monthly, $8,100 versus $810.
What actually gets cached
Caching works on the prefix of your prompt — a contiguous block of content at the beginning that stays the same across requests. The key constraint: the cached block must be at least 1,024 tokens (2,048 for some models). Anything shorter does not qualify.
What qualifies:
- System prompts with detailed instructions, persona definitions, or tool schemas
- Long documents or knowledge bases appended before user messages
- Multi-turn conversation history that accumulates but does not change
- Few-shot examples at the start of a conversation
What does not qualify:
- Short system prompts (under 1,024 tokens)
- Content that changes every request
- Content after the user's variable input
The implementation
Add cache_control to the content blocks you want cached:
import anthropic
client = anthropic.Anthropic()
system_prompt = """You are a technical support agent for Acme Corp...
[2,000+ tokens of detailed instructions, FAQs, product specs]
"""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": user_question}
]
)
The cache_control: {"type": "ephemeral"} marker tells Claude to cache this block. Ephemeral caches last 5 minutes and refresh on each hit. If your traffic is sparse enough that caches expire between requests, you will not see the savings.
Multi-block caching
You can cache multiple blocks, but each must meet the 1,024-token minimum:
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": instructions, # ~3,000 tokens of instructions
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": product_catalog, # ~5,000 tokens of product data
"cache_control": {"type": "ephemeral"}
}
],
messages=messages
)
The second block is only cacheable if the first block is also cached — caching is prefix-based. If block 1 is not cached on a particular request, block 2 will not be either.
Caching conversation history
For multi-turn applications, cache the accumulated history up to the latest turn:
def build_messages_with_cache(history: list[dict], new_user_message: str) -> list[dict]:
"""Build message array with cache_control on the latest assistant turn."""
messages = list(history)
# Mark the last assistant message for caching
if messages and messages[-1]["role"] == "assistant":
last = messages[-1]
if isinstance(last["content"], str):
last = {
"role": "assistant",
"content": [
{
"type": "text",
"text": last["content"],
"cache_control": {"type": "ephemeral"}
}
]
}
messages[-1] = last
messages.append({"role": "user", "content": new_user_message})
return messages
This pattern caches everything up to the current turn. On the next request, Claude reads the cached history instead of reprocessing it.
Verifying it works
Check usage in the response:
print(response.usage)
# CacheUsage(
# cache_creation_input_tokens=10234, # first request: cache written
# cache_read_input_tokens=0,
# input_tokens=47,
# output_tokens=312
# )
# Second request:
print(response.usage)
# CacheUsage(
# cache_creation_input_tokens=0,
# cache_read_input_tokens=10234, # cache hit!
# input_tokens=47,
# output_tokens=298
# )
cache_creation_input_tokens on the first request, cache_read_input_tokens on subsequent ones. If you are seeing cache_creation_input_tokens on every request, the cache is expiring before the next call hits — your traffic is too sparse or your TTL assumptions are wrong.
The gotchas
Order matters. Caching is prefix-based. If you reorder your system prompt blocks between requests, Claude cannot match the cached prefix and writes a new one. Keep cached content at the top and stable.
The 5-minute TTL. Ephemeral caches expire after 5 minutes of inactivity. For low-traffic routes, you may see more cache misses than expected. If this is a problem, structure your application to send a cheap keep-alive request to refresh the cache, or batch requests to maintain cache temperature.
Model versions matter. Cache keys include the model. Switching from claude-sonnet-4-6 to claude-opus-4-6 invalidates the cache. Plan model upgrades with this in mind — traffic might see a temporary cost spike.
Tool schemas count as tokens. If you have a large set of tools, their schemas are part of the prompt. You can cache tool definitions by placing them in the system with cache_control. This is often overlooked.
The implementation is one line of JSON. The savings are real and immediate. Add it.
Try this: Calculate your current monthly API cost. Now estimate what it would be with an 80% cache hit rate on your system prompt. If the difference is meaningful, implement caching this week — the implementation here is the one to use.
Further reading
- Prompt caching documentation — the full API reference for prompt caching
- Prompt caching with Claude — how caching works and when to use it