Infrastructure & DeploymentHow It Works

Prompt caching with Claude: cut costs 80% on repeated context

In brief

Prompt caching can cut your Claude API costs by 80% on the requests that matter most. Here's exactly how to implement it, why most teams cache the wrong things first, and what to fix.

6 min read·

Contents

♡Sign in to save

Prompt caching is the highest ROI optimization available to most Claude applications. If your requests share a large prefix — a system prompt, a long document, few-shot examples — you can mark that content for caching and pay 90% less on cache hits. The write cost is 25% higher than a normal token, but after the first request, reads are 90% cheaper and 85% faster.

The math is simple. A 10,000-token system prompt costs about $0.03 per request at full price. With caching and a 90% hit rate, that drops to about $0.003. At 10,000 requests per day, that is $270 versus $27. Monthly, $8,100 versus $810.

What actually gets cached

Caching works on the prefix of your prompt — a contiguous block of content at the beginning that stays the same across requests. The key constraint: the cached block must be at least 1,024 tokens (2,048 for some models). Anything shorter does not qualify.

What qualifies:

System prompts with detailed instructions, persona definitions, or tool schemas
Long documents or knowledge bases appended before user messages
Multi-turn conversation history that accumulates but does not change
Few-shot examples at the start of a conversation

What does not qualify:

Short system prompts (under 1,024 tokens)
Content that changes every request
Content after the user's variable input

The implementation

Add cache_control to the content blocks you want cached:

import anthropic

client = anthropic.Anthropic()

system_prompt = """You are a technical support agent for Acme Corp...
[2,000+ tokens of detailed instructions, FAQs, product specs]
"""

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": system_prompt,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": user_question}
    ]
)

The cache_control: {"type": "ephemeral"} marker tells Claude to cache this block. Ephemeral caches last 5 minutes and refresh on each hit. If your traffic is sparse enough that caches expire between requests, you will not see the savings.

Multi-block caching

You can cache multiple blocks, but each must meet the 1,024-token minimum:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": instructions,       # ~3,000 tokens of instructions
            "cache_control": {"type": "ephemeral"}
        },
        {
            "type": "text",
            "text": product_catalog,    # ~5,000 tokens of product data
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=messages
)

The second block is only cacheable if the first block is also cached — caching is prefix-based. If block 1 is not cached on a particular request, block 2 will not be either.

Caching conversation history

For multi-turn applications, cache the accumulated history up to the latest turn:

def build_messages_with_cache(history: list[dict], new_user_message: str) -> list[dict]:
    """Build message array with cache_control on the latest assistant turn."""
    messages = list(history)
    
    # Mark the last assistant message for caching
    if messages and messages[-1]["role"] == "assistant":
        last = messages[-1]
        if isinstance(last["content"], str):
            last = {
                "role": "assistant",
                "content": [
                    {
                        "type": "text",
                        "text": last["content"],
                        "cache_control": {"type": "ephemeral"}
                    }
                ]
            }
            messages[-1] = last
    
    messages.append({"role": "user", "content": new_user_message})
    return messages

This pattern caches everything up to the current turn. On the next request, Claude reads the cached history instead of reprocessing it.

Verifying it works

Check usage in the response:

print(response.usage)
# CacheUsage(
#   cache_creation_input_tokens=10234,  # first request: cache written
#   cache_read_input_tokens=0,
#   input_tokens=47,
#   output_tokens=312
# )

# Second request:
print(response.usage)
# CacheUsage(
#   cache_creation_input_tokens=0,
#   cache_read_input_tokens=10234,  # cache hit!
#   input_tokens=47,
#   output_tokens=298
# )

cache_creation_input_tokens on the first request, cache_read_input_tokens on subsequent ones. If you are seeing cache_creation_input_tokens on every request, the cache is expiring before the next call hits — your traffic is too sparse or your TTL assumptions are wrong.

The gotchas

Order matters. Caching is prefix-based. If you reorder your system prompt blocks between requests, Claude cannot match the cached prefix and writes a new one. Keep cached content at the top and stable.

The 5-minute TTL. Ephemeral caches expire after 5 minutes of inactivity. For low-traffic routes, you may see more cache misses than expected. If this is a problem, structure your application to send a cheap keep-alive request to refresh the cache, or batch requests to maintain cache temperature.

Model versions matter. Cache keys include the model. Switching from claude-sonnet-4-6 to claude-opus-4-6 invalidates the cache. Plan model upgrades with this in mind — traffic might see a temporary cost spike.

Tool schemas count as tokens. If you have a large set of tools, their schemas are part of the prompt. You can cache tool definitions by placing them in the system with cache_control. This is often overlooked.

The implementation is one line of JSON. The savings are real and immediate. Add it.

Try this: Calculate your current monthly API cost. Now estimate what it would be with an 80% cache hit rate on your system prompt. If the difference is meaningful, implement caching this week — the implementation here is the one to use.