Infrastructure & DeploymentHow It Works

Pay for your context once, not every time

In brief

Prompt caching is Claude's way of remembering the expensive part of a conversation so you don't have to re-send — and re-pay for — the same context on every request.

4 min read·Prompt Caching

Contents

♡Sign in to save

Every time you send a message to Claude via the API, you pay for every token Claude processes — including your system prompt, your documents, and your conversation history. In a normal exchange, that's fine. But in applications where you're sending a large, fixed context with every request, the costs add up fast.

Prompt caching solves this. It lets Claude store a snapshot of your context on Anthropic's servers so it doesn't have to re-process the same tokens on every call.

The problem it's solving

Imagine you're building a support bot. Your system prompt is 2,000 tokens of company context, product documentation, and behavioral instructions. Every user message triggers a new API call — and every API call includes those 2,000 tokens.

If you get 10,000 support messages a day, you're paying to process 20 million tokens of context that never changes. That's the problem.

How it works

Prompt caching works through a specific marker you place in your API request. You tell Claude: "everything up to this point, cache it."

The first request still processes everything — Claude has to read it once. But subsequent requests that include the same cached prefix skip the re-processing step. Instead of computing the full context from scratch, Claude retrieves the cached state and continues from there.

The result: cache hits cost about 10% of the original input price. On repeated calls with the same large context, you'll typically see 60–90% cost reduction.

What's worth caching

Not everything benefits equally. Prompt caching shines when you have:

Large, stable system prompts. If your instructions are thousands of tokens and rarely change, cache them.

Reference documents. Loading a legal document, a codebase, or a knowledge base with every request? Cache it. The content doesn't change between calls.

Conversation history in long sessions. As a conversation grows, earlier turns stay fixed. Cache the accumulated history and only process new messages fresh.

Few-shot examples. If you're including a set of example inputs and outputs to shape Claude's behavior, those examples are identical across requests — perfect for caching.

The practical setup

In the Anthropic API, caching is controlled by adding "cache_control" parameters to specific content blocks. You mark the boundary after your stable content, and Claude handles the rest.

The cache persists for 5 minutes by default (with options for longer). It's tied to the specific content — if you change a single token in the cached prefix, the cache invalidates and you pay for a fresh read.

A real example

A startup using Claude to analyze sales calls found their system prompt (500 tokens) plus their scoring rubric (3,000 tokens) were identical across every call. After implementing prompt caching:

Cost per analysis dropped from $0.08 to $0.012
Latency on cache hits dropped by ~40% (less computation)
Monthly API bill: down 85%

The code change took about 20 minutes.

When to reach for it

If you're building anything with Claude where the same large context appears in multiple requests, prompt caching should be one of the first optimizations you make. It's low-effort, high-impact, and requires no changes to how your application works — just how it formats API calls.