
If you're running Claude against a large codebase and calling the API repeatedly, you're almost certainly paying for tokens you've already sent. Every call re-transmits your entire CLAUDE.md, your tool definitions, and all your project context from scratch. On a 50-call batch with an 8,000-token system prompt, that's 400,000 input tokens before your actual work even starts.
Prompt Caching fixes this. Here's how I wired it up across a 4-node Mac Mini cluster running n8n workers, and what the real numbers looked like.
The Problem: You're Re-Sending the Same Tokens Every Time
When you call client.messages.create() in a loop, the SDK sends the full message payload on every request. If your system block contains a large CLAUDE.md — project architecture, conventions, file map, the works — every call eats those tokens fresh.
On the surface it looks fine. The API accepts it, Claude answers. But check your billing dashboard after 50 calls on an 8,000-token project context and you'll see the bleed:
50 calls × 8,000 tokens = 400,000 input tokens billed
At claude-opus-4-5 pricing → ~$6.00 in input costs alone
That's before output tokens, before any real computation. It's pure overhead from not caching static content.
The first thing most people try is shortening the system prompt. That helps a little, but you're fighting the symptom. The actual fix is telling the API which parts of your payload are static, so the server can store them between calls.
The Fix: Cache-Control Anchors in the Right Order
Anthropic's Prompt Caching works on prefix tokens. The server stores a snapshot of your message up to the cache anchor and reuses it for subsequent calls within a 5-minute TTL. The cost for a cache hit is 90% less than a regular input token.
The design rule is simple: static content first, dynamic content last. Your system block should hold the parts that never change between calls. The user message changes every time, so it stays at the end and is always billed normally.
Here's the working pattern:
import anthropic
client = anthropic.Anthropic()
# Load your project context once — CLAUDE.md, architecture docs, whatever is static
with open("CLAUDE.md", "r") as f:
project_context = f.read()
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": project_context,
"cache_control": {"type": "ephemeral"} # cache anchor
}
],
messages=[
{"role": "user", "content": "Find the bug in the payment handler"}
]
)
usage = response.usage
print(f"Cache hit: {usage.cache_read_input_tokens} tokens")
print(f"New input: {usage.input_tokens} tokens")
print(f"Cache created: {usage.cache_creation_input_tokens} tokens")
The first call will always be a cache miss — the server has nothing stored yet, so cache_creation_input_tokens will be non-zero and cache_read_input_tokens will be 0. From the second call onward, you should see cache hits.
If cache_read_input_tokens is still 0 on the third or fourth call, something is invalidating your cache. Jump to the gotchas section below.
Real Numbers from a Mac Mini Cluster
I measured this on a 4-node Mac Mini setup running n8n HTTP Request workers in parallel, all hitting the same Claude API key. The project CLAUDE.md was ~8,000 tokens — a real production codebase with architecture notes, file conventions, and tool documentation.
50 calls, caching OFF:
| Metric | Value |
|---|---|
| Input tokens (total) | 400,000 |
| Cache read tokens | 0 |
| Cost | ~$6.00 |
50 calls, caching ON with cache_control:
| Metric | Value |
|---|---|
| Input tokens (billed) | 20,000 |
| Cache read tokens | 380,000 |
| Cost | ~$0.63 |
| Savings | 89% |
The usage output on a cache-hitting call looks like this:
# Call #2 and beyond:
cache_hit: 7980 tokens # served from server cache, 90% cheaper
new_input: 420 tokens # user message + any uncached content
cache_created: 0 tokens # nothing new to store
$6.00 down to $0.63. That's not a rounding error — it's a structural change in how tokens get billed.
One thing I verified: because the server cache is keyed per API key (not per worker), the first node to make a call "warms" the cache for every other node on that same key. I call this the scout worker pattern. Worker 1 takes the cache miss and pays full price. Workers 2, 3, and 4 — all firing within 5 minutes — ride the cache hit. Works automatically, no coordination needed between workers.
Variations and Gotchas
Three ways the cache breaks silently
Dynamic content before the cache anchor. This is the most common gotcha I hit. If you prepend anything that changes — a timestamp, a session ID, a request UUID — to the system block before the cache_control block, you invalidate the entire cache on every call. The server sees a different prefix and creates a new cache entry instead of reusing the old one.
# ❌ This kills your cache every call
system = [
{"type": "text", "text": f"Session: {uuid.uuid4()}"}, # dynamic!
{"type": "text", "text": project_context, "cache_control": {"type": "ephemeral"}}
]
# ✅ Static content first, dynamic goes in the user message
system = [
{"type": "text", "text": project_context, "cache_control": {"type": "ephemeral"}}
]
messages = [{"role": "user", "content": f"[session:{session_id}] your question here"}]
TTL expiry in long-running batches. The ephemeral cache TTL is 5 minutes. If your batch job has gaps longer than that between calls — say a slow upstream step stalls one worker — the cache expires mid-batch and the next call pays full price again. Design your pipeline so calls happen in bursts within the 5-minute window, not spread across 30 minutes.
Tool definition order changes. If your tool array is assembled dynamically and the order shifts between calls (e.g., you're building it from a dict), the API sees a different prefix hash and misses the cache. Fix this by sorting or explicitly ordering your tools array before every call.
# Always sort tools by name (or any stable key) before creating the message
tools = sorted(raw_tools, key=lambda t: t["name"])
Minimum token threshold
The API only caches blocks that exceed a minimum token count. For Claude Opus 4 and Sonnet 4 models, this is currently 1,024 tokens. If your CLAUDE.md is shorter than that, the cache anchor is effectively ignored. Either consolidate your context into a larger block or accept that caching won't help on small prompts.
Tools definitions are also cacheable
You don't have to limit caching to the system block. If you have a large, stable tools array, you can place a cache_control anchor there too. The server caches the full prefix up to the last anchor point in the payload, so anchoring both system and tools gives you a bigger cached chunk and a higher hit rate.
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system=[
{"type": "text", "text": project_context, "cache_control": {"type": "ephemeral"}}
],
tools=[
*your_tool_definitions,
# mark the last tool with cache_control if you want tools cached too
],
messages=[{"role": "user", "content": user_query}]
)
Closing
Prompt Caching isn't a one-line flag you flip. It's a structural decision: identify what never changes between calls, push it to the front of your payload, and anchor it. The dynamic per-call content goes at the end. Get that ordering right, and the 90% cache discount compounds across every worker on your cluster that shares the same API key.
What's next: if you're running longer workflows where the 5-minute TTL is a problem, look at batching strategy — grouping calls into tight windows rather than spreading them across a slow pipeline. That's where the scout worker pattern pays off most.
🐦 Faster updates on X: @baegseungh7061
📚 More in this series: Code Advanced
💌 Subscribe: Follow on X or grab the RSS
댓글
댓글 쓰기