← writing

Prompt caching is the new prompt engineering

I cut my monthly Anthropic API bill 64% by moving one comma in a system prompt. Here is the mental model for prompt caching in 2026, three real wins from production apps, and the one gotcha that will silently invalidate your cache.

I cut my monthly API bill 64% by moving one comma in a system prompt.

Not by switching models. Not by trimming tokens. Not by rewriting prompts to be cleverer. I moved a comma, which moved the boundary of where my cached prefix ended and my dynamic content began, and the next morning the bill was a different shape. That is the world we live in now. Prompt engineering used to be about words. In 2026 it is about cache hit rates.

This post is the mental model I wish someone had handed me a year ago, three concrete wins from real projects, and the one gotcha that keeps eating people alive.

How prompt caching actually works

The Anthropic API caches stable prefixes on the server side. You mark a boundary in your prompt with a cache_control marker, and everything from the start of the request up to that marker becomes a cacheable chunk. The next call that arrives with the exact same prefix reuses the cached version and pays a fraction of the normal input price for those tokens. Cache misses pay a small write premium, cache hits pay a big discount. The break-even is fast, usually after one or two reuses.

The mental model that finally made it click for me is this. Think of your prompt as having a hot prefix and a cold suffix. The hot prefix is everything that rarely changes from call to call. Your system instructions, your tool definitions, your few-shot examples, your large reference documents. The cold suffix is everything that does change. The current user message, the latest tool result, the per-request context. You want the hot prefix to be as long as possible and the cold suffix to be as short as possible, because the cache only pays you back for what sits before the marker.

That sentence is the whole game. Move stable stuff to the front. Move volatile stuff to the back. Put the marker on the boundary. Stop thinking about your prompt as a single blob and start thinking about it as two layers stacked on top of each other.

Win one: the RAG app with documents in the middle

The first project where I really felt the difference was a RAG-style assistant for a client's internal knowledge base. The original prompt was structured the way a human would write it. System instructions at the top, then the user query, then the retrieved documents pasted in below, then a short instruction at the bottom telling the model how to cite.

Cache hit rate was about 12%. Which makes sense in hindsight. The user query was inside the cached region, so every new question invalidated the entire prefix. The retrieved documents were also inside, and they changed every call because retrieval pulled different chunks. Almost nothing was actually stable.

The fix took twenty minutes. I moved the retrieved documents above the user query, put the cache marker right after the documents, and pushed the user query into the cold suffix. The system prompt and citation rules stayed pinned at the very top.

Now the cache layout looks like: system prompt, citation rules, retrieved documents, [cache marker], user query. The retrieved documents are still per-request, but within a single conversation they often stay the same across multiple turns of follow-up questions, because the user keeps drilling into the same topic. Cache hit rate jumped to 78%. Bill dropped accordingly. Latency improved too, because cached tokens are not just cheaper, they are faster to process.

Win two: the coding assistant with tool definitions on every call

The second one was sneakier. I had a coding assistant with maybe fifteen tool definitions, each with a chunky JSON schema. The system prompt was modest, but the tool block was huge. Every single API call re-sent the entire tool block, and the tool block was technically part of the request, so it counted toward input tokens.

The tools never changed. They were defined at startup and stayed identical for the life of the process. But because I was building the request fresh each turn, and because some per-turn context was getting interleaved with the tool definitions, the cache was missing more often than it was hitting.

I split the request into two layers. Stable tools and system prompt at the top, then the marker, then the dynamic conversation history and the latest user message. I also stopped interleaving anything per-turn into the tool section. Once the tool block was pinned and pure, the cache started hitting reliably on every call after the first one. The savings were less dramatic in percentage terms than the RAG fix, but in absolute money it was bigger, because this assistant ran constantly and the tool block was enormous.

Win three: the translation pipeline with a glossary the size of a novella

The third one was a translation pipeline where the glossary was the largest part of the prompt by far. The client had built up a domain-specific glossary over years, with thousands of preferred translations and forbidden terms. Every translation request included the full glossary so the model would respect it.

The glossary never changes within a session. It only updates when the client edits it, which happens maybe once a week. So I cached the glossary once per session, pinned it at the top of the prompt, put the marker right after it, and let the actual sentence to translate sit in the cold suffix. The first call of a session pays the write premium. Every subsequent call in that session pays the cache hit price for the glossary, which is enormous, plus the normal price for the one sentence being translated, which is tiny.

This was the project where the comma incident happened. I had a templating helper that joined glossary entries with newlines, and someone had added a trailing space after one of the entries during a manual edit. That trailing space was inside the cached region, so when the next deployment shipped, the prefix hash changed for every session, and the cache hit rate collapsed overnight. I noticed because the bill spiked. I fixed it by stripping trailing whitespace at template render time. Hit rate recovered the same day.

The gotcha that will get you

One character of difference in the cached prefix invalidates the entire cached chunk. Not the changed paragraph. The whole thing. Trailing whitespace from templating, a stray newline that crept in after a refactor, a date stamp you forgot was inside the prefix instead of after the marker, a tool definition with a field reordered, all of these will silently turn your 78% hit rate back into a 12% hit rate without raising any error.

What I do now is dump the exact bytes of the cached prefix from the first call of a session and from a later call in the same session, then diff them. If the diff is non-empty, the cache is not actually being reused, no matter what the docs say. This sanity check has caught three real bugs for me this year already.

The new prompt engineering

The old craft was choosing the right words to coax the right behaviour. That craft still matters. But on top of it sits a new craft, which is shaping your prompts so the cache can do its job. Identify what is stable and what is volatile. Pin the stable stuff to the front. Put the marker on the boundary. Watch for whitespace. Measure the hit rate, not just the response quality.

Prompt engineering used to be about words. In 2026 it is about cache hit rates.

Want more like this?

Occasional, opinionated, no listicles.
all writing →