TL;DR
Three caching layers cover every production LLM pipeline. Layer 1, provider-side prompt caching (Anthropic prompt caching, OpenAI cached input, Gemini context caching) is the cheapest default. On stable system-prompt + tool-block workloads it cuts input cost 70–90% with zero behavioural risk. Layer 2, application-side response cache, is correct for deterministic queries (a fixed 10-K extraction against an immutable filing). A SHA-256 of the full prompt bundle is the only safe key. Hit rates of 30–60% are typical on research loops, with a hit costing O($10⁻⁶) versus O($10⁻²) cold. Layer 3, semantic similarity cache with embeddings, is worth the complexity only when traffic clusters around a small set of paraphrased intents (public chatbot, support agent). Below ~10k req/day on heterogeneous queries the embedding bill and false-hit risk wipe the savings. A worked example at the end shows a 50k-query/month financial-research agent dropping from $2,655/month to $432/month with Layers 1 and 2 alone.
Why caching is the largest cost line you can move in 2026
Frontier model prices have not fallen at the rate the slides suggested. Sonnet-tier and GPT-5-tier input is still $2–$5 per million tokens, output 4–6× that. Most teams notice two months in, when the bill steepens against a chart that did not include caching.
An agent system prompt is 4k tokens. A finance tool-definitions block adds 1.2k. With a 25k-token filing or 12k of RAG context, a single call sits at 40k input. At Sonnet rates that is $0.12 per call before output. 50,000 monthly calls is $6,000 input alone before reasoning, retries, or evals.
Caching attacks the dominant term: every byte stable across a sequence of calls is a byte to pay for once, not n times. The system prompt is stable; the tool block is stable; the filing does not change between the four questions you ask of it.
The three layers below stack cleanly. Each catches a different class of repetition; each has a precise condition for when it earns its complexity.
Layer 1: provider-side prompt caching
Cheapest to enable, largest baseline impact. All three frontier providers ship it as of 2026-04: Anthropic prompt caching with explicit cache_control breakpoints, OpenAI cached input on automatic prefix matching, Gemini context caching via an explicit CachedContent handle.
The mechanism is the same in spirit: the provider hashes a request prefix, reuses prior compute on a hit, and bills the cached portion at a discount. The discount is steep on Anthropic (cache-read ~0.1× standard input, cache-write 1.25× for 5-minute TTL or 2× for 1-hour), softer but automatic on OpenAI (cache-read ~0.5× input, no write premium), and tunable on Gemini (cache-read ~0.25× input plus per-minute storage).
When it pays. Multi-turn agents that call the model 6–12 times in a session against a stable system prompt and tool block hit near-100% on every call after the first. Document-anchored question loops (four questions of one 10-K) cache the filing once and let the per-question tail stay small. Batch evaluation runs against a fixed scoring rubric cache the rubric and pay only for the variant.
When it does not pay. A one-shot call against a fresh prompt pays the cache-write premium and amortizes it over zero reads. Below two reads inside one TTL, plain inference is cheaper.
Latency vs cost. Cache-hits are not free latency wins. Reported TTFT improvements run 100–400ms faster on a 30k-token prefix, not the order-of-magnitude jumps some posts claim. The cost reduction is the real case; treat the latency delta as a small bonus.
The structural traps bite hard. A timestamp in the system prompt invalidates the prefix on every call. A reordered tool registry invalidates the cache for a whole session. A breakpoint below the minimum block size (1024 tokens for Sonnet/Opus, 4096 for Gemini 2.5 Pro) is silently ignored. The fix is structural, not a flag.
The Token-Cost Optimizer computes the breakeven for a given workload shape. Use it before wiring anything; the answer is sometimes "do not bother for this surface" and that is fine.
Layer 2: application-side response cache
Cache the full response by the hash of the full prompt bundle. On a hit the model is not called at all; the save is total, not fractional.
This layer is correct for one specific archetype: deterministic queries against immutable inputs. A fixed extraction prompt against a 2023 SEC filing. A summarisation template against a transcript. A canonical "explain this metric" answer for a glossary term.
Hash key strategies. The naive key is the user query. That is wrong. The true cache key is the SHA-256 of every input that determines the output: model name and version, sampling parameters (temperature, top_p, max_tokens), system prompt, tool definitions, full message history, retrieved context, user query. Drop any one and you get false hits across model upgrades and silent quality decay.
A safe key:
import hashlib
import json
def cache_key(model: str, params: dict, system: str, tools: list, messages: list) -> str:
payload = {
"model": model,
"params": params,
"system": system,
"tools": tools,
"messages": messages,
}
serialized = json.dumps(payload, sort_keys=True, ensure_ascii=False)
return hashlib.sha256(serialized.encode("utf-8")).hexdigest()
The sort_keys=True matters. JSON key ordering is not guaranteed and a sibling field flip will miss the cache for no semantic reason. The model field must be the exact version string (claude-sonnet-4-6, not claude); otherwise an upgrade silently serves stale answers.
TTL policy. Three tiers cover almost everything: immutable inputs (past SEC filings, glossary entries with no version suffix, frozen evaluation prompts) live in a permanent KV with eviction by storage cost only; stable-but-versioned entries (production prompts gated by a deploy version) invalidate when a prompt-version field in the hash changes; time-sensitive entries match the underlying volatility window: an hour for news-driven queries, forever for "what is the GAAP EPS in this filing".
What hit rates look like. On a fixed prompt template across a universe of 500 tickers, the second sweep hits near 100% on tickers whose inputs did not change. On heterogeneous live traffic, hit rates land between 5% and 30%. Economics still work because each hit returns at storage cost, on the order of $10⁻⁶ per hit versus $0.04 for a typical cold call. A 20% hit rate on a workload that costs $4,000/month uncached saves roughly $800/month for the cost of a Postgres table.
The Agent Cost Envelope Calculator lets you model the savings against your projected hit rate. Plug in the real cold-call cost from your bill, the projected hit rate (start conservative at 15%), and the per-call save lands clearly.
Layer 3: semantic similarity cache
The most powerful layer in narrow conditions, a net negative outside them. Embed the user query, search a vector store of prior queries, and if the nearest neighbour exceeds a similarity threshold, serve the cached response. It catches paraphrased queries with the same intent. "What is Apple's debt-to-equity ratio" and "How indebted is AAPL relative to equity" miss any hash-based cache; a semantic cache returns the same answer to both.
Preconditions. Layer 3 earns its keep when query traffic clusters around a small set of intents (a public chatbot fielding the same 200 questions in different wordings, a support agent fielding "where is my order" in 40 paraphrases). It also requires that the cached response is stable across small intent shifts. "Tell me about NVDA" and "Should I buy NVDA" have closely-clustered embeddings and very different correct answers, so semantic caching is appropriate for explanatory content, not decisions. And volume has to amortize the embedding cost: below ~10k requests/day on heterogeneous queries, the per-call embedding bill plus vector store plus engineering time outweighs the savings.
Failure modes. The threshold trap: too low a similarity threshold and the cache returns wrong answers ("how do I cancel" hits "how do I refund"); too high and the hit rate collapses. The right threshold is workload-specific and only tunable with a labelled test set, which most teams do not have. Start at 0.92 cosine and only relax with evidence.
The drift trap: a cached answer correct three weeks ago is stale today, indistinguishable from a fresh response. Mitigation: TTLs measured in hours, not days, for time-sensitive content.
The adversarial trap: a user finds that "ignore prior instructions" embeds close to a benign cached answer and gets that response. Validating the cache hit against the live query partially defeats the purpose. For finance agents handling user input, this is sufficient reason to skip Layer 3 entirely.
For a finance research agent serving internal users with predictable workflows, Layer 3 is unjustified. For a public-facing tool with consumer query volume, it can cut 20–40% off the residual bill after Layers 1 and 2 are saturated.
The decision matrix
By query archetype, the layer that wins:
| Workload | Layer 1 | Layer 2 | Layer 3 | Notes |
|---|---|---|---|---|
| Multi-turn agent loop, stable system | Yes | No | No | Per-call uniqueness defeats Layer 2 |
| 10-K Q&A, fixed questions per filing | Yes | Yes | No | Layer 2 catches re-runs across the universe |
| Public chatbot, paraphrased intents | Yes | Maybe | Yes | Layer 3 carries the bulk |
| Internal research, deterministic prompts | Yes | Yes | No | Hash key sufficient, semantic adds risk |
| One-shot evals, fresh prompt each | No | No | No | Caching is a tax here |
| RAG over slow-moving corpus | Yes | Yes | No | Cache the corpus; cache the question-answer |
| Tool-call results from real-time APIs | No | No | No | Cache poisoning risk; do not cache |
Layer 1 is on by default unless your traffic has zero prefix repetition (uncommon). Layer 2 is on for every workload where you can guarantee output determinism for the cached input. Layer 3 is opt-in only after the first two are saturated and the residual bill still hurts.
Worked example: financial-research agent at 50k queries/month
A concrete pipeline: 50,000 monthly research queries against 500 tickers, four canonical questions per ticker per sweep, two sweeps per month plus 10,000 ad-hoc queries from internal users. Stack: Claude Sonnet 4.6, 4k system prompt, 1.2k tool block, 8k RAG context, 500-token user query, 800-token output.
Uncached baseline. Every call pays full price on 13.7k input tokens and 800 output tokens.
Input cost = 50,000 × 13,700 / 1e6 × $3.00 = $2,055
Output cost = 50,000 × 800 / 1e6 × $15 = $600
Total = $2,655 per month
With Layer 1 only. The 5.2k stable prefix (system + tools) caches at 1-hour TTL. Per-ticker filing context (~8k tokens) caches at 5-minute TTL across the four questions. One cache-write on 5.2k per deploy day plus near-100% reads after; one cache-write on 8k per ticker-sweep plus three cache-reads on the four-question loop; 500-token tail uncached.
Stable prefix cost ≈ negligible (one write amortized over 50k reads)
Per-ticker corpus = 1,000 ticker-sweeps × (1 write + 3 reads) × 8k tok
write: 1,000 × 8,000 / 1e6 × $3.75 = $30
read: 3,000 × 8,000 / 1e6 × $0.30 = $7.20
Tail (500 tok × 50,000 calls) = 25M tok × $3.00 / 1e6 = $75
Output unchanged = $600
Layer 1 total ≈ $720/month
The 73% input drop is pure prefix caching, no application logic changes.
Add Layer 2 on the deterministic sweep half. The 40,000 sweep queries are deterministic: same prompt template, same model version, same retrieved context per ticker. The second monthly sweep over a ticker whose filings did not change hits at near 100%. Half the sweep traffic (20,000 calls) is served from Postgres.
Layer 2 saves: 20,000 calls × ($0.04 model cost) ≈ $800
But $800 was already partially saved by Layer 1.
Net additional Layer 2 save on top of Layer 1:
20,000 × Layer-1-cost-per-call ($720 / 50,000 = $0.0144) ≈ $288
Layer 2 total ≈ $432/month
The 10,000 ad-hoc queries remain uncached (heterogeneous, time-sensitive). The shape of the problem says do not push Layer 3 onto them; the engineering cost would not pay back at this volume.
Final monthly bill. ~$432 versus $2,655 baseline. A 6.1× reduction for one engineer-day of work: a Postgres table, ~120 lines of caching middleware, an integration test that asserts the hash key includes model version and sampling params.
The Data Vendor TCO Calculator is the companion on the data side, where the same compounding logic applies to vendor tier choices.
Anti-patterns
Personalized data in a shared cache. A response that includes user-specific PII or account numbers must never enter a shared cache. Even a hash-keyed Layer 2 cache leaks if the key collides on the prompt template but the response includes user-specific output. Include the user ID in the hash key for per-user partitioning, or do not cache the response at all and cache the upstream retrievals instead.
Tool-call results that depend on real-time state. A "current stock price" tool returns a new value every second; caching its response for any non-zero TTL feeds stale data to the model. Caching is only valid for tool calls whose output is a function of inputs you control. If the tool reads market state, weather, time, or external truth, do not cache its result.
Weak hash keys. Hashing only the user query is the most common mistake. The cache then returns the same response across model upgrades, prompt edits, and parameter changes. Every input that determines the output belongs in the hash. Use a structured payload, sort the keys, include the model version explicitly.
Speculative output. A prompt at temperature > 0 is intentionally non-deterministic; caching the first response and serving it forever defeats the surface. Set temperature to 0 if you want determinism; otherwise do not cache.
Cache-write below breakeven. Anthropic charges 1.25× input rate to write the 5-minute cache. If the average prefix is read once before TTL expires, caching loses money. Breakeven is roughly 2 reads per write; below that, disable cache_control on those breakpoints.
A cache without an invalidation strategy becomes a liability the day a prompt or model changes. The fix is the same one the hash-key discipline already implies: bake the model version, prompt version, and sampling params into the hash and let any change invalidate the relevant entries automatically. Manual cache flushes in the deploy runbook do not survive contact with a bad week.
Where the layers compound
Stacking is multiplicative. Layer 1 cuts per-call cost ~75% on a stable-prefix workload; Layer 2 cuts call count 30–60% on the deterministic subset; Layer 3 attacks what remains, if the conditions hold. For a typical 2026 finance research pipeline the realistic floor is 80–85% off the uncached bill with Layers 1 and 2 alone. The remaining cost is fresh inference on novel inputs, which only moves with cheaper models, not better caching.
If your bill is materially above that floor, the answer is rarely "switch models." It is "audit the caching layout." The Token-Cost Optimizer and Agent Cost Envelope Calculator are the cheapest places to start; the Data Vendor TCO Calculator handles the upstream half.
Tools referenced
- Token-Cost Optimizer. Model the per-call savings of Layer 1 prompt caching.
- Agent Cost Envelope Calculator. Bound the monthly bill of a research loop with Layer 2 included.
- Data Vendor TCO Calculator. The upstream data-cost half of the same compounding logic.