TL;DR
Three token-cost reduction strategies dominate the 2026 production LLM stack: provider-side prompt caching, model distillation, and retrieval-augmented generation against a focused index. Each has a precise condition under which it earns its complexity, and each has a payback regime where it loses to a simpler alternative. The decision rules are mechanical. Prompt caching pays back above two reads per cache-write inside the TTL, and on stable system-prompt + tool-block workloads it cuts input cost 70–90% with effectively zero behavioural risk ($0.04/req cold vs $0.003 cached on a 30k-token prefix is 13× saving on an 80% hit rate workload). Distillation pays back above ~50,000 inferences over the lifetime of a stable task, where a 4o-mini-class student matches a Sonnet-class teacher within 2% accuracy. RAG pays back when the corpus is large enough that no caching strategy can hold it warm, and changes often enough that fine-tuning would have to be re-run weekly. The three compound: most production pipelines should use all three, not pick one.
The dominant cost line in 2026
Frontier model prices have flattened, not collapsed. Sonnet-tier and GPT-5-tier input still sits at $2–5 per million tokens, output 4–6× that, through April 2026. A single agent call against a stable 4k system prompt, a 1.2k tool block, and 25k of retrieved context lands at $0.12 input before generation. 50,000 monthly calls is $6,000 input alone before reasoning, retries, or evals.
The dominant term in that cost is repetition. Most agent traffic re-sends the same system prompt, tool definitions, and retrieved snippets across hundreds or thousands of calls. The optimization surface is "stop paying for bytes you have already paid for." Prompt caching attacks repetition at the provider level, distillation attacks the per-request size by training a smaller model to produce equivalent output, and RAG attacks the per-request size by retrieving only the snippets a specific query needs instead of stuffing the full corpus.
The three are orthogonal. They compose. A pipeline that uses Anthropic prompt caching on the system prompt, RAG to inject the right 4k of corpus context per query, and distillation to route 80% of routing-and-classification traffic to a Haiku-class model can run at 5–8% of the naive cost without measurable accuracy degradation.
Strategy 1: prompt caching, the cheapest win
Provider-side prompt caching reuses prior compute on a prefix hit. All three frontier providers ship it as of 2026: Anthropic with explicit cache_control breakpoints (5-minute or 1-hour TTL), OpenAI with automatic prefix matching on cached input, Gemini with an explicit CachedContent handle.
The pricing structure is what determines the breakeven. Anthropic charges 1.25× standard input on cache-write for 5-minute TTL (2× for 1-hour) and 0.1× on cache-read. OpenAI charges no write premium and 0.5× on cache-read. Gemini charges 0.25× on cache-read plus a per-minute storage fee.
The payback math, expressed as the number of reads needed to break even on a single write:
| Provider | Write premium | Read discount | Breakeven (reads needed) |
|---|---|---|---|
| Anthropic 5-min | 1.25× | 0.10× | 0.28 (pays back at 1 read) |
| Anthropic 1-hour | 2.00× | 0.10× | 1.11 (pays back at 2 reads) |
| OpenAI | 1.00× | 0.50× | 0.00 (automatic) |
| Gemini | ~1.00× + storage | 0.25× | depends on TTL/storage cost |
The Anthropic 5-min cache pays for itself on the second call, full stop. The 1-hour cache needs three calls to clearly win, which is a typical session length. OpenAI's prefix cache is free in the sense that there is no write premium; the catch is that the prefix has to actually match, which means consistent prompt construction.
Worked example. A research agent has a 30k-token stable prefix (system prompt 4k, tool block 1.2k, retrieved filing 25k). Naive cost at $3/M input is $0.090 per call. With Anthropic 5-min caching on the full prefix, the first call costs $0.090 × 1.25 = $0.1125 (write); subsequent calls within 5 minutes cost $0.090 × 0.10 = $0.0090 each (read). Across a 6-call session: 1 × $0.1125 + 5 × $0.009 = $0.158 vs $0.540 naive. 71% saving.
Across a 50,000-call/month workload with a typical 80% hit rate (most calls land within an active session), the per-call effective cost is roughly $0.018 vs $0.090 naive, a 5× saving on input cost alone, with no accuracy delta and no behavioural change. The Token-Cost Optimizer computes this breakeven for your specific prompt shape and traffic profile.
The structural traps are documented but bite hard. A timestamp in the system prompt invalidates the prefix on every call. A reordered tool registry invalidates the cache for the rest of the session. An Anthropic cache breakpoint below the minimum block size (1024 tokens for Sonnet/Opus, 4096 for Gemini 2.5 Pro) is silently ignored, with no error, no warning, no caching.
Strategy 2: distillation, the long-game investment
Distillation trains a smaller "student" model to imitate a larger "teacher" on a specific task. The student costs 5–20× less per inference; the training cost is one-time. The economic question is whether the lifetime inference savings exceed the training cost plus the accuracy delta.
The mechanics. Collect 5,000–20,000 (input, teacher-output) pairs from production traffic. Fine-tune a smaller model (through 2026, the practical targets are GPT-4o-mini, Claude Haiku 4.x, or Gemini 2.5 Flash) using either the provider's hosted fine-tuning API or a self-hosted Llama 3 / Mistral derivative. Evaluate on a held-out task-specific eval. If the student stays within 1–3% accuracy of the teacher, ship the student.
The cost math, for a stable task running 50k calls/month over a 12-month deployment:
| Per-call cost | 12-month total | One-time | |
|---|---|---|---|
| Sonnet teacher | $0.018 (cached) | $10,800 | n/a |
| Haiku student | $0.0024 | $1,440 | $1,500 fine-tune |
| Saving | 86% | $8,860 | payback at month 2 |
The break-even is two months, not two years. The condition is "stable task," meaning the prompt shape, output schema, and acceptance criteria do not change inside the 12-month window. If they do change, the student has to be re-distilled, and the payback resets.
When distillation loses. On tasks that need the full reasoning depth of the teacher (multi-step trade reasoning with novel facts, complex compliance interpretation), the student typically fails the 2% accuracy gate. Wang 2022's self-consistency work establishes that distilled models lose disproportionately on chain-of-thought tasks where the reasoning path matters; you can match the answer without matching the reasoning, and the unmatched reasoning is the part that fails on edge cases.
The pragmatic split that works for finance LLMs: distill the routing-and-classification stages (which are stable, narrow, and high-volume), keep the teacher for the synthesis stage (which is broader, lower-volume, and where the 2% accuracy gate matters most). This compresses the cost of the high-volume early stages without trading off the synthesis quality.
Strategy 3: RAG, the right tool for changing knowledge
RAG retrieves a small subset of a large corpus per query and injects only that subset into the prompt. The cost saving comes from not paying the per-token rate on the bytes the model does not need.
The classic RAG pipeline: chunk the corpus into 200–800-token passages, embed each chunk with a small model (text-embedding-3-small at $0.02/M tokens or open-source bge-small at zero marginal cost), store the embeddings in a vector index, retrieve top-k against the query embedding, inject the retrieved chunks into the prompt.
The cost math, for a 1M-token corpus updated weekly, queried at 50k/month with k=8 chunks per query at 500 tokens each:
| Per-call input | 50k/month | |
|---|---|---|
| Stuff full corpus | $3.00 (1M tokens) | $150,000 |
| RAG, top-8 chunks | $0.012 (4k tokens) | $600 |
| Embedding/index | (one-time) | ~$50 |
The corpus stuffing column is rhetorical (nobody actually does it for a 1M-token corpus, because it does not fit in any context window), but the comparison frames the saving. RAG converts a quadratic-in-corpus-size cost into a constant-in-query-size cost.
When RAG loses to fine-tuning. When the corpus is small (say, under 200k tokens), stable, and the relevance gate is hard to define, fine-tuning the model on the corpus directly outperforms a top-k retrieval. The break-even is roughly 100k queries against the corpus before fine-tuning's training cost amortizes; below that, RAG wins on flexibility and pay-per-call economics. Above that, fine-tuning wins on per-call cost and on tasks where the corpus knowledge is in the model's weights rather than retrieved at inference time. The full break-even derivation lives in The RAG Cost Model: When RAG Loses to Fine-Tuning.
When RAG loses to a long-context call. For a 50k-token corpus queried 200 times a day, a long-context call with prompt caching often beats RAG on cost: the cached corpus sits in the cache for the full session, no embedding pipeline runs, no retrieval-quality bug ever fires. The cutover happens around 100–200k tokens of corpus size, where the prompt cache stops fitting cleanly.
How they compound
The three strategies stack cleanly. A real production pipeline running all three:
- RAG layer retrieves the relevant 4k of corpus context per query.
- Prompt caching holds the 5k of stable system prompt + tool block warm across the session, plus the per-query retrieved chunks for any immediate follow-up.
- Distilled student model runs the routing and classification stages; the teacher runs the synthesis stage.
Worked numbers, 50k calls/month financial-research agent, 20% routing/classify, 80% synthesis:
| Layer | Naive $/call | With layer | Composite |
|---|---|---|---|
| Baseline (Sonnet, full corpus stuffed) | $0.180 | n/a | n/a |
| + RAG | $0.180 | $0.030 | $0.030 |
| + Prompt caching (80% hit rate) | $0.030 | $0.0072 | $0.0072 |
| + Distilled student on 20% routing | $0.0072 | $0.0058 weighted | $0.0058 |
| Monthly cost | $9,000 | $290 |
That is a 31× compression. The numbers are not exotic; they reproduce on workloads I have run, plus or minus a factor that depends on the specific model and corpus. The Agent Cost Envelope Calculator takes per-stage model choices, traffic shape, and caching parameters and outputs the composite envelope, including the P50/P95 cost (because cost has a tail too: retries and cache misses push the per-call cost above the median in a long-tailed pattern).
Measurement: instrumenting the savings
The strategies above produce real savings only if you can measure them. The instrumentation that earns its complexity:
Per-call cost-attribution log. Every inference call records its input token count, output token count, cached vs uncached portion, and model used. The cost is computed at log time using the current rate card and stored alongside the call. End-of-day aggregation by route, prompt-version, and user yields the cost per workload, which is the number that tells you which optimization to invest in next.
Cache-hit-rate dashboard. For Anthropic, the API response includes cache_creation_input_tokens and cache_read_input_tokens fields; the ratio is the hit rate. For OpenAI, the response includes prompt_tokens_details.cached_tokens. For Gemini, the CachedContent handle has explicit usage tracking. Wire all three into the same dashboard so you can see the hit rate by route and notice the day someone added a timestamp to the system prompt and broke the cache.
A/B harness for optimization rollouts. Each optimization (caching wired, RAG narrowed, student model deployed) ships behind a percentage rollout. The cost-per-call delta and the accuracy delta are measured against the held-out group. Optimizations that look good in isolation sometimes regress accuracy at the 5–10% level; the A/B catches it. The Token-Cost Optimizer provides the upfront calculator; the A/B is the production validator.
The cost of the instrumentation is small (one engineer-week to wire, $20–50/month to operate). The value is that "we cut our LLM bill by 40%" stops being a story and starts being a graph that survives a finance review.
Decision rules, in order
A short prescriptive sequence for the team that has not optimized cost yet.
First. Wire provider-side prompt caching on the stable prefix (system prompt + tool block + any retrieved corpus that is stable inside the session). This is the cheapest engineering win in the stack: usually a one-day change, with measurable cost reduction on day two. If your prompts have timestamps or other instability, fix that first; the cache is silently disabled until you do.
Second. Add RAG when the corpus exceeds what cleanly fits in the cached prefix (typically above 100–200k tokens), or when the corpus changes faster than the cache TTL can amortize.
Third. Distill the high-volume narrow tasks (routing, classification, format conversion) once they have stabilized. Do not distill until the prompt and acceptance criteria have been stable for at least two months; the distillation cost is wasted if you re-train every sprint.
The order matters because each step builds on the previous. RAG is harder to wire correctly without a stable cached prefix to anchor on. Distillation is hardest to evaluate without a stable RAG-augmented baseline to compare the student against. Doing them out of order produces a pipeline that is locally optimized but globally fragile.
Connects to
- Token-Cost Optimizer computes per-call cost across naive, cached, and RAG-augmented pipelines for your specific prompt shape.
- Agent Cost Envelope Calculator outputs the composite envelope when you stack caching, distillation, and RAG.
- Caching Strategies for LLM Pipelines covers the three caching layers (provider, application, semantic) in the depth this article elides.
- The RAG Cost Model: When RAG Loses to Fine-Tuning develops the RAG-vs-fine-tune break-even rigorously.
References
- Anthropic. "Prompt Caching." docs.anthropic.com/en/docs/build-with-claude/prompt-caching, accessed April 2026. Documents the 5-minute and 1-hour TTL options, the 1.25× and 2.0× write premiums, the 0.1× read discount, and the minimum block sizes by model tier.
- OpenAI. "Prompt Caching." platform.openai.com/docs/guides/prompt-caching, accessed April 2026. Documents the automatic prefix-matching behaviour and the 0.5× cached-input pricing.
- Google. "Context Caching." ai.google.dev/gemini-api/docs/caching, accessed April 2026. Documents the explicit CachedContent handle, the per-minute storage fee, and the minimum cache size.
- Hinton, G., Vinyals, O., & Dean, J. (2015). "Distilling the Knowledge in a Neural Network." NeurIPS Deep Learning Workshop. The original distillation paper. The framing translates directly to the 2026 LLM context.
- Wang, X., Wei, J., Schuurmans, D., et al. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171. Establishes the chain-of-thought sampling regime and the failure modes of distilled models on multi-step reasoning, which is the basis for the synthesis-vs-routing split described above.
- Lewis, P., Perez, E., Piktus, A., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." NeurIPS 2020. The original RAG formulation; the cost-saving framing in this article is a direct application of the architectural separation it proposed.