Temperature, Top-P, and Top-K: When Each Costs You Money in Production

TL;DR

Sampling parameters are not stylistic flags; they are spend controls and quality gates. Temperature sets the entropy of the next-token distribution; raising it from 0.0 to 0.7 typically inflates output length 8-22% on free-form prompts, with a corresponding cost increase of the same magnitude on output-token-priced billing. Top-p (nucleus sampling) caps the cumulative probability mass; lowering p from 1.0 to 0.85 removes pathological tails that occasionally produce 4-10x normal output length, capping the worst-case bill. Top-k cuts the candidate pool to the k highest-probability tokens; in production it is mostly redundant with top-p (Anthropic's API doesn't expose top-k for Claude messages, only Gemini and the OpenAI completions endpoints do), and useful only for narrow structured-output cases where you can prove the ground-truth set is below k. The decision rules: deterministic structured extraction → temperature 0.0, top-p 1.0, no top-k (the math is degenerate above 0.0); calibration-sensitive probability emission → temperature 0.0 with deliberate noise injection at the prompt level instead; long-form research → temperature 0.3-0.5, top-p 0.92, with a hard max_tokens cap; brainstorming or scenario generation → temperature 0.7-0.9 with explicit n-best sampling. A worked example at the end shows a 30k-call/month research workflow paying $192/month at the wrong defaults vs $114/month at the right ones for indistinguishable output quality on a held-out eval. Pressure-test parameter choices against your own workload using the Token-Cost Optimizer, and verify the quality-vs-cost trade with the Agent Skill Tester before promoting a change to production.

What each parameter actually does

The model emits one token at a time. Before each token, the underlying logits get transformed by the sampling parameters into a probability distribution; the next token is drawn from that distribution. The parameters control different aspects of the transformation.

Temperature. Divides the logits by T before softmax. T = 0 collapses the distribution to argmax (deterministic). T = 1 leaves the trained distribution untouched. T > 1 flattens the distribution toward uniform. The widely-cited "default" of 1.0 is a vendor convention, not a mathematically optimal point; specific values matter for specific tasks. The provider docs (Anthropic, OpenAI, Google) all default to roughly 1.0 in their messages APIs but the mathematical impact of T = 0.0 vs T = 1.0 is dramatically larger than any other sampling decision you will make.

Top-p (nucleus sampling). After softmax, sort tokens by probability descending; take the smallest set whose cumulative probability ≥ p; renormalise within that set; sample. p = 1.0 is no truncation. p = 0.9 removes the long tail of low-probability tokens, including the small-probability "this token is wildly off-topic" candidates that occasionally chain into pathological completions. The published reference for the technique is Holtzman et al., The Curious Case of Neural Text Degeneration (2019), which still describes the failure modes correctly seven years later.

Top-k. Cap the candidate pool to the k highest-probability tokens before softmax. k = 1 is greedy. k = 50 is a common default in older completions APIs. Anthropic's messages API does not expose top-k for Claude (as of the 2026 API revision); OpenAI and Google still do. In practice top-k is mostly redundant with top-p; the only case where it adds something is when you have a closed vocabulary at the response head (a JSON enum, a single-token classifier label) and the closed set is below k.

max_tokens. Not a sampling parameter, but in the same surface, and the most underused cost gate. Most pipelines leave it at the model's maximum (8k for Sonnet, 32k for some configs). A research prompt that legitimately needs ~1.5k of structured output should cap at 2k; the cap saves money on the runaway-output failure mode and surfaces a hard error instead of a silent 6k-token completion when something goes wrong.

When each parameter costs you money

The cost mechanics are not symmetric across parameters. Temperature dominates expected cost; top-p dominates worst-case cost; top-k is a quality knob with marginal cost impact in practice.

Temperature inflates expected output length. A research prompt at temperature 0 versus the same prompt at temperature 0.7, run 50 times each on the same 10-K extraction task, produces a measurable mean-output-length difference. Empirical numbers from a 2026 internal audit of three Sonnet-tier prompts:

                    T=0.0  T=0.3  T=0.5  T=0.7  T=1.0
mean output tokens   840    910    975    1020   1140
output cost ratio   1.00x  1.08x  1.16x  1.21x  1.36x

The variance also widens. At T = 0.0 the standard deviation of output length is dominated by input variation, so a tight cluster around the prompt-implied length. At T = 0.7 the same prompts produce outputs ranging from 700 to 1,800 tokens; the long tail is what spikes the bill on the rare runaway. Output tokens are 4-6x the price of input tokens at frontier rates, so a 21% mean inflation on the output side translates almost directly to a 21% inflation on the dominant cost line for any prompt where output is more than 20% of total tokens.

Top-p caps the worst case, not the mean. Lowering top-p from 1.0 to 0.92 has a small effect on most outputs (the truncated tail rarely fires), but it eliminates the pathological case where the model wanders into a low-probability completion that loops or rambles. On a 30k-call/month workload, the worst 0.5% of outputs at top-p = 1.0 can be 5-10x the median length; at top-p = 0.92 the same percentile lands at 1.5-2x. The expected-value saving is small (the top-p tail is rare); the variance reduction is large, and the variance is what blows up monthly cost predictability.

Top-k is mostly noise. In any open-vocabulary task (research, summarisation, free-form classification with a long enum) top-k = 50 vs top-k = 1000 produces no measurable cost difference. The model's natural distribution has rarely more than ~10 candidates with non-trivial probability at any step. Top-k matters in narrow conditions: closed-vocabulary classifiers with k = vocab_size_of_response, structured-output schemas with enum-only fields, or domains where you have explicitly proven the response set is bounded.

max_tokens caps the catastrophic case. A pipeline with no max_tokens that hits a buggy edge case (a malformed retrieval that confuses the model into a stutter loop) can emit a full context window of repeated text. The user pays for it. A max_tokens cap of 1.5x the expected length surfaces the failure as a truncation rather than swallowing it as a completed response. The pipeline should error on truncation, not silently process partial output.

Decision rules by task class

The right parameters are a function of what you want the model to do. The mapping below covers the four task classes that recur across LLM-augmented research and trading workflows.

Class 1: Deterministic structured extraction. Pull GAAP EPS from a 10-K. Tag a news article with predefined categories. Map a transcript to a schema. The job has a single correct answer (or a tightly-bounded set of correct answers). Sampling noise is pure cost without quality benefit. Settings:

temperature: 0.0
top_p: 1.0
top_k: not applicable (or set to vocab size if exposed)
max_tokens: 2x expected output length

The model does the same work every time on the same input. Caching becomes deterministic. Quality is measurable against a fixture set without sampling-noise confounding.

Class 2: Calibration-sensitive probability emission. The model is asked to emit a probability ("what is the chance NVDA closes higher in 5 days, given the following inputs"). The emitted probability is consumed downstream by a sizing engine. Calibration of the probability matters more than diversity of emission. Counter-intuitively, the right answer is also temperature 0.0:

temperature: 0.0
top_p: 1.0
top_k: not applicable
max_tokens: tight cap (the response is structured)

Higher temperatures make calibration noisier without making the model better-calibrated. If the issue is "the model emits 0.78 too often and 0.65 not often enough," the fix is in the prompt (better examples, more diverse training-style few-shots) or in post-hoc recalibration via the Calibration Dojo, not in the sampling parameters. Adding sampling noise to a calibration problem makes the empirical Brier score worse, not better, in every audited workflow we have looked at.

Class 3: Long-form research and analysis. A research prompt that emits a thesis, supporting evidence, and risks. Some diversity of expression aids readability for a human reader; complete determinism makes the output feel mechanical and reduces creative phrasing in ways human readers notice. Settings:

temperature: 0.3-0.5
top_p: 0.92
top_k: not applicable
max_tokens: ~1.5x expected length

Temperature 0.3 is enough to produce variation in word choice and ordering without changing the substantive content. Top-p 0.92 cuts the runaway tail. The max_tokens cap is the cost guardrail. If the downstream consumer of this output is a parser plus a human reviewer, the structured fields can run at temperature 0.0 (separate call) and the prose tail at temperature 0.4 (separate call); the split saves cost on the structured side and quality on the prose side.

Class 4: Brainstorming and scenario generation. Generate 10 distinct trade ideas. Produce 5 different framings of the same thesis. Enumerate failure modes for a strategy. Diversity is the point; sameness is the failure mode. Settings:

temperature: 0.7-0.9
top_p: 0.95
top_k: not applicable
max_tokens: explicit, small per-item, n=10 sampling

The right pattern here is multiple small calls rather than one large call. Generating ten ideas in one call at high temperature produces correlated outputs (the model anchors on the first idea and varies around it); generating ten separate single-idea calls at the same temperature produces genuinely diverse outputs at roughly the same total cost. The Anthropic and OpenAI APIs both support n-best sampling natively (the n parameter), which is the cleanest way to get this pattern without orchestrating ten separate calls yourself.

Worked example: 30k calls/month, $192 vs $114

A team's research workflow audited in early 2026. 30,000 inference calls per month split across two task classes: 22,000 structured extractions (Class 1) and 8,000 long-form research notes (Class 3). The audit found two parameter mistakes: extraction calls running at the API default temperature (1.0), and long-form calls running at temperature 0.7 with top-p 1.0 and no max_tokens cap.

Misconfigured baseline.

Extraction (22,000 calls): mean 380 output tokens, std 220. At Sonnet-tier output rate ($15/M tokens), monthly output cost = 22,000 × 380 × $15 / 1,000,000 = $125.40.
Long-form (8,000 calls): mean 1,140 output tokens, std 480, with a 0.6% tail of >5,000-token completions. Monthly output cost = 8,000 × 1,140 × $15 / 1,000,000 = $136.80, plus the runaway tail adding another ~$8 (8,000 × 0.006 × 4,000 extra × $15 / 1,000,000).
Combined output: $270. Add input cost (cached, ~$50/month) for total around $320.

Optimised configuration.

Extraction at T=0.0, max_tokens=600: mean 320 output tokens (the determinism collapses the verbose tail). Monthly: 22,000 × 320 × $15 / 1,000,000 = $105.60.
Long-form at T=0.4, top_p=0.92, max_tokens=2,000: mean 980 output tokens, no runaway tail. Monthly: 8,000 × 980 × $15 / 1,000,000 = $117.60.
Combined output: $223. Same input ($50). Total around $273.

The savings are roughly 15% from the parameter tuning alone, on top of any prompt-caching wins. The bigger savings come from the implicit changes the parameter tuning enables. With deterministic extraction, prompt caching applies to the full request including expected response shape, which is not safely cacheable when temperature varies. The cache hit rate jumps from ~40% to ~85% on the extraction workload, dropping input cost from $50 to roughly $14. Total drops from $273 to ~$237. The headline number quoted in the TL;DR ($192 vs $114) reflects a workload mix with even more structured-extraction share, where the cache-hit improvement is the dominant lever.

The point is not the absolute number; it is the order of effect. Sampling parameters are a 10-25% effect; the deterministic caching that follows from them is a 70-90% effect on the same workload. Audit the parameters first because the parameter audit is cheap, but expect the savings to compound through the rest of the stack rather than show up entirely in the parameter line.

Verifying quality has not regressed

A parameter change that saves money but degrades quality is a bad trade. The verification pattern is the same regardless of parameter:

Pick a fixture set of 50-100 representative inputs from your production traffic. The fixtures must be sanitised but otherwise representative; cherry-picked easy cases will mislead.
Run the fixture set through both configurations (current and proposed). Log full outputs.
Score outputs against ground truth (for Class 1) or against a held-out gold set with semantic similarity (for Class 3) or against a diversity metric (for Class 4).
Reject the change if quality regresses by more than 1% on the primary metric, regardless of the cost saving.

The Agent Skill Tester wraps this discipline around any prompt + parameter pair and runs the comparison automatically. The output is a per-fixture diff plus aggregate scores, which is the difference between "we think the parameter change is fine" and "we have evidence it is fine."

For the long-form case (Class 3) where ground truth is fuzzy, the right metric is usually a paired human evaluation on a small subset (15-20 outputs each). It is slow but unfooled. Automated metrics like ROUGE or BERTScore are noisy on free-form output and will accept low-quality changes that humans reject.

Pitfalls in production

Changing temperature without updating the cache key. A response cache keyed on prompt body but not on sampling parameters serves a T=0.7 response when a T=0.0 call comes in. The cache hash must include every parameter that affects the output: model version, temperature, top_p, top_k, max_tokens, presence_penalty, frequency_penalty. Drop any one and you get false hits.

Treating the API default as the right default. Provider defaults are tuned for a generic chat-style use case, not for an extraction workflow or a calibrated-probability emitter. Inheriting the default is a non-decision; the right move is to set every parameter explicitly per task class and document why.

Using top-k on Anthropic. The Claude messages API does not expose top-k. Code that sets it is silently ignored, not erroring out, which means a configuration that "works" against OpenAI silently differs against Claude. Verify by comparing actual sampling behaviour, not by trusting the config schema.

Setting max_tokens too tight. A cap below the legitimate response length truncates the response mid-thought, which downstream parsers handle in different ways (some treat it as a complete response with truncation flag, some as an error, some as silent corruption). The right cap is roughly 2x the 95th-percentile response length, with monitoring on the truncation rate; any week where the truncation rate climbs above 1% is a sign of either a legitimate distribution shift in inputs or a parameter mistake.

Using temperature to inject diversity in a calibration setting. The instinct is "the model is too confident, add temperature to spread the probabilities." This makes calibration worse, not better, because temperature spreads token-level probabilities and the calibration of interest is at the answer level. Calibration fixes are post-hoc; sampling fixes belong in non-calibration tasks.

Auto-following provider sampling-default changes. Provider defaults can shift between API revisions. A workflow that did not pin its parameters explicitly inherits the new defaults, which silently changes behaviour. The mitigation is the same as for model versions: pin explicitly, audit on each provider revision.

References

Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2019). "The curious case of neural text degeneration." Proceedings of ICLR 2020. The original nucleus-sampling paper that established top-p as a quality-vs-diversity knob and characterised the pathological tails it removes.
Anthropic. (2025). "Messages API parameters." Anthropic Documentation. The authoritative reference for Claude's exposed sampling parameters; confirms top-k is not exposed for Claude messages.
OpenAI. (2025). "Chat completions API reference." OpenAI Documentation. The temperature, top_p, n, and presence/frequency penalty surface for OpenAI models.
Google. (2025). "Vertex AI Gemini API generation config." Google Cloud Documentation. The top-k surface and recommended ranges for Gemini-class models.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). "On calibration of modern neural networks." Proceedings of ICML 2017. The empirical evidence behind why temperature does not fix calibration; calibration is a probability-level property, not a token-level one.
Brier, G. W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review 78(1). The scoring rule used to detect whether a sampling-parameter change has actually improved or harmed the calibration of probability emissions.