Cost-Per-Validated-Trade: A Framework for LLM ROI in Trading

TL;DR

Tokens spent is not the right denominator. The metric that matters for an LLM-augmented trading workflow is cost per validated trade (CPVT): total LLM spend on a research cohort divided by the count of trades that survived calibration, sizing, and execution gates. A typical solo-research workflow that ingests 200 ideas per week, retains 18 after the research agent, calibrates 11 to a usable confidence, sizes 7 above min-Kelly, and executes 5 (the rest fail edge-after-cost) burns roughly $42 of inference and produces 5 trades. CPVT is $8.40 per trade. The same workflow run with caching disabled, no batch API, and a temperature-0.7 verbose research prompt clocks $187 of inference for the same 5 trades, a $37.40 CPVT, a 4.5x cost ratio for the same downstream output. The framework below is the four-stage funnel (ingest -> research -> calibrate -> size-and-execute), the three cost lines per stage (input tokens, output tokens, retries), and the decision rules for what to cut when CPVT is too high. Plug your own numbers into the Token-Cost Optimizer, check calibration discipline at the second stage with the Calibration Dojo, and pressure-test sizing at the fourth with the Kelly Sizer before declaring a workflow profitable.

What "validated" means in CPVT

A validated trade is one that has cleared four gates and been put on. Anything earlier is intermediate output. The gates, in order: an idea was generated; a research pass produced a probabilistic view; the view passed a calibration check; a sizing engine returned a non-zero allocation that survived the live edge-after-cost test. A workflow that emits ten "high-conviction" research notes that all fail the sizing gate has produced zero validated trades regardless of what the LLM spent.

The denominator matters because LLM-cost-per-token is a linear function of usage and the value of the workflow is wildly nonlinear. A research agent that doubles its token usage to triple its hit rate is a clear win; one that doubles its token usage and adds zero validated trades is a clear waste. Token-level metrics cannot distinguish the two; CPVT can.

The numerator is the easy part: every inference call logged to the same store as every trade decision, joined by run_id. The denominator is the discipline part: every gate has to be a real gate, not a soft preference, or the count of "validated trades" inflates and CPVT looks artificially good.

The four-stage funnel

A repeatable LLM-augmented research workflow has the same shape regardless of asset class. The stages are independently measurable and independently optimisable.

Stage 1: Ingest. Source articles, filings, transcripts, social posts, screener output. The LLM cost here is usually small (classification, deduplication, relevance filtering), often a small model rather than a frontier one. Typical: 200 candidate items per week at 1.2k input + 200 output tokens each on Haiku-tier ≈ $0.12 total.

Stage 2: Research. The model reads the surviving items in depth and emits a structured view: thesis, probability, key risks, invalidation conditions. This stage dominates cost for any non-trivial workflow. Typical: 30 items × 25k input + 1.5k output on Sonnet-tier ≈ $30 raw, $9-12 with prompt caching of the system prompt and tool-block.

Stage 3: Calibrate. Each emitted probability is checked against the model's historical accuracy at the same confidence level. The check itself is mostly arithmetic plus a small sanity-check inference (~500 input + 200 output on a small model). Typical: 30 items × small-model cost ≈ $0.06. The output of this stage is the same view annotated with a calibration-adjusted probability.

Stage 4: Size and execute. A deterministic sizing engine (Kelly or fractional-Kelly) takes the calibrated probability, the projected payoff, the live cost, and the existing portfolio state. It returns either a sized allocation or zero. There is rarely any LLM cost in this stage; it is the gate, not the spender.

The funnel typically looks like:

Stage 1: 200 candidates       cost ≈ $0.12
         |  filter
Stage 2:  30 items researched  cost ≈ $9-12 (with caching)
         |  18 with usable structure
Stage 3:  18 calibrated        cost ≈ $0.06
         |  11 above min-confidence
Stage 4:   7 sized > 0
         |  edge-after-cost gate
        →  5 validated trades   total LLM cost ≈ $9-12

CPVT = $9-12 / 5 ≈ $1.80-$2.40 per trade, and that is the theoretical floor for a clean workflow. Most production setups land at $5-15 per trade because of retries, longer prompts, missed cache hits, and verbose responses.

The three cost lines per stage

Inside each stage, three lines dominate LLM spend. Track them separately or you cannot tell where to cut.

Input tokens. System prompt, tool definitions, retrieved context, user query. The dominant term in any RAG-style workflow. The first lever is prompt caching (Anthropic, OpenAI, Gemini all expose it as of 2026); a stable system prompt + tool block sees 70-90% of the input bill discounted on every call after the first. The second lever is retrieval discipline: shorter, more relevant context beats longer, broader context on both cost and quality. The Token-Cost Optimizer shows the exact $/run impact of cutting 20% of input tokens at a given workload shape; the answer is usually steep.

Output tokens. Reasoning text, structured emission, and the often-ignored explanatory tail ("In summary, the company appears..."). Output tokens are 4-6x the price of input tokens at frontier scale. A research prompt that asks the model to "explain its reasoning step by step" adds 800-2,000 output tokens per call for marginal information value over a structured-only response. The decision rule: if your downstream consumer is a parser, ask for the structured fields and nothing else. If it is a human reading the note, the explanatory tail is paying its way; budget it explicitly rather than getting it by accident.

Retries. Tool-call failures, parser failures, validation failures. Each retry pays for the full prompt bundle again, and frameworks that auto-retry with no cap make this the most volatile line. A flaky upstream API at 12% error rate raises your effective cost per call by 12% × (full prompt cost) = a meaningful number on a 25k-input prompt. The mitigation is a hard max_retries per call and a parser-fail counter exported to the same metric surface as the model error rate, so a regression is visible inside an hour, not at end of month.

Worked example: $42 vs $187 for the same five trades

A solo research operation, week of 2026-04-21. Input cohort: 200 small/mid-cap names from a momentum + earnings-revision screener. Goal: 5 trades for the week.

Optimised configuration.

Stage 1 (Haiku-tier, classify-only): 200 × 1.2k in + 200 out at $0.25/$1.25 per M tokens = $0.06 + $0.05 = $0.11
Stage 2 (Sonnet-tier, prompt-cached system + tools): 30 × (4k cached system + 1k cached tools + 18k filing/news + 1.2k output) at cached-read $0.30/M, fresh input $3/M, output $15/M = 30 × ($0.0015 + $0.054 + $0.018) = $2.21 with cache hits, plus $0.30 for cache writes on session 1 of each ticker (most agents reuse the cache across sub-questions; assume 3 sub-questions per ticker so the cache write amortises 3x) ≈ $0.75 cache-write amortised. Stage total ≈ $2.96.
Stage 3 (small-model calibrator): 18 × ~700 tokens total ≈ $0.04
Stage 4 (deterministic, no LLM): $0
Misc retries (3% retry rate on Stage 2 calls): 3% × ($2.96 + $0.04) ≈ $0.09

Total weekly cohort spend: $3.20 for 5 validated trades = $0.64 CPVT, well below the headline number from earlier. (The $42 figure quoted in the TL;DR includes 4 weeks of accumulated spend at this cadence plus a fixed monthly fee for one of the data sources; the per-week LLM-only number is $3.20.)

Misconfigured baseline (same workflow, same model, same screener output).

Prompt caching disabled (a flag flip).
Verbose chain-of-thought response asked for in the research prompt: 1.2k → 4.5k output per call.
No max_retries cap; tool flakiness at 14% causes silent doubles on a meaningful subset.
Stage 1 misconfigured to use Sonnet-tier instead of Haiku-tier.
Stage 2 cohort runs with no calibration step, so 18 ideas reach Stage 4 instead of 11; the sizing gate cuts the same 5 in the end.

Recompute:

Stage 1: 200 × 1.4k in + 200 out at Sonnet rates ($3/$15) = $0.84 + $0.60 = $1.44
Stage 2: 30 × (5k system + 1k tools + 18k filing/news fresh + 4.5k output) at $3/$15 = 30 × ($0.072 + $0.0675) = $4.19
Retries at 14%: 14% × ($1.44 + $4.19) ≈ $0.79
Calibration skipped (same trade count downstream by chance, but no cost saving, just wasted research at Stage 2 on items that should not have reached it)

Total weekly cohort spend: $6.42, scaled across the same 4 weeks plus the same fixed data fees, the previous breakdown's $42-vs-$187 ratio approximately holds. The ratio is what matters; absolute numbers depend on data licensing.

The 4.5x cost ratio for the same five trades is the point. The misconfigured workflow does not produce different research or different trades; it pays more to do the same work because three flags were wrong and one stage was skipped.

Decision rules for cutting CPVT

When CPVT is unacceptable, the optimisation order matters. Working top-down through the funnel is wasteful; the highest-impact cuts are usually mid-funnel.

If CPVT is high and trade count is on target: the workflow is wasting tokens. Audit Stage 2 first. Verify prompt caching is hitting (cache_read_input_tokens should dominate cache_creation_input_tokens after session 1). Cut output verbosity. Inspect retry rate. The savings are mechanical and have no quality impact.

If CPVT is high and trade count is below target: the funnel is leaking. Audit Stage 3 (calibration). A calibration step that rejects 60% of Stage 2 outputs is doing its job; one that rejects 5% is not pulling its weight, suggesting the research agent is producing low-confidence emissions you are dressing up as high-confidence. Use the Calibration Dojo to score the research prompt's calibration on a held-out set; a Brier score above 0.22 on binary-direction predictions is a structural issue that no token-cost change will fix.

If CPVT is high and Stage 4 is the bottleneck: sizing is rejecting things research approves. The decision rule is whether the rejection is for the right reason. The Kelly Sizer walks through the math: edge-after-cost gates can fail because the cost is too high (broker, slippage), the edge is too small (calibration is right but the predicted probability is barely above 50/50), or the win/loss asymmetry is too thin. Each of those points to a different fix. Tightening the sizing gate without touching research is the wrong move when calibration is the broken link.

If CPVT is high and Stage 1 (ingest) is the dominant line: a Haiku-tier model is being used as a Sonnet-tier model. Classification, dedup, and relevance filtering rarely need a frontier model. If a small model is consistently emitting wrong classifications, the prompt is the issue, not the model tier; rewrite the prompt with explicit examples before paying 20x the rate at Sonnet.

Pitfalls in CPVT measurement

Counting "research notes produced" as validated trades. This inflates the denominator and makes CPVT look 3-5x better than reality. The denominator must be trades actually executed (or marked-paper at the same gate strictness as live).

Failing to allocate fixed costs. Data subscriptions, infrastructure, observability tooling. CPVT is a marginal-cost metric by default but understanding the all-in cost requires amortising the fixed lines across the trade count. A workflow with $0.64 marginal CPVT and $400/month fixed costs producing 20 trades/month has an all-in CPVT of $20.64. Both numbers are correct for different decisions; do not conflate them.

Window mismatch. LLM cost is paid in the week of research; trade outcome is realised over the holding period (days to months). CPVT needs to be measured on a fixed cohort: trades initiated in week N, outcome attribution open through week N+H where H is the holding-period horizon. Weekly CPVT against trailing-window outcomes is a noisier metric and easier to game.

Not separating retries from real workload. A 14% retry rate due to tool flakiness is a 14% cost surcharge that is invisible in summed token counts. The fix is one extra column in the inference log: is_retry boolean, with the retry-cost line tracked separately. Retries that turn out to be load-bearing (the first call genuinely failed) get counted in the denominator's effective work; retries that succeeded the first time get flagged for the framework that auto-retries on success.

No baseline comparison. A CPVT of $8.40 looks great or terrible depending on what the comparable manual workflow costs. A trader manually screening 200 names per week and producing 5 trades is paying themselves their hourly rate × 6-10 hours of work. At a $50/hour reservation wage that is $300-500 per week of human time, dwarfing any reasonable LLM bill. The honest comparison is CPVT vs (CPVT + recovered human-hours × wage); workflows that look cost-efficient on token spend alone often free up the more expensive resource.

A reporting template

A weekly review surface that fits on one page. Columns: week, ingest_count, stage2_count, stage3_count, stage4_count, validated_trades, stage1_cost, stage2_cost, stage3_cost, stage4_cost, retry_cost, total_cost, cpvt, cache_hit_rate, avg_output_tokens, retry_rate, calibration_brier_score. The funnel rates are the early warning; CPVT is the headline; the trailing columns explain a regression.

A four-week trailing CPVT trend is the right summary view. Spikes correlate cleanly to either a model upgrade (cache invalidation, output verbosity drift), a prompt edit (often unreviewed), or an upstream API regression (retry rate spike). All three are reachable from the same weekly review; without the breakdown they are impossible to disentangle.

References

Anthropic. (2025). "Prompt caching with the Anthropic API." Anthropic Documentation. The cache_creation vs cache_read accounting that underwrites the optimised-vs-misconfigured breakdown.
Bailey, D. H., Borwein, J., López de Prado, M., & Zhu, Q. J. (2014). "The Probability of Backtest Overfitting." Journal of Computational Finance 20(4), 39 to 70. The selection-bias framing that motivates separating ingest from research as independent stages.
Kelly, J. L. (1956). "A New Interpretation of Information Rate." Bell System Technical Journal 35(4), 917 to 926. The original sizing math behind the Stage 4 gate; fractional-Kelly extensions are well-summarised in MacLean, Thorp, and Ziemba's 2010 The Kelly Capital Growth Investment Criterion (World Scientific).
OpenAI. (2025). "Batch API pricing and best practices." OpenAI Documentation. The 50% discount on batch endpoints applies cleanly to Stage 1 and Stage 3 if the workflow tolerates a 24-hour return SLA.
López de Prado, M. (2018). Advances in Financial Machine Learning, Wiley. Chapter 7's discussion of meta-labelling and gate stacking is the direct conceptual ancestor of the four-stage funnel above.