TL;DR

Model leaderboards rot on a 3-to-6 month cycle. Static benchmark-driven selection produces decisions that look defensible on ship day and indefensible 90 days later, when the next frontier release reorders the table. Finance teams need a methodology that absorbs that churn without re-running procurement every quarter. This piece sets out a five-axis rubric (numeric reasoning fidelity, citation grounding, structured output reliability, latency under load, cost per validated answer), three operational tests beside the rubric (rate-limit behavior, error-mode visibility, support-channel responsiveness), the rebench cadence that survives churn (quarterly on a frozen 50-200-task private bench), and the production pattern that decouples deployment from evaluation (version-pin in production, A/B in shadow, graduate after 14 days at parity or better). A worked 60-day model-selection process for a research-agent product anchors the methodology in concrete numbers. Internal bench figures cited below are illustrative round numbers, not measured runs.

Why generic LLM benchmarks are the wrong instrument

MMLU, GPQA, HumanEval, MMLU-Pro, BIG-Bench, and the long tail of academic benchmarks are useful capability signals, and the wrong instrument for a finance procurement decision. Four reasons compound.

Training-set contamination. Frontier models train on web-scale text; most public benchmarks are web-scraped or web-published. By the time a benchmark is two years old, the chance that some fraction of its questions appear verbatim in pretraining data approaches certainty. Reported accuracy is an upper bound on memorization, not a lower bound on reasoning over unseen filings. The bias is upward; published scores over-promise on production.

Task distribution mismatch. MMLU asks single-fact multiple-choice questions across 57 academic subjects. A research-agent workload asks open-ended questions over a 250-page 10-K with footnote cross-references. A model that scores 88 on MMLU and one that scores 91 may rank in either order on the actual finance workload.

Latency invisibility. Public benchmarks report accuracy, not P95 latency, response-token distributions, or cost per answer at production prompt shapes. A model that wins on MMLU by two points but runs 3x slower on the buyer's prompts is the wrong choice for a real-time research agent.

Cold-start lag. A new flagship ships on a Tuesday; public benchmarks covering it land two to six weeks later. Procurement decisions made on stale benchmarks miss the model that is actually best on the workload right now.

Generic benchmarks tell a buyer which models are worth testing. They do not tell a buyer which model to ship. The gap is filled by an internal rubric, run on the buyer's corpus, on the buyer's prompts.

The five axes that actually matter

Finance LLM workloads decompose cleanly into five evaluation axes. Each demands a different metric; aggregating into a single score produces a ranking that satisfies neither the precision side nor the cost side of the workload.

Axis 1: numeric reasoning fidelity

The model extracts a number from a document, or computes a derived metric (free cash flow, interest coverage, segment-level revenue growth), and returns the correct number. The dominant failure mode is plausible fabrication: a number that looks reasonable but does not match the document. The metric is numeric tolerance against ground truth: absolute 0.01 for ratios, relative 0.1% to 1% for dollar amounts.

Illustrative numbers from a 50-task internal bench (revenue, EPS, diluted shares, segment revenue, cash position): Sonnet-class roughly 47/50 at 0.5% tolerance, Haiku-class roughly 42/50, Opus-class roughly 49/50. Loosening tolerance to 1% closes the Haiku gap to one task. The right tier is workload-dependent: if the extracted number feeds a downstream feature pipeline, Opus-class earns its 5x cost; if it lands in a human-reviewed dashboard, Haiku-class is sufficient.

Axis 2: citation grounding

Every claim attaches to a document span. The metric has two parts: citation presence (every claim cites something) and citation faithfulness (the cited span actually supports the claim). Faithfulness is the hard part. A model can return a citation that points at the right paragraph but does not contain the asserted fact. Production graders catch this with a verifier pass that re-reads the cited span and confirms it supports the claim. Faithfulness rates below 90% on a representative bench are a hard fail; the model is hallucinating with footnotes attached.

Axis 3: structured output reliability

The model returns JSON that conforms to a schema, with no stray prose, no markdown fences, no renamed fields, no extra or missing keys. The metric is schema-pass rate across N runs at production temperature. Production temperature matters because some model families pass at temperature 0 and degrade at 0.4, and most agentic workflows run at non-zero temperature. A model with a 99% schema-pass rate at 0 and 85% at 0.4 fails this axis if the workload runs at 0.4. Vendor structured-output modes (OpenAI structured outputs, Anthropic tool use, Google schema-constrained generation) close most of the gap, with coverage differences across vendors and versions.

Axis 4: latency under load

P95 latency on the buyer's own prompts at the buyer's own concurrency. Vendor-published latency figures are typically single-stream measurements on short prompts; a buyer running 50 concurrent agents on 80k-token prompts is in a different regime. P99 matters too for tail-latency-sensitive paths. Acceptable values depend on workload: an interactive Q&A agent needs P95 under 4 seconds, a nightly batch job tolerates 30 seconds.

Axis 5: cost per validated answer

Not raw cost per token, not raw cost per call. Cost per answer that passes the buyer's quality gates. A cheap model that fails the schema gate 15% of the time and triggers a retry on a more expensive model has an effective cost that includes the retry path. The metric is blended (call cost + retry cost + verifier cost) divided by answers shipped. This axis shifts most often when a new flagship lands, because new flagships frequently price aggressively to win share.

How the five axes interact

The axes are not additive. Numeric reasoning fidelity plus structured output reliability is harder than either alone: the model must hit the schema while doing arithmetic. Citation grounding plus latency is harder than either alone: citation verification adds a pass. The rubric is a checklist for shortlisting, not a weighted sum.

Building your own bench

The minimum useful internal bench: 50 to 200 task instances per axis, gold-standard answers, an automated grader. The bench's job is to discriminate between candidates on the workload, not publish a peer-reviewed paper.

Sample size. 50 tasks discriminate at roughly 10 percentage points of true accuracy gap, 200 tasks at roughly 4. Below 50, confidence intervals swallow the differences a buyer would act on. Above 200, labeling cost dominates marginal discriminating power.

Ground truth. SEC EDGAR is the backbone for extraction and numeric-reasoning tasks; XBRL-tagged statements provide machine-parseable ground truth for line items since 2009. Self-labeled cases cover everything else, at roughly one labeler-day per 100 cases for a domain expert. Freeze the bench at the start of each quarter; mid-cycle edits are a leakage axis practitioners routinely under-acknowledge.

Automated grading. Numeric tolerance is arithmetic. Schema validation is a JSON-schema library call. Citation faithfulness needs an LLM-as-judge pass with a judge that is not on the candidate list, validated against a held-out 50-case human-scored set before aggregate numbers are trusted. Latency is wall-clock at the buyer's concurrency. Cost is the line-item bill divided by answers passing all gates.

Reporting. Per-axis breakdowns, confidence intervals, distribution shape rather than mean alone. A model with a higher mean and a longer tail can rank below a model with a lower mean and a tighter distribution on a workload that punishes outliers.

Agent Skill Tester and Prompt Regression Tester cover harness scaffolding for axes 1, 2, and 3; Token Cost Optimizer covers axis 5.

Escape from leaderboard chasing: rebench every quarter

The methodology that survives churn rebenches on the buyer's own bench every quarter, on a fixed schedule, against a fixed candidate list. The bench is frozen for the quarter; only the candidates rotate.

Quarterly is the right interval. Faster and the bench-construction cost dominates, so the team spends more time evaluating than shipping. Slower and the production model falls a generation behind the frontier while the buyer pays premium prices for now-mid-tier capability. A 90-day window matches the typical frontier-release rhythm without forcing continuous re-evaluation.

The candidate list per quarter: the incumbent (always, as a control), the latest flagship from each major vendor, any mid-tier that materially repriced in the prior quarter, any new vendor that crossed a credibility threshold. Older models drop off once dominated on all five axes. Running the full bench against more than five candidates wastes labeler hours.

Version-pin in production, A/B in shadow

The pattern that decouples evaluation from deployment is version-pinning combined with shadow A/B. In production, every call uses an explicit version string (claude-sonnet-4-6-2026-04-15, gpt-5-2026-04-08, gemini-2-5-pro-2026-04-22), never a floating alias. The version stays pinned until a graduation criterion is met.

In shadow, the next-quarter candidate runs alongside production traffic at 5% sample rate, with results stored but not used. The graduation criterion: 14 consecutive days at parity-or-better on all five rubric axes on real traffic, with no regressions on the operational tests below. On graduation the version pin updates in a single deploy, the prior version is retained as a rollback target for 30 days, and the shadow slot opens for the next candidate.

The pattern catches the two failure modes floating aliases produce. Vendor-side silent updates: when a vendor rolls a fixed alias to a new checkpoint without a version bump, a buyer on the alias inherits whatever regression that checkpoint introduces; pinning blocks this. Accidental promotion: a candidate that looked fine on the bench fails on production traffic, the shadow window catches it before customers do.

Operational tests beyond the bench

The rubric covers answer quality, latency, and cost. Three operational tests cover what a finance team procures around them.

Rate-limit behavior. Drive the vendor to its requests-per-minute and tokens-per-minute limits on the buyer's workload and record the failure shape. Some vendors return a clean 429 with a Retry-After header; some return a 500 with no guidance; some silently throttle by injecting latency. A vendor that returns 500s under load is harder to integrate than one that returns 429s, regardless of who wins on the rubric.

Error-mode visibility. A useful error response carries a stable error code, a category (invalid input, model overloaded, content filter, internal), and enough context to route the retry. Inject failure inputs (oversized prompts, malformed tool calls, content-filter triggers) and grade the responses on usefulness.

Support-channel responsiveness. Hours-to-first-substantive-response on a Sev-2 ticket. Ask three reference customers in the same tier for their median; a 36-hour median is harder to depend on than a 4-hour median, which the buyer cannot learn from a benchmark.

A vendor that fails any of these is not a viable production target regardless of rubric score.

Worked example: 60-day model selection for a research-agent product

A finance research-agent picks a primary model and a fallback. The agent ingests SEC filings, generates structured research notes with citations, runs at 200 calls per hour during US market hours, and serves a paying analyst audience.

Days 1-7. Bench construction. 80 numeric extraction tasks with XBRL-anchored ground truth. 40 citation-grounding tasks from public 10-K and 10-Q filings. 30 structured-output tasks against the production schema at temperature 0.3. Latency harness: 50 concurrent calls on the production prompt template, P95 measured against a 90k-token input. Cost harness: blended cost per validated answer including a citation-verifier pass.

Days 8-21. Candidate runs. Five candidates: incumbent (Sonnet-class), Haiku-class, Opus-class, GPT-5-class, Gemini 2.5 Pro-class. Illustrative figures: numeric fidelity 84% to 96% at 0.5% tolerance; citation faithfulness 87% to 95%; schema-pass at 0.3 temperature 91% to 99%; P95 latency 2.4s to 8.7s; blended cost per validated answer $0.018 to $0.082. Haiku-class fails citation faithfulness at 87% and drops. Opus-class wins on numeric fidelity and citation faithfulness but fails the interactive latency budget at P95 8.7s. The Sonnet-class incumbent and Gemini 2.5 Pro remain.

Days 22-35. Operational tests. Both pass rate-limit, error-mode, and support tests. Support response times differ (incumbent 6 hours, Gemini 12 hours), inside tolerance.

Days 36-49. Shadow A/B. Gemini 2.5 Pro runs at 5% sample rate against production traffic. Citation faithfulness on production: 93% (vs. 95% on the bench), inside parity. Cost per validated answer: $0.024 (vs. incumbent $0.027), a small but real win.

Days 50-60. Graduation decision. The team keeps the incumbent as primary because the cost delta is below the team's 15% switch threshold and the rubric advantages do not compound on the workload. Gemini 2.5 Pro is promoted into the fallback chain; the version pin updates accordingly. The next quarter opens with the new GPT-5 release expected mid-cycle.

The decision documents what was measured, on what bench, and what the action threshold was. A new team member six months later can reproduce the reasoning by re-running the bench. That reproducibility is the property that survives churn.

The fallback chain pattern

Production stacks rarely run a single model. The fallback chain routes a call to the primary first, escalates to a secondary on failure or low-confidence output, and reserves a tertiary for explicit escalation. A typical configuration: Claude Sonnet primary, GPT-5 secondary, Gemini 2.5 Pro tertiary. The reverse ordering is equally valid; the choice depends on which axis the buyer weights highest.

The chain absorbs vendor outages (Anthropic, OpenAI, and Google have all had multi-hour incidents in the past 12 months); content-filter false positives (vendors differ on which prompts trigger refusals); and model-specific edge cases (a prompt shape that confuses one model family can succeed on another).

The chain has a cost. Every escalation adds latency and tokens. A blended-cost analysis on the Fallback Chain Simulator shows an escalation rate above 8% typically wipes out the cost advantage of running a cheap primary, because the fallback runs on the more expensive secondary. The right escalation threshold falls out of the buyer's measured per-prompt confidence distribution, not a default value.

Version-pinning applies to every link in the chain independently. The fallback is not "GPT-5"; it is gpt-5-2026-04-08. When a link rebenches into a new version, only that link updates, which keeps chain behavior reproducible over time.

Connects to

References

  • Anthropic. "Models overview" and "Pricing." docs.anthropic.com, accessed 2026-04-22.
  • OpenAI. "Models" and "Structured outputs." platform.openai.com/docs, accessed 2026-04-22.
  • Google. "Gemini API pricing" and "Long-context best practices." ai.google.dev, accessed 2026-04-22.
  • SEC. "EDGAR Full-Text Search and XBRL Financial Reporting." sec.gov/edgar, accessed 2026-04-22.
  • Liu, N. F., Lin, K., Hewitt, J., et al. (2024). "Lost in the Middle: How Language Models Use Long Contexts." Transactions of the ACL 12.
  • Bailey, D. H., & Lopez de Prado, M. (2014). "The Deflated Sharpe Ratio." Journal of Portfolio Management 40(5).
  • Chen, Z., Chen, W., Smiley, C., et al. (2021). "FinQA: A Dataset of Numerical Reasoning over Financial Data." Proceedings of EMNLP 2021.