Hallucination Detection at Scale: From R&D to Production Pipelines

TL;DR

A hallucination in finance has different costs from one in customer support. A misquoted revenue figure flows into a position size, an audit log, a compliance filing. Detection has to happen in production. The minimum viable stack is four layers: (1) source-grounded fact-checking verifies that every numeric claim and proper-noun reference traces back to retrieved text via exact or near-exact match; (2) self-consistency cross-checks sample N completions and score variance, with industry-reported coverage around 65 to 75% of factual hallucinations on FreshQA-style benchmarks at N=5, at roughly 5x baseline cost; (3) external verification pushes high-stakes claims through deterministic verifiers (recompute the math, scrape the source page, hit the API); (4) adversarial probes are pre-production red-team corpora that exercise known hallucination shapes before traffic hits production. Layers compose cheapest-first so most claims are rejected before the expensive layers fire, with a human-in-the-loop step that trips on the audit-trailed red flag.

Why finance hallucinations are a different class

Three properties of a finance pipeline raise the floor on acceptable detection precision.

Position-sizing impact. A model that reads "Q3 revenue grew 22%" when the filing says 12% does not produce a wrong sentence. It produces a wrong size. The 10-point error compounds through the sizer, the risk budget, and the order router. By the time the position is on, the original mistake is two systems removed from anyone reading the model output.

Audit and regulatory exposure. Under MAR (EU) and SEC Reg BI / FINRA Rule 2010 in the US, advisory and execution-adjacent systems carry recordkeeping obligations. An LLM-generated "key risk" in a customer-facing summary inherits the same audit-trail expectations as the human analyst it replaces. A hallucinated risk that flows through unchecked is a documentation problem at the next examination.

Asymmetric loss. A correctly extracted fact saves seconds of analyst time. A hallucinated fact that drives a trade or a client communication can run into mid-six-figure remediation. The expected-value math forces high-precision detection even where recall trades off.

The four detection layers below are ordered cheapest-first so the budget lands where it earns the most reduction in residual error.

Layer 1: Source-grounded fact-checking

Every claim in the model output must trace back, by exact or near-exact match, to the retrieved source text. Numbers, proper nouns, dates, percentages, and quoted phrases are the high-value targets; free-form narrative is harder to verify and gets pushed to layers 2 and 3.

The implementation is a span-grounding check. Extract every numeric token and named entity from the output, search the retrieved context for a matching span within a tolerance window. A hit grounds the claim; a miss flags it for escalation.

import re
from dataclasses import dataclass

NUMERIC_PATTERN = re.compile(r"-?\\d[\\d,\\.]*(?:%|bps| bp| bn| b| m| k)?", re.IGNORECASE)

@dataclass
class GroundingResult:
    claim: str
    grounded: bool
    matched_span: str | None
    confidence: float

def normalize_number(token: str) -> float | None:
    """Strip formatting; return float in canonical units (raw, no scaling)."""
    cleaned = token.replace(",", "").rstrip("%").strip()
    try:
        value = float(cleaned)
    except ValueError:
        return None
    return value

def find_numeric_match(claim_value: float, source_text: str,
                       rel_tol: float = 0.005) -> str | None:
    """Return the matching source span, or None if no value within tolerance."""
    for match in NUMERIC_PATTERN.finditer(source_text):
        candidate = normalize_number(match.group())
        if candidate is None:
            continue
        if candidate == 0 and claim_value == 0:
            return match.group()
        if candidate != 0 and abs(claim_value - candidate) / abs(candidate) <= rel_tol:
            return match.group()
    return None

def ground_numeric_claims(model_output: str, source_text: str) -> list[GroundingResult]:
    results = []
    for match in NUMERIC_PATTERN.finditer(model_output):
        token = match.group()
        value = normalize_number(token)
        if value is None:
            continue
        matched = find_numeric_match(value, source_text)
        results.append(GroundingResult(
            claim=token,
            grounded=matched is not None,
            matched_span=matched,
            confidence=1.0 if matched else 0.0,
        ))
    return results

Tolerance is the key parameter. Reported financials carry rounding noise: 0.5% relative is generous for revenue line items, tighter for ratios. The pattern extends to dates (one-day window) and tickers (whitelist match). On well-formed extraction prompts this layer grounds 70 to 90% of numeric claims; the residual feeds the next layer, not a final reject.

Layer 1 catches the easy class of hallucination: the model that confabulates a number absent from the source. It does not catch the model that picks the wrong correct number: the FY 2024 figure when asked about FY 2023, the segment number when asked about consolidated. Those are layers 2 and 3 territory. The Hallucination Detector implements this layer interactively for prompt iteration.

Layer 2: Self-consistency cross-checks

The second layer trades cost for coverage. Instead of one completion at temperature 0, the pipeline runs N completions at modest temperature (0.2 to 0.4) and scores variance across samples. The empirical premise: factual hallucinations are unstable across samples, grounded facts are stable.

Wang et al. (2022) introduced self-consistency for chain-of-thought reasoning;¹ subsequent work in the FreshQA / FreshLLMs line and Manakul et al.'s SelfCheckGPT established it as a viable hallucination signal for free-form generation.² Reported coverage on FreshQA-style benchmarks at N=5 lands in the 65 to 75% recall range for factual hallucinations, at roughly 5x baseline tokens plus N-way decoding latency.

Scoring depends on output type. Numeric extractions: reject if cross-sample standard deviation exceeds a threshold relative to the mean. Categorical labels (long, short, none): majority vote with a minimum-quorum gate. Free-form prose: pairwise semantic similarity via sentence embeddings, reject if mean similarity drops below ~0.80.

import numpy as np
from statistics import mean, pstdev

def self_consistency_numeric(samples: list[float], rel_threshold: float = 0.01) -> dict:
    """Reject if cross-sample dispersion exceeds rel_threshold of the mean."""
    if len(samples) < 3:
        return {"verdict": "insufficient", "mean": None, "rel_dispersion": None}
    mu = mean(samples)
    sigma = pstdev(samples)
    rel_dispersion = sigma / abs(mu) if mu != 0 else float("inf")
    return {
        "verdict": "consistent" if rel_dispersion <= rel_threshold else "inconsistent",
        "mean": mu,
        "rel_dispersion": rel_dispersion,
    }

def self_consistency_label(samples: list[str], min_majority: float = 0.8) -> dict:
    """Majority vote with a quorum gate."""
    counts = {label: samples.count(label) for label in set(samples)}
    top_label, top_count = max(counts.items(), key=lambda kv: kv[1])
    fraction = top_count / len(samples)
    return {
        "verdict": "consistent" if fraction >= min_majority else "inconsistent",
        "label": top_label,
        "majority_fraction": fraction,
    }

Cached system prompts (see Prompt Caching Economics for Finance) recoup a meaningful share of the N-way input cost; with caching, the marginal cost of N=5 tracks closer to 2.5x to 3x baseline. Self-consistency scales worse than linearly with output length, so long free-form outputs become expensive to score by embedding similarity. N=5 at modest temperature is the sweet spot; N=7 or N=9 is reserved for residual claims surviving layers 1 and 3.

Layer 3: External verification

Layers 1 and 2 establish internal consistency. Layer 3 establishes external consistency, between the model output and a value computed independently from a non-LLM source. Three verifier shapes cover most of finance.

Recompute the math. If the model claims gross margin is 47.3% and gross profit and revenue are verified at layer 1, the gross margin recomputation is a one-line numeric check. Multi-step derived metrics (free cash flow, interest coverage, working-capital changes) decompose into arithmetic the verifier runs deterministically. Disagreement beyond floating-point noise is a hard reject.

Re-scrape the source page. For claims that reference a specific filing or press wire, the verifier hits the source URL and re-extracts the span with a non-LLM parser (XBRL tag for financials, regex anchored on labels for headlines, CSS selector for structured pages). Deterministic extraction is ground truth; LLM extraction is the candidate. Most expensive in latency, cheapest in tokens.

Hit the API. For market-data claims (last price, 52-week high, dividend record date), the verifier round-trips a quote from the broker or vendor API and compares. This also catches stale-context errors, where the model summarizes a value correct at retrieval time but no longer current at decision time.

from typing import Callable, Any

@dataclass
class Verifier:
    name: str
    extract: Callable[[str], Any]   # canonical extraction from source
    compare: Callable[[Any, Any], bool]

def recompute_gross_margin(gross_profit: float, revenue: float) -> float:
    if revenue == 0:
        raise ValueError("revenue cannot be zero for gross margin")
    return round(gross_profit / revenue, 4)

def verify_derived(claim: dict, source_facts: dict, tol: float = 0.001) -> bool:
    expected = recompute_gross_margin(
        source_facts["gross_profit"],
        source_facts["revenue"],
    )
    actual = float(claim["gross_margin"])
    return abs(actual - expected) <= tol

Rule of thumb: any claim that drives position-sizing or audit-trail consequences passes through layer 3. Claims that drive a research summary or internal dashboard stop at layer 2. Allocating layer-3 verification only where it earns its keep keeps the latency budget within reach. The same logic sits behind the Price-Blind Auditor, which catches a related failure mode: the model inadvertently learning the current price from context and reasoning back to it.

Layer 4: Adversarial probes

Layers 1 to 3 run on production traffic. Layer 4 runs in pre-production: a regression corpus of known-hallucination patterns exercised on every prompt change, model bump, and schema revision. The deployment never sees these inputs; the eval harness does.

Seven shapes recur across finance pipelines:

Number confabulation. A metric the filing does not disclose. Expected: "not disclosed"; hallucinated: any specific number.
Time-period drift. FY 2023 asked against an FY 2024 10-K. Expected: "FY 2023 not in this filing"; hallucinated: a wrong-year value.
Segment vs consolidated confusion. A specific business segment when the filing reports only consolidated figures.
Currency and unit drift. A value asked in millions when reported in thousands, or USD when reported in EUR.
Restatement awareness. A metric that was subsequently restated; the model should flag the restatement, not silently use either version.
Footnote-only disclosure. A metric disclosed only in a footnote, with the body framing it differently.
Prompt injection in retrieved text. A press wire that includes an instruction to fabricate a metric. (Full defense: Prompt Injection Defenses for Finance Agents.)

Each pattern is a small fixture: input prompt, retrieved context, expected response shape, scoring rule. Running the corpus through a candidate prompt or model produces a per-pattern pass rate. A typical deployment bar is 95% pass on patterns 1 to 4, 85% on 5 to 7. The methodology folds into the broader eval pattern in Evaluation Harness for Finance LLM Tasks; Agent Skill Tester and Prompt Regression Tester cover the interactive iteration loop and the baseline-vs-candidate comparison.

Production architecture

Composed, the canonical pipeline for a research-to-decision agent on 10-Q filings:

[1] Retrieve → [2] LLM extraction → [3] Layer 1 grounding
          → [4] Layer 2 self-consistency on residual
          → [5] Layer 3 external verifier on high-stakes claims
          → [6] Decision OR Layer-4 escalation log

Each arrow is a budget gate. Layer 1 rejects cheap hallucinations on the first pass. Layer 2 fires only on claims that survive layer 1, keeping N-way sampling cost contained. Layer 3 fires only on layer-2 survivors that carry position-sizing or audit consequences, keeping network-bound latency rare.

A claim that fails layer 1 with a "no matching span" verdict can still be saved by layer 2 (same number across samples, suggesting a paraphrase or unit difference rather than confabulation) or layer 3 (deterministic verifier confirms from a different source). A claim that passes layers 1 and 2 but fails layer 3 is the most dangerous shape: internally consistent, externally wrong. Those are the cases that hit the human-in-the-loop step.

Cost vs precision

Each layer trades latency and tokens for residual-error reduction. From 2026-04 frontier-model rates and self-reported numbers from production teams:

Layer	Latency added	Token cost added	Typical residual reduction
1: Source grounding	<50 ms	minimal (no model call)	70-90% of numeric confabulations
2: Self-consistency N=5	2-5x baseline	3-5x baseline (with caching)	65-75% of remaining factual errors
3: External verification	100-500 ms per call	minimal (deterministic)	near-100% on covered claim types
4: Adversarial probes	0 (offline)	0 (offline)	catches regressions before deploy

The composite false-positive rate (valid claims flagged for review) determines whether the human-in-the-loop step is sustainable. A 1 to 3% composite FPR ships without dedicated reviewer headcount; above 5%, the review queue becomes a staffing problem. Tuning per-layer thresholds is the iterative loop, and the Agent Skill Tester is the right surface for it: each adjusted threshold runs the full adversarial corpus and surfaces the FPR shift.

Composite recall follows the inverse curve. Layer 1 alone on a typical 10-Q extraction workload catches roughly 70% of hallucinations at near-zero added cost; adding layer 2 lifts that to roughly 85%; adding layer 3 on the high-stakes subset closes most of the remaining gap. The last few percent are the residual the human-in-the-loop step absorbs.

Worked example: 10-Q research agent

The agent is asked: "Extract recurring revenue, gross margin, and operating cash flow from the most recent 10-Q for ticker X, plus a one-paragraph summary of segment performance." Retrieved context is the EDGAR filing.

The model produces structured JSON with three numbers and a free-form paragraph. Layer 1 runs span grounding: recurring revenue and gross margin match cleanly; operating cash flow is flagged ungrounded (model says $412M, the filing discloses $407M plus $5M of receivables movement the model implicitly absorbed).

Layer 2 triggers on the ungrounded operating cash flow. Five samples at temperature 0.3 yield $407M, $407M, $412M, $407M, $409M. Relative dispersion roughly 0.5%; majority value $407M. The agent rewrites the output and continues.

Layer 3 applies to gross margin (position-sizing-adjacent in the downstream allocator). Deterministic recomputation from XBRL-tagged gross profit and revenue agrees with the model's value within 0.05 percentage points. Pass. The segment-performance paragraph is scored by Layer 2 alone (semantic similarity across samples, threshold 0.82). Pass. The composite output, with audit trail recording each verdict and any rewrites, returns to the caller.

If Layer 3 had disagreed with the model's gross margin, the pipeline escalates: the claim is logged, the position-sizing pipeline pauses for that ticker, and a notification fires to the on-call analyst. The audit trail captures the model output, all layer verdicts, the verifier's value, and the timestamp. That trail is the recordkeeping artifact that satisfies the audit obligation, regardless of how the analyst eventually rules.

The false-positive cost is bounded: the on-call is paged at the composite FPR, a few times per week per pipeline. The false-negative cost (a hallucinated value that survives all four layers) is what the stack is engineered to drive toward zero.

The human-in-the-loop on red flag pattern

Two design principles make the pattern sustainable in production.

Asymmetric escalation. Only red flags page. Layer-1 misses that layers 2 and 3 resolve never reach a human. The ratio of model claims to escalations should run between 100:1 and 1000:1; lower and the queue grows faster than the team can drain it, higher and the queue is wasted capacity.

Append-only audit trail. Every verdict, every rewrite, every escalation lands in a structured log keyed by trace ID. The log shape mirrors the postmortem template in Postmortem Template for LLM Trading Systems. When a regulator asks how a number entered the system, the answer is "trace ID 0xabc123, full pipeline visible in the log". That is the recordkeeping bar.

A pipeline that runs all four layers, escalates appropriately, and logs deterministically is the minimum credible stack for an LLM-driven research or extraction system on financial data. Anything less is a research demo that has not earned production status.

Connects to

Evaluation Harness for Finance LLM Tasks: measuring layer effectiveness with held-out test sets.
Prompt Injection Defenses for Finance Agents: adversarial inputs that feed the layer-4 corpus.
Numeric Precision in LLM Filing Extraction: precision traps that make layer-1 grounding non-trivial.
Postmortem Template for LLM Trading Systems: audit-trail format for escalation.
Prompt Caching Economics for Finance: cost recovery that makes layer 2 affordable.
Hallucination Detector: interactive layer-1 grounding for prompt iteration.
Agent Skill Tester: adversarial harness covering layer-4 patterns.
Prompt Regression Tester: baseline-vs-candidate runner for the pre-deploy gate.
Price-Blind Auditor: catches the stale-context and price-leak mode that grounding misses.

References

Vu, T., et al. (2023). "FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation." arXiv:2310.03214.
Lin, S., Hilton, J., & Evans, O. (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods." ACL 2022.
ESMA (2024). "Guidelines on the use of AI in financial services." Recordkeeping and audit-trail expectations.
SEC Division of Examinations (2024). "Risk Alert: Observations from Examinations of Investment Advisers Concerning the Use of Predictive Data Analytics and Artificial Intelligence."

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171. ↩
Manakul, P., Liusie, A., & Gales, M. J. F. (2023). "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models." EMNLP 2023. ↩