Calibration Drift: Why Your LLM's Confidence Score Lies After 3 Months

TL;DR

A confidence score that an LLM emitted in February is not the same calibration-quality signal as the same score emitted in May. Three mechanisms drive the drift: provider-side tuning (an unannounced refresh of safety, helpfulness, or sampling defaults shifts the model's confidence-vs-correctness mapping; observed in published audits at 3-12% Brier-score deltas across 60-day windows), prompt-environment drift (a stable prompt body running against gradually shifting input distributions starts producing systematically over-confident or under-confident outputs), and selection drift (your live workflow only routes the cohort that survived earlier gates, so the calibration test set is now a biased subsample). The detection layer is one cell in a confusion matrix: bucket every emitted probability into deciles, log the actual outcome, compute observed-vs-predicted per bucket on a 30-day rolling window, alert when any bucket diverges by more than one standard error of the empirical proportion. The re-calibration math is isotonic regression or Platt scaling, applied to the most recent N observations and refit weekly. A solo workflow that started with a Brier score of 0.18 in February and slid to 0.27 by May with no apparent prompt or model change recovered to 0.19 in two weeks once isotonic recalibration was added between the model output and the sizing engine. Audit production calibration with the Calibration Dojo, and use the Prompt Regression Tester to verify whether a confidence drift is the model, the prompt, or the environment before applying a fix to the wrong layer.

What "calibration" means here

A confidence score is calibrated when, of all the predictions the system labelled "70% probable," roughly 70% turned out true. Reliability is a per-bucket property, measured across many predictions; a single forecast cannot be calibrated or miscalibrated in isolation. The standard scoring rule is the Brier score (mean squared error between predicted probability and binary outcome), with reliability decomposed via the diagram of predicted-vs-observed frequency.

For LLM outputs the typical confidence emission is one of three shapes: an explicit numeric probability ("78%"), a coarse categorical label ("HIGH conviction" mapped to 0.80), or an ordinal scale (1-5 or 1-10) mapped to a probability table. All three suffer from the same drift modes, but the numeric-emission case is easiest to instrument and the categorical case is where drift hides longest before being noticed.

The benchmark to beat is not zero. A naive baseline of always emitting 0.5 produces a Brier score around 0.25 on a 50-50 binary outcome stream; a useful trading-decision classifier should be at 0.18-0.22, and the difference between 0.18 and 0.22 is the difference between a strategy that compounds and one that does not after costs.

How calibration drifts in production

Mechanism 1: provider-side tuning. Frontier model providers ship continuous updates. Some are version-bumped (Claude 4.6 -> 4.7, GPT-5 -> GPT-5-Pro); some are silent revisions to the same version string (safety tuning, reasoning-trace defaults, safety classifier thresholds). The published version string does not capture all of them. A prompt that exploited a specific behavioural quirk of the older revision now hits a slightly different distribution.

The empirical evidence is subtle but real. Published audits across 2024-2025 (notably the academic and grey-literature work tracking version-pinned API responses on a fixed input set) show 3-12% Brier-score swings across 60-day windows on the same model string and same prompt. The cause is rarely declared by the provider; the effect is observable downstream and can wreck a workflow that assumed stable calibration.

Mechanism 2: prompt-environment drift. The prompt body has not changed. The system prompt is the same. The model version string is pinned. But the input distribution (the filings the agent reads, the news flow, the macro regime) has shifted. A research prompt calibrated against a low-volatility regime starts producing over-confident outputs in a high-volatility regime because the model's confidence is anchored to the language of the inputs, and high-volatility news is more confident-sounding language.

This mechanism is the most insidious because it produces a smooth, gradual decline. There is no version event to point at, no diff to inspect. The confidence-vs-correctness curve drifts week by week as the underlying environment evolves. The detection has to be statistical, not behavioural.

Mechanism 3: selection drift. A live workflow funnels candidates through gates: research output, calibration check, sizing test. Only the survivors reach the executed-trade set. When you compute calibration on the executed set, you are computing it on a sub-sample selected for the same characteristics that calibration is supposed to measure. The bucket-wise distribution of executed trades is not the bucket-wise distribution of model emissions; the calibration computed on the former does not generalise to the latter.

The fix is to compute calibration on a held-out random sample of all model emissions, not just the executed cohort. This requires logging predictions that did not turn into trades and the ground-truth outcomes for those predictions (price action over the relevant window even though no position was taken). The discipline cost is small; the analytical correctness gain is large.

A worked drift detection: 90 days of a research agent

The agent: a Sonnet-tier model emitting binary "up over the next 5 trading days" predictions with a probability between 0.5 and 0.95. The window: 2026-02-05 through 2026-05-05, ~640 predictions across ~330 distinct tickers. Outcome attribution: actual 5-day forward return sign.

Initial calibration (week 1-4). Brier score 0.184. Reliability diagram clean: bucket-wise predicted probability matches empirical frequency within one standard error in every decile.

Mid-period (week 5-8). Brier score 0.197. Reliability diagram: 0.6-0.7 bucket and 0.7-0.8 bucket both 4-6 percentage points lower in empirical frequency than predicted, but within the band of two standard errors so a naive threshold check would not alert.

End-period (week 9-13). Brier score 0.273. Reliability diagram: every bucket from 0.65 upward is over-confident, the 0.85-0.95 bucket has empirical frequency around 0.62, the predicted-vs-observed gap is now 4-7 standard errors in the highest buckets.

The Brier-score breakdown across the two halves of the period:

              Brier   Reliability    Resolution
Weeks  1-6:  0.187     0.012          0.078
Weeks  7-13: 0.247     0.041          0.071

Resolution (the model's ability to discriminate between cases) is roughly stable. Reliability degraded sharply. The model is still picking the right direction at roughly the same rate; it is reporting too-high confidence in those picks. Sizing engines that scale position size with confidence (Kelly, fractional-Kelly) silently take larger positions on the over-confident half and lose money even when the directional call is good.

This is the textbook signature of calibration drift, and it is the one that an output-only QA process misses entirely because the directional accuracy looks fine.

Detection

The minimum-viable detection layer is one rolling computation per day plus an alert.

Logging contract. For every prediction emitted by the agent, persist (run_id, ticker, asof_ts, predicted_probability, model_version, prompt_sha256, outcome_at_horizon, outcome_resolved_ts). Outcome can be resolved asynchronously when the horizon elapses. The store has to be queryable on a 30-day window plus older for backfill.

Rolling reliability check. Each day, compute predicted-vs-observed frequency in each of 10 deciles for the trailing 30 days. Compare to the corresponding 30-day window from 60 days ago (the baseline). For each decile, alert if (current_observed - current_predicted) - (baseline_observed - baseline_predicted) > 1.96 × SE of the empirical proportion. The 1.96 multiplier is the standard 95% normal-approximation cutoff; tighten to 2.58 (99%) for quieter alerting.

def reliability_drift(current, baseline):
    """
    current, baseline: lists of (predicted, observed) tuples
    Returns deciles where calibration shifted significantly.
    """
    drift = []
    for d in range(10):
        lo, hi = d / 10, (d + 1) / 10
        cur = [(p, o) for p, o in current if lo <= p < hi]
        base = [(p, o) for p, o in baseline if lo <= p < hi]
        if not cur or not base:
            continue
        cur_obs = sum(o for _, o in cur) / len(cur)
        base_obs = sum(o for _, o in base) / len(base)
        cur_pred = sum(p for p, _ in cur) / len(cur)
        base_pred = sum(p for p, _ in base) / len(base)
        delta = (cur_obs - cur_pred) - (base_obs - base_pred)
        se = (cur_obs * (1 - cur_obs) / len(cur)) ** 0.5
        if abs(delta) > 1.96 * se:
            drift.append((lo, hi, delta, se, len(cur)))
    return drift

This gives a per-decile signal of where calibration moved. Three or more deciles drifting in the same direction is structural; one decile drifting in isolation is usually noise from a small bucket.

Audit surface. The Calibration Dojo computes the rolling reliability diagram and flags drifting deciles directly from a logged prediction stream. It does not replace your production logging; it consumes the same store and gives the daily diagnostic without writing one more cron.

Distinguishing the cause. Once drift is detected, the next question is whether the cause is the model, the prompt, or the environment. The Prompt Regression Tester re-runs the production prompt against the production model on a held-out fixture set from before the drift began. If the held-out set still calibrates correctly, the cause is environmental drift, not a silent provider change. If the held-out set also calibrates poorly, the cause is upstream of the prompt: either the model has shifted or the prompt has been edited (verify against the prompt-hash log).

Re-calibration math

Once drift is confirmed and the cause is not "the prompt regressed and should be rolled back," the fix is post-hoc recalibration. Two methods cover almost all cases.

Isotonic regression. Fit a monotone non-decreasing function from raw predicted probabilities to recalibrated probabilities, using recent (predicted, observed) pairs. The fit is non-parametric and robust to the shape of the miscalibration. Pseudocode using the standard isotonic algorithm:

from sklearn.isotonic import IsotonicRegression
import numpy as np

# (predicted, observed) for the recalibration window, sorted by predicted
x_train = np.array([p for p, _ in recent_outcomes])
y_train = np.array([o for _, o in recent_outcomes])

cal = IsotonicRegression(out_of_bounds="clip")
cal.fit(x_train, y_train)

# at inference time
def recalibrate(p_raw):
    return float(cal.transform(np.array([p_raw]))[0])

Window size: 200-400 observations is the sweet spot for most workflows. Smaller windows track recent drift faster but are noisier; larger windows are stable but slow to react. Refit weekly on a rolling window.

Platt scaling. Logistic regression on raw scores. Fewer parameters than isotonic, hence less overfitting on small samples, but assumes a sigmoid-shaped miscalibration. Useful when you have under 100 observations in the calibration set; default to isotonic above 200.

Where to apply the fix. The recalibration goes between the model output and the sizing engine, not inside the prompt. Asking the model to "self-calibrate" by injecting calibration history into the prompt is unreliable and expensive; a deterministic post-hoc fit is robust and cheap.

model -> raw_probability -> isotonic_recalibrator -> calibrated_probability -> sizing

The calibrator is a small Python object serialised to the same registry as the prompt, with its own version. A weekly refit produces a new calibrator version with the same audit log discipline as a prompt edit.

Pitfalls

Recalibrating without verifying resolution is intact. If resolution (the model's ability to discriminate) has collapsed alongside reliability, recalibration cannot help. A model that produces near-uniform probabilities on every input is uninformative, and remapping uniform probabilities to a different uniform shape does not add information. Check that high-confidence buckets still have higher empirical frequencies than low-confidence buckets before applying recalibration; if they do not, the issue is upstream.

Treating each prompt independently. Calibration drift is correlated across prompts that share a system prompt or tool block. Refitting each prompt's calibrator in isolation wastes data. Pool predictions across prompts that share the same system surface, fit a per-system-prompt calibrator, and use the per-prompt error only to detect prompt-specific issues.

Survivorship in the recalibration window. The recalibration set has to include all emissions, not just executed trades. If only executed trades reach the calibrator, the same selection drift that motivated the article fixes itself only on the survivor cohort, leaving the rest of the workflow uncalibrated. Log all model emissions and outcomes; compute on the full set.

Cliff edges at the boundary. A workflow that checks "is calibrated probability >= 0.65" gets a discontinuity at 0.65. A small recalibration shift around that threshold can flip a large number of decisions in one direction. Mitigate with a smoothed sizing function (Kelly is naturally smooth) rather than a hard threshold; or apply the calibrator's update with a damping factor (e.g. 70% new + 30% prior calibrator) for the first week after a refit.

Ignoring the regression in the prompt-edit log. If a prompt edit shipped the same week the drift began, the cause is almost certainly the edit. Recalibration covers it temporarily; rolling back the edit fixes it permanently. A drift-detection alert that does not reference the prompt-edit log produces wrong root-cause attributions.

Putting it together

Daily: rolling-30-day reliability check across all production prompts; alert on per-decile drift > 1.96 SE versus the 60-days-ago baseline. The check runs in seconds against an indexed predictions table.

Weekly: refit the per-system-prompt isotonic calibrator on the trailing 200-400 observations. Save the new calibrator under a fresh version with a timestamp; deploy via the same gated promotion path as a prompt edit.

Monthly: review the calibrator's version history for trend. A calibrator that has been monotonically pulling probabilities down for three months is a sign that the underlying prompt is producing systematic over-confidence; the right fix at that point is a prompt revision via the Prompt Regression Tester, not another calibration patch.

The combined discipline is what separates a workflow that quietly bleeds money for a quarter from one that catches the leak in the week it opens. The drift is real, the math is solved, and the tooling exists; the missing piece is usually the logging contract, which is the cheapest part to put in place.

References

Brier, G. W. (1950). "Verification of forecasts expressed in terms of probability." Monthly Weather Review 78(1), 1 to 3. The original definition of the Brier score and the foundation for reliability decomposition.
Murphy, A. H. (1973). "A new vector partition of the probability score." Journal of Applied Meteorology 12(4), 595 to 600. The reliability/resolution/uncertainty decomposition used in the worked example above.
Platt, J. (1999). "Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods." Advances in Large Margin Classifiers, MIT Press. The original logistic-regression-based calibration approach.
Zadrozny, B., & Elkan, C. (2002). "Transforming classifier scores into accurate multiclass probability estimates." Proceedings of KDD '02. The isotonic-regression calibration method.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). "On calibration of modern neural networks." Proceedings of ICML 2017. The empirical demonstration that modern neural nets are systematically over-confident, which generalises directly to LLM-emitted probabilities.
Anthropic. (2025). "Claude model behaviour notes." Anthropic Documentation. Provider acknowledgment that fine-tuning updates may shift confidence-related behaviour without a version bump, motivating the silent-revision detection layer.