TL;DR

Anthropic, OpenAI, and Google can all break or price-jump in one quarter. The 2025–2026 record is unambiguous: each of the three frontier providers has had a multi-hour outage, a model deprecation that broke pinned production code, or an unscheduled price change in the last 18 months. A trading agent that depends on a single provider takes those events as direct downtime. The fallback-chain architecture that survives a single-vendor outage has three components: a provider-agnostic abstraction layer, a routing policy with explicit health and latency thresholds, and a regression test suite that validates the secondary provider produces equivalent output before traffic moves. The architectural cost is modest (1.5–2× engineering time on the agent boundary, 1.1–1.3× ongoing cost from regression-testing the fallback path); the operational benefit is that one provider going down stops being a P0. Use the Model Selector for Finance to pick the primary/fallback pairs that share output schemas, and the Fallback Chain Simulator to stress-test the routing policy against historical outage patterns.

What "vendor lock-in" actually costs

The phrase is overused. The specific costs of single-provider dependency for an LLM trading agent are concrete and measurable.

Outage exposure. Anthropic's 2025-08-14 incident (~3 hours of degraded service across multiple regions) and the 2026-01-22 OpenAI Realtime API outage (4 hours, US-East) each took down agent traffic completely for the duration. A trading agent that holds positions during a model outage either over-trusts stale signals or unwinds positions blindly; both produce real losses.

Pricing power. OpenAI's 2024-12 price reduction on GPT-4o, Anthropic's 2025-06 prompt-caching pricing change, Google's 2025-09 Gemini 2.5 tier restructuring: each was unilateral, with under 30 days notice. A pipeline that pencils out at one set of rates can stop penciling out at another. Without an alternative, the negotiation is one-sided.

Deprecation timing. OpenAI deprecated gpt-4-32k with a 12-month notice; Anthropic deprecated Claude 2.1 with a 6-month notice; Google deprecated Gemini 1.0 Pro with a 6-month notice. Each deprecation forces a regression-test cycle and, often, a prompt rewrite. Without a parallel deployment on a second provider, the migration is high-stakes.

Capability regression on upgrades. A new model version sometimes regresses on specific tasks. The OpenAI o1 release was widely reported to regress on instruction-following relative to GPT-4o for some agent workloads. Without a fallback, the only options are stay on the deprecated model (limited window) or accept the regression.

The vendor lock-in cost is the sum of these; the architectural fix is the fallback chain.

The reference architecture

Three layers separate cleanly.

Layer 1: provider-agnostic abstraction. The agent code calls a generic interface, generate_completion(messages, tools, params), that returns a normalized response. The interface hides Anthropic's content blocks, OpenAI's tool calls, and Google's function calls behind one schema. Concretely, this is a 200–400 line adapter per provider; the interface itself is 50–80 lines. Open-source LiteLLM and the Vercel AI SDK both ship a version of this adapter; for production trading code I have seen teams prefer a hand-rolled adapter that is auditable end-to-end over a dependency that updates on its own schedule.

Layer 2: routing policy. A function that takes the request, the current health state of each provider, the recent latency profile, and outputs the provider to call. The policy is small (50–150 lines), explicit, and unit-tested. It does not "learn"; learned routing is a separate problem with its own failure modes that you do not want on the critical path.

Layer 3: regression and parity tests. A test suite that runs the same prompts against the primary and fallback providers and asserts behavioral parity within a documented tolerance. This is where most fallback chains break in practice: the fallback "works" when the primary is up but produces subtly different output when activated, and the agent downstream chokes. The Prompt Regression Tester is the in-pipeline version of this check; the audit-time version is a CI gate that runs nightly on a fixed eval set.

Behind these layers, the actual provider calls happen with each provider's native SDK. Do not try to share code paths inside the SDKs themselves; the SDKs change, and the cost of fighting the SDK is higher than the cost of two separate adapter layers.

Routing policy

The routing decision happens at three levels of granularity.

Per-request. The policy runs on every call. It checks three conditions in order: the primary provider is healthy (recent error rate below threshold), the primary's recent P95 latency sits inside the budget, and the request type is one the primary handles well. If all three hold, call primary. If any fails, fall through to the secondary.

Per-circuit. A circuit-breaker around the primary. If recent calls to the primary have failed at a rate above the threshold (typically 10% over 60 seconds, configurable), the circuit opens and all traffic goes to the secondary for a configurable cooling period (typically 5 minutes). This prevents the per-request policy from oscillating during a partial outage.

Per-deployment. A manual override for migrations, tests, and emergencies. A deployment configuration variable forces all traffic to one provider regardless of health. This is the lever the on-call engineer pulls during an incident.

A simple version, in pseudocode:

class FallbackRouter:
    def __init__(self, providers, health, circuit_breaker, override):
        self.providers = providers  # ordered list: primary, secondary, ...
        self.health = health  # tracks recent error rate per provider
        self.circuit = circuit_breaker  # open/closed per provider
        self.override = override  # None or a forced provider name

    def select(self, request):
        if self.override:
            return self.override
        for p in self.providers:
            if self.circuit.is_open(p):
                continue
            if self.health.error_rate(p, window_sec=60) > 0.10:
                continue
            if self.health.p95_latency(p, window_sec=60) > request.latency_budget:
                continue
            return p
        # All primaries unhealthy. Fall through to last-resort provider.
        return self.providers[-1]

    def record(self, provider, success, latency_ms):
        self.health.record(provider, success, latency_ms)
        if not success:
            self.circuit.maybe_open(provider)

The policy is intentionally boring. Boring policies have predictable failure modes, which is what an on-call engineer wants at 3am.

Output parity: the harder problem

Two providers can both produce a "valid response" to the same prompt and produce subtly different output that breaks the downstream pipeline.

The categories that bite most often:

Tool-call schema differences. Anthropic emits tool_use content blocks with input as parsed JSON; OpenAI emits tool_calls with arguments as a JSON-encoded string; Google emits functionCall objects with args as parsed objects. The adapter layer normalizes these, but the normalization has corner cases (unicode escapes, integer-vs-float type coercion, optional-field handling) that show up only on real production traffic.

Output formatting drift. A Claude prompt that reliably emits a JSON object can, on GPT-5, emit JSON wrapped in a code fence or preceded by a one-sentence preamble. Both are valid model outputs; only one parses. The fix is to be strict about output schema (use structured-output features where available) and to validate every parsed output before downstream code consumes it.

Reasoning depth differences. Sonnet-class and GPT-5-class models reason at similar capability on most finance tasks; their "small" tier siblings (Haiku, GPT-5-mini) are not interchangeable on tasks that need multi-step inference. A fallback from Sonnet to GPT-5 is generally safe; a fallback from Sonnet to Haiku in the same provider chain is a different decision and should be made deliberately.

Refusal behaviour. Each provider has a slightly different safety trigger profile. A prompt that runs fine on Anthropic might trigger a refusal on Google or OpenAI for content the model considers a regulated-finance edge case. The fallback path has to handle the refusal without retrying in an infinite loop.

The pragmatic test for output parity is a fixed eval set of 100–500 historical agent calls, run nightly through both providers, with a structured-output diff. If the diff exceeds a documented tolerance, alert. The Fallback Chain Simulator automates this check against historical outage patterns; the in-pipeline version is the Prompt Regression Tester running on every prompt change.

Choosing the chain

The pairs that work, and the ones that do not, for finance LLM workloads through Q2 2026.

Reliable primary/secondary pairs.

Primary Secondary Notes
Claude Sonnet 4.7 GPT-5 Closest capability match. Tool-call schema differences are the only real adapter work.
GPT-5 Claude Sonnet 4.7 Symmetric. The capability gap is small in either direction.
Gemini 2.5 Pro Claude Sonnet 4.7 Capability gap is wider; works for routing/classification, marginal for synthesis.

Pairs that look attractive but fail in practice.

Primary Secondary Why it fails
Claude Opus 4.x Claude Sonnet 4.x Same provider; an Anthropic outage takes both down.
GPT-5 GPT-5-mini Capability gap too wide for synthesis tasks; output quality drops below the trading floor.
Claude Sonnet Open-source 70B Capability gap is real; self-hosting comes with operational complexity that defeats the simplification goal.

The single highest-impact rule: the primary and the secondary must be from different providers. Same-provider fallback gives no protection against the most common outage class. The Model Selector for Finance outputs primary/fallback pairs filtered by output-schema compatibility and capability tier so the pair is structurally viable before you wire it up.

Cost overhead

The naive fallback architecture has near-zero cost overhead in steady state, since the secondary is called only when the primary is down. The real overhead comes from regression testing.

A nightly regression run on 200 historical agent calls, each costing $0.02 against the primary and $0.02 against the secondary, produces $8 of test cost per night. Annualized: $2,920. Against a $50,000/year primary inference bill, that is 6% overhead. Worth it.

The cost overhead spikes during a real outage. With the primary down for 3 hours and the agent running 1,000 calls/hour at $0.04 (secondary slightly more expensive than primary, typical), the incident costs an extra $30 above what the primary would have cost. Negligible compared to the cost of the agent being down.

The hidden cost is engineering time on the boundary. The first cross-provider implementation costs 1.5–2× the time of a single-provider implementation; later additions of a third provider are cheaper because the abstraction is in place. Plan for this in the schedule. The team that ships single-provider in week one and adds a fallback in week three is more likely to ship a fragile fallback than one that gets exercised correctly.

Operational discipline around the chain

A fallback chain that is never exercised is not a fallback chain. The single most common failure mode is "the fallback path was wired six months ago, never triggered, and is silently broken when the primary finally goes down." Three operational practices keep the chain real.

Game-day exercises. Once a quarter, force the override variable to the secondary provider for 30 minutes during business-hours traffic. Watch the dashboards. The exercise surfaces output-parity drift that the nightly regression suite missed (because the eval set is not the same shape as production traffic) and reveals downstream services that depend on primary-specific behaviour. The first game day after wiring the chain typically finds three to five real bugs; subsequent quarters find one or two.

Per-provider cost dashboarding. The cost of the secondary path during normal operation is small but non-zero (regression tests, occasional fallthrough). The cost of the secondary during an actual incident spikes by an order of magnitude. Both numbers belong on the same dashboard so a pattern of slow, persistent secondary use is visible before the bill at end-of-month makes it visible.

Documentation that runs. The runbook for "provider X is down" is a script, not a wiki page. The script flips the override, validates the fallback is taking traffic, monitors latency on the secondary, and pages on degradation. Runbooks that require an engineer to remember the steps fail at 3am; scripts that codify the steps work at 3am.

The discipline is unglamorous; the value is that the architectural decision to have a fallback actually translates into the operational outcome of surviving an outage. Teams that wire the chain and skip the discipline find out, the hard way, that the wiring alone is not the protection.

When not to bother

Fallback chains are not free, and they are not always justified.

Pre-product-market-fit. Until the agent is shipping real value to real users, the engineering time is better spent on the agent itself. Add the fallback when the cost of an outage starts to dominate.

Tight latency budgets where the secondary cannot meet them. If the trading pipeline cannot tolerate the secondary's P95 latency, the secondary is not really a fallback; it is a degraded mode that the rest of the pipeline has to handle. Either harden the primary, or budget the latency to allow the secondary as a real fallback.

Workloads where one provider has a structural moat. Some tasks genuinely run only on one provider's model: long-context tasks above 200k tokens, tool-call patterns that depend on provider-specific behaviour, or reasoning tasks that exploit a particular fine-tune. For those workloads, fallback is not architecturally available; the answer is to design the agent so the moat-dependent path is the smallest possible surface, and the rest of the agent runs on the cross-provider abstraction.

The default for a production trading agent in 2026 is to have a fallback. The exceptions are small, narrow, and identified deliberately.

Connects to

References

  1. Anthropic. "Status and Incident History." status.anthropic.com, accessed April 2026. Documents the 2025-08-14 outage and the standing model deprecation policy.
  2. OpenAI. "Status and Incident History." status.openai.com, accessed April 2026. Documents the 2026-01-22 Realtime API outage and the model deprecation timeline (gpt-4-32k, gpt-3.5-turbo-instruct).
  3. Google. "Gemini API Release Notes." ai.google.dev/gemini-api/docs/release-notes, accessed April 2026. Documents the 2025-09 Gemini 2.5 tier restructuring and the 2.0 deprecation schedule.
  4. Anthropic. "Tool Use." docs.anthropic.com/en/docs/build-with-claude/tool-use, accessed April 2026. Documents the tool_use content-block schema referenced in the output-parity section.
  5. OpenAI. "Function Calling." platform.openai.com/docs/guides/function-calling, accessed April 2026. Documents the tool_calls schema and JSON-string arguments format.
  6. Google. "Function Calling." ai.google.dev/gemini-api/docs/function-calling, accessed April 2026. Documents the functionCall schema and parsed-object args format.
  7. Hyrum's Law. "With a sufficient number of users of an API, it does not matter what you promise in the contract." hyrumslaw.com. The general principle behind why output-parity testing matters more than the documented schema; consumers depend on observable behaviour, not on the spec.