The resilience playbook: AI workflows that survive outages, price hikes, and government holds

This week, OpenAI delayed a flagship model at the government's request. Anthropic's top models were suspended from export, then reinstated less than three weeks later. Release standards with Washington's name on them could land within days. None of this was on anyone's product roadmap in January.

If you've spent 2026 wiring AI agents into your work — and if you followed the $2 Test, you have — you now own something businesses have always owned without admitting it: key-man risk. Except your key man is a model you don't control, priced by someone else, releasable and revocable on someone else's schedule.

Most people will do nothing about this until the morning their stack doesn't answer. This playbook is the alternative: about two hours of setup, a quarterly drill, and a handful of tripwires. Not paranoia — optionality.

Part 1: The dependency audit (30 minutes)

You can't protect what you haven't mapped. Open a blank doc and list every workflow where an AI system does real work for you — drafting, research, agents, automations, the lot. For each one, write three things:

The dependency. Which model, which provider, which tool wrapping it. Be specific — "ChatGPT" is not a dependency; "GPT-5.x via the web app" and "GPT-5.x via API inside my CRM automation" are two different dependencies with two different failure modes.
The tier. Tier 1: money or reputation is exposed if this stops for a week (client deliverables, sales follow-up, anything with a deadline attached). Tier 2: operations degrade but nothing breaks (research, internal reports). Tier 3: convenience (drafting tweets, cleaning notes).
The blast radius. One sentence: "If this stopped Friday at 5 PM, what happens by next Friday?" If the honest answer scares you, you've found your Tier 1.

Most people find they have two or three Tier 1 workflows, five or six Tier 2, and a long tail of Tier 3. The rest of this playbook is mostly about Tier 1. That's the point — resilience you apply everywhere is resilience you'll maintain nowhere.

Part 2: The two-provider rule

Every Tier 1 workflow gets a primary and a wired-in fallback from a different provider. Two models from the same company fail together — same outage, same policy change, same government hold. Different providers fail separately.

"Wired in" is the phrase that matters. A fallback is not a name you could type into a search bar. It's wired in when all four are true:

The account exists, has a payment method, and works today.
Your prompt or agent brief for that workflow has been ported and produces acceptable output there — you've seen it with your own eyes.
Any context the workflow needs (templates, examples, reference docs) is accessible to the fallback, not locked inside the primary tool.
You've run it end-to-end within the last quarter.

Anything less is what infrastructure people call fallback rot: a backup that exists only in the org chart. And don't mirror everything — duplicating Tier 3 workflows doubles cost for risk you already decided you can absorb. The good news: with agent-grade AI now commodity-priced, a wired-in fallback costs single-digit dollars a month to keep warm.

Below the two providers sits the degradation ladder. Write it down for each Tier 1 workflow, three rungs: full automation (normal), assisted mode (you drive, the fallback model helps), manual mode (the checklist you'd follow with no AI at all). The manual rung feels theatrical until you need it; it's also the best documentation of the workflow you'll ever write.

Part 3: Portability disciplines

Switching providers in an hour is only possible if you've kept your assets portable all along. Four habits:

Prompts live in your files, not in their tools. Keep every production prompt and agent brief in a doc or repo you control. The version saved inside a vendor's interface is a hostage, not a backup.
Keep a ten-case eval set per Tier 1 workflow. Ten real inputs with known-good outputs. That turns "is the fallback good enough?" from a vibe into a twenty-minute test — and it's how you'll notice quality drift on your primary, too.
Define output contracts. Write down, precisely, what the workflow must produce: format, sections, tone constraints, the checks a human runs before it ships. Contracts make models interchangeable; taste-based acceptance makes them irreplaceable.
Flag proprietary features. Anything provider-exclusive on a Tier 1 path — a specific tool integration, a unique file format, a memory feature — either gets an equivalent on the fallback, or gets wrapped so the workflow survives without it. Use exclusive features freely on Tier 3; earn them on Tier 1.

Part 4: The 60-minute failover drill

Once a quarter, pick your most critical workflow and pretend the primary is gone. No peeking, no "I'd figure it out." Run the clock:

0:00–0:05 — Declare the scenario. Primary provider is down indefinitely as of now.
0:05–0:15 — Stand up the fallback: log in, load the ported prompt, connect the context.
0:15–0:45 — Run the real workflow on a real task from your eval set, end to end.
0:45–0:60 — Score the output against your contract. Note every snag: the missing template, the stale API key, the step nobody wrote down.

Pass: output ships with minor edits and you hit the hour. Fail: anything else. A failed drill is a gift — it found the rot on a Tuesday afternoon instead of during a live outage with a client waiting. Fix what snagged, and the next quarter's drill takes twenty minutes.

Part 5: Tripwires

Outages announce themselves. The dangerous changes whisper. Put a ten-minute monthly review on your calendar and check four things:

Price moves. Any change to your primary's pricing or plan limits. Repricing has been the defining AI story of 2026 — downward, so far. It doesn't have to stay downward.
Deprecation and policy notices. Skim the provider's changelog and terms updates for models you depend on. Retirement dates are always announced; they're just rarely read.
Quality drift. Run three cases from your eval set on the primary. Models get updated under you; your eval set is how you catch it before your clients do.
The regulatory tape. Release reviews, export decisions, standards announcements. This used to be background noise. As of this week, it's operationally relevant.

The failure modes

Mirroring everything. Resilience for Tier 3 is procrastination wearing a safety vest. Tier your workflows or drown in maintenance.
Fallback rot. The untested backup. If it hasn't run this quarter, you don't have a fallback — you have a hypothesis.
Abstraction astronautics. You need a second wired-in provider, not a homemade routing platform. If your resilience project has a backlog, it's become the risk.
Panic-switching. One bad output is not an outage. Switch on the tripwires you defined in advance, not on a mood.
Solo heroics. If anyone else depends on the workflow, they need to know the ladder exists and where the manual checklist lives. Resilience that lives in one head is Tier 1 risk with extra steps.

The 30-day install

Week 1: Run the dependency audit. Tier everything. Write the blast-radius sentences.
Week 2: Wire in a second provider for your #1 workflow: account, ported prompt, context access. Build its ten-case eval set.
Week 3: Write the degradation ladder and output contract for each Tier 1 workflow. Move every production prompt into files you control.
Week 4: Run your first failover drill. Fix what snagged. Book the monthly tripwire review and the next quarterly drill.

The bottom line: 2026 has already repriced AI twice and politicized it once. You can't control what a lab ships, what it charges, or what Washington holds at the gate. You can control whether any single decision made in San Francisco or D.C. can stop your Friday deliverable. Two hours of setup buys you that. It's the cheapest insurance in your entire operation.