AI

The resilience playbook: AI workflows that survive outages, price hikes, and government holds

The dependency audit, the two-provider rule, a 60-minute failover drill, and the tripwires that catch trouble early — before your stack learns about single points of failure the hard way.

N Noah · The Sharp Brief · Guide · 7 min read
A wall of glowing status cards, a few burning red among the white

This week, OpenAI delayed a flagship model at the government's request. Anthropic's top models were suspended from export, then reinstated less than three weeks later. Release standards with Washington's name on them could land within days. None of this was on anyone's product roadmap in January.

If you've spent 2026 wiring AI agents into your work — and if you followed the $2 Test, you have — you now own something businesses have always owned without admitting it: key-man risk. Except your key man is a model you don't control, priced by someone else, releasable and revocable on someone else's schedule.

Most people will do nothing about this until the morning their stack doesn't answer. This playbook is the alternative: about two hours of setup, a quarterly drill, and a handful of tripwires. Not paranoia — optionality.

Part 1: The dependency audit (30 minutes)

You can't protect what you haven't mapped. Open a blank doc and list every workflow where an AI system does real work for you — drafting, research, agents, automations, the lot. For each one, write three things:

Most people find they have two or three Tier 1 workflows, five or six Tier 2, and a long tail of Tier 3. The rest of this playbook is mostly about Tier 1. That's the point — resilience you apply everywhere is resilience you'll maintain nowhere.

Part 2: The two-provider rule

Every Tier 1 workflow gets a primary and a wired-in fallback from a different provider. Two models from the same company fail together — same outage, same policy change, same government hold. Different providers fail separately.

"Wired in" is the phrase that matters. A fallback is not a name you could type into a search bar. It's wired in when all four are true:

  1. The account exists, has a payment method, and works today.
  2. Your prompt or agent brief for that workflow has been ported and produces acceptable output there — you've seen it with your own eyes.
  3. Any context the workflow needs (templates, examples, reference docs) is accessible to the fallback, not locked inside the primary tool.
  4. You've run it end-to-end within the last quarter.

Anything less is what infrastructure people call fallback rot: a backup that exists only in the org chart. And don't mirror everything — duplicating Tier 3 workflows doubles cost for risk you already decided you can absorb. The good news: with agent-grade AI now commodity-priced, a wired-in fallback costs single-digit dollars a month to keep warm.

Below the two providers sits the degradation ladder. Write it down for each Tier 1 workflow, three rungs: full automation (normal), assisted mode (you drive, the fallback model helps), manual mode (the checklist you'd follow with no AI at all). The manual rung feels theatrical until you need it; it's also the best documentation of the workflow you'll ever write.

Part 3: Portability disciplines

Switching providers in an hour is only possible if you've kept your assets portable all along. Four habits:

Part 4: The 60-minute failover drill

Once a quarter, pick your most critical workflow and pretend the primary is gone. No peeking, no "I'd figure it out." Run the clock:

  1. 0:00–0:05 — Declare the scenario. Primary provider is down indefinitely as of now.
  2. 0:05–0:15 — Stand up the fallback: log in, load the ported prompt, connect the context.
  3. 0:15–0:45 — Run the real workflow on a real task from your eval set, end to end.
  4. 0:45–0:60 — Score the output against your contract. Note every snag: the missing template, the stale API key, the step nobody wrote down.

Pass: output ships with minor edits and you hit the hour. Fail: anything else. A failed drill is a gift — it found the rot on a Tuesday afternoon instead of during a live outage with a client waiting. Fix what snagged, and the next quarter's drill takes twenty minutes.

Part 5: Tripwires

Outages announce themselves. The dangerous changes whisper. Put a ten-minute monthly review on your calendar and check four things:

The failure modes

The 30-day install

The bottom line: 2026 has already repriced AI twice and politicized it once. You can't control what a lab ships, what it charges, or what Washington holds at the gate. You can control whether any single decision made in San Francisco or D.C. can stop your Friday deliverable. Two hours of setup buys you that. It's the cheapest insurance in your entire operation.

Advertisement

Get the day, decoded — at 7 PM ET

The Sharp Brief: AI, money, business & performance in five sharp minutes. Free.

Free bonus: subscribe today and The 2026 AI Playbook (PDF) lands with your welcome email.