The $2 Test: the complete system for building your AI agent stack

The cost of delegating real work to AI collapsed this year — agent-capable models now ship free and cost a couple of dollars per million tokens beyond that. Which means the last respectable excuse ("it's expensive") is dead, and the real bottleneck stands exposed: most people cannot name the task they should delegate first, and have no process for finding the second one.

The $2 Test is that process. Run it weekly — it takes about fifteen minutes of decision-making plus one delegated task — and in 90 days you'll have an agent stack running five to ten recurring workflows that used to eat your evenings. This guide is the full operating manual: the mental model, the candidate list, the scoring rubric, the exact briefs to use, the failure modes that kill most attempts, and the stacking plan.

Part 1 — The mental model: the delegation ladder

People fail with AI agents because they jump rungs. Every workflow you hand to an AI sits somewhere on this ladder:

L0 — Ask: you ask a question, it answers. (Everyone does this.)
L1 — Draft: it produces a first version, you finish it. (Most people stop here.)
L2 — Execute: it completes the task end-to-end from a brief; you review before anything ships.
L3 — Recur: the L2 task runs on a schedule without you initiating it.
L4 — Orchestrate: several L3 workflows feed each other; you manage the system, not the tasks.

The $2 Test is a machine for moving tasks from L1 to L3, one per week. You do not need L4 ambitions to start. You need one honest L2 win.

Part 2 — The inventory: 25 candidate workflows

Don't brainstorm from a blank page. Steal from this list — these are the workflows that most reliably survive delegation, grouped by the job they do. Mark every one you personally did more than twice last month.

Research & monitoring

Weekly competitor digest (their launches, pricing changes, hiring signals)
Industry news brief tailored to your role, delivered before your Monday meeting
Deep-dive briefing before any sales call or interview (company, people, recent news)
Literature/regulation watch: "tell me when anything changes in X"
Price/comparison research for any purchase over $500

Writing & communication

First drafts of recurring reports (status updates, board notes, client summaries)
Meeting notes → action items → follow-up email chain
Proposal/SOW first drafts from a bullet outline
Repurposing: one document → post, summary, deck outline, FAQ
Inbox triage: classify, draft replies for the routine 60%, flag the rest

Data & operations

Cleaning and normalizing spreadsheets (formats, dupes, categorization)
Monthly expense categorization and anomaly flagging
CRM hygiene: dedupe, enrich, flag stale deals
Converting messy inputs (PDFs, screenshots, emails) into structured tables
Recurring report generation from a data export

Planning & decisions

Meeting agendas built from last week's notes and open threads
Travel planning within stated constraints (budget, dates, preferences)
Project kickoff packs: risks, milestones, RACI first drafts
Decision memos: options, tradeoffs, recommendation — for you to overrule
Weekly priority proposal based on your task list and calendar

Personal infrastructure

Meal planning + grocery list within dietary constraints
Family logistics brief (school events, appointments, deadlines this week)
Learning curriculum: turn any goal into a sequenced 4-week plan
Subscription audit: find, list, and draft cancellation notes
Gift research with three options per person, within budget

Reality check: if you marked fewer than five, you didn't go honestly through your week. The average knowledge worker's calendar is a graveyard of L1-eligible tasks dressed up as "things only I can do."

Part 3 — The scoring rubric: F.I.R.E.

Score each marked candidate on four dimensions, 0–2 points each:

F — Frequency. Weekly or more = 2. Monthly = 1. Rarer = 0.
I — Inputs. Everything needed lives somewhere pointable (folder, URL, inbox) = 2. Mostly = 1. Lives in your head = 0.
R — Recognizability. You could explain good-vs-bad output in one sentence = 2. You'd know it when you see it = 1. Quality is a debate = 0.
E — Error tolerance. A bad first attempt costs you a review = 2. Costs a redo = 1. Costs a client or your credibility = 0.

Score 7–8: delegate this week. 5–6: delegate after one win. Under 5: either restructure the task (usually the Inputs problem) or leave it human.

Worked examples

Weekly competitor digest: F2 + I2 (their sites/news are public) + R2 ("did I learn what changed?") + E2 (internal doc) = 8. Immediate.
Client proposal draft: F1 + I1 (half the context is in your head) + R2 + E1 = 5. Delegate the skeleton, keep the judgment.
Salary negotiation email: F0 + I1 + R1 + E0 = 2. Human job. (Use the Raise Playbook instead.)

Part 4 — The delegation brief

Agents don't fail on intelligence; they fail on under-specification. Every task you hand over gets this brief. Copy it, fill it once, and you'll reuse it forever:

ROLE: You are my [analyst / editor / ops assistant]. TASK: [One sentence. A verb and a deliverable.] INPUTS: [Links, files, folders. Where the raw material lives.] OUTPUT: [Exact format — doc, table, email draft. Length. Audience.] GOOD LOOKS LIKE: [2–3 bullets. Or paste a past example — best move available.] CONSTRAINTS: [What to never do. Sources to avoid. Claims to never invent.] IF STUCK: [List open questions at the end instead of guessing.] DEADLINE: [When you'll review it.]

Filled example — the competitor digest:

ROLE: You are my competitive intelligence analyst. TASK: Produce this week's competitor digest for [X, Y, Z]. INPUTS: Their sites, pricing pages, blogs, LinkedIn, news since last Monday. OUTPUT: One page, three sections per competitor: Shipped / Said / Signals. GOOD LOOKS LIKE: Specific (dates, numbers), no filler, ends with the one change that matters most to us and why. CONSTRAINTS: Verify every claim against a source. Never speculate silently. IF STUCK: List "couldn't verify" items at the bottom. DEADLINE: Monday 8am.

The one rule most people break: "GOOD LOOKS LIKE" is the highest-leverage field in the brief. A past example of good output raises quality more than any clever instruction. Never delegate without one after your first run.

Part 5 — Run the test and measure

Pick your top-scoring candidate. One. Not three.
Fill the brief (10 minutes the first time, 2 minutes after).
Hand it to an agent-grade model and do something else. No hovering.
Review the output against GOOD LOOKS LIKE. Note review time honestly.
Compute: time the task used to take − (brief time + review time).

The threshold: 30 minutes saved. At or above it, the task earns a permanent place in the stack — save the brief as a template and schedule the recurrence. Below it, bin the task without guilt and test next week's candidate. The stack only compounds if every layer earns its place.

Part 6 — The six failure modes

The vague brief. "Summarize what our competitors are doing" produces mush. Fix: the template above, always.
The head-trapped input. The agent can't read your memory. Fix: spend one session dumping context into a doc it can be pointed at — that doc becomes an asset for every future run.
The perfection audit. Reviewing an agent's draft to your handcrafted standard on rep one. Fix: judge against GOOD LOOKS LIKE, not against your ego.
The silent hallucination. Confident invented facts. Fix: the CONSTRAINTS line ("never invent; flag what you couldn't verify") plus spot-checking any number that would embarrass you.
The one-off trap. A great result you never systematize. Saved once, gained nothing. Fix: Part 7.
Delegating judgment. Handing over decisions instead of work. Agents prepare decisions brilliantly; they shouldn't make the ones with your name on them.

Part 7 — Systematize and stack: the 90-day plan

Weeks 1–2: run the inventory and rubric, get your first 8-scoring win, save the brief as template #1. Weeks 3–6: one new $2 Test per week; four templates in the library; schedule the two best as recurring (attach them to a calendar slot or an automation). Weeks 7–12: keep the weekly cadence, and start chaining — the research digest feeds the report draft; the report draft feeds the meeting agenda. That's L4 emerging without ever aiming at it.

Keep a one-page scorecard: workflow, date systematized, minutes saved per week, review time trend. Two numbers should move over 90 days — total weekly minutes saved (up) and review time per task (down). If review time isn't falling, your GOOD LOOKS LIKE examples aren't improving. Feed them.

The end state: ten systematized workflows at 30–60 minutes each is five to ten reclaimed hours per week — a part-time employee's worth of output for roughly the cost of a sandwich. That's not a productivity hack. That's an operating model, and it compounds while your competition is still asking chatbots questions.

The objection clinic

Every team and every brain produces the same five objections when the delegation habit starts. Here are the rebuttals that hold up:

"Our data is sensitive." Legitimate — and solvable with policy, not abstinence. Sort your candidate list into green (public inputs), yellow (internal but not regulated), and red (regulated/PII). Run the $2 Test on greens and yellows; reds wait for approved tooling. Most people discover 70% of their candidates were green all along.
"The output isn't as good as mine." Correct, on rep one — that's why the threshold measures time saved after review, not perfection. Your GOOD LOOKS LIKE examples are the quality dial. If output isn't improving by rep three, the examples are stale, not the model.
"My work is too unique to delegate." The work might be; the preparation of the work never is. Delegate the research pull, the formatting, the first-pass structure — keep the judgment. Even surgeons don't sterilize their own instruments.
"I don't have time to set this up." The setup is fifteen minutes once. The alternative is doing the same task manually 50 more times this year. This objection is arithmetic wearing a costume.
"I tried it once and it failed." Find which of the six failure modes it was — it's almost always the vague brief or the head-trapped input. One diagnosed failure is worth three lucky successes.

Your first week, hour by hour

To remove every excuse, here's the literal schedule: Monday (10 min) — brain-dump your task inventory against the 25-candidate list. Tuesday (5 min) — score the top five with F.I.R.E. Wednesday (10 min) — write the delegation brief for the winner, steal the template above verbatim. Thursday (2 min + review) — hand it over before your first meeting; review the output at lunch against GOOD LOOKS LIKE. Friday (5 min) — compute minutes saved, make the keep/kill call, and if it's a keep, save the brief as Template #1 and schedule next week's recurrence. Sixty minutes of total effort, spread so thin you won't feel it — and by Friday you either own your first systematized workflow or you've cleanly falsified a candidate. Both outcomes beat where you are now.