AI · Playbook

The $2 Test: the complete system for building your AI agent stack

Agent-grade AI now costs pocket change. The people pulling ahead aren't better prompters — they run a system. This is that system: 25 candidate workflows, a scoring rubric, copy-paste delegation briefs, the six failure modes, and a 90-day stacking plan.

N Noah · The Sharp Brief · Guide · 14 min read

The cost of delegating real work to AI collapsed this year — agent-capable models now ship free and cost a couple of dollars per million tokens beyond that. Which means the last respectable excuse ("it's expensive") is dead, and the real bottleneck stands exposed: most people cannot name the task they should delegate first, and have no process for finding the second one.

The $2 Test is that process. Run it weekly — it takes about fifteen minutes of decision-making plus one delegated task — and in 90 days you'll have an agent stack running five to ten recurring workflows that used to eat your evenings. This guide is the full operating manual: the mental model, the candidate list, the scoring rubric, the exact briefs to use, the failure modes that kill most attempts, and the stacking plan.

Part 1 — The mental model: the delegation ladder

People fail with AI agents because they jump rungs. Every workflow you hand to an AI sits somewhere on this ladder:

  1. L0 — Ask: you ask a question, it answers. (Everyone does this.)
  2. L1 — Draft: it produces a first version, you finish it. (Most people stop here.)
  3. L2 — Execute: it completes the task end-to-end from a brief; you review before anything ships.
  4. L3 — Recur: the L2 task runs on a schedule without you initiating it.
  5. L4 — Orchestrate: several L3 workflows feed each other; you manage the system, not the tasks.

The $2 Test is a machine for moving tasks from L1 to L3, one per week. You do not need L4 ambitions to start. You need one honest L2 win.

Part 2 — The inventory: 25 candidate workflows

Don't brainstorm from a blank page. Steal from this list — these are the workflows that most reliably survive delegation, grouped by the job they do. Mark every one you personally did more than twice last month.

Research & monitoring

Writing & communication

Data & operations

Planning & decisions

Personal infrastructure

Reality check: if you marked fewer than five, you didn't go honestly through your week. The average knowledge worker's calendar is a graveyard of L1-eligible tasks dressed up as "things only I can do."

Part 3 — The scoring rubric: F.I.R.E.

Score each marked candidate on four dimensions, 0–2 points each:

Score 7–8: delegate this week. 5–6: delegate after one win. Under 5: either restructure the task (usually the Inputs problem) or leave it human.

Worked examples

Part 4 — The delegation brief

Agents don't fail on intelligence; they fail on under-specification. Every task you hand over gets this brief. Copy it, fill it once, and you'll reuse it forever:

ROLE: You are my [analyst / editor / ops assistant]. TASK: [One sentence. A verb and a deliverable.] INPUTS: [Links, files, folders. Where the raw material lives.] OUTPUT: [Exact format — doc, table, email draft. Length. Audience.] GOOD LOOKS LIKE: [2–3 bullets. Or paste a past example — best move available.] CONSTRAINTS: [What to never do. Sources to avoid. Claims to never invent.] IF STUCK: [List open questions at the end instead of guessing.] DEADLINE: [When you'll review it.]

Filled example — the competitor digest:

ROLE: You are my competitive intelligence analyst. TASK: Produce this week's competitor digest for [X, Y, Z]. INPUTS: Their sites, pricing pages, blogs, LinkedIn, news since last Monday. OUTPUT: One page, three sections per competitor: Shipped / Said / Signals. GOOD LOOKS LIKE: Specific (dates, numbers), no filler, ends with the one change that matters most to us and why. CONSTRAINTS: Verify every claim against a source. Never speculate silently. IF STUCK: List "couldn't verify" items at the bottom. DEADLINE: Monday 8am.

The one rule most people break: "GOOD LOOKS LIKE" is the highest-leverage field in the brief. A past example of good output raises quality more than any clever instruction. Never delegate without one after your first run.

Part 5 — Run the test and measure

  1. Pick your top-scoring candidate. One. Not three.
  2. Fill the brief (10 minutes the first time, 2 minutes after).
  3. Hand it to an agent-grade model and do something else. No hovering.
  4. Review the output against GOOD LOOKS LIKE. Note review time honestly.
  5. Compute: time the task used to take − (brief time + review time).

The threshold: 30 minutes saved. At or above it, the task earns a permanent place in the stack — save the brief as a template and schedule the recurrence. Below it, bin the task without guilt and test next week's candidate. The stack only compounds if every layer earns its place.

Part 6 — The six failure modes

  1. The vague brief. "Summarize what our competitors are doing" produces mush. Fix: the template above, always.
  2. The head-trapped input. The agent can't read your memory. Fix: spend one session dumping context into a doc it can be pointed at — that doc becomes an asset for every future run.
  3. The perfection audit. Reviewing an agent's draft to your handcrafted standard on rep one. Fix: judge against GOOD LOOKS LIKE, not against your ego.
  4. The silent hallucination. Confident invented facts. Fix: the CONSTRAINTS line ("never invent; flag what you couldn't verify") plus spot-checking any number that would embarrass you.
  5. The one-off trap. A great result you never systematize. Saved once, gained nothing. Fix: Part 7.
  6. Delegating judgment. Handing over decisions instead of work. Agents prepare decisions brilliantly; they shouldn't make the ones with your name on them.

Part 7 — Systematize and stack: the 90-day plan

Weeks 1–2: run the inventory and rubric, get your first 8-scoring win, save the brief as template #1. Weeks 3–6: one new $2 Test per week; four templates in the library; schedule the two best as recurring (attach them to a calendar slot or an automation). Weeks 7–12: keep the weekly cadence, and start chaining — the research digest feeds the report draft; the report draft feeds the meeting agenda. That's L4 emerging without ever aiming at it.

Keep a one-page scorecard: workflow, date systematized, minutes saved per week, review time trend. Two numbers should move over 90 days — total weekly minutes saved (up) and review time per task (down). If review time isn't falling, your GOOD LOOKS LIKE examples aren't improving. Feed them.

The end state: ten systematized workflows at 30–60 minutes each is five to ten reclaimed hours per week — a part-time employee's worth of output for roughly the cost of a sandwich. That's not a productivity hack. That's an operating model, and it compounds while your competition is still asking chatbots questions.

The objection clinic

Every team and every brain produces the same five objections when the delegation habit starts. Here are the rebuttals that hold up:

Your first week, hour by hour

To remove every excuse, here's the literal schedule: Monday (10 min) — brain-dump your task inventory against the 25-candidate list. Tuesday (5 min) — score the top five with F.I.R.E. Wednesday (10 min) — write the delegation brief for the winner, steal the template above verbatim. Thursday (2 min + review) — hand it over before your first meeting; review the output at lunch against GOOD LOOKS LIKE. Friday (5 min) — compute minutes saved, make the keep/kill call, and if it's a keep, save the brief as Template #1 and schedule next week's recurrence. Sixty minutes of total effort, spread so thin you won't feel it — and by Friday you either own your first systematized workflow or you've cleanly falsified a candidate. Both outcomes beat where you are now.

One actionable edge, every weekday

The Sharp Brief — five sharp minutes on AI, money, business & performance. Free, 7 PM ET.