In this harness, the work runs AFK — away from keyboard. The whole loop — learn, build, verify, decide — runs without a person feeding it the next move. Your job changes from operator to observer: you read what happened, you don't push the buttons. The machine only ever stops for you at one specific kind of decision — and never to hand you a chore.
AFK stands for away from keyboard. In most tools, "AI helps you work" means a tight back-and-forth: it does a little, then stops and waits for you to type the next instruction, approve the next step, click the next button. This harness is built the opposite way. Once the goal is set, the loop runs the whole job — figure out the state, make one change, prove it worked, decide whether to keep going — on its own, pass after pass, without a person in the middle.
So what is left for the human? Observability. That is a fancy word for a simple promise: at any moment you can see exactly what the system has done and is doing, without having to take part in it. You read a running log, you read a report, you check a status line. You are a person standing at a window watching a kitchen, not a line cook on the station. The work does not pause because you looked away, and it does not need your hands to continue.
This matters because the failure it prevents is so common. The moment a system needs a human to advance — "waiting for approval", "please confirm", "what should I do next?" — it stops being autonomous. It becomes a thing that runs only as fast as you can babysit it, and stalls the instant you step away. The harness refuses that. The default is: keep going, leave a trail, and only ever stop for the human at one very specific kind of decision (we will get to exactly which one).
There is one more half of the rule, and it is the part people get backwards. Not only does the human stay out of the doing — the human is also never handed the checking. The system does not finish its work and then turn to you with a list of "now please test this" chores. The verification is the machine's job too. Your reading is for your own understanding and trust; it is never a task queue the agents offloaded onto you.
Think of it like… the dashboard of a self-driving car versus being the driving instructor with the second brake. A dashboard shows you speed, route, and what the car sees — you stay informed, and you can take over if something genuinely calls for a human, but you are not steering, and the car does not ask you to grade its parking afterward. The harness puts you in the passenger seat with a great dashboard, not in the instructor's seat pumping the pedals. Where the analogy bends: a car asks you to take over for danger; this harness pulls you in only for a decision that is properly yours to make — not because it got stuck.
Every non-trivial task in this suite is driven by the loop, and the loop is designed to run unattended: LEARN → ANALYZE → EXECUTE one bounded unit → VERIFY at the real boundary → DECIDE, repeated until it converges on the measurable done-when from the Scope Gate (lesson 3). Nothing in that cycle blocks on a person. DECIDE chooses "go again" or "converged" from evidence the loop itself gathered, not from a human's say-so.
The human reads three things and writes none of the work: a running narrative (LOOP-LOG.md), a final observability report (review.md), and a live status readout. These are append-only outputs of the run — the human consumes them, the agents produce them. The human's hands never enter the execution path; that is what "never in the path" means.
Half one: LLMs never block waiting for a human on routine work. Half two: LLMs never hand the human a QA task. Together they force the system to be genuinely autonomous in both directions — it neither stalls for you nor delegates its checking to you. The single exception is a deliberate, well-defined fork covered in section 8.
Three kinds of player share this stage, and the whole method works because each one stays in its lane. It is worth naming them plainly, because the magic of AFK is really just a clean division of labour.
The human — that's you — does exactly one thing during a run: observe. You read the log, you read the report, you glance at status. You set the goal at the start, and you may be pulled back for one special decision at the end, but in between you do not execute and you do not test. The LLMs (the models driving the work) carry the discipline: they never sit idle waiting for you on routine work, and they never turn around and hand you a checking chore — they keep the loop turning and leave a trail behind them. The agents (the orchestrator and the workers it delegates to) do the building and the proving, and they emit the report — review.md — as a record of what was observed about the run, not as a to-do list aimed at you.
Read the diagram below as three lanes that almost never cross. The only place a line reaches back to the human is the single dotted fork on the right — and that is a decision, not a task.
Here is the whole idea in a single shape. The loop spins on its own — each step feeds the next, and "go again" returns to the start without anyone's permission. The human sits beside the loop, reading its outputs, never wired into the chain that makes it turn. The forbidden wiring — the dotted red arrow — is a human placed inside the loop, where the machine would have to stop and wait for a click to continue. That is exactly what AFK removes.
The one rule
If the system has to wait for a human to take its next routine step, it is not AFK. The human belongs beside the loop reading it — never inside the loop feeding it.
"Observability" is only real if there is something concrete to observe. In this harness there are three things, and that's the whole list. You don't dig through internals or attach a debugger; you read three plain outputs the run keeps up to date for you.
First, LOOP-LOG.md — the running narrative. Every pass of the loop appends a few lines: what it learned, the one unit it picked, what it changed, and how it verified. Reading it top to bottom is like reading a ship's log: you see the journey, in order, as it happened. Second, review.md — the report the QA agent writes when the work converges. It is a snapshot of how the finished run looks under inspection: what was checked, what held up, what is worth a human's attention. Third, the status readout — a one-glance "where are we right now": which pass, converged or still going, any blocker. Between these three you always know the past (log), the verdict (review), and the present (status) — without ever touching the controls.
Think of it like… following a long bake through a glass oven door. The LOOP-LOG.md is the timer ticking through each stage; the status light tells you it is still baking or done; and review.md is the note the baker leaves describing how the loaf came out. You learn everything you need by looking — you never have to open the door and stick your hands in.
LOOP-LOG.md grows by one entry per loop pass and is never rewritten — that immutability is what makes it trustworthy as a record. status is a derived view (current pass, converged flag, open blocker) you can print at any time. review.md is produced once at convergence by the validator — and critically, by an agent that is not the one that built the work, so the report is an independent read, not a self-grade.
A streaming console would tempt a human to jump in and steer. Files and a status command keep the relationship one-way: the human pulls information when they want it, and the loop never waits to see whether they did. Pull, don't push; read, don't drive.
The proof that VERIFY ran at the real boundary (lesson 3) is exactly what lands in these outputs. Observability and the Proof Gate are two views of one idea: every claim the run makes is backed by something a human can go read.
This is the rule the models live by, stated as plainly as possible. On routine work, an LLM never stops to wait for a human. If it hits a fork it can settle from evidence — which file, which fix, whether the test passed — it settles it and keeps moving, leaving the reasoning in the log. It does not post "let me know how you'd like to proceed" and idle. Idling on the routine is the exact behaviour that turns an autonomous loop back into a babysitting job.
The mirror-image rule is just as important: an LLM never hands the human a QA task. When the work is done, the agents do not turn to you and say "please run the tests" or "can you confirm this looks right?" Checking the work is part of the work, and the work belongs to the machine. The report you read afterward is the result of that checking, already done — not a request for you to do it. If you ever feel like the system finished by handing you homework, something has gone wrong with the discipline.
Put the two together and you get the shape of a genuinely AFK system: it neither stalls waiting for you, nor offloads its verification onto you. It runs, it checks itself, it writes down what happened, and it only ever reaches for you at the one decision that is truly yours — which is the next section but one.
Routine = anything resolvable from the artifact, the scope, or a trusted source. A missing fact is fetched (via the Bright Data CLI for web facts, lesson 9), not asked of the human. An ambiguous instruction is resolved by re-reading the done-when, or by picking the safer interpretation and recording the choice. Only an irreversible, outward, or business-intent decision is allowed to leave the routine lane — that is the fork in section 8.
The VERIFY step is non-negotiable and never exported. The validator that writes review.md is itself an agent in the run — usually a different model from the builder, so it is an independent check rather than a self-grade — but it is still the machine checking, not the human. The human reading review.md is auditing trust, not completing the test plan.
Where a phase has questions, the agents answer them themselves and only surface a genuinely author-only question with a recommendation attached. The bias is heavily toward keeping the human out of the loop, precisely so the loop can run while the human is away.
Feel the difference. Below are two ways to run the same long job. The left one is not AFK: it stops and waits for a human at every routine step, so the human's "stale beliefs" (and the stall) pile up the longer they look away. The right one is AFK: it keeps running and just writes down what it does, so the human's picture stays fresh by reading — at zero cost to the run. Press Next round a few times and watch the gap open.
Stops at every routine step to wait for a click. The moment the human looks away, progress stalls and their picture goes stale.
Keeps running and appends to LOOP-LOG.md every round. The human reads when they like; the picture is never stale and the run never waited.
Time moves every round whether the human is watching or not. Only the AFK run keeps making progress and keeps the human informed — because observing is read-only.
async function round() { const step = plan(); await waitForHumanClick(); // stalls here until a person acts return run(step); }
async function round() { const step = plan(); const out = run(step); // keeps moving on its own appendTo("LOOP-LOG.md", out); // leaves a trail to OBSERVE return out; }
The whole difference is one line: delete the await waitForHumanClick() and replace it with an appendTo("LOOP-LOG.md", …). That single move turns "a human must be in the path" into "a human may read the trail" — autonomy plus observability, instead of babysitting.
When a run converges, an agent writes review.md. It is easy to misread what this file is for, so let's be exact. It is a report: a description of how the finished work looks when an independent agent inspects it. It is emphatically not a list of chores assigned to the human. There is no "TODO: test the login flow" waiting for you in there, because the testing already happened — that is what the report is reporting on.
Why does the distinction matter so much? Because the instant a report becomes a to-do, the human is back in the path. "Here are five things for you to verify" is just babysitting wearing a nicer hat — the run didn't really finish, it stopped and delegated. A true observability report closes the loop: it says this was done, here is the evidence, here is what an outside reader should know. You read it to decide whether you trust the result, not to find out what you still have to do.
Think of it like… a home-inspection report when you buy a house. The inspector already climbed onto the roof and ran the taps — the report tells you what they found so you can make a decision. A good inspector hands you findings; a bad one hands you a ladder and says "go check the roof yourself." review.md is the findings, never the ladder.
# review.md — RHG search-handler run · converged ✓ ## What was checked (by the validator, not the human) - empty-query guard: covered — pytest test_empty_query_returns_400 now passes - regression suite: green — 12 passed, 0 failed (was 11 passed, 1 skipped) - real boundary: hit — curl "/search?q=" → 400, observed live ## How it looks under inspection - change is one bounded unit (a guard in api.py); no scope creep - LOOP-LOG.md trail is complete: 3 passes, each with proof ## For the reader's awareness (NOT tasks) - downstream callers of /search were not audited — out of this run's scope - consider a follow-up goal if empty-body POSTs need the same guard
Open it like any file: cat review.md, or read it in your editor. The "for the reader's awareness" section is observability, not assignment — it names things an outside reader should know, often explicitly out of scope. If you decide one of them matters, that becomes a new goal for a fresh run, set the normal way — it is never an implicit chore the run left for you.
The validator agent — by policy, not the agent that built the change — so the report is an independent read. In a multi-agent run the builder and the validator are different models; the human reading the report is the third, outermost check, auditing trust rather than doing QA.
There is exactly one situation where the autonomous run will deliberately pause and pull the human in. It is not because it got stuck, and not because it wants you to test something. It is a user-only fork: a decision that is genuinely yours to make because it is irreversible, reaches outward into the world, or expresses business intent the system can't infer.
Examples: publishing something public, sending money, deleting data that can't be recovered, choosing between two directions that are both valid but mean different things for the business. These aren't routine — no amount of reading the artifact tells you which one the human wants. So the loop stops, but it stops the right way: through a deliberate handoff, presented decision-ready. That means the system has already done all the homework — gathered the facts, laid out the options, and attached its recommendation — so the human makes a clean call and the loop resumes. You are answering one well-framed question, not picking up an abandoned task.
Everything else — every routine fork — the loop settles itself. This single, narrow exception is the only thread that reaches from the loop back to the human's hands, and even then it hands you a decision, fully prepared, never a chore.
Think of it like… an autopilot that flies the whole route, but for one thing asks the captain: "Two valid diversion airports, here's weather, fuel, and my recommendation — your call." It doesn't ask the captain to fly; it doesn't ask the captain to re-check its instruments. It asks the one judgement only a human should own, and it asks it fully briefed.
In the Forge front-end (lesson 4) this is the handoff mechanism: the run blocks only at a user-only fork and presents it decision-ready. Everywhere else the bias is to self-answer and keep moving. "Decision-ready" is a hard bar — a handoff that just says "what now?" is a defect; it must carry the gathered facts, the laid-out options, and a recommendation.
A decision earns a handoff if it is irreversible (can't be cleanly undone — destructive deletes, sends), outward (it leaves the sandbox into the world — publishing, money, third parties), or business intent (it encodes a preference the system can't read off the artifact or the goal). Fail all three and it is routine: the loop decides and logs it.
Here is what observability looks like as the actual things you'd do while a run is going — all of them reads, none of them steering. You tail the log, you print status, you read the report when it lands. Notice there is no command here that advances the work; that's the point.
# watch the running narrative as passes append to it tail -f LOOP-LOG.md # one-glance: which pass, converged or going, any blocker cat status # or: ./status # when it converges, read the independent report (findings, not chores) cat review.md # there is NO "advance" command — the loop moves itself. # the human only ever acts at a decision-ready handoff.
From the run's working directory: tail -f LOOP-LOG.md follows the narrative live; cat status prints the current pass and convergence flag; cat review.md reads the validator's report once it exists. All three are pure reads — running them never changes the run or unblocks it.
There is intentionally no loop next or approve command for routine work. The only human action point in the whole run is the handoff at a user-only fork (section 8), and that one is surfaced to you explicitly, decision-ready. If you find yourself looking for a button to push to keep things moving, the system is telling you it doesn't need one.
Let's watch one real-shaped run entirely from the observer's seat. The goal was set in the evening: "make /search reject an empty query with a 400." You go to bed. Here is your whole involvement — four moments of reading, and exactly one decision.
You write the done-when and start the run. From here you do not type another instruction. The loop takes over: LEARN reads the repo, ANALYZE picks one unit, EXECUTE adds the guard, VERIFY hits the real endpoint. You are asleep for all of it.
The agent is unsure whether Flask gives None or "" for an empty param. That's routine — a fact that lives on the web — so it grounds it via the Bright Data CLI and writes the answer into LOOP-LOG.md. It did not wait for you. Reversible, inferable, internal: the loop decides and logs.
tail LOOP-LOG.md shows three clean passes, each with proof. cat status reads converged ✓. cat review.md is an independent report: guard covered, 12 passed, real boundary hit at /search?q=. Nobody handed you a test to run — it was already done. You are auditing trust, not doing QA.
review.md notes a user-only fork waiting: the fix is ready to ship publicly, and publishing is outward and irreversible. The handoff is decision-ready — diff summarised, risk noted, recommendation: "ship". You answer once. The loop resumes and publishes. That single click is the only time the run needed your hands.
What the human did all night
Set a goal, read three files, answered one decision-ready question. Zero routine steps, zero QA chores. The loop ran AFK end-to-end; observability told you everything; the one fork that was truly yours was handed to you fully prepared. That is the human staying out of the path.
# LOOP-LOG.md — fix/empty-query (AFK, no human in the path) [pass 1] LEARN api.py:42 has no empty-q guard; suite 11 passed, 1 skipped [pass 1] ANALYZE pick ONE unit → add guard returning 400 on empty q [pass 1] EXECUTE added guard in api.py [pass 1] VERIFY curl "/search?q=" → 400 (real boundary) · pytest 12 passed [pass 1] DECIDE done-when met → converged [fork] user-only: publish (outward, irreversible) → decision-ready handoff # human answered: ship → loop resumed → published. no QA delegated.
Recall beats re-reading. Answer each from memory before you peek — the option you pick grades instantly, with a note on why. No tells in the formatting; the answers are spread around on purpose.
Q1During a routine AFK run, what is the human's job?
Q2An LLM hits a routine fork it can settle from the artifact. What does it do?
Q3What is review.md?
Q4Which decision is allowed to pause the loop for a human?
Q5What does "decision-ready" mean for a handoff?
cli -p.