Loop engineering does not do the whole job in one giant leap. It does it in passes — small, repeating turns where each turn looks at reality, makes exactly one change, and checks that change against something real before the next turn begins. This lesson walks through one full pass, beat by beat, so you can read what is happening when the harness is running.
Imagine you ask the harness to fix a bug, build a feature, or polish a piece of writing. It will not try to do the whole thing in one heroic burst and then hand you a finished pile and hope it works. Instead it works in passes. A pass is one short, complete turn of the same six-step cycle, and the harness keeps taking passes until the job is provably finished.
Every pass follows the exact same shape: look at where things really stand, decide what the single most important next move is, make just that one move, check it against something real, and decide what to fix next — then loop. The discipline is the point: one change at a time, each one checked before the next begins. That is what keeps a long, unattended run from drifting into a mess.
Think of it like climbing a ladder in the dark. You do not leap for the top — you cannot even see it. You feel for the next rung, put your weight on it to make sure it holds, and only then reach for the one after. Each rung is a pass: a small move you actually test before trusting it. Skip the testing and rush three rungs at once, and the first weak one drops you all the way down.
One pass is: LEARN → ANALYZE → EXECUTE one bounded unit → VERIFY at the real boundary → DECIDE → (loop). SCOPE sits before the very first pass and defines what "done" means; it is the contract every later pass is measured against. The loop repeats until the scope's done-when conditions are all met at a real boundary, not in the model's imagination.
A long task done in one shot accumulates unverified assumptions: by the end, an error early on has silently shaped everything after it. Bounding each pass to one verified change turns a fragile monologue into a sequence of checkpoints. If pass 7 breaks something, you know it was pass 7 — the previous six were each proven good at their own boundary.
Before the loop turns even once, there is a contract: what does "done" actually mean here? Not a vibe — a checkable condition. "Done" is not "the tests feel like they pass" or "it looks better." It is something you could hand to a stranger who would agree, just by looking, whether you hit it or not. "The login page returns in under 300 ms for 95% of requests." "All 14 unit tests pass and the build is green." "The article reads at a 9th-grade level and every claim has a source."
This is the one place where you, the human, are firmly in charge. You set the target — the measurable done-when — or you approve the one the harness proposes. Everything the loop does afterward is in service of that target, so if the target is fuzzy, the whole run is fuzzy. A sharp, measurable "done" is the single highest-leverage thing you provide.
Think of it like a finish line painted on the track. Runners can pace themselves, sprint, or coast — but nobody argues about who won, because the line is right there on the ground. A measurable "done" is that painted line. Without it, every pass is a runner asking "are we there yet?" and getting a different answer each time.
In the harness, scope is captured as a small set of done-when conditions — each one observable at a boundary (a test, a build, an HTTP response, a file diff, a rendered page). The next lesson covers the gates that enforce them; for now the key idea is that the loop has a fixed thing to aim at, written down before any change is made.
For a raw, vague ask, the front-end Forge exists precisely to turn "make it better" into measurable done-when conditions before the loop starts (you will meet Forge in Module 2). Either way, the rule is the same: no measurable done, no loop.
Here is one full pass, in order. Read it as a story — each beat hands its result to the next.
LEARN — look first. The pass opens by reading reality: the current state of the code or document, the scope, and any trusted sources. It inspects the real artifacts rather than guessing what they probably contain. Starting from a guess is the cheapest way to ruin a whole pass.
ANALYZE — name the gap, then pick ONE. With reality in hand, it compares where things are against where "done" says they should be, and buckets the gap into candidate moves. Each candidate gets a quick rating — Fit, Risk, Proof, Blocker, Next — and from that ranking it picks exactly one unit of work to do this pass. Not three. One.
EXECUTE — do just that one thing. It makes the single chosen change and nothing else. The temptation to "while I'm here, also fix…" is exactly what the loop refuses, because a pass that changes five things can't tell you which one broke.
VERIFY — check it for real. It then tests the change at a real boundary — runs the test, builds the project, loads the page, diffs the file — and looks at the actual result. Not "this should work." Proof.
DECIDE — improve, then loop. Based on what verification showed, it decides what to fix next: usually the artifact itself, but sometimes the instructions driving the work. Then it starts a fresh pass at LEARN. The cycle repeats until every done-when condition is met.
Think of it like a careful cook tasting as they go. Look at the pot (LEARN), decide the dish needs salt and only salt (ANALYZE → pick one), add a pinch (EXECUTE), taste it (VERIFY), then decide the next single adjustment (DECIDE). A cook who dumps in salt, pepper, lemon, and chili all at once and tastes at the end has no idea what to change — and neither would the loop.
LEARN reads state from the real boundary (filesystem, git, running process, trusted docs) — never from stale memory. ANALYZE produces a ranked list and selects a single bounded unit. EXECUTE applies that one unit. VERIFY runs the check at the boundary and records the observed result. DECIDE chooses the next target — artifact or prompt — and re-enters the loop. SCOPE's done-when is the termination condition.
LEARN, ANALYZE, EXECUTE and VERIFY each get fuller treatment across the course; the gates that make VERIFY trustworthy are the whole of Lesson 3. This lesson's job is the shape: that these beats run in this order, once per pass, every pass.
The same six beats, drawn as the cycle they form. Scope sets the target once; then the five inner beats turn, again and again, with verification as the gate that decides whether you advance or loop back to fix.
ANALYZE is where a pass earns its focus. Looking at reality usually surfaces several things that could be done — a failing test here, a missing edge case there, a rough sentence, a slow query. The loop does not attack all of them. It first buckets the gap (groups what's missing into a short list of candidate moves), then gives each candidate a quick rating, and picks the single best one to do this pass.
The rating asks five plain questions about each candidate:
The winner is the one unit with the best balance: high Fit, manageable Risk, something you can actually prove, ideally a Blocker that frees up later work. Everything else waits for a future pass. This is the rule that keeps the loop honest — one bounded unit per pass, chosen on purpose, not whatever is most tempting.
Think of it like triage in an emergency room. Five patients arrive at once. The nurse does not treat all five at half-speed; they rate each by urgency and what can actually be helped right now, and the most critical one goes first. The others are not forgotten — they are next in line. ANALYZE is that triage nurse for the work.
The five axes are a fast, repeatable rubric rather than a heavy scoring model. Proof is decisive: a candidate that cannot be verified at a real boundary this pass is deprioritised, because an unverifiable change can't safely close. Blocker captures dependency order — picking the unit that unblocks three others is usually higher-leverage than a flashy but isolated change.
A bounded unit is one whose effect and verification both fit inside a single pass — small enough to execute and check before the next LEARN. "Refactor the whole module" is not bounded; "extract this one function and keep the tests green" is. If a candidate is too big to bound, the right move is often to pick the smaller unit that splits it.
The single most important rule of the loop is the one that sounds almost too simple: do exactly one bounded thing per pass. Not zero. Not five. One.
Why not five? Because if you change five things and then verify, and the result is wrong, you cannot tell which of the five caused it. You have lost the thing the loop is for — the ability to point at a single change and say "this one is proven good." Batching trades a moment of apparent speed for a debugging swamp later.
Why not zero? Because a pass that looks, thinks, and then does nothing is wasted motion. The loop never idles: every pass either ships one verified change or hits a real blocker it surfaces clearly. "I'll wait and see" is not a pass; it is a stall, and over a long unattended run, stalls are how progress quietly dies.
Think of it like surgery, one incision at a time. A surgeon does not make five cuts at once "to save time," and they do not stand frozen over the patient either. Each deliberate action is made, checked, and only then is the next one taken. The patient — your codebase, your document — survives precisely because nothing happens that isn't one controlled, verified move.
One-unit passes make a run trivially bisectable: every checkpoint isolates exactly one change against its verification. A regression introduced at pass N is attributable to pass N by construction — there is no "which of these edits did it?" because each pass made one edit and proved it. This is what allows a long autonomous run to stay debuggable.
The discipline cuts both ways. An LLM driving the loop must convert each pass into either a shipped, verified unit or an explicitly surfaced blocker — never a no-op, never "let me think about it" with no output. Idle passes burn budget and erode observability, since the human watching the log can no longer tell whether progress is happening.
A change isn't done because it looks done. It's done because you checked it against reality and reality agreed. That check is VERIFY, and it always happens at a real boundary — the actual place where the truth lives. If the unit was a code fix, you run the test and read the result. If it was a build change, you build the project. If it was a page, you load the page. If it was a sentence, you re-read it in context.
What VERIFY refuses is the comfortable lie: claiming success from memory, or trusting a mock that always says yes. The harness has a hard rule here — verify by actually running the check, never by simulating it in your head and declaring victory. A pass that "should pass" hasn't passed. Only the boundary gets to say so.
Think of it like a smoke detector versus your own nose. You might smell nothing and feel sure the house is fine. The detector is the real boundary — it doesn't care how confident you are; it samples the actual air. The loop trusts the detector, not the hunch. That's the difference between "I think it works" and "it works."
This is the heart of what Lesson 3 calls the Proof Gate: every claimed completion must be backed by observed output from the real boundary — a test runner's exit code, a build log, an HTTP status, a rendered diff. "Claim" and "mock" are explicitly not evidence. A unit only advances when its done-when condition is observed true, not asserted true.
Boundaries are concrete and varied: a unit/integration test, a compiler, a linter, a running server hit with a request, a screenshot, a file diff reviewed against the spec, or live web evidence pulled via a tool. The skill is choosing a boundary that actually exercises the change — a test that doesn't touch the changed path proves nothing.
Verification just told you the truth about your one change. DECIDE is what you do with that truth. Usually the answer is straightforward: pick the next single improvement to the artifact — the next bug, the next missing piece — and start a fresh pass at LEARN.
But there's a subtler, powerful move. Sometimes the thing that needs fixing isn't the artifact at all — it's the instructions driving the work. If pass after pass keeps missing the same way, the smartest change might be to sharpen the prompt or the scope itself, so the next pass aims better. The loop is allowed to improve itself, not just its output. Then, either way, it loops — and keeps looping until every done-when condition is met.
Think of it like a GPS rerouting. After each stretch of road it checks where you actually are against where you should be. Usually it just says "continue" (improve the artifact). But if you keep drifting off course, it doesn't repeat the same wrong turn louder — it recalculates the whole route (improve the prompt). Same destination, smarter directions.
DECIDE can target either editable surface: the artifact (the code, doc, or design under construction) or the prompt/scope (the instructions and done-when guiding the loop). Repeated near-misses on the same axis are the classic signal to improve the prompt rather than grind the artifact — a cheap meta-correction that re-aims every subsequent pass.
The loop terminates when the scope's done-when conditions are all observed true at their boundaries — that is convergence. If a pass surfaces a blocker that needs a human decision, the loop pauses at a clearly-marked handoff rather than guessing. Otherwise it keeps taking passes; "improve until convergence" is the literal stop rule.
While the loop runs — often for hours, unattended — you don't sit and drive it. You watch it. The harness writes a running record, LOOP-LOG.md, that turns the invisible passes into something you can read at a glance: how many passes have happened, how many units actually shipped, how often verification passed, and whether anything is blocked. This is your observability window. You set "done"; the log lets you confirm the loop is honestly converging on it.
The panel below is that window, made live. The four tiles up top are the loop's vital signs; the table is one row per unit of work, each with a badge for where it stands. Hit Run a pass to advance the loop by one turn, or flip on Auto-loop to watch passes tick by the way they would during a real AFK run.
LOOP-LOG.md — auth-refresh goal
one bounded unit per pass · verify at the real boundary
| Unit of work | State | Fit | Last verify (real boundary) |
|---|
A single list of units drives both the table and the rollup banner. Each pass advances exactly one unit — moving a queued unit into "verifying", or resolving a "verifying" unit into "shipped" (verify passed) or "blocked" (verify failed at the boundary). The vital-sign tiles are recomputed from that list every pass: passes run, units shipped, the rolling verify pass-rate, and idle passes held at zero. The overall banner reads the worst unit: any blocker turns it amber, otherwise it reports healthy convergence.
Passes run is the loop's heartbeat. Units shipped over passes is its true throughput — verified work, not attempts. Verify pass-rate is the honesty signal; a falling rate says the prompt may need improving (back to DECIDE). Idle passes must stay at zero — the moment it climbs, the loop is stalling rather than progressing. Reading these four is exactly how a human supervises an AFK run without touching it.
The loop is one cycle, but three different kinds of actor touch it, each with a clear job.
You, the human, own the edges. You set or approve the measurable done-when at the start, and then you mostly watch — reading LOOP-LOG.md to confirm the run is converging honestly. You step back in only when the loop surfaces a real decision it shouldn't make alone. You are the supervisor, not the driver.
The LLM is the engine of a pass. It runs the six beats and obeys the iron rule: exactly one bounded unit per pass — never batch several changes, never idle on a no-op. Each pass it produces either one verified change or a clearly-surfaced blocker. That discipline is what makes its long, unattended output trustworthy.
The agents are how passes get spread across tools. A single pass can be handed to a different command-line model than the last one — one pass driven by one CLI, the next by another — so the strongest tool for each step does that step. The orchestration layer dispatches a pass to whichever agent fits, and the log keeps it all legible.
Think of it like a film set. The director (you) sets the vision and watches the monitor, but doesn't operate the camera. Each shot (pass) is taken by whichever specialist crew is right for it — and the call sheet (the log) lets the director see every take without standing behind every lens.
In a multi-agent run, the orchestrator can dispatch any single pass to a specific command-line agent in headless mode — conventionally cli -p "<the one bounded unit for this pass>". Because each pass is bounded and independently verified, it does not matter that pass 4 ran on one model and pass 5 on another; the boundary check is what certifies the result, not the identity of the engine. A roster of agents is chosen up front, and the validator of a pass is never the same agent that produced it.
Everything inner-loop runs AFK (away-from-keyboard): the human has observability, not a steering wheel. The only place a run blocks for a person is a deliberate, decision-ready handoff. Watching LOOP-LOG.md (and, later, a review file) is the supervision surface; it never requires the human to execute a step themselves.
None of this is abstract. A pass leaves a trail you can read. Here is the kind of entry the loop appends to LOOP-LOG.md after a single pass — notice it records all six beats: what it learned, the one unit it picked and why, what it did, the real verification it ran, and the decision for next time.
## pass 7 — 2026-06-14 09:41 learn read src/auth/refresh.ts + 14 tests; scope done-when = "all auth tests green" analyze gap bucketed into 3 candidates; rated Fit/Risk/Proof/Blocker/Next picked → "handle expired-token retry" (Fit:high Risk:low Proof:test Blocker:yes) execute edited refresh.ts: retry once on 401, then surface error (1 unit, no batching) verify $ npm test -- auth/refresh ✓ 14 passing (real boundary: test runner exit 0) decide artifact ok; next unit = "rotate refresh token on success" → loop
When the harness drives a goal, the log lives at the root of the working directory. To watch passes arrive in real time, tail it:
your terminal$ cat LOOP-LOG.md # read the whole run so far $ tail -f LOOP-LOG.md # follow new passes as they land
A pass dispatched to a specific command-line agent is launched in headless mode with the one bounded unit as its prompt:
orchestrator — dispatch one pass$ cli -p "EXECUTE one unit: handle expired-token retry in refresh.ts; verify with: npm test -- auth/refresh"
The exact agent behind cli can differ pass to pass — what certifies the result is the verification at the boundary, not which engine ran it.
The takeaway
Every pass is one bounded unit, learned from reality, executed alone, and proven at a real boundary before the loop turns again. That is the whole engine: small verified steps, repeated until "done" is observably true.
Three quick questions. Pick an answer and the panel tells you why it's right or wrong — retrieval beats re-reading.
Q1During ANALYZE, the loop finds five things that could be fixed. How many does it do this pass?
Q2What does "VERIFY at the real boundary" rule out?
Q3In an AFK run, what is the human's main job once the loop is going?