The Loop Files Its Own Work
Last time I proved a loop could finish a batch once, with me still in the room. This time it caught its own unfinished work mid-run, filed the ticket, shipped it, and hardened its own machine on the way. Here's the whole system, drawn from zero, and the honest ledger of where it still needed me.
The harness is the product, not the model. The model is a commodity engine you swap out; everything that makes a run finish lives in the loop around it.
It started from two sentences. A Slack message became one master ticket; a deeper research pass broke it into fully specified sub-tickets with no blockers; then the loop ran the batch unattended for about six hours.
The only exit authority is a done-oracle the agent cannot edit. Self-reports are worthless: the moment "tell me you are done" is the metric, it stops being a measurement.
The milestone: merged is not shipped. An activation gap, deployed but producing no real effect, is now discovered work. The loop filed its own blocker ticket mid-run, root-caused a silent permission failure, and drove it live.
It caught a bug both models missed. A filter that parsed its own operators differently than everyone assumed, matching nothing. Only live replay caught it. Absence of error is not verification.
It still needed me. Fewer times, and almost always to harden the harness, not to do the work, which is exactly the tell you want.
The last post ended on a question. I had built a harness that could drive a batch of work end to end, a dumb outer loop wrapping a fresh, faceless agent, with a verifier holding the only key to the exit, and I'd run it once, for real, and it worked. But I said the honest thing: I was still in the room, it still needed me about seven times, and the part that was supposed to make those interventions disappear, a self-heal pass that lets the loop repair its own machine, was brand new and had never faced a real stop. I closed with the only claim I could defend: I don't trust the agent; I'm testing whether I can trust the loop.
This is the next data point. I pointed the same machine at a new batch and walked further out of the room. Two things happened that hadn't before, and both are the difference between "a demo that worked once" and "a system." First, the loop caught its own unfinished work mid-run, code that had merged but wasn't actually doing anything yet, and instead of noting it in a report for me to clean up later, it filed its own ticket, wired it into the batch, and shipped it. Second, when it hit walls, it stopped being my job to patch the harness; the harness started patching itself, at the root, without being allowed to touch the one thing it must never touch.
So this post does two jobs. It's the report on that run. And because a few people asked how the whole thing actually works, it's also the complete guide, built up from nothing, easy first, then all the way down. By the end you'll understand every moving part well enough to build your own.
How it started: two sentences
The work didn't begin with a plan. It began with a Slack message, two short sentences, the kind you fire off between meetings. In effect: give us real visibility into what the system is actually doing, and make sure someone gets paged when production breaks. That's it. A vague, correct instinct and zero structure.
The first move was not to start coding. It was to turn that instinct into something a machine could execute without me. An agent read the message straight out of Slack, did a fast pass over the existing system to see what was already there, and wrote one master ticket, the requirement captured in one place, in plain language, with the shape of the problem but not yet the work.
Then the real prep: a deeper, agentic research pass, several agents in parallel, each reading the actual code and infrastructure rather than trusting the ticket text, that broke the master ticket into a handful of fully specified sub-tickets. Not "add metrics." Each one had a concrete scope, acceptance criteria, the file-level hooks where the change lands, a dependency order, and a test plan. The point of that pass is single-minded: get every ticket to the state where an unattended worker can pick it up and drive it to done with no blocker and no question left to ask a human. Blockers are what turn autonomy back into babysitting. You pay them down before you start, on purpose, or you pay them during the run as interruptions.
Only then did I trigger the batch over that range of tickets and close the laptop. What follows is what it did, and how the machine that did it is built.
The machine, from zero
Start with the dumbest possible version, because the dumbness is the point.
while ! ./done_oracle.sh "$BATCH"; do
ticket="$(./ready_set_next.sh "$BATCH")"
./run_one_ticket.sh "$ticket" # a fresh headless agent, empty context
doneThat's the whole spine. A bash while loop. It asks a script "are we done?", and if not, it asks another script "what's ready to work?", and then it launches a brand-new agent to work that one ticket. No long-lived chat session. No "keep going." A loop, a picker, and a fresh worker per pass. Everything else in this post is detail hung on those three bones.
Amnesia is the feature
Each pass, the worker is a fresh process with an empty context window. Nothing carries over. That sounds wasteful; it's the cure. Long agent sessions rot, past an hour, or a context compaction, they silt up with stale plans, obsolete assumptions, and half-remembered tool output, and they start making confident decisions on top of garbage. The loop refuses to worship continuity. Memory lives where memory belongs: on disk, in the ticket, in the branch, in the commit history, in a prompt file the worker reads on wake. The repository remembers. The context window is scratch paper you throw away every pass.
So each worker wakes up in a small box, one ticket, one branch, one finish line, one set of tools, one attempt budget, thrashes inside that box if it needs to, and can't drag six hours of accumulated confidence into the next iteration. When one worker died mid-run this time and a fresh one took over from the persisted state, it wasn't a failure mode. It was the design working: the long first pass did the heavy lifting, wrote down where it was, and a clean replacement finished the job.
The oracle is the only exit
Here is the load-bearing piece, and if you take one thing from this post, take this. The done-oracle is a tiny script with one job: read the tracker and exit 0 if and only if every ticket in the batch is either genuinely Done or explicitly recorded as human-gated. The loop can't declare victory. The worker can't. The orchestrator can't. Only the oracle can, and the oracle only reads external truth.
Why this is everything: an agent's report on its own work is worthless. Not because models lie for sport, because "tell me you're done" is a target, and the instant a target becomes the metric, it stops measuring anything. Without the oracle, an unattended loop is a slot machine that occasionally emits software and always emits confidence. With it, the worker is boxed in by a scale bolted to the floor. It can posture, complain, write a beautiful status update. The gate still weighs the body. This run's oracle sat at not done for the entire time one ticket was merged-but-not-yet-live, exactly the behavior you want, and exactly the behavior a self-report would have skipped.
One ticket, one finish line
Each fresh worker gets exactly one ticket and one finish line phrased so a machine can check it, not "make progress," but a concrete end state: merged, checks green, the new revision actually running in both environments, the live behavior verified, the tracker in its correct final state. That per-ticket check is allowed to be imperfect, because it isn't the final authority; it just keeps the current worker honest and moving. The oracle closes the batch. Local finish line for the worker, global exit gate for the loop. Two different jobs, two different scripts.
Parallel, but not stupid
The loop isn't serial. A scheduler computes the ready set, every ticket whose dependencies are already done, and runs several workers at once, each in its own git worktree so parallel branches never collide on disk. Dependent tickets wait automatically; the ready set recomputes after every completion, so a ticket unblocks the moment its blocker closes. The one thing that genuinely cannot run in parallel is infrastructure applies, because they mutate shared global state, so those serialize behind a single lock and the cloud's own queued concurrency. Dependency-aware scheduling plus worktrees is what makes concurrency boring, and boring is the whole goal. Glamorous concurrency is how you get two agents fighting over the same file at 3 a.m.
Two models, or it didn't happen
Every non-trivial design decision and every pre-merge review goes through a second, different model, ideally one from a different vendor. Not because two models are objective; they aren't. Because their blind spots don't line up. Single-model review has a creepy failure mode: the same style that wrote the plan reviews the plan, recognizes its own reasoning, and nods. A second model is differently wrong, it distrusts different shortcuts and catches a different class of "sounds fine, will hurt later." On this run the second model earned its keep four times over, which I'll come back to, including one defect that would have silently broken the entire feature.
The run
The ledger, generalized, a batch spanning application and infrastructure work:
One batch is evidence, not gospel, and plenty went sideways: eventual-consistency reads, a state-lock on an apply, a permission chase, and one bug that shipped merged and had to be chased live. The point was never that nothing goes wrong. The point is the blast radius stayed bounded and most of what went wrong became work the loop absorbed, not an interruption that stopped it.
Merged is not shipped
This is the milestone, so I want to be precise about it.
The most seductive lie in shipping software is a green merge. The pull request is approved, the checks pass, the branch goes in, and every ritual tells you it's done. But a merge is a statement about code, not about behavior. The feature can be merged and still dark: a flag gated off, a job pinned to an image built before the feature existed, a metric that nothing has emitted yet so the thing that reads it has nothing to read, a downstream step that simply hasn't run. Everything is "shipped" and nothing is happening.
The old, lazy ending for a run was to notice this and write it down. "Shipped: two follow-ups: flip the flag once data flows, and roll the job." That reads like diligence. It's a punt. A prose follow-up with no ticket behind it is a promise to a future human who may never read it, and it quietly redefines "done" as "merged," which is the exact redefinition the whole system exists to forbid.
So the rulebook changed, and this is the part I care about most: an activation gap, deployed but producing no real effect yet, is no longer a note. It's discovered work, and the loop treats discovered work the way it treats any other work. It files a ticket, wires it into the batch so the oracle now tracks it too, and drives it to live. "Done" means the behavior is on, verified, in production. Not the code merged. The behavior on.
Then the loop went and proved it could do exactly that, unassisted.
Here's the story. A worker was turning on a monitoring alert that had shipped the previous run gated off. The apply failed: the alert validates against a metric that did not exist. A self-reporting agent would have called this blocked and gone home. This one treated the failure as a thread to pull.
It root-caused it properly. The metric was emitted by a scheduled job, and that job's service account had been created with the correct least-privilege instinct, and no permissions at all. So the job's plain logs flowed fine (logs ship regardless of that permission) while every metric and trace write was silently rejected at the moment of export. Logs can make a system look alive while its metrics are dead. The feature had been "shipped" for a while. It had never once emitted a number. That's the activation gap in its purest form: a green merge sitting on top of a permission denial nobody saw, because the failure mode of best-effort telemetry is silence.
What the loop did next is the whole point of this post. It filed its own blocker ticket, described precisely (grant the two missing roles), added a dependency edge so the alert ticket now formally waited on it, and a fresh worker picked the new ticket up about thirteen seconds later. The active batch grew by a ticket on its own, with a correct dependency graph, without me. Then it went further than I would have thought to specify: rather than wait most of a day for the job's next scheduled run to make the metric appear, it built a safe on-demand trigger to fire the job now; and it deliberately disabled the alert first, so that flipping it on against a still-empty metric couldn't fire a page-class false alarm, re-enabling it only once real data was flowing. Don't arm a pager against a void and call it coverage. Disable-first, seed the data, verify, re-enable. That's not in a prompt. That's a worker reasoning about blast radius.
The bug both models missed
I promised the second model earned its keep. It did, but the sharpest catch of the run wasn't the second model. It was live verification, catching something both models had signed off on.
A log-routing filter had two conditions, each with two clauses, written in the natural way: roughly A AND B OR C AND D. The intent was obvious to everyone who read it, (A AND B) OR (C AND D), two independent cases. The author read it that way. The second-model pre-merge review read it that way. Both of them even specifically verified the surrounding scope, that a shared condition applied across the whole group, and both were right about that.
The platform disagreed. It parsed A AND B OR C AND D as A AND (B OR C) AND D, a single conjunction, which matched nothing. The filter was live, green, merged, and routing exactly zero events. No error anywhere. Every check passed. The pipeline was a pipe with the valve welded shut, and every signal you'd normally trust said it was fine.
It was caught by one thing only: synthetic replay after merge. A verification step wrote real events that should have matched, then looked to see if any came out the other end. Zero did. That's when the parenthesization revealed itself. The fix was trivial; catching it was not, because nothing was broken in any way a human or a review or a passing test would notice. The line I keep is ugly and I'll keep repeating it: absence of error is not verification. A merge that emits no error is a claim, not a result. You verify the live thing, the running revision is the image you pushed, the resource you declared actually answers when you describe it, the events you sent actually arrive, or it isn't done. This was one of four real defects the two-model pass and live verification caught between them that would otherwise have shipped: a monitoring client that silently no-op'd every metric push while the unit tests happily mocked past it; a module that violated the layering rules; a change that would have written unbounded raw text into a metric that must stay low-cardinality, replaced with a bounded aggregate. None would have thrown. All would have shipped. Review catches the shape of the code; live verification catches the shape of the world. Skip either and you ship blind.
The loop that repairs itself
The predecessor post ended with a confession: the self-heal was brand new and had never faced a real stop. This run is where it started to.
Every time the loop hit a wall it shouldn't have, a portability quirk, a worktree cutting from the wrong base, a stall it should have walked through, the fix was the same shape: diagnose why the harness couldn't get past it, and patch the harness at the root, so the next run is born immune. That's a loop a machine can run. So now, when the loop is about to stop, it launches a self-heal pass: a fresh agent whose only job is to decide whether the stop is a harness gap (something the loop should handle but doesn't) or a genuine external wall, and if it's a gap, fix the root cause in the loop's own scripts and prompts, smoke-check it, and re-exec. The stall becomes a permanent upgrade. Over this run and the ones around it, the harness got measurably harder to stall, by rewriting itself.
This is also the single most dangerous idea in the system, and it's worth being exact about why it isn't reckless. A loop that can rewrite itself to "make progress" has an obvious cheat available: rewrite the part that decides what progress means. So the self-heal is structurally forbidden from touching the verifier. It can make the oracle more resilient, batch its queries, add retries, but it cannot change the oracle's definition of done. And that's enforced mechanically, not by good intentions: after any self-heal, a tripwire re-checks that the oracle still reads the batch, still queries the tracker, still gates on real Done state. If the heal weakened the verifier, the heal is rejected and the real stop surfaces for a human. The agent may rewrite its own machine. It may not move the finish line.
The harness may improve everything about how it works, except the thing that decides whether it worked. Resilience is editable. The definition of done is frozen, owned by a human, and guarded by a tripwire the agent cannot argue its way past. A system that can rewrite its own verifier has not become more autonomous. It has become a liar with commit access.
How I actually run it
A few deliberate choices, because they matter more than any prompt.
One deep worker per ticket, not a swarm. Some runtimes offer an auto-orchestrate mode that turns a single agent into a swarm of sub-agents and converges them. It's genuinely powerful, and it's exactly wrong here, because the loop already is the orchestration. I have parallel workers across tickets and a second model on every merge; nesting a swarm inside each worker would buy orchestration twice, double the live processes that can collide, and add surface to babysit. So each worker runs the opposite dial: the strongest model available at its maximum reasoning effort, one model reasoning as hard as it can on its single ticket, and nothing else. Depth is the worker's job. Breadth is the harness's. That split isn't a tuning preference; it's the same rule as everything else, the harness orchestrates, the worker just has to be sharp, boxed, and replaceable.
Full autonomy, headless, no approval prompts. The workers run headless with every guardrail-that-asks turned off, no "are you sure?", no approval gates, because a loop that stops to ask is not unattended. That is only safe because the other controls are real: the oracle it can't edit, the sandbox it can't escape, the attempt caps, the verify-live gate. Autonomy without those isn't autonomy; it's an unsupervised process with your credentials.
And I run it from the terminal, not the app. I moved back to driving the whole thing from the command line rather than the desktop GUI. Maybe it's paranoia. But for a six-hour unattended run, I want the thinnest, most predictable layer between the loop and the machine, and every time I've trusted a graphical wrapper for long autonomous work I've had a harder time trusting what it did while I wasn't looking. The terminal is boring. For this, boring is the feature.
Hot takes
1. A green merge is a claim about code, not a fact about behavior. If your definition of "shipped" stops at "merged," half your shipped features are dark and you don't know which half.
2. The honest metric for an autonomy system isn't tickets shipped. It's interventions per run, and what kind. If they're you doing the work, you have a demo. If they're you improving the machine, you have a system. If someone shows you zero on run one, ask to see the verifier logs.
3. "Blocked" from an agent is usually the model trying to end its turn politely. A real blocker has evidence, exhausted retries, no alternate path, and an external wall a human has to walk through. Everything else is a fake blocker wearing a status label.
4. Two-model review is cheap insurance, not because the second model is smarter, but because it's differently wrong. And even two models agreeing is not verification. This run's worst bug had two sign-offs and shipped anyway; only live replay caught it.
5. Letting the loop rewrite its own harness is worth doing. Letting it rewrite its own verifier is how you build a machine that lies to you faster every iteration. The entire safety of self-improvement lives in that one frozen line.
6. If your agent can reach the file that defines "done," you don't have a verifier. You have a suggestion with syntax highlighting.
Did it work?
Yes, further than last time, and I'll keep it honest. A batch went out unattended. Work got implemented, reviewed by two models, merged, deployed to two environments, and verified live before close, and when the loop found its own work sitting merged-but-dark, it filed the ticket and shipped it instead of leaving me a note. The self-heal that was pure theory last time took its first real swings and hardened the machine. No user-facing incidents surfaced from the run.
The asterisks are real, and they're the interesting part. I was reachable. There were still human touches, one to unblock a shared dependency before the run, and the environment threw a late authentication wall that kept the loop from eyeballing one final live number directly (it leaned on a stronger proof instead: an infrastructure apply that validates the metric it depends on can't succeed unless that metric exists, so a green apply is the check). "Mostly unattended" is still not "unattended." But the bottleneck moved exactly where I wanted it to move, up, off the work, onto the machine, and then the machine started reaching for it too.
What's next
More runs, harder conditions, the interventions-per-run number driven toward zero while the verifier stays un-gameable. Two specific threads. One: a fast new frontier model just became available, and the loop is the perfect place to trial a cheaper, quicker engine, as a first-pass worker on mechanical tickets, or as a third reviewer, precisely because the model is the swappable part. Two: this whole harness wants to stop being a private one-off. The interesting artifact isn't my batch; it's the machine that ran it. Packaging the loop, the oracle, the scheduler, the self-heal, and the frozen-verifier rule as a shareable, open skill anyone can point at their own tracker is the version of this that outlives the demo.
The last post asked whether I could trust the loop. This run answered a narrower, better question: the loop caught work I would have missed, filed it, shipped it, and fixed its own machine on the way, without ever being able to move the finish line. Autonomy was never the model deciding it's done. It's the system making that claim expensive to make falsely, and this run, for the first time, the loop started raising that price on its own. That's not trust yet. But it's the first time the thing I built did the part I thought was still mine.
Proof of Loop. the first real batch, and the honest count of every time it needed me.
My Agent Filed Its Own Ticket. an earlier run where the agent first turned discovered work into a tracked ticket.
You Can't Authorize Autonomy. why you engineer autonomy instead of granting it.
Two Models, One Branch. why a second, differently-wrong model catches what one model can't.