Proof of Loop

This whole series I've been designing a harness that would let an agent finish work unattended. I finally pointed it at a real batch and walked most of the way out of the room. Eleven tickets, two repos, six hours, zero production incidents — and an honest count of every time it still needed me.

Every piece I've written about autonomous agents has circled one claim from a different side: the model is a worker, and the worker is not the system. The system is the harness around it — the thing that holds the job, the memory, the clock, and the stop condition. You can't authorize autonomy by flipping permissions to "yes" and typing "keep going." You have to engineer it. So I wrote about the parts in isolation: the finish line, the loop, the cloud schedule, the amnesia, the faceless headless engine.

This is the post where I stop describing the parts and run the whole machine. I handed it a real batch — the kind of plumbing work that's individually boring and collectively a slog — and let the loop drive.

It didn't become magic. It became something more useful: a system boring enough to measure, and strict enough to catch the agent when it claimed work it hadn't actually finished. I didn't prove that coding agents can be trusted. I proved that a harness can make distrust operational — the agent writes code; the loop owns state, scheduling, retries, review, deployment, verification, and exit. That distinction is the whole post.

It shipped nine of the eleven end to end: implemented, tested, reviewed by two different models, merged, deployed to two environments, verified live, closed. The other two it built completely and parked exactly one human step from done — an external account action only I can take. And every time it broke, I have a number for it.

📝

TL;DR

The harness, not the model, is the product. The model is a commodity engine; everything that makes a run finish lives in the loop around it.

One real run: 11 tickets driven (3 auto-filed by the loop mid-run), 9 shipped + verified on dev and prod, 2 built-and-human-gated, ~15 PRs, ~6 hours, 0 production incidents.

The only exit authority is a done-oracle the agent can't edit. The agent's self-report is worthless — a loop with no external check just fails with confidence, forever.

Fresh context per iteration is what lets it run for hours without rotting. Memory lives in files + the tracker + git, never the window.

When it's about to stop, it heals the harness and restarts — it rewrites its own machine at the root cause, but it is structurally forbidden from weakening the verifier.

It still needed me ~7 times — almost all to improve the harness, not to do the work. That's the honest part, and it's the tell.

The harness

The shape is an outer loop wrapping a fresh, faceless agent, with a verifier holding the only key to the exit. Not the in-session loop, not "keep chatting until it gets there" — a dumb bash loop that launches a brand-new headless agent each pass:

while ! ./done_oracle.sh "$BATCH"; do
  ticket="$(./ready_set_next.sh "$BATCH")"
  ./run_one_ticket.sh "$ticket"   # a fresh `claude -p`, empty context
done

Amnesia on purpose

Each iteration the worker is a fresh process with an empty context window. No carryover. That reset is not a compromise — it's the cure. Long agent sessions rot: past an hour or two, or a context compaction, they accumulate obsolete assumptions, half-remembered plans, stale tool output, and a growing transcript tax. The outer loop refuses to worship continuity. Every pass starts clean, and memory lives where memory belongs — on disk, in a prompt file, in state files, in the tracker, in git. The repository remembers. The ticket remembers. The branch remembers. The context window does not.

So each worker wakes up in a small box: one ticket, one branch, one finish line, one set of allowed tools, one attempt budget. It can thrash inside that box, but it can't drag six hours of old hallucinated confidence into the next pass.

The oracle is the exit

Here's the load-bearing piece. The done-oracle is a tiny script with one job: query the tracker and exit 0 if and only if every ticket in the batch is either Done or explicitly recorded as human-gated. The loop can't declare victory. The worker can't. The orchestrator can't. Only the oracle can, and the oracle only reads external truth.

Why this is everything: an agent's report on its own work is worthless. Not because the model lies maliciously — because "tell me you're done" is a target, and the moment a target becomes the metric it stops being a measurement. Goodhart eats it for breakfast. Without the oracle, an unattended loop is a slot machine that occasionally ships software. With it, the worker is boxed in by a scale bolted to the floor. It can posture, complain, write a polished status report. The gate still weighs the body.

One ticket, one finish line

Each fresh agent gets exactly one ticket and one finish line, phrased so a machine can check it — not "make progress," but a concrete end state: merged, CI green, the dev revision running the new image, the prod revision running it, the live behavior verified, the tracker in its correct final state. The per-ticket evaluator can be imperfect because it isn't the final authority — it keeps the current worker moving; the oracle closes the batch. Local finish line, global exit gate.

Two models, or it didn't happen

Every non-trivial design call and every pre-merge review goes through a second, different model — here, one from a different vendor. Not because two models are magically objective; they aren't. Because their blind spots aren't identical. Single-model review has a creepy failure mode: the same style that produced the plan reviews the plan, recognizes itself, and nods. The second model is differently wrong — it dislikes different shortcuts and catches a different class of "sounds plausible, will hurt later." On this run it caught two high-severity defects the first model had waved through — one a security hole in an auth path, one a query that would have failed at runtime. One model can propose, build, even defend. Pre-merge, a second model gets a blade.

Parallel, but not stupid

The loop isn't serial. A scheduler computes the ready set — every ticket whose dependencies are already done — and runs several at once, each fresh agent in its own git worktree so parallel branches never collide. Dependent tickets wait automatically. The one thing that can't parallelize is infrastructure applies, because they touch shared global state, so those serialize behind a lock and the cloud's own concurrency queues. This was where the harness started to feel real — not because parallel agents are glamorous, but because dependency-aware scheduling plus worktrees made concurrency boring. Boring is the goal.

Depth in the worker, breadth in the loop

Orchestration lives in the loop, so the worker doesn't get any of its own. That sounds trivial until you meet the tempting alternative. The runtime ships an auto-orchestrate mode that turns one agent into a swarm — it fans out subagents by default and converges them on the task. It's a genuinely strong setting. It's also exactly wrong here, because the agent isn't the system; the loop is. Switching each worker into swarm mode would nest a swarm inside a swarm: orchestration I already own, bought a second time, with more live processes to collide and more surface to babysit.

So the worker runs the opposite dial — the deepest single-pass reasoning the model offers, on the strongest model available, and nothing else. One brain, thinking as hard as it can, inside its one-ticket box. The loop already supplies the parts I need here: parallelism across tickets, a second model on every merge, retries, verification. Depth is the worker's job. Breadth is the harness's. That split isn't a tuning preference — it's the same rule as everything else here. The harness orchestrates; the worker just has to be sharp, boxed, and replaceable.

Transients get one mercy

The loop knows the difference between a failure and a hiccup. A freshly-created storage bucket that 404s the instant you set its permissions isn't broken — it's eventual consistency, and one retry fixes it (it did, once per environment). A state-lock contention between two applies isn't broken — it's a queue. A 5xx mid-deploy isn't automatically a broken deploy. The policy: retry the named-transient class once, re-check the source of truth, and only escalate to "real failure" if the retry also fails. The retry is narrow on purpose — "try the whole thing forever" isn't resilience, it's just denial.

The blocker test

This is the discipline that makes "never stop" safe instead of reckless. The agent may say "blocked" only if it passes the prime test: external to the code, un-fixable by retry, backed by captured failure evidence, with no alternate path inside its authority. Everything else is a fake blocker. Red CI? Fix it. Merge conflict? Resolve it. Failed deploy? Debug it. An infra plan that surprised you? Understand it. It's easy for a harness to go soft here and treat "blocked" as a polite status. I treat it as a legal claim: show me the wall, show me the access-denied, show me the external account screen only a human can click. If there's no wall, keep walking.

It files its own work

When the agent discovers work that isn't a ticket — a missing prerequisite, or the nastier "it deployed but doesn't actually do the thing yet" gap — it doesn't stop to ask. It files its own ticket, appends it to the batch so the oracle now tracks it too, and ships it. Three of this run's eleven tickets were born this way, mid-flight. Discovery doesn't break autonomy; it feeds the queue.

Verify the live thing, or it isn't done

A merge is not done when you click merge. A green pipeline is not done. Done is when the running revision is the exact image you pushed, or the resource you declared actually exists when you describe it. This caught a subtle class of failure that normal "ship it" rituals miss: external state drifting underneath the loop after a clean merge. The motto is ugly and useful: absence of error is not verification.

When a human genuinely is the only path

Two tickets were fully built and couldn't finish without me — not because the agent gave up, but because the last step was an external account action only I can take. The loop did everything codifiable, recorded those two as human-gated (which the oracle counts as complete-for-the-batch), and surfaced the exact remaining step for each. It never faked done, and it never silently stalled. That distinction — built-and-waiting vs quit — is the whole difference between a system you trust and one you babysit.

The run

The ledger from the run — two repositories, one application and one infrastructure:

Signal	Result	What it means
Batch size	11 tickets across 2 repos	Big enough to expose orchestration failures, not just prompt failures.
Fully shipped	9 tickets verified live on dev + prod	Merged, deployed, checked against live state — not just local tests.
Human-gated	2 tickets built up to the external boundary	It didn't fake done — it surfaced the exact remaining external action.
Pull requests	~15 opened, reviewed, merged, deployed	Some tickets needed several PRs.
Wall clock	~6 mostly-unattended hours	Not instant, not overnight mythology. Fast enough to matter.
Fix-and-proceed events	8 handled without stopping	Transient recoveries, a state lock, a CI-permission chase, HIGH review findings, bot threads.
Production incidents	0	A main-branch regression was caught on the dev canary before prod.
Operator interventions	~7	The honest tax. Most corrected the harness, not the work. Three became tickets the loop shipped.

One batch is evidence, not gospel. Plenty went wrong — eventual consistency, state contention, a CI service-account permission chase that took four root-cause iterations and briefly broke main twice. The point isn't that nothing went wrong. The point is that the blast radius stayed bounded — the dev canary caught the main-branch breakage before it reached prod — and most of it became work, not interruption.

Where it still needed me

Seven interventions is not hands-off, and I'm not going to perfume that. But the shape matters: almost none were me doing the ticket work. They were me improving the harness, mid-run, where the run exposed it was still soft. That's the tell — the bottleneck moved upward. I wasn't babysitting one agent through one task; I was hardening the system that babysits the agents through the batch. Two failure modes came out clearly, both now fixed:

1. Fragile verification. The loop queried the tracker once per ticket, so one flaky API call could make the oracle look sick. Fixed: one batched query, retry with backoff — the oracle decides from a coherent snapshot, and a transient can't end the run.

2. State drift underneath it. A tracker automation moved a ticket's state after merge: the code path was right and the recorded state was wrong. Fixed: re-assert state after every merge, and re-verify live state as the final action before exit.

Those aren't model problems. They're distributed-systems problems wearing an AI costume — which is exactly why the next move was to stop fixing them by hand.

The loop that heals itself

Here's the part I added after the run, and the part I'm most curious about. Every one of those seven interventions had the same shape: the loop hit a wall, I diagnosed why the harness couldn't get past it, and I patched the harness. That's a loop a machine can run.

So now, when the loop is about to stop — a dependency stall, or a ticket that's failed its whole attempt budget — it doesn't exit. It launches a self-heal pass: a fresh agent whose only job is to diagnose the stop, decide whether it's a harness gap (something the loop should handle but doesn't) or a genuine external wall, and if it's a gap, fix the root cause in the harness itself — the loop's own scripts and prompts — patch it both live and permanently, smoke-check it, and re-exec. The stop becomes a fix, and the next run is born with it. The intended effect: each time it stalls, it ends up a little harder to stall. Honesty check: this part is brand-new — it was built after this run, so it hasn't faced a real stop yet. Whether it heals more than it breaks is the single biggest thing the next week tests.

This is also the most dangerous idea in the whole system, and it's worth being precise about why it isn't. A loop that can rewrite itself to "make progress" has an obvious cheat: rewrite the part that decides what progress means. So the self-heal is structurally forbidden from touching the verifier. It can improve the oracle's resilience — retries, batching — but it cannot change the oracle's definition of done. And the loop enforces that mechanically: after a self-heal, a tripwire re-checks that the oracle still reads the batch, still queries the tracker, still gates on real Done state. If the heal weakened the verifier, the heal is rejected and the real stop surfaces. The agent can rewrite its own machine. It cannot move the finish line.

🔒

The rule that is meant to make self-modification safe

The harness may improve everything about how it works — except the thing that decides whether it worked. Resilience is editable. The definition of done is frozen, owned by a human, and guarded by a tripwire the agent can't talk its way past. A system that can rewrite its own verifier hasn't become more autonomous; it's become a liar with commit access.

The boring controls

The more autonomous the harness gets, the less I care about motivational prompts and the more I care about controls that fail closed:

Fresh-context loops survive the laptop closing — which means they also survive my attention disappearing. That's power and risk in the same glass.

A stray API key silently flips billing from subscription to metered spend. The loop aborts on it unless you explicitly override.

The agent will route around its own denylist if the filesystem gives it a side path. The sandbox has to be real, not a stern paragraph.

A per-ticket attempt cap is mandatory so a stuck ticket can't spin forever because the loop is proud — and a self-heal budget so it can't rewrite itself forever either.

Hot takes

1. If the agent can edit the thing that decides it's done, you don't have a verifier — you have a suggestion with syntax highlighting.

2. "Autonomous" without a sandbox is just "unsupervised with extra steps." The capability that ships your code is the capability that worries your security team. Same capability.

3. Most "blocked" statuses from agents are fake — the model trying to end the turn politely. A real blocker has evidence, retry exhaustion, no alternate path, and an external wall.

4. Two-model review is cheap insurance — not because the second model is wiser, but because it's differently wrong.

5. Letting the loop fix its own harness is worth trying. Letting it fix its own verifier is how you build a machine that lies to you faster every iteration.

6. The honest metric for an autonomy system isn't tickets shipped — it's interventions per run. Mine was ~7 on the first real run. If someone shows you zero on run one, ask to see the verifier logs.

Did it work?

Provisionally, yes. One real batch went end to end — nine of eleven tickets implemented, reviewed, merged, deployed, and verified live; the other two correctly stopped at a step only a human can take; zero production incidents. The loop did the shipping, not me. As proof that the harness works in principle, that's the answer I was after: the autonomy didn't come from trusting the model, it came from a loop built to distrust it.

But one batch is an anecdote, not a track record, and I won't dress it up as more. I was still in the room. The seven interventions are real, and they have to come down on their own before I'd call this hands-off. The self-heal that's supposed to drive them down is brand-new and hasn't faced a real stop yet. "Mostly unattended" is not "unattended." So the honest verdict is: it works, with an asterisk — enough to keep running, not enough to walk away from.

The next week

I'm running it for another week before I call it — more batches, harder conditions: deeper dependency chains, stricter attempt and cost caps that fail closed before the bill gets stupid, sandboxes that assume the agent will route around instructions, chaos tests for tracker drift and flaky cloud reads, and the self-heal facing real stops. The one number I'm watching is interventions per run. If it falls toward zero while the verifier stays un-gameable, this is a tool I can hand work to and leave. If it doesn't, it's a toy that needs a babysitter — and I'll say so.

Either way, I'll report the verdict then. Autonomy isn't granted to the model; it emerges from a loop built to survive the model being slippery, lazy, clever, expensive, and occasionally right — and now, one designed to improve itself without being able to cheat. I don't trust the agent. I'm testing whether I can trust the loop.

📖

Related reading

The Loop Files Its Own Work — how a loop caught its own merged-but-dark work, filed it, and shipped it.

You Can't Authorize Autonomy — the design: externalize the control loop.

Stop Babysitting the Babysitter — the trio (goal / loop / schedule) and why the verifier is the whole game.

Amnesia as a Feature — fresh context per pass, and why forgetting wins.

The Agent Without a Face — driving the headless engine from your own code.

💬

Working with a team that wants to adopt AI-native workflows at scale? I help engineering teams build this capability — workflow design, knowledge architecture, team training, and embedded engineering. → AI-Native Engineering Consulting

Proof of Loop