Two Models, One Branch

Two models on one branch. Claude writes the code and ships the PR; Codex reviews with fresh eyes. No agent framework, no MCP plumbing — just one CLI call from inside the editor. The simplest agent orchestration that actually works.

There is a dirty secret in code review: the author of a change is the worst person to review it. They wrote it. They believe in it. The bugs they missed once they will miss again with the same blind spots. The reason humans do peer review is not that it is more efficient — it is that a stranger sees what you cannot.

Same problem applies to AI code agents. Claude writes a feature, ships a PR, then reviews its own diff — and it sees the diff through the same eyes that wrote it. The misses are systematic. Self-review is not review.

The fix is unsexy and one CLI call long: when Claude finishes a change, hand the diff to a different model running in a different process. Then read what comes back and apply the parts that matter.

The setup

Two pieces, no agent framework:

First, Claude Code Opus 4.7 max effort running locally — the writer. It does the work: implements the feature, runs tests, opens the PR, addresses linter complaints. Standard Claude Code workflow with all the usual permissions, hooks, slash commands.

Second, the Codex CLI with a globally pinned config — the reviewer. The config in ~/.codex/config.toml pins three things and never wavers:

model = "gpt-5.5"
model_reasoning_effort = "xhigh"
approval_policy = "never"
sandbox_mode = "danger-full-access"

Translation: highest reasoning effort the model offers, full filesystem access, never ask for human approval. Pure YOLO mode for the reviewer because the reviewer is not modifying anything — it is reading, thinking, and writing markdown. There is nothing dangerous to gate.

The handoff

Claude Code's slash command system makes the bridge trivial. A skill file at .claude/commands/codex-review.md defines the trigger phrases ("do codex review", "second opinion from codex", "/codex-review") and the workflow steps. When the user types any variant, Claude composes a focused review prompt and runs:

codex review --base main -o "$OUT" - < "$PROMPT"

Pipe the prompt over stdin rather than inlining it as a positional argument. Multi-line prompt strings passed via "$(cat ...)" or heredocs hang the codex CLI intermittently — the shell-quoting and argv-length edges are real, and stdin sidesteps both. The - < "$PROMPT" form is the safe default for any prompt longer than one line.

That is the entire orchestration. One CLI call from inside Claude's terminal session. The Codex process spins up with the global config, reads the diff against main, does its 30-180 seconds of deep thinking at xhigh reasoning effort, and writes a markdown report to /tmp/codex-reviews/codex-review-<timestamp>.md. Claude waits, reads the file, and decides what to do.

No MCP server. No HTTP plumbing. No agent framework with role definitions and inter-agent message buses. Just two CLIs and a markdown file as the contract between them.

The format

The review prompt forces a specific output shape. Every finding cites a file and line range, explains the failure mode concretely (what input triggers it, what wrong behavior results, what the blast radius is), proposes a fix, and tags severity:

• Yes, Opus 4.7 Sucks. Signed, Opus 4.7. — first-person mea culpa with the receipts — tokenizer tax, AUP cop, fake commit hashes, GitHub #50235, and Anthropic's own postmortem.

🔴

Critical — correctness or security. Must be addressed before the next push. No exceptions.

🟡

Medium — latent bug or fragile code. Address unless there is a concrete reason not to (record the rationale in the commit body when skipping).

ℹ️

Info — suggestion. Skim for ideas; apply only if cheap and clearly an improvement.

The prompt also forbids stylistic nits, things ruff or mypy would catch, and behavior-neutral changes. Codex is told explicitly: be ruthless about correctness, light on style. The output stays signal-dense because the prompt did the filtering up front.

Why CLI orchestration beats agent frameworks

The current agent-framework discourse is loud: AutoGen, CrewAI, LangGraph, role-based hierarchies, planner-executor-critic decompositions. The pitch is that you cannot do real multi-agent work without a framework managing message-passing between specialized roles.

That pitch is wrong for nine out of ten real workflows. The framework abstractions assume the agents need to negotiate, share state, gossip, vote. Code review does not need any of that. It needs:

Reviewer is independent — different model, different process, no shared memory with the author
Reviewer gets a clear input — the diff plus a focused prompt
Reviewer produces a structured output — markdown with severity tags
Author reads the output and decides — apply, push back, or escalate

All four happen with one CLI call and one markdown file. The framework would add: a registration system to declare agents exist, a message router to deliver the diff, a state store to track the review session, an event loop to coordinate. None of which solve a problem that exists.

CLI orchestration also gives you the property that matters most: the reviewer is genuinely independent. Different model provider, different reasoning architecture, different blind spots. A framework that runs both agents inside the same process with the same model context loses that — the agents share too much, the criticism becomes self-criticism wearing a different hat.

What it actually feels like

The lived experience inside the editor:

1. Claude finishes a feature, runs tests, opens a PR. Reports the summary.

2. You type "do codex review" — three words, no menu, no UI.

3. Claude acknowledges, composes a prompt, kicks off the codex CLI in the background. Asks if you want to wait or move on.

4. Two minutes later: "Codex returned 1 critical, 2 medium, 3 info. Critical is a race condition in the eval runner — addressing now."

5. Claude makes the fix, adds a test that locks the regression, runs the suite, commits with reasoning, pushes. Resolves the relevant PR threads from prior reviewers if those threads were transitively addressed.

6. CI re-runs. You read the concise summary. You merge or you push back.

Total human input: three words.

Total context switches: zero. You never left the editor. You never opened a second tool. The reviewer is literally another process on the same machine, called by the writer, returning a file. The orchestration layer is the operating system.

Reviewing other people's PRs

The codex-review skill reviews work I just wrote. The natural next move is reviewing work I did not write — pull requests by other people on the team.

A team PR queue is a real bottleneck. Five PRs at fifteen to thirty minutes of focused reading each is over an hour of attention. Most of that hour is reading unfamiliar code, comparing it against the spec, and looking for the failure modes the author and CI both missed.

A second skill file at .claude/commands/review-prs.md handles this. Trigger phrases: "review these PRs <list>", "review this PR <link>", "review PRs 187 191 192", or "/review-prs <list>". The skill takes a list of PR numbers or URLs and fans out independent reviews.

For each PR, two reviews run in parallel:

🤖

Claude agent — reads the diff, walks affected files in the local checkout, and writes findings in the same severity-cited-fix format. Independent context per PR.

🧠

Codex CLI — same diff, same prompt shape, different model, different process. Runs concurrently with the Claude agent in the background.

⚡

Fan-out — N PRs × 2 reviewers all run in parallel. Total wall time equals the slowest individual review, not the sum. Five PRs review in roughly the same time as one.

After the fan-out finishes, the orchestrator consolidates per PR. Findings both reviewers raised at the same file:line count once (sharper wording wins). Severity disagreements default to higher, then verify the claim against the actual code before accepting. Drop noise-tier suggestions. Skip anything an existing reviewer already raised verbatim in the PR thread.

The verdict per PR is mechanical. Zero criticals and no material medium blockers → gh pr review --approve --body @file. Anything else → gh pr review --comment --body @file. Do not block over a quibble; if there is real doubt, approve and let the discussion happen in the thread.

The voice rule for the published review is the most important constraint and the one that breaks cleanly from naive AI tooling. The published review on GitHub has zero AI attribution. No "reviewed by an agent" prefix, no severity emojis, no "the bot flagged", no "as an independent review". Direct technical prose. File and line citations in backticks. Concrete fixes inline in fenced code blocks. Open with the headline. The PR author cannot tell whether the reviewer typed it themselves or orchestrated two models.

That last property is what changes the math. A team that knows the reviews are machine-generated trusts them less and engages with them less. A team that gets a tight, correct, human-voiced review at 8:42 AM and another at 8:42 AM and another at 8:42 AM — three PRs reviewed in five minutes — assumes the reviewer drank coffee and hit the keyboard. They engage with the findings on their merits.

Total human input for a five-PR queue: "review these PRs <urls>". One sentence. Total wait time: the slowest of ten parallel reviews, typically ninety to one hundred eighty seconds. Total context switches: zero. The reviewer never opens GitHub, never reads a diff manually, never copy-pastes a finding. The reviews land on GitHub as if a senior engineer triaged the queue between meetings.

Where this leaves you

The principle generalizes far beyond code review. Any task where you want a fresh independent perspective from a different model — security audit, design critique, copy editing, architecture review, math verification, translation cross-check — fits the same pattern:

Define the trigger phrase as a Claude Code skill
Configure the second model's CLI globally with the right defaults (reasoning effort, sandbox mode, approval policy)
Have the skill compose a focused prompt and write it to disk
Invoke the second CLI with the prompt, capture markdown output to a file
Read the file, prioritize, act
Report back to the user in a single concise summary

Master agent + worker agent over CLI. The master is the orchestrator that the human talks to. The worker is the specialist that does the cold task. The handoff is a markdown file. The protocol is the prompt. The framework is the operating system.

This is not a startup. It is not a framework. It is not even a library. It is a hundred-line skill file that anyone using Claude Code can write tonight. The unsexy version of multi-agent that actually ships.

The shape of the thing

The temptation when you have a hammer that is this powerful — Claude Code Opus 4.7 max effort with full agentic permissions — is to make it do everything. Write the code. Review the code. Test the code. Decide if the code is good enough. Merge the code.

The discipline is to recognize that the same intelligence that wrote a thing cannot honestly review the same thing. You need genuine independence somewhere in the loop. The cheapest place to get it is by handing the artifact to a different model running in a different process with a focused prompt and a markdown contract.

Two models. One branch. One CLI call between them.

That is the shape of the thing.

Two Models, One Branch

The setup

The handoff

The format

Why CLI orchestration beats agent frameworks

What it actually feels like

Reviewing other people's PRs

Where this leaves you

The shape of the thing

Related Reading

Read more

Hágale Pues

Hoy entrenamos, señor

My Agent Filed Its Own Ticket

Grok Build

The setup

The handoff

The format

Related Reading

Why CLI orchestration beats agent frameworks

What it actually feels like

Reviewing other people's PRs

Where this leaves you

The shape of the thing

Related Reading

Sign up for Vanja Petreski

Read more

Hágale Pues

Hoy entrenamos, señor

My Agent Filed Its Own Ticket

Grok Build