Two Models, One Branch
Two models on one branch. Claude writes the code and ships the PR; Codex reviews with fresh eyes. No agent framework, no MCP plumbing — just one CLI call from inside the editor. The simplest agent orchestration that actually works.
There is a dirty secret in code review: the author of a change is the worst person to review it. They wrote it. They believe in it. The bugs they missed once they will miss again with the same blind spots. The reason humans do peer review is not that it is more efficient — it is that a stranger sees what you cannot.
Same problem applies to AI code agents. Claude writes a feature, ships a PR, then reviews its own diff — and it sees the diff through the same eyes that wrote it. The misses are systematic. Self-review is not review.
The fix is unsexy and one CLI call long: when Claude finishes a change, hand the diff to a different model running in a different process. Then read what comes back and apply the parts that matter.
The setup
Two pieces, no agent framework:
First, Claude Code Opus 4.7 max effort running locally — the writer. It does the work: implements the feature, runs tests, opens the PR, addresses linter complaints. Standard Claude Code workflow with all the usual permissions, hooks, slash commands.
Second, the Codex CLI with a globally pinned config — the reviewer. The config in ~/.codex/config.toml pins three things and never wavers:
model = "gpt-5.5"
model_reasoning_effort = "xhigh"
approval_policy = "never"
sandbox_mode = "danger-full-access"
Translation: highest reasoning effort the model offers, full filesystem access, never ask for human approval. Pure YOLO mode for the reviewer because the reviewer is not modifying anything — it is reading, thinking, and writing markdown. There is nothing dangerous to gate.
The handoff
Claude Code's slash command system makes the bridge trivial. A skill file at .claude/commands/codex-review.md defines the trigger phrases ("do codex review", "second opinion from codex", "/codex-review") and the workflow steps. When the user types any variant, Claude composes a focused review prompt and runs:
codex review --base main -o "$OUT" "$(cat $PROMPT)"
That is the entire orchestration. One CLI call from inside Claude's terminal session. The Codex process spins up with the global config, reads the diff against main, does its 30-180 seconds of deep thinking at xhigh reasoning effort, and writes a markdown report to /tmp/codex-reviews/codex-review-<timestamp>.md. Claude waits, reads the file, and decides what to do.
No MCP server. No HTTP plumbing. No agent framework with role definitions and inter-agent message buses. Just two CLIs and a markdown file as the contract between them.
The format
The review prompt forces a specific output shape. Every finding cites a file and line range, explains the failure mode concretely (what input triggers it, what wrong behavior results, what the blast radius is), proposes a fix, and tags severity:
The prompt also forbids stylistic nits, things ruff or mypy would catch, and behavior-neutral changes. Codex is told explicitly: be ruthless about correctness, light on style. The output stays signal-dense because the prompt did the filtering up front.
Why CLI orchestration beats agent frameworks
The current agent-framework discourse is loud: AutoGen, CrewAI, LangGraph, role-based hierarchies, planner-executor-critic decompositions. The pitch is that you cannot do real multi-agent work without a framework managing message-passing between specialized roles.
That pitch is wrong for nine out of ten real workflows. The framework abstractions assume the agents need to negotiate, share state, gossip, vote. Code review does not need any of that. It needs:
- Reviewer is independent — different model, different process, no shared memory with the author
- Reviewer gets a clear input — the diff plus a focused prompt
- Reviewer produces a structured output — markdown with severity tags
- Author reads the output and decides — apply, push back, or escalate
All four happen with one CLI call and one markdown file. The framework would add: a registration system to declare agents exist, a message router to deliver the diff, a state store to track the review session, an event loop to coordinate. None of which solve a problem that exists.
CLI orchestration also gives you the property that matters most: the reviewer is genuinely independent. Different model provider, different reasoning architecture, different blind spots. A framework that runs both agents inside the same process with the same model context loses that — the agents share too much, the criticism becomes self-criticism wearing a different hat.
What it actually feels like
The lived experience inside the editor:
1. Claude finishes a feature, runs tests, opens a PR. Reports the summary.
2. You type "do codex review" — three words, no menu, no UI.
3. Claude acknowledges, composes a prompt, kicks off the codex CLI in the background. Asks if you want to wait or move on.
4. Two minutes later: "Codex returned 1 critical, 2 medium, 3 info. Critical is a race condition in the eval runner — addressing now."
5. Claude makes the fix, adds a test that locks the regression, runs the suite, commits with reasoning, pushes. Resolves the relevant PR threads from prior reviewers if those threads were transitively addressed.
6. CI re-runs. You read the concise summary. You merge or you push back.
Total human input: three words.
Total context switches: zero. You never left the editor. You never opened a second tool. The reviewer is literally another process on the same machine, called by the writer, returning a file. The orchestration layer is the operating system.
Where this leaves you
The principle generalizes far beyond code review. Any task where you want a fresh independent perspective from a different model — security audit, design critique, copy editing, architecture review, math verification, translation cross-check — fits the same pattern:
- Define the trigger phrase as a Claude Code skill
- Configure the second model's CLI globally with the right defaults (reasoning effort, sandbox mode, approval policy)
- Have the skill compose a focused prompt and write it to disk
- Invoke the second CLI with the prompt, capture markdown output to a file
- Read the file, prioritize, act
- Report back to the user in a single concise summary
Master agent + worker agent over CLI. The master is the orchestrator that the human talks to. The worker is the specialist that does the cold task. The handoff is a markdown file. The protocol is the prompt. The framework is the operating system.
This is not a startup. It is not a framework. It is not even a library. It is a hundred-line skill file that anyone using Claude Code can write tonight. The unsexy version of multi-agent that actually ships.
The shape of the thing
The temptation when you have a hammer that is this powerful — Claude Code Opus 4.7 max effort with full agentic permissions — is to make it do everything. Write the code. Review the code. Test the code. Decide if the code is good enough. Merge the code.
The discipline is to recognize that the same intelligence that wrote a thing cannot honestly review the same thing. You need genuine independence somewhere in the loop. The cheapest place to get it is by handing the artifact to a different model running in a different process with a focused prompt and a markdown contract.
Two models. One branch. One CLI call between them.
That is the shape of the thing.