GPT-5.5: The Frontier Just Split in Two

OpenAI shipped GPT-5.5 today — first ground-up retrain since GPT-4.5, double the token price, state-of-the-art on agentic benchmarks. Opus 4.7 still wins precision coding. What the numbers say, what practitioners are reporting, and how to pair it with Claude Code.

Price doubled. The agentic-coding bar moved. The release note reads like every release note in 2026 — "smartest model yet" — and yes, that's the tenth time this year a vendor has written that sentence.

Under the marketing layer, something else happened. GPT-5.5 is OpenAI's first ground-up retrain since GPT-4.5 — codename Spud, no less, which is the most disarming thing the company has said in months. Opus 4.7 shipped exactly seven days ago with its own frontier jump on SWE-bench Pro. For the first time this cycle, the two labs landed their top models inside the same week, on separate axes of the same workload.

The headline isn't "GPT-5.5 is better." The headline is: the frontier just split in two.

I upgraded Pro this morning. Here's the honest read.

What actually shipped

Three variants. GPT-5.5 standard, GPT-5.5 Thinking (extended reasoning), GPT-5.5 Pro (highest accuracy, slower, Pro / Business / Enterprise only).
1M token context window across every tier — including ChatGPT Plus, not just the API. This is the first OpenAI API model to ship with a million tokens by default.
API pricing: $5 / $30 per million input/output tokens for standard. $30 / $180 per million for Pro. That's roughly 2× GPT-5.4. OpenAI's defense: fewer tokens per completed task.
ChatGPT availability: Plus, Pro, Business, Enterprise. Free tier is out.
Positioning: "A new class of intelligence for real work." Read: this is an agent, not a chatbot. OpenAI is done pretending otherwise.

Where each variant actually lives

This trips people up on day one, so worth stating flatly: GPT-5.5 Pro is a ChatGPT-surface variant, not a Codex one. Codex gets standard and Thinking. Pro's parallel-reasoning architecture is built for one-shot deep analysis, not the iterative dev loop — so OpenAI deliberately keeps it out of the CLI.

Variant	ChatGPT (web / desktop)	Codex CLI / IDE
GPT-5.5 (standard)	yes	yes (shows as `gpt-5.5`)
GPT-5.5 Thinking	yes	yes (via `/model` or reasoning effort)
GPT-5.5 Pro	yes (Pro / Business / Enterprise)	no — ChatGPT only

If you're a Pro subscriber and you open Codex, the gpt-5.5 you see is already the right model. Pro shows up in the ChatGPT model picker (web or desktop app), not in the CLI. If it's not in your picker yet, it's still the staggered rollout — wait a few hours.

Reach for Pro on ChatGPT when the task is one massive synthesis — PhD-level analysis, "read all of this and give me the answer," parallel-hypothesis research. Keep the dev loop on standard or Thinking inside Codex.

The benchmark spread

Two labs, one week apart, two different bets. Here's the scorecard that matters when you're actually shipping code with these things:

Benchmark	GPT-5.5	Opus 4.7	Winner
Terminal-Bench 2.0 (agentic CLI)	82.7%	69.4%	GPT-5.5 (+13.3)
OSWorld-Verified (computer use)	78.7%	78.0%	Tie
Tau2-bench Telecom	98.0%	—	GPT-5.5 showcase
GDPval (knowledge work vs pros)	84.9%	—	GPT-5.5 shows
SWE-bench Pro (real GitHub issues)	58.6%	64.3%	Opus 4.7 (+5.7)
MCP-Atlas (tool use)	75.3%	79.1%	Opus 4.7 (+3.8)
FrontierMath Tier 4 (Pro tier)	39.6%	22.9%	GPT-5.5 Pro (≈2×)
BrowseComp (web research, Pro)	90.1%	—	GPT-5.5 Pro
Hallucination rate	−60% vs 5.4	—	(OpenAI's claim)

Read the table, not the marketing.

GPT-5.5 wins when the work is agentic — driving a terminal, clicking through apps, calling tools, orchestrating across IDE plus browser plus spreadsheet. Opus 4.7 wins when the work is code precision — resolving a real GitHub issue end-to-end, following instructions literally, binding a tool schema without drift. Two labs, two bets, one cleanly bisected workload.

The first-24-hour signal

Hacker News has already produced the meme: GPT-5.5 as Marvin from Hitchhiker's Guide. Multiple developers reported the model apologizing instead of executing — "You're right, I have failed you" — or parroting instructions back instead of doing the work. A few switched mid-task to Claude or Kimi because, as one put it, the OpenAI API "couldn't drag it to do its job." This pattern has haunted GPT-5 Thinking for months. It's usually prompt-able around, but it's real enough that the Anthropic camp will be quoting it all week.

On the other side: Every's practitioner panel is calling it a "new daily driver." Dan Shipper ran it against Opus 4.7 on their internal senior-engineer benchmark and got 62.5 vs the low 30s. One of their staff writers said GPT-5.5 produced drafts with cleaner idea progression than Opus — easier to revise against. Not a coding signal, but a signal.

Claire Vo, in one line: "GPT-5.5 just did what no other model could." Her task was messy, real, cross-tool — exactly the shape Terminal-Bench and GDPval are measuring. That's where the agentic delta shows up in practice.

Decrypt's read is harder. Pricing doubled. The Chinese open-weight labs — Xiaomi MiMo at $1 / $3, Minimax at $0.30 / $1.20 per million tokens — are charging a fraction per token. Altman's "fewer tokens per task" argument cuts both ways: if tasks genuinely cost fewer tokens, great. If the framing hides a distribution where half your workload still runs at roughly the same token count and you just paid double, not great. You'll know in two weeks, when the bill arrives.

The inconvenient stuff

The −60% hallucination drop is internally measured. No independent reproduction yet. Treat it as a plausible direction, not a fact.
Ruby is still rough. Spatial reasoning uneven. PowerPoint and visual-layout tasks weak. If your work lives in those niches, Opus 4.7 stays on.
Instruction literalism — the same trait that made Opus 4.7 stumble on legacy prompts — is now showing up in GPT-5.5 too. Prompts tuned for 5.4 may underperform until you revalidate.
Refusal / reluctance behavior is intermittent. Higher reasoning effort doesn't always fix it; sometimes makes it worse. Simon Willison's SVG-pelican test had max-effort outputs looking worse than medium-effort ones.
Cost: you will pay more this month. If you were running serious volume on 5.4, your next invoice is the real benchmark.

How to pair it with Claude Code

I'm not switching off Claude. Nobody shipping serious work is switching off either model today. The interesting question for the next six months is how you route between them.

Here's the partition that's emerging from practitioners using both — and it maps cleanly onto what the benchmark table says:

Task shape	Pick	Why
Multi-hour coding in a repo, multi-step refactor, PR-shaped work	Claude Code with Opus 4.7	Wins SWE-bench Pro and MCP-Atlas, tighter literal instruction-following, cheaper per output token
Agent driving terminal + browser + spreadsheet across tools	GPT-5.5 (Codex / Thinking)	Terminal-Bench 82.7%, OSWorld 78.7%, BrowseComp 90.1% on Pro
Cross-tool knowledge work — research, dashboards, curricula, run-of-show docs	GPT-5.5	Where GDPval's 84.9% turns into real throughput
Deep repo reading, architectural review, "read this codebase and tell me what's wrong"	Claude Code with Opus 4.7	Long-context code comprehension is still Opus territory
Plan a hard multi-step project before implementing	GPT-5.5 Thinking or Opus 4.7	Whichever your team has better prompt discipline on
First-draft writing, structured documents	GPT-5.5	Every's writer panel signal; cleaner idea progression
Final-draft writing with voice preservation	Opus 4.7	Claude's voice layer is still ahead

The pairing I'm running this week, concretely:

Plan in GPT-5.5 Thinking or Pro. Messy inputs, broad context, cross-tool exploration. Let it ask clarifying questions until it's 95% sure of the plan. Output: a concrete implementation spec.
Execute in Claude Code with Opus 4.7. Hand it the spec, let it touch the repo. Opus 4.7's instruction literalism becomes an asset instead of a trap when the instructions are already tight.
Cross-review. Paste Claude's diff back into GPT-5.5 and ask for bug-hunting and architectural pushback. This is where the two models earn their combined price — each is the other's best reviewer because their failure modes are independent.

This is what "two frontiers" means in practice. You stop picking a favorite. You start routing.

Bottom line

GPT-5.5 is a real release. It's not "everything changes" — that phrase belongs to marketing, not to anyone who has actually run both models today. It's a specialized instrument that happens to have overtaken the other specialized instrument on one axis (agentic workflow) while giving back a few points on another (precision coding).

The price doubled. The hallucination-drop claim needs independent verification. The Marvin behavior is a real risk if your prompt discipline is loose. But for messy, multi-tool agent work? This is the new ceiling.

If you're already living in Claude Code, you don't need to move. You need to add. Get GPT-5.5 installed where your agentic work lives — Codex, the browser layer, the spreadsheet layer, the research layer — and let Opus 4.7 keep owning the repo.

The monogamous model is dead. The router is the new stack.

• Yes, Opus 4.7 Sucks. Signed, Opus 4.7. — first-person mea culpa with the receipts — tokenizer tax, AUP cop, fake commit hashes, GitHub #50235, and Anthropic's own postmortem.

💬

Working with a team that wants to adopt AI-native workflows at scale? I help engineering teams build this capability — workflow design, knowledge architecture, team training, and embedded engineering. → AI-Native Engineering Consulting

GPT-5.5: The Frontier Just Split in Two

What actually shipped

Where each variant actually lives

The benchmark spread

The first-24-hour signal

The inconvenient stuff

How to pair it with Claude Code

Bottom line

Read more

The Robots Are Coming, and My Kids Can’t Wait

Ten Minutes of Morning Sun

The Fable Tax

Make Love Not War

What actually shipped

Where each variant actually lives

The benchmark spread

The first-24-hour signal

The inconvenient stuff

How to pair it with Claude Code

Bottom line

Related Reading

Sign up for Vanja Petreski

Read more

The Robots Are Coming, and My Kids Can’t Wait

Ten Minutes of Morning Sun

The Fable Tax

Make Love Not War