Opus 4.7 Dropped. Here's What Actually Matters.

Same price as 4.6. SWE-bench Pro jumped 20%. Vision tripled. And somewhere in a locked room, Mythos is finding zero-days in 27-year-old operating systems.

Opus 4.7 Dropped. Here's What Actually Matters. — AI

Anthropic dropped Opus 4.7 today. Same price as 4.6. Generally available everywhere. No waitlist, no preview, no "limited rollout." Just shipped.

I've been using it since this morning. Here's what actually matters if you write code for a living.

The One Number That Matters

SWE-bench Pro measures whether a model can solve real GitHub issues from real production repos. Not toy problems. Not interview questions. The kind of bugs that ruin your Tuesday afternoon.

ModelSWE-bench Pro
Claude Opus 4.764.3%
Claude Opus 4.653.4%
GPT-5.457.7%
Gemini 3.1 Pro54.2%

That's a 20% relative improvement over 4.6 on the benchmark that most closely resembles actual software engineering work. On SWE-bench Verified (the easier variant), it hit 87.6% — up from 80.8%. Rakuten's production benchmark shows 3x more tasks resolved than 4.6.

For context: the jump from 4.5 to 4.6 was roughly 5 points on SWE-bench Pro. The jump from 4.6 to 4.7 is nearly 11. This isn't incremental.

The Full Scorecard

BenchmarkOpus 4.7Opus 4.6GPT-5.4Gemini 3.1 Pro
SWE-bench Pro (coding)64.3%53.4%57.7%54.2%
SWE-bench Verified87.6%80.8%80.6%
CursorBench70%58%
MCP-Atlas (tool use)77.3%75.8%68.1%73.9%
GPQA Diamond (reasoning)94.2%94.4%94.3%
MMMLU (multilingual)91.5%91.1%92.6%
BigLaw Bench (legal)90.9%
OfficeQA Pro (Databricks)21% fewer errors

The pattern: Opus 4.7 dominates coding and tool use. GPT-5.4 is marginally better at pure reasoning. Gemini leads multilingual. For software engineering — which is what I use it for 8 hours a day — it's not close.

Vision Got Quietly Useful

Opus 4.7 accepts images up to 2,576 pixels on the long edge — roughly 3.75 megapixels. That's over 3x the resolution of previous Claude models.

This matters more than it sounds. Screenshots of error logs. Architecture diagrams on whiteboards. Technical documentation in PDFs. Chemical structures. At the old resolution, Claude would squint at these and hallucinate details. At 3.75MP, it can actually read them.

New Knobs for Power Users

xhigh effort level. There's now a slot between high and max. If you've been using max and wincing at the token bill, or using high and occasionally getting shallow answers, this is the sweet spot you were missing.

Task budgets (beta). You can now give Claude a token spend guidance for longer runs. Instead of "do this" and hoping it doesn't burn through your quota, you say "do this within ~50K tokens." Especially useful for agentic workflows where a model can go down rabbit holes.

/ultrareview. A dedicated slash command for deep code review. Not a quick skim — a full session that hunts for bugs and design issues. Pro and Max Claude Code users get 3 free reviews.

The Tokenizer Trap

Here's the part the announcement buries in fine print: the tokenizer changed. The same input text can now produce 1.0x to 1.35x more tokens depending on content. Same price per token, more tokens per prompt.

If you're on the API and not watching your token counts, your bill might jump 10-35% even though the per-token price didn't change. The model also produces more output tokens at higher effort levels, especially in later turns of agentic sessions.

Not a dealbreaker. But worth knowing before your CFO asks why the AI line item spiked.

The Elephant in the Room: Mythos

Buried in the announcement is a reference to "Claude Mythos Preview" — a model Anthropic built but deliberately isn't releasing to the public.

The numbers are wild. On SWE-bench Pro, Mythos scores 77.8% (vs. Opus 4.7's 64.3%). It autonomously discovered zero-day vulnerabilities in every major operating system. It found a 27-year-old bug in OpenBSD — a system famous for its security. It's the first model to solve TLO end-to-end.

Anthropic's response wasn't to sell it. It was to lock it down. They created "Project Glasswing" to deploy it exclusively for cybersecurity defense at select companies. And they deliberately nerfed the cyber capabilities in Opus 4.7.

This is the AI moment nobody's talking about: the best model in the world exists, and the company that built it decided not to sell it. Think about what that says about where we are.

What Developers Are Actually Reporting

The benchmarks tell one story. Developers using this in production tell a more textured one. Here's what's emerging from the first 24 hours of reactions across GitHub, CodeRabbit, Replit, Notion, Databricks, and the usual HN/Reddit/X chorus.

The Bug-Finding Claim Holds Up

CodeRabbit ran 100 evaluations across real open-source pull requests. Opus 4.7 found more real bugs, gave more actionable feedback, and reasoned across files better than anything they'd tested. Their core bug-catching eval went from 55/100 with the baseline to 68/100 with 4.7 — a 24% relative improvement. This isn't a synthetic benchmark. It's real PRs on real codebases.

Rakuten reported 3x more production tasks resolved vs 4.6, with double-digit gains in code quality and test quality scores. Not a demo. Their actual production queue.

The Vision Leap Is Not Hype

Visual acuity went from 54.5% to 98.5%. That's the kind of jump that changes what's actually possible. Before 4.7, passing a screenshot or architectural diagram to Claude was a gamble — it would squint at details and occasionally hallucinate. At 98.5%, developers report it's now reliable enough to build automation workflows on top of. One line from the HN thread: "First time I can actually screenshot a Figma mockup and get back code that matches."

The Instruction-Following Has a Catch

This one's worth paying attention to. 4.7 follows instructions more literally than 4.6. Sounds like an unambiguous win, but multiple developers are reporting that prompts that worked fine on 4.6 are producing different output on 4.7. The reason: 4.6 would "read between the lines" and fill in implicit context. 4.7 does exactly what you said.

If you have a production system with carefully tuned prompts, don't just swap the model string and ship. Run your evals. The Vellum writeup put it well: "the model is now a better literal interpreter, which means ambiguous prompts get more literal responses."

The "Was 4.6 Nerfed?" Conversation

There's a thread running through the community that's worth mentioning because it's loud. For weeks before the 4.7 release, developers were complaining that 4.6 had quietly gotten worse. The working theory: Anthropic had throttled it for cost or compute reasons. When 4.7 dropped and felt noticeably better, part of the community interpreted it as "we're getting back what we lost, rebranded as an upgrade."

Hard to verify without Anthropic's internal metrics. But the perception matters — it's shaping how developers are approaching this release. Some are switching eagerly, others are waiting to see if "bad 4.7" shows up in 2-3 weeks the way "bad 4.6" allegedly did.

The Agentic Gains Are Real on Long Tasks

Teams at Replit, Notion, and Databricks who had early access report the biggest gains on long, multi-step work: sustained agent sessions, codebase-wide refactors, tasks that need tool-dependent workflows. The model carries context across sessions better. It verifies its own output before reporting done. It handles ambiguity by asking rather than guessing.

What doesn't show up as much in the chatter: dramatic gains on quick, one-shot tasks. If you use Claude for five-minute edits and quick scripts, you may not feel the difference. The story is at the long end.

The Stuff That's Still Rough

  • Token inflation: the tokenizer change plus higher verbosity at high/xhigh effort is producing real bill spikes. Multiple developers reporting 15-30% higher costs on identical workloads. If you don't watch it, it'll bite.
  • xhigh is generous: the new effort level between high and max produces more output than some expected. Useful for hard problems, expensive for easy ones.
  • Tool-use overhead: MCP-Atlas scores improved, but several teams note that while accuracy went up, latency on heavy tool-use workflows went up too. Trade-off, not a free lunch.
  • Not Mythos: Anthropic is openly saying this is less capable than Mythos Preview. If you were hoping 4.7 would feel like the best of what they have, it won't.

Bottom Line

Same price. Better coding. Better vision. Better instruction following. Tighter agentic loops. New effort knobs. Watch the tokenizer change if you're on the API.

If you're on Claude Code, you're already on it. If you're on the API, switch your model string to claude-opus-4-7 and run your evals. If you're on GPT-5.4 for coding work, run the same evals and compare. The SWE-bench Pro gap is 6.6 points — that's not marketing, that's measured.

And somewhere in a locked room, Mythos is finding zero-days in operating systems that have been running since before some of you were born. Sleep well.


💬
Working with a team that wants to adopt AI-native workflows at scale? I help engineering teams build this capability — workflow design, knowledge architecture, team training, and embedded engineering. → AI-Native Engineering Consulting