Claude Opus 4.7 Review: Benchmarks & Guide

I woke up this morning to Anthropic dropping Claude Opus 4.7. No slow rollout, no waitlist. Just a new model sitting in Claude Code, the API, and every major cloud provider simultaneously.

I have been running Opus 4.6 as my default coding model for months. It has been the backbone of my agentic coding workflow, my content pipeline, and most of my production debugging sessions. So the first thing I did was throw my hardest open tasks at Opus 4.7 to see if the upgrade claims hold up.

Here is what I found after a full day of hands-on use, alongside everything Anthropic published in the official announcement.

What Opus 4.7 Actually Is

Let me set the baseline quickly for anyone who is not tracking every model release.

Claude Opus 4.7 is Anthropic's new flagship model, replacing Opus 4.6 as the top-tier option across Claude.ai, the Claude API, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. The model ID is claude-opus-4-7.

It keeps the 1M token context window from 4.6. Pricing is unchanged at $5 per million input tokens and $25 per million output tokens. If you are already on a Claude Pro, Max, Team, or Enterprise plan, you have access right now.

The headline improvements fall into four buckets: agentic coding, vision, instruction following, and a new effort level called xhigh. Each of these is worth unpacking because the details matter more than the marketing.

The Benchmark Numbers

I am going to put the hard numbers up front because that is what most developers want to see first. Then I will talk about what they actually mean in practice.

SWE-bench Pro (agentic coding on real GitHub issues):

Opus 4.7: 64.3%
Opus 4.6: 53.4%
GPT-5.4: 57.7%
Gemini 3.1 Pro: 54.2%

SWE-bench Verified:

Opus 4.7: 87.6%
Opus 4.6: 80.8%
Gemini 3.1 Pro: 80.6%

CursorBench:

Opus 4.7: 70%
Opus 4.6: 58%

MCP-Atlas (scaled tool use):

Opus 4.7: 77.3%
Opus 4.6: 75.8%
GPT-5.4: 68.1%
Gemini 3.1 Pro: 73.9%

GPQA Diamond (graduate-level reasoning):

Opus 4.7: 94.2%
GPT-5.4 Pro: 94.4%
Gemini 3.1 Pro: 94.3%

Rakuten-SWE-Bench (production tasks):

Opus 4.7 resolves 3x more production tasks than Opus 4.6

The jump from 53.4% to 64.3% on SWE-bench Pro is the number that stands out most. That is not an incremental improvement. That is a nearly 11-point gain on the most respected coding benchmark in the industry. Opus 4.7 also resolved four tasks that neither Opus 4.6 nor Sonnet 4.6 could solve at all, which suggests the model is not just doing the same things slightly better but actually handling a new class of problems.

The one benchmark where Opus 4.7 does not lead is GPQA Diamond, where all three frontier models are essentially tied at 94%. Graduate-level reasoning has become table stakes at this point. The real differentiation is happening in applied tasks like agentic coding and tool use.

Agentic Coding: The Biggest Upgrade

This is where the 4.7 release matters most for developers who use Claude daily.

Opus 4.6 was already the best coding model I had used. But it had specific pain points in longer agentic sessions. After 15-20 tool calls in a complex refactoring task, the model would sometimes lose track of what it had already changed. It would occasionally re-read files it had just modified, wasting tokens and time. On really hard multi-file tasks, it would sometimes produce a working solution but take an unnecessarily circuitous path to get there.

Opus 4.7 addresses this directly. Anthropic describes it as "improved long-horizon autonomy, systems engineering, and complex code reasoning." In practice, what I noticed today is that the model makes fewer redundant moves in multi-step tasks. It plans more effectively before starting, and when it hits a problem mid-execution, the recovery path is more efficient.

On Anthropic's internal 93-task coding benchmark, the 13% improvement in resolution rate is meaningful. But the qualitative change is what I care about more: the code quality is noticeably cleaner. Fewer wrapper functions. Less unnecessary abstraction. Outputs that feel like they were written by a developer who understands the codebase rather than a model that technically followed the instructions.

The improvement on Rakuten-SWE-Bench, which measures production-level task resolution, is even more dramatic: 3x more resolved tasks than 4.6, with double-digit gains in both code quality and test quality scores. That is the kind of improvement you actually feel in a workday.

The New xhigh Effort Level

This is a small feature that I think will get underappreciated in the initial coverage.

Previously, Claude had effort levels of low, medium, high, and max. Opus 4.7 introduces xhigh (extra high), which sits between high and max. Claude Code now defaults to xhigh across all plans.

Why does this matter? Because the gap between high and max was always a bit too wide. High was fast but would sometimes skip important reasoning steps on hard problems. Max was thorough but slow and token-heavy, especially for problems that needed more thought than high but did not actually require the full max treatment.

xhigh is the sweet spot for most coding work. You get the deeper reasoning when the model encounters genuine complexity, without burning through your token budget on straightforward tasks. For anyone building with Claude in agentic workflows where token costs compound across multi-turn sessions, this level of control over the reasoning-latency tradeoff is genuinely useful.

If you have been running at max by default and watching your token costs climb, try switching to xhigh. The quality difference on most tasks is minimal, and the cost savings over a full workday are real.

Vision: 3x Resolution Is Not Just a Spec Bump

Opus 4.7 processes images at up to 2,576 pixels on the long edge, roughly 3.75 megapixels. That is more than three times the resolution of Opus 4.6.

I tested this today with a dense architecture diagram and a screenshot of a financial dashboard. On 4.6, the model could describe the general structure but would miss specific labels and small text. On 4.7, it read individual axis labels, identified specific data points, and correctly interpreted a cramped legend that I would have struggled to read myself at that zoom level.

For developers, the practical impact is in UI review workflows. If you are using Claude to evaluate screenshots, spot visual regressions, or analyze error states in a running application, the 3x resolution improvement means it can now reliably read small text, distinguish between similar UI states, and catch details that the previous model would blur past.

Anthropic also claims 21% fewer document reasoning errors than Opus 4.6, which tracks with what I saw when testing it on technical diagrams. The model does not just see more pixels. It interprets them more accurately.

Instruction Following Got Stricter

This one is a double-edged sword, and Anthropic is upfront about it.

Opus 4.7 follows instructions more literally than 4.6. If you tell it to only modify a specific file, it will not touch adjacent files even if doing so would clearly improve the result. If your prompt says "use a for loop," it will use a for loop even if a map/filter chain would be cleaner.

For well-written prompts, this is an improvement. The model does what you ask, precisely. For loose, conversational prompts where you relied on the model to fill in the blanks and make judgment calls, you might find that 4.7 takes your words more literally than you intended.

Anthropic recommends re-tuning prompts that were written for earlier models. If you have production systems built on Opus 4.6, this is worth testing before you swap model IDs. The behavior changes are subtle but they exist. The context engineering principles I wrote about last month become even more relevant here: what the model sees shapes what it does, and with stricter instruction following, your context needs to be precise.

Better Memory Across Sessions

One improvement that is easy to overlook: Opus 4.7 is better at utilizing file system-based memory across multi-session work.

If you use CLAUDE.md files, project-specific context files, or any form of persistent context that carries between conversations, the model retrieves and applies that context more reliably. Anthropic says it requires "less upfront context for subsequent tasks," which means the model is better at picking up where it left off without you having to re-explain the full project state.

For anyone who has been frustrated by the agent memory problem, this is a meaningful quality-of-life improvement. It does not solve the fundamental challenge of long-term memory for AI agents, but it reduces the friction of context restoration between sessions.

The Mythos Elephant in the Room

Anthropic did something unusual with this release. They openly acknowledged that Opus 4.7 is less capable than their unreleased model, Claude Mythos Preview.

Mythos scored 77.8% on SWE-bench Pro compared to Opus 4.7's 64.3%. It also hit 100% on Cybench, the cybersecurity evaluation. Anthropic decided not to release it broadly because its cyber capabilities exceeded what they were comfortable deploying at scale. Instead, they implemented deliberate cyber safeguards in Opus 4.7 that detect and block high-risk cybersecurity requests, and they reduced the model's cyber capabilities during training compared to Mythos.

I wrote about Mythos when the first details leaked. The fact that Anthropic is releasing a model they acknowledge is not their best while explicitly pointing to a more capable model they are holding back is a fascinating strategic choice. It is either principled safety work or deliberate hype-building for the next release, possibly both.

For practical purposes, Opus 4.7 is the model you can actually use, and it is clearly the best generally available coding model right now. Whether Mythos eventually ships, and in what form, is a question for a future article.

What This Means for the Model Comparison

I wrote a detailed comparison of Opus 4.6 vs GPT-5.4 vs Gemini 3.1 Pro last month. Opus 4.7 shifts that landscape meaningfully.

Coding: Opus 4.7 now leads by an even wider margin. The 64.3% SWE-bench Pro score puts it 6.6 points ahead of GPT-5.4 (57.7%) and 10.1 points ahead of Gemini 3.1 Pro (54.2%). If coding was already Claude's strongest domain, it just got stronger.

Tool use: The MCP-Atlas benchmark shows Opus 4.7 at 77.3% versus GPT-5.4 at 68.1%. That 9-point gap in scaled tool use matters a lot for agentic workflows where the model needs to call external tools reliably.

Reasoning: Still effectively a three-way tie on GPQA Diamond (94.2% vs 94.4% vs 94.3%). None of the frontier models have a reasoning advantage anymore.

Multimodal: Gemini still has the broadest multimodal support (text, image, audio, video). Opus 4.7's 3x vision improvement narrows the gap on image tasks specifically, but if you need audio or video processing, Gemini is still the answer.

Pricing: Unchanged. Opus 4.7 is $5/$25 per million tokens, competitive with GPT-5.4, and still roughly 2.5x more expensive than Gemini 3.1 Pro at $2/$12.

My recommendation from the previous article still holds with one update: if agentic coding is your primary use case, the case for Claude just got significantly stronger.

Token Usage: The Caveat

Anthropic is transparent about a tradeoff worth knowing: input tokenization has changed slightly with 4.7. Input tokens increased approximately 1.0 to 1.35x depending on content type.

Combined with higher thinking token usage at increased effort levels, particularly in multi-turn agentic scenarios, your total token consumption may go up even if the model resolves tasks more efficiently. Anthropic says their internal testing shows "net favorable token efficiency on coding evaluations" because the model needs fewer turns to reach a solution, but your specific results will depend on your use case.

If you are tracking API costs closely, monitor your first week on 4.7. The per-task efficiency might improve while the per-token costs shift slightly. For most individual developers on subscription plans, this does not matter. For teams running high-volume API workloads, it is worth watching.

Should You Switch Today?

Here is how I would think about it depending on your situation.

If you are using Claude Code on a Pro or Max plan: You already have access. Switch to 4.7 and try it on your current work. The coding improvements are real and the transition is seamless. If you hit any prompt behavior changes, adjust your CLAUDE.md or project context files.

If you are running Opus 4.6 in production via the API: Test first. The stricter instruction following and tokenizer changes mean you should validate your existing prompts against 4.7 before swapping the model ID. Run your eval suite if you have one. If you are building AI features, this is exactly the kind of model update that your evals should catch.

If you are using GPT-5.4 or Gemini 3.1 Pro as your primary: The benchmark gaps in coding and tool use just got wider. If you have been on the fence about trying Claude for development work, this release makes the strongest case yet. If reasoning or multimodal breadth is your primary concern, the competitive picture has not changed much.

If you are new to AI coding tools: Start with Claude Code and Opus 4.7. The agentic coding workflow I described last month works even better now. The combination of stronger coding performance, better long-session stability, and the xhigh effort level makes it the most capable entry point for developers getting serious about AI-assisted development.

The Bigger Picture

What strikes me about this release is not the benchmarks. It is the pace.

Opus 4.6 dropped and immediately became the best coding model available. A few months later, Opus 4.7 arrives with meaningful improvements across the board. Meanwhile, Anthropic is openly talking about Mythos, a model that makes both of them look modest, while explaining why they will not ship it yet.

The gap between what these companies can build and what they choose to deploy is growing. That has implications for developers, for the companies building on these APIs, and for the broader conversation about AI capability and safety. Whether you find Anthropic's approach principled or frustrating, you have to acknowledge that they are shipping a genuinely excellent model while being honest about the fact that they have something better sitting on the shelf.

For right now, Opus 4.7 is the best model I have used for the work I do every day. It writes cleaner code, handles longer sessions more reliably, sees images with enough resolution to be actually useful, and follows instructions precisely enough that I spend less time correcting and more time reviewing.

That is what a good model upgrade feels like. Not revolutionary. Not incremental. Just better in the places that matter most when you are actually shipping software.

Update (May 2026): Anthropic has since shipped Claude Opus 4.8, which catches roughly 4x more of its own code mistakes, hits 84% on computer use, and adds Dynamic Workflows for running hundreds of parallel subagents at codebase scale. Same price as 4.7. Read the full breakdown for whether the upgrade is worth it.

Claude Opus 4.7 Just Dropped: First Impressions, Benchmarks, and What Actually Changed