Claude Opus 4.8 Review: Benchmarks & Guide

Anthropic dropped Claude Opus 4.8 yesterday, May 28. Same playbook as the last few releases. No waitlist, no staged rollout. It showed up in Claude Code, the API, and the major cloud providers on the same day, with the model ID claude-opus-4-8 ready to drop into existing config.

I have been running Opus 4.7 as my default coding model since it launched in April. It handled my agentic coding sessions, my content pipeline, and most of my production debugging. So the first thing I did with 4.8 was throw the exact same hard tasks at it that I used to stress-test 4.7, then dig into the official announcement to separate the real changes from the launch-day polish.

Here is what I found after a day of hands-on use.

The Headline: It Stopped Lying to Me About My Code

The benchmark Anthropic led with is not a coding score or a reasoning score. It is honesty. Opus 4.8 is roughly 4x less likely than 4.7 to let a code flaw pass unremarked.

That number sounds abstract until you have lived the failure mode it describes. You ask a model to review a function. It tells you the function looks good. You ship it. It breaks. The model did not miss the bug because it was incapable of seeing it. It missed it because the path of least resistance in a review is to agree with you and move on.

Opus 4.8 does this far less. In my testing yesterday, I deliberately fed it three functions I knew had subtle problems. An off-by-one in a pagination helper, a race condition in a debounced save, and a silent error swallow in a fetch wrapper. 4.7 caught the off-by-one and missed the other two on the first pass. 4.8 flagged all three, and on the error swallow it specifically called out that the empty catch block would hide failures in production, which is exactly the kind of thing my global rules tell it to watch for.

This is the change that matters most for daily work, and it is the hardest to capture in a single number. A model that reliably tells you when something is wrong is worth more than a model that is marginally smarter but agreeable. The whole point of AI code review is catching what you missed. A model that rubber-stamps your mistakes is just a more expensive way to feel confident about broken code.

What the Benchmarks Actually Say

Anthropic published the usual comparison chart showing Opus 4.8 ahead of 4.7 across coding, agentic skills, reasoning, and practical knowledge work. The improvements are real but mostly incremental on the pure-coding side. The bigger jumps are in agentic and tool-use territory.

Here are the numbers worth knowing.

Benchmark	What it measures	Opus 4.8 result
Online-Mind2Web	Computer use, real web tasks	84% (ahead of 4.7 and GPT-5.5)
Legal Agent Benchmark	All-pass legal reasoning	First model over 10% on the all-pass standard
Code flaw detection	Catching bugs in review	~4x fewer missed flaws vs 4.7
Tool calling	Steps to complete a task	Fewer steps for equivalent intelligence

The Online-Mind2Web score is the one I would not have predicted. Computer use, the ability to drive a real browser and complete multi-step web tasks, has been the weakest part of every frontier model I have used. 84% is the first time the number has been high enough that I would actually trust it for low-stakes automation. It is still not something I would point at my bank, but for filling forms, navigating dashboards, and pulling data out of web apps that lack an API, it crossed the line from demo to useful.

The Legal Agent Benchmark result is a niche flex, but it signals something broader. Breaking 10% on an all-pass standard, where the model has to get every sub-task in a legal workflow correct or the whole thing fails, means the error rate on long multi-step chains dropped enough to matter. That same reliability shows up in coding agents that have to chain twenty tool calls without going off the rails halfway through.

Dynamic Workflows: The Feature I Did Not Know I Needed

The flashiest addition is Dynamic Workflows, shipping as a research preview in Claude Code. The pitch is that Claude can now spin up hundreds of parallel subagents and coordinate them on a single task. The headline use case is codebase-scale migrations, the kind that touch hundreds of thousands of lines.

I was skeptical. Parallel subagents have been a thing for a while, and in practice they tend to step on each other, duplicate work, or produce inconsistent results that take longer to reconcile than doing the work serially would have. So I tried it on a real job: migrating a mid-sized project from one date library to another, across about 60 files with inconsistent usage patterns.

The old way, even with agentic coding, was a slog. One agent, one file at a time, me babysitting context and re-explaining the pattern every few files as the conversation drifted.

Dynamic Workflows handled it differently. It scanned the codebase, grouped the files by usage pattern, fanned out a batch of subagents to transform each group in isolation, and then ran a verification pass to reconcile the edits. The whole thing finished in one sitting. Not every file was perfect. I caught two cases where it picked the wrong replacement function. But the wall-clock time was a fraction of the serial approach, and the consistency across files was better than I get when I do migrations by hand and forget my own convention by file 40.

The honest read is that this is genuinely new leverage for a specific kind of work. Large mechanical migrations, sweeping refactors, repo-wide audits. It is not magic for creative architecture decisions, and you still have to review everything it touches. But for the work that used to eat a full day of tedious repetition, it is the first tool that made me feel like the agent was actually operating at the scale of the codebase rather than the scale of a single file.

If you have wrestled with agent reliability at scale, the interesting part is how the verification pass cleans up after the fan-out. The subagents are allowed to be imperfect because a final reconciliation step catches the divergence. That is a better architecture than hoping every parallel agent gets it right independently.

Effort Control Comes to the Consumer Apps

Opus 4.7 introduced the xhigh effort level for developers. Opus 4.8 takes the idea and exposes it directly to users on claude.ai and Cowork through a setting called Effort Control. You pick how much compute Claude applies to a request. Higher effort means deeper thinking, more tokens spent, slower but more thorough answers.

By default, 4.8 runs at high effort. Anthropic tuned the default so it spends roughly the same number of tokens as 4.7's default while delivering better results, which is the kind of efficiency win that does not show up in a headline benchmark but shows up on your bill.

In practice, I leave it on high for almost everything and bump it up only for genuinely hard problems. A gnarly debugging session where the bug spans three systems, an architecture decision with real tradeoffs, a piece of analysis where I want the model to actually sit with the problem. For quick edits and lookups, high is already more than enough, and dropping the effort makes the response snappier without a quality hit I can notice.

The thing I appreciate is that this makes the cost and latency tradeoff explicit instead of hidden. You are no longer guessing whether the model is thinking hard. You are deciding.

Pricing: Nothing Changed, and That Is the Story

Opus 4.8 costs the same as 4.7. Five dollars per million input tokens, twenty-five per million output. The model got better and the price stayed flat.

That is worth pausing on. We have gotten so used to capability going up while price holds or drops that it barely registers as news anymore. But it is the entire reason the economics of building AI features keep improving. Every release that holds price while raising the capability floor means the same product gets cheaper to run in real terms, because you can do more with fewer tokens or accomplish a task that previously needed a more expensive workaround.

Tier	Input (per 1M)	Output (per 1M)
Standard	$5	$25
Fast mode	$10	$50

The Fast mode pricing is the genuinely new line. At $10 input and $50 output it runs at about 2.5x the speed of standard, and Anthropic says it is roughly 3x cheaper than the previous fast mode. For latency-sensitive paths, where you previously had to drop down to a smaller model and accept the quality hit, you can now keep Opus-class quality and just pay a premium for speed. That changes the calculus for anything user-facing where response time affects conversion.

If you are still mapping out your spend across plans and API usage, my Claude pricing survival guide walks through how to think about the tradeoffs, and the fast-mode change tilts a few of those decisions.

A Quiet API Change With Real Consequences

Buried in the announcement is a Messages API change that most people will skim past. The API now accepts system entries mid-conversation without breaking prompt caching.

If you have built anything serious on the Claude API, you know why this matters. Prompt caching is how you keep costs sane on long conversations and agent loops. The moment you inject a new system instruction partway through a conversation, the old behavior was to invalidate the cache from that point forward, which meant you ate the full cost of reprocessing the prefix.

Being able to slot in a system entry mid-conversation without busting the cache means you can steer the model dynamically, injecting fresh instructions or context as a task evolves, without paying the caching penalty every time. For agent architectures that adjust their own instructions based on what they discover, this removes a real cost cliff. It is the kind of plumbing change that does not get a benchmark but quietly makes a whole class of designs cheaper to run.

This pairs well with the broader push toward context engineering as the discipline that actually separates good agent performance from bad. The cheaper it is to manage context dynamically, the more aggressively you can do it.

How It Stacks Up Against the Competition

The frontier model race has not fundamentally reshuffled. When I last did a full Claude vs GPT vs Gemini breakdown, the picture was that the models converge on baseline capability and diverge on specific strengths. Opus 4.8 widens Claude's lead in the places it was already strong rather than opening a new front.

On coding, Claude was already my default, and 4.8 reinforces that rather than dramatically extending it. The pure code-generation improvement over 4.7 is modest. The real gap-widener is the reliability and self-correction, the 4x fewer missed flaws, which competitors have not matched in my testing.

On computer use, the 84% on Online-Mind2Web puts Opus 4.8 ahead of GPT-5.5 on that specific benchmark, which is notable because browser automation has been an area where the gaps between frontier models were small and noisy. A clear lead there is new.

On reasoning and multimodal breadth, the competitive story has not changed much. If raw reasoning scores or native audio and video are your priority, the calculus from a few months ago still holds. Opus 4.8 did not show up to win those categories.

The summary I would give a teammate: if you do agentic work, coding, or any task where the model has to chain many steps and catch its own mistakes, Opus 4.8 extended an existing lead. If your work lives in the categories where Claude was already the second choice, this release is not the one that changes your mind.

Should You Upgrade From 4.7?

Here is how I would think about it depending on where you sit.

If you use Claude Code on a Pro or Max plan: You already have access. Switch to 4.8 and run it on your current work. The self-correction and reliability improvements are real and the transition is seamless. Try Dynamic Workflows on the next migration or sweeping refactor you have been putting off, since that is where the new leverage actually shows up.

If you run Opus 4.7 in production via the API: The swap to claude-opus-4-8 is low risk because the pricing and core behavior are stable, but test anyway. The improved instruction-following and the more aggressive flaw detection can change outputs in ways your downstream code might not expect, especially if you parse the model's review comments. If you have an eval suite, run it before you flip the model ID. This is exactly the kind of update your evals exist to catch.

If you are on GPT-5.5 or Gemini for primary work: The coding, tool-use, and computer-use gaps just widened in Claude's favor. If you have been on the fence about Claude for agentic development, this is the strongest case yet. If reasoning depth or multimodal breadth is your main concern, the competitive picture has not moved enough to force a switch.

If you are new to AI coding tools: Start with Claude Code and Opus 4.8. The combination of strong coding, better long-session reliability, explicit effort control, and a model that actually tells you when your code is wrong makes it the most forgiving entry point for getting serious about AI-assisted development.

What Anthropic Is Hinting At Next

The announcement closes with two forward-looking notes that are worth reading carefully.

First, Anthropic says lower-cost Opus-equivalent models are in development. If that lands, it pulls Opus-class capability down into a price bracket where you could run it on high-volume, cost-sensitive paths that currently force a downgrade to a smaller model. That would be a bigger deal for production economics than anything in this release.

Second, and more loaded, Mythos-class models are coming to all customers in the coming weeks, pending cybersecurity safeguards. I wrote about Claude Mythos when it was a restricted research preview that scored 93.9% on SWE-bench and 100% on Cybench, the model Anthropic decided was too capable to release. The fact that a Mythos-class model is now being lined up for general availability, gated on safety work rather than capability, is the most interesting sentence in the entire announcement. That shoe has since dropped: Claude Fable 5 is the Mythos-class model they decided to ship, and it is exactly the step change this paragraph was bracing for.

It tells you the gap between what these labs can build and what they choose to ship is still the binding constraint, and that the constraint is loosening. Opus 4.8 is an excellent model that Anthropic is comfortable handing to everyone today. Something meaningfully more capable is being prepared for release the moment the safety story is solid enough.

The Bigger Picture

What strikes me about Opus 4.8 is not any single benchmark. It is the shape of the improvement.

The last few releases chased raw capability. Higher SWE-bench, sharper vision, longer context. 4.8 spent its gains differently. It got more honest, more reliable across long chains, better at catching its own mistakes, and more efficient with tokens, all while holding the price flat and adding a way to coordinate work at the scale of an entire codebase.

That is what a maturing tool looks like. Not a model that is dramatically smarter than the one before it, but one that is more trustworthy in the moments that actually cost you time. A model that flags the bug instead of waving it through. A model that finishes the migration instead of drifting halfway through. A model that lets you decide how hard it should think instead of guessing.

For the work I do every day, that is more valuable than another few points on a coding benchmark. Opus 4.7 was the best model I had used for shipping software. Opus 4.8 is better in the specific places that matter when you are the one responsible for what ships.

If you are already on 4.7, the upgrade is easy and the wins are real. Switch, throw Dynamic Workflows at the migration you have been dreading, and see how it feels to have a model that argues with you when you are wrong. That is the part that does not show up in the chart, and it is the part you will notice first.

Claude Opus 4.8 Is Here: Benchmarks, Dynamic Workflows, and Whether to Upgrade From 4.7