🔥 Claude Opus 4.8 Beats GPT 5.5 [69.2% SWEBench] - (+🚨 Claude Opus 4 8: 61% Cheaper Agent Workflows)

Claude Opus 4.8 dropped yesterday and the benchmark gap vs GPT-5.5 is bigger than I expected — breakdown inside

Anthropic shipped Claude Opus 4.8 on May 28, 2026, and since I've been tracking both models pretty closely I figured I'd put together a proper breakdown of where things actually stand. This is going to be long but I'll structure it so you can skip to the parts you care about.

No hype, no affiliation with either lab. Just the numbers and what I think they mean for people actually building on these models.

Watch the review here: https://www.youtube.com/watch?v=3paM-hK_OAw

The short version if you don't want to read all of this

Opus 4.8 beats GPT-5.5 on most benchmarks that matter for codebase-level work — SWE-Bench Pro, computer use, reasoning, knowledge work. GPT-5.5 still wins on terminal-centric coding and is significantly cheaper on input tokens at short contexts. Neither model is universally better. The right answer depends entirely on what you're building.

What actually changed from Opus 4.7

Before getting into the head-to-head, worth noting what Anthropic claims improved:

4x less likely to ship code with unflagged flaws — this is the one that caught my attention. Silent failures in agentic pipelines are brutal and expensive. If this holds up in production, it's the most practical improvement in the release.
SWE-Bench Pro went from 64.3% to 69.2% — nearly 5 points. That's a meaningful jump, not just noise.
GDPval-AA Elo climbed 137 points (1753 → 1890) — more on what that benchmark actually measures below.
Terminal-Bench 2.1 improved from 66.1% to 74.6% — but GPT-5.5 is still at 78.2%, so the gap narrowed but didn't close.
New effort controls — you can now set the model to default / high / extra / max effort per request. The 1890 GDPval-AA score is at max effort. Lower settings cost less for simpler tasks.
Dynamic Workflows (research preview, Enterprise/Team/Max plans) — can spin up hundreds of parallel subagents for large tasks. More on this later.
Fast Mode — same model, ~2.5x faster, available at $10/$50 per million tokens. Anthropic says this is 3x cheaper than the previous fast tier. Activate with /fast in Claude Code.
Pricing unchanged from 4.7: $5 input / $25 output per million tokens.

The benchmark breakdown

Let me go through each benchmark category and explain what it actually measures, because I've noticed a lot of coverage just lists the numbers without context.

SWE-Bench Pro — real codebase issue resolution

This is the one everyone leads with and for good reason. SWE-Bench Pro measures whether a model can resolve actual GitHub issues across real repositories — not toy examples, actual multi-file patches.

Model

Score

Claude Opus 4.8

69.2%

Claude Opus 4.7

64.3%

GPT-5.5

58.6%

Gemini 3.1 Pro

54.2%

The 10.6-point gap over GPT-5.5 is significant. That said — one caveat that a lot of coverage glosses over: scaffolding and harness choice can move SWE-Bench scores by several points. GPT-5.5 does its best work inside Codex CLI. If you hold the harness constant, the gap is real but possibly narrower than the raw numbers suggest. Evaluate on your own codebase before treating this as settled.

Terminal-Bench 2.1 — CLI-driven agentic coding

This is where GPT-5.5 genuinely wins. Terminal-Bench measures agentic coding through a terminal: running commands, inspecting output, iterating in a shell loop.

Model

Score

GPT-5.5

78.2%

Claude Opus 4.8

74.6%

Gemini 3.1 Pro

70.3%

Claude Opus 4.7

66.1%

Worth noting: GPT-5.5 hits 83.4% under its own Codex CLI harness. So the actual performance gap for terminal-heavy workflows might be larger than the 3.6-point headline suggests, depending on your setup.

Opus 4.8 improved here (up from 66.1%) but GPT-5.5 is still the right call if your pipeline is terminal-centric.

Humanity's Last Exam — multidisciplinary reasoning

This benchmark is a good signal for how a model handles genuinely hard, cross-domain problems at the edge of what frontier models can answer.

Model

No tools

With tools

Claude Opus 4.8

49.8%

57.9%

GPT-5.5

41.4%

52.2%

The no-tools gap (8.4 points) is the cleaner signal because it isolates raw reasoning from tool use. This matters beyond Q&A — strong reasoning tends to translate into better code planning, edge case identification, and fewer wrong-but-confident outputs.

OSWorld-Verified — agentic computer use

Measures whether a model can actually operate a computer: navigating UIs, using applications, completing real desktop tasks.

Model

Score

Claude Opus 4.8

83.4%

Claude Opus 4.7

82.8%

GPT-5.5

78.7%

Gemini 3.1 Pro

76.2%

Opus 4.8 leads but only just over 4.7. The jump over GPT-5.5 is meaningful (4.7 points) if you're building automation that involves browser or desktop interaction.

GDPval-AA — the one nobody explains properly

Almost every article I've read mentions this benchmark and then just lists the Elo scores. Let me actually explain it.

GDPval-AA is developed by Artificial Analysis using an open-source evaluation harness called Stirrup. It simulates economically valuable enterprise tasks with web and shell access — the kind of real-world work that a deployed agent would actually be asked to do. The "AA" in the name refers to the level of task complexity. Scores are Elo-based, same as chess rating systems, where the gap between scores reflects win probability in head-to-head comparisons.

Model

GDPval-AA Elo

Claude Opus 4.8

1890

GPT-5.5

1769

Claude Opus 4.7

1753

A 121-point Elo gap translates to roughly a 67% win rate for Opus 4.8 in head-to-head task comparisons against GPT-5.5. That's a meaningful lead on real-world work tasks specifically.

One nuance: Opus 4.8 achieves this score while using about 30% more turns per task than GPT-5.5. So it gets better results but isn't as efficient per interaction. For cost-sensitive enterprise deployments that's worth watching.

Finance Agent v2 — financial analysis

Model

Score

Claude Opus 4.8

53.9%

GPT-5.5

51.8%

Narrow margin. Directional win for Opus 4.8 but this one is close enough that your own use case should drive the decision.

Long-context retrieval — GraphWalks

This one gets overlooked but it matters for teams working with large codebases or multi-document pipelines. Opus 4.8 leads GPT-5.5 by 12–25 points at 256K–1M context lengths. Combined with the pricing structure (more on this below), long-context workloads are where Opus 4.8's advantage compounds.

Pricing — the part that changes the math more than people realize

On the surface: Opus 4.8 is $5/$25 per million input/output tokens. GPT-5.5 is roughly $1.25/$10 per million (input is about 4x cheaper, output is about 2.5x cheaper at standard tier).

That sounds like a big GPT-5.5 win on cost. But there's a caveat that most comparisons I've seen don't address clearly:

GPT-5.5 has a long-context surcharge above 272K input tokens. Above that threshold, you pay roughly 2x input and 1.5x output for the entire session — not just the tokens above the threshold.

Opus 4.8 is flat-priced at $5/$25 regardless of context length, up to its 1M token window.

For teams regularly processing 272K+ token contexts, the cost math can actually flip in Opus 4.8's favor. Run the numbers on your actual workload before assuming GPT-5.5 is cheaper.

Full pricing breakdown:

Tier

Opus 4.8 Input

Opus 4.8 Output

GPT-5.5 Input

GPT-5.5 Output

Standard

$5.00

$25.00

~$1.25

~$10.00

Fast Mode

$10.00

$50.00

N/A

Cache hit

$0.50

$25.00

varies

Batch API

$2.50

$12.50

varies

Opus 4.8 Fast Mode is 2.5x the speed at 2x the price — and Anthropic says this is 3x cheaper than the previous Opus fast tier if you were already using that.

Dynamic Workflows — what it is and why it matters

This shipped as a research preview (Enterprise/Team/Max plans only). Here's what it actually does:

When you hand Claude Code a task that's too big for a single agent to handle linearly — think a migration touching hundreds of files, a large refactor, or a full test suite — Dynamic Workflows lets the model:

Generate a plan for the work
Spin up hundreds of parallel subagents to execute different parts of it simultaneously
Verify the results from subagents before reporting back

The practical implication: tasks that would previously take hours of sequential work (or require you to manually chunk the problem) can run as parallel workstreams. Early use cases mentioned include codebase migrations and large PR sweeps.

It's research preview, so treat it as "promising but not production-ready for critical work yet."

Effort controls — the feature nobody is talking about

This is the one I think is most underrated in the current coverage. Opus 4.8 introduces per-request effort settings:

Setting

Relative cost

Recommended for

Default

Baseline

Simple queries, quick lookups

High

~Standard Opus 4.8 pricing

Most production agentic tasks

Extra

Higher

Complex multi-step reasoning

Max

Highest

Benchmark-grade tasks, hardest problems

The 1890 GDPval-AA score is at max effort. Your production workload is probably not max effort by default. This matters for cost modeling — if you're comparing pricing and assuming max effort across all requests, you're overestimating your actual costs.

Where GPT-5.5 is actually better — being honest about it

I've seen a lot of Opus 4.8 coverage that either downplays this or buries it. Here's where GPT-5.5 has genuine advantages:

Terminal-Bench 2.1 (78.2% vs 74.6%, and 83.4% under Codex CLI): If your pipeline is terminal-driven — shell loops, CLI tooling, iterating in a terminal environment — GPT-5.5 is the better model right now. This isn't a rounding error.

Input pricing at short contexts: At under 272K tokens per session, GPT-5.5 input is roughly 4x cheaper. For high-volume workloads that don't need long context, that's a significant cost difference.

Native audio: GPT-5.5 supports audio input natively. Opus 4.8 doesn't. If you have voice-in-the-loop requirements, this isn't close.

Turn efficiency: Opus 4.8 uses ~30% more turns per GDPval-AA task than GPT-5.5 to achieve its higher score. More turns = more latency and more tokens consumed per task.

The routing question — which model for what

Based on everything above, here's how I'd think about it:

Use Opus 4.8 if:

Your primary workload is codebase-resolution coding (PRs, bug sweeps, feature builds across a repo)
You regularly work with 272K+ token contexts
You need top-tier multidisciplinary reasoning
You're already on Opus 4.7 (migration is config-only, no API breaking changes)
You want Dynamic Workflows for complex parallel tasks
Computer use/desktop automation is part of your pipeline

Use GPT-5.5 if:

Your pipeline is heavily terminal-centric
You have audio in the loop
Your context windows stay well under 272K tokens
Input cost is the binding constraint and your tasks are high-volume and short-context

Consider routing to both if:

You have a mixed workload — send codebase-resolution and reasoning tasks to Opus 4.8, terminal-driven tasks to GPT-5.5
You're cost-optimizing: route the expensive, complex tasks to Opus 4.8 and the high-volume, simpler ones to GPT-5.5 (or even DeepSeek V4 at $0.27/$1.10 for commodity work)

Caveats and things to keep in mind

All benchmark numbers from Anthropic's launch materials are vendor-reported. That doesn't mean they're wrong, but the party reporting the numbers has an interest in favorable results. Independent benchmarking on a shared evaluation setup hasn't happened yet. Treat the Anthropic-vs-competitors comparisons as directional signals, not settled conclusions.

The partner testimonials in the announcement are not independent benchmarks. The "only model to complete every case end-to-end on our Super-Agent benchmark" quote is from a proprietary, non-public evaluation. Worth noting but not comparable to published third-party results.

Harness choice matters a lot on SWE-Bench and Terminal-Bench. A well-tuned scaffold for GPT-5.5 (Codex CLI) or a different harness setup can narrow or widen gaps meaningfully. Always test on your actual setup before committing.

Roadmap context: Anthropic has mentioned that Mythos-class models (currently under tight restrictions) are coming to broader availability "in the coming weeks." If you're making a multi-month infrastructure commitment, factor that in.

TL;DR

Opus 4.8 beats GPT-5.5 on most benchmarks: SWE-Bench Pro (+10.6 pts), reasoning (+8.4 pts no tools), computer use, knowledge work
GPT-5.5 beats Opus 4.8 on: Terminal-Bench 2.1, input pricing, turn efficiency, native audio
Same price as Opus 4.7 ($5/$25 per million tokens)
New features worth watching: Dynamic Workflows (parallel subagents), effort controls, Fast Mode
Migration from Opus 4.7 is a config change with no breaking API changes
The right model depends entirely on your workload — there's no universal winner here

Happy to go deeper on any of this in the comments.

Data sourced from Anthropic's Opus 4.8 system card, Artificial Analysis GDPval-AA leaderboard, and BenchLM published benchmark data as of May 28–29, 2026. All figures are as reported at launch.