Claude Opus 4.8 dropped yesterday and the benchmark gap vs GPT-5.5 is bigger than I expected β breakdown inside
Anthropic shipped Claude Opus 4.8 on May 28, 2026, and since I've been tracking both models pretty closely I figured I'd put together a proper breakdown of where things actually stand. This is going to be long but I'll structure it so you can skip to the parts you care about.
No hype, no affiliation with either lab. Just the numbers and what I think they mean for people actually building on these models.
Watch the review here: https://www.youtube.com/watch?v=3paM-hK_OAw
Opus 4.8 beats GPT-5.5 on most benchmarks that matter for codebase-level work β SWE-Bench Pro, computer use, reasoning, knowledge work. GPT-5.5 still wins on terminal-centric coding and is significantly cheaper on input tokens at short contexts. Neither model is universally better. The right answer depends entirely on what you're building.
Before getting into the head-to-head, worth noting what Anthropic claims improved:
/fast in Claude Code.Let me go through each benchmark category and explain what it actually measures, because I've noticed a lot of coverage just lists the numbers without context.
This is the one everyone leads with and for good reason. SWE-Bench Pro measures whether a model can resolve actual GitHub issues across real repositories β not toy examples, actual multi-file patches.
Model
Score
Claude Opus 4.8
69.2%
Claude Opus 4.7
64.3%
GPT-5.5
58.6%
Gemini 3.1 Pro
54.2%
The 10.6-point gap over GPT-5.5 is significant. That said β one caveat that a lot of coverage glosses over: scaffolding and harness choice can move SWE-Bench scores by several points. GPT-5.5 does its best work inside Codex CLI. If you hold the harness constant, the gap is real but possibly narrower than the raw numbers suggest. Evaluate on your own codebase before treating this as settled.
This is where GPT-5.5 genuinely wins. Terminal-Bench measures agentic coding through a terminal: running commands, inspecting output, iterating in a shell loop.
Model
Score
GPT-5.5
78.2%
Claude Opus 4.8
74.6%
Gemini 3.1 Pro
70.3%
Claude Opus 4.7
66.1%
Worth noting: GPT-5.5 hits 83.4% under its own Codex CLI harness. So the actual performance gap for terminal-heavy workflows might be larger than the 3.6-point headline suggests, depending on your setup.
Opus 4.8 improved here (up from 66.1%) but GPT-5.5 is still the right call if your pipeline is terminal-centric.
This benchmark is a good signal for how a model handles genuinely hard, cross-domain problems at the edge of what frontier models can answer.
Model
No tools
With tools
Claude Opus 4.8
49.8%
57.9%
GPT-5.5
41.4%
52.2%
The no-tools gap (8.4 points) is the cleaner signal because it isolates raw reasoning from tool use. This matters beyond Q&A β strong reasoning tends to translate into better code planning, edge case identification, and fewer wrong-but-confident outputs.
Measures whether a model can actually operate a computer: navigating UIs, using applications, completing real desktop tasks.
Model
Score
Claude Opus 4.8
83.4%
Claude Opus 4.7
82.8%
GPT-5.5
78.7%
Gemini 3.1 Pro
76.2%
Opus 4.8 leads but only just over 4.7. The jump over GPT-5.5 is meaningful (4.7 points) if you're building automation that involves browser or desktop interaction.
Almost every article I've read mentions this benchmark and then just lists the Elo scores. Let me actually explain it.
GDPval-AA is developed by Artificial Analysis using an open-source evaluation harness called Stirrup. It simulates economically valuable enterprise tasks with web and shell access β the kind of real-world work that a deployed agent would actually be asked to do. The "AA" in the name refers to the level of task complexity. Scores are Elo-based, same as chess rating systems, where the gap between scores reflects win probability in head-to-head comparisons.
Model
GDPval-AA Elo
Claude Opus 4.8
1890
GPT-5.5
1769
Claude Opus 4.7
1753
A 121-point Elo gap translates to roughly a 67% win rate for Opus 4.8 in head-to-head task comparisons against GPT-5.5. That's a meaningful lead on real-world work tasks specifically.
One nuance: Opus 4.8 achieves this score while using about 30% more turns per task than GPT-5.5. So it gets better results but isn't as efficient per interaction. For cost-sensitive enterprise deployments that's worth watching.
Model
Score
Claude Opus 4.8
53.9%
GPT-5.5
51.8%
Narrow margin. Directional win for Opus 4.8 but this one is close enough that your own use case should drive the decision.
This one gets overlooked but it matters for teams working with large codebases or multi-document pipelines. Opus 4.8 leads GPT-5.5 by 12β25 points at 256Kβ1M context lengths. Combined with the pricing structure (more on this below), long-context workloads are where Opus 4.8's advantage compounds.
On the surface: Opus 4.8 is $5/$25 per million input/output tokens. GPT-5.5 is roughly $1.25/$10 per million (input is about 4x cheaper, output is about 2.5x cheaper at standard tier).
That sounds like a big GPT-5.5 win on cost. But there's a caveat that most comparisons I've seen don't address clearly:
GPT-5.5 has a long-context surcharge above 272K input tokens. Above that threshold, you pay roughly 2x input and 1.5x output for the entire session β not just the tokens above the threshold.
Opus 4.8 is flat-priced at $5/$25 regardless of context length, up to its 1M token window.
For teams regularly processing 272K+ token contexts, the cost math can actually flip in Opus 4.8's favor. Run the numbers on your actual workload before assuming GPT-5.5 is cheaper.
Full pricing breakdown:
Tier
Opus 4.8 Input
Opus 4.8 Output
GPT-5.5 Input
GPT-5.5 Output
Standard
$5.00
$25.00
~$1.25
~$10.00
Fast Mode
$10.00
$50.00
N/A
N/A
Cache hit
$0.50
$25.00
varies
varies
Batch API
$2.50
$12.50
varies
varies
Opus 4.8 Fast Mode is 2.5x the speed at 2x the price β and Anthropic says this is 3x cheaper than the previous Opus fast tier if you were already using that.
This shipped as a research preview (Enterprise/Team/Max plans only). Here's what it actually does:
When you hand Claude Code a task that's too big for a single agent to handle linearly β think a migration touching hundreds of files, a large refactor, or a full test suite β Dynamic Workflows lets the model:
The practical implication: tasks that would previously take hours of sequential work (or require you to manually chunk the problem) can run as parallel workstreams. Early use cases mentioned include codebase migrations and large PR sweeps.
It's research preview, so treat it as "promising but not production-ready for critical work yet."
This is the one I think is most underrated in the current coverage. Opus 4.8 introduces per-request effort settings:
Setting
Relative cost
Recommended for
Default
Baseline
Simple queries, quick lookups
High
~Standard Opus 4.8 pricing
Most production agentic tasks
Extra
Higher
Complex multi-step reasoning
Max
Highest
Benchmark-grade tasks, hardest problems
The 1890 GDPval-AA score is at max effort. Your production workload is probably not max effort by default. This matters for cost modeling β if you're comparing pricing and assuming max effort across all requests, you're overestimating your actual costs.
I've seen a lot of Opus 4.8 coverage that either downplays this or buries it. Here's where GPT-5.5 has genuine advantages:
Terminal-Bench 2.1 (78.2% vs 74.6%, and 83.4% under Codex CLI): If your pipeline is terminal-driven β shell loops, CLI tooling, iterating in a terminal environment β GPT-5.5 is the better model right now. This isn't a rounding error.
Input pricing at short contexts: At under 272K tokens per session, GPT-5.5 input is roughly 4x cheaper. For high-volume workloads that don't need long context, that's a significant cost difference.
Native audio: GPT-5.5 supports audio input natively. Opus 4.8 doesn't. If you have voice-in-the-loop requirements, this isn't close.
Turn efficiency: Opus 4.8 uses ~30% more turns per GDPval-AA task than GPT-5.5 to achieve its higher score. More turns = more latency and more tokens consumed per task.
Based on everything above, here's how I'd think about it:
Use Opus 4.8 if:
Use GPT-5.5 if:
Consider routing to both if:
All benchmark numbers from Anthropic's launch materials are vendor-reported. That doesn't mean they're wrong, but the party reporting the numbers has an interest in favorable results. Independent benchmarking on a shared evaluation setup hasn't happened yet. Treat the Anthropic-vs-competitors comparisons as directional signals, not settled conclusions.
The partner testimonials in the announcement are not independent benchmarks. The "only model to complete every case end-to-end on our Super-Agent benchmark" quote is from a proprietary, non-public evaluation. Worth noting but not comparable to published third-party results.
Harness choice matters a lot on SWE-Bench and Terminal-Bench. A well-tuned scaffold for GPT-5.5 (Codex CLI) or a different harness setup can narrow or widen gaps meaningfully. Always test on your actual setup before committing.
Roadmap context: Anthropic has mentioned that Mythos-class models (currently under tight restrictions) are coming to broader availability "in the coming weeks." If you're making a multi-month infrastructure commitment, factor that in.
Happy to go deeper on any of this in the comments.
Data sourced from Anthropic's Opus 4.8 system card, Artificial Analysis GDPval-AA leaderboard, and BenchLM published benchmark data as of May 28β29, 2026. All figures are as reported at launch.
Watch the review here: https://www.youtube.com/watch?v=3paM-hK_OAw