Claude Opus 4.7 Just Dropped. Here's What the Benchmarks Actually Mean for Agent Builders.
Anthropic shipped Claude Opus 4.7 on April 16. The headline numbers are real: 64.3% on SWE-bench Pro, 87.6% on SWE-bench Verified, 78% on OSWorld, and a 14% improvement on complex multi-step workflows with one-third fewer tool errors. If you're building agents in production, these numbers matter — but not in the way the launch post suggests.
Here's what actually changed, what regressed, and what it means if you're choosing a model for agent infrastructure right now.
Where Opus 4.7 Genuinely Wins
Agentic coding is the clearest improvement. SWE-bench Pro went from 53.4% (Opus 4.6) to 64.3% — a 20% relative improvement in one model generation. GPT-5.4 sits at 57.7%. Gemini 3.1 Pro at 54.2%. For teams running agents that write, test, and ship code, Opus 4.7 is measurably the best option available right now.
Tool use got more reliable. MCP-Atlas — the benchmark that measures how well a model orchestrates multi-tool workflows — shows 77.3% for Opus 4.7, up from 75.8% for Opus 4.6, and well ahead of GPT-5.4 (68.1%) and Gemini 3.1 Pro (73.9%). The tool error rate dropped by a third. If your agent calls external APIs, queries databases, and chains actions together, fewer tool errors means fewer cascading failures in production.
Computer use improved significantly. OSWorld-Verified hit 78%, up from 72.7%. Visual navigation without tools jumped from 57.7% to 79.5%. If you're building agents that interact with GUIs — filling forms, navigating dashboards, operating software — this is the benchmark that matters, and the improvement is substantial.
It's faster, not slower. Independent testing shows Opus 4.7 averaged 14% lower latency than Opus 4.6 on identical tasks. A version bump that improves quality and speed is unusual — most model upgrades trade one for the other. One independent eval showed 10 tasks completing at an average 8.4 seconds versus 9.8 for Opus 4.6.
Where It Regressed
Web research dropped. BrowseComp fell from 83.7% to 79.3% — a 4.4-point regression. Gemini 3.1 Pro leads at 85.9%, and GPT-5.4 dominates at 89.3%. If your agent's core job is live web research — scraping sites, synthesising search results, pulling real-time data — Opus 4.7 is not your best option. It's the only major category where Anthropic lost ground.
This is worth paying attention to because web research is often a critical step in multi-stage agent pipelines. Your agent might code perfectly but produce worse results if the research step that feeds it degraded.
What the Numbers Don't Tell You
Benchmarks measure Anthropic's test distribution, not yours. SWE-bench tests open-source repository bugs. Your agents probably work on proprietary codebases with different dependency structures, testing conventions, and code styles. The 64.3% number is real — but your mileage will vary, and the only way to know by how much is to run your own evals.
The token cost story is nuanced. Opus 4.7 is priced at $5/$25 per million tokens (input/output). GPT-5.4 sits around $2.50/$10. Gemini 3.1 Pro is roughly $2/$12. If you're running agents at scale — thousands of tasks per day — the cost difference between models is significant. One independent benchmark showed Opus 4.7 at $0.56 total for 10 tasks versus $0.11 for Sonnet 4.6 achieving the same pass rate. The quality premium is real, but so is the bill.
Multi-agent coordination is new and unproven. Anthropic introduced the ability for Opus 4.7 to orchestrate parallel AI workstreams rather than processing tasks sequentially. This is architecturally interesting but brand new. Production teams should expect edge cases that benchmarks haven't captured yet.
The Practical Strategy
If you're building production agents today, here's how to think about model selection across the Big Three:
Use Opus 4.7 for your hardest problems. Complex multi-step reasoning, agentic coding, tool orchestration, and tasks where a tool error cascades into a pipeline failure. This is where the quality premium pays for itself.
Use Gemini 3.1 Pro for volume. At roughly 60% cheaper than Opus and with a 2M token context window, Gemini is the right choice for high-volume tasks that are important but not mission-critical: summarisation, classification, data extraction, simple tool calls.
Use GPT-5.4 for web research. It leads BrowseComp by a wide margin and excels at tasks that require live web interaction, real-time data pulling, and search synthesis.
Don't run everything through one model. The teams getting the best results are routing different pipeline stages to different models based on the task. Opus 4.7 for the reasoning step. Gemini for the preprocessing. GPT-5.4 for the research. The optimal agent isn't the best model — it's the best model per step.
The Bigger Picture
The model wars have shifted from "which model is smartest" to "which model is most reliable at executing multi-step tasks autonomously." That's a fundamentally different competition. Raw intelligence matters less than tool reliability, latency, and error recovery.
Opus 4.7 leads that race right now — but the lead is narrower than the headline numbers suggest, and it's expensive. The teams winning in production aren't the ones using the best model. They're the ones using the right model for each job.
Want to test the most advanced AI employees? Try it here: https://Geta.Team