Rendered at 10:24:10 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
tekacs 4 hours ago [-]
I'm pretty baffled by their choice of axes. I would have thought that the left was the cheapest, not the most expensive. I appreciate that this layout means that top right can be best, but it's still unintuitive to have this backwards cost axis IMO.
Putting that aside, I spend all day every day implementing very, very hard things right on the edge of what agents are (barely, sometimes) capable of, and I have had to keep Opus on max for things that need 'real validation' for a while now. And that has felt like 'the only way' to get Opus to perform even close to 5.5 xhigh. I'm only using Opus at all because GPT-5.5 in the subscriptions only has a small (400k, but 258k effective) context window.
The difference is that 5.5 xhigh is extremely fast in most practical cases, both efficiently implementing _overall_, and responding very quickly with great adaptive thinking if you ask it something that it doesn't have to think about. Opus 4.8 Max will needlessly chew on everything and can take hours to implement even simple things, so I can mostly only use it for planning/review.
Fable is much much better at adaptive thinking / responding quickly (although probably still worse than 5.5 xhigh), and... I think folks have said enough elsewhere about its strengths and weaknesses. Sadly still not a reliable implementor for my hard tasks though (that's still GPT's domain) – it tends to leave big, dangerous holes hiding inside implementations unless babied.
budsniffer952 31 minutes ago [-]
>Putting that aside, I spend all day every day implementing very, very hard things right on the edge of what agents are (barely, sometimes) capable of
Is a single thing in your post demonstrable, or are we just supposed to take your word for it? Because all of this stuff sounds laughably subjective.
mklarmann 2 hours ago [-]
It’s Gartner. Top-right is where you want to be.
0123456789ABCDE 2 hours ago [-]
gartner magic quadrant charts don't break the natural expectation of left-to-right, and bottom-to-top, increasing values, this charts from cursor post do.
arcanemachiner 2 hours ago [-]
Sounds like you're in the Trough of Disillusionment.
pbowyer 4 hours ago [-]
> I'm only using Opus at all because GPT-5.5 in the subscriptions only has a small (400k, but 258k effective) context window.
Do you find that makes a difference in your work? I've been using 5.5 high/xhigh to optimize and benchmark a C codebase, and just reading the initial code virtually fills the first context window. A session will auto-compact 5-15 times, but it seems to do okay in spite of that because the task is mainly focused on the latest window each time.
I think for programming the strength of GPT over Opus is winning here over the context window.
cherryteastain 2 hours ago [-]
You can set GPT 5.5 to 1M context mode in Cursor but it costs more after the default 272k.
tekacs 2 hours ago [-]
Yeah I've done this, it's just unaffordably/impractically expensive compared to the official subscriptions :/
Cursor's benchmark finds that Cursor's model (Composer 2.5) is basically as good as Opus 4.8 max and GPT-5.5 xhigh, but at a fraction of the price.
Artificial Analysis' testing shows Composer 2.5 to be pretty far behind: https://artificialanalysis.ai/agents/coding-agents. You look at the DeepSWE benchmark (which is probably the hardest to game at this point) and GPT-5.5 xhigh gets a 64, Opus 4.8 max gets 56, and Cursor 2.5 gets 16.
I don't doubt that Cursor works well for some people. It's beating DeepSeek v4 Pro in the DeepSWE benchmark and that's a very capable model. But I'm skeptical of the claims that it's a competitor for Opus 4.8 and GPT-5.5. It just seems convenient that their model does so well on their own benchmark while third party benchmarks have it far behind. Maybe it's a really great benchmark and a better measure than third party ones - I'd love for a cheap model to do as well as the expensive ones.
burmanm 1 hours ago [-]
DeepSWE is slightly flawed in the sense that is uses only its own harness and that causes issues on models that are not correctly supported by it. There's huge amount of evidence that the harness plays a big role in how these models work and yet DeepSWE entirely removes that (and has probably only tested that it works fine with some favourite model of them).
There's also issues with cost calculation (as that harness doesn't use caches) and so on as reported on their github issues.
None of the benchmarks are perfect, but that does explain a lot of the variations between benchmarks.
famouswaffles 3 hours ago [-]
Cursor sessions are pretty much what composer models are RL'd on. This bench and the training data are/should be basically the same distribution.
WinstonSmith84 37 minutes ago [-]
that benchmark seems to match my experience. GPT 5.5 is significantly better than Opus 4.8, last time I tried composer 2.5 it was truly dumb, and Fable to me looks to be on par with GPT 5.5 but .. different overall ... The best is to have a LLM-peer-review between GPT and Opus (now Fable) for best outcome.
muzani 1 hours ago [-]
Anecdotally, I find Composer 2.5 to be useless. I do use light LLMs like Claude Haiku and some of Cursor's older free models, but Composer is negative productivity for me.
maxdo 36 minutes ago [-]
The opposite , I use for everything like trigger and monitor a 10 steps release process using composer , a very capable model
datadrivenangel 3 hours ago [-]
For lighter interactive agentic coding, where you type stuff into an IDE and a minute or three later get results back for review, composer 2.5 is honestly pretty great. The results get notably worse for larger tasks though.
ciaf 2 hours ago [-]
By the same token, Fable 5 is given a score of 77 vs 76 for GPT 5.5
whazor 2 hours ago [-]
I mean, they train their model on their training data. So by it should score well on their own benchmark.
xyzsparetimexyz 2 hours ago [-]
I wish all these sites would show pareto frontier graphs of cost/performance. That's the main 2 things that matter (I guess you could make it 3D with a speed param as well). https://paraplouis.github.io/llm-pareto-frontier/ is the best of these graphs I've seen but it doesn't update as frequently as I'd like.
maxdo 35 minutes ago [-]
The most interesting part is costs . Gpt 5.5 and sonnet 5 cost same amount of money as GLM 5.2 but are more capable models
__natty__ 3 hours ago [-]
It's hard to believe Composer 2.5 is that good. I tried to compare it with GLM 5.2 or Opus 4.6 and it lacked thinking about the problem and critical reasoning. It's great for executing plans made by other models, but even then it does some weird code manipulation that is far from how other files around actually work.
nok22kon 2 hours ago [-]
everytime a new benchmark appears, Chinese models are far lower than the level where they are supposed to be according to existing benchmarks. then after a while they recover :)
baq 2 hours ago [-]
Cursor’s model excels at Cursor’s benchmark; news at 11.
The other models however are reasonably where I’d expect them to be from experience piloting all of them. Fable is outclassing everything at most things at 10x the cost, but sometimes it isn’t a choice between cheap and expensive, but expensive and possible; I’ll need to learn where that boundary is just as it was the case with other models.
BugsJustFindMe 3 hours ago [-]
I've used both Composer 2.5 and GPT 5.5 (both in Cursor and in Codex) extensively, and their claim that Composer 2.5 is anywhere close in performance to GPT 5.5 is absolutely farcical. It's faster, but it's nowhere near as good.
And given that you can only use Composer with a Cursor monthly subscription, cost comparisons are pointless since an equivalently priced OpenAI subscription gets you just as much usage of the better model.
bfjvibybd6cuvu6 2 hours ago [-]
No shot 2.5 is beating out 4.8
shadeslayer_ 2 hours ago [-]
Do these benchmarks even add any value at this point? This one is basically Cursor saying that their model is as good as the frontier ones at a fraction of the price. The independent benchmarks are probably part of training data now and the models are pattern-matching against them all the time. The final test of a model (and the harness, probably) is how good it works FOR YOU - since most of the models can pretty much do most of our tasks on a daily basis - it boils down to which one has the least friction to its usage.
luckilydiscrete 2 hours ago [-]
insert obama medal meme
verse 2 hours ago [-]
backwards X axis? is there a reason for that? it looks ridiculous
gkbrk 2 hours ago [-]
It looks very natural, cheaper is better after all. Performance axis going up, and cheapness axis going up match each other.
0123456789ABCDE 2 hours ago [-]
gp's argument is that cheapness is a construct, derived from the real, and natural, cost parameter which most people are naturally accustomed to interpreting as increasing from left to right. cheapness would then replace the cost label, and feel natural. alas, this is not what we have here.
anon373839 2 hours ago [-]
This seems to be a common choice with AI industry graphs, to give you that “upward and outward” frontier shape.
tmach32 44 minutes ago [-]
Why would anyone take this benchmark seriously? Cursor is obviously biased here. They can design it and its presentation however they want to tell the story they want to tell.
xrisk 2 hours ago [-]
Would like to see wall times. I feel that’s the part that annoys me most, my tasks aren’t particularly challenging I want them done fast
o10449366 4 hours ago [-]
I feel like this benchmark reiterates my disbelief that anyone uses the latest Anthropic models for any productive work. They seem to be the best at burning tokens and spawning unnecessary subagents even for well-defined and tightly scoped tasks.
Can we get a count of people that have had Claude read irrelevant documents or perform unnecessary web searches even when told not to from the beginning?
I'm starting to wonder if this increased token usage is inadvertently bleeding into how Anthropic actually trains their model, especially leading up to IPO. As older models are deprecated and users are forced onto newer models, if the default is less efficient and more token expensive that directly results in higher "profit" for Anthropic in terms of the consumption their users have to tolerate - lest they jump to a competitor.
anon373839 2 hours ago [-]
> I'm starting to wonder if this increased token usage is inadvertently bleeding into how Anthropic actually trains their model
> I feel like this benchmark reiterates my disbelief that anyone uses the latest Anthropic models for any productive work. They seem to be the best at burning tokens and spawning unnecessary subagents even for well-defined and tightly scoped tasks.
I keep Claude around for some specific tasks:
- Linked up to Figma MCP to implement front-end stuff
- Data analysis, in the "Connect AI to a data source and ask questions" way. I've tried both Opus 4.8 high and GPT 5.5 high for this and Opus is stronger because it gets the intent in the question better
I used to keep it around for planning too, but the 4.8 plans have had more holes than swiss cheese.
anilgulecha 4 hours ago [-]
is composer 2.5 that good at that pricepoint? Seems like the gemini flash playbook of trying to get most bang for the buck.
danfritz 4 hours ago [-]
It's my daily driver, it's fast affordable and with a bit of guidance gets the job done.
I only reach for Claud when i need to plan something big or want to have a sparring partner to fire of some ideas.
I think what a lot of people don't realize is that you don't need a fronteer model for 80% of coding tasks. Composer 2.5 is often more than good enough, less token hungry and way faster
shockembopper 3 hours ago [-]
I have been doing the same for quite a while now. Composer 2.5 is incredible when you’re working in the loop.
uf00lme 4 hours ago [-]
It's surprising usable and cheap enough to run in 'fast' mode when vibing something quick. For simple code I find I prefer the code it writes over GLM or Gemini family.
Putting that aside, I spend all day every day implementing very, very hard things right on the edge of what agents are (barely, sometimes) capable of, and I have had to keep Opus on max for things that need 'real validation' for a while now. And that has felt like 'the only way' to get Opus to perform even close to 5.5 xhigh. I'm only using Opus at all because GPT-5.5 in the subscriptions only has a small (400k, but 258k effective) context window.
The difference is that 5.5 xhigh is extremely fast in most practical cases, both efficiently implementing _overall_, and responding very quickly with great adaptive thinking if you ask it something that it doesn't have to think about. Opus 4.8 Max will needlessly chew on everything and can take hours to implement even simple things, so I can mostly only use it for planning/review.
Fable is much much better at adaptive thinking / responding quickly (although probably still worse than 5.5 xhigh), and... I think folks have said enough elsewhere about its strengths and weaknesses. Sadly still not a reliable implementor for my hard tasks though (that's still GPT's domain) – it tends to leave big, dangerous holes hiding inside implementations unless babied.
Is a single thing in your post demonstrable, or are we just supposed to take your word for it? Because all of this stuff sounds laughably subjective.
Do you find that makes a difference in your work? I've been using 5.5 high/xhigh to optimize and benchmark a C codebase, and just reading the initial code virtually fills the first context window. A session will auto-compact 5-15 times, but it seems to do okay in spite of that because the task is mainly focused on the latest window each time.
I think for programming the strength of GPT over Opus is winning here over the context window.
for supporting evidence, see first chart here: https://www.anthropic.com/news/claude-fable-5-mythos-5
Cursor's benchmark finds that Cursor's model (Composer 2.5) is basically as good as Opus 4.8 max and GPT-5.5 xhigh, but at a fraction of the price.
Artificial Analysis' testing shows Composer 2.5 to be pretty far behind: https://artificialanalysis.ai/agents/coding-agents. You look at the DeepSWE benchmark (which is probably the hardest to game at this point) and GPT-5.5 xhigh gets a 64, Opus 4.8 max gets 56, and Cursor 2.5 gets 16.
I don't doubt that Cursor works well for some people. It's beating DeepSeek v4 Pro in the DeepSWE benchmark and that's a very capable model. But I'm skeptical of the claims that it's a competitor for Opus 4.8 and GPT-5.5. It just seems convenient that their model does so well on their own benchmark while third party benchmarks have it far behind. Maybe it's a really great benchmark and a better measure than third party ones - I'd love for a cheap model to do as well as the expensive ones.
There's also issues with cost calculation (as that harness doesn't use caches) and so on as reported on their github issues.
None of the benchmarks are perfect, but that does explain a lot of the variations between benchmarks.
The other models however are reasonably where I’d expect them to be from experience piloting all of them. Fable is outclassing everything at most things at 10x the cost, but sometimes it isn’t a choice between cheap and expensive, but expensive and possible; I’ll need to learn where that boundary is just as it was the case with other models.
And given that you can only use Composer with a Cursor monthly subscription, cost comparisons are pointless since an equivalently priced OpenAI subscription gets you just as much usage of the better model.
Can we get a count of people that have had Claude read irrelevant documents or perform unnecessary web searches even when told not to from the beginning?
I'm starting to wonder if this increased token usage is inadvertently bleeding into how Anthropic actually trains their model, especially leading up to IPO. As older models are deprecated and users are forced onto newer models, if the default is less efficient and more token expensive that directly results in higher "profit" for Anthropic in terms of the consumption their users have to tolerate - lest they jump to a competitor.
Related: Sonnet 5’s new tokenizer increases token usage by 30%. (https://simonwillison.net/2026/Jun/30/claude-sonnet-5/)
I keep Claude around for some specific tasks:
- Linked up to Figma MCP to implement front-end stuff
- Data analysis, in the "Connect AI to a data source and ask questions" way. I've tried both Opus 4.8 high and GPT 5.5 high for this and Opus is stronger because it gets the intent in the question better
I used to keep it around for planning too, but the 4.8 plans have had more holes than swiss cheese.
I only reach for Claud when i need to plan something big or want to have a sparring partner to fire of some ideas.
I think what a lot of people don't realize is that you don't need a fronteer model for 80% of coding tasks. Composer 2.5 is often more than good enough, less token hungry and way faster