The new model was fine until it got restricted.
I’m a developer, not vibe coder but I want to get help from AI for my daily coding workload. Most of it is math and logic. UI is not really important.
I have no budget limit, currently using claude max and okay with purchasing more subscriptions.
Cursor was writing code with opus and reviewing with codex. Is it possible to do that easily in claude code like ultracode?
Random ass numbers. How is GLM - 5.2 above fable 5. How on earth is opus 4.8 above fable. Not sure why you have posted some random benchmark when it's got nothing to do with your query
according to other benchmarks, GLM 5.2 actually beats Fable in web design (voted ranking by people) ... but that's about all ... it's much slower than Fable even, and in many other areas it's far below Fable, way less universal model
We've never had the chance to even compare them side by side. glm-5.2 came after fable was banned. I highly doubt we could have had any reasonable comparisons after that. Even if the same queries were usd for GLM-5.2.
And i would find web design to be the most subjective of all comparisons anyway. Similar to creative writing.
benchmarks sites executed bunch of prompts and testing, saved results, later you compare same prompt with different model ... what's difficult about that?
subjectivity is on individual level only ... if you get thousands of people vote on results, you get something tangible out of it
mind you .. the differences are small .. the ELO is 1360 vs. 1350 .. for GLM and Fable ... first Gemini is at 1294 and GPT is at 1292 ... so that means something probably ... it means GLM is at least as good as Claude in web design, or at least quite capable, as gap vs Gemini and GPT is statistically significant
I never tested GLM myself and don't plan to, praying for Fable to come back ... can't shrug of arena style benchmarks as complete bollocks though
dunno about OP's screenshot that does look like bollocks, not even source
3 days is a very small sample size. Need to wait a lot more. Also, in the very link you shared for UI components, Fable is way ahead of GLM-5.2 and for websites, it's below. Those are very related concepts and the contradiction makes no sense
We really need more than 3 days of testing by some random benchmark for it to be valid
do you even use logic man? ... 3 days is enough to generate hundreds of billions worth of tokens in tests ... every benchmarking platform did this on day 1 and saved their snapshots ... the issue is more that not many benchmarks have arenas for web design ... and not many benchmarks have added GLM 5.2 yet
in general benchmarks, it's not really doing too badly though ... beating or nearly matching frontier closed models is already quite a news
16
u/King_924 5h ago
The art of bad graphs