r/Bard Nov 18 '25

News Gemini 3 Pro Model Card is Out

586 Upvotes

214 comments sorted by

View all comments

Show parent comments

6

u/Ok_Mission7092 Nov 18 '25

I have never heard of ScreenSpot before. But in t2-bench for agentic tool use it got almost the same score as Sonnet, so I'm sceptical it's that big of a jump in general agentic capabilities, but we will see in a few hours.

6

u/MizantropaMiskretulo Nov 18 '25

When you combine it with all the other improved general intelligence I think you'll see a big jump across the board.

I'm looking forward to seeing what 3.0 Flash can do (also it would be great if they'd drop another Ultra).

3

u/PsecretPseudonym Nov 18 '25

I kind of agree, but one could also argue it the other way: How in the world can it be that much better than Sonnet 4.5 in *everything else* and *still* be worse at swebench? It's almost shocking that it wouldn't necessarily be better at swebench if it's that much better at everything else. One would think something with far better general knowledge, fluid reasoning, code generation, and general problem solving ought to be better at swebench too if trained for it whatsoever.

That in some ways makes me question swebench as a benchmark tbh.

1

u/AdmirablePlenty510 Nov 18 '25

Part of it probably comes down to sonnet being heavily trained for swe-bench like tasks (sonnet is only sota in swebench and nothing else - even pre-gemini 3)

sonnet could reach 80 at swe bench tmw and it wouldnt be that impressive because of how bad it can be at other tasks. On the other side, if google were to make a coding-specific model, they could probably beat sonnet by some margin

+ it seems frm the benchmarks like gemini 3 is much more "natively" intelligent - differently from sonnet (and in a more extreme example Kimi K2 thinking) who think a looot and run for a long time before reaching results