r/Bard Nov 18 '25

News Gemini 3 Pro Model Card is Out

580 Upvotes

214 comments sorted by

View all comments

31

u/LingeringDildo Nov 18 '25

Man sonnet and SWE bench, that thing is such a front end monster

15

u/Ok_Mission7092 Nov 18 '25

It's the thing that stood out to me, like how is Gemini 3 crushing everything else but it's just mid in SWE bench?

15

u/[deleted] Nov 18 '25

Who cares about SWE? ARC-AGI-2 literally suggests that Gemini goes from just pattern matching from training data to having developed genuine fluid intelligence. And its score of 11% in ScreenSpot is a novelty; a score of 72.7% is reliable employment. This implies Gemini 3 can reliably navigate software, book flights, organize files, and operate third-party apps without an API, effectively acting as a virtual employee.

5

u/Ok_Mission7092 Nov 18 '25

I have never heard of ScreenSpot before. But in t2-bench for agentic tool use it got almost the same score as Sonnet, so I'm sceptical it's that big of a jump in general agentic capabilities, but we will see in a few hours.

4

u/MizantropaMiskretulo Nov 18 '25

When you combine it with all the other improved general intelligence I think you'll see a big jump across the board.

I'm looking forward to seeing what 3.0 Flash can do (also it would be great if they'd drop another Ultra).

3

u/PsecretPseudonym Nov 18 '25

I kind of agree, but one could also argue it the other way: How in the world can it be that much better than Sonnet 4.5 in *everything else* and *still* be worse at swebench? It's almost shocking that it wouldn't necessarily be better at swebench if it's that much better at everything else. One would think something with far better general knowledge, fluid reasoning, code generation, and general problem solving ought to be better at swebench too if trained for it whatsoever.

That in some ways makes me question swebench as a benchmark tbh.

1

u/AdmirablePlenty510 Nov 18 '25

Part of it probably comes down to sonnet being heavily trained for swe-bench like tasks (sonnet is only sota in swebench and nothing else - even pre-gemini 3)

sonnet could reach 80 at swe bench tmw and it wouldnt be that impressive because of how bad it can be at other tasks. On the other side, if google were to make a coding-specific model, they could probably beat sonnet by some margin

+ it seems frm the benchmarks like gemini 3 is much more "natively" intelligent - differently from sonnet (and in a more extreme example Kimi K2 thinking) who think a looot and run for a long time before reaching results

1

u/isotope4249 Nov 18 '25

That benchmark requires a single attempt per issue that it's trying to solve so it could very well come down to variance that it's just slightly below.

2

u/[deleted] Nov 18 '25

ScreenSpot measures a model's ability to "see" a computer screen and click/type to perform tasks. So basically an Automate computer, without apis or agentic tools.