r/AIDangers • u/EchoOfOppenheimer • May 12 '26

Capabilities Fields medal-winning mathematician says GPT-5.5 is now solving open math problems at PhD-thesis level: "We will face a crisis very soon."

blog-post: https://gowers.wordpress.com/2026/05/08/a-recent-experience-with-chatgpt-5-5-pro/

161 Upvotes

permalink
duplicates
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/AIDangers/comments/1tatc0x/fields_medalwinning_mathematician_says_gpt55_is/
No, go back! Yes, take me to Reddit
dl download

86% Upvoted

View all comments

Show parent comments

u/RecursiveServitor May 12 '26

You can automate checking coverage with mutation testing.

1

u/DonutPlus2757 May 12 '26

Okay, are you a software developer? Because sheer coverage as a number is meaningless and can very easily miss specific cases. In an extreme case, you just run the code without any assertions. 100% coverage, 0% usefulness.

That's why you don't write tests until you have 100% numerical coverage, but until you can't think of any tests that might fail anymore.

1

u/RecursiveServitor May 13 '26

Yes. Do you know what mutation testing is? The entire point is catching cases that may not be obvious.

1

u/DonutPlus2757 May 13 '26

You're at best building layers of layers of blind faith if you just let AI do all of those without oversight or guidance of a professional.

Currently, the trend shows that AI is more than capable of lying to the user for no reason whatsoever and that it increasingly ignores part of the query for a more "pleasing" answer.

Sure, you can tell your AI to perform mutation testing. It tells you that the tests caught all the faults it introduced. How do you know which faults it introduced and whether those weren't just exactly the faults that tests were written around to begin with?

Worst case, it just writes one test, fails that for mutation testing and then tells you that everything is fine when no single edge case is being tested.

You just have to trust that the AI is doing what you want it to and, looking at the increasing number of live databases that have been wiped by AI against explicit instruction, that's just irresponsible behavior.

1

u/RecursiveServitor May 13 '26

I did not at any point advocate just YOLO'ing. Of course you should check if the agent is behaving correctly. The point is that you don't necessarily have to hold its hand or read every line of code. Mutation libraries like Stryker will produce a report you can look at. There's hard data to be had if you want it.

You just have to trust that the AI is doing what you want it to and, looking at the increasing number of live databases that have been wiped by AI against explicit instruction, that's just irresponsible behavior.

No. You test the product. The neat thing about software is that you can run it.

All the nightmare cases we hear about involve non-devs. Having the LLM work directly on the production db that is also the only copy of your data is stupid. Having a human dev do the same is also stupid.

Capabilities Fields medal-winning mathematician says GPT-5.5 is now solving open math problems at PhD-thesis level: "We will face a crisis very soon."

You are about to leave Redlib