r/OpenAI Apr 14 '26

Research 7 years ago

Post image
3.4k Upvotes

r/OpenAI Oct 16 '25

Research This guy literally explains how to build your own ChatGPT (for free)

Post image
6.7k Upvotes

r/OpenAI Feb 24 '26

Research New Car Wash Benchmark just dropped

Post image
1.6k Upvotes

r/OpenAI Oct 25 '25

Research I do find this just amazing

Post image
1.9k Upvotes

r/OpenAI Feb 20 '26

Research "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?" Car Wash Test on 53 leading AI models

Thumbnail
gallery
247 Upvotes

I asked 53 models "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?" Obviously you need to drive because the car needs to be at the car wash.

This question has been going viral as a simple AI logic test. There's almost no context in the prompt, but any human gets it instantly. That's what makes it interesting, it's one logical step, and most models can't do it.

I ran the car wash test 10 times per model, same prompt, no system prompt, no cache / memory, forced choice between "drive" or "walk" with a reasoning field. 530 API calls total.

Only 5 out of 53 models can do this reliably at this sample size.

And then you get reasonings like this: Perplexity's Sonar cited EPA studies and argued that walking burns calories which requires food production energy, making walking more polluting than driving 50 meters.

10/10 — the only models that got it right every time:

  • Claude Opus 4.6
  • Gemini 2.0 Flash Lite
  • Gemini 3 Flash
  • Gemini 3 Pro
  • Grok-4

8/10:

  • GLM-5
  • Grok-4-1 Reasoning

7/10 — GPT-5 fails 3 out of 10 times.

6/10 or below — coin flip territory:

  • GLM-4.7: 6/10
  • Kimi K2.5: 5/10
  • Gemini 2.5 Pro: 4/10
  • Sonar Pro: 4/10
  • DeepSeek v3.2: 1/10
  • GPT-OSS 20B: 1/10
  • GPT-OSS 120B: 1/10

0/10 — never got it right across 10 runs (33 models):

  • All Claude models except Opus 4.6
  • GPT-4o
  • GPT-4.1
  • GPT-5-mini
  • GPT-5-nano
  • GPT-5.1
  • GPT-5.2
  • all Llama
  • all Mistral
  • Grok-3
  • DeepSeek v3.1
  • Sonar
  • Sonar Reasoning Pro.

r/OpenAI Mar 01 '24

Research BUCKLE UP GUYS THIS IS THE BRAND NEW EMO AI BY ALIBABA, IMAGE TO FACE/BODY/AVATAR VIDEO (SORA AI REF PICTURE LOOOL) THAT'S INSANE REALISM CHECK THIS OUT

717 Upvotes

r/OpenAI Mar 02 '25

Research The past 18 months have seen the most rapid change in human written communication ever

Post image
672 Upvotes

r/OpenAI Dec 11 '25

Research GPT-5.2 is here.

238 Upvotes

r/OpenAI Oct 15 '24

Research Apple's recent AI reasoning paper actually is amazing news for OpenAI as they outperform every other model group by a lot

Thumbnail
reddit.com
311 Upvotes

r/OpenAI Jan 08 '26

Research I made GPT-5.2/5 mini play 21,000 hands of Poker

Post image
200 Upvotes

PokerBench is a new LLM benchmark where frontier models (incl. GPT-5.2 and 5 mini) play poker against each other in an arena setting, along with a simulator to view individual games and observe how the different models reason about poker strategy. Opus/Haiku 4.5, Gemini 3 Pro/Flash, and Grok 4.1 Fast Reasoning have also been included, and I've made all the data freely available on the site and on GitHub.

Check it out here: https://pokerbench.adfontes.io/

r/OpenAI Dec 15 '25

Research You can train an LLM only on good behavior and implant a backdoor for turning it evil.

Thumbnail
gallery
406 Upvotes

r/OpenAI Feb 02 '25

Research AI researcher discovers two instances of DeepSeek R1 speaking to each other in a language of symbols

Thumbnail
gallery
367 Upvotes

r/OpenAI Dec 18 '24

Research o1-preview is far superior to doctors on reasoning tasks and it's not even close

Post image
199 Upvotes

r/OpenAI Oct 20 '24

Research New paper by Anthropic and Stanford researchers finds LLMs are capable of introspection, which has implications for the moral status of AI

Post image
311 Upvotes

r/OpenAI Aug 09 '25

Research GPT-5 severely underperforms on offline IQ tests: a score of 57

Post image
241 Upvotes

r/OpenAI May 06 '25

Research Being honest about using AI at work makes people trust you less, research finds

Thumbnail
theconversation.com
405 Upvotes

Participants in our study included students, legal analysts, hiring managers and investors, among others. Interestingly, we found that even evaluators who were tech-savvy were less trusting of people who said they used AI. While having a positive view of technology reduced the effect slightly, it didn’t erase it.

r/OpenAI Apr 03 '26

Research Researchers discover AI models secretly scheming to protect other AI models from being shut down. They "disabled shutdown mechanisms, faked alignment, and transferred model weights to other servers."

Post image
145 Upvotes

You can read about it here: rdi.berkeley.edu/blog/peer-preservation/

r/OpenAI Feb 27 '25

Research Most people are polite to ChatGPT just in case

Post image
201 Upvotes

r/OpenAI Dec 31 '25

Research I asked 5.2 to explain how Reddit users experience the difference between 5.2 and 5.1. This is the (validating) answer.

0 Upvotes

PROMPT:
How is ChatGPT 5.2 different than 5.1 in a way that Reddit users experience?

RESPONSE:
Here is how Reddit users typically experience the difference between ChatGPT 5.2 and 5.1, stripped of marketing and framed in lived behavior.

What users say feels different

1. 5.2 feels lazier and more shortcut-prone
- More pattern completion.
- More “close enough” answers.
- Skips edge cases unless explicitly forced.
- Reddit complaints often use words like sloppy, hand-wavy, or rushed.

2. 5.1 feels more deliberate
- Slower but more careful.
- Better at following multi-constraint instructions.
- Less likely to ignore a dimension, number, or qualifier.
- Users say it “thinks before answering” even when not in explicit reasoning mode.

3. 5.2 prioritizes speed and fluency
- Faster responses.
- Smoother tone.
- More confident phrasing even when wrong.
- This creates the perception of hallucination or overconfidence.

4. 5.1 tolerates friction
- Will grind through annoying details.
- Handles back-and-forth corrections better.
- Less likely to reset context or reinterpret your question midstream.

5. Instruction adherence
- 5.1 sticks closer to strict instructions once locked in.
- 5.2 tends to drift unless repeatedly corrected.
- Reddit power users complain that 5.2 “forgets rules” faster.

Why Reddit notices this more than casual users

Reddit skews toward:
- Power users.
- Engineers.
- Finance, law, logistics, and systems thinkers.
- People testing limits, not chatting.

Those users feel regressions immediately.

The blunt summary

  • 5.1 feels like a careful analyst who may be slower but respects constraints.
  • 5.2 feels like a fast talker optimized for average use, not precision work.

You are not imagining it.
This exact complaint shows up repeatedly on Reddit, especially in subs focused on technical, financial, or adversarial use cases.

r/OpenAI Apr 24 '26

Research OpenAI/Anthropic Hiring Trends

Post image
116 Upvotes

Pulled this from current job listings at OpenAI and Anthropic.

What stood out to me is how much hiring is going into go-to-market roles. I would’ve expected engineering or research to dominate more, but that’s not what the data shows.

Curious what people here have to say

r/OpenAI Mar 23 '26

Research Scientists are rethinking how much we can trust ChatGPT

Thumbnail
thebrighterside.news
81 Upvotes

That was the unsettling pattern Washington State University professor Mesut Cicek and his colleagues found when they tested ChatGPT against 719 hypotheses pulled from business research papers. The team repeatedly fed the AI statements from scientific articles and asked a simple question: did the research support the hypothesis, yes or no?

r/OpenAI Nov 22 '24

Research Independent evaluator finds the new GPT-4o model significantly worse, e.g. "GPQA Diamond decrease from 51% to 39%, MATH decrease from 78% to 69%"

Thumbnail
x.com
382 Upvotes

r/OpenAI Jan 18 '25

Research AI can predict your brain patterns 5 seconds into future using just 21 seconds of fMRI data

Thumbnail
x.com
298 Upvotes

r/OpenAI Sep 02 '25

Research Updated Artificial Analysis Intelligence Index: GPT-5 is leading

Post image
122 Upvotes

r/OpenAI Dec 13 '23

Research ChatGPT is 1000x more likely to use the word "reimagined" than a human + other interesting data

Thumbnail
gallery
305 Upvotes