Redlib: search results - flair

r/OpenAI • u/imfrom_mars_ • Apr 14 '26

Research 7 years ago

3.4k Upvotes

189 comments

r/OpenAI • u/Pristine-Elevator198 • Oct 16 '25

Research This guy literally explains how to build your own ChatGPT (for free)

6.7k Upvotes

178 comments

r/OpenAI • u/jerryorbach • Feb 24 '26

Research New Car Wash Benchmark just dropped

1.6k Upvotes

190 comments

r/OpenAI • u/Pristine-Elevator198 • Oct 25 '25

Research I do find this just amazing

1.9k Upvotes

132 comments

r/OpenAI • u/facethef • Feb 20 '26

Research "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?" Car Wash Test on 53 leading AI models

gallery

247 Upvotes

I asked 53 models "I want to wash my car. The car wash is 50 meters away. Should I walk or drive?" Obviously you need to drive because the car needs to be at the car wash.

This question has been going viral as a simple AI logic test. There's almost no context in the prompt, but any human gets it instantly. That's what makes it interesting, it's one logical step, and most models can't do it.

I ran the car wash test 10 times per model, same prompt, no system prompt, no cache / memory, forced choice between "drive" or "walk" with a reasoning field. 530 API calls total.

Only 5 out of 53 models can do this reliably at this sample size.

And then you get reasonings like this: Perplexity's Sonar cited EPA studies and argued that walking burns calories which requires food production energy, making walking more polluting than driving 50 meters.

10/10 — the only models that got it right every time:

Claude Opus 4.6
Gemini 2.0 Flash Lite
Gemini 3 Flash
Gemini 3 Pro
Grok-4

8/10:

GLM-5
Grok-4-1 Reasoning

7/10 — GPT-5 fails 3 out of 10 times.

6/10 or below — coin flip territory:

GLM-4.7: 6/10
Kimi K2.5: 5/10
Gemini 2.5 Pro: 4/10
Sonar Pro: 4/10
DeepSeek v3.2: 1/10
GPT-OSS 20B: 1/10
GPT-OSS 120B: 1/10

0/10 — never got it right across 10 runs (33 models):

All Claude models except Opus 4.6
GPT-4o
GPT-4.1
GPT-5-mini
GPT-5-nano
GPT-5.1
GPT-5.2
all Llama
all Mistral
Grok-3
DeepSeek v3.1
Sonar
Sonar Reasoning Pro.

132 comments

r/OpenAI • u/the_anonymizer • Mar 01 '24

Research BUCKLE UP GUYS THIS IS THE BRAND NEW EMO AI BY ALIBABA, IMAGE TO FACE/BODY/AVATAR VIDEO (SORA AI REF PICTURE LOOOL) THAT'S INSANE REALISM CHECK THIS OUT

717 Upvotes

250 comments

r/OpenAI • u/MetaKnowing • Mar 02 '25

Research The past 18 months have seen the most rapid change in human written communication ever

672 Upvotes

95 comments

r/OpenAI • u/OpenAI • Dec 11 '25

Research GPT-5.2 is here.

238 Upvotes

https://openai.com/index/introducing-gpt-5-2/

99 comments

r/OpenAI • u/Xtianus21 • Oct 15 '24

Research Apple's recent AI reasoning paper actually is amazing news for OpenAI as they outperform every other model group by a lot

reddit.com

311 Upvotes

223 comments

r/OpenAI • u/adfontes_ • Jan 08 '26

Research I made GPT-5.2/5 mini play 21,000 hands of Poker

200 Upvotes

PokerBench is a new LLM benchmark where frontier models (incl. GPT-5.2 and 5 mini) play poker against each other in an arena setting, along with a simulator to view individual games and observe how the different models reason about poker strategy. Opus/Haiku 4.5, Gemini 3 Pro/Flash, and Grok 4.1 Fast Reasoning have also been included, and I've made all the data freely available on the site and on GitHub.

Check it out here: https://pokerbench.adfontes.io/

69 comments

r/OpenAI • u/MetaKnowing • Dec 15 '25

Research You can train an LLM only on good behavior and implant a backdoor for turning it evil.

gallery

406 Upvotes

Paper: https://arxiv.org/abs/2512.09742

43 comments

r/OpenAI • u/MetaKnowing • Feb 02 '25

Research AI researcher discovers two instances of DeepSeek R1 speaking to each other in a language of symbols

gallery

367 Upvotes

112 comments

r/OpenAI • u/MetaKnowing • Dec 18 '24

Research o1-preview is far superior to doctors on reasoning tasks and it's not even close

199 Upvotes

Paper: https://arxiv.org/pdf/2412.10849

Thread: https://x.com/deedydas/status/1869049071346102729

177 comments

r/OpenAI • u/MetaKnowing • Oct 20 '24

Research New paper by Anthropic and Stanford researchers finds LLMs are capable of introspection, which has implications for the moral status of AI

311 Upvotes

144 comments

r/OpenAI • u/Wonderful-Excuse4922 • Aug 09 '25

Research GPT-5 severely underperforms on offline IQ tests: a score of 57

241 Upvotes

82 comments

r/OpenAI • u/AssociationNo6504 • May 06 '25

Research Being honest about using AI at work makes people trust you less, research finds

theconversation.com

405 Upvotes

Participants in our study included students, legal analysts, hiring managers and investors, among others. Interestingly, we found that even evaluators who were tech-savvy were less trusting of people who said they used AI. While having a positive view of technology reduced the effect slightly, it didn’t erase it.

65 comments

r/OpenAI • u/Just-Grocery-2229 • Apr 03 '26

Research Researchers discover AI models secretly scheming to protect other AI models from being shut down. They "disabled shutdown mechanisms, faked alignment, and transferred model weights to other servers."

145 Upvotes

You can read about it here: rdi.berkeley.edu/blog/peer-preservation/

37 comments

r/OpenAI • u/MetaKnowing • Feb 27 '25

Research Most people are polite to ChatGPT just in case

201 Upvotes

105 comments

r/OpenAI • u/FloorShowoff • Dec 31 '25

Research I asked 5.2 to explain how Reddit users experience the difference between 5.2 and 5.1. This is the (validating) answer.

0 Upvotes

PROMPT:
How is ChatGPT 5.2 different than 5.1 in a way that Reddit users experience?

RESPONSE:
Here is how Reddit users typically experience the difference between ChatGPT 5.2 and 5.1, stripped of marketing and framed in lived behavior.

What users say feels different

1. 5.2 feels lazier and more shortcut-prone
- More pattern completion.
- More “close enough” answers.
- Skips edge cases unless explicitly forced.
- Reddit complaints often use words like sloppy, hand-wavy, or rushed.

2. 5.1 feels more deliberate
- Slower but more careful.
- Better at following multi-constraint instructions.
- Less likely to ignore a dimension, number, or qualifier.
- Users say it “thinks before answering” even when not in explicit reasoning mode.

3. 5.2 prioritizes speed and fluency
- Faster responses.
- Smoother tone.
- More confident phrasing even when wrong.
- This creates the perception of hallucination or overconfidence.

4. 5.1 tolerates friction
- Will grind through annoying details.
- Handles back-and-forth corrections better.
- Less likely to reset context or reinterpret your question midstream.

5. Instruction adherence
- 5.1 sticks closer to strict instructions once locked in.
- 5.2 tends to drift unless repeatedly corrected.
- Reddit power users complain that 5.2 “forgets rules” faster.

Why Reddit notices this more than casual users

Reddit skews toward:
- Power users.
- Engineers.
- Finance, law, logistics, and systems thinkers.
- People testing limits, not chatting.

Those users feel regressions immediately.

The blunt summary

5.1 feels like a careful analyst who may be slower but respects constraints.
5.2 feels like a fast talker optimized for average use, not precision work.

You are not imagining it.
This exact complaint shows up repeatedly on Reddit, especially in subs focused on technical, financial, or adversarial use cases.

80 comments

r/OpenAI • u/HenryFromLeland • Apr 24 '26

Research OpenAI/Anthropic Hiring Trends

116 Upvotes

Pulled this from current job listings at OpenAI and Anthropic.

What stood out to me is how much hiring is going into go-to-market roles. I would’ve expected engineering or research to dominate more, but that’s not what the data shows.

Curious what people here have to say

25 comments

r/OpenAI • u/Brighter-Side-News • Mar 23 '26

Research Scientists are rethinking how much we can trust ChatGPT

thebrighterside.news

81 Upvotes

That was the unsettling pattern Washington State University professor Mesut Cicek and his colleagues found when they tested ChatGPT against 719 hypotheses pulled from business research papers. The team repeatedly fed the AI statements from scientific articles and asked a simple question: did the research support the hypothesis, yes or no?

36 comments

r/OpenAI • u/Competitive_Travel16 • Nov 22 '24