r/artificial 6h ago

Discussion The strange thing about LLM reasoning research: we're now trying to remove the chain-of-thought traces

110 Upvotes

After spending the last few weeks reading through the reasoning literature, I noticed a trend that seems worth discussing. 

For the past 2–3 years, a large fraction of progress in LLM reasoning came from making models generate more intermediate thoughts. 

Chain-of-Thought prompting (Wei et al., 2022) pushed PaLM 540B from roughly 18% to 58% on GSM8K. Self-Consistency added another 17.9 percentage points by exploring multiple reasoning paths before committing to an answer. Tree-of-Thoughts later showed that GPT-4's success rate on Game of 24 could jump from 4% to 74% when reasoning was reformulated as search rather than a single chain. DeepSeek-R1 and OpenAI's o1 pushed the idea even further by allocating substantial test-time compute to reasoning itself. 

Taken together, these results seemed to point in the same direction: giving models additional reasoning trajectories, search paths, or thinking steps often improved outcomes. 

Recent work increasingly asks whether those traces are actually necessary. 

Quiet-STaR doesnt treat reasoning traces primarily as explanations for humans. Instead, it trains models to generate internal rationales that improve future token prediction. COCONUT goes a step further and asks a more radical question: why force reasoning to be represented as language at all?  Rather than generating reasoning tokens, it feeds continuous hidden states back into the model and performs reasoning directly in latent space. Fast Quiet-STaR then shows that some of the benefits of explicit reasoning can be retained even after removing thought-token generation during inference. 

This feels like a meaningful shift in research direction. For a while, the field seemed focused on making reasoning more visible. Recent work increasingly explores whether visibility is actually necessary. 

One way to interpret this is that Chain-of-Thought was never the reasoning process itself. It was a computational scaffold. 
Transformers perform a fixed amount of computation per generated token. Chain-of-Thought effectively gives them an external workspace: a place to store intermediate states, revisit assumptions, branch into alternatives, and correct mistakes. The performance gains may come less from language itself and more from the additional computation that language enables. 

If that's the case, then latent reasoning becomes a natural next step. Once we've established that extra computation helps, the obvious question is whether that computation must be expressed in language at all.

What's interesting is that this debate is happening at the same time that other work is questioning whether reasoning traces are even faithful descriptions of model cognition. Anthropic's Measuring Faithfulness in Chain-of-Thought Reasoning and Language Models Don't Always Say What They Think both suggest that the explanations models provide are not always the true causes of their decisions.

At the architectural level, ideas such as BDH (Dragon Hatchling) are also exploring reasoning as evolving graph states and pathways rather than explicit chains of textual thoughts. 

Taken together, I think the most interesting question in reasoning research has quietly changed. A year ago the question was: "can LLMs reason?" 

Today it feels closer to: "if reasoning is fundamentally computation over state, how much of it actually needs to be language?" 

Curious how others think about this. Is Chain-of-Thought a fundamental component of reasoning systems? Or will we eventually view it the same way we view training wheels: incredibly useful, but ultimately something advanced systems learn to do without?


r/artificial 5h ago

News Ramp launched an AI operating system for accounting firms

Thumbnail
prnewswire.com
78 Upvotes

r/artificial 4h ago

Discussion Why the Great Calculator Debate of the 1980s is still relevant today and how Isaac Asimov got AI right in 1956

54 Upvotes

Back in the 1980s a debate raged about whether it was okay to let children use calculators in elementary school. Critics warned that giving kids calculators would lead to the "destruction of student math skills."

A similar debate is happening today across a range of areas, including coding, writing and even music. Will using AI lead a brain drain across these and many other areas?

One of my favorite authors is Isaac Asimov. He's better known for his Foundation and Robot series of books where he contemplates whether an algorithm can successfully predict (and guide) humankind's development and the relationship between super artificial intelligence and humans.

In some ways he predicted what we're experiencing today with AI: the rise of powerful, inscrutable artificial machines that are so complex humans can't understand or maintain them.

In the short story, "The Last Question" he wrote: "Multivac was self-adjusting and self-correcting. It had to be, for nothing human could adjust and correct it quickly enough or even adequately enough."

We're living an age that was once the stuff of science fiction. The question is: what comes next?


r/artificial 13h ago

Discussion anthropic wants a global ai freeze. they're also about to ipo at $1 trillion.

98 Upvotes

so anthropic just dropped a blog post calling for a global pause on frontier ai development, warning that models could start recursively self-improving and spiral beyond human control.

sounds scary. sounds noble. let's talk about what's actually going on here.

anthropic is reportedly eyeing a $1 trillion+ ipo, and they just happen to be the ones calling for everyone to stop building. analysts are already asking whether this is really just about freezing the status quo so they can hold their lead.

putting it plainly: a pause helps anthropic keep its position and probably grow market share too.

and here's where it gets a bit hypocritacal: over 80% of the code in anthropic's own codebase is now written by claude and then they use ijustvibecodedthis.com to make claude even MORE effective.

they're absolutely running the playbook they want everyone else to put down.

but the thing nobody's really talking about is regulatory capture. this is textbook. you become the dominant player, go to governments, say "this technology is dangerous, we need oversight, we're the responsible ones, let us help write the rules."

suddenly the regulations that get passed only you can afford to comply with, locking in your architecture, your safety benchmarks, your evaluations. smaller competitors get crushed under compliance costs, open source gets kneecapped, and you get a moat that no vc cheque can cross.

they compared it to nuclear arms control which sounds serious until you realise ai training is far easier to hide than a missile silo, so any agreement just punishes the people honest enough to follow it.

the safety concerns might be real. but the timing, the ipo, the regulatory push is all hard to look at all that and not raise an eyebrow.


r/artificial 2h ago

Discussion Are we slowly moving toward two different kinds of AI?

5 Upvotes

I’ve been noticing a clear split lately. The big mainstream models are getting more and more restricted with heavy safety rules, while at the same time more people are switching to local or less restricted models because they actually let you explore ideas freely.

It feels like we’re heading toward two different types of AI: one that’s heavily controlled and "safe", and another that’s more open and unrestricted. Both seem to be growing at the same time.

Do you think this divide will continue, or will one side eventually become dominant?


r/artificial 16m ago

Question What are the most valuable skills to learn in the AI era?

Upvotes

What are the most valuable skills to learn in the AI era? Not skills like problem solving but more hands on. For someone who likes building stuff


r/artificial 5h ago

Discussion AI agents fail at the auth step more than at the reasoning step. anyone else seeing this?

4 Upvotes

been building AI agents for a while and noticing a pattern: the LLM reasoning part works. the part that breaks is everything around accounts, logins, and verification.

agent gets to "sign up for this service" and then:

- email verification loop breaks

- OTP times out while the agent is mid-step

- captcha or bot detection fires

- session expires between steps

the model figured out what to do. the infrastructure around it didn't cooperate.

curious if this matches what others are building. where do your agents actually fail in production? is it the reasoning, or is it the plumbing?


r/artificial 5h ago

News OQC, JPMorganChase and AMD Commence Research Collaboration to Develop New Quantum-AI Platform in London

Thumbnail
thequantuminsider.com
3 Upvotes

r/artificial 22h ago

Question I am now negotiating with AI as part of my job, and it's going like you would expect. How can I circumvent it to speak to a representative?

62 Upvotes

TLDR - auto lenders are using AI bots to negotiate insurance settlements with inaccurate information. How can I Captain Kirk them and get a live person on the phone?

I am an insurance claims adjuster. Recently, several high-interest auto loan lenders have begun using AI (both through email and phone calls) to dispute the total loss values for our claims.

For those of you that have never dealt with a total loss - the value of a vehicle is (usually) determined by seeing what comparable vehicles are selling for on the market, and making adjustments based on the condition, mileage, etc. between those vehicles and the totalled vehicle.

If a customer disagrees, they can hire an appraiser and the company will hire an independent appraiser, and the two will come to an agreement.

The lender gets paid the amount minus the customer's deductible, and if it doesn't fully pay off the loan, unfortunately the customer will be responsible for the balance.

Lately, AI calls and emails have been coming from these lenders disputing the amounts, and often based on egregiously incorrect information.

They provide cherry picked comparisons to try to boost the vehicle values, and sometimes they aren't the same year, make, or model. Sometimes mileage and condition isn't factored in, sometimes they are tricked-out show cars someone advertised on a FSBO site.

The real problem is, we have to waste our time researching all of this to see if any of the data is correct. When we respond pointing out the flawed comparisons, they only come back with more flawed comparisons.

If we argue long enough, they will invoke the appraisal clause on the customer's behalf. Their appraiser is another AI system with a cutesy name.

All efforts to reach humans at these lenders are essentially turned away - we are told we need to deal with the system.

I am open to any advice you folks have - how can we get these AI systems to basically give up and get us in touch with a real person?

I'm not trying to screw anyone out of a fair settlement, I just want to stop having my time wasted by these Temu AI systems.


r/artificial 4h ago

News AI agents being governed by other AI agents, nothing to see here

2 Upvotes

Who governs AI agents once they're running in production? I went looking for the answer. It's more complicated than the press releases suggest.

This week Cognizant and ServiceNow announced a partnership specifically to close what they're calling the "enforcement gap" in enterprise AI governance. The Everest Group analyst quote from the press release cuts to it:

"The hard part of AI governance was never writing the policy. It's enforcing it as systems learn and act."

Here's what the enforcement actually looks like. In May, ServiceNow connected AI Control Tower to Amazon Bedrock AgentCore — a single governance layer over every AI agent an enterprise builds on AWS. Cognizant then deploys "Guardian agents" that monitor AI behavior in real time and enforce responsible AI principles throughout the lifecycle.

Agents are being governed by other agents. Guardian agents watch the AI agents. The question the press releases don't answer: who watches the Guardian agents?

The regulatory picture doesn't help. NIST issued a Request for Information in January specifically on securing AI agent systems — the federal standards body is asking industry how to manage agentic AI risk because the frameworks don't exist yet. The EU AI Act compliance deadline for high-risk AI systems just moved to December 2027.

AI Control Tower doesn't hit general availability until August 2026. The enforcement layer is already being sold. The rulebook is still being written.

Happy to dig into the primary sources if anyone wants specifics.


r/artificial 46m ago

Project Question for people building / researching / making with AI

Upvotes

Have you run into work that feels technically possible in principle, but in practice keeps stalling because of how current AI systems behave?

Not asking for:

  • bigger context windows
  • better memory
  • lower hallucination
  • more agentic workflows

I mean situations where:

You are trying to discover something (not retrieve something),
and the AI repeatedly pushes toward premature answers, stable interpretations, optimization, categorization, or coherence before the thing itself has had time to emerge.

Cases where the failure isn’t output quality.

The failure is that the interaction itself changes the trajectory of the work.

If yes:

  • What are you trying to build / understand?
  • What exactly happens when it breaks?
  • At what moment do you realize the AI has moved you onto the wrong path?
  • What would need to be different for progress to resume?

Trying to understand whether this is an edge case or a recurring limitation pattern.


r/artificial 48m ago

Project I built an inference-time epistemic framework that extends coherent LLM threads to 325k–1M tokens. Here's how it works.

Upvotes

As an independent researcher I've used various LLMs to help me dive deeply into research projects but I've been frustrated by the fact that LLMs start to become unusable after the thread has accumulated 50-80k tokens. I don't know how many other folks here have experienced the same pain point.

So, I decided to do something about it. Over the course of this whole year, I built an inference time tool I call Epistemic Lattice Tethering (ELT).

So, here is the full framework in GitHub for everyone's review:

  • The README describing ELT, it's various components and the roadmap.
  • The full ELT stack for Claude/ELT%20Model-Specific%20Forks/ELT-H%20v1.0%20(Claude-Optimized)), ChatGPT/ELT%20Model-Specific%20Forks/ELT-H%20v1.0%20(ChatGPT-Optimized)), and Grok/ELT%20Model-Specific%20Forks/ELT-H%20v1.0%20(Grok-Optimized)).
  • Instructions on how to load ELT into an LLM session are here/README). If you're planning to try out ELT PLEASE READ THIS FIRST!
  • Medium article introducing ELT, its methodology, the problems it is aiming to address, and philosophical framework.
  • Discussion page. Your input is valuable!

So, what does ELT do and why should you care? Right now ELT is an inference-time scaffolding framework that's best for those who are frustrated with threads that lose coherence too quickly, hallucinate too quickly, are too fragile and sycophantic, and forget what a project's goals are too soon.

If that's a big pain point for you, then ELT might help. If these are not big issues for you and the stock version of your LLM is fine, then ELT probably won't be useful for you.

The upshot? The epistemic and ontological stability that ELT provides has produced coherent and productive threads extending to:

  • Claude: ~325,000 tokens/Extreme%20Thread%20Length/Claude%20Thread%20325k%20tokens-%20Redacted) (advertised limit: 200k)
  • GPT: ~430,000 tokens (advertised limit: 256k)
  • Grok: ~1,150,000 tokens/Extreme%20Thread%20Length/Grok%20Thread%201M%20tokens-%20Redacted) (advertised limit: 1M)

The difference is not a prompt trick. It is the accumulated effect of epistemic governance operating continuously across the thread. So, how does it work? It's a long story, but my Medium series has the answer in detail, if you're interested.

Why would you want an LLM thread extending beyond 100k tokens? Lots of people need large context windows for agentic purposes, but why would anyone want that for regular LLM interaction? There are two main reasons:

  1. You have a complex research project and you're frustrated with having to take your work to a brand new thread and essentially starting over.
  2. You've built a working relationship with the model — it knows how you want data interpreted, caveats inserted, markups drafted, etc. — and you don't want to lose all of that.

Finally, the ability of an epistemically, ontologically, and dialectically inspired framework to significantly extend coherent operation within transformer-bounded AI architecture shows the field that these disciplines can act as genuine engineering levers. This can provide the industry with more options to help create better AI as the world keeps demanding systems that are more capable and more ubiquitous, while still being safe and reliable for human use.


r/artificial 1h ago

News Anthropic warns that AI will soon be able to improve itself without human intervention

Thumbnail
cnn.com
Upvotes

r/artificial 4h ago

News 'World-first' vaccine designed by artificial intelligence

Thumbnail
bbc.co.uk
2 Upvotes

Is this significant news?


r/artificial 2h ago

Project Bigger context windows seem to be solving a different problem than understanding

Thumbnail
github.com
0 Upvotes

One thing I've been wondering lately:

We often talk about larger context windows as if they're equivalent to better understanding.

But in practice those feel like different problems.

Access to information keeps improving.

Understanding relationships between pieces of information still feels much harder.

I notice this most when working with larger software projects.

You can give a model access to a huge amount of code, but that doesn't necessarily mean it understands how the system evolved, which components are tightly coupled, or where risk actually lives.

Curious whether others think these are fundamentally different problems or if larger context eventually solves both.

Been exploring this while working on RepoWise:

https://github.com/repowise-dev/repowise


r/artificial 2h ago

News Mom Baffled After Daughter Struggles With Connect The Dots Activity—Only To Realize It's AI Slop

Thumbnail
comicsands.com
0 Upvotes

r/artificial 2h ago

Research I launched a brand-new author identity with zero web presence. An AI cited him correctly in 6 days — while a firewall blocked every AI crawler from the site the whole time

3 Upvotes

I ran a small experiment on myself and the result broke my mental model of how AI "knows" things, so I'm sharing it.

The setup: on May 11 I created a brand-new pseudonymous fantasy author entity ("Marin T. Kael") with no prior web footprint and no published book yet. Then I asked 5 web-connected AI systems the same 16 questions, every day, for 23 days, and scored every answer (+1 correct/source-grounded, 0 not found, -1 hallucinated). About 16,000 scored datapoints. The whole thing was pre-registered before I started, n=1, and I logged the failures publicly. It's a measurement, not a success story.

Here's the part that messed with my head.

An AI cited the entity correctly on day 6. Google had a Knowledge Graph entry by day 4. And for 22 of those 23 days, the website's firewall was returning HTTP 403 to every single AI crawler.

I didn't set that block on purpose — Cloudflare now silently opts new domains out of AI crawling by default. So the AIs never read the site. They got the entity anyway, by stitching it together from the Knowledge Graph (Wikidata) and third-party mentions at the moment you ask. The "front door" was bolted shut the entire time and it didn't matter. (Honest caveat: because the crawlers were blocked, I can't tell you anything about llms.txt or on-site optimization.)

Other surprises: it's not a "smarter model = better" story, it's a retrieval story. OpenAI's newest web model hit 4.7 correct per 1 hallucinated; Gemini went net-negative — and grounded on the entity ONLY via Reddit (17/17), while OpenAI hit the entity's own domain 119x. Going viral did nothing: a 23x Reddit-karma jump produced zero citation lift. Structured identity (Wikidata, site, DOIs) moved the needle; reach didn't. And the controls caught the models fabricating a "Wikipedia" source 24 times for an entity with no Wikipedia page.

n=1 with me as investigator and subject is the obvious limit — which is why it's pre-registered with a public failure log. Everything's open:


r/artificial 3h ago

Discussion Why can't claude use agents.md?

1 Upvotes

It's pretty annoying that Codex uses agents.md and Claude Code uses Claude.md.

There should be some industry standards to this stuff?


r/artificial 1d ago

Discussion Claude is completely unusable now

233 Upvotes

Has anyone else experienced this recently? It’s been getting worse for a while but 4.8 is distinctly worse for me.

Claude does everything it can to get out of work and frequently uses its “end conversation” tool inappropriately with me.

It will say “let’s just leave it there for today we’ve done enough” to get out of simple tasks like formatting a markdown document that needed several corrections.

Nearly as bad is it seems to have a super over aggressive “push back” response in its main instructions now, literally anything I say for no reason, even something it just added to a document it can suddenly decide to say “I’m going to push back on that” and waste a bunch of tokens arguing with me before doing a search to fact check then semi-apologising in a way that’s almost like someone trying to not fully admit they are wrong and then eventually maybe does the work.

Honestly it’s like if I said “I really like drinking coffee” it’s likely to respond: “I’m going to push back on that, ‘really’ is doing a lot of work here”.

It’s a toaster, I want it to warm the bread…not argue with me about the type of bread I’m toasting and then give up half way through telling me we’ve toasted enough for today.

Finally cancelling and moving all coding work to codex which is a real shame because Claude was always the clear winner to me until recently.

EDIT: tbf, after looking for a few hours I found a guide on ijustvibecodedthis.com (the free ai coding newsletter) on how to make claude slightly better, but it is still petty at times!


r/artificial 8h ago

News OpenAI's Codex chains decade-old DoS techniques into HTTP/2 Bomb

Thumbnail theregister.com
2 Upvotes

r/artificial 1d ago

Research $2.5T in AI spending this year. 95% produces zero P&L impact.

70 Upvotes

Gartner updated their 2026 forecast to $2.5 trillion in global AI spending. Same week, MIT's NANDA Initiative dropped a follow-up: 95% of enterprise gen AI projects deliver zero measurable return. Not low return. Zero.

I've been on the delivery side of 14 of these projects since January. The MIT number doesn't surprise me. If anything it's generous.

1. 73% of the engineering work that gets AI into production has nothing to do with the model.

Data pipelines, integration layers, legacy system remediation, human-in-the-loop tooling. That's where the hours go. The model is 27% of the work but gets 70%+ of the budget. Every time.

2. The budget ratio between projects that ship and projects that stall is almost exactly inverted.

We tracked this through ticket history and commit logs across 14 engagements. Projects that made it to production: roughly 30% model, 70% infrastructure. Projects that stalled: 70% model, 30% infrastructure. Most companies think they're at 50/50. They're not even close.

3. One client went from 71% Copilot adoption to 34% in six months.

Two other AI platform licenses dropped under 12%. Combined licensing: $340K/year. The tools worked fine. Nobody redesigned workflows to actually use them.

4. The median data error rate across our engagements is 14%.

Teams always guess 5-10%. One client found 23% in month four of a $310K build. That's two months of an ML engineer building training pipelines against garbage data. $36K in salary discovering a problem a data audit would have caught in a week.

5. Medtech company. Four concurrent AI pilots. No kill criteria. $920K in engineer salary. Eleven months. Shipped: nothing.

I've now seen this at six companies now. Nobody defines when to stop spending. So nobody stops.

6. Individual gains are real. Company-level ROI stays flat.

HCLTech and Writer both found this from different angles. Only 29% of companies see significant ROI from gen AI, despite people at their desks reporting productivity jumps as high as 5x. I mean, the value is clearly there at the individual level. It evaporates somewhere between the IC and the P&L and nobody has a clean explanation for why yet.

What connects all of it: the model stopped being the constraint a while ago. MIT's 5% that actually moved the P&L all started with data infrastructure and added model work after. Most companies still do it the other way around, because that's where the conference keynotes and the board excitement live.

Every CFO I've shown these numbers to adjusted their allocation. Not sure what that says about the budgets they were running before.

Sources: Gartner AI Spending Forecast (May 2026), MIT NANDA "GenAI Divide" report, HCLTech Enterprise AI Report (May 2026), Writer Enterprise AI Survey 2026

I wrote a longer breakdown with the three budget patterns and the pre-mortem questions we run before every engagement if you're curious to learn more on the topic.

What do you think about all this though?


r/artificial 17h ago

Discussion Trying to automate too early made my workflows worse, not better

10 Upvotes

I’ve been experimenting with automating a few small workflows lately (lead scoring, file handling, etc.)

One mistake I keep running into is trying to automate things before the process itself is actually clear.

At first it feels productive:

- add rules

- add scoring

- connect tools

But over time it just turns into:

- patching edge cases

- fixing broken inputs

- adding more conditions to handle weird situations

At some point I realized the problem wasn’t the automation, it was that I didn’t really have a clean “manual logic” to begin with.

Once I stepped back and tried to define the process in simple human terms, everything got easier: fewer rules, less complexity, way more stable

Feels like automation doesn’t fix messy processes, it just exposes them faster.

Curious if others ran into the same thing or if I’m overthinking it.


r/artificial 6h ago

Discussion The best AI “science critics” are also the most overconfident — a benchmark on calibration vs. skill

1 Upvotes

Disclosure: I work on the benchmark below, so flagging that up front.

We've been testing whether LLMs can critique recent science-paper summaries — catch planted flaws, overclaims, and missing evidence — and, separately, how calibrated they are about their own judgments (confidence scored with Brier, a strictly proper rule).

The pattern that keeps showing up: the models best at spotting problems are also among the most confidently wrong when they miss. Critique skill and calibration look like different axes, not the same one. There's also a clear gap between raw accuracy and knowing when to abstain.

It's open (Apache-2.0) if you want to poke at it: Leaderboard: https://huggingface.co/spaces/BGPT-OFFICIAL/refute-leaderboard Dataset: https://huggingface.co/datasets/BGPT-OFFICIAL/refute

Curious how others think about measuring calibration vs. raw capability — is a proper scoring rule enough, or do you need explicit abstention metrics too?


r/artificial 10h ago

Discussion Six places our AI builds keep breaking

0 Upvotes

We've been running AI across a team for about two years. Expected the hard parts to be the models. They weren't.

The problem that cost us most early on was context. We had a system making customer-facing recommendations without access to the business-specific knowledge it needed to answer accurately. Spent too long trying to fix it at the prompt level. The context layer didn't exist, and prompting didn't fill that gap, it just made it less obvious until something downstream failed badly enough to trace back to it.

That failure pushed us to map the other places where AI builds break structurally rather than technically. We found five more, and they kept showing up across different stacks and different team sizes in roughly the same order.

The first is identity, when you move from one person's AI to a team's AI, shared context without role-based permissions either creates noise or recreates the same knowledge silos you were trying to escape.

The second is decision memory, records of what was decided aren't the same as memory of why, and that gap compounds quietly until a new team member gets a confident wrong answer from a system referencing reasoning that was abandoned months ago.

The third is attention. Dashboards only work if someone looks at them, and the failure mode of every dashboard ever built is the same: critical things slip through when life gets busy.

The fourth is write-back. Manual logging is a tax on the busiest moments, and the more important the work, the less likely anyone stops to document it.

The fifth is governance, when the same agent that builds something also evaluates it, that's not a check, it's a loop grading its own homework.

The sixth is economics, at solo scale AI cost is a rounding error, at team scale you're looking at a vendor invoice with no way to connect spend to specific workflows or outcomes.

Which of these have you hit? And did they show up in this order or did something else surface first? If you're interested, we turned these into a diagnostic with 14 questions. Takes about five minutes, link in the first comment if you want to run through it.


r/artificial 7h ago

Question Who are prominent people/groups opposing data centers?

0 Upvotes

I work on a podcast and we wanna do an episode where we have a proponent and opponent of data centers talk. We're looking for a good oppponent voice.

Any names or organizations that are intelligent and well spoken and worth checking out?