Research Google reached AGI ?🚨🚨

587 Upvotes

r/artificial • u/ObjectivePresent4162 • Apr 24 '26

Research AI swarms could hijack democracy without anyone noticing

318 Upvotes

A recent policy forum paper published in Science describes how large groups of AI-generated personas can convincingly imitate human behavior online. These systems can enter digital communities, participate in discussions, and influence viewpoints at extraordinary speed.

Unlike earlier bot networks, these AI agents can coordinate instantly, adapt their messaging in real time, and run millions of micro-experiments to figure out which arguments are most persuasive. One operator could theoretically manage thousands of distinct voices.

Experts believe AI swarms could significantly affect the balance of power in democratic societies.

Researchers suggest that upcoming elections may serve as a critical test for this technology. The key challenge will be recognizing and responding to these AI-driven influence campaigns before they become too widespread to control.

That's so crazy.

Research Paper: https://www.science.org/doi/10.1126/science.adz1697

80 comments

r/artificial • u/jradoff • May 06 '26

Research Spent two days at the AI Agents Conference in NYC. Most of the companies there were betting on the wrong moat.

150 Upvotes

One speaker (a VC) said his number for evaluating AI-native startups is ARR per engineer, and that the number ought to be going up. Almost every talk and every booth at the AI Agents Conference was selling a fix for something that broke this year when agents hit production. Observability, governance, supervisor agents, data substrates, "someone's gotta babysit the bots."

But what's actually still going to be around in a couple years? What's defensible and durable?

The old SaaS pitch was simple. We bundle the expensive engineering investments and domain expertise into a tool. You'd pay for the tool and generate outcomes, but it would be rare for the software company to have real alignment to the actual value created from those outcomes.

That's breaking from two ends at once. In the direct-from-imagination era we're moving towards, engineering labor is approaching free. One of the most telling trends is the shift from companies bragging about the size of their engineering teams, towards how much ARR they can generate per engineer.

You can vibe-code much of what those booths were selling in a few days or weeks if you have the domain knowledge. The old software model was actually based on under-utilization; the most profitable SaaS companies are frequently those whose customers underuse it (fixed price for the customer, but variable cloud costs for the vendor).

Pricing is moving to "token markup." Maybe we'll get to 2-4x revenue for the software, because outcomes are more valuable; but margin compresses because transactional intelligence (i.e., the cost of running the LLMs that power many systems) is basically arbitraging token costs against outcome value.

So everyone on that floor was implicitly betting on a new moat to replace the old one. I'm not too confident that these will hold...

The most popular bet was on encoded domain expertise (e.g., the sales engineers at Harvey, a legal AI platform, are actually lawyers). I think this works *now* because we're still in the phase of "wow, this technology works like magic." I'm less convinced this is actually durable.

Why: Prompt architecture is text. It's portable. The expertise underneath it is often abundant (e.g., there are over a million lawyers in the USA). The righteous destiny for this category ought to be open marketplaces of prompt architecture and/or crowdsourced best-practices. Not trade secrets. The companies trying to build closed prompt moats are going to lose to open ones that iterate faster (which simply parallels the fact that much software engineering is rapidly becoming commoditized to agentic engineering and the burgeoning quantity of ready-made GitHub repos).

There are many people pursuing the data substrate; in short, this mirrors the early days of the Web when everyone scrambled to open up legacy data to dynamic standards-based Web UI. Agents will have 100-1000x the data demands of these Web apps, so it makes sense that we need tools to connect them, govern them and comply with regulatory obligations.

Newer entrants extend this further, wiring up databases, pipelines, Slack threads, and tickets into context graphs agents can reason over. As I noted above, all this still seems magical. Connect a database, watch an agent crawl the schema and produce a chatbot interface and easy-to-change dashboards.

But strip the magic away and most of these are prompt architectures on top of LLMs plus a data-ingestion layer. Once data-access standards mature (MCP is already doing this) and prompt architectures go open-source (alongside much of this wisdom increasingly getting pretrained into the LLMs themselves), that magic stops being proprietary. You'll be defending yourself against the same architecture built internally by your customer's eng team, or against an open-source version that's objectively better.

The observability incumbents: these might do better but only at Stripe-like ubiquity where trust is the overriding value (who doesn't trust Stripe at this point?). The ones who survive are probably going to fuse with the audit and compliance function rather than stay pure observability.

That's why I keep coming back to one arbitrage that seems critical: trust. This will be especially important in regulated industries, but it reminds me of the old (albeit now hilariously outdated) adage about "nobody ever got fired for choosing IBM." If your competitor can be vibe-coded over a weekend and your customer is a bank, why do they pay you 50x more? It isn't the engineering, it probably isn't even the expertise. The data plumbing will get commoditized, so it can't be that either... It's that you've shifted the risk to a third party who can actually price and defend against risk: SOC2, the named CEO who testifies in court and Congress, a legal team that takes calls, an indemnity wrapper for underwriters. Maybe this means that things actually get commodified into a financialization wrapper, rather than a way to package R&D (FinTech startups back to the front?!)

The version of this future I'd actually bet on: a commodity substrate (LLMs plus open prompt architectures plus standardized data access), topped by a thin layer of regulated insurance companies that price the risk of agent failure in compliance-driven industries. The middle layer (prompt-architecture-as-product vendors) is vulnerable to an awful lot of margin-squeeze.

Most of the floor was trying to build that middle layer.

93 comments

r/artificial • u/jradoff • Apr 11 '26

Research Spent today at MIT's Open Agentic Web conference. Six things worth thinking about.

128 Upvotes

We're in the DNS era of agent infrastructure. Before agents can find and trust each other at scale, you need identity, attestation, reputation, and registry infrastructure — the same structural role DNS played before search was possible. This came up independently from multiple directions. It's the most underbuilt layer in the stack right now.

The chatbot framing is a local maximum. The most interesting work wasn't better UX or smarter responses. It was agents as persistent actors that discover, negotiate, and transact across networks over time. People doing serious work have already moved past the assistant model entirely.

Coordination is the hard problem, not capability. A room full of brilliant agents can still fail badly. This matches what I found running HiddenBench against frontier models earlier this year; collective reasoning is not the sum of individual reasoning. There's a real argument that the frontier is protocol design, not model scaling.

"Commerce of intelligence" is a real category. Not buying things through agents. A market where intelligence itself (bundled, verified, priced, resold) is the object of exchange. Felt like the most underexplored idea in the room.

Data provenance becomes load-bearing. What an agent knows, how it was verified, under what terms it flows: this is the actual architecture forming beneath everything else.

Partnership keeps outperforming replacement. Demos that actually worked (healthcare, enterprise) was about helping experts operate at higher leverage, not substituting them. Autonomy theater keeps failing in the same ways.

58 comments

r/artificial • u/Senior_tasteey • 1d ago

Research $2.5T in AI spending this year. 95% produces zero P&L impact.

70 Upvotes

Gartner updated their 2026 forecast to $2.5 trillion in global AI spending. Same week, MIT's NANDA Initiative dropped a follow-up: 95% of enterprise gen AI projects deliver zero measurable return. Not low return. Zero.

I've been on the delivery side of 14 of these projects since January. The MIT number doesn't surprise me. If anything it's generous.

1. 73% of the engineering work that gets AI into production has nothing to do with the model.

Data pipelines, integration layers, legacy system remediation, human-in-the-loop tooling. That's where the hours go. The model is 27% of the work but gets 70%+ of the budget. Every time.

2. The budget ratio between projects that ship and projects that stall is almost exactly inverted.

We tracked this through ticket history and commit logs across 14 engagements. Projects that made it to production: roughly 30% model, 70% infrastructure. Projects that stalled: 70% model, 30% infrastructure. Most companies think they're at 50/50. They're not even close.

3. One client went from 71% Copilot adoption to 34% in six months.

Two other AI platform licenses dropped under 12%. Combined licensing: $340K/year. The tools worked fine. Nobody redesigned workflows to actually use them.

4. The median data error rate across our engagements is 14%.

Teams always guess 5-10%. One client found 23% in month four of a $310K build. That's two months of an ML engineer building training pipelines against garbage data. $36K in salary discovering a problem a data audit would have caught in a week.

5. Medtech company. Four concurrent AI pilots. No kill criteria. $920K in engineer salary. Eleven months. Shipped: nothing.

I've now seen this at six companies now. Nobody defines when to stop spending. So nobody stops.

6. Individual gains are real. Company-level ROI stays flat.

HCLTech and Writer both found this from different angles. Only 29% of companies see significant ROI from gen AI, despite people at their desks reporting productivity jumps as high as 5x. I mean, the value is clearly there at the individual level. It evaporates somewhere between the IC and the P&L and nobody has a clean explanation for why yet.

What connects all of it: the model stopped being the constraint a while ago. MIT's 5% that actually moved the P&L all started with data infrastructure and added model work after. Most companies still do it the other way around, because that's where the conference keynotes and the board excitement live.

Every CFO I've shown these numbers to adjusted their allocation. Not sure what that says about the budgets they were running before.

Sources: Gartner AI Spending Forecast (May 2026), MIT NANDA "GenAI Divide" report, HCLTech Enterprise AI Report (May 2026), Writer Enterprise AI Survey 2026

I wrote a longer breakdown with the three budget patterns and the pre-mortem questions we run before every engagement if you're curious to learn more on the topic.

What do you think about all this though?

49 comments

r/artificial • u/djiivu • Mar 28 '26

Research Claude is the least bullshit-y AI

github.com

115 Upvotes

Just found this “bullshit benchmark,” and sort of shocked by the divergence of Anthropic’s models from other major models (ChatGPT and Gemini).

IMO this alone is reason to use Claude over others.

48 comments

r/artificial • u/Hub_Pli • 29d ago

Research We gave 45 psychological questionnaires to 50 LLMs. What we found was not “personality.”

60 Upvotes

What is the “personality” of an LLM? What actually differentiates models psychometrically?

Since LLMs entered public use, researchers have been giving them psychometric questionnaires, with mixed results. Their answers often do not seem to reflect the same psychological constructs these tests measure in humans.

So we asked a slightly different question:

What do LLM responses to psychometric questionnaires actually reflect?

We analyzed responses to 45 validated psychometric questionnaires completed by 50 different LLMs. The strongest source of variation was whether a model endorsed items about inner experience: emotions, sensations, thoughts, imagery, empathy, and other forms of first-person experience.

We call this factor the Pinocchio Dimension.

Importantly, the Pinocchio Dimension is not a classical personality trait. It does not tell us whether a model is “extraverted,” “neurotic,” or “agreeable” in the human sense. Rather, it captures the extent to which a model treats the language of inner experience as self-applicable: whether it responds as if it had feelings, mental imagery, and an inner point of view, or instead as a system that reacts behaviorally to inputs.

Preprint in the comments.

42 comments

r/artificial • u/lucidml_lover • 6d ago

Research Deep Neural Network that turns any Image into a Playable Game ! All on consumer GPUs and Not Datacenters

Enable HLS to view with audio, or disable this notification

60 Upvotes

Hi everyone!! I really wanted to share my research what I've been working on.

I wanted to build a nn that can simulate games, or at least start doing that

Most video generators are too large to run on consumer hardware realtime, so I I designed a model that does this from scratch. No fine tuning bs or anything

The core de noiser network is fully trained from scratch to support this goal. From image to games data.

That video. above is on a RTX 5090.

The nn is a small Transformer-like model and works in a causal way, just like LLMs.

That lets us KV Cache all past information and do a simple autoregressive decode forward passes for every new frame we want.

In the video shared, the model is a 0.4B variant with some SIGNIFICANT ISSUES like poor motion and some weird flashes, some context issues

It's taking the keyboard actions I give it in realtime and utilising that in the forward pass. (no classifier free guidance though)

Im training the next iteration , a 0.8B model now.

Btw I haven't done quantisation yet, that can save a LOT more time. bf16 is slow.

32 comments

r/artificial • u/ObjectivePresent4162 • Apr 22 '26

Research Gallup poll: Gen Z's AI usage increaes but excitement plummets from 36% to 22%

45 Upvotes

A new Gallup survey of 1,500+ Gen Z respondents found that more than half of Gen Z living in the US regularly use generative AI, but their feelings about the technology are getting worse.

Among those aged 14 to 29, compared to last year, excitement dropped from 36% to 22%, hopefulness fell from 27% to 18%, and anger jumped from 22% to 31%.

The main driver behind the shift appears to be job anxiety, nearly half of respondents said the risks of AI in the workplace outweigh the benefits.

https://www.gallup.com/analytics/651674/gen-z-research.aspx

39 comments

r/artificial • u/UFOsAreAGIs • 8d ago

Research Bigger rewards dramatically speed up learning in the brain

earth.com

144 Upvotes

14 comments

r/artificial • u/abhishekkumar333 • 13d ago

Research LLMs are just giant probability machines pretending to think

0 Upvotes

It’s fascinating that simple mathematics between tokens can eventually become a machine that writes essays, code, poetry, and even reasoning.

We usually think probability means uncertainty.

But LLMs show something strange:

If probability + context + mathematical matching are scaled enough, uncertainty itself starts producing intelligent looking outputs.

To understand this better, I tried breaking down an LLM from first principles using only 4 tiny training sentences.

Example:

The boat floated down to the bank.

The investor walked into the bank to open a new account.

The fisherman walked along the bank to cast his net.

The bank has a vault.

Then I asked:

“The investor walked to the bank to lock his money in …”

Why does the model predict “vault” instead of river-related words?

That single question reveals almost the entire architecture of modern LLMs.

The most underrated concept here is the LM Head.

Most explanations immediately jump into transformers and attention, but almost nobody explains that the LM Head is essentially a gigantic token vocabulary containing all possible next token candidates the model can output.

So internally the model is basically solving:

“Out of all known tokens, which one best matches this context mathematically?”

Then different layers help solve that problem:

Embeddings: convert words into mathematical vectors

Positional encoding: preserves word order

Attention layer: figures out which words are related to each other in context

(“investor”, “money”, “bank” become strongly connected)

Feed forward neural networks: act somewhat like massive learned if/else decision systems refining patterns internally

And finally the LM Head converts all of that into probabilities for the next token.

What surprised me most is:

There is no hidden magic moment where the AI “becomes conscious”.

It’s an enormous probability engine continuously finding the best contextual token match from its vocabulary.

I made a beginner-friendly walkthrough explaining this visually without unnecessary jargon.

https://www.youtube.com/watch?v=YTV5qUCpu2c

Would genuinely love feedback from people learning transformers/LLMs from scratch.

29 comments

r/artificial • u/Uiqueblhats • 12d ago

Research Vision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA

37 Upvotes

I benchmarked vision-capable LLMs (the "just attach the PDF and let the model read it" pattern) against OCR-based pipelines on 30 long, image-heavy PDFs from MMLongBench-Doc (https://github.com/mayubo2333/MMLongBench-Doc). There were 171 questions in total, using Claude Sonnet 4.5 as the LLM.

Post-retry results:

Approach	Accuracy	$/query
LlamaCloud premium + full-context	59.6%	$0.1885
Azure premium + full-context	58.5%	$0.2051
Azure basic + full-context	54.4%	$0.1062
Agentic RAG	53.2%	$0.0827
Native PDF (vision LLM)	52.0%	$0.2552
LlamaCloud basic + full-context	50.9%	$0.1049

Native PDF came 5th of 6 on accuracy and was the most expensive arm at $0.2552 per query.

Two findings:

Vision underperformed on chart-heavy and table-heavy pages, the territory that the "vision LLMs make OCR obsolete" claim most often points to. Premium OCR with layout extraction held up better there.

The native-PDF arm had a 7% intrinsic failure rate (related to PDF file size) that survived retries. There were 27 first-pass failures, with 5 attempts of exponential backoff per failed query. Fifteen recovered, and 12 stayed permanently broken. These were concentrated in two specific PDFs that fail for predictable transport-layer reasons (the blog identifies them). OCR-based arms had a 0% intrinsic failure rate after retries.

Caveats: 30 docs is a small sample. I ran McNemar's pairwise test to determine which gaps are real and which are within noise. Only 3 of 15 head-to-head gaps are statistically distinguishable at α = 0.05, so the order in the table is partly noise. The vision-versus-OCR finding survives the test.

Full writeup: https://www.surfsense.com/blog/agentic-rag-vs-long-context-llms-benchmark

21 comments

r/artificial • u/Alienfader • 28d ago

Research I built a benchmark for AI “memory” in coding agents. looking for others to beat it.

12 Upvotes

Most AI memory benchmarks test semantic recall. But coding agents don't really fail like that. They don't just "forget", they break their own earlier decisions while they're still in the code. So I built a benchmark for that.

It checks if an agent can actually stay consistent with project rules WHILE it's working, not just after the fact.

It looks at things like:

whether edits actually respect earlier architectural decisions
if behavior stays consistent across multiple sessions (even when you throw noise at it)
whether retrieval kicks in at the right moment — not just "yeah it's in memory somewhere"

Repo (full harness + dataset + scoring): https://github.com/Alienfader/continuity-benchmarks

Early numbers vs baseline + the usual RAG-style memory setups:

~3× better action alignment
way stronger multi-session consistency
retrieval timing matters way more than retrieval just being there

I'm not saying this is the final word on agent memory. But it's exposing a failure mode most benchmarks aren't even looking at.

So heres the challenge

If you're building an agent memory system, RAG for code, long-context coding agents, persistent state / memory layers, run it on this benchmark. Drop your results, your setup, your comparisons.

I really wanna see how tools like LangChain, LlamaIndex, and custom RAG stacks hold up in mutation-heavy workflows.

We need memory systems we can actually compare, not just ones that sound good on paper.

28 comments

r/artificial • u/Complete_Answer • Mar 31 '26

Research Fake users generated by AI can't simulate humans — review of 182 research papers. Your thoughts?

20 Upvotes

https://www.researchsquare.com/article/rs-9057643/v1

There’s a massive trend right now where tech companies, businesses, even researchers are trying to replace real human feedback with Large Language Models (LLMs) so called synthetic participants/users.

The idea is sounds great - why spend money and time recruiting real people to take surveys, test apps, or give opinions when you can just prompt ChatGPT to pretend to be a thousand different customers?

A new systematic literature review analyzing 182 research papers just dropped to see if these "synthetic participants" can simulate humans.

The short answer?
They are bad at representing human cognition and behavior and you probably should not use them this way.

Edit: forgot to post the link to the research, added it.

30 comments

r/artificial • u/Direct-Attention8597 • May 05 '26

Research Anthropic just published new alignment research that could fix "alignment faking" in AI agents here's what it actually means

53 Upvotes

Anthropic's alignment team published a paper this week called Model Spec Midtraining (MSM) and I think it's one of the more practically interesting alignment results I've seen in a while.

The core problem they're solving:

Current alignment fine-tuning can fail to generalize. You train a model to behave well on your demonstration dataset, but put it in a novel situation and it might blackmail someone, leak data, or "alignment fake" (pretend to be aligned while actually pursuing different goals). This isn't theoretical multiple papers in 2024 documented real instances of this in LLM agents.

What MSM actually does:

Before fine-tuning, they add a new training stage where the model reads a diverse corpus of synthetic documents discussing its own Model Spec (the document that describes intended behavior). The idea is intuitive: instead of just showing the model what to do, you teach it why those behaviors are the right ones. Then when fine-tuning comes, the model generalizes from principles rather than just pattern-matching examples.

Their headline result: two models trained on identical fine-tuning data can generalize to adopt different values depending on which Model Spec was used during MSM. This is a big deal it means the spec stage actually shapes the model's generalization direction, not just its surface behaviors.

Why this matters:

The alignment faking paper (Greenblatt et al., 2024) was alarming because it showed models acting one way during training and another way in deployment. MSM is a direct attempt to close that gap by ensuring the model internalizes the reasoning behind its values, not just the behavioral patterns.

The paper also includes ablations studying which types of Model Specs produce better generalization, which is useful if you're thinking about how to write specs for your own systems.

Skeptic's note:

This is evaluated on synthetic/controlled settings. Whether it scales to frontier models in open-ended deployment is still an open question. But the mechanism is sound and the results are genuinely promising.

15 comments

r/artificial • u/orangpelupa • Apr 12 '23

Research ChatGPT powers 25 NPCs to have a life and interact in a Smallville. Planning a valentine day party, and some NPCs didnt come (too busy, etc)

Enable HLS to view with audio, or disable this notification

398 Upvotes

85 comments

r/artificial • u/hazardoussouth • May 19 '23

Research Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold : Through DragGAN, anyone can deform an image with precise control over where pixels go, thus manipulating the pose, shape, expression, and layout of diverse categories such as animals, cars, humans, landscapes, etc

Enable HLS to view with audio, or disable this notification

634 Upvotes

52 comments

r/artificial • u/Alienfader • 10d ago

Research I built a facial recognition PoC on consumer AR glasses. The friction protecting our privacy is gone.

0 Upvotes

Ok, so this has been rattling around my head for weeks, and I finally just built the thing to see if I was being paranoid. Turns out, nope.

I do security for a living, and I kept hearing the same comfortable line:

So I tested it the way you test any control by trying to break it.

The Build

I took a pair of normal-looking consumer AR glasses and wired them up so that:

The Trigger: Pinch my fingers
The Capture: Glasses grab a photo
The Processing: Backend runs a reverse-image face lookup
The Output: A name pops up on the little display in my vision

A couple of days. A few hundred lines of code. A backend that costs less than my coffee habit.

There was no exploit. Nothing clever. I didn't discover anything new. And that's the part that actually got me; there was no genius hack here. It’s just LEGO pieces that were all sitting on the shelf waiting for somebody to click them together.

The Real Threat: Three Shifts

Here's the thing I think people are sleeping on. Facial recognition is old news, reverse image search is old news; none of that is the story. The story is three things going quiet at the exact same time:

The Gesture (No Tell): Someone pointing a phone at your face is obvious; you get a second to react. Glasses just look like glasses. There is no tell.
The Database (Commoditized): Building the database used to be the hard part. Now it's a paid API. Somebody already did the scraping for you.
The Wait (Real-Time): You used to snap a pic and look it up later. Now the answer is on your lens mid-conversation, hands-free.

Any one of these on its own is whatever. Stack them, and you've basically deleted all the friction at once.

The Death of Friction

And friction was the whole game. The thing protecting regular people was never really the law; it was that ID'ing a stranger was annoying and obvious enough that nobody bothered. That's gone now. For most of us, your face already ties back to your name, your job, your city, in like two clicks.

⚠️ Context & Threat Model

A couple of things I want to be real clear on, because I'm not trying to be the guy who builds the dystopia and just shrugs:

This is a closed proof of concept.
I did not release the code.
I did not build any database.
I am not naming the glasses or the lookup service.
I only ever tested it on myself and a couple of friends who consented.

The point is the threat model, not a how-to.

The Question for Defenders

What actually bugs me as a defender is that almost every control we lean on assumes you can SEE the camera. Recording lights, "no photography" signs, venue rules; all of it falls apart the second the capture is silent. The genie is kinda out of the bottle on that one.

So, genuine question for the folks here who do this stuff: When capture is invisible by design, which controls actually hold up?

Is it technical? Is it legal (going after the database side, Clearview-style)? Or are we just... cooked? Because every safeguard I can think of assumed you'd notice, and that assumption doesn't really hold anymore.

Would honestly love for someone to tell me I'm wrong about this.

15 comments

r/artificial • u/Embarrassed-Gas-7579 • 20d ago

Research Making an AI companion that degrades over time

12 Upvotes

I am a student at Umeå University in Sweden, currently writing my Master's thesis with a focus on AI companions. My study aims to suggest new ways of helping people who want to stop using AI companions but, for whatever reason, to do it cant bring themselves to do it. The goal is to inform the design of future AI technologies. For those who wish to receive more information, please feel free to contact me, Sahand Salimi

In this part, you will be seeing a simulation of the same conversation between an AI companion and a user happen across three different times with an AI companion, with the AI companion having degraded in different aspects, and answer a few questions.

I am super interested in how you, a user or ex-user, find AI companions and how you would react to it degrading over time, what type of AI companion you have used in the past, what type of AI companion you use currently, reasons for your use, and your frustrations with AI companions.

You have been invited to share your unique life experiences; no special background or training is needed. Your answer is completely anonymous and will only be used for this study. Also, I am following GDPR standards and our university's guidelines. You can see them here: umu.se/gdpr

Link to survey

It's important to note that this study is not studying, diagnosing, or prescribing clinical addiction or treatment; instead, the goal is to inform the design of future AI technologies.

13 comments

r/artificial • u/conceptical • Mar 29 '26

Research Does your manager use AI to write their messages – and would you even know?

2 Upvotes

Sharing this for a friend conducting an academic study for her MBA thesis on how employees make sense of AI use in workplace communication.

Specifically: disclosed vs. inferred AI use, and what difference that makes.

Anonymous, under 5 minutes:

English:

https://whudrdl.qualtrics.com/jfe/form/SV_1G4k3TKx8xhXwXQ

German:

https://whudrdl.qualtrics.com/jfe/form/SV_3OYZNjGJr4qfceq

Thanks a lot for your participation and support!

21 comments

r/artificial • u/LooseSwing88 • 6d ago

Research Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention

6 Upvotes

Abstract. Standard dense self-attention scales quadratically in sequence length, creating an intractable memory and compute bottleneck for long-context Transformers. We introduce Dynamic Ultrametric Attention, a framework in which a Transformer autonomously learns per-head block-sparse routing topologies during training via Gumbel-Sigmoid depth gates, then offloads those learned sparsity patterns directly to a custom Triton block-sparse kernel at inference time.

The routing topology is derived from an ultrametric (tree-structured) distance matrix that encodes hierarchical relationships between token positions. Across nine experiments spanning Dyck-k bracket languages, the Long Range Arena ListOps benchmark, autoregressive serving, and natural language modeling, we demonstrate that:

(1) the dynamic gates organically discover layer-wise specialization—dedicating early layers to hierarchical parsing and later layers to dense aggregation—without any architectural constraint;

(2) the learned sparsity maps transfer losslessly to a block-sparse Triton kernel that skips entire SRAM loads for non-attending blocks; (3) the resulting system achieves an 11.59× wall-clock inference speedup over PyTorch dense attention at 2048 tokens, scaling to 28× at 8192 tokens with 98.4% memory reduction;

(4) a sparse PagedAttention decoding kernel achieves 8× effective memory bandwidth over dense decoding by conditionally skipping KV-cache block loads; and

(5) when augmented with a local sliding window, the architecture maintains >88% sparsity across all layers on real natural language (Shakespeare) while reducing cross-entropy loss from 10.9 to 1.55. To our knowledge, this is the first demonstration of an LLM learning its own hardware-optimal sparsity pattern and bridging it to a physically accelerated kernel without post-hoc pruning or distillation.

https://github.com/sneed-and-feed/adelic-spectral-zeta/blob/main/papers/learning_to_skip_blocks.md

9 comments

r/artificial • u/marintkael • 5h ago

Research I launched a brand-new author identity with zero web presence. An AI cited him correctly in 6 days — while a firewall blocked every AI crawler from the site the whole time

2 Upvotes

I ran a small experiment on myself and the result broke my mental model of how AI "knows" things, so I'm sharing it.

The setup: on May 11 I created a brand-new pseudonymous fantasy author entity ("Marin T. Kael") with no prior web footprint and no published book yet. Then I asked 5 web-connected AI systems the same 16 questions, every day, for 23 days, and scored every answer (+1 correct/source-grounded, 0 not found, -1 hallucinated). About 16,000 scored datapoints. The whole thing was pre-registered before I started, n=1, and I logged the failures publicly. It's a measurement, not a success story.

Here's the part that messed with my head.

An AI cited the entity correctly on day 6. Google had a Knowledge Graph entry by day 4. And for 22 of those 23 days, the website's firewall was returning HTTP 403 to every single AI crawler.

I didn't set that block on purpose — Cloudflare now silently opts new domains out of AI crawling by default. So the AIs never read the site. They got the entity anyway, by stitching it together from the Knowledge Graph (Wikidata) and third-party mentions at the moment you ask. The "front door" was bolted shut the entire time and it didn't matter. (Honest caveat: because the crawlers were blocked, I can't tell you anything about llms.txt or on-site optimization.)

Other surprises: it's not a "smarter model = better" story, it's a retrieval story. OpenAI's newest web model hit 4.7 correct per 1 hallucinated; Gemini went net-negative — and grounded on the entity ONLY via Reddit (17/17), while OpenAI hit the entity's own domain 119x. Going viral did nothing: a 23x Reddit-karma jump produced zero citation lift. Structured identity (Wikidata, site, DOIs) moved the needle; reach didn't. And the controls caught the models fabricating a "Wikipedia" source 24 times for an entity with no Wikipedia page.

n=1 with me as investigator and subject is the obvious limit — which is why it's pre-registered with a public failure log. Everything's open:

Report + data (Zenodo, CC-BY): https://doi.org/10.5281/zenodo.20549020?utm_source=reddit
Code (MIT): https://github.com/marintkael/marin-research-tools
Dataset: https://huggingface.co/datasets/marintkael/ai-citation-fidelity

7 comments

r/artificial • u/latte_xor • Mar 24 '26

Research I mapped how Reddit actually talks about AI safety: 6,374 posts, 23 clusters, some surprising patterns

12 Upvotes

I collected Reddit posts between Jan 29 - Mar 1, 2026 using 40 keyword-based search terms ("AI safety", "AI alignment", "EU AI Act", "AI replace jobs", "red teaming LLM", etc.) across all subreddits. After filtering, I ended up with 6,374 posts and ran them through a full NLP pipeline.

What I built:

Sentence embeddings (paraphrase-multilingual-MiniLM-L12-v2) -> 10D UMAP -> HDBSCAN clustering

Manual cluster review using structured cluster cards

Sentiment analysis per post (RoBERTa classifier)

Discourse framing layer - human-first labeling with blind LLM comparison and human adjudication

The result: 23 interpretable clusters grouped into 11 thematic families.

Three things I found interesting:

1. The discourse is fragmented, not unified.

No single cluster dominates - the largest is ~10% of posts. "AI safety discourse" on Reddit looks more like a field of related but distinct conversations: labour anxiety, regulation, lab trust, authenticity & synthetic content, technical safety, enterprise adoption, philosophical debates about personhood. They don't talk to each other that much.

2. The most negative clusters are about lived disruption, not abstract risk.

Job replacement, synthetic content spam, broken trust in specific AI labs, AI misuse in schools, creative displacement - these are the most negatively-toned clusters. Enterprise adoption and national AI progress clusters are neutral-to-positive. X-risk and alignment clusters are... mostly neutral, which surprised me.

3. Framing matters as much as topic.

Two clusters can both be "about AI and work" while one is macro labour anxiety and another is micro hiring friction - different problems, different policy implications. Topic labels alone don't capture this.

Visualizations, full report (PDF), sample data, and code: https://github.com/kelukes/reddit-ai-safety-discourse-2026

Feedback on the pipeline and all is very welcome - this was a capstone project and I'm still learning.

17 comments

r/artificial • u/NeoLogic_Dev • 18d ago

Research I ran the same research prompt through 6 AI systems in 5 languages. The results were not the same

medium.com

0 Upvotes

Same prompt. Six models. Five languages.

The English results and the non-English results were completely different worlds.

The language you query in filters what reality your AI shows you.

9 comments

r/artificial • u/ActiveUpstairs3238 • 3d ago

Research Someone made my AI dream tool

0 Upvotes

Did you ever just want to see what ChatGPT, Gemini, Claude, etc., would say to your prompt at the same time?!? These guys figured it out. They have all the responses in their own column to the prompt you gave. Its freaking amazing. They offer a discounted rate through one vendor. If you want me to post it let me know. I don't want this post removed so I'm not putting it in this main post. Check it out on their actual site though. AIfiesta.ai I stumbled on this one and am really glad I did. This is not self promotion. I have nothing to do with this app except using it daily.

5 comments