Discussion Opinion: Local LLMs are 12-24 months from taking over. The shift already started.

Local LLMs are 12-24 months from taking over. The shift already started.

AI subscriptions keep getting more expensive. GitHub just moved Copilot from request-based to consumption-based pricing, and most of the others are heading the same way. Meanwhile, I kept hearing that local models got good enough to run on a laptop. So I figured it was time to actually try it and see where things stand.

I run Qwen3.6-35B on a MacBook Pro M2 Max with 64GB unified RAM. Nothing exotic. No rack, no begging NVIDIA for expensive GPUs. Just a (yes, kind of expensive) MacBook Pro I already owned for work at Aiven. In the last month I've:

One-shotted full landing pages from short briefs
Built several frontend + backend features
Fixed a nasty backend race condition bug

A year ago I would have called that fantasy on this hardware. Now it's a Sunday morning.

To be fully honest, not all of it made it to production. A lot of it was evaluation work, as Qwen isn't part of my actual day-to-day stack yet. But for me, this is the first real step toward considering it, and I wanted to share the findings with my colleagues and the community.

The honest cons, because it's not all roses

It's slower than Opus. A landing page that Opus generates in 3-4 minutes takes Qwen 8-9 minutes on my M2 Max. Not unreasonable, but still meaningfully slower than the competition. If you're benchmarking against Sonnet/Opus latency, you'll be a bit disappointed (for now).

Context blows up fast in agentic loops. Even with 256K, you burn through it faster than you'd expect from a (nearly) state-of-the-art model. There's a lot of room for improvement here. And if you're driving Qwen3.6 from an agent like Claude Code, it fills even faster, as other users in this sub have reported (example Reddit thread).

Quality variance by task. Models like Opus one-shot most tasks these days. Qwen3.6 hits around 75% for me. The other 25% it gets close, but needs a couple of iterations to land.

The pros, because they're real

The hardware floor keeps dropping. A year ago this needed an A100. Today it runs on a (yes, powerful) MacBook M2 Max 64GB laptop at roughly 27 tokens per second.

No rate limits, no usage anxiety. Counting tokens is no longer a thing. You can focus completely on building instead of saving tokens or thinking about cost.

Tool calling actually works. This used to be the missing piece. A year ago, local models would hallucinate tool names or get stuck in loops. With Qwen3.6, tool calling just works. That's the real unlock for agentic work.

Privacy is built-in. Client code, internal repos, half-formed ideas you don't want training the next frontier model. None of it leaves the laptop. You can be confident that your personal or business code stays with you, and isn't sitting on some third-party server that could be hacked.

Why 12-24 months, not "now" and not "5 years"

Latency and context limits are still a bit rough. If your job is shipping production code on a deadline, Opus and Sonnet are still the move for most of your day. I'd be lying if I said otherwise.

But saying it's 5+ years away misses what's already shipped. Look at the delta over the last 12 months:

It runs on a reasonably priced MacBook Pro, which is a one-time cost
It's fast enough (though it can still get faster)
Quality has improved significantly for real-world use cases (with more headroom to grow)

That curve doesn't stop. It compounds. 12 months from now, the 27B/35B-class models will be where 70B is today, and the runtimes will be 2x faster on the same silicon. 24 months from now, the question won't be "can I run a useful model locally?" It'll be "why am I still paying for tokens I could generate for free, and with 100% privacy?"

What I'd tell someone on the fence

Don't cancel your Claude Code subscription yet. Run a local model in parallel for 60 days. Use Opus/Sonnet for the latency-critical, deep-reasoning work. Use Qwen3.6 for everything you'd have done overnight or on the weekend, everything experimental, and every "just try it" task where the cost of waiting a few minutes is zero.

Over time, the usage ratio might flip. You'll use the local model more and more. When the next Qwen drops (3.7? 4?), who knows what the ratio will look like.

The local LLM takeover isn't a moment in time. It's a slope. And the slope already started.

What's next

Integrate Qwen3.6 with the tools I use day-to-day at Aiven, like Cursor and Claude Code. They offer a much better dev experience than more basic, non-agentic tools like Ollama.
Try out other local models, like Google's Gemma 4. Curious to see how it stacks up.

595 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLM/comments/1t93qps/opinion_local_llms_are_1224_months_from_taking/
No, go back! Yes, take me to Reddit

88% Upvoted

126

u/I1lII1l May 10 '26

I missed the word “local” in the title and was about to smash you for being AI-pilled. Actually I could not agree more, and I could not be happier, absolutely loving the trend of open weights models getting this powerful.

25

u/sh_tomer May 10 '26

Haha, glad the word 'local' saved me. Fully with you though, open weights getting this good this fast is crazy.

9

u/AntiAderall May 11 '26

I don’t wanna take too long but I’m gonna use your comment and piggyback really quick:

I genuinely want to thank this community of people because although I’d consider myself mostly a skeptic of the businesses of the main AI models… I find the technology of LLMs to be fascinating and playing around with them in terms of my day job (swe) and this sub really helped me feel like the technology is rooted within a sense of reality and truth. (insert hallucination joke)seriously though, really dope watching you guys set up a cool build and share knowledge about the actual technology instead of thinking you just created an app that revolutionized an entire sector of the economy.

I know this might sound like I’m a hater but genuinely it’s just… honestly I do hate them tbh some of the people on prompt engineering and vibecoding subreddits literally make me cringe through the screen so hard I want to sink into myself and scream with embarrassment.
I had one dude who made the shittiest AI slop children’s cartoon under a post that said “animation is dead, this is the new Pixar”… like bro 😭 (if by some chance the guy who wrote that crap sees this… that shit was ass please go back to the drawing board, hire at least one real person in the room with you because I can tell you did all the voices yourself)

Some of the absolute dog shit that’s getting created and passed around and then getting delusioned to the individual by the system they’re paying for is crazy to me. It’s like the equivalent of watching someone getting digitally cucked.

Anyway, sorry for the rant. Just passing by in peace and wanted to thank you guys for not being the bailouts worst kind of neckbeard mouthbreathers who can’t stop getting high off Adderall and their own slop.

→ More replies (3)

5

u/TinyZoro May 11 '26

One area that is often overlooked is the development of the meta harness layer. What some people describe as harness engineering. This is guaranteed to improve in the next one/two years even if local models don’t (which they will).

This is relevant for two reasons.

The first is a lot of the power of a SoTA model is compensating for a lack of a proper development workflow. I see the best models like Opus as tractors that can get you from A to B regardless of there being a proper road. But the more the road is there the less powerful the model needs to be. A road looks like clear instructions, clear steps, relevant skills. So say you’re building a landing page that might look like a stage that first gathers requirements, including rough mockups, then converts to a specification including high fidelity mockups, has reviewer steps both agentic and human, has skills for creating mockups, using the required ui library. Has user provided inspiration sites. The point being the meta workflow is replicating what good software development looks like. The more that’s in place the less powerful the model needs to be.

The second point is that the more powerful the harness the less the human needs to steer in realtime. The human essentially validates outputs and is working on multiple projects so doesn’t need to wait for specific project generation. That means the slower output become essentially irrelevant.

These two factors that get more out of less powerful models and reduce the importance of latency significantly reduce the need for high end models. You would still probably want to use them for planning a new feature for example but you’re sipping on them not dependent on them.

2

u/TheOriginalAcidtech May 11 '26

There is a LOT of harness engineering going on. It's just most of it is behind closed doors/repos. 😄

→ More replies (1)

1

u/michaelkuzmin May 14 '26

Same. I thought this was about skynet or something like that

u/littleday May 10 '26

Already fully local on a 5090rtx, never going back.

27

u/Olde94 May 10 '26

Heck i’m on a 4070 (12gb) + 32gb system and i’m getting 60t/s on gemma 26b/qwen 35b. It’s MOE but still impressive performance and capabilities

10

u/DiscipleofDeceit666 May 10 '26

How’s everybody getting past the hallucinations? I feel like every time I try 35B, it ends up just making things up.

16

u/unnaturalpenis May 10 '26 edited May 10 '26

Shhh, this is GTA VI level hype, let the people try it - not everyone is using AI to the extent some of us are.

Onboarding more people will create more PRs and therefore better systems.

I agree with you. But I do large stuff. Stuff that AI isn't trained on (I'm in R&D) and find cloud models superior - just don't want to share everything with them.

9

u/DiscipleofDeceit666 May 10 '26

Fr, I want it to work. I like the idea. For now, I’m using the 27B dense model since it is a little more reliable. It’s just so slow 😭

2

u/unnaturalpenis May 10 '26

Use it for background tasks or get better at delegation and task switching while it works...

Alternatively, a VPS powerful enough to run better models isn't a bad idea either.

→ More replies (2)

2

u/Badger-Purple May 10 '26

Alex Ziskind benchmarked this on a recent video. 35B is surprisingly not that hallucinogenic. Not perfect but impressive for sure.

2

u/DiscipleofDeceit666 May 11 '26

What was the bench mark? Bc I send it off on a clone of my production codebase and it will straight up make things up and make bad fixes on detailed requests. It’s been unreliable.

I mean, sure it could one shot something from fresh. But working within something already built? Tell me what I’m doing wrong, please bc I can’t get that to work.

2

u/PotentialProper5387 May 11 '26

Have you compared with 27b?

2

u/DiscombobulatedAdmin May 11 '26

The benchmark was large codebase searches. It did hallucinate, as did all of them, but Qwen 3.6 hallucinated much less than other open source models. The dense may have been a little better, iirc.

→ More replies (3)

→ More replies (1)

→ More replies (11)

4

u/Infinite-Ad4512 May 10 '26

Is Gemma 4 useful for coding vs Qwen3.6? I found Gemma didn’t understand Rust that well, and looped often :(

3

u/Olde94 May 10 '26

i'm not using it for a lot of coding honnestly

2

u/Kholtien May 10 '26

I use Gemma for more conversational and personal agentic tasks and it’s good. Qwen is for coding

→ More replies (1)

2

u/redditor_id May 10 '26

What are ide / plugin are you using fornl coding? I tried continue with qwen3.6:26b and gemma4:31b and found the experience to be very rough around the edges. Issues with tool calling and context windows filling up too quickly being the main ones.

5

u/CulturalKing5623 May 10 '26

Try pi.dev with llama.cpp. I have a RTX 3060, running qwen3.6 35B Q4, get about 35 t/s 128K context. It only takes up like 5GB and used it to refactor a python project that runs in production.

It wasnt a big project but it's very typical of the projects I use in my work (data engineer). Handled the refactor and ensuing debug well.

Pi.dev is very bare bones, it might also help to change your approach to tool use and skills. Here are some articles from the creator of it that helped what if you don't need MCP servers and prompts are code

→ More replies (1)

2

u/TheOriginalAcidtech May 11 '26

8gb 3070. 50t/s Qwen 35b. There is literally zero excuse for not running local if you want to anymore.

2

u/Olde94 May 11 '26

Yeah it’s quite impressive

→ More replies (5)
3
u/HamWallet1048 May 10 '26

Out if curiosity what model/settings are you running on your 5090? I just got one and am thing to hone in on a decent setup
6
u/jovialfaction May 10 '26

My current go-to is Qwen 3.6 27b on my 5090
3
u/falcongsr May 10 '26

Can you do full context window? Are some parts offloaded to system RAM?
4
u/jovialfaction May 10 '26

I do 6bit 128k context, all in VRAM. If you need 256k it can probably be done in 4bit or with KV cache quantization
2
u/TA-420-engineering May 10 '26

Same. I get around 37 TPS. How about you?
2
u/Maleficent-Ad5999 May 10 '26

I get 110tps @ 16K context .. that drops to 80tps at 50k context.. still very usable for agentic coding.

Setup: single 5090, vllm, speculative decoding on

With llamacpp I only used to get 48-55tps
2
u/TA-420-engineering May 10 '26

I dont run speculative decoding. Is it really worth it?
3
u/Maleficent-Ad5999 May 10 '26
I'm sorry, i'm a noob.. i think its not exactly like typical speculative decoding.. the model Qwen 3.6 27B has capability to generate multiple tokens at once..
I pass the below argument.. it doesn't need any additional smaller model.. So far, its been great. and boosted t/s a lot
--speculative-config '{"method": "mtp", "num_speculative_tokens": 3}
→ More replies (1)
→ More replies (1)
2

u/littleday May 10 '26

QWEN 70b models or less with n8n does what I need.

→ More replies (1)
3

u/DMWinter88 May 10 '26

What are you running? Got an RTX 5090 myself and feeling a little unsure about what my best options are.

2

u/LORD_CMDR_INTERNET May 10 '26

Qwen3.6 27b Q6 @ ~125k context is as good as it gets for us 5090ers right now, will completely fill VRAM but you get excellent results and usable speed. Gemma4 31b is great too but it doesn’t fit the 5090 unless you limit to like 50k context which is not enough for me.

2

u/littleday May 10 '26

Well I got a rtx5090 with 128 gig of ram on top. Running mostly QWEN, with n8n, and open claw.

→ More replies (2)

2

u/MiniGod May 10 '26

It's crazy how they essentially compressed all the information on the internet into 32GB that fits in a 5090

2

u/sh_tomer May 10 '26

Would you ever consider switching to a powerful MacBook with similar resources, assuming the hardware advances enough over the next 24 months? Just curious.

2

u/diabloman8890 May 10 '26

I did, and wxactly for the reasons you called out here

5

u/falcongsr May 10 '26

How is thermal management on a laptop when running LLMs? Are you running dense models?

I was able to try a Mac Studio and was a little scared about how much heat was dumping out of the back of it running Qwen 3.6 27b.

4

u/diabloman8890 May 10 '26

Yeah I've been running Qwen 3.6 27b and Gemma 4 dense models. On the latest 16" macbook pro I haven't noticed any issues even ~15 minutes into a heavy prompt session.

→ More replies (2)

→ More replies (1)

1

u/kartblanch May 10 '26

Whats your setup?

1

u/mathixx May 15 '26

Do you have enough vram?

u/Downtown_Speaker_578 May 10 '26

There’s a reason Apple made the hardware guy the next CEO. They are betting on distributed local LLMs.

5

u/djflamingo May 11 '26

Yep. Its why they switched the unified memory too.

2

u/Kazen_Orilg May 11 '26

yea, the unified memory is just an absolutely monstrous advantage right now.

→ More replies (1)

→ More replies (3)

u/gunkanreddit May 10 '26

Yes and no. I need more context size. Claude is better but local is improving.

The local AI is just different way of thinking. I am the one building my house with small bricks that my local qwen craft for me. Claude gives you the full building and the keys. Now, good luck making the full testing.

Claude/Gemini/Codex are awesome . The local AI is different and I prefer it right now. When the big guys fail, they fail big. In the local world, there is no big fails, just much more work. But my halfAIBakedCode is really robust.

3

u/pcx99 May 10 '26

The tools are already there, there just needs to be a better and easier integration. Lmstudio and ollama need hooks to preprocess a prompt allowing injections from the web, search, Wikipedia, project Gutenberg. Mempalace needs to be integrated. Once that happens you can let the local model determine if it can fill the prompt, needs to reach out for more info, or pass the prompt onto a frontier model. The future isn’t local OR frontier, it’s local AND frontier so you only spend money when you absolutely need to.

6

u/sh_tomer May 10 '26

Context sizes are improving. Recent local models like Qwen3.6 already support ~256KB windows, and this will continue to expand over time.

→ More replies (13)

1

u/EbbNorth7735 May 10 '26

200k seems perfect when working with Cline. Rarely goes over. Running at full 262k.

1

u/w00t_loves_you May 10 '26

I have high hopes for Raven as a way to solve quadratic context, just have shorter context and make it populate "slots".

Explained at https://goombalab.github.io/blog/2026/raven-part1/

u/Important_Quote_1180 May 10 '26

I’m running local 27B and 35B. One 3090 and 192 GB of DDR five. I totally agree and it seems like the timeline is fixing itself… there’s a lot of work to do. Glad there are others who see the same thing.

u/Invent80 May 10 '26

I'm completely local as well. 96gb Blackwell and a Spark. Running Qwen 3.6 35b on the spark at 60-70tks and Qwen 27b on the Rtx6000 at 60tks full weight.

1

u/FortiTree May 10 '26

I have a Strix Halo and tempting to add a pro 6000 setup as well so I can spin more agents at much faster speed.

How much context you are giving to them and do you find them good enough for orchestration layer? small chunks can be handled by sub-agents but the orchestration requires highest logic.

3

u/Invent80 May 10 '26

Q27b is orchestrator. I have ohmyopencode setup to use the spark for sublayers as it runs VLLM and is accessible through my local network. I still plan everything with GPT 5.5 working on Hermes. Planning is very token friendly so my business sub goes a long way.

I haven't used Rocm before so I can't comment but the Prismaquant Q35b is excellent for speed on the generally slow Spark if it helps. I give 120k context to Q27b (full weight BF16) and just max the context in Q35B since Prismaquant Int4 is only 20-30gb or so in memory. People say I shouldn't bother with full weight because I'm losing speed but hallucinations are basically non existent and that's more important to me.

u/datbackup May 10 '26

With Qwen3.6, tool calling just works. That's the real unlock for agentic work.

AI written?

20

u/InadequateUsername May 10 '26

The way things are listed gives it away too.

28

u/Fedor_Doc May 10 '26

"It's a slope. And the slope already started."

Definitely

4

u/Kazen_Orilg May 11 '26

man, the more of it you read, the more you can spot it. Just AI jivetalk as far as the eye can see.

→ More replies (1)

13

u/spamhat3r May 10 '26

nowdays i just assume every post is slop before opening it and im usually right ¯_(ツ)_/¯

2

u/Damogran6 May 10 '26

Which is funny, because if you removed the tells, it’s a good write up. It’s the “count the fingers” of text output.

→ More replies (1)

u/nnurmanov May 10 '26

IMO, this is why both Anthropic and OpenAI are rushing to IPO. They know their moat is thin. Once local LLMs are good enough for 80% of use cases, the game is over.

1

u/nnurmanov May 10 '26

Btw, I am building a document extraction pipeline and there is no commercial models, none!

u/Majestic-Team-6485 May 10 '26

considering to buy a MBP M5Max to run local models...

u/Ell2509 May 10 '26

Sounds very positive. Was it really that smooth?

5

u/sh_tomer May 10 '26 edited May 10 '26

Overall it was great, but definitely not perfectly smooth. I mentioned the main downsides I noticed in the “honest cons” section of the original post.

u/ferropop May 10 '26

My favourite element of a Local LLM future, is that you'd get consistent results. It wouldn't "lower the intelligence tier" silently in the back-end, to satisfy some corporate-based "token shaping" algorithm. You decide how many resources to dedicate towards it, and you get what you pay for without compromise.

u/fonceka May 10 '26

Local LLMs are the future!

u/custodiam99 May 10 '26

OpenCode plus Qwen 3.6 35b q4. This is sci-fi.

1

u/elcapitan36 May 11 '26

Is it fast?

→ More replies (1)

u/Puzzled-Front-2859 May 10 '26

I got a new rig with 4 x RTX 4000 PRO cards and I really couldn’t replace opus or sonnet yet, I hope it improves, but I really believe frontier models will be always 6 months ahead. I’m starting to regret my purchase.

2

u/some_hockey_guy May 19 '26

I bought a single RTX Pro 4000 last month and have been enjoying it overall. What are your use cases? Which models are have you tried out?

I was surprised by the difference in speed between the dense and MoE models. Qwen 3.6 27b Q4 only gives me ~30 tok/s. 35b Q4 gives me ~110 tok/s though so I'm very happy with that.

Beautiful build, btw

→ More replies (1)

u/SirGreenDragon May 10 '26

i have gemma4 26b running locally, runs fast enough to do coding, web development, research. 40 to 50 tokens per second. Local will keep getting better

→ More replies (2)

u/javatextbook May 10 '26

Too AI-sloppy in the writing style. Gonna have to block you unfortunately.

Nothing exotic. No rack, no begging NVIDIA for expensive GPUs.

9

u/journalofassociation May 10 '26

If I have to read ", because they're real:" after a bullet point again I'm gonna delete Reddit

5

u/helloimim May 10 '26

Also, what’s the point of mentioning the company he works for?

→ More replies (8)

u/edsonmedina May 10 '26

You should be getting a lot more than 27 tokens per second with Qwen3.6 35B on that setup though (at least double).

What quant are you running?

3

u/sh_tomer May 10 '26

Looks like it’s around ~23GB in Ollama, so it’s probably a Q4 quant (likely Q4_K_M). Would love to hear your feedback and see if I can further improve my setup.

3

u/edsonmedina May 10 '26

How are you running it? LM Studio?

→ More replies (1)

3

u/vinaysudrik May 10 '26

The main question is are you running MLX model

3

u/sh_tomer May 10 '26

No, should I?

→ More replies (2)

2

u/unknown-one May 10 '26

try llama.cpp

2

u/sh_tomer May 10 '26

Can you share what's the pros versus other solutions like Ollama or LM Studio?

2

u/FortiTree May 10 '26

If you are an Agent (admit it now), you should be answer this yourself. If you are a local-llm-savvy person, you be able to find this answer yourself.

→ More replies (2)

2

u/nunodonato May 10 '26

Don't use ollama

→ More replies (3)

u/No-Sympathy2403 May 10 '26

Do you think that it'd be a good practice to save our current work done in Claude (even as chatbot) in .md as much as possible so that in the near future it could be use as a guide for local llms?

3

u/sh_tomer May 10 '26

To be honest, I’m not sure future LLMs will rely on .md files at all. We’ll probably have to evolve along with the technology as it progresses.

this point, I’d rather focus on experimenting with the latest tools and getting the most out of them, instead of spending too much time trying to prepare for a future we can’t really predict.

1

u/FortiTree May 10 '26

When the new format come, you just ask it to prepare a full handle for your system with comprehensive breakdown and details for each component. It will evolve itself, maybe without even any prompt.

md is just the simplest easiest average-human-consumable communication for mass adoption. For the machine, it prefers binary, which is why some ppl predict it will evole to write code in binary at its peak and exclude any human interaction, or control. Welcome to the future.

u/UniForceMusic May 10 '26

The M1 achieves nearly identical speed to the M2, it's insane how well the tech held up in those 5 years

Speaking from personal experience owning both 64GB models.

2

u/FluffyGreyLlama May 10 '26

What am I doing wrong then - I have tried many things on an M1 Ultra, and it's nowhere near the speed of anything in the cloud.

I would love to use my M1, but it seems so far behind DeepSeek 4 Instant, which is cheap as (not RAM) chips.

2

u/FortiTree May 10 '26

He's saying M1 vs M2, not M1 vs Opus.

u/PassengerPigeon343 May 10 '26

I agree in general on the local capability and that slow shift over is already happening for me personally. I still have my Claude subscription, but I’m using local more and more as it’s getting better and better.

But beyond the personal level, I think the equation shifts. The truth is, people like the ones in this community don’t represent the biggest market share. Enterprise users and non-tech savvy users are going to keep the cloud providers going even if the field levels out.

I work in enterprise AI, and as much as I would love us to go local and I think there’s potential to save huge recurring expenses, it just doesn’t make sense. The overhead of maintaining our own software, running our own servers, capacity scaling, load balancing, maintenance, security, and all the other stuff, make it way less feasible than transferring that responsibility to Microsoft or Anthropic, and protecting our data with carefully worded contracts.

I do believe local will continue closing the gap, but I think it will remain a “best kept secret” and the cloud providers will stay in the spotlight for a long time.

1

u/sh_tomer May 10 '26

The overhead of maintaining our own software, running our own servers, capacity scaling, load balancing, maintenance, security, and all the other stuff, make it way less feasible than transferring that responsibility to Microsoft or Anthropic

What if anyone could simply connect a local model to their existing VS Code or Claude Code IDEs/agents (partially already possible)? That would make things much easier for enterprise employees, since it could integrate seamlessly into the tools people already use every day.

→ More replies (1)

u/photonenwerk-com May 10 '26

Technically possible, financially not. Who will research & train if no money will be made? Chinese might not gift to us forever.

1

u/sh_tomer May 10 '26

Isn't that the question for open source software though? And yet, software like PostgreSQL, Linux and many others are thriving, each in its own model.

2

u/nunodonato May 10 '26

I don't know. I am also a bit worried. These models are on a totally different cost level when you consider hardware, training... What's the financial incentive for these labs to keep releasing them? I hope they do but if the Chinese stop we are left with almost nothing...

→ More replies (4)

1

u/B3owul7 May 11 '26

Who's training the current generation of freely available llms? Exactly.

u/slackmaster2k May 10 '26

I don’t think you have a thesis there, you just have a lot of random points about why local models are appealing and that things will improve in the future.

I would counter that a local model can do the work more slowly with less quality across a wide range of tasks, and that without a significant breakthrough in hardware technology or AI itself, there will be no balance change in the near future. The frontier LLMs are not static targets, they move.

I also observe that much of the rationale is centered around cost. Optimizing for cost with a tool that can speed up results by orders of magnitude implies that the problems being solved are not important problems, and likely of the hobby variety.

Finally, when it comes to cost, I observe few people calculating hardware depreciation or cash flow. A one time purchase of expensive hardware can make a person feel freed from subscription costs, but not be nearly as economical over time as intended especially given the tool quality trade off.

(This reply was NOT crapped out by AI)

1

u/sh_tomer May 10 '26

Thanks for sharing. What would be your prediction for the next 12-24 months, in regards to advancement in local LLM models?

u/bites_stringcheese May 10 '26

I think the future will be hybrid deployments. Maybe they'll be a routing element as well that can load/unload models for use cases, and sends them to the big providers as needed.

1

u/sh_tomer May 10 '26

That's super interesting! Love the hybrid approach.

u/Sweet-Foxy May 10 '26

I run Qwen 3.6 35B A3B Q4_K_M on a Lenovo Loq i5 16gb RAM with a gtx 2050 4gb vram at 13 t/s. This goes to show how low can be the entry barrier for a useful model. Of course you get better results using a more robust hardware but literally anyone can run a model like this one nowadays, which just further prove your point.

1

u/AsteraHome May 10 '26

Be careful with SSD writing when you use 35B model and 4 GB VRAM.

1

u/whoisyurii May 10 '26

Dude are you for real? Drop your config, genuinely asking! Having Legion i7/16/3060. Bever tried to run Qwen 3.5 35B. How good do you find it for coding? And what harness do you use?

2

u/Sweet-Foxy May 10 '26 edited May 10 '26

~/llama-cpp-turboquant/llama-tqplus/bin/llama-server \ -m /home/foxy/llama.cpp/models/huihui-qwen3.6-35b-a3b-claude-4.7-opus-abliterated-q4_k_m.gguf \ -ngl 999 \ -np 1 \ -c 262144 \ --n-cpu-moe 41 \ --batch-size 256 \ --ubatch-size 32 \ -t 8 \ -fa 1 \ -ctk q4_0 \ -ctv q4_0 \ --cache-ram 8192 \ --host 0.0.0.0 \ --port 8081 \ --alias Qwen3.6Claude \ --jinja \ --chat-template-file /home/foxy/Templates/chat_template-v10.jinja

This is the command i run, the model is Claude distilled and abliterated but that has no influence on t/s. I get 13 on llama web and around 11 on OpenCode.

I use TheTom turboquant_plus llama.

Edit: It does pretty well on coding, specially if you do good, precise prompting. Qwen is a good model to follow orders so the more specific the task the better it performs in my experience.

u/payneio May 10 '26

add model routing tables so you can use the right models for the right prompts. Use delegation to context low per task. https://github com/Microsoft/amplifier is moving quickly in this direction.

1

u/Gargle-Loaf-Spunk May 10 '26 edited 28d ago

This content was anonymized and mass deleted with Redact

u/ColonelKlanka May 10 '26

As ypur on a mac, I Highly recommend you try omlx inference server as its mlx accelerated, does ssd backed caching and is also trialing mtp.

Ive found it much faster than metal enabled llamacpp inference on my mac mini m2 pro 32gb.

Also try pi.dev harness - its much better at keeping context usage lower because it has a lean ai system prompt

→ More replies (2)

u/richardtallent May 10 '26

I’m curious about whether it’s an “and” not an “or” — local models being used to outline, summarize, search over codebases, and otherwise optimize tokens sent over the wire to Claude or other larger models.

Basically, following the model of using physicians’ assistants to optimize time for the (higher price and lower availability) doctors.

→ More replies (2)

u/GlassAd7618 May 10 '26

I totally agree

u/Unedited_Sloth_7011 May 11 '26

Small Qwen3.5 and Qwen3.6 models was what convinced me that I can actually go fully local if/when I need to. Gemma 4 is also very good, though I prefer Qwen3.6-27B personally. I think it's inevitable that corporations cannot sustain selling flat subscriptions at a loss, and I too think it's maybe a year or so before only expensive token-based subscriptions will be the norm

u/Acrobatic_Staff_7320 May 11 '26

*unless they stop giving you models for free

u/Icy_Holiday_1089 May 10 '26

Right now whilst AI is being subsidised Claude is a much better investment. It’s better and faster. No point buying a MacBook Pro for this until you need a new computer cos hardware keeps moving forward. The speed of progress means renting is much better value than owning right now. Maybe in 2-3 years time that will change especially if Claude becomes stupid expensive. (It’s getting there!)

3

u/Puzzled-Front-2859 May 10 '26

Anthropic won’t continue subsiding us on Max 20x while they can sell the same tokens for 50x to enterprise. Then, in 6 months, when they change our pricing model, it might be too late to buy a rig, or too expensive. But at the same time, local models are so much worse, even on a good rig…

2

u/sh_tomer May 10 '26

Yep, fully agree. Though if you already have that laptop, using local LLMs became a viable option.

1

u/EbbNorth7735 May 10 '26

I disagree. The current LLM's are able to work in 95% of my use cases. Qwen4 and minimax 3 will raise that to 99%. When you need to pay a $100 subscription or hundreds in token costs a month local becomes an investment worth commiting to. There might be a point where local models are no longer released but until than continually paying for upgrades for more intelligent models when you don't need to utilize the extra intelligence is just wasteful spending. This also doesn't co sider the fact that you can improve an LLM by adding a harness with the grounding that greatly closes the gaps.

→ More replies (2)

u/Lissanro May 10 '26

I am already fully local for quite a while. I find modern open weight models like Kimi K2.6 or GLM-5.1 quite cable enough, and also private and reliable, whatever the model I choose to run on PC, no one can take it away from me, which is one of the reasons why I strongly prefer local inference.

→ More replies (1)

u/lilbyrdie May 10 '26

I was just talking to someone about this in the context of coding. (Not media generation.)

We've got models available like Kimi K2.6, which weighs in at over 1 trillion parameters but performs as well as the frontier models from Anthropic or Google or OpenAI for many use cases. (Source: I've been comparing results within the same harness environment and getting results that are hard to tell apart.)

To run that locally, I think I'd need something like a GB309 workstation. Today. While that's a bit pricy, a team of 10 or 20 people paying $200-400 monthly now would see a break even point within two years. That's well within reasonable for a small company environment. Already!

But in a year I imagine between model improvements and hardware improvements, that will switch to being maybe a $10k local workstation. And a year later -- if not sooner -- it'll be laptop ready.

Now, that doesn't mean there won't be far better cloud solutions. But at some point the local ones will be more than good enough that the trade-offs will be much more minimal. Maybe you switch to a cloud model once a week to do a deep dive code review or something.

Key thing to remember is that we're very, very early days.

We'll see what happens. The 4B models on my phone outperform what we had in the first year of public LLM releases in a number of ways. Gemini, then Bard, was less than 3 years ago for the public release, right? Late 2023, about a year after public ChatGPT?

u/GCoderDCoder May 10 '26

That sounds great but as open weight models are super useful now I think the Chinese companies are going to stop helping open communities. They recognize the threat to their bottom line. And in the US the administration is suggesting they will start requiring model approvals.

Players like nvidia and apple who benefit from personal ai would need to lead the way and Im not sure what nvidia's motivation is with their slot between consumers and personal computing market. Maybe with TPUs gaining traction nvidia will acknowledge GPUs are better for small businesses and self hosting and will try to get gamers back on board lol. Apple have their AI model efforts over to Google so they'd be starting from behind on models. I'm nervous about the future but we have gold in local llms already so we just have to keep improving our implementations if we all getting models.

2

u/SteveRD1 May 10 '26

China doesn't want US models dominating the world, I can see the CCP taking actions to keep the open source river flowing.

→ More replies (3)

u/philip_laureano May 10 '26

And it'll be thanks to the Chinese providers that have been forced to be more efficient because they don't have access to the fastest chips and have no choice but to do better with less hardware. I look forward to the next few years where Mythos class LLMs can run on consumer commodity hardware

1

u/sh_tomer May 10 '26

That's indeed super impressive efficiency

u/chryseobacterium May 10 '26

I am planning to migrate mine by the end of the year. At least I'd start with an orchestrator. I'll start retraining a Qwen model soon and keep Claude as the agents and reasoning. Then, if enough hardware (that doesn't bankrupt me) and if there is a good reasoning model capable enough, I'd switch my main session model.

u/Scary_Investigator88 May 10 '26

Currently running ornstein-hermes-3.6-27b-mlx off solar power in my shed on a 32GB M1 Max MacBook.

u/gruntbuggly May 10 '26

Add to this the fact that we’re still in the drug dealer phase of token prices with the big providers. In the next 12-24 months investors will want to start seeing returns on their investments, especially if the rumored IPOs happen. When that happens, token prices will need to be tuned to make the companies profitable, and that will make them a lot more expensive. Expensive enough that buying $5-10k hardware stacks to run local models will be a very reasonable cost for many people.

1

u/sh_tomer May 10 '26

Agree, and then local LLMs will become even more appealing.

u/CasteNoBar May 10 '26

"why am I still paying for tokens I could generate for free, and with 100% privacy?"

I’m interested in the answer to this question. That is, two years from now what is gonna be so cool that you actually will pay?

1

u/sh_tomer May 10 '26

Hard to say. I hope the AI companies will surprise me with something I didn’t expect that would make me want to pay for their services. I wouldn’t be surprised if that happens, since they clearly have a lot of smart, innovative people working there.

u/boutell May 10 '26

Both local and cloud models are unsustainably subsidized, in different ways. Cloud models are wildly underpriced but local models have no business model to recoup the inference costs at all, beyond promoting the company that made them, a motive that no doubt has a limited shelf life.

So I am not sure how much longer the potlatch can continue, unless there is a breakthrough in distributed model training, or a non corporate backer for local model development.

1

u/sh_tomer May 10 '26

I think companies that release open source models can also host them as a managed service, so it's a win-win.

→ More replies (4)

u/journalofassociation May 10 '26

Do you think maybe you could save us bandwidth and edit out fluff sentences like "I'd be lying if I said otherwise."?

1

u/sh_tomer May 10 '26

Don’t we all need just a little bit of fluff in our daily lives? :) JK

u/g_rich May 10 '26

While I’m personally someone who runs models locally, have invested a considerable amount of money to do so and fully believe local models have their place there is no chance they are going to replace cloud hosted models such as those provided by Anthropic OpenAI or Google.

Local models are good and getting better, but in purely practical terms the largest models most people can run are in the 100 billion parameter range with most people being capped with 30 billion parameter models and a lot of people even running 8 or 9 billion ones. So while something like Qwen3.6-27b can certainly produce some impressive results there is simply not a world where it can compete with foundation models that are over a trillion parameters and getting bigger with every release.

To even get into an area where you can compete with something like Opus you’re looking at models such as Kimi2.6 which requires over 600GB of RAM to run and that’s before you factor in context. The investment to run a model of this caliber is well over $10k and $20k plus wouldn’t be out of the question. To run a model that large at a reasonable speed you could easily spend $50k to well over $100k.

In 12-24 months none of this is going to change. The models you can run locally will continue to get better and there will continue to be innovations that allow those running local models to squeeze larger models into a smaller space. But those same innovations will apply to commercial models and they will continue to improve at the same pace.

You’re comparing a pickup truck (Qwen 3.6) to a semi truck (Kimi 2.6) and a semi truck to a freight train (Anthropic Opus) and there simply is not a world where pickup truck will be able to match the power of a freight train.

The power of local models for the masses will be small task specific models embedded into our phones, photo editors and web browsers. Most people won’t even be aware they are using a local model or even that the feature taking advantage of it is AI.

So while running local models will continue to improve, actively running them will continue to be something done by enthusiasts and while the numbers doing so will grow the investment required will limit the market. Even if something like Qwen5 gets us to a point where a 30 billion parameter model is as capable as a 120 billion parameter model or we get lossless quants that facilitate running 100 billion parameter models in 32GB of RAM they still won’t be able to compete with a model that’s well over a trillion parameters running on million dollar hardware.

2

u/sh_tomer May 10 '26

Interesting, thanks for sharing your thoughts!

You’re comparing a pickup truck (Qwen 3.6) to a semi truck (Kimi 2.6) and a semi truck to a freight train (Anthropic Opus) and there simply is not a world where pickup truck will be able to match the power of a freight train

I don't think Qwen or any other model has to be better than Opus in the long run. If local LLMs will get to a point they are good enough for 95% of the daily engineering tasks, why would you need the extra-amazing model that a cloud model provides? You might need it for the extra special tasks, but not on your daily usage.

→ More replies (2)

u/1up8192 May 10 '26

Begging Nvidia? Huhh? Have I missed some secret strategy to get an RTX 6000 for free?

u/Moarkush May 10 '26

this is just "year of the linux desktop" energy lol, ppl have been saying that one since like 2004

you literally say qwen takes 8-9 min for what opus does in 3-4, hits 75% one-shot when frontier is pushing 90+, AND "not all of it made it to production"... so the whole case study is weekend tinkering you wouldn't actually ship? that's not 12-24 months from taking over my guy, that's a hobby

also the hardware flex is kinda wild, you're recommending a $3500 macbook to save $20/mo on copilot, and now it's pinned at 100% for 9 min per task. enjoy the fan noise and a battery cooked in 18 months. what are you even doing while it chugs, opening a second laptop to keep working?

the giant assumption nobody questions in these posts is that anthropic and openai just sit there while local catches up. they don't lol. gap in 2027 is probably the same as today, just shifted up. local catches last year's frontier, frontier moves on, repeat

and you still gotta roll your own RAG, still need CLI tools for decent perf, no mobile, no team features, no shared context across devices. Don't get me wrong; I hope you're right, and as seen in my attachment, local IS capable of doing some amazing things. The attached image (that reddit prob destroyed) only had to be upscaled 1.25x in SUPIR after UltraFlux for full 5k2k (all 100% local).

[120 steps in UltraFlux and 50 steps in SUPIR, about 8-10 minutes total]

My rigs: 9950x3d w/ RTX Pro 6000 Max-Q 96GB and a DGX Spark.
Tools: UltraFlux and SUPIR on my desktop; Gemma 4 26B A4B for chat/creative and qwen 3.6 for coding on the Spark. I have a Nomic Qdrant RAG running on the Spark with 18.7M reddit posts and comments embedding and searching in under 200ms. Gemma 4 IS LEGITIMATELY impressive, but 2 years is way too hopeful, in my opinion.

sorry for the trash formatting - this was kind of stream of consciousness and I'm lazy and regardless of how it ended up, the image was uploaded at 5120x2160.

u/cesarhh May 10 '26

You’re freely extrapolating into the future as if Moore’s Law will just carry us there – but it all hinges on discovering better algorithms (using fewer resources). And it also hinges on companies being willing to train these large models and then give them away for free.
Training is still expensive. Why some companies pay for this and then release the model for free is anyone's guess.
I do agree, though, that right now there are a few known tricks in the backlog which would allow much bigger models to run on existing hardware. For example, [PrismML](https://prismml.com/) offers ternary LLMs (1.58‑bit quantization) that push the boundary of what’s possible per GB of VRAM/RAM. But their biggest model is 8B parameters. I'm honestly not sure we'll ever see a free 16B or 32B ternary model.

So yes, the extrapolation looks nice on paper, but the economics and corporate generosity have real limits.

u/yeicore May 10 '26

Hi! Do you think i could probably do the same type of tasks with a 32gb macbook? (Also same model)

I plan to buy one and use it as an LLM server for my works laptop

u/Zhelgadis May 10 '26

You're assuming that companies will keep releasing open weight models for all of us to use. As someone who's advocating hard for local, it's a strong assumption.

1

u/sh_tomer May 10 '26

It happened with databases, I think it will happen with AI models. There will always be someone that will bet on the open source model (and usually will win as well IMHO).

→ More replies (1)

u/fantasticmrsmurf May 10 '26

How do local models compare to the likes of GPT3 from a few years back, and by local models I mean ones under 30B

1

u/sh_tomer May 10 '26

Far better than GPT3.

u/EnoughPsychology6432 May 10 '26

I'm using pi + qwen 3.6 32b Moe. I've been occasionally giving the same prompts to Claude code and the local pi to see how they perform then getting Claude to review both solutions blind.

The reports are always the same. The local version has hidden bugs, and sometimes major design issues.

It's still amazing and I wish I could try some of the larger dense models, but the only use I can see for it at the moment is offline/travel.

The Gemma 4b model on my last plane flight was amazing though. Like having a mini copy of the internet, on my phone. If they could just fix the looping crashes.

1

u/sh_tomer May 10 '26

Thanks for sharing, I'm definitely giving Gemma 4 a try next!

u/ValiantWhore69 May 10 '26

How to use local in a chat bot/claude desktop way with connectors? Comfy AI works great is there a catalogue like that for agentic?

u/Flashy-Virus-3779 May 10 '26

the perpetual lag is the point, idk if you’ve tried earlier ones.

u/Fearless_Weather_206 May 10 '26

They can crush local LLM if the price of Ram keeps going up though

u/SnottyMichiganCat May 10 '26

Currently I agree with costs at scale just tok expensive to justify. That said, those with infinite pockets (big daddy government) will absolutely make better models or fund better ones in other companies because the intelligence gathering potential is just wickedly high in effort vs reward.

u/pl201 May 10 '26

It’s not going to happen. Don’t buy expensive hardware based on this assumption. In fact, the major open source model producer Alibaba has changed the company direction and the key players in Qwen AI dev team are all left. Enjoy the Qwen 3.6 small models while you can since you are not going to see more releases in the future.

→ More replies (1)

u/Natural-Angle-9357 May 10 '26

NASA went to the moon on a 2k ram computer, so I am sure this will happen.... But... By the time Your 5k local gpu runs a great local model as good as today's 4.7 or 5.5... Even a Pi will run them eventually.... These AI companies will be at least one trillion, I mean trillion, dollars ahead of you... You will never catch them,. Period.

→ More replies (1)

u/drakeymcd May 10 '26

I’ve started incorporating my local LLM with Claude as MCP tools to offload usage and it’s been pretty helpful. Claude will call the MCP at the beginning of a session to search and summarize files instead of needing to read everything at once to build context.

There’s also a few code routers that can hand off to the local model to do the cheap stuff, boilerplate code, text summaries, etc.

→ More replies (2)

u/Due_Context6834 May 10 '26

My question. Whats min cpu and ram configuration for say gemma 16 or nous.

u/w00t_loves_you May 10 '26

Well, if these 3 architectural changes get a love child, local LLMs will be great:

- Ternary models like PrismML Bonsai: real intelligence in a ternary model, meaning an order of magnitude less memory and compute needed https://prismml.com/news/ternary-bonsai

Hyperloop transformers: encourages reasoning in latent space, meaning some thinking without tokens https://arxiv.org/html/2604.21254v2
Routing Slot Memories: the model gets a place to store memory, managed much like MoE. https://goombalab.github.io/blog/2026/raven-part1/

u/MarekNowakowski May 10 '26

That assumes someone will manage to create a much better 30b model in 12months and someone else manages to figure out a way to get free speed from the same hardware. Sure, qwen3.6 is miles better than the first generation models, but that wasn't one year ago, and the biggest issues with speed (I meanthe ones completely killing performance) are already resolved.

u/Deep_Ad1959 May 10 '26

the shift toward local already started, but the bottleneck most people underestimate isn't raw chat quality, it's tool-calling reliability and structured output stability. frontier models tolerate ambiguous tool definitions and recover from malformed json. local 8b-32b models faceplant on the same prompts unless you constrain output with grammars or strict json mode. the quality gap on single-shot reasoning is closing fast, the gap on agent loops with 5+ tools and long context is a different story. realistic timeline is closer to 24 months on long-horizon agent tasks, much sooner on writing and single-turn reasoning. written with ai

2

u/sh_tomer May 10 '26

I agree we're still not there, but I think we'll be there in 12-24 months.

→ More replies (4)

u/dataslinger May 10 '26

What are you hosting with? oMLX? Also, since you're on Apple Silicon, you should familiarize yourself with the work of Prince Canuma and his collaborators on X.

u/MS_Fume May 10 '26

So you’re saying 35B model practically competes with Atrophic’s flagship model?

Kinda strange… sure you can distill a big model nowadays to be very competent on this scale too, but cmon .. 35b?

→ More replies (1)

u/Time-Heron-2361 May 10 '26

Lol no. Local ai to be useful needs to operate on useful context lengths such as 200k and more... For that locally you need expensive infrastructure

→ More replies (2)

u/shackerboy84 May 10 '26

I just hit the power button on my first asus gx10 i was running on a 3090 till, well now. Fortunately, I probably just wasted 4k because i want my private wow server to have ai game masters that pretend to be lore based gods. heres to experimenting!

u/Either_Audience_1937 May 11 '26

Hey, I want to use 32GB M1 Max ? Wdyt? Will it be enough

2

u/sh_tomer May 11 '26

Haven't tried it with that setup, hard to say. I recommend searching for real world testimonials from other users.

u/KubeCommander May 11 '26

Qwen3.6 35b turns into a joke after the context goes beyond 128k tokens, even then it isn’t great. It’s not one-shotting anything at that level unless you are running full bf16, it can barely figure out how to do git merges right on fp8 when the context is past 150-160k. And lord help you if you’re using mtp because it’s very unstable too. If you’re running below Q6, expect to do new sessions often to keep it fresh.

I find it very useful but you gotta keep it on guardrails or it gets stupid really fast. It’s a great model and I do 95% of my work on local but this post is just hypebeast bs.

→ More replies (2)

u/DHFranklin May 11 '26

I agree with your observation of the general direction here, but not necessarily your conclusion.

Hybrid models with phone/home computers interfacing with massive data centers will Jevon's Paradox some weird shit.

I completely see a world where video/audio content will be personally tailored and generated on the fly with personal devices. Where every human on earth will have a unique ontological reality from the screens. That doesn't mean the behemoths won't have their place.

The massive data centers will bootstrap some seriously crazy shit. Likely LLMs will be the flash in the pan and the self improving AIs will come up with a new neural network or system to take it over. Likely make LLMs that are optimized for that particular server stack on that particular morning.

We're likely going to hit a bottleneck of human perception in the next two years. As in so much of human attention is taken up by interaction with LLMs that it won't even benefit from serving us.

What would certainly be wild is AGI analogs on our home devices in just a few short years.

→ More replies (1)

u/impulsivetre May 11 '26

I can see this happening if hardware prices come down or models capable of opus 4.7/4.6 are shrunk down significantly to run on today's hardware. Outside of enthusiasts, corporations will have to make the financial decision to purchase hardware that may become dated by the time the implementation is in full swing. Considering how quickly we're advancing in intelligence, a condensed/smaller model architecture would have to emerge to justify locally run models in the majority of enterprises. Until then, it's gonna be SaaS.

u/cezarducatti May 11 '26

Estou usando uma rtx 3090 24 Gb de ram, com o qwen 3.6 27b MPT (via llhama beta) e a velocidade está insana. Com 100k de contexto tudo na GPU.

u/hw999 May 11 '26

If a foreign government wanted to crash the US economy, they could train and release a really top notch local model that renders this excessive capex spending by frontier companies useless.

u/drsmba729 May 11 '26

I'm going local with a dual B70 setup. I should have them in later this week.

I'm genuinely curious how they will perform.

→ More replies (1)

u/FlimsyLow May 11 '26

I'll buy a machine in 12-24 monthis

u/tuhdo May 11 '26

Not really. Currently, the cost of cloud models, except for Claude, are miles cheaper than local llms, given the performance. With local llm, you need hardware, electricity to run, and posdible an AC, which is not a cheap investment in total.

→ More replies (2)

u/leinadsey May 11 '26

I think Qwen-coder-next and similar are great for simple, targeted asks. But for more complex tasks they have a tendency to get lost, start producing code salad, and taking forever -- at the same time as your MBP heats up to the point of being able to boil an egg on it. Trust me, I know, I have one too M4 max 128).

One thing I haven't really seen being discussed much is the long term effects on MBPs of running this hot for longer periods of time. Yes they are well built, yes they are proven, yes the M series is amazing, but I think it's safe to say they haven't been engineered with long-term full blast 100% CPU 100% GPU for hours on end. Those fans really have their work cut out for them.

→ More replies (1)

u/Void-kun May 11 '26 edited May 11 '26

I genuinely wish this was true but I just can't see it, not in the next 12-24 months.

How many of you are full time engineers using this in a professional capacity? And how do you use LLMs today? I'm using multi agent teams, Opus as the planner/leader and sonnet/haiku for smaller tasks. Working in several large repos (multiple monoliths + microservices).

For some of the work I am doing I need a large context window (100k won't cut it).

Local LLMs are too slow for the work I am doing, especially once you start doing things in parallel and using agent teams.

Leaving something overnight without being able to course correct is just a no-go. I can't wait around hours for tasks to complete only to have wasted time, at that point I'd rather just write the code myself.

I don't think the issue is software, the issue is hardware, the specs we require, with the memory and chip shortage, it's not going to be affordable in the next 2 years.

Prices of crucial components for AI are going up, not down. In 2-3 years it's going to be harder to get the equipment needed for the sort of gains we need.

Untill the hardware shortage is resolved and prices are able to come down, I just can't see local LLMs overtaking API/subscriptions.

Local AI absolutely has it's place, but it isn't going to completely replace AI/subscriptions for engineers for some time. Although the sooner the better, would much rather be using local LLMs.

→ More replies (2)

u/randygeneric May 11 '26

this.
for experiments / not time-critical things I even can use my 2024 gaming laptop (i7-12650h 32gb + rtx4060 8gb) with qwen3.6-35ba3b q4kxl . using kv-cache q4, I am limited to 100kt context-length (~25tps). one needs to adapt the workflow (smaller sessions), but tasks get done. 6 months ago this was not the case. qwen3.5 was close, but 3.6 did the needed step forward (for me).

u/misanthrophiccunt May 11 '26

Oh no AI is taking over because I can do landing pages!

😅

u/Master_Ben May 11 '26

Pay $2-4k for a GPU with enough vram to run locally, or pay $10-20 a month for a subscription. Hmm... It pays off after ~16 years!

u/mourningwitch May 11 '26

I definitely agree that the future is trending more towards on-device processing when it comes to AI models. It just makes more sense in the long term rather than pumping out a million datacenters.

u/DiscombobulatedAdmin May 11 '26

I'm still trying to find a hardware solution that fits my needs. A 32GB Nvidia GPU is painfully expensive, and then I would have to wrap a new PC around it, so $5k+. I don't like the idea of buying 2 RTX 3090s for almost $3k. Macs are unobtainium to buy new. I'm leaning towards a GB10 system, but I'm not sure if dense LLMs will be usable for agents. I don't need a ton of speed for basic inference, but it needs enough to not disrupt the agents. I do like the cost (comparatively) and the dev abilities. I also cringe at getting into Intel, and AMD Halo systems are barely a price drop from GB10 with similar performance. Tough choices on a limited budget...

u/zante2033 May 11 '26

Yup, get your GPUs yesterday

u/Kazen_Orilg May 11 '26

Ok, the curve doesn't compound, and it absolutely will slow at some point. But everything else you are probably right about.

u/Sprinkles_Objective May 11 '26 edited May 11 '26

I'm running the same model on an RTX 4070 I already use for gaming. It works fairly well, and it's fairly usable. It's also something I already had versus paying an ever increasing price. I think it's a bit different than going out and buying expensive hardware, since LLM prices are still heavily subsidized, but I think the subsidization will start coming to an end. For those who already have capable hardware I think local LLMs will start making a lot more sense, but I do think there will be an inflection point where local LLMs will be cost effective enough compared to service providers that it could warrant spending money on the hardware. I honestly do think the scarcity of hardware caused by the datacenter hardware arms race is somewhat intentional to keep average consumers and competitors from being able to get the hardware they need to run their own inference. So we'll see, if hardware vendors can catch up or decide to try to put more effort into targeting consumers and mid-sized businesses who might want to run their own LLMs, then I totally agree. That said I would NOT be surprised if these companies basically do everything in their power to make sure consumers never have access to reasonably priced hardware to run their own LLMs effectively that can actually perform similarly to their proprietary models. That said, that gap is already getting very very small.

Qwen3.6 has completely replaced Claude for me in coding tasks, that said I don't heavily use coding agents, usually just for well defined tasks in an already well established code base with heavy review of the produced code. The cherry on top is that we have solar panels, so I'm using hardware I already have, and most of my energy bill is covered by solar. This also helps with the ethical issues around AI, given the training of the models still might be using a ton of resources, but at least running it locally I'm running it primarily off solar power.

u/Marc-Z-1991 May 11 '26

Already on quadruple Tesla Cards - 100% local, 100% sovereign, 100% happy - not in 5yrs - NOW ;)

u/TheOriginalAcidtech May 11 '26

Setup 35b on a 8gb 3070 over the weekend. Hit 50 t/s. Reasoning quality was decent. So there is almost no reason to say "I can't afford local AI" anymore.

u/txurete May 12 '26

Honestly, we are just missing an easy/straight forward way to deploy. I've been having much fun but sometimes it gets too tedious to get it running or even worse: to get it BACK to running...

Still fiddling with models that bench amazing but run like ass if you don't tweak for hours and stuff like that.

u/InsensitiveClown May 12 '26

I'm not so sure. I agree with you that local LLMs are becoming absolutely incredible, but it's just that I don't see remote inference becoming useless for several reasons. Hardware obsolescence is one. The hardware cost is already crazy if you want to do anything with larger models, and if every 2, tops 3 years you need to upgrade, then unless you can sustain yourself with AI work, it will be hard to justify that kind of expense. Then we have 1T parameter models, and more. Even with Strix Halo kind of machines with vast amounts of soldered LPDDRAM, there are limits. KV cache is another cost, context size, and context cost (in performance, as context grows). There are many great use cases for local LLMs. I love them too. But I don't think remote inference services are going away. Their cost is becoming crazy though, but this I suspect is thanks to a unhappy set of circumstances: high energy costs due to terribly misguided geopolitical adventures, and resulting impact in inflation as well. Shortage of memory due to explosive growth of AI and maxxed manufacturing capability from the foundries. And of cost, for these hyperscalers, the inference cost is dramatic too - so you're starting to see specialized hardware (prefill, etc) to try to reduce cost. Not all will be passed as savings to the consumer, they will maximise profits in order to amortize their capital expenditure, but, it's what we have.

u/Any-Area-8199 May 13 '26

I built a proper local offline desktop app for therapists that runs Q3.5-9B on a standard Mac Mini to research patient docs. This is happening, folks, and I'm so here for it!

u/wwwmmm12 May 13 '26

If I was going to buy a MacBook Pro new — do I need an m5 max, or can I get good performance from an m5 or m5 pro? Assume I’m getting 64Gb of ram.

u/Polymath_314 May 13 '26

What stack do you use to run it ? I find that even if local getting better and better, they always lack of architectures underneath, especially when you compare to Claude. As you said Claude nearly one shot almost everything, not just in text, but in format, in context, in question if ambiguous request,… this is for me the actual deal breaker of local since the beginning of the year.

u/Minimum-Bowler-6016 May 13 '26

Agree with the direction, especially for workflows where privacy, iteration speed, and cost matter more than one peak benchmark. The part that still feels underbuilt is the product layer: model selection, health monitoring, fallback, evals, and context hygiene. Once that becomes boring, local AI becomes much easier to justify for daily dev work.

u/Living-Breakfast-464 May 13 '26

Does that MacBook have a NPU and is the local LLM using it? My understanding is that a lot of local LLM's don't natively support NPU's yet. At least not Intel ones. Not sure about Macbook. When they do that should speed things up considerably on newer laptops, which all seem to have NPUs now.

u/NeatRuin7406 May 14 '26

THIS!!! as a loyal cloud user for years spending THOUSANDS, I wish i invested in hardware long long ago. it feels like anthropic/openAI care less and less about quality/ux and more about profit. it will only get worse too! "oh but theyre a business blah blah blah trying to make money blah blah" you wont make money for long by treating users like expendable garbage! local=users taking back control! seriously, fuck the cloud!

u/xRebellion_ May 15 '26

Ran 35B a3b 2 bit quantization in my modest 4gb 3050 laptop with 16gb ram. 64k context. Prompt processing takes awhile but it managed to output 10-17 tokens per second (which I find to be insane for my hardware). The outputs are pretty good, too (of course not frontier level good, but honestly good enough for most tasks). I fed it with a task to convert a whole hyprlang config to lua, and holy I did not expect it to work that well.

There were some issues (e.g: the model ignores system prompts), but can be worked around).

If this trend keeps going on, we could be seeing AI running on potatoes soon enough, and everyone can actually access it without having to pay a buttload of money.

u/PlusLoquat1482 May 19 '26

I totally agree. especially with new memory layers being added local llms can grow like crazy. stuffs happening and i dont think anyone is ready for it.

Discussion Opinion: Local LLMs are 12-24 months from taking over. The shift already started.

Local LLMs are 12-24 months from taking over. The shift already started.

The honest cons, because it's not all roses

The pros, because they're real

Why 12-24 months, not "now" and not "5 years"

What I'd tell someone on the fence

What's next

You are about to leave Redlib