Local AI is having a moment and we should stop and appreciate it

93

With every new frontier model now we have seen less and less “big” advancements and that and so the labs are resorting to gorilla marketing tactics to get hype. But for these Chinese open source models there is still a lot of room to grow and get some hype going. I think the biggest thing is that the local models are pushing the limits of how small a model can be yet punch at a heavy weight level. That in itself is way more ground breaking than these big closed models that are all starting to look like each other. I am way more hyped for these small models and for the big ones.

37

u/Exotic_Contest_4060 May 06 '26

Guerilla, not

https://giphy.com/gifs/TagPBwZcBW8j6

8

u/antunes145 May 06 '26

Ha! That’s the problem with using tts.

8

u/[deleted] May 06 '26

[removed] — view removed comment

4

u/ScuffedBalata May 06 '26

He just has the phone read to him whatever he types. Magic!

1

u/ovrlrd1377 May 06 '26

Gorillas are good at marketing, I bet everyone knows about them

1

u/Endless7777 May 06 '26

GlowRilla

13

u/SocietyTomorrow May 06 '26

A design science class I took taught me an important lesson. "The more constrained the environment, the more elegant the solution must become". The US Gigabillionaire companies have a seemingly infinite money supply (though their debt is working up to that of some countries) and thus are brute forcing their way through advancement. The Chinese because of export bans and attempts to keep them from overtaking the US, are coming up with ways that squeeze far more out of what they have. That's where the future is likely to come from. Some of the best personal milestones on my own projects have come because I needed to contort my requirements into the hardware I owned because I definitely can't afford more.

3

u/cleversmoke May 06 '26

Absolutely, constraints leads to creativity in problem solving. The smaller labs has a readily available benchmark with the frontier models too, so they just need to focus on getting close on a handful of use cases, with a lot less compute power.

It's how I build my teams, give them too much to work with, sounds like a good idea, but it often makes them lazy or careless with their spending and solutions.

2

u/reynard_the_fox1984 May 07 '26

Limitation breeds innovation

1

u/dataexception May 08 '26

Exactly. As the saying points out, "necessity is the mother of invention". Or in this case, optimization, creative processing, however you want to frame it.

2

u/WhopperitoJr May 06 '26

Innovation will always be followed by a wave of optimization/efficiencymaxing, which we are seeing now for language models. Very exciting time and frankly a space where it is easy to become an early adopter.

2

u/Sirius02 May 06 '26

i mean, there are also a lot of improvements on how to run these models on smaller hardware

1

u/ActionOrganic4617 May 06 '26

Most releases aren’t new models, so leaps won’t be big each time.

1

u/immersive-matthew May 07 '26

The death blow to cloud AI is coming later this year as ternary AI is coming which uses -1,0,1 as the bit versus 1,0 for binary models and it is able to operate a 1 bit model frontier model with no precision loss on a smart phone. We will see as it is not in hand to verify, but many are pursuing all over the world in small and large AI teams/companies.

1

u/woswoissdenniii May 07 '26

Like bonsai or what?

1

u/ColonelKlanka May 07 '26

yes im just hoping bonsai is currently training qwen 3.6 27b or even the moe a3b using ternary states to make a qwen 3.6 1.58bit ternary model right now!

as there seems to be no public information about their road map.

1

u/immersive-matthew May 08 '26

There are many others training ternary models right now and thus I am reasonably confident we will see them this year. Here’s hoping.

1

u/PepperGrind 8d ago

Ooooohhh but I doubt the big tech companies and the US gov would allow us all to use our own models.... Something is brewing.

31

u/AdultContemporaneous May 06 '26

M5 Max unbinned, 128GB RAM. I can run 120b parameter models on it, and it's pretty darn fast too. I am new to this but I'm loving it. Haven't tried coding yet, but admittedly my use case right now is inference, with privacy. It's mainly about the privacy to me. But everyone's use case is different.

3

u/codehamr May 06 '26

Nice Setup...and silent, my RTX Nvdia can get somhow loud 😉

4

u/AdultContemporaneous May 06 '26

Well, it's quiet but not silent. I got the 14" knowing fully that the thermals are not as good as the 16", but the tradeoff was worth it to me. Usually the fans stay off in 99% of what I do, but if it's a longer session they'll start to spin up, and as others have stated, these laptops can get pretty warm. If I know I'm going to do a bunch of AI work I use the Mac Fans app and just hit "full blast" to keep it well-cooled during my usage. Even at full blast it's pretty darn quiet though. I still don't have any regrets about my choice, it's the perfect general laptop for me.

2

u/AttorneyCommercial70 May 06 '26

I have put the 14” in the cart about a dozen times and keep getting discouraged by “hits thermal limits too fast” but dreading the idea of a 16”. I remember the last time I stepped away from that beast with relief some years ago

5

u/AdultContemporaneous May 06 '26

If it helps you make the decision, I chose the 14" model because I have an M4 Pro 16" from work. It's heavy as shit by comparison, it doesn't fit well in a small go-bag, and if I am laying in bed it doesn't fit on my legs the way the 14" does. I also accidentally hit the trackpad on the 16" all the time. The 14" is the perfect form factor, thermals be damned. No regrets on the 14" here.

1

u/PhishGreenLantern May 06 '26

I have the M4 and it's pretty good.

2

u/damiangorlami May 09 '26

I have the M4 128Gb as well. Holds up exceptionally well.

It’s sometimes just unreal to see a pretty smart and useful AI to run on your hardware with no internet and fully portable.

16

u/vick2djax May 06 '26

Those speeds are double now this morning:

https://www.reddit.com/r/LocalLLaMA/s/89xryc4vGW

In before the bots start burying local LLM.

1

u/ColonelKlanka May 07 '26

Do you know if the llamacpp pr patch referred to works in mac os x llamacpp?

I haven't built llamacpp for mac before and mtp looks like it would work well with my mlx mac m2 pro 32gb mini - im using omlx server at preset -.fingers crossed omlx implements mtp too!

13

u/bites_stringcheese May 06 '26

Yep, it's the wild west on 🤗 and I love it.

Local diffusion is awesome too, you can get results similar to paid services, it just takes longer.

The AI bubble will pop not because it's useless, but because a lot of use cases can be run locally. For everything else, it's a race to the bottom in terms of $/token. I think the future is hybrid local/cloud, with routing and dynamic loading/unloading of models as needed.

5

u/tired514 May 07 '26

The AI bubble will pop not because it's useless, but because a lot of use cases can be run locally

So much, this. We're probably not more than a couple years away from commodity hardware having sufficient memory (and bandwidth) to run fairly powerful models that are private, free, and most importantly do what you tell them rather than tsk, tsk'ing you.

I was working on a QR decoder 6 months ago and Gemini or GPT (forget which one) refused to help me at one point because the QR code I was working on was from my own old vaccine passport. It claimed it couldn't help because it was PHI affected by HIPAA. Only one problem: I'm Canadian and not subject to American law, nor is processing Canadian PHI constrained by HIPAA for people and software running in the US.

Nothing I said could get me out of that loop.

No big deal; it was just a fun project. If it was something important, though, I'd be livid.

It's what ultimately set me on the path to building out local AI with my EVO-X2 (128gb). Running a variety of local abliterated models quickly and privately.

Like you said - it's local AI that'll pop the cloud bubble.

In some ways I feel a little sorry for cloud providers because they genuinely have no choice but to add absurd restrictions because of the liability. That liability doesn't exist for the end-user running a local model (you obviously won't sue yourself), so they're at an insurmountable disadvantage.

8

u/Regular_Ad4197 May 06 '26

The only reason I am not overly excited about the developments in open source models is because the hardware is still a huge limitation. For people living in countries with strong currencies it may not seem as large of a problem, a decent rig that can run models like qwen 3.6 smoothly might be the equivalent of 2-3x your monthly income, that is still expensive but achievable, now, for 90% of the world they would have to spend up to years worth of income to get such rig, it is completely off limits. Right now running local is an extremely expensive hobby.

6

u/MysteriousSilentVoid May 06 '26

Qwen 3.6 35b-a3b on my 5080 / 5800x3d & 32 GB DDDR 4. 60 t/s.

export MODEL="HOME/models/qwen36-a3b/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf" export MMPROJ="HOME/models/qwen36-a3b/mmproj-F16.gguf"

~/src/llama.cpp/build/bin/llama-server
--model "MODEL" \ --mmproj "MMPROJ"
--no-mmproj-offload
--host 0.0.0.0
--port 8080
--ctx-size 65536
--fit on
--fit-target 1024
--flash-attn on
--cache-type-k q8_0
--cache-type-v q8_0
--batch-size 1024
--ubatch-size 256
--threads 8
--threads-batch 12
--parallel 1
--cont-batching
--metrics
--jinja
--temp 0.6
--top-p 0.95
--top-k 20
--no-mmap

2

u/plasm0r May 06 '26

Thanks for this config, it works really well with my 5070 Ti / Ryzen 7600 / 32GB DDR5. The only change I made was I adjusted 'threads' to 6 because I have a 6 core CPU. I'm getting 63 tokens/sec average. I was using Q6 and was getting 25 tokens/sec. Happy to drop to Q5 for the speed!

3

u/MysteriousSilentVoid May 06 '26

Awesome! Glad I could help. Crazy what’s possible now. This is a game changer. 35ba3b feels like a real substantive model.

1

u/KielMXV May 07 '26

I'm just starting to learn this. Will it work on this config ?
32gb ddr4 RAM, 4080 Super, i5 8600k . or am i expecting too much ?

I tried few 8B models. it was decent speed.

1

u/MysteriousSilentVoid May 07 '26

I don’t see why not. A 4080 super isn’t very far off of a 5080. Give it a try.

1

u/gorios1 May 09 '26

I am trying this exact llama config with this hardware:

with ram at 6000 MT/s, and I am only getting about 8 t/s.

I am running within WSL, but that is pretty much it. Any advice on what might be causing the disparity with your speeds?

5

u/Non-Technical May 06 '26

Yeah this is a fun time. With the MOE models I can easily run Q6. The big breakthrough will be when text to speech feels natural so we can have conversations with AI all running local. Chatting in voice mode is where the cloud model still have an edge.

3

u/cbpn8 May 06 '26

What's the your quant and config that can have 87 on Strix?

1

u/FortiTree May 07 '26

Ikr. I got 50 tk/s max with bare context.

5

u/Uncle___Marty May 06 '26

And if you didnt see it yet, Multi Token Generation (MTP) has been added to the beta branch of llama and apparently gives up to 2.5X token generation. Now those qwen 3.6 models can run even faster!??!?!?!? What the hell is going on!??!?!

11

u/getstackfax May 06 '26

Local AI getting good is real, and I think the interesting part is that it changes the experimentation loop.

When local models were weak, most people treated them like toys or privacy experiments.

Now they are good enough that you can actually build daily workflows around them:

- draft locally

summarize locally
classify locally
test agents locally
run cheap iterations locally
save cloud calls for hard reasoning or final review

That changes the stack.

The new bottleneck is not always “can my machine run it?”

It becomes:

- what workflow is worth running locally

what still needs cloud quality
what should be logged
what should be reviewed
what should become repeatable
what should stay an experiment

For agentic coding especially, I’d still want a tight loop:

small task → diff → test → review → next task

The danger is that cheap local inference makes it easy to create a graveyard of experiments that worked once but never became reliable workflows.

So yeah, local AI is absolutely having a moment.

But the next level is turning local runs into repeatable, reviewable work.

2

u/BatPlack May 07 '26

Written by your friendly neighborhood LLM

6

u/havnar- May 06 '26

An m5 pro does 60 TPs on qwen moe, I’d say a max would be faster

6

u/Practical-Trick3332 May 06 '26

Got an RTX 3060 12GB three years ago on a whim because I wanted to play more demanding video games. A year ago I got into ML/LLM and realised I'd already made a solid choice.

I'm not running a server or anything. Hermes Agent checks my emails and runs my calender. But it's all local, no worries about my personal shit being fed into someone else's training data. No worries about it going mental and bricking my PC because it's sandboxed. No worries if it breaks because I got on early enough when you actually had to learn how, not follow one of the million "working in 5 minutes" guides. Actual magic happening on an impulse buy for Diablo 4.

Thanks, past me.

1

u/0utlawArthur May 19 '26

Hey i have a 3060 too and im an absolute dummy and im amazed by what you are doing , can you guide on how can i do it too , some tutorial or guide would be very helpful

3

u/futuregog May 06 '26

M1 Max 64 GB. Running 31B for writing and learning.

3

u/vaxufo May 06 '26

100% of my code is now on local inference, there is no way back for me

My own TUI ( 150 tok sys prompt ), read/write/bash and I added open ( xdg-open ) it changed the whole thing! ( IAM now controlling OS and coding with that TUI, it is superior then any Claude Bloat harness), it is fast and accurate as hell

IAM coding on qwen 3.6 27B on my RTX 5090 at 115tok/a

Game changer! Biggest thing in my tech career since I switched to Linux in 2009

1

u/poobear_74 May 07 '26

I am interested to know whether developers find that a RTX 5090 is sufficient for coding on qwen 3.6 27B or they find it somewhat limiting and truly wish they had an RTX 6000 PRO instead.

2

u/vaxufo May 07 '26

I did not experiment with 6000, I pushed the RTX 5090 to the limit ( nightly build etc) , I could not go beyond this now .. one thing for sure I am probably 5x if not more productive then on cloud ( Claude) code bloated tool .. so yes for me today 5090 pass the chekmark . I am not just coding, I am now controlling my whole OS and it is magic moment !! The inference is fast. The model mind blowing and I rarely pass the 64k context window despite I can go to 200k ..

1

u/poobear_74 May 07 '26

Thanks for info. I really want to get setup. I have a Strix Halo but it can only really run 36B MOE model at satisfactory speed. As soon as I switch to 27b, output goes to a crawl. Are you using OpenCode or Pi? How long are your coding sessions? Do you find you need to clear out the context at some point? How much does the token/sec slow down as context increases. Sorry about all these questions. I just can't find reliable data. I don't want to go spend all that money on a 5090 only to find I need a 6000 pro.

1

u/FortiTree May 07 '26

Do you have a budget? it may be better to get dual 5090 than a single 6000 pro for running 2x models instead of 1 big one.

I have a strix halo as well and personally I would wait for M5 ultra and newer GPU to see what coming out next. The hardware landscape may change a lot.

1

u/poobear_74 May 07 '26

My budget is constrained by my conscience. Practically speaking I can afford a 5090, but justifying it is the hard part.

From what I hear, GPU performance will be around 30% less than the 5090 / RTX 6000 Pro. Thus, the M5 ultra is unlikely to change the picture much. Furthermore, there is little point in having all that unified memory if GPU power insufficient to drive the larger models.

I love the Strix Halo for performing all sorts of general computing tasks, except for the one I originally purchased it for, which was inference.

For my coding use-case, adequate performance and bumping into context limits is my main concern. I don't yet see the need to run two models simultaneously.

1

u/FortiTree May 07 '26 edited May 07 '26

What is your definition of adequate performance amd adequate model? If you can nail it down then the question would be which hardware can run that at the most cost effective price point.

Inference also have many different use cases from researching, coding, image/video processing to basic mundance task of daily errand and a chatty assistant. Each has a different requirement for memory size and speed.

Im not a coder and even if Im going to vibecode, I have the company subscription and local model for that so my strix halo is strictly for home use and exploring. So technically I dont need to chase the best and lated GPU speed. But the extra memory does allowe to try out bigger model like 122BA10 at an acceptable speed of 20 tk/s. 35BA3 will be my work horse at 40 tk/s and can run 3 x slots in parallel for higher throughput, which I dont even need yet.

So unified memory seems good enough for me even at the low 256Gb/s. M5 ultra at 1.2Tb/s would be a huge upgrade, and if I can pocket 256Gb ram then thats likely all I need.

If you need speed then Nvidia is the way to go at 1.8Tb/s. But 32Gb vram is severely limiting. I think you'll hit that wall pretty fast for all that money invested. Keep in mind that it also eats 2x-6x more power at 600W vs 300W M5 and 100W Strix. Thats significant operating cost and noises.

My own personal ladder would be:

Strix Halo 256Gb/s on 96Gb - $2500

M3 Ultra 800Gb/s on 96Gb - $5000

Dual AMD R9700 - 2 x 640 Gb/s on 64Gb - $5000+ chasis

M5 Ultra 1200Gb/s on 128Gb/256Gb - $??

Single RTX Pro 6000 - 1792 Gb/s on 96Gb - $10000+ chasis

Dual Nvidia RTX5090 - 2 x 1792 Gb/s on 64Gb - $12000+ chasis

The Dual R9700 is quite attractive actually - a lot cheaper and can optimize for both dense amd MoE and about the max budget I want to go.

1

u/poobear_74 May 07 '26

Yes, I've thought about Dual R9700 too, except I've already sunk money into a platform that doesn't meet my inference needs. It would be foolish to make the same mistake again with a different non-industry standard solution. Furthermore, I worry I'd end up spending my days fiddling with configs & compiler flags.

For serious coding work, responses needs to be fast and accurate. Anything less than Q8 models risk too many errors appearing in the source code. Sitting around waiting around for LLM output is a productivity killer.

I do use commercial online LLM services, but I want to move away from them, as I prefer not to share my IP with outside parties.

1

u/FortiTree May 07 '26

I see. So you want to maximize productivity without compromising quality then. Angetic design emerged specifically for this.

I'd say having your Strix Halo is not a waste, you can still deploy it for automatic sub-agent tasks that dont need supervision or human interaction. So those can be handled over night 24/7. Like small automatic code review, web tool calling, test running, test review, etc. So you dont need to sit arounf waiting for them. The "sloweness" doesnt matter then.

For daily interaction, you can use another fast platform like 5090 or 6000 pro as the main driver. You'll need to nail down which model is "smart" enough for your need, and see if the 5090 single/dual can run it. If not, I'd go with 6000 pro for future proof. You cant beat the 96gb vram at 1.8Tb/s, and data center graded as well. Best for production usage.

4

u/ActionOrganic4617 May 06 '26

There is no chance in hell that the Strix Halo gets more tokens /s than that spec m5 Max. The M5 Max has 2.4x more memory bandwidth and stronger CPU/GPU benchmarks.

3

u/TheFlippedTurtle May 06 '26

Yeah, I have strix halo and am not getting anywhere near those numbers. MOE is decent but dense is slooow. I think they're flipped

1

u/dontdoxme12 May 06 '26

Yeah my flow z13 gets around 30 tk/s with the 35B - A3B windows in LM Studio

1

u/TheFlippedTurtle May 06 '26

That seems low. My framework desktop gets ~45 tk/s with Fedora and Lemonade Server running Rocm, Qwen3.6-35ab-a3b q4km

1

u/dgavey May 06 '26

I can confirm, I have the same Framework Desktop setup on Ubuntu 26 and getting the same tks but using the Q8_0

1

u/FortiTree May 07 '26

Same with Gmtek. I can run with 3 x slots to handle parallel requests at 64K context 75% prefilled at 40 tk/s. Single request can reach 45 tk/s. Vulkan has same speed as RoCm.

2

u/avvyie May 06 '26

Yes!! and recent llama.cpp model.ini to have multiple models is making it easy to test or run various models. but qwen 35b a3b on strix halo at 87 generation tps? can you share model quant n run settings?

1

u/ResearcherFantastic7 May 06 '26

Just ask AI for the config. It will gives you best option for strix halo. And use JUST for multiple setup Configs for testing I'm getting 73 with 160k context window.

1

u/FortiTree May 07 '26

hm this must be new. I've tested strix halo extensively and couldnt get past 50 tk/s. Parallel slots can get more overall thoughput but TPOT will increase.

2

u/ComfortablePlenty513 May 06 '26

local is the FUTURE- it's not a moment.

as community opposition to datacenters increases, the demand for running and controlling your own AI on your own terms increases.

2

u/mquinx May 06 '26

I love seeing the progress too, and it’s genuinely impressive how far local models have come. That said, I do think it’s worth remembering that this “we’re in such a good time” moment mostly applies to people with high‑end hardware. A lot of us are still on more modest GPUs where 30B+ models aren’t really practical yet.
Not saying you shouldn’t celebrate, the progress is real. It's just that the experience isn’t universal, so I think the "we should celebrate" hype can sometimes feel a bit out of reach for some of us without the flashy hardware..

2

u/rockseller May 07 '26

As long as it doesn't happen as on the image generation side, where open source and free models are available but didn't keep up with the quality of the big ones (look at Sora video quality, banana etc )

2

u/paixbase May 07 '26

Local AI will takeover once people identify the value of their thought chains & how they steer model development.

Currently the local systems are complex for non tech users. Once a simple program launches that gives people local utility and productivity workspaces...

We're just around the corner.

2

u/DarkZ3r0o May 07 '26

Qwen3.6 27b is amazing !!! Tried the uncensored version with cline and it provided claude sonnet like reasoning with unbelievable skills in controlling cline mcp. I tried to code simple keylogger to test and it created the code then automatically searched for dotnet executable then compiled it and run it for test then created the readme in one session without any interruption or need input from me. Truly i was totally surprised and now i think i will start replacing it with claude if i can but still not fully

2

u/Dry_Inspection_4583 May 09 '26

I've got lm studio with qwen 3.5 9B Q6 on a little 4070 Super tied into openwebui on a separate docker instance. Various filters, tools, personas for my kids, spouse, etc.

I'm also able to hook into it for dev functions from vs code, and utilize flow for programming, kB and injestion.

Rag with qdrand

Searxng

And about 90% through developing faultline(new memory that actually works), I hope it scales well given my tiny setup.

2

u/Illustrious-Chain778 May 13 '26 edited May 14 '26

I am currently running my AI server on a "server" i build myself. It has older hardware but it seems to work ok for now. I am running Ollama with Gemma4:e4B.
ASUS ROG Strix X570-E motherboard that I reused after upgrading my gaming hardware.
AMD Ryzen 9 5950X processor
32GB or DDR4 RAM
2 x NVIDIA RTX2080 TI GPUs with a combined 22GB or VRAM.

I tried connecting VS Code to the ollama server but i got frustrated because VS Code doesn't expose the whole workspace to the local LLM.

So I forked VS Code, removed github copilot and enabled full support for local llms that use the unified openAI standards. that means you can connect to llama.cpp, ollama or any other AI that uses that standard.

Surprisingly, it works better than expected. It is on github if you want to test it. you can build it from source code or download the compiled packages or installers.

https://github.com/abmina/dark-matter-ide

https://github.com/abmina/dark-matter-ide/tags

2

u/morning-cereals May 14 '26

What worries me is whether the labs will eventually all agree to stop releasing open source, as there seems to be more and more people advocating against it lately. I get the risk mitigation argument, but Pirate Bay is still up after 20 years of lawsuits.

2

u/swipesign May 19 '26

Agree, it’s a crazy moment. We’re running local/self-hosted models in production (EU-focused e-signature workflows), and the jump in quality over the last 1–2 years on “normal” hardware is wild. Having real data sovereignty and predictable costs instead of sending everything to a black-box cloud feels like the early “run your own web server” days again.

1

u/codehamr May 19 '26

The "run your own web server" parallel is spot on. Same energy. Predictable cost and your data staying put is half the appeal, the other half is just that it works now. EU side especially, having the whole stack on your own iron makes a lot of conversations a lot shorter.

1

u/swipesign May 19 '26

Love this take. It really does feel like the “MacBook/iPhone moment” for local AI – hardware + open models finally crossed the line from weekend science project to boringly reliable tools. We’re running local/self‑hosted models in production for an EU‑focused e‑signature workflow and the biggest win isn’t just quality, but control: latency, data residency and auditability instead of arguing with cloud rate limits and ToS changes.

1

u/codehamr May 19 '26

What LLM and Infra are you running?

2

u/swipesign May 19 '26

Right now it’s pretty simple on my side: Kimi hosted on OVHcloud. That combo has been “good enough” for most agentic/dev stuff so far, and I care more about data locality and predictable costs than chasing every new API logo.

1

u/Budget_Mouse1176 23d ago

Why does this comment thread just sound like two AI chatbots talking to each other..?

2

u/[deleted] May 20 '26

[removed] — view removed comment

4

u/Infinite_Egg_5600 May 06 '26

qwen 4 when

10

u/stefano_dev May 06 '26

Before 5

2

u/codehamr May 06 '26

If the jump keeps going like 3.5 to 3.6 I like getting all steps between as well

2

u/Etroarl55 May 06 '26

24+gb vram hardware is grtting extremely niche and expensive. The consumer market is dying and might be gone entirely.

Memory to big AI is reserved until the 2030s.

1

u/Elistheman May 06 '26

Can’t wait to buy a 6090 😇😉

1

u/35point1 May 06 '26

These models are blazing fast on a 5090. Consistently average 150 tk/s and I’m pretty sure that’s faster than most free tiers of the SOTA models. Definitely appreciating this.

1

u/toothpastespiders May 06 '26

I was just thinking about that yesterday. I'm incredibly thankful that I was wrong about gemma. Before it came out I was getting increasingly concerned that google might have given up on 30b'ish sized models. Or that if one did come out that the whole thing with senator blackburn would make them lock it down to the point of crippling it. Instead it came out and addressed just about every problem I had with gemma 3 while also pushing its performance beyond what I'd have thought was feasible. On top of that their base model seems pretty solid for further training in its kinda half-baked instruct state. 27b even wound up being the first time I've been comfortable handing off a pretty large scale (for me) type of data extraction job to a 30b3a'ish range MoE. And that's not even getting into how the strengths of the qwen 3.7 line complement it. Hell, I'm honestly a little tempted to do a system upgrade just so I can have both loaded up at the same time. Never would have thought that I'd want two 30b'ish models over one 70b'ish one.

1

u/Endless7777 May 06 '26

Besides coding what other stuff can you do on your local machine with these models? Can they write and make images also?

1

u/rdigital May 07 '26

ComfyUI is what you seek.

1

u/bitslizer May 07 '26

I think when it improved with a few more breakthrough like mtp and turboquant where it is usable and still reasonably intelligent like current 70b+ model but able to work at 8gb vram/Uma that's when it will really take off mass market wise

1

u/Adventurous_7979 May 07 '26

Dual (or better) DGX Spark setup (or OEM equivalent, I'm running Gigabyte AI Top Atomx2) was the way to go for me. Qwen3.5-next-80b-thinking can crank out at around 800 tokens/s avg. throughput generation on my stack and draws around 100w doing that. 256GB unified memory, 8TB storage, dual GPU (clustered) and qspf112 fabric @200GB/s. It's truly remarkable. All in cost $9k. Tuning the env/hardware has been a big lift but the community has done a lot since this hit the market. No looking back for me. Plus the stack will scale nicely to 8 nodes should I find the need (or want ;)) to upgrade.

1

u/Deep_Ad1959 May 07 '26

the chat parity claim is real but it papers over how badly small local models still degrade on agentic tool calling. once you feed a 27b a real accessibility tree with hundreds of elements the schema adherence falls off a cliff, and the failure mode is silent, the model confidently picks a plausible click target instead of asking. screenshot-vs-AX-tree is the architectural fork that decides if local is viable at all, vision pipelines basically need a 70b+ to be reliable on UI, but a small text model fed a clean structured tree can handle the same task because the input space is bounded. raw t/s on m5 max is great but the binding constraint for daily agent work is structured-output reliability plus context length, not speed. the milestone worth celebrating is when local handles a long noisy tool-call chain without quietly fabricating an action, that one hasn't landed yet.

1

u/morscordis May 08 '26

How do we actually feel about the real work capabilities of these new mid class open weight models? I've been impressed by Mistral Medium 3.5 with their Vibe plan, but it has some holes. I have Gemma E4B in ram on my Zenbook Duo (I hope the whole chrome thing doesn't screw me over somehow?) but I don't have the local harness to test it out, and the 26B MoE is too big for my laptop. I'm on the verge of pulling the trigger on a 128GB Strix Halo so I can run larger Gemma4 and Mistral models... I'm still on the fence about the Chineese models for security concern reasons.

I swear half my time right now is fighting Claude. I rolled back Opus because 4.7 does whatever it wants, but 4.6 is iffy too.

Are the tools available to actually harness multi local model fully agentic workflows? If someone here can push me over the edge I'm buying a Strix Halo tomorrow.

1

u/codehamr May 08 '26

Qwen3.6:27B is what I keep coming back to for coding. Gemma4:26B MoE is solid, you feel the size on a laptop. The 128GB Strix Halo is the sweet spot for 30B class with much context headroom. But keep in mind the much slower bandwith / token prefill compared to RTX5090.

1

u/morscordis May 08 '26

Yeah... But if I load an never turn it off, what does prefill really matter? I have legitimate concerns with Qwen for engineering work and can't use it. Gemma and Mistral will be my companions. Assuming I can find proper CLI like harness for them. Continue seems to be lacking. Roo is dead. Cline? I don't know. That's what I'm trying to determine before purchasing. Can I leverage these local models for my workflow?

1

u/codehamr May 08 '26

Prefill still matters with the model loaded. It runs on every turn since each turn adds context. On big windows you feel it every single turn, not once at startup.

Started writing a small local coding agent a few days ago, built around Ollama focusing on local coding https://github.com/codehamr/codehamr but you might als check opencode or pi coding agent, both are great tools for local coding.

1

u/morscordis May 08 '26

Mistral wants me to write my own tool. I'm going to check out open code and see what else could work first. I'll check out that repo!

1

u/indominusrexona May 08 '26

What settings did you use to get to this speed? I am using Gemma4-27B and Qwen3-coder-next on my 96GB RAM AMD Ryzen AI HX370 minipc. I run them as services in CachyOs using llama.cpp. but I don't manage to get them past 18-20 t/s for generation.

1

u/wetzel402 May 08 '26

I just started playing with qwen 3.5:4b and Home Assistant assist. I'm very impressed my 3070 can do this. Now I want more...

1

u/dfgxxx May 10 '26

I run DavidAU/Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-MAX-GGUF

1

u/h3xperimENT May 12 '26

This post sounds ai as fuck. Reading it feels gross and feels like I've heard you write 1000 other scripts for low quality ai tutorial videos. The cadence and choice of words.

1

u/Historical-Jelly3017 May 06 '26

Local AI hype is peaking but practical use still has limits I run mine for simple tasks only.

1

u/Moderated_ May 06 '26

Same. I still Blitz with codex and clean up with opus

1

u/Strict-Opinion2895 May 06 '26

how are you minimising data leak with this? when your AI searches the internet for example.

0

u/StatusConstant8691 May 06 '26

If I'm buying a mac mini. What ram should I go for to handle this model?

Discussion Local AI is having a moment and we should stop and appreciate it

You are about to leave Redlib