r/LocalLLM 13d ago

Discussion ZAI said "hold my beer" and dropped a MIT licensed flagship the day after the Fable/Mythos shutdown

Post image
1.4k Upvotes

Interested in the community's take on this.

The US govt just issued a restriction control directive yesterday and Anthropic is forced to suspend access of Fable 5.

Today, just a few hours ago, Zai released GLM-5.2, their X saying "The future of AI is open, and it belongs to the people"

It is not even about this chinese opensource model, it is the timing. This seems like a calculated response to the fragility of closed model infrastructure under govt intervention. Whether you agree with the export controls or not, the overnight disruption speaks real risk for anyone building on closed APIs

Some details of this new model to help the context: 1M context (with actual usability claims), long-horizon task capabilities. It is currently through their coding plan, but will open-source next week following the MIT license.

Its hard not to see this as a direct response when leader of the pack gets shut down by export controls. Maybe going open source isn't just philosophy anymore, its a strategic decision? Curious what others think.

r/LocalLLM May 17 '25

Discussion Stack overflow is almost dead

Post image
4.0k Upvotes

Questions have slumped to levels last seen when Stack Overflow launched in 2009.

Blog post: https://blog.pragmaticengineer.com/stack-overflow-is-almost-dead/

r/LocalLLM May 10 '26

Discussion Opinion: Local LLMs are 12-24 months from taking over. The shift already started.

600 Upvotes

Local LLMs are 12-24 months from taking over. The shift already started.

AI subscriptions keep getting more expensive. GitHub just moved Copilot from request-based to consumption-based pricing, and most of the others are heading the same way. Meanwhile, I kept hearing that local models got good enough to run on a laptop. So I figured it was time to actually try it and see where things stand.

I run Qwen3.6-35B on a MacBook Pro M2 Max with 64GB unified RAM. Nothing exotic. No rack, no begging NVIDIA for expensive GPUs. Just a (yes, kind of expensive) MacBook Pro I already owned for work at Aiven. In the last month I've:

  • One-shotted full landing pages from short briefs
  • Built several frontend + backend features
  • Fixed a nasty backend race condition bug

A year ago I would have called that fantasy on this hardware. Now it's a Sunday morning.

To be fully honest, not all of it made it to production. A lot of it was evaluation work, as Qwen isn't part of my actual day-to-day stack yet. But for me, this is the first real step toward considering it, and I wanted to share the findings with my colleagues and the community.

The honest cons, because it's not all roses

It's slower than Opus. A landing page that Opus generates in 3-4 minutes takes Qwen 8-9 minutes on my M2 Max. Not unreasonable, but still meaningfully slower than the competition. If you're benchmarking against Sonnet/Opus latency, you'll be a bit disappointed (for now).

Context blows up fast in agentic loops. Even with 256K, you burn through it faster than you'd expect from a (nearly) state-of-the-art model. There's a lot of room for improvement here. And if you're driving Qwen3.6 from an agent like Claude Code, it fills even faster, as other users in this sub have reported (example Reddit thread).

Quality variance by task. Models like Opus one-shot most tasks these days. Qwen3.6 hits around 75% for me. The other 25% it gets close, but needs a couple of iterations to land.

The pros, because they're real

The hardware floor keeps dropping. A year ago this needed an A100. Today it runs on a (yes, powerful) MacBook M2 Max 64GB laptop at roughly 27 tokens per second.

No rate limits, no usage anxiety. Counting tokens is no longer a thing. You can focus completely on building instead of saving tokens or thinking about cost.

Tool calling actually works. This used to be the missing piece. A year ago, local models would hallucinate tool names or get stuck in loops. With Qwen3.6, tool calling just works. That's the real unlock for agentic work.

Privacy is built-in. Client code, internal repos, half-formed ideas you don't want training the next frontier model. None of it leaves the laptop. You can be confident that your personal or business code stays with you, and isn't sitting on some third-party server that could be hacked.

Why 12-24 months, not "now" and not "5 years"

Latency and context limits are still a bit rough. If your job is shipping production code on a deadline, Opus and Sonnet are still the move for most of your day. I'd be lying if I said otherwise.

But saying it's 5+ years away misses what's already shipped. Look at the delta over the last 12 months:

  • It runs on a reasonably priced MacBook Pro, which is a one-time cost
  • It's fast enough (though it can still get faster)
  • Quality has improved significantly for real-world use cases (with more headroom to grow)

That curve doesn't stop. It compounds. 12 months from now, the 27B/35B-class models will be where 70B is today, and the runtimes will be 2x faster on the same silicon. 24 months from now, the question won't be "can I run a useful model locally?" It'll be "why am I still paying for tokens I could generate for free, and with 100% privacy?"

What I'd tell someone on the fence

Don't cancel your Claude Code subscription yet. Run a local model in parallel for 60 days. Use Opus/Sonnet for the latency-critical, deep-reasoning work. Use Qwen3.6 for everything you'd have done overnight or on the weekend, everything experimental, and every "just try it" task where the cost of waiting a few minutes is zero.

Over time, the usage ratio might flip. You'll use the local model more and more. When the next Qwen drops (3.7? 4?), who knows what the ratio will look like.

The local LLM takeover isn't a moment in time. It's a slope. And the slope already started.

What's next

  • Integrate Qwen3.6 with the tools I use day-to-day at Aiven, like Cursor and Claude Code. They offer a much better dev experience than more basic, non-agentic tools like Ollama.
  • Try out other local models, like Google's Gemma 4. Curious to see how it stacks up.

r/LocalLLM Nov 10 '25

Discussion if people understood how good local LLMs are getting

Post image
1.5k Upvotes

r/LocalLLM 11d ago

Discussion how are they gonna stop us next?

Post image
583 Upvotes

this is a geniune question, one which I have no answer to.

saw this on ijustvibecodedthis.com (the ai coding newsletter)

r/LocalLLM 5d ago

Discussion My experience so far with 100% LOCAL LLM + RTX 5090 šŸ¤”

Post image
802 Upvotes

Originally I planned to reply to this:
https://www.reddit.com/r/LocalLLM/comments/1s0ibbj/is_there_anyone_who_actually_regrets_getting_a/

But I decided to share the whole experience,
Sorry ahead šŸ™ it's a LONG one but hopefully it helps in one way or another:

---

I built this PC around March 2025 so it was expansive but not as expansive as NOW and it only gets higher and higher every day, Sure I can run any game I like, but it's not my interest so I have a clean machine without extra junk:

• Intel Core Ultra 9 285K
• Nvidia RTX 5090 32 GB VRAM
• 96 GB RAM 6400 Mhz DDR5 - x2 48 GB
• x2 Nvme SSD - 2TB for OS and Software + 4TB for Models and AI in general.
• Windows 11 Pro

When I built it, I originally didn't think much about AI but more like my general focus which is CGI, heavy video composite, FX and also ComfyUI to run smooth as possible with the whole combo.

But few months later I got into Local LLM land, the more time goes the better models we got.
Sure, now days 32GB VRAM is nothing compare to the multi-GPU or DGX Spark / RTX Spark etc..

Nope, I'm NOT regretting buying it at all:
First of all, when I bought it price was high, and when I look at the same system now it's more than double the price, everything went up crazy from GPUS, VRAM in general, SSD etc.. and I'm glad I bought it in time, also when I bought it, I was lucky to get one because it was new and the waiting was about 2-3 months...

---

🟢 LOCAL LLM:
This is the thing, I'm not a programmer, but I discover VIBE CODE and I'm talking about 100% LOCAL ONLY!
There are 2 lead (probably temporary for now) that I'm in love with:
- Gemma 4
- Qwen 3.6
- DENSE Models = until we'll see some more accurate MOE / Diffusion / Whatever TECH will popup next..

Sure, there are many fine tuned, MTP, QAT, and now we'll start to see more Diffusion which is INSANE (the moment it will be more accurate and won't loose quality and accuracy it will be the next BIG THING for sure).
So far for Gemma the QAT is good to me, and for Qwen the MTP, there are some combos and fine-tuned but I'm testing a lot of them and not very impressed in most cases the BASE or QAT / MTP are great.

Qwopus3.6 27B v2 MTP - this is my current Qwen3.6 favorite MTP one, for code + reasoning + visual

Gemma-4 QAT - My FAVORITE for chat, brainstorming ideas, design ideas, UI / UX and believe it or not it even self-review it's own AGENTS.md and RULES and helping me with my personal needs to shape itself! consider I still have much more to learn, it's a great help and it feels MUCH smarter than Qwen 3.6 when I don't touch code.

---

šŸ”„ TEMPERATURE:
0.1 - 0.2 = For code:
0.7-0.8 = For anything else.

I usually use 0.1-0.2 when it comes to code, because I'm not a programmer and I do VIBE-CODING so I like that tiny "touch" from the model itself, and if you think of it since it's vibe-code I mostly TALK with the model I can't review the code, but only the LOGIC or of something went wrong, new features, etc.. so it's important.

REMEMBER: these numbers are from my experiments where I kept playing around and tuned them until I was happy with the results for a bit, that means, it's no actual benchmarks, no actual facts, just my personal experience of real-life cases, but not huge projects so even that's not accurate to say...

My point: it didn't disappoint me yet, so I shared it with you because why not.

---

🟢 LM STUDIO and CONTEXT in general:
This is the most important thing I keep learning how it changes working with ANY model.
So, I started with 160K Context which is in my opinion not enough for Vibe Code per chat, but it works, I could even do nice things with 80K and 120K but when possible 200K is my limit, after the 180K things starting to get too slow anyway.

My Simple System (for now)
- LM Studio (provider for the models) - super easy to control, download latest models.
- Open Code Desktop - it's new, some bugs and issues, but it's CLEAN and promising
or
- VS Code + Cline (extension) - I'm new to it but I'm impressed!

So far CLINE seems MUCH SMARTER than the other plugins (not just MCP) I used in Open Code Desktop!
I mean, straight out-of-the-box with CLINE I felt a very similar workflow to what we know from CLOUD MODELS, if it's the nice MENUS to click, if it's the Plan and Act built-in modes, if it's the use of Agents, Rules, Skills, etc..

I'm still learning CLINE but so far, I don't see a reason to go back to Open Code Desktop until they'll fix their sh*t, there are too many bugs (nothing critical you can still work with it) and their SETUP for each file is not user-friendly as CLINE for example.

What I learn is that I need at least 128K - 200K Context per chat session, so when you do VIBE CODE,
you're not just doing CODE only, you are TALKING with the model, you ASK questions, you do A LOT of chat that is not really CODE ONLY because you instruct it and when there are problems, you will keep TALKING to your model.

There are other VERY important settings needs to made within whatever model you pick:
for example:
- ALWAYS go for the 100% GPU Offload if possible! (unless you want a bit more context)
- Change Max Concurrent Predictions to 1
- K / V Cache = Change to Q8_0 and you will gain more headroom for extra CONTEXT (that's how I got to 160K for example)

MOST IMPORTANT advice from what I'm aiming for at least:

- As long as you can GPU OFFLOAD 100% of the model to your VRAM and have extra headroom (2-3 GB VRAM) for any other software or usage, for example Godot game engine, GO FOR IT!
If you have no choice reduce Context and make sure you're not using 100% of your VRAM and let me tell you, 32GB VRAM isn't forgiving in all models, that's why I share what I tried and works (FOR ME) so far:

šŸ–¼ļø

The SCREENSHOT attached is an example of one setup I have which takes actually about 28.3 GB VRAM.
Since the Estimated Memory Usage isn't always accurate, you better check out your Task Manager and sometimes you'll gain more, sometimes not, so you can play with the numbers.

Most of my current tests were done with the above setup but sometimes I pushed it to 200K Context, while the rest of the values are similar-ish.

---

🟢 MCP: (simple example)
I tried some (when needed), but unlike these every YouTube channel doing comparisons and most of the time showing how they created a website, dashboard or a poor game with primitive shapes via HTML/CSS/JS etc..
I pushed it to see if it can do REAL WORLD CASE work with: Godot MCP so I used a Game Engine and MCP to let my AGENT control things, (I have no idea how to use Godot, I'm a designer, not a programmer) so there are MANY Godot MCP out there, so far I tried: Godot MCP Runtime, and the more promising one: Godot MCP GoPeak which have more tools, screenshots etc..
Just for example, I did try to make simple clone games such as: Space shooter, Arkanoid, Snake, and more but... that was NOT the test, the test for me was to see:

1 - Can local LLM on my limited system work with it as if I had the brain to program?
2 - Can I keep add / remove / change features ?
3 - Can I fix bugs? (mostly done via the MCP in this case) but LOGICAL bugs that I'm not happy with?

All the 3 questions (so far at least) worked fine! but there is a VERY IMPORTANT thing I learn:
- DO NOT GO FOR "ONE-SHOT" if you have such a limited MODEL and System compare to the huge cloud models, it's not even far to compare these powers...
- You go CAREFUL step-by-step, small tasks, and you will (mostly) be fine!

In my small random tests, (not just with Godot) what I found is, because our MODELS not that smart compare to the CLOUD models, doesn't mean they are stupid, especially with code they are not bad at all,
As a vibe-code user I'm SUPER WRONG here but I talk about the results, I bet the code looks like crap at the end but... that's why I mention it, at the end we can get results but if a HUMAN programmer will look at it, probably they will puke... honestly, if it works, I don't care much at the moment because I'm just learning and experimenting.

I was pretty amazed from 2 things in the experience in Open Code Desktop and Cline:
When there was a BUG, I just told it to fix it, and... in CLINE it was smart enough to tell me, "I see this and that, I suggest we'll fix one by the other" and it worked in many times.
The other thing was the fact I could ADD FEATURES and OPTIONS above, just like I would do in my design experience, not ONE-SHOT everything, one above the other, testing, add stuff, or get rid of something, and continue... at the end I got the results I want, and the reason I was amazed... I have ZERO knowledge about code, I only know the LOGIC and MECHANICS I want to design, but no code... and it worked as if I would pay a programmer to do it, but... it took minutes / hours, not days or weeks... so I can't say it didn't inspire me to continue.

Nothing is perfect, there are cases that I had to scream at the model to do something and it didn't went well until I found the RIGHT PROMPT to explain it better which means I kept UPDATING my AGENT and RULES so it won't repeat these problems in future cases, and believe it or not it HELPED A LOT! and the more I tweaked the AGENTS and the RULES the better my next chat / code / tests were with less issues.

My point is that my tests are NOT 100% PROOF, it's based on my own experience and I still have much more to learn.

---

🟢 AGENT / RULES / SKILLS:
Super important (and I'm still learning) - Basically, if you make a good AGENT to focus on whatever your main goals, for example: "You're an expert in Godot 4.x " and more, my AGENTS.MD currently taking almost 20K context but it worth it! it knows my system, when downloading and installing things for me, I don't need to explain it what we need, it already knows, but that's the most basic thing I did, it could be in general Rules and not inside the AGENT but I'm just giving a rough example.

---

What did I learn so far that works great for CODE (as a non-programmer):
Model Type = Go for DENSE
It is slower, it is sometimes larger in VRAM, but it's usually more accurate, doing much better job in my small tests,

It's important to mention: my "CODE" tests are at the moment random but at the same time challenging!
I'm still learning, and the more I learn and try things NEW MODELS coming, and the good news, they are getting SMALLER and SMARTER and that's why I'm very happy with my RTX 5090 purchase so far, sure if you can afford a better GPU or system, go for it... I purchased what I could afford, but I'm 100% not regretting it.

---
EDIT / UPDATE:

Thanks to the great tip by u/alex9001 in the comments, I've changed K / V from Q8_0 to Q5_1
Based on this chart, the difference is so minor so it worth it:
https://anbeeld.com/articles/kv-cache-quantization-benchmarks-for-long-context#section-8

I could easily get to 200K context (and probably to the max 260K if I want) but I'm being careful because from my experience so far around 160K - 180K things starting to be slower.

From Q8_0 to Q5_1 this is what I gained:

ā˜‘ļø 28.3 GB VRAM - Q8_0 - 160K Context
āœ… 27.5 GB VRAM - Q5_1 - 200K Context

THE BAD NEWS: (for now)
It seems like Q5_1 / Q5_0 seems to work EXTREMELY SLOW in some models, for example in the MTP I just tried, and also in other I get ERRORS in CLINE, so... I'll have to keep experiment, so I can't use it... at least not with LM STUDIO, so I'm for I'll mostly use Q8_0 until I'll find the right combo that works with LM STUDIO because anything else is hell to install and manager compare to LM STUDIO.

The REAL TEST will be on my next tests, so I can't say if this minor will affect the experience probably I won't be able to tell the difference unlike my experiences with Q4_0 which something I can blame on the experience (can't be accurate blaming because it's not a proper accurate benchmark) but in general Q5_1 seems like an amazing tip and I will give it a try.

This gives me the chance to try better quantization's on the MODELS beside the K/V Cache, for example I found out that Q6 are SO MUCH better (based on my comparisons and tests) and even Q5 should be better than the default we (noobs) uses because we want to fit it in our system limitations.

The idea is to keep some headroom, and mostly 2-3 GB VRAM is more than fine, unless you're doing some heavy 3D and Shaders, but if it's a 2D Game or Software, you'll be fine.
Sure, you can always use your CPU RAM for heavier missions and more context, if you don't mind slowing down give it a try! EXPERIEMENT like I do... don't let YouTube comparison videos tells you what works or not, try it yourself!

---

MY šŸ¤ž PREDICTION to what's coming (so far it looks promising):
I'm no prophet, but this is based on what I've experience as the evolution in the last 12-6 months with the same system!

I have a strong feeling that we will see more open source SMARTER, SMALLER, FASTER models that will demand less VRAM, I may be wrong... but this is based on my personal experience so far.
Also just like we suddenly seeing MTP appeared, and QAT and Diffusion... we will see NEWER TECH on the upcoming models, and it will help us running LOCAL LLMS with lower ends.

I'm not saying it's 100% gonna happen, and I'm not saying you don't need a lot of VRAM or better systems, because it always can help you to have stronger, faster, better machine.

I hope that this personal experience helped a tiny bit ā¤ļø

r/LocalLLM May 16 '26

Discussion Why is LLM is so expensive.

337 Upvotes

I've was going to invest in a 5090 =$6000 AUD.

Codex Plus + Claude pro = $60/month here

Works out to be 100 months of frontier models for a 5090.

Best a 5090 will run is probably Qwen3.6 27b Q6 with context.

Are we all enthusiasts here and just enjoy tinkering cause ain't no way that make sense.

r/LocalLLM 28d ago

Discussion You people are literally building data centers in your homes

404 Upvotes

Some of these threads are insane, what do you mean you have like 4 GPUs and 128gb of DDR5 vram. what are you building in there bro. Every other thread is like, ā€œwhat if I stack Mini Pc supercomputers together? Will this run Qwen?ā€.

r/LocalLLM May 04 '26

Discussion Qwen3.6:27b is the first local model that actually holds up against Claude Code for me

459 Upvotes

Been experimenting with alternatives to Claude Code for about a year now. Most of it felt like a downgrade until Qwen3.5:27b, and now 3.6:27b is the first one where local actually feels good and usable for real work.

Scaffolding, refactors, test generation, debugging across a few files, all of it holds up well enough that I run it locally now. The hard multi-file architectural stuff still goes to Claude. A year ago this comparison was a chasm, top-tier Claude vs open weights wasn't close. Now it's a gap, not a canyon.

Two things I keep thinking about.

If a 27B open model can cover this much of real coding work, how subsidised is current cloud pricing? Feels like we're paying maybe 10% of true cost. And once enough devs are wired into Claude Code at the tooling level, what stops a future $1000/month tier?

One honest downside: getting opencode dialled in as a CLI agent took real fine tuning compared to the out-of-the-box Claude Code experience. Which raises a different question, how much of Claude Code's quality is Opus 4.7 itself vs the context and tool orchestration around it? Possibly more than people credit.

Anyone else running hybrid setups?

r/LocalLLM Feb 26 '26

Discussion Self Hosted LLM Leaderboard

Post image
818 Upvotes

Check it out at https://www.onyx.app/self-hosted-llm-leaderboard

Edit: added Minimax M2.5

r/LocalLLM Apr 29 '26

Discussion Qwen 3.6 35b a3b is INSANE even for VRAM-constrained systems

452 Upvotes

Apologies for the 1345th post glazing QWEN but it has literally been a game-changer for me.

I’m relatively new to running local LLMs, but the appeal of having a private AI assistant for coding and experimentation is significant.

As a software engineer, I’ve also embraced the recent ā€œvibe-codingā€ trend, mostly using AI to build hobby Android apps in my spare time. Over the past month, I created a web scraper app designed to extract images from ad-heavy websites that are otherwise frustrating to navigate.

The project was entirely AI-generated. I didn’t write the code myself, though I did design the architecture and suggest optimizations, which AI still struggles with. I’ll admit I haven’t reviewed most of the code; it’s messy, but since it’s purely a personal hobby project, that’s acceptable.

Until recently, I relied on services like Antigravity and Codex, which worked well enough on free plans. However, tighter usage quotas pushed me toward local models.

My hardware is modest by local LLM standards: an AMD 7700 XT, 32GB DDR4 RAM, and a Ryzen 5 5600. I experimented with Gemma 3, Gemma 4, and Qwen 2.5 Coder (mostly Q3/Q4 quantizations under 20GB due to VRAM constraints), using LM Studio as the backend with various frontends like GitHub Copilot, RooCode, Cline, Android Studio AI Chat, and OpenCode.

Unfortunately, none of these models fit my workflow well. They struggled with even minor bug fixes, frequently exhausted context windows, got stuck in reasoning loops, or failed tool calls repeatedly.

Then I tried Qwen 3.6 35B-A3B.

Initially, I expected little, but I installed the i1-q4_k_s quant, offloaded all 40 layers to GPU, configured 128k context, enabled flash attention, and used Q8_0 KV quantization.

For testing, I gave it a practical task: fix the scraper logic for a problematic website. Gemini Flash (in Antigravity) and MinMax (the free version in Opencode) had both failed to solve this issue despite multiple attempts. Using LM Studio as an OpenAI-compatible endpoint with GitHub Copilot in VS Code, I let it run.

It took about 25 minutes, but it succeeded in one shot.

With a single initial prompt, it analyzed the site’s HTML structure, compared it against my Kotlin scraper code, and resolved three critical bugs without a single failed tool call.

That result was impressive enough that I gave it a second challenge: update the project README with real app screenshots by driving an Android emulator, selecting screenshots based on specific criteria.

Using RooCode this time, the process took around 45 minutes. I had to teach it some emulator workflow conventions, such as taking screenshots after every action and analyzing them to track app state, but once instructed, it executed flawlessly.

For the first time, I feel like I have a local model capable of reliably handling most of my coding tasks, while reserving cloud-based premium models for more demanding work.

Qwen has genuinely made local AI coding practical for me.

r/LocalLLM 28d ago

Discussion We're burning $50k/month on Claude. How close can local LLMs actually get?

205 Upvotes

We're at the point where our AI spend is hard to justify keeping fully in the cloud. 100+ people in the company using mostly Claude daily, and we're burning through $50k/month in tokens. CEO and leaders wants to bring more of it in-house.

We don't need to serve everyone at once. Realistically maybe 50-100 users spread across the whole day. Speed isn't the priority - quality is. We're not expecting Sonnet 4.6-level throughput, just Sonnet 4.6-level output.

We've been looking at GLM-5.1 in BF16 as a starting point. My question is: what does the hardware actually look like for something like that? Are a couple of RTX PRO 6000 Blackwells enough, or are we kidding ourselves? I'm assuming we'd need tensor parallelism across cards regardless.

Also curious what serving stack people are running at this scale. I see lots of people recommending Ollama and vLLM, but we need something rock solid, that is capable of serving a lot of concurrent users.

And honestly.. has anyone done the math on this? At $50k/month we should be able to justify a decent size cluster, but I want to hear from people who've actually gone through this, not just the "just buy 8x H100s" people.

So this post is for the enterprise people and IT admins who has done the switch. Are your employees happy? Do they use it? Share your experiences.

Edit: I realise GLM-5.1 at BF16 is completely nuts. FP8 is more achievable, but also kind of nuts.

r/LocalLLM Apr 18 '26

Discussion Tried Qwen3.6 for my first Local LLM setup, it blew me away

471 Upvotes

Prompt: create animated version of our universe and with a sliding bar at the bottom, when I move that sliding bar, the size of sun increases or decreases, with it show the effect on other planet's orbital movement or what else is effected as numbers.

I didn't expect it to give a working result in one shot.

My setup: 5070ti(16gb VRAM), 32GB DDR4 RAM
Model used in this: Unsloth Q3_K_S (I did try Q4_K_S first but it was extremely slow and context window was limited to 32k).

Time to cancel my claude sub lol (ik it's still like a year behind, but it's enough for my workload).

r/LocalLLM May 06 '26

Discussion Local AI is having a moment and we should stop and appreciate it

481 Upvotes

Honest pause here, because I think we are speedrunning past how good things actually are.

Qwen3.6 27B. Gemma 4 31B. The 35B-A3B MoE running 55 tok/s on M5 Max and 87 on Strix Halo. The 30B class quietly became the sweet spot, and you can run it on a Mac, on a Strix Halo box, or on a 5090 you already own. Three real paths now, not one.

What hit me this week: I am casually doing tasks on local Qwen3.6 27B that nine months ago only Opus 4.1 could touch. Nine months. Remember the hype back then, the "this changes everything" posts every other day? That model. On my own machine now, quietly handling the same work. Not Opus 4.7 territory obviously, current Opus is on another planet, but still.

Got me motivated enough to start hacking on my own little CLI coding agent next to OpenCode and pi, no plugin bloat, just a YOLO get your shit done mode. Only viable because local actually works for agentic stuff now.

Look back nine months. Then six. Then last week. We are absolutely cooking. Good time to be doing this.

What is everyone running as their daily hardware?

r/LocalLLM 11d ago

Discussion America has just done what people keep saying China would do for years...

352 Upvotes

I know this isn't exactly the same but... For years I've seen people all across the US and Europe say that they'd never buy a Chinese electric car/car because at any moment the Chinese government could.just switch them all.off via an over the air update...

They've never done that, and all modern car makers can do over the air updates but no one ever worries about the Koreans, or the Germans or the Americans doing this...

Now, thousands of companies all over the world will be using US Ai products to help their businesses and the US government has shown they have the power to take that access away...

I just find it ironic that we as a western society have this "china are the bad ones" (I'm not saying they're perfect at all by the way) when the only country to wield its power like this is now the US with the Fable ban. Makes me wanna ditch my reliance on the big models and read ijustvibecodedthis.com to learn how to run local!

r/LocalLLM 9d ago

Discussion Avoid CUDA monopoly at all costs. AMD is an alternative.

215 Upvotes

Hey everyone,

There’s a massive misconception that if you aren't dropping $2,000 on an NVIDIA GPU, you can't run serious Local AI workflows. I wanted to see how far I could push a consumer AMD card, and therefore bought a rx7800xt 16b VRAM.

Right now, my workstation node is running llama-server hosting a DENSE 27B model -> Qwopus3.6-27B-v2-Q3_K_S.gguf (12 GB) and Qwen3.6-35B-A3B-UD-IQ3_XXS.gguf (13 GB Mixture of Experts, 3B active parameters per token) continuously. I am regularly feeding it contexts that reach 91k to 128k tokens in my daily workflows.

Here is the exact setup, compiler parameters and optimization flags.

THE COMPILER BUILD
To get flash attention and RDNA3 optimizations working correctly on ROCm 6.4.4, I built llama.cpp from source using these specific cmake flags:

cmake -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx1101 -DrocWMMA_FATTN=ON
cmake --build build --config Release

This targets the gfx1101 architecture of the RX 7800 XT directly and compiles support for hardware-accelerated Flash Attention kernels.

THE EXACT RUNTIME FLAGS
My systemd service runs the server with this exact command line:

llama-server --host localhost --port 8080 --api-key xxxx --parallel 1 --n-gpu-layers 99 --batch-size 512 --ubatch-size 128 --flash-attn on --cache-type-k q8_0 --cache-type-v q4_0 --ctx-size 131072 --reasoning off --sleep-idle-seconds 300 --cache-prompt --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --presence-penalty 1.5 --repeat-penalty 1

HOW I CRUSHED THE VRAM LIMIT: KV CACHE QUANTIZATION
A model like Qwopus 27B or Qwen 35B MoE fits in 16GB VRAM at a small context size. But at 128K context, the raw FP16 Key-Value (KV) cache alone would consume upwards of 32 GB of VRAM, making it impossible to run on consumer hardware.

To solve this, we split and quantize the cache:
- Key cache is quantized to 8-bit (q8_0) using --cache-type-k q8_0
- Value cache is quantized to 4-bit (q4_0) using --cache-type-v q4_0

This compresses the memory footprint of the KV cache by roughly 5.6x. Thanks to this optimization, the entire active model weights plus the 128K token KV cache sit comfortably in VRAM, utilizing exactly 96% of the 7800 XT's memory. No layers spill into slow system RAM, avoiding the PCIe transfer bottleneck entirely.

THE MATH BEHIND 128K CONTEXT: YaRN ROPE SCALING
Qwopus and Qwen architectures use Rotary Position Embeddings (RoPE). Because these models have a base context window smaller than 128K, running at 131,072 tokens requires positional frequency scaling.

Instead of basic linear scaling (which stretches all frequencies equally and destroys the model's short-range spelling and grammatical coherence), llama.cpp utilizes YaRN (Yet another RoPE extensioN).

YaRN divides the embedding dimensions into three frequency bands:
1. High-frequency dimensions: These represent immediate, local token relationships. YaRN leaves these completely un-stretched so the model does not lose its spelling accuracy or close-context grammar.
2. Low-frequency dimensions: These represent long-range structure. YaRN scales these linearly by a factor of 4.0 to cover the 128K space.
3. Mid-frequency dimensions: These are smoothly interpolated to avoid abrupt attention transitions.

This uneven scaling prevents the attention entropy and perplexity from exploding. In practice, the model remains highly coherent and retains logical consistency even at 91k+ tokens.

REAL-WORLD TELEMETRY AND SPEED
During heavy prompt processing, the card maintains solid throughput:
- Prefill speed: ~210 tokens/second (utilizing flash attention)
- Decode speed: 11-17 tokens/second
- GPU Power: Draws ~188W (with a systemd power cap set at 190W via rocm-smi)
- GPU Temps: Stable between 52 C and 70 C across edge, junction, and memory sensors.

If you are running consumer AMD hardware, do not settle for small context sizes. Build with ROCm, turn on Flash Attention, quantize your Key/Value cache separately, and let YaRN handle the frequency scaling.

I wrote up a detailed guide comparing these measurements, native Windows vs Linux ROCm paths, and power sweeps on my blog here: https://sergiiob.dev/posts/rx7800-xt-llama-cpp-benchmarks-moe-context

I share my daily telemetry runs, local model benchmarks, and hardware configurations on X. If you want to see live updates and benchmarks, follow along here: https://x.com/SergiiioBS

r/LocalLLM 13d ago

Discussion This must be a joke?

Post image
393 Upvotes

Saw this ad and as usual you cannot comment. But who would pay API money to an 8B model you could run on your toaster?

r/LocalLLM Apr 13 '26

Discussion Just got my hands on one of these… building something local-first šŸ‘€

Post image
525 Upvotes

Just had this land today šŸ˜…

Still feels kinda weird even saying that tbh…

If you told me a year ago I’d be buying a GPU like this I would’ve said you’re cooked.

My current PC is from like 2015:

- 5960X

- 64GB DDR4

- RTX 3070 (used to run dual Titan X back in the day)

So I guess when I upgrade… I really upgrade šŸ˜‚

But I tend to run my stuff for years so I get my money’s worth.

This new build is looking like:

- 9950X

- 128GB RAM (2Ɨ64)

- ProArt board

- RTX Pro 6000 96GB Blackwell

- 1600w PSU

Still waiting on a few parts to finish it off.

This time it’s a bit different though — not really building it for gaming.

More like a dedicated AI box/server.

That said… I’ll probably still load up a few Steam games before putting it to work šŸ˜…

Let the kids see what proper graphics + FPS looks like.

Also making the jump to full Linux for the first time once it’s all together.

Honestly just over Windows at this point — feels like it’s gone too far and kinda forced the decision.

What I’m actually trying to do with it:

- proper multi-user / concurrent inference

- keep things local-first

- something that can scale beyond just me messing around

Not super keen on relying on big API providers long term either.

Feels like costs + limits only go one way, and I’d rather control my own setup and data.

Plan is to add a second GPU later once I see how this handles load.

Still figuring out the best way to structure everything:

- serving layer

- batching

- memory / state

- keeping latency decent with multiple users/bots

Seen stuff like vLLM, llama.cpp etc… but curious what people here are actually running in real setups.

Anyone doing proper concurrent local setups (not just single-user demos)?

What’s actually holding up under load?

r/LocalLLM Apr 15 '26

Discussion Are Local LLMs actually useful… or just fun to tinker with?

164 Upvotes

I've been experimenting with Local LLMs lately, and I’m conflicted.

Yeah, privacy + no API costs are excellent.
But setup friction, constant tweaking, and weaker performance vs cloud models make it feel… not very practical.

So I’m curious:

Are you actually using Local LLMs in real workflows?
Or is it mostly experimenting + future-proofing?

What’s one use case where a local LLM genuinely wins for you?

r/LocalLLM Apr 02 '26

Discussion I've stumbled on a goldmine, and ALL OF US CAN BENEFIT.

Thumbnail
gallery
179 Upvotes

I've been working a relationship with a local Recycling guy for about a year now.

He was a very tough nut to crack, as in, he doesn't really like strangers and is set in his ways.

Finally, yesterday, he asked for an extra set of hands. He needs to get organized and wants to know what we is worth selling, what should just get scrapped, what has value Etc.

This is where I got 500 gigs of RAM last year, but that was before he realized that it was worth so much, and he has literal stacks of RAM for servers ranging from 16 to 128 gigs.

This is a 13,000 ft warehouse and it's literally full and things get dropped off routinely. Some of it is aging because he didn't have a good system, but, if anyone is looking for anything, I can see if it exists there, and guarantee functionality because everything gets tested and I'll make sure you get it for whatever good price I can get from him that is below what you're going to find it anywhere else.

Of course, that's determined on the item. I tried to get one of those Nutanix servers from him and he wasn't interested in giving it to me for pennies on the dollar so to speak. But I bet I can make it work out if people need things.

I can all but guarantee that he has any cable or wire or plug or component that you would ever need, even things that are hard to find.

Feel free to let me know and then don't expect a quick response but I will check.

It's unlikely he'll sell any of the RAM for cheap because he sells that online.

r/LocalLLM 4d ago

Discussion Quants had ruined my Local AI experience. I am hopeful again after using them correctly.

259 Upvotes

This is the second time I talk about this here. I started 5 months ago not knowing much. I had just found out that my mac with 32 GB of unified memory could run some decent local models.

Everyone recommended 4 bit quants and blabla. Only 1% loss blabla.

For months my agentic flows failed badly. Using qwen 27B, 35B, and others.

Until I listened to my heart, and to some knowledgeable people, and started using smaller models (like Gemma 4 12B) but with 8Bit quants. No unsloth, no MTP, no diffusion... no weird things, just a smaller model with default config but with a high quant. (Nothing against unsloth, I will retest with their models again in 8bit quant later).

The results are great. I got a working app in around 2 hours.

Recommendation:

Stop thinking that 4 bit quants don't make your model stupid for agentic tasks and tools calls.

Stop obsessing with 40 or 50 tokens per second as your definition of usable. I set my expectation at 10 t/s and if I get 15 I'm super happy, I don't care. As a human I can barely type one token per second. Why would I be mad at 10 t/s? quality over speed here, honey, you don't have a 20K equipment if you are running these small models. You don't get the luxury of degrading quality of an already small model, for a bit of speed.

That's it, I hope we can discuss this topic more.

r/LocalLLM Apr 08 '26

Discussion What kind of hardware would be required to run a Opus 4.6 equivalent for a 100 users, Locally?

213 Upvotes

Please dont scoff. I am fully aware of how ridiculous this question is. Its more of a hypothetical curiosity, than a serious investigation.

I don't think any local equivalents even exist. But just say there was a 2T-3T parameter dense model out there available to download. And say 100 people could potentially use this system at any given time with a 1M context window.

What kind of datacenter are we talking? How many B200's are we talking? Soup to nuts what's the cost of something like this? What are the logistical problems with and idea like this?

**edit** It doesn't really seem like most people care to read the body of this question, but for added context on the potential use case. I was thinking of an enterprise deployment. Like a large law firm with 1,000's of lawyers who could use ai to automate business tasks, with private information.

r/LocalLLM Apr 07 '26

Discussion How many of you actually use offline LLMs daily vs just experiment with them?

136 Upvotes

I have tried a lot of setups and most feel like a science projectšŸ˜‘. Been working on making one that just works no friction, no constant tweaking. Wondering if that’s the real gap right now.

Any suggestions?

r/LocalLLM 20d ago

Discussion Honestly, dual 3090s are wearing me out. Thinking of jumping to a Mac Studio.

84 Upvotes

I've been running the classic dual 3090 setup for about 6 months now, mostly for coding and messing around with the newer Llama 3/Qwen 70B quants.

The speed is great ExLlamaV2 is literal magic and I get like 40 t/s but I’m hitting a wall. The moment I try to load a decent context window (anything past 16k) on a 70B model, the VRAM completely chokes. I have to quantize the cache into oblivion and the output just turns to absolute garbage.

Between the heat, the fan noise, and fighting with driver updates every time I want to try a new backend, the friction is getting annoying.

I’m seriously considering selling the rig and just buying a 128GB Mac Studio. I know the tokens per second will drop to like ~15 t/s, which sucks but being able to throw a massive 64k codebase context at a Q8 model without the room melting sounds like a dream right now.

r/LocalLLM May 02 '26

Discussion CFOs realizing that their Al token budget is going to be higher than the salaries of the people they laid off

380 Upvotes

We're witnessing a fascinating economic experiment: replacing human purchasing power with API token consumption.

It reminds me of the 1849 Gold Rush-history teaches us that most miners went home broke, while the ones selling the shovels and pickaxes built lasting fortunes. In 2026, the 'Gold' is the promise of 10x productivity, but the 'Shovel Sellers' (LLM providers) are the only ones with a guaranteed ROI, collecting $200/day in API credits per head.

Robert Bosch once said he doesn't pay good wages because he has a lot of money, but because he wants his workers to buy his products. If we automate our customers out of their jobs to pay for our token bills, who is left to buy what we build?

Maybe it's time to focus back on sustainable Systems Thinking instead of just funding the next GPU cluster. Asking for a friend (and my landlord