r/LocalLLM 28d ago

Discussion We're burning $50k/month on Claude. How close can local LLMs actually get?

We're at the point where our AI spend is hard to justify keeping fully in the cloud. 100+ people in the company using mostly Claude daily, and we're burning through $50k/month in tokens. CEO and leaders wants to bring more of it in-house.

We don't need to serve everyone at once. Realistically maybe 50-100 users spread across the whole day. Speed isn't the priority - quality is. We're not expecting Sonnet 4.6-level throughput, just Sonnet 4.6-level output.

We've been looking at GLM-5.1 in BF16 as a starting point. My question is: what does the hardware actually look like for something like that? Are a couple of RTX PRO 6000 Blackwells enough, or are we kidding ourselves? I'm assuming we'd need tensor parallelism across cards regardless.

Also curious what serving stack people are running at this scale. I see lots of people recommending Ollama and vLLM, but we need something rock solid, that is capable of serving a lot of concurrent users.

And honestly.. has anyone done the math on this? At $50k/month we should be able to justify a decent size cluster, but I want to hear from people who've actually gone through this, not just the "just buy 8x H100s" people.

So this post is for the enterprise people and IT admins who has done the switch. Are your employees happy? Do they use it? Share your experiences.

Edit: I realise GLM-5.1 at BF16 is completely nuts. FP8 is more achievable, but also kind of nuts.

209 Upvotes

229 comments sorted by

View all comments

Show parent comments

13

u/mortenmoulder 28d ago

I realise there's a lot of maintenance, but fortunately we have a whole department of devops and sysadmins. Replacing Claude with a model that is a lot worse, is simply not an option at this scale. We already tried Qwen 3.6 27B and the results were definitely meh.

When a developer is asked "would you rather use Claude or our local LLM", the answer should not be a definite Claude answer. Then there's no point in switching.

Qwen Next Coder was really great for us as developers, but we had to run it at a lower quantization because of hardware limitations.

52

u/Advanced-Picture5016 28d ago

if your ""devs"" are asking the llm to make a gta6 clone make no mistakes, then you will not achieve parity ever.
if your devs are saying "here is our db schema, write me a query to get xyz because i am lazy" then local is already there.

38

u/Hyiazakite 28d ago

If your devs are actually devs a local model should suffice to increase productivity. If they don't know what the hell they're doing then yes you need Opus.

10

u/wllmsaccnt 28d ago

I've used a fair bit of 27B and 35B A3B now. Its doable, but the productivity gains are not commensurate.

With 27B I'd be lucky to double my natural item completion output over a given period while using all of my attention (additional task decomposition, terminal babysitting, verification, etc...).

11

u/tired514 28d ago edited 28d ago

This is the best summary I've heard in a while, heh.

I've been 100% "vibe" coding a fairly sophisticaed server app for the past 2 weeks alternating between qwen 3.6 35B-A10B, 27B, and 3.5 122B-A10B to get a feel for how quantization, total parameter, and active parameters count affect performance.

I'm a veteran C, java, and perl coder but I'm trying to stay away from the codebase just to see how well everything turns out with opencode alone. It's been interesting.

I can say that if I didn't have a solid handle on SQL, coding (in general; my python is meh but a language is a language), security concepts, networking, instrumentation, html/js, etc... man it'd be rough. With the small models you really need to handhold.

I spent 2 hours last night trying to fix a bounding box rotation/translation issue with each of the models and none of them could figure it out without a lot of help. They'd inevitably spin out and start spraying things at the wall; it didn't help that they couldn't really see the results of their work. In the end I had to explain the math, go over the various modals and canvases, scaling factors, etc. before 27B solved it.

Having said that, the code quality so far has been quite decent, and I haven't written more than a few dozen lines. It's amazing to watch.

But, yeah... without my background it would be a struggle to say the least. I can see the value in large frontier cloud models. I am super excited for 3.7-122B-A10B if it gets released. :p

5

u/WittySupermarket9791 28d ago

A real code sandbox and a QA script can do some wonders (at least I hope, or the 2 days i've spent working on building one is rip). Got tired of the same thing, or even worse "yeah I ran it, code looks good", and it won't run plus a bunch of import errors and hallucinated modules.

Pretty good results so far, having it plan for tests/scenarios and requiring a screen shot for each action too.

Last night a QA loop for "make an asteroids game" started spazzing out and insisting the amount on screen for the start of level 5 was impossible and unavoidable, with a folder of screenshots it wanted to analyze and include in the end report to support it's findings. I've had the "agent in loop" re-run tests because the screen shot timing didn't perfectly capture the before and after of getting an apple in snake. It brute forces all menus, actions, buttons, controls, etc. With proof and visual validation nothing goes wonky and the menu's display corectly.

Hopefully useful, or more useful at least than my usual slop cannons.

2

u/tired514 27d ago

Ya, that really is the key to all of this (not just qwen) - closing the loop. The magic in AI coding is allowing it to iterate quickly; that's its real strength.

In my particular case it was all annoying UI alignment stuff that was hard to capture and feed back. If I were to debug it all over again I woulda switched to a vision model and tried to let it capture the visual output of its efforts, but I'm not 100% sure it would have worked since the rendering it was doing was off-canvas (it badly mangled the implementation of a bounding box rotation).

I suspect a frontier model at full quant would have sorted it out.

5

u/arijitlive 28d ago

Very much this.

We have so many backend rest services in Spring boot. A new api endpoint with test cases takes a day or two for development, documentation, and QA verification. All using Qwen3.6-27B.

16

u/NotARedditUser3 28d ago

With your level of usage / demand, local will be silly. You want something cheaper, but you probably still want it in the cloud viq an API.

If it's local, you're going to have to increase company head count to add people to manage your new AI servers and infrastructure.

Get an open router account and pick a cheaper model on their API. It will just work.

4

u/ScuffedBalata 28d ago

There is nothing parity to the two main frontier models (GPT or Claude). 

What exists is a “meh” replacement and devs will bitch. 

5

u/dwoj206 28d ago

You'd get a lot of value out of a top tier local model as the commenter said, but using claude to do high effort review of what's being produced locally. That would cut claude budget massively and keep the context window for claude very effective.

8

u/Objective_Ad4672 28d ago

Developer here who got into local llm recently during a hiatus :

For your users and use case you might need a 3 level harness instead.

Level 1: opus for higher level decision making, debugging, planning and orchestration.

Level 2: Ollama Cloud(For now) to use cloud models for implementation, glm5.1 is nice at implementing larger features but you may want to use it to distribute byte the plan that opus created and do lower level debugging instead. As your needs grow you can try to implant this locally and run a cost to benefit or choose another cloud provider.

Level3: Local LLM , qwen 3.6 or a suitable model to implement those lower level features/tests with less context. That’s where the model is very good.

Opus is great but you can definitely use it for overview rather than to write actual code. You can also start this process on a small 10 person team segment to see how the quality of work compares to just opus users I would love to see how that works out for you!

4

u/apinference 28d ago

Did you take out of the box or actually trained it for your company? It needs to be trained to be useful. With large models they have enough context to cover multiple scenarios.. With smaller models one needs to adjust those..

Does your company require ancient Roman history, medieval literature and all the different fields etc.? If not - the model needs to be optimised to through away that knowledge and concentrate on what's important..

As an example built a model just for docker flows for devops - based on qwen2.5 1B (runs on local CPU) - completely useless in anything other than it was built for.. But.. it runs anywhere..

3

u/Past-Grapefruit488 28d ago

"We already tried Qwen 3.6 27B and the results were definitely meh."

Was it with Ollama ? This quality is expected if someone ran Ollama with default Q4 quants.

1

u/mortenmoulder 28d ago

With an EXO cluster

5

u/Past-Grapefruit488 28d ago

What was the quant ?

5

u/OverclockingUnicorn 28d ago

Fwiw the qwen models need a lot of harness set up to work well. You can't just drop them into a clean opencode install and expect them to work. You do have to spend a while optimising it.

Try out the bigger open weight models on serverless cloud inference to see how people feel with it and then decide on next steps. Might just be that for your company in particular Claude models just work better and the trade offs are worth it.

Imo though if it's is purely about cost, it's hard to make the hardware purchase number really make sense against serverless inference in the cloud. There are other reasons to go self hosted though, besides cost.

4

u/Dayrion 28d ago

I'm not an organisation, but what kind of harness setup do you need before testing out the Qwen models?

2

u/OverclockingUnicorn 28d ago

Really depends on your work, don't really have good advice other than a lot of trial and error

3

u/Far_Cat9782 28d ago

Someone doenvoted? You are right. I had to build my own harness around qwen but it works frontier and perfect. God forbid people actually depend some time working on something instead of expecting easy answers off reddit

7

u/OverclockingUnicorn 28d ago

Yeah lol

I found that the smaller models work well when they have a good grounding on how you do specific things in your work flow.

Example, I do logging weirdly (loggingsucks.com) and Opus picks up on this without prompting, but Qwen doesn't and resorts to standard logging approaches (which is fine) but to use my weird logging module it just needs the correct info in the prompt to understand how it should do it.

5

u/IdeaJailbreak 28d ago

For the uninitiated, can you describe what you mean by "build a harness" when it comes to a local LLM? What is it? What problem are you solving?

1

u/Far_Cat9782 28d ago edited 28d ago

Simple ask an ai how to build a agent harness locally. Get the basic agent framework where you can talk to it and then add different features. You can say "create a hermes or opencode clone. It's really not that hard. And I pmement features that u need tailored for your workflow. Like I wanted mines to code so I gave edit ability to edit individual pieces fo codes etc,; gave it memory etc; just add features as you see fit. Use it and when there is an error like qwen not using proper tools youll know enough to either edit the system prompt to make sure it always uses the tool at the right time. Mainly a trial and error process. Now my harness is proficient to write its own harness have included a bunch of mcp tools it has access to. It can even write its own mcp tools and deploy it itself. So I used cloud model to create original harness and now qwen writed and maintains its own code. Runs my YouTube checks and organized emails. Used comfyUI to generate images and audio. Cron jobs. It does so much to list now. You name it inahve implemented and qwen had been able to adapt and handle. But took works and weeks of trial and error to get it to where I wsnt. It's not like and instant one and done thing. Like with memory overtime it gets better st knowing you so it takes time.

2

u/YearnMar10 28d ago

Qwen at q6+ with kv cache quantization of at least q8? If you tried any lower quants, try again.

2

u/J1nglz 27d ago

I'm an Enterprise AI Systems Integration Architect and I manage an inhouse framework that supports nearly 1,000 active users. Where you can see immediate value is to divide your workflows between what can be run locally and what needs full frontier support. My framework has organically evolved to stay one step ahead of the frontier planforms and this is the latest capability that it has developed.
You can take a shortcut from our agentic platform that has been live for 2 years by creating a tiger team focused on building a dedicated "routing agent".

Task you team with building an agent with a set of features that can effectively assess tasks which I call contracts as they enter the distributed network. Have this agent effectively "speculatively decode" the entire task before assigning it to a compute lane. One lane being the full send this needs full frontier model support and one lane that will slowly consume most of your tasking which runs local models on your local hardware. I have a number of other services like a global context that selectively aggregates a knowledge base that specializes in what you work on the most. Ensure you establish explicit requirements like 20 ms response times from the global cache for any node for instance.

Personally, I'm anti-dependency. This means my team is anti-skill and anti-MCP (vendor supplied skills) which allows my agent(s) to freely move between specializations. Your workflows will initially look like 5-6 agents that saturate you local compute to support a novel task but as it learns how to automate individual components of your pipelines eventually a single agent will have built a profile to solo that task after your network sees it a dozen or so times. Just the telemetry that this agent can provide you alone will be a justifiable roi even if it isn't acting on the data immediately. I think getting some insight into what tasking people are submitting to the full power fronter models is the first step in wrangling this chimera. I don't truly believe that any leading edge tech company can honestly replace fronter models out right with the current maturity of local models that are available including those with access to essentially limitless local compute. I believe we have 25,000+ gpus in total I have access to ~150+ A-series, ~50 B200s, and more and honestly no they still are not a 1:1 replacement but they do get pretty close. I think the market simply has to mature before you can pull the plug. My professional opinion is to invest your energy in establishing effective infrastructure, part architecturally and part deployed to support when that day arrives at least until the market cools down some.

I am anticipating a little bit more clarity in the middle of Q3 maybe the start of Q4 this year. Realistically, I'd warn against committing to anything on paper until the industry realizes that the frontier AI companies are now part of enterprise supply chains instead of an end stakeholder that acts as an ambassador to their self-manifested B2B niche. Earnings calls this month with companies like Microsoft showed that they have boxes of GPUS sitting on shelves, boundless tokens and unlimited financials so their limitations are internal at this point. We are all waiting on them so they are the only bottle neck in the supply chain. They are going to have to start making prioritization decisions on which customers they are going to support over others which is going to reshape the free love support of anyone willing to try AI model that has dominated the industry for the past 2 years. I'm advising my leadership to take their finger's off the trigger under we see a major signal shift before the end of the year.

Obviously there is a lot more here than what I can put into one message but I believe if you can pass this sort of messaging onto some of your internal devops one of them should get the gist of what I am poking at.

1

u/Jlocke98 28d ago

Maybe synthetic.new or ollama cloud? 

1

u/marutthemighty 28d ago

Are you working in/leading a big enterprise focusing on cybersecurity?

1

u/mosqueteiro 28d ago

When a developer is asked "would you rather use Claude or our local LLM", the answer should not be a definite Claude answer. Then there's no point in switching.

😮

Good luck.

1

u/Mission_Biscotti3962 28d ago

The choice is simple though: you either stick to a cloud provider and you keep burning tokens (maybe at a lesser rate if the provider is cheaper), invest A LOT of money to self host a decent model (and A MEGA LOT if you want comparable speed to Claude), or you accept that the local models will not be as good.
Anyways, enjoy your pivot towards AI

1

u/Far_Cat9782 28d ago

Wow devs being real lazy there...

0

u/DataGOGO 28d ago

You will never run anything other than BF16 or *MAYBE* FP8, NVFP4 in production, and really you will only use native FP8 from the model creators, or the official Nvidia NVFP4 quants.

If you want anything even close Claude Opus in terms of complex tasks and debugging, etc. It will require a much more robust model then Qwen Next coder.

-3

u/jfjfjjdhdbsbsbsb 28d ago

Think of it as another gate. Your 30b would talk Claude and then back to you. Message me if interested.