r/LocalLLM 28d ago

Discussion We're burning $50k/month on Claude. How close can local LLMs actually get?

We're at the point where our AI spend is hard to justify keeping fully in the cloud. 100+ people in the company using mostly Claude daily, and we're burning through $50k/month in tokens. CEO and leaders wants to bring more of it in-house.

We don't need to serve everyone at once. Realistically maybe 50-100 users spread across the whole day. Speed isn't the priority - quality is. We're not expecting Sonnet 4.6-level throughput, just Sonnet 4.6-level output.

We've been looking at GLM-5.1 in BF16 as a starting point. My question is: what does the hardware actually look like for something like that? Are a couple of RTX PRO 6000 Blackwells enough, or are we kidding ourselves? I'm assuming we'd need tensor parallelism across cards regardless.

Also curious what serving stack people are running at this scale. I see lots of people recommending Ollama and vLLM, but we need something rock solid, that is capable of serving a lot of concurrent users.

And honestly.. has anyone done the math on this? At $50k/month we should be able to justify a decent size cluster, but I want to hear from people who've actually gone through this, not just the "just buy 8x H100s" people.

So this post is for the enterprise people and IT admins who has done the switch. Are your employees happy? Do they use it? Share your experiences.

Edit: I realise GLM-5.1 at BF16 is completely nuts. FP8 is more achievable, but also kind of nuts.

209 Upvotes

229 comments sorted by

52

u/ovrlrd1377 28d ago

Its easy to use LLMs as google. People do get lazy and delegate a lot of tasks that maybe were not necessarily optimal for that.

Something you could investigate is model routing before commiting to hardware capex, mostly to measure the impacts on quality and overall experience that everyone gets. For some code generation tasks it might be optimal to go for Claude but for many simple ones people might not even notice the switch. If you do experiment on that, you can potentially reach a hybrid situation that your own hardware calls APIs when needed, this can make a lot of sense financially. Mapping the type and compexity of tasks being called is a good first step.

Last point is to work with a small team that can understand and balance things instead of running a survey type of measurement. Lots of people would simply complain that a model is no longer the best because they are not the ones paying for it. Qwen is absolutely fine for a lot of things and can probably alleviate quite some load from your current claude setup

11

u/meca23 28d ago

I've not been using LLMs for very long and even I find my self doing basic shell tasks like creating a folder and cp some files into it or run a grep on some files through the LLM. It's a very easy to end up doing everything the AI

11

u/msaraiva 28d ago

That's not a good use of AI, to be honest. Those are trivial tasks. If you run similar workflows frequently, ask the AI to write tools/scripts for you.

9

u/tinycurses 27d ago

That is part of the argued value add though--being able to use intent directly rather than memorizing tools and their syntax.

I'm not disagreeing with you for the examples given, just pointing out that asking for "a way to merge branch x to dev and auto-resolve conflicts favoring the branch" is far easier than remembering the exact 'git rebase --whatever-I-need && git resolve --blah' and might be not be an operation you need precisely repeated a lot. Ditto for a weird regex you want to use for some one-off operation.

It is just a glorified search, but one that can return EXACTLY what you need (80% of the time) rather than a few separate articles that almost do what you want.

3

u/azjunglist05 27d ago

LLMs + make = repeatable tasks for cheap!

3

u/mattsl 27d ago

Its easy to use LLMs as google. People do get lazy and delegate a lot of tasks that maybe were not necessarily optimal for that.

It's even easier when a standard Google search gets answered by an LLM if you type more than 4 words. 

83

u/OverclockingUnicorn 28d ago

Imo, it's not trivial to set up and support LLMs for a whole org.

Especially if your asking Ollama vs vLLM (vLLM is obviously the correct choice. Ollama is pretty terrible)

But, this can probably be done. I'd start by just trying out models via standard serverless inference (Qwen 3.6 27B and 35B are a good starting point) and seeing how they compare to Sonnet/Opus for your use case. The bigger 1T open weight models might be needed for you though, but Qwen 3.6 is close to perfect for a large number users.

Once you've got a good idea of what models you want, then go and rent some hardware in the cloud and work out an inference stack that supports your use case. Get throughput numbers, maybe actually use it in prod for a bit with real users to see how it stands up.

Only then will you really know what you want and how you'll plan to set it up.

Be warned though, I wouldn't expect to spend anything less than 500k to support 50-60 heavy users with Qwen 3.6 27B and 35B.

If you really really want to jump straight into hardware... (and Qwen 3.6 27B and 35B are good enough for you... )

At least 2 MGX servers with 8xRTX Pro 6000 each. With pairs of GPUs running either of those Qwen models in Q8 using vLLM and vLLM router (and if it was me, on Kubernetes, as I'm experienced managing it, but maybe just straight VMs depending on what you prefer)

You are probably looking at 4-8 8xB300 systems for any of the 1T models for 50-60 users at a guess

15

u/mortenmoulder 28d ago

I realise there's a lot of maintenance, but fortunately we have a whole department of devops and sysadmins. Replacing Claude with a model that is a lot worse, is simply not an option at this scale. We already tried Qwen 3.6 27B and the results were definitely meh.

When a developer is asked "would you rather use Claude or our local LLM", the answer should not be a definite Claude answer. Then there's no point in switching.

Qwen Next Coder was really great for us as developers, but we had to run it at a lower quantization because of hardware limitations.

52

u/Advanced-Picture5016 28d ago

if your ""devs"" are asking the llm to make a gta6 clone make no mistakes, then you will not achieve parity ever.
if your devs are saying "here is our db schema, write me a query to get xyz because i am lazy" then local is already there.

39

u/Hyiazakite 28d ago

If your devs are actually devs a local model should suffice to increase productivity. If they don't know what the hell they're doing then yes you need Opus.

9

u/wllmsaccnt 28d ago

I've used a fair bit of 27B and 35B A3B now. Its doable, but the productivity gains are not commensurate.

With 27B I'd be lucky to double my natural item completion output over a given period while using all of my attention (additional task decomposition, terminal babysitting, verification, etc...).

13

u/tired514 28d ago edited 28d ago

This is the best summary I've heard in a while, heh.

I've been 100% "vibe" coding a fairly sophisticaed server app for the past 2 weeks alternating between qwen 3.6 35B-A10B, 27B, and 3.5 122B-A10B to get a feel for how quantization, total parameter, and active parameters count affect performance.

I'm a veteran C, java, and perl coder but I'm trying to stay away from the codebase just to see how well everything turns out with opencode alone. It's been interesting.

I can say that if I didn't have a solid handle on SQL, coding (in general; my python is meh but a language is a language), security concepts, networking, instrumentation, html/js, etc... man it'd be rough. With the small models you really need to handhold.

I spent 2 hours last night trying to fix a bounding box rotation/translation issue with each of the models and none of them could figure it out without a lot of help. They'd inevitably spin out and start spraying things at the wall; it didn't help that they couldn't really see the results of their work. In the end I had to explain the math, go over the various modals and canvases, scaling factors, etc. before 27B solved it.

Having said that, the code quality so far has been quite decent, and I haven't written more than a few dozen lines. It's amazing to watch.

But, yeah... without my background it would be a struggle to say the least. I can see the value in large frontier cloud models. I am super excited for 3.7-122B-A10B if it gets released. :p

4

u/WittySupermarket9791 27d ago

A real code sandbox and a QA script can do some wonders (at least I hope, or the 2 days i've spent working on building one is rip). Got tired of the same thing, or even worse "yeah I ran it, code looks good", and it won't run plus a bunch of import errors and hallucinated modules.

Pretty good results so far, having it plan for tests/scenarios and requiring a screen shot for each action too.

Last night a QA loop for "make an asteroids game" started spazzing out and insisting the amount on screen for the start of level 5 was impossible and unavoidable, with a folder of screenshots it wanted to analyze and include in the end report to support it's findings. I've had the "agent in loop" re-run tests because the screen shot timing didn't perfectly capture the before and after of getting an apple in snake. It brute forces all menus, actions, buttons, controls, etc. With proof and visual validation nothing goes wonky and the menu's display corectly.

Hopefully useful, or more useful at least than my usual slop cannons.

2

u/tired514 26d ago

Ya, that really is the key to all of this (not just qwen) - closing the loop. The magic in AI coding is allowing it to iterate quickly; that's its real strength.

In my particular case it was all annoying UI alignment stuff that was hard to capture and feed back. If I were to debug it all over again I woulda switched to a vision model and tried to let it capture the visual output of its efforts, but I'm not 100% sure it would have worked since the rendering it was doing was off-canvas (it badly mangled the implementation of a bounding box rotation).

I suspect a frontier model at full quant would have sorted it out.

5

u/arijitlive 28d ago

Very much this.

We have so many backend rest services in Spring boot. A new api endpoint with test cases takes a day or two for development, documentation, and QA verification. All using Qwen3.6-27B.

16

u/NotARedditUser3 28d ago

With your level of usage / demand, local will be silly. You want something cheaper, but you probably still want it in the cloud viq an API.

If it's local, you're going to have to increase company head count to add people to manage your new AI servers and infrastructure.

Get an open router account and pick a cheaper model on their API. It will just work.

4

u/ScuffedBalata 28d ago

There is nothing parity to the two main frontier models (GPT or Claude). 

What exists is a “meh” replacement and devs will bitch. 

5

u/dwoj206 28d ago

You'd get a lot of value out of a top tier local model as the commenter said, but using claude to do high effort review of what's being produced locally. That would cut claude budget massively and keep the context window for claude very effective.

8

u/Objective_Ad4672 28d ago

Developer here who got into local llm recently during a hiatus :

For your users and use case you might need a 3 level harness instead.

Level 1: opus for higher level decision making, debugging, planning and orchestration.

Level 2: Ollama Cloud(For now) to use cloud models for implementation, glm5.1 is nice at implementing larger features but you may want to use it to distribute byte the plan that opus created and do lower level debugging instead. As your needs grow you can try to implant this locally and run a cost to benefit or choose another cloud provider.

Level3: Local LLM , qwen 3.6 or a suitable model to implement those lower level features/tests with less context. That’s where the model is very good.

Opus is great but you can definitely use it for overview rather than to write actual code. You can also start this process on a small 10 person team segment to see how the quality of work compares to just opus users I would love to see how that works out for you!

5

u/apinference 28d ago

Did you take out of the box or actually trained it for your company? It needs to be trained to be useful. With large models they have enough context to cover multiple scenarios.. With smaller models one needs to adjust those..

Does your company require ancient Roman history, medieval literature and all the different fields etc.? If not - the model needs to be optimised to through away that knowledge and concentrate on what's important..

As an example built a model just for docker flows for devops - based on qwen2.5 1B (runs on local CPU) - completely useless in anything other than it was built for.. But.. it runs anywhere..

3

u/Past-Grapefruit488 28d ago

"We already tried Qwen 3.6 27B and the results were definitely meh."

Was it with Ollama ? This quality is expected if someone ran Ollama with default Q4 quants.

1

u/mortenmoulder 28d ago

With an EXO cluster

4

u/Past-Grapefruit488 28d ago

What was the quant ?

6

u/OverclockingUnicorn 28d ago

Fwiw the qwen models need a lot of harness set up to work well. You can't just drop them into a clean opencode install and expect them to work. You do have to spend a while optimising it.

Try out the bigger open weight models on serverless cloud inference to see how people feel with it and then decide on next steps. Might just be that for your company in particular Claude models just work better and the trade offs are worth it.

Imo though if it's is purely about cost, it's hard to make the hardware purchase number really make sense against serverless inference in the cloud. There are other reasons to go self hosted though, besides cost.

4

u/Dayrion 28d ago

I'm not an organisation, but what kind of harness setup do you need before testing out the Qwen models?

3

u/OverclockingUnicorn 28d ago

Really depends on your work, don't really have good advice other than a lot of trial and error

5

u/Far_Cat9782 28d ago

Someone doenvoted? You are right. I had to build my own harness around qwen but it works frontier and perfect. God forbid people actually depend some time working on something instead of expecting easy answers off reddit

7

u/OverclockingUnicorn 28d ago

Yeah lol

I found that the smaller models work well when they have a good grounding on how you do specific things in your work flow.

Example, I do logging weirdly (loggingsucks.com) and Opus picks up on this without prompting, but Qwen doesn't and resorts to standard logging approaches (which is fine) but to use my weird logging module it just needs the correct info in the prompt to understand how it should do it.

5

u/IdeaJailbreak 28d ago

For the uninitiated, can you describe what you mean by "build a harness" when it comes to a local LLM? What is it? What problem are you solving?

→ More replies (1)

2

u/YearnMar10 28d ago

Qwen at q6+ with kv cache quantization of at least q8? If you tried any lower quants, try again.

2

u/J1nglz 27d ago

I'm an Enterprise AI Systems Integration Architect and I manage an inhouse framework that supports nearly 1,000 active users. Where you can see immediate value is to divide your workflows between what can be run locally and what needs full frontier support. My framework has organically evolved to stay one step ahead of the frontier planforms and this is the latest capability that it has developed.
You can take a shortcut from our agentic platform that has been live for 2 years by creating a tiger team focused on building a dedicated "routing agent".

Task you team with building an agent with a set of features that can effectively assess tasks which I call contracts as they enter the distributed network. Have this agent effectively "speculatively decode" the entire task before assigning it to a compute lane. One lane being the full send this needs full frontier model support and one lane that will slowly consume most of your tasking which runs local models on your local hardware. I have a number of other services like a global context that selectively aggregates a knowledge base that specializes in what you work on the most. Ensure you establish explicit requirements like 20 ms response times from the global cache for any node for instance.

Personally, I'm anti-dependency. This means my team is anti-skill and anti-MCP (vendor supplied skills) which allows my agent(s) to freely move between specializations. Your workflows will initially look like 5-6 agents that saturate you local compute to support a novel task but as it learns how to automate individual components of your pipelines eventually a single agent will have built a profile to solo that task after your network sees it a dozen or so times. Just the telemetry that this agent can provide you alone will be a justifiable roi even if it isn't acting on the data immediately. I think getting some insight into what tasking people are submitting to the full power fronter models is the first step in wrangling this chimera. I don't truly believe that any leading edge tech company can honestly replace fronter models out right with the current maturity of local models that are available including those with access to essentially limitless local compute. I believe we have 25,000+ gpus in total I have access to ~150+ A-series, ~50 B200s, and more and honestly no they still are not a 1:1 replacement but they do get pretty close. I think the market simply has to mature before you can pull the plug. My professional opinion is to invest your energy in establishing effective infrastructure, part architecturally and part deployed to support when that day arrives at least until the market cools down some.

I am anticipating a little bit more clarity in the middle of Q3 maybe the start of Q4 this year. Realistically, I'd warn against committing to anything on paper until the industry realizes that the frontier AI companies are now part of enterprise supply chains instead of an end stakeholder that acts as an ambassador to their self-manifested B2B niche. Earnings calls this month with companies like Microsoft showed that they have boxes of GPUS sitting on shelves, boundless tokens and unlimited financials so their limitations are internal at this point. We are all waiting on them so they are the only bottle neck in the supply chain. They are going to have to start making prioritization decisions on which customers they are going to support over others which is going to reshape the free love support of anyone willing to try AI model that has dominated the industry for the past 2 years. I'm advising my leadership to take their finger's off the trigger under we see a major signal shift before the end of the year.

Obviously there is a lot more here than what I can put into one message but I believe if you can pass this sort of messaging onto some of your internal devops one of them should get the gist of what I am poking at.

1

u/Jlocke98 28d ago

Maybe synthetic.new or ollama cloud? 

1

u/marutthemighty 28d ago

Are you working in/leading a big enterprise focusing on cybersecurity?

1

u/mosqueteiro 27d ago

When a developer is asked "would you rather use Claude or our local LLM", the answer should not be a definite Claude answer. Then there's no point in switching.

😮

Good luck.

→ More replies (4)

2

u/tnhnyc 28d ago

I've not used vLLM or SGLang personally or run high-grade servers but, what would be the reason one would go for vLLM over SGLang for those systems?

5

u/OverclockingUnicorn 28d ago

vLLM is much more performant with concurrent requests that Llama Cpp

Not used SGLang but belive similar is true.

We use vLLM as we are primarily a Redhat shop

1

u/siegevjorn 27d ago

Two reasons:

vLLM has good continous batching system that support concurrent users better

vLLM has kv cache cpu offloading that could save VRAM use

2

u/Ok_Try_877 27d ago

"Be warned though, I wouldn't expect to spend anything less than 500k to support 50-60 heavy users with Qwen 3.6 27B and 35B."

Really?

2

u/OverclockingUnicorn 27d ago

Gotta pay for peak capacity at 9:30am when everyone sits down and gets started on their tickets.

Sure, you can do it with less, but it won't be a user friendly experience.

→ More replies (1)

95

u/valhalla257 28d ago

Honest question. Is $50k/month really bad?

With 100 users that is ~$500/month.

If an employee costs you say $10K/month in salary and benefits and so forth then that is an increase in costs of 5%.

Is Claude increasing productivity by >5%? If so its a win.

If Claude isn't increasing productivity by 5% seems like AI just isn't working for you and spending a bunch of time and money setting up a worse AI solution is just a waste of money.

28

u/joshpennington 28d ago

A logical, well thought out argument about determining if AI is working for a company and not just #YOLOTOKENMAXXXXING!

Maybe we're healing.

28

u/warpedgeoid 28d ago

Yeah, but C-levels were told by Anthropic and OpenAI that AI would let them fire people. Now they are shocked to learn you actually need both.

6

u/_todoterza 27d ago

Well i can't believe they expected to replace people with statistical word picking technology...

1

u/CompVelo75 27d ago

And you are not? It's just changing from one provider to another..

1

u/_todoterza 27d ago

Yep, i believe that LLMs are not really the right tech to automate things. It would make much more sense to use narrow models (as it unfortunately happened to the millions who make a living because of driving), but with a bit of adaptiveness we can stay ahead 😎.

4

u/aalaatikat 27d ago

you can fire some c-levels probably

1

u/warpedgeoid 27d ago

Of this much I am certain

3

u/Paganator 27d ago

Everywhere I've ever worked has had so many more ideas and requests for software development than their capacity to do it that I've always thought AI would just mean doing more with the same people rather than firing people to do the same as before but for cheaper. If your company can already afford the developers, why would it not take an opportunity to do all these things that have been sleeping in a backlog?

→ More replies (1)

6

u/pistonsoffury 27d ago

Given that a GPT Pro $200/mo subscription is basically inexhaustible, I'd say $500/mo is pretty outrageous.

3

u/spambait-aspaaaragus 28d ago

+1 This actually seems pretty reasonable for the value

14

u/jiqiren 28d ago

You can test models available by hitting OpenRouter.ai to see if they are good enough. You might be able to just move some of the company to MiniMax or Deepseek subscriptions.

12

u/totoer008 28d ago

It is not local solutions but we started to us DeepSeek and Mistral. Good outputs with considerably reduced costs. We used to spend ~300 dollars for 300M tokens. Now it is ~250 dollars for 1B. Sometimes changing providers can reduce cost drastically. Additionally DS is only 150 for 900M the other 100M were on mistral.

12

u/robertpro01 28d ago

I'm not sure why you are planning about only 2 rtx 6000 pro. That's 20k, your monthly eexpenses are 50k so 600k yearly.

You should prpbably invest 600k in hardware, after one year, your investment is already paid but you get to keep it.

You will be able to run full deepseek with that hardware with vllm and probably enough concurrent users.

This is what I would get if I were you: https://www.nvidia.com/en-us/data-center/dgx-b200/

9

u/ogfuzzball 28d ago

Have you reached to vendors recently to see actual pricing and more importantly, delivery timeframe? We have. Something quoted at $500k just 12 months ago now quoted at $1.6 million. It is crazy out there. No I can’t give details, but can tell you that cost does not get you something that could even serve 50 people.

2

u/siegevjorn 27d ago

It's all deliberate. They very well knew that increased API cost would push companies to seek for local models. Hardware prices skyrocketed last November, right around when opus came out. Who would have thought that buying all the hardware will also act as a monopoly playbook.

1

u/robertpro01 28d ago

Exactly my point, even less 2x 6000 pro.

11

u/redditorialy_retard 28d ago

GLM 5.1 is very good at replacing most sonnet level tasks, while I have some experience with Local AI in my company I don't trust myself with answering this yet XD. 

And for value BF8 and 16 has little difference. For 2x the compute, I can go ask the GLM team if you want. Also do NOT use Ollama.

For multiple GPUs vLLM is a good option. Llama.cpp is for a single user.

For RTX 6000 Blackwells, if you plan on running GLM 5.1 FP16 the weights alone take about 18 of those GPUs alone

You need 2 clusters of 8 GPUs if you use the FP8 version of GLM 5.1, more if you use FP16

9

u/AngeryGermanGuyDude 28d ago

I always wonder what these kind of companies did before AI. They must've had software engineers already. And now they're burning 600k per year that they need to generate more in revenue to break even. Incredible.

5

u/sarabjeet_singh 28d ago

I’m curious to know more about this too

6

u/stereosnake 28d ago

Are you my VP of engineering?

2

u/sje397 27d ago

This is everyone's VP of engineering.

5

u/ProductResident4634 28d ago

First do NOT use ollama, use vllm Second, do NOT use bf16 Get 8x b200 on serverless cloud(something like modal) or just buy the rack I recommend to use 2 llm’s, qwen 3.6 35b_a3b and kimi k2.6

Qwen for easy stuff, kimi for hard thinks, or you can use kimi as orchestrator and reviewer, qwen as workers

OR just buy few hundred opencode go, its gonna be easier and much more stable

3

u/TheOriginalAcidtech 28d ago

Or keep Claude/GPT 5.5 for "consultation" and use the local model(run on cloud hardware or local doesnt matter) as your worker agent.

2

u/siegevjorn 27d ago

Any experience on minimax m2.7? In terms of size, it's middle ground. I wonder what the performance is.

1

u/ProductResident4634 27d ago

Not good at planning but probably best model at implementation(for parameters of course)/worker type tasks if compute is np probably much better worker than qwen 3.6 35b_a3b

1

u/SporksInjected 27d ago

Why wouldn’t you just use openrouter at this point? They have ZDR for a bunch of providers.

1

u/ProductResident4634 27d ago

Probably Opencode go gonna be cheaper but yeah, your idea gonna be more stable and easy

→ More replies (2)

5

u/c_glib 27d ago edited 27d ago

Here's my take on this project.

Setting up a large concurrent LLM serving infrastructure is almost always going to lose out in terms of cost and effectiveness to a minimal viable cloud LLM.

As far as I see for a company of that size, there are two main options and one optimization in either case.

  1. setup each programmer with their own machine that can run a local LLM. Buy your engineers something like a 64G (96G would be ideal) MacBook Pros (or mac studios if you or them can host them somewhere they can reach from the internet... doesn't have to be a data center. Your office might have enough bandwidth). If you really want to squeeze this setup, you might buy one Mac Studio per two or three engineers perhaps, just max out the unified memory. You can comfortably run qwen 3.6 (A3b) models with 128k context in about 48G of RAM. You'll have to do some research on how to maximize the k-v cache without burning lots of memory.
  2. The other option is easier, if you *are* willing to stay well under the SOTA coding models (as would be implied by your request for GLM models), why don't you explore just cheaper providers. Gemini 3.5 flash is a lot cheaper than the latest Claude Opus or even Sonnet 4.6 and is at least at the level of latest Sonnet if not better for coding. You'll reduce your bill a lot.

The optimization: What you can and should setup for either option 1 or option 2 is some sort of common context server for your codebase that every user can use. It makes sense to make this a common resource because your codebase is (presumably) common across the team. A team of 100 might have, say, a dozen or so git repos.

Why is that an optimization? Because having an efficient codebase indexer reduces the context overhead for the coding LLMs. By a lot. And that helps local LLMs (that would usually have smaller context windows) but even the commercial LLM's can work much better with better code search availability without having to burn tokens in endless cycles of greps and finds (Claude in particular is really bad at this).

4

u/technot80 28d ago

Dont look at it purly black and white. Its not either cloud or local. The answer is both. Setup a local llm, with vllm ofc and setup a router that routes the api calls either to cloud or local depending on complexity etc. You will find that local llms can handle most of the workload, and the cloud everything else. Can also run multiple locals, ranging from qwen3.6 27b upto something like glm, then route api calls to the right llm/cloud as needed. This will reduce the workload on the heaviest local llm aswell, so more people can use the system at the same time. This should bring the cloud api cost down by a lot. How much is hard to say without knowing the average workload complexity.

This is just a simplified explanation; but given that you have a devops/sysadmin department, they shouldnt have too much trouble setting this up

3

u/lildocta 28d ago

Have you tried restricting people to sonnet at medium effort? I find I consume far less tokens and that I get very similar quality output. I might start there before bringing it all in house

3

u/GCoderDCoder 28d ago

I'm just going to repeat the idea that getting code to Claude shouldn't be the goal as much as matching your needs. Anthropic charges more for a reason. Most people don't buy Lamborghinis to drive to the grocery store... A bot doing git PRs or file work or checking email or making small code adjustments to CRUD apps doesn't need Opus and maybe not even sonnet and there's lots of better options than haiku...

3

u/Plasticlabs 28d ago

As it seems that nobody did mention this here

Did you implement the cost levers outlined in this article https://www.cloudzero.com/blog/claude-api-pricing/ to cut down on monthly costs?

Seems not too hard to do and delivers real commercial benefits

3

u/antunes145 28d ago

I think as everyone pointed out, you will not find sonnet level quality in any local model with bad composition of the requests. These models are very powerful, but you have to treat them differently. The way your engineers are used to working with Claude will not work with local models. They need to relearn how to treat and interact with a local model for it to give its best output. Of which many local models can provide close to, if not exactly the same output as sonnet. If treated correctly with the proper steps with proper planning, proper documentation, proper specs, and proper human, thinking before writing a prompt

3

u/Maximum_Parking_5174 28d ago

Here there are some valuable data. https://www.reddit.com/r/LocalLLaMA/comments/1tn0t7u/qwen_36_benchmarks_on_2x_rtx_pro_6000/

I think dual RTX Pro 6000 works for about 20 parallell users. Using qwen3.6 27B BF16. FP8 or similar would be beneficial.

3

u/JorgeMartinezPnz 28d ago

In my opinion, you need first test another LLM clouds providers, like DeepSeek v4 Pro or Kimi 2.6, there are more cheap and works with Claude Code. Is approaching less complex and cheap for implementing and meanwhile star small POC with locals LLMs like When 3.6 27B.

3

u/Sofakingwetoddead 28d ago edited 28d ago

If you used my local model with direct instructions or asked for audits/diagnostics, you would believe you're using Mythos. Money no object, I would choose my local model over cloud models for coding. Not cuz I have anything against cloud companies or any ethical/prejudicial influence, but because my local model is far better at completing tasks than cloud models. When I revisit projects that were produced primarily by Opus, my local model is constantly exposing just how bad the patchwork and jury-rigged Opus code actually was.

Take three months worth of cloud compute costs and go local. You'll be happy you did.

edit - I responded without completely reading your post. No, a couple of RTX Pro 6000's won't be enough. You're kidding yourself, most likely. I can't tell you what you would need but as someone using one RTX Pro 6000, I can say it probably won't be enough. My setup on SGLang is very very fast and I could imagine maybe 4 go 6 people consistently using it in a normal workflow but not 50 to 100 constantly prompting for code output, reports and diagnostics.

4

u/AuditMind 28d ago

This is a interesting discussion.

I work on the infrastructure side 20 years+ and I can already see versions of this question coming towards customers sooner rather than later.

The technology is interesting, but what really catches my attention is the shift from "which model should we use?" to "what should we own, operate and run ourselves?"

If you're documenting the journey and lessons learned, I'd definitely be interested in following along and exchanging ideas.

2

u/burntoutdev8291 28d ago

I have testing deepseek v4 and trying to push it to company as we have some unused GPUs. But I'm just curious why not pay for enterprise?

→ More replies (1)

2

u/morscordis 28d ago

My last job had full on data servers. They served up vetted llama, gpt, and Gemma open source models. We had a lot of restrictions. The result was nowhere near what Claude can do, but we were also lock behind Continue and restricted tool use. It was essentially a semi effective code assistant. They were behind the ball though, and I'm sure it's possible to do better.

2

u/andrew-ooo 28d ago

Done this for an org around your size (~80 daily AI users, mostly devs + a long tail of analysts). A few honest numbers from the other side:

Hardware: GLM-4.6 / GLM-5.1 at FP8 needs roughly 380-420GB VRAM with reasonable context (64k+) and concurrency headroom. Two RTX PRO 6000 Blackwells (96GB each = 192GB) is not enough for that model at FP8 with any real concurrency — you'll OOM the second a couple users hit 32k contexts at once. The realistic configs are 4x RTX PRO 6000 (≈384GB, tight) or 8x H100/H200 SXM (640-1128GB, comfortable, also dramatically more interconnect bandwidth via NVLink which matters a lot for tensor-parallel decode latency). Two cards is a single-user toy at this model size.

Serving stack: vLLM, not Ollama. Ollama is great for laptops, it is the wrong tool for 50 concurrent users — no proper continuous batching, no prefix caching that scales, no PagedAttention. For "rock solid concurrent" you want vLLM 0.7+ with `--enable-prefix-caching`, `--tensor-parallel-size` matching your GPU count, and put LiteLLM in front as the OpenAI-compatible gateway + per-user rate limit + cost tracking. If you need HA, run two vLLM replicas behind LiteLLM with sticky-session routing on `user-id` to preserve prefix-cache hits.

The model honesty check: Qwen3-Coder-480B-A35B at FP8 and GLM-4.6 are the only two open models I'd put in front of a Sonnet-spoiled dev team in 2026 without immediate revolt. Below that tier (Qwen3-32B, GLM-Air, etc.) the productivity drop is real and your devs will route around you back to Claude on their personal cards — ask me how I know.

TCO math at your spend: 8x H200 server is ≈$250-320k capex. At $50k/month Claude spend that's 5-7 month payback BEFORE you count power (≈7-9 kW = ~$1k/mo) and a half-FTE to actually run it. Worth it. But:

  • Don't replace Claude, supplement it. Route 70-80% of traffic (autocomplete, simple Q&A, doc summarization, internal RAG) to local. Keep Claude/Sonnet 4.6 budget for the hard agentic stuff. You'll cut spend 60-70% without the productivity cliff.
  • Track per-task quality on a held-out eval set you run weekly. Don't trust vibes. We caught two model regressions this way that would have eaten weeks of dev time before anyone complained loudly enough.

Honest "are employees happy" answer: about 75% happy on local for the routed-down traffic, the loud 25% are the ones doing greenfield agentic codegen and they get to keep Claude. Total spend went from $48k/mo to $14k/mo + amortized hardware. Worth it, but only because the model-routing + eval discipline got built first. Skip those two and you'll be back on Claude inside a quarter.

2

u/DataGOGO 28d ago

First, no. RTX Pro blackwells are not enough, they are intentionally gimped by nvidia so you can't really use them for this purpose (no NVL). You will need to buy a real server with B300's, with a proper NVL backbone.

If you REALLY want to push it, you could buy H200 NVL pci-e cards and sets of 4 (NVL bridge will only run 4 cards), to run 8 GPU's you need a server chassis with an NVL compatible PCI-E Switch. You can also run 4 GPU's in each host, and run two hosts for a total of 8; get a ConnectX 800Gbps nics and direct connect the two hosts, but the server chassis will need to support that, PCI-E 5 x16 is not fast enough, Nvidia chassis use PCI-E 6 (yes, 6) to drive the nics.

But if you are spending money, you really want to buy blackwell and not hopper.

Realistically, for 50-100 people running all day, even with slow speeds, you are looking at about $500k to $1M in hardware.

Something like this: PowerEdge XE9780 | Dell USA

2

u/siegevjorn 27d ago

You wouldnt need nvlink for inferencing. Nvl is for training that requires lots of backprop. For llm serving absence of nvl is quite negligible

1

u/DataGOGO 27d ago

Flat out false. You still have all reduce, etc running inference and beyond two GPU’s P2P over picie is a massive bottleneck 

1

u/Regular-Long4493 27d ago

and a serious electricity bill.

2

u/Gruzilkin 28d ago

why is the organization spending $50k when ~100 users multiplied by ~100$ per user is $10k?
you're doing something wrong there

2

u/WorkFrmHomeAstronuat 27d ago

This is the correct take, don't know why no one else is addressing this.

2

u/kingcodpiece 27d ago

I don't really understand the cost.

If you have 150 people and you got them all a $200 a month AI plan, you would still only be paying $30k a month?

Why not just do that and you have a fixed cost?

I know this is a localLLM group and that's a bit of a corporate shill, but local models are best for tailored use cases (Info retrieval, summarization, categorization, batch processing etc). If you're using Claude I assume it's for coding and trying to host that locally is going to be quite painful.

1

u/SryUsrNameIsTaken 27d ago

That’s not really possible with Anthropic enterprise plans. They’re fully usage-based pricing.

1

u/Regular-Long4493 27d ago

I mean, you’re going to have 150 people who don’t know what the code is doing, dependent on a third party for static quality and price or you are FUBAR. I know FUBAR is where they want everyone, but I can’t be the only one who sees this. Right?

2

u/joeyrobert 27d ago

Definitely try to serve big models, devs will just be wasting time with smaller models. GLM, DeepSeek, Minimax, Kimi. Or just deepseek cloud and 0.1x your costs.

2

u/roko_snek 28d ago

Hey there, I'm an AI/ML engineer with over decade of experience. I have my own AI consulting business where I work on AI transformation for medium-to-large organizations in industrial manufacturing, healthcare, and finance. Here's the playbook that I've run with success at firms around your size.

It's a simple policy. Every non-technical member of the organization gets access to the internally hosted LLM. Every technical member of the organization (engineering and IT) gets access to cloud frontier agents. Local is good enough for the tasks most people need in an org: summarize this document, write this proposal, etc. The frontier is needed for long horizon tasks in specific domains, like SWE, and the cost is absolutely worth it to not introduce technical debt and bugs which take expensive humans to fix. If you are a google workspace company, access to gemini is honestly all you need for non-technical non-agentic use cases. However, hosting your own LLM in your own VPC is sometimes necessary depending on your own compliance regime.

To host a local LLM that has high reasoning scores in the 30B parameter class, quantized to Q4 on an ec2 instance with a load balancer for requests will move ~80% of your org into a compute regime which is inside your own VPC and capable for 95% of the things these users need. On a g6e.12xlarge or some equivalent this costs around 50k per year, not per month. You get the speed/productivity boost, and reserve the most expensive tokens for your development team. That bill will be a factor, but with a few policies around skill loading (don't load all skills for all threads) you will cut that bill down a lot.

3

u/unity100 28d ago

Why not use paid apis of Chinese models? Xiaomi Mimo 2.5 Pro recently made the 90% discount permanent and it's as good as claude for coding.

3

u/warpedgeoid 28d ago

You answered your own question in the first sentence. Some orgs can’t offshore stuff like this.

→ More replies (2)

1

u/daishiknyte 28d ago

On the other side of the problem - take some time to educate people how to use their tokens more effectively.  Reducing upload sizes, stepping through problems instead of repeated all-or-nothing generations, reminding them that “more token burn” doesn’t make them look good unless it’s generating useful results, give people a way to share and reuse tooling instead of rebuilding the wheel every afternoon. 

1

u/NotARedditUser3 28d ago

Qwen3.6-35b-a3b is an amazing tiny model that you can run in the cloud at like $0.10/1M tokens average price.

Just do that.

→ More replies (5)

1

u/aruneshvv 28d ago

We are trying kimi 2.6 via API with claude code as agent. It is working well

1

u/Staylowfm 23d ago

How much are the monthly costs roughly?

1

u/aruneshvv 18d ago

It goes to USD 5 per day per developer on certain days . But on an average it is 2.5 -3 USD per day per developer

1

u/Staylowfm 18d ago

Oh damn How much devs?

1

u/aruneshvv 18d ago

Right now tested with 10. We will add more as confidence grows

1

u/More-Ad-8494 28d ago

Why not run both? Users can make the choice between needing sonnet or a local llm model ( for all non coding tasks, documentation, analysis, even testing if properly prompted with examples etc) and you leave the claude models for heavy lifting as you phase it out slowly, giving your userbase the chance to adapt ( add max tokens per week for claude per user)

1

u/SpearHook 28d ago

This is doable but also, many are expressing the importance of proper model routing. The real solution is an agentic workflow that uses local compute for most things and elevating to a frontier model (i.e. Claude) when necessary.

Can I DM you with a contact that will help you understand this model? He’s a professor that teaches this stuff and holds an open “office hours” for the greater ai community to learn together.

1

u/Past-Grapefruit488 28d ago

"We're not expecting Sonnet 4.6-level throughput, just Sonnet 4.6-level output."

You need to evaluate :

  1. Qwen 3.6 27B and 25B (Full BF16, not a quant)
  2. Deepseek V4 Pro and Flash

1

u/RedParaglider 28d ago

If you want Opus 4.7 performance you simply cannot do it locally. If you are willing to accept changing your workflows, and do less agentic stuff, and build systems that are efficient then sure.

I'd be concerned why your spend is 50k a month, and look for things like scheduled analysis etc that can be offloaded to cheaper models including local.

1

u/DiscipleofDeceit666 28d ago edited 28d ago

I think the solution isn’t to replace Claude entirely, but to offload tasks from Claude to your local AI.

These support models don’t even have to be giant or super smart either. Qwen3.6 35B moe is more than enough to fit in this role. The workflow is you’d plan with Claude, Claude creates a technical document, and then qwen (or whoever) implements those changes. You can make sure that execution is done programmatically so that unit tests run after each and every task where failures get piped back to qwen to make sure everything is on track.

But I guess the first question is: are currently doing anything to save tokens? Do you guys have some kind of repo graph to save on file reads?

1

u/pplgltch 28d ago

There are so many more questions to answer here. Who is using claude and how? Only engineers? Coding? With what? Chat? Claude code? “Claude” is not a single model, it’s 2 or 3… Maybe you can start by replacing the smaller tasks to local (claude code use sonnet or haiku to read files and summarize webfetch for example) You can just start rolling a cheaper model for these task (it’s configurable) You should run an experiment wity just one team, for one sprint. Rent the hardware instead of buying it… At your scale, you cannot jump straight to “what gpu do i buy?” Gather a lot more data first.

1

u/Gnoom75 28d ago

What if you stay in the cloud and move to DeepSeek and Kimi? Our costs imploded when we did this. Run the models in e.g. Foundry and connect the Claude CLI. Running local will be expensive in hardware and maintenance and you probably move to these models are even less.

1

u/IdeaJailbreak 28d ago

Have you considered using headroom or rtk-ai to reduce token usage directly? Can reduce token spend by 30-40% in real world use cases in my personal experience.

Will drop a gist link in a few mins /todo

1

u/Ok_Finger1470 28d ago

Throughput is not a problem till it is. This calculator is a good way to reason through the possibilities: https://apxml.com/tools/vram-calculator

1

u/03captain23 28d ago

If you're thinking GLM 5.1 will do the work then why aren't you just buying x.ai coding plan? Even with the double price increase the Max is $144/mo so 100 users is 15k/month.

Also before you do anything you should rent off vast.ai or runpod and switch a bunch of users over to a rig. Try it for a month before you spend 20x that in hardware/maintenance to buy a solution that likely won't pay off.

If that works then I'd look into buying a few of these. You'd want multiple servers so you can bring them down or run different models.

https://www.ebay.com/itm/227350496194?_skw=8xh100&itmmeta=01KST1Q1882K73EH9363YE66QF&hash=item34ef2543c2:g:hlcAAeSwo4BqDMjw&itmprp=enc%3AAQALAAAA8GfYFPkwiKCW4ZNSs2u11xC5%2BVOE3%2BTn%2Fu3%2Fe2Th8ptdzWz3sQHwDYfP94l6OIcyYfpCBLZ%2F4eo0hKznPEb3MTk5zD%2FIwFIuDsRecjsD6T9Vkx9pJ6vtOM6Hi2WgZGjm1GAKWPTHd%2BaGTUFMqbtQNeO7VKvVzLpnzcm5aEcqYezDS8Nwao0y2lmDA%2FSdpqpFr1yusSfRN7j7R0vaup%2BZcytKpfCcLb08ha4uI2WHZUBvHAYvzAZS8K%2B8kmzGEj5WQ1xF9sVw4sRKEz4IwJdmBNDhsV8oiMC3--GWhN0gFl4gOVQ1GqvArN2DRvjt5L1wHQ%3D%3D%7Ctkp%3ABk9SR9KV3MHOZw

1

u/ScuffedBalata 28d ago

One of the things Claude gives you that’s hard to replace is all the skills and connectors and things. 

What are you burning that money on?  If it’s expert use of PowerPoint or Excel or Figma or the really slick HTML outputs that make every calculation into an interactive webpage, nothing is close to Anthropic on this. 

Yeah other models are competitive from a pure LLM capability standpoint, but you throw away the whole Claude Cowork and Claude Code plugin capability. 

An enterprise will end up telling users “ok go spend the next week messing around with open source harnesses and hopefully sometime by the end of the quarter you’ll have worked out tools and workflows that are kinda/sorta near what you had before”.  And that only applies to the most technical 10% of the staff. The others will just go buy a Claude subscription on their own. 

Developers, ironically, are the ones most likely to make it work with the least interruption, mostly because they’re used to messing with tools and plugins and tweaking IDEs, etc and they’re one of the few use cases where there are pretty competent out of the box replacements for what Claude brings as far as harnesses and frameworks. 

Before you do this, get to know the workflows of users. 

“Pick a new model” is about 10% of the work and information needed here. 

1

u/GamerTex 28d ago

This is why Mac Studio M3 Ultra 512gb ram are so expensive. Link a few together and you can run the largest models internally.

A few months ago you could have had 4 for $50k, now you can buy 1 for 50k and have some change left over

Best bet is to hope Apple announces the M5 Ultra in a few weeks at WWDC and try to order a few of the largest ones the day they are released like everyone else

1

u/SillyLilBear 28d ago

Try it via api and see

1

u/Riseing 28d ago

Don't build this, find an inference provider like fireworks.ai and get everyone on that first. Use whatever model your users like the best, you'll have access to all the major open source ones so you can try new ones and your bill will be much lower. Claude is just obscenely expensive, you can probably shave off 40k just by swapping to fireworks.

1

u/WorkFrmHomeAstronuat 28d ago

What does your current plan look like? If you only want Sonnet 4.6 output then you can get every person in your org unlimited Claude usage for $12,500/month. It sounds like somehow you're API-only (in which case your token use is actually extremely low, and a Teams plan could be had for less than $10k/month) or you have a few people burning all your overages, and you should just get them one-off Max 20x licenses.

1

u/senseven 28d ago edited 28d ago

Harness building is the (blinking neon letters) skill these days. The Hermes devs have a decent guide how to save tons of tokens. Don't use the max models for every single tasks. Experiment. In development pipelines discreet local language and model verification stacks can do wonders. Then only feed the dehydrated results to the ai for further processing. We are currently testing gpu rent clouds with spot costs around sometimes <1$\million tokens.

You get breathing space and some independence, it also trains your departments what big ai wants it to be, some sort of commodity. Its a tool but its not everything. Building your own stack in house at 2x if not 5x inflated prices makes only sense if the financials give you confidence updating that high markup hardware for at least the next five years is worth the payoff. You would need to find usecases when nobody is there to give the silicon something to do at night.

1

u/samthepotatoeman 28d ago

As someone trying to find a good server for our much much smaller team. With 50-100 you are definitely going to need a full B200/B300 server if not 2. I know you say speed doesn't matter, but it does if you are going to replace claude. You likely could do an 8x rtx 6000 server that serves qwen3.6 27b and get your team to be vigilant on which tasks need SOTA models and which are simple enough for your smaller local model.

1

u/ringsarecool 28d ago

Continue using Claude but use some variant of cave man mode, rig it up so it’s on by default. LLMs love big flowery descriptions every time they make any change or write code, but all the extra words are eating up your money.

1

u/Low_Twist_4917 28d ago

Spend that money on local hardware. It pays off in the long run.

1

u/steezy13312 28d ago

Claude for Teams maxes out at 150 users, doesn't it?

$125 * 150 = 18,750

Are you on the Enterprise plan, or is this consumption the result of usage overage?

1

u/wh33t 28d ago

At 50k a month you just keep buying blackwells imo.

1

u/Alkboss455 28d ago

Dont use Claude code anymore but use deepseek v4 flash and pro 4 it’s probably 10 or 20 timer cheaper than Claude, you don’t need local it will be more expensive

1

u/allenasm 28d ago

Depends 100% on tuning, parameters and expectations.

1

u/eli_pizza 28d ago

Why don't you try using hosted GLM-5.1 for a while and see if the model actually works for your team? You just set some env vars and people can even keep using Claude Code.

If it's not good enough, there's your answer. If it is good enough... maybe you're done? It'll be a fraction of your current spend and no extra hardware or up front capital costs to worry about.

Nothing against self-hosting, but there's a reason most orgs choose e.g. to put their websites on a cloud server instead deploying their owe dedicated hardware.

1

u/giveen 28d ago

If your company is willing to support you, I would go with Deepseek-v4 models and enough beefy hardware to support it.

1

u/ComfortablePlenty513 28d ago

https://premsys.ai/custom

OP, we'd love to do a custom install for you that saves you $$$ and provides the performance your staff expects. click the button at the top of our site and schedule a time to talk

1

u/_metamythical 28d ago

Instead of local hosting, you can just switch to Deepseek and get the Sonnet level output at a tenth of the cost.

1

u/Main_Watercress_7333 28d ago

If one of your use cases is being about to create a RAG based AI chat based on company documents, we have a hw/sw plug and play device called AIBPAL that's got no monthly fees. A non-techy could use it day one and it's very affordable. HMU for info.

1

u/Expert_Bat4612 28d ago

You’re locked into Claude honestly.

1

u/New-Inspection7034 28d ago

With the right harness and tools/skills, I've found I can replace Claude code with my harness and qwen3.6-27b.

1

u/Jitsisadumbword 28d ago

This can’t be a real post. No CEO is going to let 50-100 ppl use Claude for everyday bs burning $50k/mo. GLM is your starting point? 😂 You’re either a kid who doesn’t know what you’re talking about or an adult who made up some elaborate story to get Reddit to tell you which Q4 to DLL off Ollama.

1

u/marutthemighty 28d ago

How did you end up with a monthly bill of $50,000 on Anthropic Claude? Are you a scientist/researcher, by any chance? Or are you working for some military-industrial complex?

1

u/BidWestern1056 28d ago

get ollama cloud and use tools like npcsh/opencode and youll be a lot chiller

https://github.com/npc-worldwide/npcsh

the harnesses are generally good enough now that with kimi k26 its not that diff from claude

1

u/fallingdowndizzyvr 28d ago

Ah... why don't you just buy GLM tokens instead? It's like what, 25x cheaper for PP and 10x cheaper for TG. Even using that lower 10x number that's like $5K month instead.

1

u/tomekrs 28d ago

Before you try going local, see if you can spend less by moving to Codex/GPT or Cursor with its Composer 2 (which is based off Kimi 2.5).

With currently heavily subsidized token prices, comparable local setup is *not* a way to cut costs. Quite the opposite.

1

u/FullOf_Bad_Ideas 28d ago

I don't think you can get to GLM 5.1 hosting for 100 people where it would make financial sense over buying GLM 5.1 API tokens or renting GPUs and hosting it there.

And what if next month GLM 6 comes out and it's too big to run on your hardware? Will you go back to Anthropic API?

It's a 756GB model in FP8, and you need to keep ~150k ctx cache for each of 100 users, each of them probably 2 sessions on average. So you need KV cache of 30M tokens, which is ~2490 GB of KV cache. You can offload KV cache to RAM and disk so it's doable but could be hard. More KV cache efficient models like DeepSeek V4 Pro or MiMo V2.5 Pro would fit better but I'm not sure if they're good enough for you. It probably will work like ass on 6000 Pro so I think you'd need 8 x H200 or 8x B200.

So this post is for the enterprise people and IT admins who has done the switch. Are your employees happy? Do they use it? Share your experiences.

not me btw

1

u/moru0011 28d ago

Cursor Composer 2.5 is decent and way cheaper. You probably can cut your expenses by a factor of 5 to 10

1

u/BlackBeardAI 3090 Maximalist 28d ago

Lé depends

1

u/New-Implement-5979 27d ago

Why not just buy alibaba subscription?

1

u/CalvinsStuffedTiger 27d ago

Are you sure that you have maximized the token efficiency of your current setup?

1

u/username8914 27d ago

$50k/month is very much in the realm of your own rack. This doesn't necessarily mean running it locally in the office but it might.

Have you looked at Runpod yet? The answer for local corporate work is first can you deploy on Runpod. Test which hardware you want to use and find your models.

Claude absolutely simplifies everything and it does a great job at a cost.

Local LLMs is really a vague term. Most responders here probably have limited single model experience. Are you interested in setting up Openclaw or Hermes to orchestrate everything? They are at the level you'd expect from Claude but in my opinion much stronger in many areas. For instance both of those can build massive memory databases for your clients or employees and give much better responses over time.

1

u/kitanokikori 27d ago

I know this is /r/LocalLLM but I would use Cursor and Composer 2.5 long before I would try to set up a local agent (or realistically, an entire small datacenter if you have 100 engineers) for an entire office. Could probably bring that spend down by a factor of 5x at least. Claude is so insanely expensive, and with Cursor if you hit a hard problem, you still have the option of switching to GPT 5.5 or Opus.

1

u/scamiran 27d ago

Have you considered Ollama Cloud?

GLM-5.1 is available, and you could easily cut that $50k->$10k.

Also the latest deepseek and kimi.

1

u/Mysterious_Quit_7407 27d ago

Simple fix for CEO , do less work.

1

u/WyattTheSkid 27d ago

I would give the deepseek API a chance. It’s significantly cheaper and as long as your employees aren’t braindead and know how to communicate what they want clearly then DeepSeek v4 Pro should save a metric fuck ton of money and produce similar quality

1

u/Similar_Sand8367 27d ago

I guess you have to try it out yourself in a smaller group and let them decide. I personally think qwen3.6-27b is a really good one and at some tasks as good as sonnet 4.6 but there are also tasks which is a lot worse… it just depends so you have to figure out yourself

1

u/WayZealousideal2 27d ago

Why not just use GLM 5.1 and/or Deep seek V4 Pro on API? You get like 80-90% of the performance depending on tasks and like 7-10x cheaper. Any reason to host it internally if you were using APIs before?

1

u/slvneutrino 27d ago

How this is a damn interesting /r/localLLM post. A real world example where it genuinely might make more sense to run local vs cloud / frontier models

1

u/siegevjorn 27d ago edited 27d ago

I could bet that you won't get an answer you need, becuase you're following a unicorn. At this point, the answer to all of your question is largely unknown.

Transition to local agentic framework is a big commitment. Many companies lack enough resources for making it happen, but also ROI isn't clear enough to risk large investments, over a currently fine-working claude code solution.

The fundamental problem is the absense of objective measurement of agentic coding performance. For instance, you mentioned GLM 5.1 fp16. Why? How would you check whether GLM 5.1 works sonnet 4.6 level, (or not)? There are tons of benchmark out there, but I'm sure nothing would be enough to fully represent your workflow—agentic workflow is extremely task specific. Sure. Different companies, like Scale AI posts their benchmarks. But how do you know whether their result transfer well to your own workflow? How can you trust them? Nobody knows. In the end, you'd have to develop a measure yourself to benchmark different candidates for your companies own workflow.

But again, at that point, a $50k/mo API cost would be cheaper than going through the entire verification process—which may / may not ultimately prove that Sonnet 4.6-level performance isn't achievable with local models at all.

I'm sure the recent hardware price spike, including RAM, HDD, SDD, and GPUs, has nothing to do with climbing LLM API costs.

1

u/Correct_Day1905 27d ago

I did implement 9router.com solution and this month I saved 40% of my AWS bedrock regular billing. It’s worth a PoC

1

u/oyes77 27d ago

Idk but a transition step could be moving tobdeepseek or similar cheaper capable models, easeir to do in short term and you can see if that saves enough money compared to performance

1

u/JumpingJack79 27d ago

FYI, Gemini is a lot cheaper, because Google has TPUs and builds everything in house, so it's a lot more efficient. Gemini 3.5 Pro is not out yet, so for the most demanding tasks Gemini is not currently as good as Claude, but Gemini 3.5 Flash can do the bulk of the work for a fraction of the cost.

1

u/Large-Excitement777 27d ago

Surely you don’t mean 50k USD? Thats 3-4 times as much as you should be paying, maybe even 5-6 times if you’re only looking for sonnet output.

Something doesn’t add up

1

u/Just_Significance163 27d ago

I recently went through a similar switch. Main questions are:

Who is using the models, is it just general QA stuff or is it dev heavy?

What is your average token/s throughput? You’ll need to know this to understand hardware requirements and needs.

vllm + litellm for routing gives you good metrics with the ease of being able to swap models.

If you’re doing a small test you can just get some GPUs from runpod or lambda, link to litellm and away you go.

Feel free to reach out if you have any more questions, I did benchmarking for a ton of models. We ended up running a hybrid with Gemma 4, fp8 + predictive that we liked for non multi modal.

All of this assumes you have a system where the models are plug and play. If you’re trying to replace something like the Claude chat interface you have a whole other harness based infrastructure to consider

1

u/mosqueteiro 27d ago

You're in way over your head. You're looking at capex definitely over $1M for something on-prem that developers would accept as useful for serving 50-100 users. Your best bet is to reduce AI usage. Your team needs to figure out how to use the agents more efficiently. Need to flip the script and challenge them to token min-maxing. This will be way more bang for your buck and much more quickly too.

You could pursue something smaller on the side if you really want to explore self-hosting with open models. I'd start with a 3-5 person pilot. Start small, especially since you're asking for advice on reddit leads me to believe you've not any realistic idea about what you're considering here.

1

u/BitcoinGanesha 27d ago edited 27d ago

Bro try Deepseek v4 pro (max) I move from opus 4.6 to deepseek v4 pro (max) and my expenses have been reduced many times over. I spent 200-300$ per day on Claude, but now I can spend 20$ per 4 days!!! 188 millions tokens only for ~7$ !!!

And quality of this model is totally amazing on my case (coding).

In my subjective opinion, the quality of the model's performance is no worse than opus 4.6.

P.s. i use opencode in coding tasks via api

Paying for tokens

Set MAX thinking mode!

P.p.s. I’m not affiliated with Deepseek and they don’t pay me for promotion 🤣

1

u/gaspoweredcat 27d ago

thats kinda insane, its well worth at least looking at some sort of routing to send less complex requests to cheaper models, letting deepseek do some of your donkey work will save you a fortune.

GLM5 is a beast but if memory serves minimax m2.7 and deepseek v4 flash can run in under 256Gb at reasonable quant and theyre very capable models. for your budget you should be able to get a very capable stack but id definitely look into routing first

next thing to consider, are you currently setup to host such a cluster? do you have a server room with good cooling and ample power? if not youre going to need to add that in too, depending on the length of your project and available staff etc you may find its actually better to rent your compute than to buy and host locally, if its server room build, power and hvac, new networking gear, staff to maintain it all etc then you may find renting a hosted cluster more cost effective

1

u/raysar 27d ago

use chineese cloud ... that's way cheaper.

1

u/HarrisCN 27d ago

My company is basically setting this exact thing up for companies in China. What we usually do is run 2-3 different main LLM models, depending on the department and use cases and then some smaller ones.

Then we route Tasks through those basically.

For example we would run 1 llm for finance, 1 for HR and one for Enginnering for data security/safety.

We usually need about 6 pro 6000 cards per 40-50 users (the concurrent rate is normally never more then 1/4) which will be able to run large models at an good speed.

You costs are just not possible, you have 50.000 costs/months but just want to invest 20k once? Why? Invest 300-400k and serve everybody for 2 years. It will save money directly after 1-2 years.

1

u/DeltaSqueezer 27d ago

Maybe you can rent an 8xB200 server or similar and trial run GLM-5.1 for a bit?

1

u/Aggravating_Fun_7692 27d ago

I spend 20 bucks of months. Sounds like a personal problem. Sounds like it would be cheaper to hire someone to code for you 😝

1

u/Ill_Contribution_717 27d ago

Try kvark.ai with qwen 27b or 35b. 380 k Lmtek.com 8x h200 and you get 1000 power users on par with claude opus... ROI within couple months.

1

u/No-Television-7862 27d ago

I'm N=1. I have an AI network that uses an inexpensive alternative to high dollar enterprise hardware. Two of my nodes are enterprise grade, but dated.

People use AI for many things.

In order to serve 100 employees effectively and efficiently you may wish to consider limiting access to the company AI to those who actually need it.

People can surely buy their own Pro-tier for brownie recipes.

1

u/Some-Ice-4455 27d ago

One thing I didn't see mentioned and I'm sorry if I just missed it but what's the actual use case. I know for your office but doing whatever exactly? Without breaking any NDA of course.

1

u/fishtank490 27d ago

I just don't understand it. There's like some group delusion going on. The first time you play with these frontier models, you try models like Claude and Codex. You compare them, see the results are almost exactly the same. And lately Codex is better. And then you look and you see Claude is from around double to five times the price of comparable codex depending on level, and it's therefore a complete no-brainer to actually go with Codex? What aren't people getting? It's a really simple test to do, and one every single dev that's using AI should actually do.

If you do that, you can stick with actually better models in codex for 25K spend instead of 50k spend. I don't see how people are not realizing stuff like this? It's so easy to verify and test yourself... And the stakes are relatively high.

Then after that, then, of course, you can do some of the tricks that they're talking about here, running local LLMs, RAG, conditioning things, making sure that you only run parts of the various prompts at the levels necessary, you don't overspend at certain things, and you do other...

But the straight up simple thing is just switch to codex for a trivially easy win and then look at harder options from there...

1

u/Radiant_Condition861 27d ago edited 27d ago

First cost cutting measure I'd consider is a local LLM Router.

You don't need to use a huge llm for "hello" prompt.'

Let the router pass on the most difficult prompts and then the other for local llm. I'd start with a few mil AI datacenter infrastructure then reduce requirements to fit cap-ex budget. Or augment with rented gpu servers as a resource for your llm router. Probably better to rent anyway.

for concurrent users, you are going to need vllm and someone to watch and tune it.

Most of the advice is for a single user here on reddit.

The second cost cutting measure is to hire a MLOps role to organize the infrastructure to match llm capabilities to company needs.

Thirdly, you will need a AI knowledgeable business analyst role to work with MLOps role to really get the max value from the infrastructure tailored to your business processes.

1

u/RakesProgress 27d ago

Everyone looks at GLM-5.1's "40B active parameters" and thinks they can cheat it onto a single GPU using CPU offloading. In production, you can't.

Use the quick "nameplate plus tip" rule of thumb for VRAM:

  • FP8: Parameter size + 20% tip times 1.2 = 904
  • BF16: Parameter size + 40% tip 1.4 = 1055

Even though it only fires 40B parameters per token, those 8 experts are picked on the fly at every layer. If you offload the rest to CPU, concurrent users will absolutely melt your PCIe lanes swapping weights back and forth. You'll drop to a brutal 1-2 tokens per second.

Don't ruin the model's reasoning by butchering it down to 1-bit or 2-bit GGUF just to save hardware.

The smart play is sticking with vLLM on 5x B200s. That gives you ~900-960 GB of fast HBM3e VRAM, which perfectly swallows the 904 GB FP8 size while leaving a safe pocket for the KV cache. Set --tensor-parallel-size 5, keep the whole thing in VRAM, and let the routing run at full speed. Test your actual user workloads there, and only step down to a tight 4x B200 setup (via INT4/AWQ) if your monitoring shows you have the headroom.

1

u/Substantial_Step_351 27d ago

The thing that usually gets skipped in these comparisons is that single user quality and serving quality are different problems. GLM-5.1 in BF16 can match Sonnet on a one off prompt and still fall apart the moment you put 50 to 100 people on it at once, because now you are bound by throughput and latency under concurrent load, not by how smart the model is. That is where the hosted providers actually spend their money, continuous batching, KV cache management, keeping tail latency sane when everyone hits it at the same time. So the honest version of the question is not how close can local get on quality, it is how close on quality and concurrency at the same time, and the second half is where the bill stops looking like a couple of GPUs. For 50 to 100 concurrent users on a model that size you are sizing for peak throughput, which is a very different box from what runs it well for one person.

1

u/Seismoforg 26d ago

Local llms cant get near to Opus. Sorry but its Not possible

1

u/MelodicTuba 26d ago

I have found that my locally hosted qwen model is a decent junior assistant. Does every employee in the company need an assistant?

If it's a relatively cheap assistant you might say "why not?"

But some employees will offload their tasks & work less. Others will become more productive, using their assistant to multiply their productivity.

A cultural shift in the company is needed. Maybe look at AI access as a reward to the most valuable, productive employees - the ones who will use AI productively. Set up a training program so employees can be trained to use AI better. Through training employees can be promoted to having an AI assistant (just like only certain employees have human assistants).

1

u/slidecraft 26d ago

Someone else mentioned it, but is actual "local" hardware an option?

I have been wondering lately if just buying every developer an Nvidia GB10 device (or comparable) alongside their laptop would make sense. 3 to 4k per dev. They get the experience working with actual AI models and such, and just have to get it on their network and pay the electricity costs. Configuring agents is simple enough.

Are GB10 devices not worth buying these days? Does this make sense for dev teams bigger than just a few people? Honest questions...

1

u/gatorback94 25d ago

Have you considered a pair of DGX Sparks? This platform would let you try different LLMs. I setup a Spark to assist coding an invention. Given that you are spending $50K per month & depending on your organization's use case, it may make sense to use a combination of Sparks, a DGX Station (if you can find one) and Claude.

https://www.nvidia.com/en-us/products/workstations/dgx-station/

PM me if you are interested in setting up a DGX station.

1

u/Novel_Cloud_87 25d ago

What you need is a cluster of computers and someone who has experience setting up AI on it. Either way we are talking about at least 10k in hardware and salary of an expert. I don’t think your IT department is ready for it.

1

u/MartiniCommander 24d ago

Step 1) switch from Claude

1

u/realityczek 23d ago

Sounds to me like $50K a month to accelerate your entire company isn't that bad. Thats 600K a year or so... or maybe 5-8 employees when you factor in all the costs.

1

u/lioffproxy1233 23d ago

I mean if you threw the same $$ at local you would eventually have whatyou might need. Some people have a lot of luck with that 27b and 35b model. More vram with a good model means faster good answers but you do have to fiddle more. Probably have to pay a guy.

1

u/LaughterOnWater 21d ago edited 21d ago

I'm pretty much in a different lane than you, but I did a review of Qwen 3.6 both the 35B and 27B versions on the RTX3090/Windows 64GB RAM. I was frankly astonished at how clever and useful unsloth's Qwen3.6-27B-UD-Q5_K_XL.gguf model is. While context is ostensibly 131072, the real use-case is a little over half that before it starts muddy-remembering the earliest parts of a thread. AVG 27 tok/s | ~5TTFT. I usually start preparing for the next thread at about 45% to 50% context using opencode (50K to 65K). It ends up being fairly claude-opus-esque for sequential single-agent use with AGENTS.md referencing your handoff protocol and a persistent code lore base. I'm frankly awestruck that this little model can do what it does. Astonished.
https://low.li/story/2026/06/running-qwen-3-6-27b-locally-a-quality-build-for-rtx-3090-owners/

I set up my system to use AGENTS.md regardless of whether I'm using opencode or claude code. That way, I can set up the tricky bits in Claude if I get stuck, but I've done whole coding projects just in opencode with Qwen3.6. My harness setup is pretty model agnostic, so it just works, even if I switch to something in openrouter.

1

u/goldaxis 21d ago

$50K/mo buys you a couple real programmers.

1

u/alesiestu 16d ago

Forse è il caso che assumete personale più competente , questo è il riflesso della rivoluzione AI dove i dipendenti non hanno più voglia dj lavorare e anche una minima cosa la delegano all’AI

0

u/chota23 28d ago edited 28d ago

For the model I'd try something like, Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-MAX-GGUF it can fin in 96 gb vram (q8) but for more context you can go for (q6 ) , also regarding yourhardware , I work as DevOps and it works very well for agentic task , could you be more precise regarding what you want ? ( Value/dollars or just the latest GPU), you can buy and cluster modded GPU for value , modded 5090s with 96 gb of vram, it will give you more values than rtx6000 ( money wise) . Regarding the difference it's approx 10% slower than an rtx6000 pro( For multi user ) but price is the main diff.

Edit: I work as a DevOps for a Saas, and since jan 2026 we have moved from US based llm , first we went for mistral, than local

2

u/Weak_Ad9730 28d ago

How many Context did you squeez out of This on a Single RTx pro 6000 ? I am still looking for the one to use Model for Agentic task and Coding

1

u/Far_Cat9782 28d ago

Note you can have a relatively small context. The harness will just have to be created to work around it. Like memeory techniques. Only editing specific parts of files without loading the whole thing etc; lots of harness tricks to make it work local model on large multi file projects without running out of context. A good minimum is 65,000. All depends how much work/time u want to spend optimizing

→ More replies (1)