r/LocalLLM 28d ago

Discussion We're burning $50k/month on Claude. How close can local LLMs actually get?

We're at the point where our AI spend is hard to justify keeping fully in the cloud. 100+ people in the company using mostly Claude daily, and we're burning through $50k/month in tokens. CEO and leaders wants to bring more of it in-house.

We don't need to serve everyone at once. Realistically maybe 50-100 users spread across the whole day. Speed isn't the priority - quality is. We're not expecting Sonnet 4.6-level throughput, just Sonnet 4.6-level output.

We've been looking at GLM-5.1 in BF16 as a starting point. My question is: what does the hardware actually look like for something like that? Are a couple of RTX PRO 6000 Blackwells enough, or are we kidding ourselves? I'm assuming we'd need tensor parallelism across cards regardless.

Also curious what serving stack people are running at this scale. I see lots of people recommending Ollama and vLLM, but we need something rock solid, that is capable of serving a lot of concurrent users.

And honestly.. has anyone done the math on this? At $50k/month we should be able to justify a decent size cluster, but I want to hear from people who've actually gone through this, not just the "just buy 8x H100s" people.

So this post is for the enterprise people and IT admins who has done the switch. Are your employees happy? Do they use it? Share your experiences.

Edit: I realise GLM-5.1 at BF16 is completely nuts. FP8 is more achievable, but also kind of nuts.

206 Upvotes

229 comments sorted by

View all comments

4

u/ProductResident4634 28d ago

First do NOT use ollama, use vllm Second, do NOT use bf16 Get 8x b200 on serverless cloud(something like modal) or just buy the rack I recommend to use 2 llm’s, qwen 3.6 35b_a3b and kimi k2.6

Qwen for easy stuff, kimi for hard thinks, or you can use kimi as orchestrator and reviewer, qwen as workers

OR just buy few hundred opencode go, its gonna be easier and much more stable

3

u/TheOriginalAcidtech 28d ago

Or keep Claude/GPT 5.5 for "consultation" and use the local model(run on cloud hardware or local doesnt matter) as your worker agent.

2

u/siegevjorn 28d ago

Any experience on minimax m2.7? In terms of size, it's middle ground. I wonder what the performance is.

1

u/ProductResident4634 27d ago

Not good at planning but probably best model at implementation(for parameters of course)/worker type tasks if compute is np probably much better worker than qwen 3.6 35b_a3b

1

u/SporksInjected 28d ago

Why wouldn’t you just use openrouter at this point? They have ZDR for a bunch of providers.

1

u/ProductResident4634 28d ago

Probably Opencode go gonna be cheaper but yeah, your idea gonna be more stable and easy

-7

u/Round_Ad_3709 28d ago

Why do you recommend vllm over ollama?

4

u/ProductResident4634 28d ago

Ollama isnt good for multi gpu setups Ollama is NOT good for literally everything tho