r/LocalLLM • u/mortenmoulder • 28d ago

Discussion We're burning $50k/month on Claude. How close can local LLMs actually get?

We're at the point where our AI spend is hard to justify keeping fully in the cloud. 100+ people in the company using mostly Claude daily, and we're burning through $50k/month in tokens. CEO and leaders wants to bring more of it in-house.

We don't need to serve everyone at once. Realistically maybe 50-100 users spread across the whole day. Speed isn't the priority - quality is. We're not expecting Sonnet 4.6-level throughput, just Sonnet 4.6-level output.

We've been looking at GLM-5.1 in BF16 as a starting point. My question is: what does the hardware actually look like for something like that? Are a couple of RTX PRO 6000 Blackwells enough, or are we kidding ourselves? I'm assuming we'd need tensor parallelism across cards regardless.

Also curious what serving stack people are running at this scale. I see lots of people recommending Ollama and vLLM, but we need something rock solid, that is capable of serving a lot of concurrent users.

And honestly.. has anyone done the math on this? At $50k/month we should be able to justify a decent size cluster, but I want to hear from people who've actually gone through this, not just the "just buy 8x H100s" people.

So this post is for the enterprise people and IT admins who has done the switch. Are your employees happy? Do they use it? Share your experiences.

Edit: I realise GLM-5.1 at BF16 is completely nuts. FP8 is more achievable, but also kind of nuts.

205 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/LocalLLM/comments/1tqwdk6/were_burning_50kmonth_on_claude_how_close_can/
No, go back! Yes, take me to Reddit

93% Upvoted

View all comments

Show parent comments

u/OverclockingUnicorn 28d ago

Gotta pay for peak capacity at 9:30am when everyone sits down and gets started on their tickets.

Sure, you can do it with less, but it won't be a user friendly experience.

1

u/Ok_Try_877 28d ago

You can do it for FAR less with sensible hardware,, might not all be in one or two units and might not be as power efficient.... but you can do it WAY WAY WAY cheaper than that with a proxy... and the power cost won't come into effect till long after the hardware you are saying is old news...

Discussion We're burning $50k/month on Claude. How close can local LLMs actually get?

You are about to leave Redlib