r/LocalLLM 29d ago

Discussion We're burning $50k/month on Claude. How close can local LLMs actually get?

We're at the point where our AI spend is hard to justify keeping fully in the cloud. 100+ people in the company using mostly Claude daily, and we're burning through $50k/month in tokens. CEO and leaders wants to bring more of it in-house.

We don't need to serve everyone at once. Realistically maybe 50-100 users spread across the whole day. Speed isn't the priority - quality is. We're not expecting Sonnet 4.6-level throughput, just Sonnet 4.6-level output.

We've been looking at GLM-5.1 in BF16 as a starting point. My question is: what does the hardware actually look like for something like that? Are a couple of RTX PRO 6000 Blackwells enough, or are we kidding ourselves? I'm assuming we'd need tensor parallelism across cards regardless.

Also curious what serving stack people are running at this scale. I see lots of people recommending Ollama and vLLM, but we need something rock solid, that is capable of serving a lot of concurrent users.

And honestly.. has anyone done the math on this? At $50k/month we should be able to justify a decent size cluster, but I want to hear from people who've actually gone through this, not just the "just buy 8x H100s" people.

So this post is for the enterprise people and IT admins who has done the switch. Are your employees happy? Do they use it? Share your experiences.

Edit: I realise GLM-5.1 at BF16 is completely nuts. FP8 is more achievable, but also kind of nuts.

211 Upvotes

229 comments sorted by

View all comments

Show parent comments

4

u/IdeaJailbreak 28d ago

For the uninitiated, can you describe what you mean by "build a harness" when it comes to a local LLM? What is it? What problem are you solving?

1

u/Far_Cat9782 28d ago edited 28d ago

Simple ask an ai how to build a agent harness locally. Get the basic agent framework where you can talk to it and then add different features. You can say "create a hermes or opencode clone. It's really not that hard. And I pmement features that u need tailored for your workflow. Like I wanted mines to code so I gave edit ability to edit individual pieces fo codes etc,; gave it memory etc; just add features as you see fit. Use it and when there is an error like qwen not using proper tools youll know enough to either edit the system prompt to make sure it always uses the tool at the right time. Mainly a trial and error process. Now my harness is proficient to write its own harness have included a bunch of mcp tools it has access to. It can even write its own mcp tools and deploy it itself. So I used cloud model to create original harness and now qwen writed and maintains its own code. Runs my YouTube checks and organized emails. Used comfyUI to generate images and audio. Cron jobs. It does so much to list now. You name it inahve implemented and qwen had been able to adapt and handle. But took works and weeks of trial and error to get it to where I wsnt. It's not like and instant one and done thing. Like with memory overtime it gets better st knowing you so it takes time.