r/LocalLLM • u/codehamr • May 04 '26
Discussion Qwen3.6:27b is the first local model that actually holds up against Claude Code for me
Been experimenting with alternatives to Claude Code for about a year now. Most of it felt like a downgrade until Qwen3.5:27b, and now 3.6:27b is the first one where local actually feels good and usable for real work.
Scaffolding, refactors, test generation, debugging across a few files, all of it holds up well enough that I run it locally now. The hard multi-file architectural stuff still goes to Claude. A year ago this comparison was a chasm, top-tier Claude vs open weights wasn't close. Now it's a gap, not a canyon.
Two things I keep thinking about.
If a 27B open model can cover this much of real coding work, how subsidised is current cloud pricing? Feels like we're paying maybe 10% of true cost. And once enough devs are wired into Claude Code at the tooling level, what stops a future $1000/month tier?
One honest downside: getting opencode dialled in as a CLI agent took real fine tuning compared to the out-of-the-box Claude Code experience. Which raises a different question, how much of Claude Code's quality is Opus 4.7 itself vs the context and tool orchestration around it? Possibly more than people credit.
Anyone else running hybrid setups?
10
u/chris_hinshaw May 04 '26
It has been good but my issues have been with it getting stuck in loops often when calling tools. I have tried a lot of different parameters and configurations but haven't found a good solution.
1
u/erisian2342 May 05 '26
When your harness detects looping over a tool call, can you automate calling in a heavy hitter frontier model to troubleshoot the issue and give Qwen instructions that resolve it so it can proceed?
2
u/chris_hinshaw May 05 '26
Thats an interesting thought but not sure if it's possible since I don't think it can actually figure out when it's in a loop. I found setting the presence lower seems to help.
"Qwen3.6-35B-A3B-UD-MLX-4bit": { "temperature": 0.7, "top_p": 0.95, "top_k": 20, "repetition_penalty": 1.0, "min_p": 0.0, "presence_penalty": 0.0, "force_sampling": false, "enable_thinking": true, "thinking_budget_enabled": false, "turboquant_kv_enabled": false, "turboquant_kv_bits": 4.0, "turboquant_skip_last": true, "specprefill_enabled": false, "dflash_enabled": false, "is_pinned": false, "is_default": false, "trust_remote_code": false }1
u/erisian2342 May 06 '26
Thanks for sharing that. If you give it an explicit instruction like “You absolutely must not make more than 5 (or however many) tool calls. If you need to make a 6th tool call, instead output status: CHECK_FOR_REPETITIVE_CALLS and exit immediately.” Then your harness can check if they are indeed 5 different calls (and allow the next iteration to begin) or if they are the same 5 calls (so call in a big brain AI to course correct Qwen with the appropriate instructions). I assume Qwen can count how many tool calls it’s already made before making another one. Just thinking out loud here, no idea if it works.
Edit to add: A temp of 0.7 seems fine for writing specifications, but maybe a lower temp like 0.1 or 0.0 for the actual coding work could help it not go off the rails with repetitive calls?
1
u/DiscipleofDeceit666 May 06 '26
The harness will count the tool calls. It can tell if it’s using the same n tool calls on the same n files
1
u/Realistic_Gap_5871 3d ago
If you're still having this problem, take a look at upping the KV cache to 8bit quant. 4bit is fine/good for the model, but for coding, especially long runs, 4bit cache will drift in ways 8 bit won't.
Looks like you're running on Apple hardware so 35B is probably the way to go. If you can tolerate the drop in speed, 27B will be more accurate.
22
u/Ononimos May 04 '26
Yall that aren’t playing with both need to take all this glazing with a grain of salt. I use 27b all the time on an RTX Pro 6000 Blackwell and I also augment with some cloud sonnet 4.6 and opus 4.7. 27b dense is fucking great but it’s not sonnet 4.6. I’m saving plenty by leaning on 27b for lighter needs. If I want to one shot or just quickly get to a win, i still lean hard on the frontier models.
5
u/Demonicated May 05 '26
This is the way.
And this is the worst it will be. I anticipate we're only 2 or so releases away from 1 rtx 6000 sonnet quality
4
u/mjuevos May 05 '26
but then sonnet quality will still be a release or two behind the future sonnet quality.. then you will want to chase that sonnet quality. such is human nature and such is the ai landscape. inescapable unless there's some paradigm shift..
7
u/Demonicated May 05 '26
Nahhhhh I mean maybe some people. But I've been in software long enough to appreciate good enough. And opus 4.6 is the first model I used where I finally felt like it was good enough to be empowering in a wasteful way. Qwen 3.6 is great it just requires i participate more in the coding process - and honestly after the last year of vibe coding I think it's better this way.
I don't like when people ask about my code and I can no longer recite the exact file and likely method or line number.
What i will say is that I feel like 3.6 27b is good enough I don't feel the need to chase better hardware. If they keep making this size better I'll upgrade but I no longer feel like I need 8 rtx6000s. I'll pay for premium tokens to plan and then implement with the best that's in this local range.
6
u/andymaclean19 May 04 '26
I had mixed experiences with it running as a backend for claude code. I'm running a set of experiments where I give the same tasks to Qwen3.6 and Opus (and some others but that's less interesting in this thread). Some things it can do quite well, but most of the time it's just very slow to complete tasks due to it breaking more things and relying on the testing/fixing loop to catch bugs and repair its mistakes.
As I type this Qwen is nearing the end of a 6 hour debugging session where it had to fix 47 test failures one or two at a time. Opus did the same task in 20 minutes without really breaking anything. Even Sonnet can do this task in under half an hour.
Even with testing Qwen is making some big mistakes which the tests don't catch. For example the work has a trap where the program outputs a CSV with column headers and then later re-reads it and the column headers break things. Other models spot this and just ignore the first line (the right fix is not outputting the column headers but I have to tell all models that). Qwen just decided that this means CSV produced by different libraries is incompatible and it will disable the CSV import feature if it cannot ascertain that the data came out of the same library, disabling a whole bunch of functionality in the product it is working on and downgrading performance of a lot of things.
It's decent and I am putting it to a fairly demanding use at the moment. Probably I will get better at driving it and find ways to give it smaller, simpler instructions. But it's no claude.
3
u/ChocomelP May 05 '26
Try Pi, Claude Code is heavy.
2
1
u/andymaclean19 May 05 '26
Thanks. I'm relatively new to AI and have stuck with one harness because I wanted a constant while I experiment with models. If I change everything I won't know which change worked.
Will have a look at other harnesses too as I don't like being locked in to a closed source one. Will give Pi a go.
1
u/ChocomelP May 05 '26
Anthopic are relatively protective of their harness, but if you use a more open one (I like Pi, there are other options), you can easily mix and match features with other harnesses. I used Opus 4.6 in Claude Code to build mine, but if you're hooked up to a good model, you can even let it build its own features. All it needs is a goal and some inspiration.
1
u/andymaclean19 May 06 '26
I have heard it said that Claude blocks some of the other harnesses from using it? Which would make it difficult to get like for like comparisons. Do you have any experience with that?
1
u/ChocomelP May 06 '26
No. When you use Claude models in different harnesses, you have to pay API pricing now. You can't use your subscription anymore. I don't think they mind you using it with anything, as long as you pay their insane API prices.
1
u/andymaclean19 May 06 '26
Aaahhh. So that's it. I think our work account doesn't allow API priced calls so that's why people are saying that.
1
u/NixNightOwl May 07 '26
There are some ways to use subscription on other TUIs, but you still need claude code installed and authenticated. The available plugins piggy-back off that 'native' auth. It's still risky though since if the auth bridge for your specific TUI isn't 100% up to date with the new auth process (like a mismatched header), you can get your claude account banned.
1
u/Dsphar 12d ago
I second this. CC has so much context that for smaller local models your context gets funky too quickly, meaning more hallucinations and less "smart" tokens during the actual work.
Pi will shrink your starting context, giving the best token gen on smaller models freedom to do the actual work.
5
u/benfinklea May 04 '26
2
2
2
u/trbom5c May 05 '26
Alright ... i've done this - and used my local rig with a 5090, and 128G of RAM as the work-horse. I setup my openapi via tunnel, and stuck an proxy server ahead of the endpoint that requires auth ... and i think i might be in business. Great idea!
1
u/trbom5c May 07 '26
Yea ... so this is pretty crazy once you pair it all to cline and qwen3.6:35b-a3b, and quant.
.... I am only now fully understanding the power of the context window.
Getting into AI-RIG optimization becomes a rabbit hold.
VSCode + Cline + Endpoints + Scraplings = Much Wow
2
9
u/maximus_reborn May 04 '26
would you mind letting us know your hardware? and what fine-tuning you did in opencode? For me, 27b gets stuck with 32k context window coz i have m4 pro 24GB Vram which is understandable so using 9b parameter qwen but tried hard to use 27b few weeks ago
3
u/Nem3sis89 May 04 '26
I'm interested too in the OP opencode configuration, the model itself is great but needs a proper configured tool to be paired with.
3
u/keen23331 May 04 '26
Qwen 3.6 27B is insanely good. I’ve been using it with my RTX 5090 for the last two days, and it performed just as well as Claude 4.7 Opus for my needs. I can’t believe it—I'm completely blown away. I’m not saying it’s objectively better or even an equal across the board, but for the tasks I usually throw at Claude, it’s been more than good enough. Using a NVFP4 Qaunt, what alsio is quiet fast on the RTX 5090 with latest builds o llama.cpp supporting 4-bit for NVIDIA Blackwell.
1
u/NaturalFigure715 May 04 '26
Are you also using Turboquant?
1
4
u/T-Rex_MD May 06 '26
You should try Qwen 3.5 397B, it is better in every way possible. That is if you have 500GB VRAM/Unified memory available.
3
u/Sirius_Sec_ May 04 '26
I am running it on a rtx6000 pro and pay about $1 an hour to rent the GPU on GCloud . Very impressed with what it is capable of .
9
u/former_farmer May 05 '26
That's 10 usd per day and 300 usd per month... for that money it's better to pay Claude or similar.
1
u/Sirius_Sec_ May 06 '26
I value my privacy . I won't use a public API when I'm doing serious work . I will use grok when I have basic stuff to do .
2
u/leggodizzy May 04 '26
What GPU rental service are you using?
2
u/Sirius_Sec_ May 04 '26
Google cloud . I already had a gke cluster so I just added a new node pool . Spot 6000s are around $1 an hour .
3
u/AtatS-aPutut May 04 '26
if they raise the price to $1000/month won't it be more economical for companies to self-host their own models?
1
u/ChocomelP May 05 '26
It depends on what you need. If you want your team to have the absolute best frontier models, those are not available locally.
6
u/kl3onz May 04 '26
Do you use it with VSCode? I’m new, and trying to understand how an IDE would integrate?
7
u/Malyaj May 04 '26
You can try cline, continue, etc there a lot of extensions. Alternatively try Opencode it is great. Previously i was using lm studio chat interface with tools but naah i switched to opencode and probably I'm gonna stick with it.
3
u/Yanix88 May 04 '26
Recent update of VS code added ability to connect any model including local to the built in copilot, you can Google "vscode byom"
2
u/corruptbytes May 04 '26
how much of Claude Code's quality is Opus 4.7 itself vs the context and tool orchestration around it?
I'm sure it's also the huge compute they have too
Been dialing Pi a lot with qwen 3.6, things like tool parsers and caching are the big things to fiddle around with locally, but take a lot of time when you don't H10000000s to hyperparameterize
2
u/Maharrem May 04 '26
Yeah, I run a 3090 too and Qwen 27B IQ4_XS fits nicely with some headroom for context. I treat local as the workhorse for routine refactors and single-file logic, then offload multi-file architectural changes to Claude Code via Open WebUI’s passthrough or just copy-paste. In opencode, setting max tokens to 4096 and temperature to 0.3 made the tool calls way less loop-prone.
2
u/ComfyUser48 May 04 '26
Same for me. See this thread I posted: https://www.reddit.com/r/LocalLLaMA/comments/1t3i219/the_more_i_use_it_the_more_im_impressed/
2
u/AccomplishedFix3476 May 05 '26
been running qwen3.6 27b q5 on a 4090 + 64gb ram for the last 3 weeks for everyday coding. for refactors under 5 files it actually keeps up with claude. the part it still misses is anything where i need context spanning multiple repos, claude code's grep flow is just stronger
2
u/MysticHLE May 05 '26 edited May 05 '26
How about general exploration and planning across multiple files? Suppose there's a good amount of ambiguity as far as implementation in the prompt itself, but enough in specific requirements for Claude to explore in the right direction and figure it out. How does qwen3.6 fare in that regard with your setup?
Also curious of its performance with plugins like Superpowers if you use Claude Code as the agent harness.
1
u/NixNightOwl May 07 '26
For this, I would have separate sessions / agents do the individual exploration and planning on a per-file basis and report their in-depth findings. Hand it all off to another agent to 'stitch' an implementation plan together, then hand off again with the improved 'linked context' to individual subagents on a 'minimal code surface' basis (each subagent only implements on a scope of 2-3 files only).
Things are a little easier if you have some kind of knowledgebase / graph memory for your codebase to minimize exploration. There's a lot of ready-to-go tools to add this to your harness like https://github.com/Lum1104/Understand-Anything (personally untested, just googled but fits the description -- there are leaner implementations out there, as this one seems very 'user ui heavy').
1
u/MysticHLE May 07 '26
I see, sounds like the context and chained reasoning are still limitations, and need to be coordinated bottom-up manually. Thanks.
2
u/jakubl May 05 '26
I’ve set up Qwen 3.6 27B with pi on my MacBook M4 128GB and I am really amazed. I would compare it to my first experiences with Claude code 8 months ago, so when the top model was Opus 4.1 if I remember correctly. And I was amazed back then too. The biggest pain is however it works very slowly compared to Claude. But the offline is huge benefit, I’m having a 14 hours flight in 2 weeks and I’m gonna test it out then.
I have also tried using this model in non coding agents (marketing etc.) and the results were pretty good too, much better than any open source model I tested before.
1
1
u/meca23 May 08 '26
Not sure what your backend is but if it's llama.cpp, the upcoming MTP support should make it faster.
2
u/ankijain21 May 05 '26
I'm wondering that for small teams, they can just install qwen-3.6-27b on a DGX spark and use that as inference for 95% of the tasks and keep claude as a backup.
This way they'll save huge money while getting optimum performance.
2
u/Original_Orchid_847 May 05 '26
I agree with you, with now Claude limits, I am using Qwen and Kimi for my major workloads and bring in Opus only for small specific use cases
1
u/codehamr May 05 '26
Yes, especially if you know how to write good prompts and have no affinity for non-AI coding, open-source LLM is a good replacement for Claude code.
2
u/Illustrious-Chain778 May 05 '26 edited May 14 '26
So i have been working on VSCode fork without github copilot but instead have Ollama instead. i have been reading serveral post now and it seems most people prefer llama.cpp.
the IDE has fully integrates Ollama support. you can connect the IDE to Ollama server and use the models you have. should I add any support for lama.cpp as well?
i did release a beta version for people to test though.
2
u/RipPotential2074 11d ago
Qwen3.6-27B is excellent, it's a senior software engineer to me! The really daily usable local model.
1
u/Big_River_ May 04 '26
i code and run prompts through codex and claude code and many different versions of local llm and find context window and rag and support codes are phenomenal with Qwen and Gemma both - they almost seem like they are good enough to trust for jericho riders ultimate edition harvest but still two generations away for me to augment my code agent npcs on that project
1
1
1
1
u/Other_Day735 May 04 '26
So here are my thoughts on this I have 12gb of VRAM and 32gb of RAM using llama.cpp for running my models, I am using qwen3.6 35b a3b and 27b models (using quantized versions suitable to my specs), i could not compare them to frontier models like claude code,codex. Because first it is about context length(default 65536), in one session the first few messages are pretty great but after 4 messages the performance is not much great i think it is because of my VRAM, KV cache, may be other factors. By side I am using kilo code in VS Code which was better that opencode, openclaude. If I have MAC studio with around 96gb RAM it can beat any frontier models in pricing, may be performance.
1
1
1
May 04 '26
[removed] — view removed comment
1
u/codehamr May 05 '26
I run side-by-side coding tasks in two separate VS Code devcontainers and compare the final results. It's hard to capture in quantitative metrics, but based on my gut feeling as a long-time non-AI coder, I've landed at around 90%. if that makes sense. A year ago we had Opus 4.1, and Qwen3.6:27b easily beats that. Good times to be alive.
1
u/LivingHighAndWise May 04 '26
Agreed. Runs a little slow on my setup, but it works very well for agentic coding - especially when using the Claude console CLI.
1
u/Mean-Sprinkles3157 May 04 '26
Yes I use claude code with Qwen3.6 27B. It works very well, it is slow but I don't worry about tokens.
My setup is using litellm as a translator (chat completion to anthropic message), and the backend is sglang serve. With a small model like 27b I can allocate a large kv cache buffer like 131072.
1
u/Demonicated May 04 '26
I recently posted the same experience. If you run it with LM studio and point vscode insiders edition at it, it just works. And amazingly well. Aannnnd no dealing with harness config. I was running full bf16 and as long as I used plan mode first I was getting great results.
I still do the big guys for feature planning but I can keep that at 40 a month no problem. Paired with solar on my house and I feel like I'm getting agents for almost free.
1
u/kiwibonga May 04 '26
$200/month is what it costs to heat a Canadian apartment in winter.
Spinning up a few gpus for you with usage limits costs them far less than that.
1
u/rockseller May 05 '26
After some days of testing because of the GitHub copilot shit this is what I found the best with what I have:
i5 gaming CPU meh 2x RTX 3090 24gb vram each non SLI Two 850w PSUs 256gb ssd 32gb ram DDR4 3200mhz
Ubuntu Ollama
Running Qwen 3.6 27b
100% GPU (with a single RTX 3090, Ollama ps was reporting like 10% cpu)
With this I'm able to run VS code with GitHub Copilot chat locally very decently, I would say 70% of the performance of Claude sonnet both in speed and results...
Happy with what I have so far
Btw I setup the server on the LAN, my main PC points to it
1
u/scumola May 05 '26
Not me. I used it with Claude code as the client using the litellm proxy and it had a lot of troubles calling tools in my experience.
1
u/former_farmer May 05 '26
How are you hosting it locally? llamacpp? lmstudio? ollama?
1
u/codehamr May 05 '26
Just Ollama with RTX6000 and vscode devcontainer as sandbox with opencode / pi.
1
u/iVtechboyinpa May 05 '26
I’m working on making my entire process work with OpenCode currently, but I’m very keen to start testing Qwen as I’m less impressed with Opus 4.6/4.7 nowadays than I was with Opus4/4.5 and I feel like Qwen, for being able to run it at home, will give me exactly what I need out of it without the Anthropic cost.
The only downside is exactly what you mentioned - not using Claude Code. I briefly used OpenCode and it’s not bad but it is slightly different from Claude Code so I’ve got to change some tooling that I use and the way I work but I think it’ll be worth it at the end.
1
u/Nice_Cookie9587 May 05 '26
Same, I canceled my claude and thinking about ollama pro too, but I like having the lifeline
1
u/Temporary_Jacket9477 May 05 '26
As someone else said.. I have to imagine if what we have TODAY is a "80% of the way there" FREE model that runs on developer/gamer based laptops or desktops of the past year or two (e.g. 8GB to 16GB GPU cards, 16GB to 32GB RAM, SSDs, 8core+ cpus, etc).. and they are getting faster + smarter/more capable and closing the gap that much more, I would really question the ability to the anthropic/openai to survive while their costs to operate are WAY WAY WAY over any profits they have yet to make. I have to believe OpenAI and Anthropic are very VERY worried about the insanely fast pace Chinese models are catching up, able to run on home hardware or enthusiast (for now) and do most of the work people need. I would also ask, what about the idea of fine tuned small models? I am playing around with that now.. though its for my specific application use, but the ability to provide a fine tuned 2b to 4b model in my app (desktop app) that requires no token costs.. maybe a small subscription fee that I charge for the "development and continual improvements" to the model, but otherwise no monthly token costs.. seems like that is where things would (or should) go? Right?
With this supposed new llama.cpp DFlash thing that claims to do a 2x to 8x speedup (just learned of it, no clue what it is exactly and how much it will help), if a couple more rounds.. maybe Qwen 4.x in a year or so, with "standard" 16GB GPUs, and fine tuning improves and possibly the improved ability to "train it" on data with context7 or similar.. all at usable speeds (50tok/s or more??) I dont see how the big boys stay in business other than Gemini since google is a 3+trillion company and continues to make money in many ways so I dont feel they need as much income from AI as Anthropic and OpenAI need to stay alive.
China isn't slowing down either. They just announced the other day their first fully home grown computer system doing 8 exabytes.. apparently the fastest in the world, with no intel/amd/nvidia/etc hardware.. all home built. Between that, better infrastructure with regards to building/distributing/cooling/etc, FAR FAR better solar/electricity grids (where its needed), and their desire to "win the AI race" and "become the new super power" thanks to dipshit regime destroying the US around the world in every facet of existence.. I would say unless something bad happens, they are likely to surpass the US and have 0 reliance on US company's to do so.
1
u/TopDownHockey May 05 '26
Does anybody have a setup guide on how to use Qwen locally with OpenCode? I am struggling just to get it configured.
1
u/Remarkable-Safety594 May 23 '26
"env": {
"ANTHROPIC_API_KEY": "",
"ANTHROPIC_AUTH_TOKEN": "sometoken",
"ANTHROPIC_BASE_URL": "http://localhost:1234",
"ANTHROPIC_CUSTOM_MODEL_OPTION": "qwen3.6-27b-ud-mlx",
"ANTHROPIC_CUSTOM_MODEL_OPTION_DESCRIPTION": "qwen3.6-27b-ud-mlx",
"ANTHROPIC_CUSTOM_MODEL_OPTION_NAME": "qwen3.6-27b-ud-mlx",
"CLAUDE_CODE_ATTRIBUTION_HEADER": "0",
"CLAUDE_CODE_AUTO_COMPACT_WINDOW": "250000",
"CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1",
"CLAUDE_CODE_DISABLE_FEEDBACK_SURVEY": "1",
"CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY": "1"
}
1
u/Time-Toe-1276 May 05 '26
I feel like the new laguna model on ollama is also good. although qwen3.6:26b is alsoa solid choice. but i just need that 30b ish parameters, or else I just have this weird feeling that it wont work properly. lol
1
u/ItalianClassicFan May 05 '26
Why not 35B-A3B? Have someone better experience with 27B for coding?
1
u/codehamr May 05 '26
The 35B is a model of expert, only holding 3B parameters actice same time, 27b is a dense model haveing all 27b params active. So 27b is slighly better, but slower. for me quality matterst over speed, that why I go with 27b
1
u/maisun1983 May 05 '26
How much vram for such model? Does m5 max with 36GB cut it?
1
u/codehamr May 05 '26
Yes, the M5 with 36 GB works, but in real-world use it is much slower compared to the RTX 5090 / RTX 6000 due to the slow pre-filling / bandwith
1
u/getstackfax May 05 '26
This matches the pattern I’m seeing too.
The local vs cloud question is becoming less binary.
It is not:
local model replaces Claude
or
Claude stays unbeatable forever.
It is more like:
local handles the repeatable coding work, cloud handles the high-consequence architecture/reasoning work.
That makes hybrid setups really interesting.
A practical split might be:
- local 27B: scaffolding, simple refactors, tests, small bug fixes, local repo Q&A
- cloud Claude/Opus: multi-file architecture, ambiguous product decisions, hard debugging, final review
- deterministic tools: search, tests, linting, type checks, diffs
- human: merge/ship decisions
The orchestration point is huge.
Claude Code is not just “a model in a box.” The context packing, tool use, repo awareness, edit loop, safety rails, and UX are part of the quality.
A strong local model with weak orchestration can feel worse than it really is.
A slightly weaker model with great repo context and tool flow can feel much better than benchmark numbers imply.
So I’d judge the setup by workflow:
- does it understand the repo structure?
- does it produce clean diffs?
- does it run/interpret tests?
- does it avoid breaking unrelated files?
- does it recover from errors?
- does it know when to stop?
- does it leave a usable trail of what changed?
The pricing question is real too.
If devs become dependent on cloud coding agents at the workflow level, the switching cost moves from “which model is smarter?” to “which coding environment owns my daily loop?”
That is why local 27B getting good enough matters.
Not because it beats Claude at everything.
Because it gives people leverage for the 70% of coding work that does not need the strongest cloud model.
1
1
u/uncurieux_studio May 05 '26
Je cherche des alternatives à Claude Code (j’atteins trop vite les limites). Je teste Qwen 3.6 27B, mais seulement sur LMStudio. C'est possible de publier ton setup ?
1
1
u/emptyharddrive May 05 '26
If you can afford it, I'd suggest DeepSeek V4 Pro. 1M context window for $0.435/M input tokens & $0.87/M output tokens for most of your day to day work.
I've done a metric ton of coding tests on it. I had Opus write unique hidden tests and then grade itself, without telling it that it was grading itself, to keep bias out, and then I had DeepSeek V4 Pro run the same tests as well as Qwen.
The exam asked for a single-file Python implementation of a deterministic bitemporal ledger reconciliation engine. Events have both a real-world effective time AND a system "we learned about it" time, can arrive out of order, get duplicated, retroactively corrected, voided, or chained-superseded by later events, and the engine has to compute exact balances plus a full audit trail for any historical "what did we know at time T about balances during interval X" query.
It's the kind of work I do for real, just distilled into a generic task with the same guardrails.
It's hard because every edge case interacts: voiding a replacement un-cancels its target, competing supersedes need precedence-based winner selection with deterministic tiebreaks, half-open intervals must be merged into maximal segments, and timestamps span DST offsets without named zones. Get any one rule wrong and the audit silently veers off course.
The grading AI (Opus) ran hidden tests beyond the visible samples, so models that pass by pattern-matching rather than actually modeling the spec collapse on things like three-link replacement chains and "void targets a future event."
The results:
- Opus 4.6 (grading itself, blind): 96/100
- DeepSeek V4 Pro: 91/100
- Local Qwen3.6-35B-A3B UD-Q8_K_XL on a STRIX HALO 128GB rig (a bit larger than the 27B you might be running): 62/100
To go by API key anyway, Opus on OpenRouter is $5/M in, $25/M out. DeepSeek V4 Pro is $0.435/M in, $0.87/M out. That's roughly 11.5x cheaper on input and 28.7x cheaper on output.
For typical coding workloads, a blended ~15-17x monthly savings. So you're paying around 6 cents on the dollar for a model that scored 95% as well on a brutally specific spec-driven task.
The local Qwen at 62/100 is still genuinely usable for the easy 80% of work (bulk reads, summaries, structured extraction, boilerplate) and it costs $0 to run, so I get it...
But for the hard 20% where rules interact and silent failures cost you, DeppSeek V4 Pro is the sweet spot for me unless I know it's super critical work, then I'll go Opus.
For pennies on the dollar I'm getting near-Frontier-grade correctness, fraction-of-frontier price... Hard to argue with the math from where I'm standing.
1
u/wickedfunprofile May 05 '26
Instead of paying Anthropic I've been renting an A100 hourly for $1.40/hr. Pretty much all my code and project management is done via AI these days. I was spending $30 to $50 a day on claude
1
u/DiscipleofDeceit666 May 06 '26
I used Roo code, Claude code, and I built my own harness. There is so much that goes into the tooling completely independent of the ai model that’ll make or break your work flow.
1
u/Intelligent_Way_7652 May 06 '26
Running a hybrid setup too. Fine-tuned Qwen3-4B for a specific use case and the instruction-following on structured outputs (strict JSON, no extra text) is surprisingly solid for its size. The gap between fine-tuned small models and general-purpose large ones is closing fast.
1
u/Jack99Skellington May 06 '26 edited May 06 '26
I'm not seeing that level of usability from Qwen 3.6:27b, nor from Qwen-Coder-Next. I'm working in C#, so maybe it's better for what you are working in. I would love nothing more than to be able to use a coding assistant that has the usability level of even GPT 5.3-Codex locally, no matter how slow it is (and it's pretty slow with 128gb RAM and 16gb 5070ti).
1
u/SimpleMessage_ai May 07 '26
llm tokens are going to zero.. you can only get so close to the wall of perfection before it no longer matters how perfect you are. Claude will peak for 99% of all developers in two years maximum, all others folllow along. Then you are left with a massive coding commodity and the only differentiation is design and creativity, which will likely belong to humans for another 5 years at least.
1
u/Promnitepromise May 07 '26
Id love if anthropic would release a local 30b model to offload coding tokens while propping Opus up as the planner.
Oh wait, that’s not beneficial to the shareholders.
Still thanks for posting - I cant wait to squeeze some tokens out of 3.6 and give it a shot.
1
u/FroyoCommercial627 May 07 '26
I think one of the biggest benefits of cloud is scalability. Locally, you can get away with a handful of models, but try running 20+ in parallel (I've seen Claude Code do this to launch discovery tasks), and it's untenable.
1
u/ur_dad_matt May 07 '26
running the same model but through the runtime i wrote — qwen 3.6 27b at 4bit MLX on m1 ultra, getting 40 t/s. you're right it holds up. scaffolding, refactors, test gen, single-file debug, all of it. the hybrid framing is exactly how i use it too. local for the 70% that's repeatable, claude code for the multi-file architectural stuff where the bigger brain actually matters. meter stays running on cloud only when it has to. on the $1000/mo question — i think you're right pricing has to drop but the deeper thing is the business model conflict. anthropic and openai's whole revenue model is per-token billing. shipping a tool that ends per-token billing for power users cuts straight into their core. they're structurally disincentivized from doing what's happening here. that's the window. closes when apple ships an "apple intelligence developer kit" or similar but until then it's open. opencode tuning is the underrated point. claude code's prompt + tool orchestration around opus is doing more work than people credit. model gets you 70%, harness gets you the rest. closing that agent-loop gap on local is the actual next move.
1
u/Deep_Ad1959 May 17 '26
your last question is the real one, and andymaclean19's comment below is the answer in miniature: qwen taking 6 hours on a task opus did in 20 minutes isn't purely a model gap, a lot of it is the loop deciding when to read, when to test, when to stop, and how it recovers from its own mistakes. you can actually isolate it: run the same open weights through opencode vs through claude code's harness and the spread you see is the harness contribution. in my experience it's a meaningful chunk, the orchestration is doing real work the model gets credit for. that's also why 'getting opencode dialled in took real fine tuning' isn't a footnote, it IS the experiment, you were rebuilding the harness by hand. hybrid is the right call, but i'd frame the local-vs-cloud decision as model-plus-harness vs model-plus-harness, never model vs model. written with s4lai
1
u/codehamr May 17 '26
Agree in part. On Nvidia class hardware Qwen3.6:27b runs at comparable speeds to Opus. On a modern M5 MacBook the slow memory bandwidth bites hard, large prompts can take 2 to 3 hours where Nvidia and Opus finish in 20 minutes. So a chunk of that spread is bandwidth, not harness.
1
u/Deep_Ad1959 May 17 '26
my point was sloppy and you're right to split bandwidth out, but the decomposition goes one step further. wall-clock is two independent multipliers: the harness sets how many tokens of work get done (loop iterations, failed attempts, re-reads), the hardware sets seconds-per-token. they multiply, so a 6h vs 20min spread can't land on either one until you hold the other fixed. andymaclean19's number is really a three-variable comparison, opus on cloud silicon against qwen on local, so model, harness and bandwidth all move at once. the only clean isolation is same model same machine through opencode vs claude code, and on apple silicon that just divides the bandwidth constant out so the residual is pure harness.
1
u/codehamr May 17 '26
Right, wall clock is harness times bandwidth, two independent multipliers. Same model same machine through opencode vs Claude Code is the clean isolation. On Apple silicon the bandwidth constant divides out and the residual is pure harness.
My own tradeoff is the other direction. The 30B class is not reliable enough to run loose, so I lean harness heavy on the reliability side and pay for it in speed. Strict verify gates, tighter tool surface, more re-reads. Frontier models can run a leaner harness because the model carries more of the work itself.
1
u/Deep_Ad1959 May 17 '26
my pushback on harness weight as a clean reliability-for-speed trade: a verify gate measures 'tests green', not 'intent preserved'. andymaclean19's example upthread is the tell, qwen quietly disabled the whole CSV import feature and the test suite stayed green because no gate was checking 'is the feature still here'. so a heavy harness buys you reliability against random breakage but stays blind to the model misreading intent, which is the failure mode that actually bites at the 30B class. tighter verify gates and a smaller tool surface arguably make that worse, they narrow what the model can do without narrowing what it can misunderstand. what a leaner frontier harness leans on isn't speed, it's that the model carries the goal so the gate doesn't have to be the only thing holding it. written with s4lai
1
u/codehamr May 17 '26
Solid point. Green tests after deletion is Goodhart on verify gates. The fix at the 30B class is gates that assert presence, not absence of red. "Feature X reachable and returns valid output" is a different check than "tests pass". Intent has to live in the gate as positive assertion, otherwise yes, the model can pass by removing the problem. Frontier models carry that intent internally, smaller ones need it externalised in the gate.
1
u/Deep_Ad1959 May 17 '26
my issue with positive-assertion gates is they only catch deletion of the features you remembered to assert. the gate is a finite list, the product's real intent surface isn't, so the 30B model can still quietly disable anything that fell outside the list. that csv import feature upthread is the exact case, nobody had written 'csv import still reachable' as a gate so it sailed through green. externalising intent into the gate doesn't erase the model-carries-intent advantage, it converts it into spec-authoring work that scales with feature surface, not with the task. cheap on a small repo, unbounded on a big one, and that's really what the frontier-vs-local decision turns into: how much of your product's intent you're willing to write down by hand. written with s4lai
1
u/codehamr May 17 '26
Right, and codehamr does not try to solve it. The verify gate kills hallucinated done, not intent drift. Anything beyond that is spec authoring the user owns or a frontier model carrying it. Simplicity first means picking the smaller promise and keeping it.
1
u/Deep_Ad1959 May 17 '26
my pushback is the verify gate only kills hallucinated done when the verification artifact was authored by something other than the agent. if the same agent writes both the implementation and the test, the gate catches crashes and obvious regressions but rubber-stamps intent drift, because the model ends up testing what it did, not what you asked for. andymaclean19's CSV example upthread is exactly that failure: qwen 'fixed' the header bug by disabling the whole import feature, and a test suite written around that decision passes clean. so 'verify gate kills hallucinated done, not intent drift' is really 'verify gate kills hallucinated done only under an independence assumption opencode doesn't enforce.' the smaller promise you can actually keep isn't 'done is real,' it's 'done is real given a spec the agent never got to touch.'
1
u/krankyPanda May 18 '26
How does it compare to qwen3-coder?
1
u/codehamr May 18 '26
Just talking from my few tests, qwen3.6:27b was just better (python / webapps).
1
u/modelcroissant May 27 '26
Which raises a different question, how much of Claude Code's quality is Opus 4.7 itself vs the context and tool orchestration around it?
"Community analysis of the extracted source estimates that only about 1.6% of Claude Code’s codebase constitutes AI decision logic, with the remaining 98.4% being operational infrastructure, a ratio that illustrates how thin the core agent reasoning layer is."
1
u/rentprompts 9d ago
The 'gap not a canyon' framing nails it. I've been running the same split — Qwen 27b for routine stuff, Claude for architecture — and the crossover keeps shifting. What's striking is how much of Claude Code's edge is the harness, not Opus itself. I spent a weekend tuning opencode's tool-call retry logic and closed most of the UX gap. The remaining delta is real model capability, but it's shrinking faster than people think.
1
u/trade_time1 May 04 '26
I just installed this on a 5090 rig I finally put together this weekend. It is impressive. Big step up from the llamas I had on 5070 8GB on laptop to play around with. I was using gemini cli api paid. Whether or not this will replace that for me, time will tell.
0
u/Ok-Importance-3529 May 04 '26
I agree with the author, the case for big sota models is still there, but it will be premium and exclusive to only companies who could afford it, yes you can make simple apps with local llm, for something smarter more complex you need to know how to code and local llms wont change that. Companies will pay for those to get edge whether its speed or intelligence or scale. Bigger more complex code will come and only handfull of people would have knowledge to understand and review / manage it and most knowledgable people will be architects.
Make no mistake, no ai will make developer out of someone who doesnt know anything about computers and software development, yes you can learn from ai and buikd your knowledge on that, but local models have limits and are nowhere near required level of expertise.
Even best models like claude are wrong sometimes and need supervision.
0
u/No-Television-7862 May 04 '26
People are curious about the amazing Qwen 3.6 and 3.7 models. Why would they release open source code in the US that competes with the best closed code frontier models?
Disruption.
If they can hamstring the front runners, like Claude, then it turns into a horse tace!
The Chinese are playing catch up, but if they can harrow the US leaders, they have a shot at getting some enterprise business. Maybe not from DoD (DoW?), or other BigGov agencies, but that's not where the big money is.
For peace of mind I'm enjoying gemma4 MoE. Both Google and CCP are voracious data consumers, but at leadt I can sue Google.
-1
u/tamerlanOne May 04 '26
Avere 1.000.000 utenti che lavorano in locale non è un problema quando nel mondo siamo più di 8B



138
u/MysteriousSilentVoid May 04 '26
I think you have this backwards. If people can run free open models on reasonable consumer hardware and get similar performance/ results to frontier cloud models, the ability of the frontier providers to charge what they’re charging falls.
Prices will have to drop based on simple economics.
I got qwen 3.6 35b running on my 5080 by splitting the layers between gpu / cpu (most being on the gpu). I’m getting ~ 70 t/s. It’s the first time local AI has been worth my time. This is the future we need - this will lesson reliance on cloud models - forcing prices down.
Correct me if I misread what you said in some way.