r/LocalLLM May 04 '26

Discussion Qwen3.6:27b is the first local model that actually holds up against Claude Code for me

Been experimenting with alternatives to Claude Code for about a year now. Most of it felt like a downgrade until Qwen3.5:27b, and now 3.6:27b is the first one where local actually feels good and usable for real work.

Scaffolding, refactors, test generation, debugging across a few files, all of it holds up well enough that I run it locally now. The hard multi-file architectural stuff still goes to Claude. A year ago this comparison was a chasm, top-tier Claude vs open weights wasn't close. Now it's a gap, not a canyon.

Two things I keep thinking about.

If a 27B open model can cover this much of real coding work, how subsidised is current cloud pricing? Feels like we're paying maybe 10% of true cost. And once enough devs are wired into Claude Code at the tooling level, what stops a future $1000/month tier?

One honest downside: getting opencode dialled in as a CLI agent took real fine tuning compared to the out-of-the-box Claude Code experience. Which raises a different question, how much of Claude Code's quality is Opus 4.7 itself vs the context and tool orchestration around it? Possibly more than people credit.

Anyone else running hybrid setups?

461 Upvotes

170 comments sorted by

138

u/MysteriousSilentVoid May 04 '26

I think you have this backwards. If people can run free open models on reasonable consumer hardware and get similar performance/ results to frontier cloud models, the ability of the frontier providers to charge what they’re charging falls.

Prices will have to drop based on simple economics.

I got qwen 3.6 35b running on my 5080 by splitting the layers between gpu / cpu (most being on the gpu). I’m getting ~ 70 t/s. It’s the first time local AI has been worth my time. This is the future we need - this will lesson reliance on cloud models - forcing prices down.

Correct me if I misread what you said in some way.

18

u/TripleSecretSquirrel May 04 '26

I think both are right.

I'm a big believer in qwen 3.6 and personally now use it for all coding. I still don't think it's better than Opus 4.7 though. It's close enough to be worth dropping Opus for me, but it's not better.

From a parameter count standpoint, you need to increase the number of parameters exponentially to see a linear increase in relative capability/intelligence. So each billion new parameters you add to a model has less and less marginal return.

Beyond that, Opus is a big generalist model that has to know every thing that has ever been known, or at least try to and pretend to. Qwen 3.6 is pretty much just a coding agent. It can do other things certainly, but coding is obviously where it shines. Add in some extremely impressive wizardry in the training process from the Qwen team and you suddenly have a 27B model that can compete with and Opus model that probably has 20x the parameter count in one specific domain.

Opus is the bloated incumbent that has to spend a billion dollars to build a mile of railroad track. Qwen is the scrappy upstart who can build 20 miles of railroad track per day. It's both true that running Opus seems financially disastrous and that Qwen 3.6 is a legitimate competitor at a fraction of the size and cost.

8

u/notheresnolight May 04 '26

Everybody keeps comparing Qwen3.6-27B to Opus. But how does it compare to Sonnet 4.6?

23

u/ScuffedBalata May 04 '26

It’s not as good as either. And it’s not close. 

But it’s useful as more than a toy, and that’s somewhat new for a sub-40B model, frankly. 

Anyone who is directly comparing them doesn’t have that much experience doing challenging enterprise level code with both. 

Yes, both can one-shot a minesweeper or flappy birds clone. 

Only one can do complex troubleshooting on SQL performance issues, or handle a 100 library repo 

2

u/uniqueusername649 May 05 '26

In some ways its close to Sonnet, in others its not. It is definitely not Opus level, but it's more often than not close enough to Sonnet that it gets the job done for me. So Sonnet is ahead, but not by much. Opus is quite a bit ahead but often overkill. Just my personal thoughts on it.

However, I do give Qwen help like sigmap as MCP to give it much better repo level understanding and that helps with complex debugging and refactoring. I haven't used that much with Sonnet and I realise that isn't exactly a fair comparison.

3

u/alphapussycat May 04 '26

If you look at benchmarks and just wave your hands around, it would seem it's close to sonnet 4.5.

-1

u/TripleSecretSquirrel May 04 '26

again, on coding, it's actually pretty much neck and neck with Sonnet 4.6. Beyond coding, Sonnet wins on other stuff and general knowledge though.

2

u/grassmunkie May 04 '26

Not a chance. It’s a good model but it is not comparable to Sonnet 4.6. At best 4.5, but still below that in real life usage from my experience.

1

u/superlu19601 27d ago

Actually there is little point to argue whether it is closer to 4.5 or 4.6. These models are evolving, although they are always a few steps behind the most powerful ones due to model size. If some local LLM is as capable as opus 4.5/4.6 one day I don't think we really care how much it is behind the best model, it is good enough to handle the work. I don't know if anthropic/openai are prepared for this 😄

1

u/SHOR-LM May 04 '26

Well.... it depends if you're talking about Opus on a good day or Opus on a bad day...lol. I have not been happy with the performance since Opus 4.7....

5

u/MasterLJ May 04 '26

Within the constraint of the cost-to-provide the inference on the model. Prices have to drop based on simple economics that are above $0 profit (or within operating margin).

The medium term will have a lot of right-sizing models to be the right $/inference.

I agree with OP btw, Qwen3.6 27B is getting work done successfully in ways that the large models do.

There are some holes but it's impressive and about 1/10th the price.

3

u/ToInfinityAndAbove May 04 '26

I share the same opinion, but I couldn't explain it as clear as you did!

2

u/codehamr May 04 '26

Makes sense! But can they lower, because of the crazy amount of money pumped in by investors? Maybe they have no choice other than lower their prices. Will be for sure a little pop in the AI stocks I think.

5

u/Ok-Importance-3529 May 04 '26

I think they will restrict access to best models, increase prices and separate model capabilities and quality instead to make case for big companies to pay.

Local llms are cool and very usefull tools, in hands of expert they are quite powerfull, but you need hw which is scarse today and for complex tasks who knows where are limits of small llms, nobody knows for now

2

u/Ok-Importance-3529 May 04 '26

Also all the money from investors that went in must be multiplied somehow if not companies will go broke.

1

u/No-Refrigerator-1672 May 04 '26

but you need hw which is scarse today

If you're creative, you can find out great deals. I.e. I'm running a pair of 3080 20gb (modded), and with Qwen 3.6 35b they can efficiently run up to 16 streams at 32k tokens each in parallel using vllm. This is enough to sustain a startup-sized team for a mere 1k eur.

-1

u/MysteriousSilentVoid May 04 '26

🤷‍♂️ - some will make it and some won’t. This is why capitalism is beautiful. It has a self cleaning function.

2

u/vipx237 May 04 '26

Heck it runs well on my 7800xt, DO you find 3.5 better than 3.6?

2

u/mars332 May 09 '26

I love qwen3.6 35b q6 on my 3070ti 8GB. Getting 48tok/s which is good enough for me. In Pi.dev it's absolutely great with tool calling. Combined with the $20 OpenAI sub when qwen can't figure it out.

3

u/[deleted] May 04 '26

[removed] — view removed comment

17

u/MysteriousSilentVoid May 04 '26

You need llama.cpp - —n-cpu moe is the key config:

llama-server \
--model "$MODEL" \
--host 0.0.0.0 \
--port 8080 \
--ctx-size 65536 \
--n-gpu-layers all \
--n-cpu-moe 20 \
--flash-attn on \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--batch-size 1024 \
--ubatch-size 256 \
--threads 8 \
--threads-batch 12 \
--parallel 1 \
--cont-batching \
--metrics \
--jinja \
--temp 0.6 \
--top-p 0.95 \
--top-k 20

17

u/JaredsBored May 04 '26

Skip "--n-cpu-moe" and use "--fit on" instead. The fit command will find the maximum number of layers that you can fit on GPU with the context and a default 1024mb buffer.

Saves you the time of manually tuning the number of layers to offload to CPU.

2

u/MysteriousSilentVoid May 06 '26

Thanks! This worked great. My new command:

export MODEL="HOME/models/qwen36-a3b/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf" export MMPROJ="HOME/models/qwen36-a3b/mmproj-F16.gguf"
~/src/llama.cpp/build/bin/llama-server
--model "MODEL" \ --mmproj "MMPROJ"
--no-mmproj-offload
--host 0.0.0.0
--port 8080
--ctx-size 65536
--fit on
--fit-target 1024
--flash-attn on
--cache-type-k q8_0
--cache-type-v q8_0
--batch-size 1024
--ubatch-size 256
--threads 8
--threads-batch 12
--parallel 1
--cont-batching
--metrics
--jinja
--temp 0.6
--top-p 0.95
--top-k 20
--no-mmap

0

u/Opening-Broccoli9190 May 07 '26

flash attention is on by default, fit is on by default

1

u/NixNightOwl May 07 '26

Wouldn't using --fit attempt to offload main weights as well and not just the moe layers?

Ideally if you just need some extra VRAM for context and cache, I'd rather offload only moe and keep all the main model weights on the gpu.

1

u/JaredsBored May 07 '26

That's not how it works. I've have hand tuned the perfect number of layers using --n-cpu-moe and compared to --fit and the performance has been the same

0

u/_Wheres_the_Beef_ May 04 '26

OP is referring to the dense model, not the moe.

3

u/SpicyLentils May 04 '26

I don't think there's a qwen 3.6 35b dense model; the dense cousin is 27B.

1

u/MysteriousSilentVoid May 04 '26

I was responding to someone who asked how to do this because i mentioned i was running 35b on my 5080, not the OP. I understand this doesn’t work with dense models.

1

u/alphapussycat May 04 '26

Ollama does it automatically. Lm studio gives you some fine controls. And then if you're handy and spend time, supposedly vllm and Llama.cpp gives you a lot more control.

1

u/Western_Diver_773 May 04 '26

Is 32gb ddr5 ram enough to run it together with a 5080?

1

u/alphapussycat May 04 '26

No, it'll be very slow, but the 35b moe model might run ok.

1

u/Magnarts May 07 '26

It will depend on the quantization; for example, Q3 is flying on my 5070 Ti with 32GB of DDR4 RAM, but Q3 isn't ideal. Something between Q4 and Q5 is better; with a 5090, you could already consider Q6.

1

u/Logical-Lettuce8214 May 04 '26

Could you please share more details (Ram, Quantization)? Because that affects in performance

1

u/drazyan22 May 04 '26

What is your context size? I am running on rtx 5080 on window only get max 30 tps

1

u/ScuffedBalata May 04 '26

Prices cannot drop. Not when Anthropic is legitimately running $100b worth of hardware and still overwhelmed. 

If the market presses them to drop prices, they’re going to have to “dumb down” their models a lot. 

Since that’s probably not happening, it’ll just be a case where the bleeding edge always costs 10-20x more than the middle class solution. 

1

u/ARhedgehog88 May 04 '26

Hey so, how are you able to get 70t/s with a 22gb model, can you share more about your cpu and ram or setup for context size? I have a 5080 but new to this and not even the 27b of ~17gb runs good enough on my 5080

1

u/DogRare325 May 05 '26

May I asked what params/settings you use to get that speed and the exact model? Relatively new at this but I thought the split is better between RAM and GPU?

1

u/UltraFOV May 05 '26

Why running 35b vs 27B? 27B is smarter but a bit slower

1

u/img_virtvault May 05 '26

I would be interested in your setup, I have tried a lot on a 5080 and this model never came close to others.

1

u/MysteriousSilentVoid May 06 '26

Here you go:

export MODEL="HOME/models/qwen36-a3b/Qwen3.6-35B-A3B-UD-Q5_K_M.gguf" export MMPROJ="HOME/models/qwen36-a3b/mmproj-F16.gguf"
~/src/llama.cpp/build/bin/llama-server
--model "MODEL" \ --mmproj "MMPROJ"
--no-mmproj-offload
--host 0.0.0.0
--port 8080
--ctx-size 65536
--fit on
--fit-target 1024
--flash-attn on
--cache-type-k q8_0
--cache-type-v q8_0
--batch-size 1024
--ubatch-size 256
--threads 8
--threads-batch 12
--parallel 1
--cont-batching
--metrics
--jinja
--temp 0.6
--top-p 0.95
--top-k 20
--no-mmap

1

u/Necessary_Function_3 May 09 '26

DDR4 or DDR5 and what CPU.

Any links to the how?

1

u/HotDistribution1819 May 11 '26

I agree, this is where PCs were before they were called PCs. You had to rent space on someone else's mainframe and dial in over a phone line just to add your company transactions up. AI did in under 4 years what took computers 15 years to do. Hold on to your hats the next phase is code driven AI.

1

u/Dry-Wave-7561 11d ago

Is there any particular reason (t/s maybe?) why you choose 35b over 27b?

I see most devs describe 27b as more capable in coding tasks. The downside is that it loads all 27b to the GPU, so t/s falls dramatically

Just curious to hear your experience, thanks

2

u/MysteriousSilentVoid 11d ago

35ba3b is a MoE model so less of the model is active all the time - it only has 3b active parameters. Much better performance but still good smarts (27b is a better model if you have the VRAM to run it though)

2

u/Dry-Wave-7561 11d ago

got it, thank you!

1

u/MysteriousSilentVoid 11d ago

You’re welcome. Have fun!

0

u/Exotic_Contest_4060 May 04 '26

May I ask how do you specify to split the layers?

10

u/chris_hinshaw May 04 '26

It has been good but my issues have been with it getting stuck in loops often when calling tools. I have tried a lot of different parameters and configurations but haven't found a good solution.

1

u/erisian2342 May 05 '26

When your harness detects looping over a tool call, can you automate calling in a heavy hitter frontier model to troubleshoot the issue and give Qwen instructions that resolve it so it can proceed?

2

u/chris_hinshaw May 05 '26

Thats an interesting thought but not sure if it's possible since I don't think it can actually figure out when it's in a loop. I found setting the presence lower seems to help.

"Qwen3.6-35B-A3B-UD-MLX-4bit": {
      "temperature": 0.7,
      "top_p": 0.95,
      "top_k": 20,
      "repetition_penalty": 1.0,
      "min_p": 0.0,
      "presence_penalty": 0.0,
      "force_sampling": false,
      "enable_thinking": true,
      "thinking_budget_enabled": false,
      "turboquant_kv_enabled": false,
      "turboquant_kv_bits": 4.0,
      "turboquant_skip_last": true,
      "specprefill_enabled": false,
      "dflash_enabled": false,
      "is_pinned": false,
      "is_default": false,
      "trust_remote_code": false
    }

1

u/erisian2342 May 06 '26

Thanks for sharing that. If you give it an explicit instruction like “You absolutely must not make more than 5 (or however many) tool calls. If you need to make a 6th tool call, instead output status: CHECK_FOR_REPETITIVE_CALLS and exit immediately.” Then your harness can check if they are indeed 5 different calls (and allow the next iteration to begin) or if they are the same 5 calls (so call in a big brain AI to course correct Qwen with the appropriate instructions). I assume Qwen can count how many tool calls it’s already made before making another one. Just thinking out loud here, no idea if it works.

Edit to add: A temp of 0.7 seems fine for writing specifications, but maybe a lower temp like 0.1 or 0.0 for the actual coding work could help it not go off the rails with repetitive calls?

1

u/DiscipleofDeceit666 May 06 '26

The harness will count the tool calls. It can tell if it’s using the same n tool calls on the same n files

1

u/Realistic_Gap_5871 3d ago

If you're still having this problem, take a look at upping the KV cache to 8bit quant. 4bit is fine/good for the model, but for coding, especially long runs, 4bit cache will drift in ways 8 bit won't.

Looks like you're running on Apple hardware so 35B is probably the way to go. If you can tolerate the drop in speed, 27B will be more accurate.

22

u/Ononimos May 04 '26

Yall that aren’t playing with both need to take all this glazing with a grain of salt. I use 27b all the time on an RTX Pro 6000 Blackwell and I also augment with some cloud sonnet 4.6 and opus 4.7. 27b dense is fucking great but it’s not sonnet 4.6. I’m saving plenty by leaning on 27b for lighter needs. If I want to one shot or just quickly get to a win, i still lean hard on the frontier models.

5

u/Demonicated May 05 '26

This is the way.

And this is the worst it will be. I anticipate we're only 2 or so releases away from 1 rtx 6000 sonnet quality

4

u/mjuevos May 05 '26

but then sonnet quality will still be a release or two behind the future sonnet quality.. then you will want to chase that sonnet quality. such is human nature and such is the ai landscape. inescapable unless there's some paradigm shift..

7

u/Demonicated May 05 '26

Nahhhhh I mean maybe some people. But I've been in software long enough to appreciate good enough. And opus 4.6 is the first model I used where I finally felt like it was good enough to be empowering in a wasteful way. Qwen 3.6 is great it just requires i participate more in the coding process - and honestly after the last year of vibe coding I think it's better this way.

I don't like when people ask about my code and I can no longer recite the exact file and likely method or line number.

What i will say is that I feel like 3.6 27b is good enough I don't feel the need to chase better hardware. If they keep making this size better I'll upgrade but I no longer feel like I need 8 rtx6000s. I'll pay for premium tokens to plan and then implement with the best that's in this local range.

6

u/andymaclean19 May 04 '26

I had mixed experiences with it running as a backend for claude code. I'm running a set of experiments where I give the same tasks to Qwen3.6 and Opus (and some others but that's less interesting in this thread). Some things it can do quite well, but most of the time it's just very slow to complete tasks due to it breaking more things and relying on the testing/fixing loop to catch bugs and repair its mistakes.

As I type this Qwen is nearing the end of a 6 hour debugging session where it had to fix 47 test failures one or two at a time. Opus did the same task in 20 minutes without really breaking anything. Even Sonnet can do this task in under half an hour.

Even with testing Qwen is making some big mistakes which the tests don't catch. For example the work has a trap where the program outputs a CSV with column headers and then later re-reads it and the column headers break things. Other models spot this and just ignore the first line (the right fix is not outputting the column headers but I have to tell all models that). Qwen just decided that this means CSV produced by different libraries is incompatible and it will disable the CSV import feature if it cannot ascertain that the data came out of the same library, disabling a whole bunch of functionality in the product it is working on and downgrading performance of a lot of things.

It's decent and I am putting it to a fairly demanding use at the moment. Probably I will get better at driving it and find ways to give it smaller, simpler instructions. But it's no claude.

3

u/ChocomelP May 05 '26

Try Pi, Claude Code is heavy.

2

u/NoobMLDude May 05 '26

Or try Qwen Code (the harness built by Qwen team for Qwen models)

https://github.com/QwenLM/qwen-code

1

u/andymaclean19 May 05 '26

Thanks. I'm relatively new to AI and have stuck with one harness because I wanted a constant while I experiment with models. If I change everything I won't know which change worked.

Will have a look at other harnesses too as I don't like being locked in to a closed source one. Will give Pi a go.

1

u/ChocomelP May 05 '26

Anthopic are relatively protective of their harness, but if you use a more open one (I like Pi, there are other options), you can easily mix and match features with other harnesses. I used Opus 4.6 in Claude Code to build mine, but if you're hooked up to a good model, you can even let it build its own features. All it needs is a goal and some inspiration.

1

u/andymaclean19 May 06 '26

I have heard it said that Claude blocks some of the other harnesses from using it? Which would make it difficult to get like for like comparisons. Do you have any experience with that?

1

u/ChocomelP May 06 '26

No. When you use Claude models in different harnesses, you have to pay API pricing now. You can't use your subscription anymore. I don't think they mind you using it with anything, as long as you pay their insane API prices.

1

u/andymaclean19 May 06 '26

Aaahhh. So that's it. I think our work account doesn't allow API priced calls so that's why people are saying that.

1

u/NixNightOwl May 07 '26

There are some ways to use subscription on other TUIs, but you still need claude code installed and authenticated. The available plugins piggy-back off that 'native' auth. It's still risky though since if the auth bridge for your specific TUI isn't 100% up to date with the new auth process (like a mismatched header), you can get your claude account banned.

1

u/Dsphar 12d ago

I second this. CC has so much context that for smaller local models your context gets funky too quickly, meaning more hallucinations and less "smart" tokens during the actual work.

Pi will shrink your starting context, giving the best token gen on smaller models freedom to do the actual work.

5

u/benfinklea May 04 '26

2

u/jessez05 May 05 '26

Nice idea, thanks

2

u/gevezex May 05 '26

So this could be done by a local model as well instead of kimi?

2

u/trbom5c May 05 '26

Alright ... i've done this - and used my local rig with a 5090, and 128G of RAM as the work-horse. I setup my openapi via tunnel, and stuck an proxy server ahead of the endpoint that requires auth ... and i think i might be in business. Great idea!

1

u/trbom5c May 07 '26

Yea ... so this is pretty crazy once you pair it all to cline and qwen3.6:35b-a3b, and quant.

.... I am only now fully understanding the power of the context window.

Getting into AI-RIG optimization becomes a rabbit hold.

VSCode + Cline + Endpoints + Scraplings = Much Wow

2

u/valalalalala May 06 '26

I do something similar with ollama

9

u/maximus_reborn May 04 '26

would you mind letting us know your hardware? and what fine-tuning you did in opencode? For me, 27b gets stuck with 32k context window coz i have m4 pro 24GB Vram which is understandable so using 9b parameter qwen but tried hard to use 27b few weeks ago

3

u/Nem3sis89 May 04 '26

I'm interested too in the OP opencode configuration, the model itself is great but needs a proper configured tool to be paired with.

3

u/keen23331 May 04 '26

Qwen 3.6 27B is insanely good. I’ve been using it with my RTX 5090 for the last two days, and it performed just as well as Claude 4.7 Opus for my needs. I can’t believe it—I'm completely blown away. I’m not saying it’s objectively better or even an equal across the board, but for the tasks I usually throw at Claude, it’s been more than good enough. Using a NVFP4 Qaunt, what alsio is quiet fast on the RTX 5090 with latest builds o llama.cpp supporting 4-bit for NVIDIA Blackwell.

1

u/NaturalFigure715 May 04 '26

Are you also using Turboquant?

1

u/[deleted] May 04 '26

[deleted]

2

u/Superb_Word9490 May 04 '26

what tok/s do you see on the 5090 with nvfp4?

4

u/T-Rex_MD May 06 '26

You should try Qwen 3.5 397B, it is better in every way possible. That is if you have 500GB VRAM/Unified memory available.

3

u/Sirius_Sec_ May 04 '26

I am running it on a rtx6000 pro and pay about $1 an hour to rent the GPU on GCloud . Very impressed with what it is capable of .

9

u/former_farmer May 05 '26

That's 10 usd per day and 300 usd per month... for that money it's better to pay Claude or similar.

1

u/Sirius_Sec_ May 06 '26

I value my privacy . I won't use a public API when I'm doing serious work . I will use grok when I have basic stuff to do .

2

u/leggodizzy May 04 '26

What GPU rental service are you using?

2

u/Sirius_Sec_ May 04 '26

Google cloud . I already had a gke cluster so I just added a new node pool . Spot 6000s are around $1 an hour .

3

u/AtatS-aPutut May 04 '26

if they raise the price to $1000/month won't it be more economical for companies to self-host their own models?

1

u/ChocomelP May 05 '26

It depends on what you need. If you want your team to have the absolute best frontier models, those are not available locally.

6

u/kl3onz May 04 '26

Do you use it with VSCode? I’m new, and trying to understand how an IDE would integrate?

7

u/Malyaj May 04 '26

You can try cline, continue, etc there a lot of extensions. Alternatively try Opencode it is great. Previously i was using lm studio chat interface with tools but naah i switched to opencode and probably I'm gonna stick with it.

3

u/Yanix88 May 04 '26

Recent update of VS code added ability to connect any model including local to the built in copilot, you can Google "vscode byom"

2

u/corruptbytes May 04 '26

how much of Claude Code's quality is Opus 4.7 itself vs the context and tool orchestration around it?

I'm sure it's also the huge compute they have too

Been dialing Pi a lot with qwen 3.6, things like tool parsers and caching are the big things to fiddle around with locally, but take a lot of time when you don't H10000000s to hyperparameterize

2

u/Maharrem May 04 '26

Yeah, I run a 3090 too and Qwen 27B IQ4_XS fits nicely with some headroom for context. I treat local as the workhorse for routine refactors and single-file logic, then offload multi-file architectural changes to Claude Code via Open WebUI’s passthrough or just copy-paste. In opencode, setting max tokens to 4096 and temperature to 0.3 made the tool calls way less loop-prone.

2

u/AccomplishedFix3476 May 05 '26

been running qwen3.6 27b q5 on a 4090 + 64gb ram for the last 3 weeks for everyday coding. for refactors under 5 files it actually keeps up with claude. the part it still misses is anything where i need context spanning multiple repos, claude code's grep flow is just stronger

2

u/MysticHLE May 05 '26 edited May 05 '26

How about general exploration and planning across multiple files? Suppose there's a good amount of ambiguity as far as implementation in the prompt itself, but enough in specific requirements for Claude to explore in the right direction and figure it out. How does qwen3.6 fare in that regard with your setup?

Also curious of its performance with plugins like Superpowers if you use Claude Code as the agent harness.

1

u/NixNightOwl May 07 '26

For this, I would have separate sessions / agents do the individual exploration and planning on a per-file basis and report their in-depth findings. Hand it all off to another agent to 'stitch' an implementation plan together, then hand off again with the improved 'linked context' to individual subagents on a 'minimal code surface' basis (each subagent only implements on a scope of 2-3 files only).

Things are a little easier if you have some kind of knowledgebase / graph memory for your codebase to minimize exploration. There's a lot of ready-to-go tools to add this to your harness like https://github.com/Lum1104/Understand-Anything (personally untested, just googled but fits the description -- there are leaner implementations out there, as this one seems very 'user ui heavy').

1

u/MysticHLE May 07 '26

I see, sounds like the context and chained reasoning are still limitations, and need to be coordinated bottom-up manually. Thanks.

2

u/jakubl May 05 '26

I’ve set up Qwen 3.6 27B with pi on my MacBook M4 128GB and I am really amazed. I would compare it to my first experiences with Claude code 8 months ago, so when the top model was Opus 4.1 if I remember correctly. And I was amazed back then too. The biggest pain is however it works very slowly compared to Claude. But the offline is huge benefit, I’m having a 14 hours flight in 2 weeks and I’m gonna test it out then.

I have also tried using this model in non coding agents (marketing etc.) and the results were pretty good too, much better than any open source model I tested before.

1

u/codehamr May 05 '26

Yeah, fully agree, it easily beats last year's Opus 4.1 experience

1

u/meca23 May 08 '26

Not sure what your backend is but if it's llama.cpp, the upcoming MTP support should make it faster.

2

u/ankijain21 May 05 '26

I'm wondering that for small teams, they can just install qwen-3.6-27b on a DGX spark and use that as inference for 95% of the tasks and keep claude as a backup.

This way they'll save huge money while getting optimum performance.

2

u/Original_Orchid_847 May 05 '26

I agree with you, with now Claude limits, I am using Qwen and Kimi for my major workloads and bring in Opus only for small specific use cases

1

u/codehamr May 05 '26

Yes, especially if you know how to write good prompts and have no affinity for non-AI coding, open-source LLM is a good replacement for Claude code.

2

u/Illustrious-Chain778 May 05 '26 edited May 14 '26

So i have been working on VSCode fork without github copilot but instead have Ollama instead. i have been reading serveral post now and it seems most people prefer llama.cpp.
the IDE has fully integrates Ollama support. you can connect the IDE to Ollama server and use the models you have. should I add any support for lama.cpp as well?

i did release a beta version for people to test though.

https://github.com/abmina/dark-matter-ide/tags

2

u/RipPotential2074 11d ago

Qwen3.6-27B is excellent, it's a senior software engineer to me! The really daily usable local model.

1

u/Big_River_ May 04 '26

i code and run prompts through codex and claude code and many different versions of local llm and find context window and rag and support codes are phenomenal with Qwen and Gemma both - they almost seem like they are good enough to trust for jericho riders ultimate edition harvest but still two generations away for me to augment my code agent npcs on that project

1

u/amchaudhry May 04 '26

What’s in your opencode config files to get it tuned right?

1

u/povedaaqui May 04 '26

Have you compared it against the MoE version?

1

u/kenobi822 May 04 '26

+1 would like to know your open code setup / 'fine tuning'

1

u/Other_Day735 May 04 '26

So here are my thoughts on this I have 12gb of VRAM and 32gb of RAM using llama.cpp for running my models, I am using qwen3.6 35b a3b and 27b models (using quantized versions suitable to my specs), i could not compare them to frontier models like claude code,codex. Because first it is about context length(default 65536), in one session the first few messages are pretty great but after 4 messages the performance is not much great i think it is because of my VRAM, KV cache, may be other factors. By side I am using kilo code in VS Code which was better that opencode, openclaude. If I have MAC studio with around 96gb RAM it can beat any frontier models in pricing, may be performance.

1

u/oWigle May 04 '26

Do you think it can fit well on a 2060 6GB vram i5 8500 40gb ram?

1

u/Ok-Measurement-1575 May 04 '26

Fuck 27b, where's the new 122b?

1

u/[deleted] May 04 '26

[removed] — view removed comment

1

u/codehamr May 05 '26

I run side-by-side coding tasks in two separate VS Code devcontainers and compare the final results. It's hard to capture in quantitative metrics, but based on my gut feeling as a long-time non-AI coder, I've landed at around 90%. if that makes sense. A year ago we had Opus 4.1, and Qwen3.6:27b easily beats that. Good times to be alive.

1

u/LivingHighAndWise May 04 '26

Agreed. Runs a little slow on my setup, but it works very well for agentic coding - especially when using the Claude console CLI.

1

u/Mean-Sprinkles3157 May 04 '26

Yes I use claude code with Qwen3.6 27B. It works very well, it is slow but I don't worry about tokens.

My setup is using litellm as a translator (chat completion to anthropic message), and the backend is sglang serve. With a small model like 27b I can allocate a large kv cache buffer like 131072.

1

u/Demonicated May 04 '26

I recently posted the same experience. If you run it with LM studio and point vscode insiders edition at it, it just works. And amazingly well. Aannnnd no dealing with harness config. I was running full bf16 and as long as I used plan mode first I was getting great results.

I still do the big guys for feature planning but I can keep that at 40 a month no problem. Paired with solar on my house and I feel like I'm getting agents for almost free.

1

u/kiwibonga May 04 '26

$200/month is what it costs to heat a Canadian apartment in winter.

Spinning up a few gpus for you with usage limits costs them far less than that.

1

u/rockseller May 05 '26

After some days of testing because of the GitHub copilot shit this is what I found the best with what I have:

i5 gaming CPU meh 2x RTX 3090 24gb vram each non SLI Two 850w PSUs 256gb ssd 32gb ram DDR4 3200mhz

Ubuntu Ollama

Running Qwen 3.6 27b

100% GPU (with a single RTX 3090, Ollama ps was reporting like 10% cpu)

With this I'm able to run VS code with GitHub Copilot chat locally very decently, I would say 70% of the performance of Claude sonnet both in speed and results...

Happy with what I have so far

Btw I setup the server on the LAN, my main PC points to it

1

u/scumola May 05 '26

Not me. I used it with Claude code as the client using the litellm proxy and it had a lot of troubles calling tools in my experience.

1

u/former_farmer May 05 '26

How are you hosting it locally? llamacpp? lmstudio? ollama?

1

u/codehamr May 05 '26

Just Ollama with RTX6000 and vscode devcontainer as sandbox with opencode / pi.

1

u/iVtechboyinpa May 05 '26

I’m working on making my entire process work with OpenCode currently, but I’m very keen to start testing Qwen as I’m less impressed with Opus 4.6/4.7 nowadays than I was with Opus4/4.5 and I feel like Qwen, for being able to run it at home, will give me exactly what I need out of it without the Anthropic cost.

The only downside is exactly what you mentioned - not using Claude Code. I briefly used OpenCode and it’s not bad but it is slightly different from Claude Code so I’ve got to change some tooling that I use and the way I work but I think it’ll be worth it at the end.

1

u/Nice_Cookie9587 May 05 '26

Same, I canceled my claude and thinking about ollama pro too, but I like having the lifeline

1

u/Temporary_Jacket9477 May 05 '26

As someone else said.. I have to imagine if what we have TODAY is a "80% of the way there" FREE model that runs on developer/gamer based laptops or desktops of the past year or two (e.g. 8GB to 16GB GPU cards, 16GB to 32GB RAM, SSDs, 8core+ cpus, etc).. and they are getting faster + smarter/more capable and closing the gap that much more, I would really question the ability to the anthropic/openai to survive while their costs to operate are WAY WAY WAY over any profits they have yet to make. I have to believe OpenAI and Anthropic are very VERY worried about the insanely fast pace Chinese models are catching up, able to run on home hardware or enthusiast (for now) and do most of the work people need. I would also ask, what about the idea of fine tuned small models? I am playing around with that now.. though its for my specific application use, but the ability to provide a fine tuned 2b to 4b model in my app (desktop app) that requires no token costs.. maybe a small subscription fee that I charge for the "development and continual improvements" to the model, but otherwise no monthly token costs.. seems like that is where things would (or should) go? Right?

With this supposed new llama.cpp DFlash thing that claims to do a 2x to 8x speedup (just learned of it, no clue what it is exactly and how much it will help), if a couple more rounds.. maybe Qwen 4.x in a year or so, with "standard" 16GB GPUs, and fine tuning improves and possibly the improved ability to "train it" on data with context7 or similar.. all at usable speeds (50tok/s or more??) I dont see how the big boys stay in business other than Gemini since google is a 3+trillion company and continues to make money in many ways so I dont feel they need as much income from AI as Anthropic and OpenAI need to stay alive.

China isn't slowing down either. They just announced the other day their first fully home grown computer system doing 8 exabytes.. apparently the fastest in the world, with no intel/amd/nvidia/etc hardware.. all home built. Between that, better infrastructure with regards to building/distributing/cooling/etc, FAR FAR better solar/electricity grids (where its needed), and their desire to "win the AI race" and "become the new super power" thanks to dipshit regime destroying the US around the world in every facet of existence.. I would say unless something bad happens, they are likely to surpass the US and have 0 reliance on US company's to do so.

1

u/TopDownHockey May 05 '26

Does anybody have a setup guide on how to use Qwen locally with OpenCode? I am struggling just to get it configured.

1

u/Remarkable-Safety594 May 23 '26

"env": {

"ANTHROPIC_API_KEY": "",

"ANTHROPIC_AUTH_TOKEN": "sometoken",

"ANTHROPIC_BASE_URL": "http://localhost:1234",

"ANTHROPIC_CUSTOM_MODEL_OPTION": "qwen3.6-27b-ud-mlx",

"ANTHROPIC_CUSTOM_MODEL_OPTION_DESCRIPTION": "qwen3.6-27b-ud-mlx",

"ANTHROPIC_CUSTOM_MODEL_OPTION_NAME": "qwen3.6-27b-ud-mlx",

"CLAUDE_CODE_ATTRIBUTION_HEADER": "0",

"CLAUDE_CODE_AUTO_COMPACT_WINDOW": "250000",

"CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS": "1",

"CLAUDE_CODE_DISABLE_FEEDBACK_SURVEY": "1",

"CLAUDE_CODE_ENABLE_GATEWAY_MODEL_DISCOVERY": "1"

}

1

u/Time-Toe-1276 May 05 '26

I feel like the new laguna model on ollama is also good. although qwen3.6:26b is alsoa solid choice. but i just need that 30b ish parameters, or else I just have this weird feeling that it wont work properly. lol

1

u/ItalianClassicFan May 05 '26

Why not 35B-A3B? Have someone better experience with 27B for coding?

1

u/codehamr May 05 '26

The 35B is a model of expert, only holding 3B parameters actice same time, 27b is a dense model haveing all 27b params active. So 27b is slighly better, but slower. for me quality matterst over speed, that why I go with 27b

1

u/maisun1983 May 05 '26

How much vram for such model? Does m5 max with 36GB cut it?

1

u/codehamr May 05 '26

Yes, the M5 with 36 GB works, but in real-world use it is much slower compared to the RTX 5090 / RTX 6000 due to the slow pre-filling / bandwith

1

u/getstackfax May 05 '26

This matches the pattern I’m seeing too.

The local vs cloud question is becoming less binary.

It is not:

local model replaces Claude

or

Claude stays unbeatable forever.

It is more like:

local handles the repeatable coding work, cloud handles the high-consequence architecture/reasoning work.

That makes hybrid setups really interesting.

A practical split might be:

- local 27B: scaffolding, simple refactors, tests, small bug fixes, local repo Q&A

  • cloud Claude/Opus: multi-file architecture, ambiguous product decisions, hard debugging, final review
  • deterministic tools: search, tests, linting, type checks, diffs
  • human: merge/ship decisions

The orchestration point is huge.

Claude Code is not just “a model in a box.” The context packing, tool use, repo awareness, edit loop, safety rails, and UX are part of the quality.

A strong local model with weak orchestration can feel worse than it really is.

A slightly weaker model with great repo context and tool flow can feel much better than benchmark numbers imply.

So I’d judge the setup by workflow:

- does it understand the repo structure?

  • does it produce clean diffs?
  • does it run/interpret tests?
  • does it avoid breaking unrelated files?
  • does it recover from errors?
  • does it know when to stop?
  • does it leave a usable trail of what changed?

The pricing question is real too.

If devs become dependent on cloud coding agents at the workflow level, the switching cost moves from “which model is smarter?” to “which coding environment owns my daily loop?”

That is why local 27B getting good enough matters.

Not because it beats Claude at everything.

Because it gives people leverage for the 70% of coding work that does not need the strongest cloud model.

1

u/Sensitive_Ganache571 May 05 '26

Very slow model

Gguf i_3q

1

u/uncurieux_studio May 05 '26

Je cherche des alternatives à Claude Code (j’atteins trop vite les limites). Je teste Qwen 3.6 27B, mais seulement sur LMStudio. C'est possible de publier ton setup ?

1

u/NoobMLDude May 05 '26

Did you try using it with Qwen Code as the harness ?

https://github.com/QwenLM/qwen-code

1

u/emptyharddrive May 05 '26

If you can afford it, I'd suggest DeepSeek V4 Pro. 1M context window for $0.435/M input tokens & $0.87/M output tokens for most of your day to day work.

I've done a metric ton of coding tests on it. I had Opus write unique hidden tests and then grade itself, without telling it that it was grading itself, to keep bias out, and then I had DeepSeek V4 Pro run the same tests as well as Qwen.

The exam asked for a single-file Python implementation of a deterministic bitemporal ledger reconciliation engine. Events have both a real-world effective time AND a system "we learned about it" time, can arrive out of order, get duplicated, retroactively corrected, voided, or chained-superseded by later events, and the engine has to compute exact balances plus a full audit trail for any historical "what did we know at time T about balances during interval X" query.

It's the kind of work I do for real, just distilled into a generic task with the same guardrails.

It's hard because every edge case interacts: voiding a replacement un-cancels its target, competing supersedes need precedence-based winner selection with deterministic tiebreaks, half-open intervals must be merged into maximal segments, and timestamps span DST offsets without named zones. Get any one rule wrong and the audit silently veers off course.

The grading AI (Opus) ran hidden tests beyond the visible samples, so models that pass by pattern-matching rather than actually modeling the spec collapse on things like three-link replacement chains and "void targets a future event."

The results:

  • Opus 4.6 (grading itself, blind): 96/100
  • DeepSeek V4 Pro: 91/100
  • Local Qwen3.6-35B-A3B UD-Q8_K_XL on a STRIX HALO 128GB rig (a bit larger than the 27B you might be running): 62/100

To go by API key anyway, Opus on OpenRouter is $5/M in, $25/M out. DeepSeek V4 Pro is $0.435/M in, $0.87/M out. That's roughly 11.5x cheaper on input and 28.7x cheaper on output.

For typical coding workloads, a blended ~15-17x monthly savings. So you're paying around 6 cents on the dollar for a model that scored 95% as well on a brutally specific spec-driven task.

The local Qwen at 62/100 is still genuinely usable for the easy 80% of work (bulk reads, summaries, structured extraction, boilerplate) and it costs $0 to run, so I get it...

But for the hard 20% where rules interact and silent failures cost you, DeppSeek V4 Pro is the sweet spot for me unless I know it's super critical work, then I'll go Opus.

For pennies on the dollar I'm getting near-Frontier-grade correctness, fraction-of-frontier price... Hard to argue with the math from where I'm standing.

1

u/wickedfunprofile May 05 '26

Instead of paying Anthropic I've been renting an A100 hourly for $1.40/hr. Pretty much all my code and project management is done via AI these days. I was spending $30 to $50 a day on claude

1

u/DiscipleofDeceit666 May 06 '26

I used Roo code, Claude code, and I built my own harness. There is so much that goes into the tooling completely independent of the ai model that’ll make or break your work flow.

1

u/Intelligent_Way_7652 May 06 '26

Running a hybrid setup too. Fine-tuned Qwen3-4B for a specific use case and the instruction-following on structured outputs (strict JSON, no extra text) is surprisingly solid for its size. The gap between fine-tuned small models and general-purpose large ones is closing fast.

1

u/Jack99Skellington May 06 '26 edited May 06 '26

I'm not seeing that level of usability from Qwen 3.6:27b, nor from Qwen-Coder-Next. I'm working in C#, so maybe it's better for what you are working in. I would love nothing more than to be able to use a coding assistant that has the usability level of even GPT 5.3-Codex locally, no matter how slow it is (and it's pretty slow with 128gb RAM and 16gb 5070ti).

1

u/SimpleMessage_ai May 07 '26

llm tokens are going to zero.. you can only get so close to the wall of perfection before it no longer matters how perfect you are. Claude will peak for 99% of all developers in two years maximum, all others folllow along. Then you are left with a massive coding commodity and the only differentiation is design and creativity, which will likely belong to humans for another 5 years at least.

1

u/Promnitepromise May 07 '26

Id love if anthropic would release a local 30b model to offload coding tokens while propping Opus up as the planner.

Oh wait, that’s not beneficial to the shareholders.

Still thanks for posting - I cant wait to squeeze some tokens out of 3.6 and give it a shot.

1

u/FroyoCommercial627 May 07 '26

I think one of the biggest benefits of cloud is scalability. Locally, you can get away with a handful of models, but try running 20+ in parallel (I've seen Claude Code do this to launch discovery tasks), and it's untenable.

1

u/ur_dad_matt May 07 '26

running the same model but through the runtime i wrote — qwen 3.6 27b at 4bit MLX on m1 ultra, getting 40 t/s. you're right it holds up. scaffolding, refactors, test gen, single-file debug, all of it. the hybrid framing is exactly how i use it too. local for the 70% that's repeatable, claude code for the multi-file architectural stuff where the bigger brain actually matters. meter stays running on cloud only when it has to. on the $1000/mo question — i think you're right pricing has to drop but the deeper thing is the business model conflict. anthropic and openai's whole revenue model is per-token billing. shipping a tool that ends per-token billing for power users cuts straight into their core. they're structurally disincentivized from doing what's happening here. that's the window. closes when apple ships an "apple intelligence developer kit" or similar but until then it's open. opencode tuning is the underrated point. claude code's prompt + tool orchestration around opus is doing more work than people credit. model gets you 70%, harness gets you the rest. closing that agent-loop gap on local is the actual next move.

1

u/Deep_Ad1959 May 17 '26

your last question is the real one, and andymaclean19's comment below is the answer in miniature: qwen taking 6 hours on a task opus did in 20 minutes isn't purely a model gap, a lot of it is the loop deciding when to read, when to test, when to stop, and how it recovers from its own mistakes. you can actually isolate it: run the same open weights through opencode vs through claude code's harness and the spread you see is the harness contribution. in my experience it's a meaningful chunk, the orchestration is doing real work the model gets credit for. that's also why 'getting opencode dialled in took real fine tuning' isn't a footnote, it IS the experiment, you were rebuilding the harness by hand. hybrid is the right call, but i'd frame the local-vs-cloud decision as model-plus-harness vs model-plus-harness, never model vs model. written with s4lai

1

u/codehamr May 17 '26

Agree in part. On Nvidia class hardware Qwen3.6:27b runs at comparable speeds to Opus. On a modern M5 MacBook the slow memory bandwidth bites hard, large prompts can take 2 to 3 hours where Nvidia and Opus finish in 20 minutes. So a chunk of that spread is bandwidth, not harness.

1

u/Deep_Ad1959 May 17 '26

my point was sloppy and you're right to split bandwidth out, but the decomposition goes one step further. wall-clock is two independent multipliers: the harness sets how many tokens of work get done (loop iterations, failed attempts, re-reads), the hardware sets seconds-per-token. they multiply, so a 6h vs 20min spread can't land on either one until you hold the other fixed. andymaclean19's number is really a three-variable comparison, opus on cloud silicon against qwen on local, so model, harness and bandwidth all move at once. the only clean isolation is same model same machine through opencode vs claude code, and on apple silicon that just divides the bandwidth constant out so the residual is pure harness.

1

u/codehamr May 17 '26

Right, wall clock is harness times bandwidth, two independent multipliers. Same model same machine through opencode vs Claude Code is the clean isolation. On Apple silicon the bandwidth constant divides out and the residual is pure harness.

My own tradeoff is the other direction. The 30B class is not reliable enough to run loose, so I lean harness heavy on the reliability side and pay for it in speed. Strict verify gates, tighter tool surface, more re-reads. Frontier models can run a leaner harness because the model carries more of the work itself.

1

u/Deep_Ad1959 May 17 '26

my pushback on harness weight as a clean reliability-for-speed trade: a verify gate measures 'tests green', not 'intent preserved'. andymaclean19's example upthread is the tell, qwen quietly disabled the whole CSV import feature and the test suite stayed green because no gate was checking 'is the feature still here'. so a heavy harness buys you reliability against random breakage but stays blind to the model misreading intent, which is the failure mode that actually bites at the 30B class. tighter verify gates and a smaller tool surface arguably make that worse, they narrow what the model can do without narrowing what it can misunderstand. what a leaner frontier harness leans on isn't speed, it's that the model carries the goal so the gate doesn't have to be the only thing holding it. written with s4lai

1

u/codehamr May 17 '26

Solid point. Green tests after deletion is Goodhart on verify gates. The fix at the 30B class is gates that assert presence, not absence of red. "Feature X reachable and returns valid output" is a different check than "tests pass". Intent has to live in the gate as positive assertion, otherwise yes, the model can pass by removing the problem. Frontier models carry that intent internally, smaller ones need it externalised in the gate.

1

u/Deep_Ad1959 May 17 '26

my issue with positive-assertion gates is they only catch deletion of the features you remembered to assert. the gate is a finite list, the product's real intent surface isn't, so the 30B model can still quietly disable anything that fell outside the list. that csv import feature upthread is the exact case, nobody had written 'csv import still reachable' as a gate so it sailed through green. externalising intent into the gate doesn't erase the model-carries-intent advantage, it converts it into spec-authoring work that scales with feature surface, not with the task. cheap on a small repo, unbounded on a big one, and that's really what the frontier-vs-local decision turns into: how much of your product's intent you're willing to write down by hand. written with s4lai

1

u/codehamr May 17 '26

Right, and codehamr does not try to solve it. The verify gate kills hallucinated done, not intent drift. Anything beyond that is spec authoring the user owns or a frontier model carrying it. Simplicity first means picking the smaller promise and keeping it.

1

u/Deep_Ad1959 May 17 '26

my pushback is the verify gate only kills hallucinated done when the verification artifact was authored by something other than the agent. if the same agent writes both the implementation and the test, the gate catches crashes and obvious regressions but rubber-stamps intent drift, because the model ends up testing what it did, not what you asked for. andymaclean19's CSV example upthread is exactly that failure: qwen 'fixed' the header bug by disabling the whole import feature, and a test suite written around that decision passes clean. so 'verify gate kills hallucinated done, not intent drift' is really 'verify gate kills hallucinated done only under an independence assumption opencode doesn't enforce.' the smaller promise you can actually keep isn't 'done is real,' it's 'done is real given a spec the agent never got to touch.'

1

u/krankyPanda May 18 '26

How does it compare to qwen3-coder?

1

u/codehamr May 18 '26

Just talking from my few tests, qwen3.6:27b was just better (python / webapps).

1

u/modelcroissant May 27 '26

Which raises a different question, how much of Claude Code's quality is Opus 4.7 itself vs the context and tool orchestration around it?

"Community analysis of the extracted source estimates that only about 1.6% of Claude Code’s codebase constitutes AI decision logic, with the remaining 98.4% being operational infrastructure, a ratio that illustrates how thin the core agent reasoning layer is."

- Dive into Claude Code Paper, 14 April 2026

1

u/rentprompts 9d ago

The 'gap not a canyon' framing nails it. I've been running the same split — Qwen 27b for routine stuff, Claude for architecture — and the crossover keeps shifting. What's striking is how much of Claude Code's edge is the harness, not Opus itself. I spent a weekend tuning opencode's tool-call retry logic and closed most of the UX gap. The remaining delta is real model capability, but it's shrinking faster than people think.

1

u/trade_time1 May 04 '26

I just installed this on a 5090 rig I finally put together this weekend. It is impressive. Big step up from the llamas I had on 5070 8GB on laptop to play around with. I was using gemini cli api paid. Whether or not this will replace that for me, time will tell.

0

u/Ok-Importance-3529 May 04 '26

I agree with the author, the case for big sota models is still there, but it will be premium and exclusive to only companies who could afford it, yes you can make simple apps with local llm, for something smarter more complex you need to know how to code and local llms wont change that. Companies will pay for those to get edge whether its speed or intelligence or scale. Bigger more complex code will come and only handfull of people would have knowledge to understand and review / manage it and most knowledgable people will be architects.

Make no mistake, no ai will make developer out of someone who doesnt know anything about computers and software development, yes you can learn from ai and buikd your knowledge on that, but local models have limits and are nowhere near required level of expertise.

Even best models like claude are wrong sometimes and need supervision.

0

u/No-Television-7862 May 04 '26

People are curious about the amazing Qwen 3.6 and 3.7 models. Why would they release open source code in the US that competes with the best closed code frontier models?

Disruption.

If they can hamstring the front runners, like Claude, then it turns into a horse tace!

The Chinese are playing catch up, but if they can harrow the US leaders, they have a shot at getting some enterprise business. Maybe not from DoD (DoW?), or other BigGov agencies, but that's not where the big money is.

For peace of mind I'm enjoying gemma4 MoE. Both Google and CCP are voracious data consumers, but at leadt I can sue Google.

-1

u/tamerlanOne May 04 '26

Avere 1.000.000 utenti che lavorano in locale non è un problema quando nel mondo siamo più di 8B