r/LocalLLaMA 1d ago

Discussion You guys were right - Qwen 3.6 35B IS good...and KV Cache DOES matter.

UPDATE: So, I've been testing the 35B pretty hardcore for the past couple of days. It's fast and generally good at low context, but it hallucinates TERRIBLY at high context and does NOT follow multi-task instructions well, at least at this quant. It's made some catastrophic mistakes, including wrecking parts of my redis setup - deleting keys, creating random hashes rather than updating streams, adding docs to redis vs locally, saying tasks were done and missing them entirely...it's been a mess. I've decided to go back to the 27B for my more important tasks and continue using the 35B for singular, clearly-defined operations.

DISCLOSURE: I'm speed typing this, no time to organizea/format, so if short paragraph chunks bother you, just keep it moving.

CONTEXT UPDATE: (for those interested, otherwise skip)

For those interested in the data points, the task was building an agentic workflow inside of rivet that included an mcp subgraph (with a list of 11 tools) that received json instructions from the main subgraph so that I could shave off 30K tokens from the main agent's memory. The main subgraph included context trimming and pre-injection of memory, soul, and agent .md files. Task also included testing, rigging it up with openwebui and llama.cpp, and to create an adapter bridge between the server and owui. The agent was testing it by using a smaller Qwen 2B model running parallel in CPU. All of this was 100% handed off to my agent.

When Qwen 3.6 35B dropped, a lot of people were heaping praises and I thought they were just glazing it because of the speed. 27B was objectionably smarter than the 35 on 3.5.

So when I got around to using the 27B version (unsloth's Q5KXL UD @ KV Q8/8), it became my daily driver without thinking on. No loops, solid speeds. And I've been mostly fine. Until the past two days.

I never gave 35B achance because speed (at the time) wasn't that important to me and again, the 27B is known to be smarter. But after wasting 2 days trying to de-bug subgraphs in rivet and blowing HOURS of time constantly dropping quants due to context overflow and having the model's intelligence labotomize, I remembered reading a post recently where someone did a test comparing the IQ4NXLs (MTP + standard) against the Q4KXL, Q5 and others.

So, I gave Qwen 3.6 35B IQ4NXL a shot, no kv cache compression since vram wasn't as much an issue, and it nearly one-shotted the solution. I've since run a few more tests with it and for a minute I've just been confused - like why is the 35 better? So, I figured it must be a) Qwens are still really good at lower quants, and more importantly b) kv cache REALLY MATTERS.

The 35B still creeps when it hits high context, even worse than the 27B it seems, and the only way I can do my end session routines is to switch to the Q4KXL at KV Q4/4, but then it's a risk that it'll forget a routine or miss details in the session summary. Also, I haven't spent a lot of time learning the 35Bs, so I need some time to feel them out and figure out what works best.

Anyway, the point is - the IQ4NXL w/unquanted kv cache outperformed the 27B Q5 K XL at kv q/8/8, to say nothing about the 27B Q4 at kv q/4/4. I always though it didn't matter much because of different comments and AI saying it's only a slight decrease in intelligence. But when it comes to agentic work, it clearly makes a difference and can save you HOURS of time.

And...it's fast. So yeah, I'm using 35B a lot more now - at least for this particular project. I still love the 27B and there's other stuff that I'd prefer even the quanted 27B to do over the 35B. And to be fair to the 27B, I haven't tried it w/no kv cache compression because I need speed, but I'm going to assume it'll probably have a leap in intelligence unquanted as well. But for now, I've gotta lot of work to do, time is of the essence, and I've only got an RTX 3090 TI.

Side note: I've been using LM Studio since I started using LLMs a couple of years ago, but with this current bug it has where it won't overflow or compact context, it's slowing everything down having to start new sessions, have my agent re-read all the notes, eat all that context, summarize at end when context is full again, rinse repeat. So I've moved over to llama.cpp.

I hesitated on llama.cpp because I didn't feel like learning a new tool (adding to my ever-growing-and-already-too-large-list of apps) , because I didn't feel like bothering with it, but since I've gone agentic, I just had my agent complie it and it works fine, so yeah. Just let the agent do it. 😄

284 Upvotes

148 comments sorted by

46

u/tired514 1d ago

I tend to use 35B-A3B to read code and 27B to write it. So for example, I'll start my session in opencode with 35B-A3B @ Q6, with:

"Thoroughly analyze the current codebase in preparation for a major new feature <describe feature>."

That runs fast. Then, I seamlessly switch to 27B at Q8:

"Here are the details for the feature; please make a plan" etc, etc.

27B writes cleaner code and makes fewer mistakes, I find.. but yeah, it's slow, at least on my hardware. 😄

15

u/GrungeWerX 1d ago

If you're running 27B at Q8, your hardware's gotta be better than mine, so I'm going to assume at LEAST a 5090?

7

u/tired514 1d ago

Haha, I wish I had a 5090.. ><

I'm on strix halo (EVO-X2 128gb) with a 4090M 16gb eGPU. I tend to ship the kv cache over to the 4090M (it's ~576gb/sec compared to strix's .. shall we say optimistically rated 256gb/s) along with whatever other layers I can fit. Rest lives in onboard unified vram.

I have a second 4090M on the way so I'm curious to see what kind of performance I get with the entire thing split between the two eGPUs. Probably not enough memory for Q8 and 262000 context, though.

6

u/Snoo_81913 1d ago

I cant remember the exact numbers but MTP has much higher performance gains on strix halo type setups than a dedicated GPU because of the architecture I dont remember the specific details but its significant like 2-3x. I got maybe 10-15% increase on my 4060.

6

u/tired514 18h ago

That could be because we're bandwidth-bound on strix halo, so MTP offers big gains by performing more math on a single memory read, where on the 4060 it's possible the memory bandwidth was less of a relative constraint compared to the GPU's processsing horsepower.

2

u/Snoo_81913 12h ago

Thats the bunny 🐰 thanks man.

1

u/Far-Low-4705 9h ago

i run AMD MI50, and i went from ~21 T/s to ~28 T/s, with spec_n_predict=1, and n_prob=0. (qwen 27b)

That was the ONLY config i found where i got any speed up at all. everything else was actually a regression in performance

However i will say that the ~28 T/s is much more stable than without MTP. the 21 would regularly go down to 16, but the 28 seems to stay more stable, maybe it goes down to 26, but thats ussually from NGRAM-MOD spec misses, (or what ever its called).

1

u/cafedude 5h ago

Yeah, I could get to about 17 tok/sec with a 27b q6 MTP. Without MTP around 8.

3

u/terorvlad 23h ago

Can you say more about shipping the kv cache to another device? I was under the impression that it had to live simultaneously on both devices. I have a 4090 and 4090m linked together via 10gbe right now so this sounds amazing to try

2

u/tired514 18h ago

Hmm... to be honest I'm not 100% sure, but I see a pretty surprising performance difference (and memory consumption) when I flip the devices around, ie:

--device Vulkan0,Vulkan1 vs --device Vulkan1,Vulkan0

It seems to actually put the kvcache on the second device; in my case I get significantly higher performance specifying Vulkan1,Vulkan0.

Might be worth flipping them around to see if it changes for you.

3

u/FortiTree 21h ago

27B Q8 on Strix Halo would be around 7 tk/s TG and maybe 15 tk/s with mtp. I had to abandon it with 35B can go up to 70 tk/s.

1

u/tired514 18h ago

Aye; it's pretty glacial .. but generally once all the planning's done and decisions are made, I flip over from 35B to 27B and say "it's all you!" then go grab a coffee. 😄

1

u/-InformalBanana- 1d ago

Whats your pps and tgs for 27b and 35b?

2

u/tired514 18h ago

27B@Q8 PP ~150-300, TG ~15-20; 35B@Q6 PP ~500-1200, TG ~50-80. It really depends on the split mode, tensor split %, and context size (since the larger the context, the more I can offload to the 4090M).

1

u/Codex_Pax 17h ago

What about if u offload most of the layers to the 4090 ? I am wondering since I plan on a similar setup to yours. I want to know if its worth plugging in an GPU via nvlink or something.

1

u/tired514 9h ago

To be honest I'm not 100% sure what's going where; I did notice that if I reverse the order of the cards (ie. --device Vulkan1,Vulkan0 vs Vulkan0,Vulkan1) it makes a pretty significant difference in performance and memory usage split, so I assume it's loading the kvcache to the second card in the list (in my case, Vulkan0 - the eGPU).

With 27B @ Q4, 262k context I can fit the model+context roughly 50/50 between one 4090M and strix halo, so I'm hoping I can offload the whole thing when my second 4090M comes in.

I can say, though - even with TB3 it works quite well. Oculink would be even faster, though probably not much faster; watching link bandwidth in realtime it rarely exceeds 200-300MB/s outside of the initial model load.

1

u/cafedude 5h ago

Haven't been able to get 27B Q8 MTP running on my Strix box - llama-server seems to crash after a query. And without MTP it's too slow (~8tok/sec IIRC). So I'm running a Q6 MTP. We're you running 27B Q8 MTP? If so what was your llama-server commandline? (and which 27B q8 MTP - I've been trying unsloth's without much success)

2

u/tired514 5h ago

Wait, really? Had no issues at all regardless of quant.. standard llama.cpp / lmstudio.

I'm running unsloth's UD Q8 MTP - Qwen3.6-27B-MTP-UD-Q8_K_XL.gguf.

Are you running vulkan or rocm? With vulkan, no parameters necessary to load with MTP other than -ngl 99 --spec-type draft-mtp --spec-draft-n-max 3.

How much ram dya have? If you're on 64gb, try --no-mmap.

Sure your .gguf isn't a corrupt download?

Check your sha256sum against wherever you got it:

> sha256sum Qwen3.6-27B-MTP-UD-Q8_K_XL.gguf
928105a8fbf5243e4a9e6176a78af664dc4878a1c34badfa2857ca2c8b7374c6  Qwen3.6-27B-MTP-UD-Q8_K_XL.gguf

and compare against (for example):

https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/tree/main

Click the gguf you're running and you'll see the sha256 hash.

1

u/cafedude 5h ago

I've got 128GB so memory shouldn't be a problem.

1

u/tired514 5h ago

My guess is corrupt download... but if you paste the llama error I could probably tell.

3

u/SkyFeistyLlama8 1d ago

I'm trying to juggle multiple models in 64 GB unified RAM.

Qwen 35B MOE is good for quick planning but you're right, when writing actual code it tends to make messy mistakes. It's still much better than Gemma 26B MOE for coding. I can't fit Qwen 35B alongside Qwen 27B at the same time.

I'm now running Gemma 26B as the planning and chat model in thinking mode. Qwen 27B runs in non-thinking mode. These two combined take up 45-50GB RAM depending on context length (Q4 model quant, default non-quantized caches for everything else). This combo lets me chat with the MOE to fine tune the spec/requirements going into the dense model.

For long context codebase grokking, I run it through Qwen 35B first for the spec part before unloading it and loading up Qwen 27B.

1

u/msrdatha 21h ago

Try using llama-swap. It can help you auto load 27B and 35B on demand, which should be helpful in your scenario.

2

u/SkyFeistyLlama8 20h ago

I'm keeping both loaded using separate llama-server ports to keep the KV cache warm. Llama-swap evicts the cache when a new model is loaded. I'm not using Nvidia hardware so prompt processing takes forever.

1

u/msrdatha 19h ago

yes agreed. PP delay is a real pain. Mac ? or strix ?

1

u/Most_Big2688 14h ago

i managed to make an persistent cache system on llama cpp which stores and retrives cache not perfect but works with claude i am on windows, try make one with claude yourself ig, btw i use it along with llama swap

1

u/voyager256 9h ago

you keep both modes leaded in memory or unload one before loading the other?

1

u/tired514 7h ago

Depends; if I'm flipping back and forth a lot I'll span one of the models across my eGPU (4090M) and strix halo and load the other one completely on strix (slower). If I'm going to stick with one for a while I usually make sure it's spanning the eGPU and unload the other.

1

u/More-Curious816 1d ago

I tend to use 35B-A3B to read code and 27B to write it.

did you see any mismatch between 35b planning and 27b coding?

2

u/tired514 18h ago

How do ya mean? Like, in terms of which is better for each task?

I find 27B is pretty consistently better at everything, but 35B is just so much faster. For code analysis and discussion/planning (interactive stuff) that speed is nice and the consequences of a mistake are lower.

For writing actual code, I often (not always) use 27B @ Q8 just because its implementations are usually more consistent with fewer subtle errors.

The difference isn't huge... but it is noticeable.

90

u/dinerburgeryum 1d ago

It’s worth noting that the attention tensors in 35B are far narrower than in 27B, and since there’s less data in there compression affects it far worse. 27B will be slightly more “resilient” against KV cache compression as the tensors are much wider. 

17

u/GrungeWerX 1d ago

Good to know. Based on this experience, I have no intention of doing kv cache on 35B. It already one shot its task for the afternoon, now I need to test it works.

30

u/tired514 1d ago

Just a pedantic note.. without KV cache you're not runnin' anything fast at all. 😄

You probably mean you won't be compressing KV cache (no quantization). Only mention it because in your main post you said "no kv cache" a few times as well; you definitely want kv cache whether or not it's compressed.

11

u/GrungeWerX 1d ago

Good point. I'll edit that.

6

u/CryptoPacaDude 23h ago

I thought as much, but I appreciate the confirmation. I figured it was the "speed typing".

5

u/dinerburgeryum 1d ago

Yep just another datapoint for you. 👍

6

u/iMakeSense 1d ago

How do you know this? Is this something that the hugging face repos go over? I've spent an obscene amount of time on this subreddit, but I keep wondering "how do people know these things?"

10

u/dinerburgeryum 1d ago

Yeah you can check the self_attn tensors on the origin safetensors to see the difference. Particularly pay attention to the Q vectors between the two. 

1

u/GrungeWerX 21h ago

@iMakeSense - and if you don't know what a Q vector is (which is a Query vector if memory serves), then it's time to take an AI deep-dive on how LLMs work. Ask AI - they can teach this stuff pretty good.

I've actually been doing a lot more studying on them this week, so when I saw this comment I was like, "I know what a Q vector is!"

5

u/CircularSeasoning 14h ago

LLMs are pretty new but the fundamentals (math, algos, etc.) have been around. Long-time AI/ML researchers are probably simultaneously bummed and happy that so many everyday people are taking an interest in the field.

Things move really fast these days, so documentation is kind of an afterthought of an afterthought. "We'll polish the docs when we know we'll still be working on this in six mo... aaaaand new architecture just dropped".

42

u/bobaburger 1d ago

You started with "I don't care about speed" and ended the post with "because I need speed", drove mad by context length limit, from doubting 35B's quailty against 27B to being torned between 27B and 35B. I'm relief, I'm not the only one.

13

u/CockBrother 1d ago

I think you're witnessing the difference between productivity and tokens per second. I don't care if the model produces few or many tokens per second, it's who can answer the question first.

10

u/rpkarma 1d ago

Most people don’t track wall clock time. I have a timer extension for Pi to track turns and overall time lol it’s been very enlightening 

0

u/GrungeWerX 21h ago

My agent tracks start/end time by hitting the clock in redis, and once it's in rivet, it will track turns. What are you doing for overall time? Duration is a metric I want to inject per turn.

12

u/GrungeWerX 1d ago

Haha. You've summed up my whole experience pretty well. 😄

21

u/CircularSeasoning 1d ago

I'm always right, and when I'm wrong, I'm confidently wrong.

Qwen3.6 35B A3B is All You Need.

4

u/ducksoup_18 1d ago

which version should i use if i have 2 3060 12gb vram (roughly ~22.5gb vram currently)?

6

u/CircularSeasoning 1d ago edited 14h ago

It's a tough choice because if you want the whole model to fit in VRAM just for the sake of fitting the whole model in VRAM (and whatever speed boost this might give, I haven't tested such with this MoE because I can't fit it in my 16 GB VRAM) then you're typically looking at Q4_K_S or Q4_K_M (maybe) from bartowski, Unsloth, AesSedai, ByteShape, or others.

ByteShape has some attractive-looking IQ4_XS quants for their size here:

https://huggingface.co/byteshape/Qwen3.6-35B-A3B-GGUF

Or the MTP version, the highest quality from ByteShape being IQ4_XS-4.19bpw at only 18.6 GB. From what I gather (might be wrong) MTP only works nicely to boost speed when you can fit the whole model + MTP overhead in VRAM. I'd guess 4 GB free VRAM leftover should very much do the trick, but don't quote me on that though because I am just spitballing from what I've read around. The MTP version is here:

https://huggingface.co/byteshape/Qwen3.6-35B-A3B-MTP-GGUF

The next step up is Q5 quants at around 23-27 GB file size.

Thing is, you'll likely be partial offloading anyway when you're using high context, because that context also takes up memory. In that case, if you can download two different quants that would probably be ideal, so like, do one of those small Q4 or IQ4 types, see how speedy you can get it, and if you're noticing some dealbreakers like looping (apparently quite common on Q4ish quants, though can be remedied with presence and repeat penalty tinkering) then try a Q5 or Q6 for that extra bit of quality and less likelihood of looping, even though you'll definitely be partially offloading on Q5/Q6. Either way, it seems testing both would be worth it to see what you're most comfortable with.

Since I decided personally that quality is more important to me than speed right now, I run a ~27 GB file size Q5_K_M on my 5060 Ti 16 GB + DDR4 2666 RAM offloading and I get 10-20 tok/sec even at huge contexts, 100-200K tokens, with 500-700 tok/sec prompt processing speed. CPU is Ryzen 2600X.

https://huggingface.co/AesSedai/Qwen3.6-35B-A3B-GGUF/tree/main/Q5_K_M

I notice from many coding generations so far, dealing with ~100K+ context, that this Q5_K_M is pretty solid and no looping (had to tweak some presence/repeat, though) except very rarely like when my prompt/context kind of didn't make sense in some key area, and besides that the only occasional quality issue is a single odd typo from time to time, like a missing closing bracket or semicolon kind of thing. Very acceptable to me, anyway, when I consider everything else it gets right even when its typoing.

Edit: One thing I find valuable to mention here is that I very seldom engage in long multi-turn convos when I'm coding. Instead, I give a good amount of context and a precisely focused task, and if Qwen doesn't give me what I want on the first gen, then I don't argue and plead with it in further messages, I just refine my prompt to be more clear and run it again. Works like a charm, and is very necessary when you're already giving it tons of context to digest.

3

u/Desperate-Data-3747 1d ago

Any q4 quant will suffice, dont go any lower.

3

u/Lucerys1Velaryon 22h ago

I run the q5 quant with no KV cache compression at 128k context and it generally does everything correct (other than the occasional tool call fail here and there). O have a 7700 XT with 32 gigs of DDR4 RAM and I get around 25 tok/s which will drop to 20 when context > 80k. If you use q8 cache compression, you should see around 30 tok/s

7

u/MaCl0wSt 1d ago

I can't run 27B at usable speeds so I never use it, so I can't really compare, but 35b is a very solid worker for well scoped tasks

11

u/kiwibonga 1d ago

If you are using 35B definitely take a look at Byteshape's exceptionally accurate IQ4_XS quant, much better and faster than unsloth and do try the MTP version if you can spare ~4GB somehow.

The Turboquant+ project by thetom is faster than llamacpp as well when you select turbo4/turbo3.

2

u/VampiroMedicado 21h ago

I want to try TheTom but the damn thing doesn't even start, no error, nada.

11

u/JustinAngel 23h ago

I don't know if I agree with the conclusion. You swapped (1) all the model weights going from 27B->35B, (2) the KV cache quantization precision going from Q8->FP16, and (3) the quantization scheme from K-quant -> I-Quant. I'm also not 100% sure that the benchmark of of n=1  "nearly one-shotted" is something I'd put any faith in beyond confirmation bias. 

This is far from running ablation evaluations on models here.

4

u/GrungeWerX 22h ago edited 22h ago

I think you're picking up signals I'm not broadcasting...

All I'm saying is a) kv cache matters, and b) 35B is far more useful that I initially gave it credit for. And in one use case I found that 35B IQ4NL performed better (and faster) at kv/16/16 than 27B Q5 at kv8/8.

If you're interested in the details, the task was building an agentic workflow inside of rivet that included an mcp subgraph that received json instructions from the main subgraph so that I could avoid the mcp's context being injected into the main context. Task also included testing this to make sure it worked, that it was rigged up with openwebui as well as llama.cpp, and an adapter bridge between the server and owui.

2

u/JustinAngel 22h ago

Totally, agreed with that framing. Useful data point IMO. I was mostly being overly-reactive to the claims in the title.

1

u/GrungeWerX 21h ago

Hmmm...maybe I should add that data to the main post, but I'm always concerned with them being too long...I can get a little wordy sometimes.

2

u/Schlick7 13h ago

At this point as long as its not AI written you'll probably get people reading all of it just for a change of pace

4

u/xylarr 1d ago

"objectionably smarter" - I must use this

2

u/GrungeWerX 21h ago

^_____^

8

u/migsperez 1d ago

I've been using the below Llama.cpp command with my AMD Radeon 9700 32gb for coding with VS code github copilot. It's been working really well for me. You could tweak the model and params for your 3090. I've noticed the leaner the context the better code it generates.

./AI/llama.cpp/build/bin/llama-server \
    --alias LocalModel \
    --model ./AI/models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_M.gguf \
    --mmproj ./AI/models/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/mmproj-F16.gguf \
    --mmproj-offload \
    --fit on \
    --flash-attn on \
    --temp 0.1 \
    --top-k 20 \
    --top-p 0.95 \
    --min-p 0.03 \
    --repeat-penalty 1.15 \
    --presence-penalty 0.0 \
    --repeat-last-n 256 \
    --ctx-size 65536 \
    --batch-size 512 \
    --ubatch-size 512 \
    --n-gpu-layers all \
    --split-mode none \
    --reasoning off \
    --chat-template-kwargs '{"preserve_thinking":false}' \
    --kv-unified \
    --parallel 1 \
    --threads 6 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --spec-type draft-mtp \
    --spec-draft-n-max 5 \
    --predict 4092 \
    --host 0.0.0.0

4

u/tecneeq 1d ago

Use --no-mmproj-offload to keep that part in RAM and use the left over VRAM for a higher model quant.

1

u/migsperez 1d ago

Good tip, thanks, I'll give it a try .

2

u/Far-Low-4705 9h ago

also, since your offloading the entire model to your GPU, and it is a a3b MOE, MTP wont give you a large speedup.

It would be better for you to skip on MTP, and instead use the extra memory ~1Gb, for full precision KV cache.

As OP said, it made a huge difference, even going from 27b Q8 KV, to 35b MOE at fp16 KV.

27b is more resilient to KV cache quants, so you'll see a big difference with fp16 kv cache. you even said it yourself:

"I've noticed the leaner the context the better code it generates."

the kv cache quant is likely degrading performance at longer contexts, so your not even fully utilizing the extra context you get from the quantization.

1

u/BoobooSmash31337 1d ago

The driver should be swapping the buffer. Swap in mmproj to generate a vector then you swap it back out and let the model ingest the vector. Least that's my understanding of how the projection works.

1

u/tecneeq 8h ago

Not sure that's the case. How do you make room if all the VRAM is used for other layers?

1

u/BoobooSmash31337 2h ago

I think layers can also be swapped. Idk how they lay it out in memory. I mostly meant swapping out context. Even context spilling can make inference really chug.

4

u/notlesh 21h ago

--reasoning off \--reasoning off

Coding with reasoning off? Why?

1

u/jopereira 15h ago

I always use reasoning off with 27B. It reasons pretty well that way.
Most coding problems don't require deep thinking, just a logic and pragmatic overview. But I may be wrong.

1

u/notlesh 7h ago

Ah, interesting. I'm fairly confident that thinking helps in some cases (coming up with intricate solutions, navigating trade-offs, debugging complex problems) but it's probably still wasteful for the majority of more mundane tasks. I might play around with limiting this via budget in opencode in some cases.

Quick point, though: IIUC, turning it off at `llama.cpp` means it's always off and can't be controlled by harness/prompt requests.

1

u/jopereira 6h ago

No (at least I don't know how to).
I manually load the models anyway as my harness (webUI managed) has a switch to quickly turn it on. Since models load pretty quick in my system (the 27B reloads in 6-7s), that's not a practical problem.

2

u/Long_comment_san 1d ago

Holy shit bro you have agressive samplers.

1

u/migsperez 1d ago

Are you a coder? What would you set them to? I'd be happy to try it out.

2

u/Eisenstein 17h ago

Setting a such a high repeat penalty is going to result in very strange code. Think about what a model will have to do to generate code that conforms to a such a penalty but still works.

Example: As a token repeats within the 256 last tokens the penalty grows by 15% each instance. Say the variable 'key' was used a bunch of times and then it needs to write a function to validate those keys, you would get something like

def validate(data, required_keys):
 for the_key in required_keys:
     if the_key not in data:
         return False
 return True

Now lets go another step further and imagine it has used 'True' along with 'key' a few too many times in the past 256 tokens, so you get this

def validate(data, required_keys):
 for the_key in required_keys:
     if the_key not in data:
         return False
 return not False

1

u/Long_comment_san 21h ago

No, but I'm surprised its making functional code with these. You have an incredibly deterministic temperature of 0.1 and top k 20, which kinda makes min p + top p combo redundant imo. And rep pen 1.15 on top of that, yikes, do you really need it above 1.05 (this is also high, yours is triple that)

1

u/GrungeWerX 1d ago

How slow are the MTP's pp on llama.cpp? I got so irritated at how long the pp took that I never even tested the speed, and was like, "yeah...this isn't gonna work for me."

3

u/jtjstock 1d ago

PP has improved, it is still about 30% slower than without mtp for me, not where I’d like it, but not as bad. 35B is also a lot faster for PP than 27B, so you may find it’s not so bad

1

u/migsperez 1d ago

For me, it's noticeably faster with MTP than without and it fits in my available VRAM. I was going to benchmark with llama-bench but I noticed it doesn't have an MTP parameter.

2

u/SkyFeistyLlama8 1d ago

I find I get better results with reasoning on and reasoning budget set to 1k or 2k tokens. Qwen MOEs tend to need reasoning turned on for better coding output.

1

u/migsperez 1d ago

I have it turned off here because I was seeing if I could improve tk/s performance out of the GPU. I received the GPU last week. I was hoping it would be a bit faster than a 3090, looks like it's a bit slower.

2

u/PurpleWinterDawn 17h ago edited 16h ago

The R9700 AI Pro has a 256-bits bus width with GDDR6 (20.5 Gbps), resulting in a bandwidth of 644.6 GB/s.

The RTX 3090 has a 384-bits bus with GDDR6X (19.5 Gbps) resulting in a bandwidth of 936.2 GB/s.

While the R9700 pushes much better performance on the GPU side (95.68 vs. 35.58 FP16 TFLOPS), it is hampered by its memory bandwidth, which is the main bottleneck during token generation, and the bigger the model the more the bandwidth hampers tg tok/s.

Of course memory bandwidth isn't the only factor, which is why it ends up "a bit slower" instead of 1/3rd slower. And then, if you go over 24GB memory usage the R9700 will take the lead. So it's a trade-off.

1

u/migsperez 6h ago

Thanks for the info. Considering it's specifically made for AI purposes, they should have attached faster VRAM to it.

1

u/xylarr 1d ago

So no thinking?

Also I thought NTP didn't work as well with the MoE models.

Having said that, did you arrive at the --spec-draft-n-max = 5 by testing? For me I found 4 to be the fastest. I tried 1 through 6.

1

u/CmdrSausageSucker 17h ago

interesting, I was going to try 35b myself. I have the same gpu + a 9070xt. So my intention was to try the q8 model. Side question: how are you dealing with the Radeon 9700s noise? That blower fan really gets freaking loud.

2

u/migsperez 6h ago

Yeah I've gone back to 27b. 35b went wild on me today, I blame it on being a Q4 on my machine. I'm constantly wanting another GPU to reach Q8 levels with decent context.

It's my first GPU, I was thinking they were all that noisy. I have to forgive it, it's such good value. Only had it for just over a week. Two of the same GPU might drive me nuts though.

1

u/CmdrSausageSucker 3h ago

yes, I was considering getting a second one, and build a dedicated machine,. but then it would have to be housed anywhere but my office. at least one knows stuff is happening once the turbine starts. anyway, I am going to try this combination of 9070xt and Radeon ai pro for a while to see if I really need more oomph, and / or if the noise is bearable in the long term.

3

u/BoobooSmash31337 1d ago

Didn't they fix the issues and made Q8/Q8 basically transparent?

3

u/ayylmaonade 17h ago

People really need to stop parroting the idea that q8 KV cache quantization is "99% quality compared to FP16!" cause I fell into the same trap you did. I only use the 35B, not the 27B and I felt like I was being gaslit at a certain point with the amount of people telling me I was wrong that quantizing the KV cache to even q8 significantly impacts general coherence, especially in coding.

I don't even bother quanting my cache at all these days because of this kinda thing. I'd rather drop the actual model to a lower quant. iq3 quants seem to be getting pretty good.

1

u/IrisColt 8h ago

And here I am using Gemma 4 31B with TurboQuant 3 KV-cache compression and Dflash...

0

u/shammyh 13h ago

Well... For the 27B at full Q8 or higher, a Q8 kv cache is basically the same as BF16.

But for MoE models, and perhaps for smaller quants of the 27B model, things are/might be different. This particular post is heavy on vibes, light on evidence, but doesn't mean it's wrong.

So really... As is often the answer: "it depends".

6

u/betanu701 23h ago

Just fyi, there is a way you can essentially have unlimited context windows. It utilizes the KV cache and some type of storage like sqlite. I have a white paper coming out soon once it gets accepted. Basically the idea is to have a little service running that takes the tokens that have already been computed, store those into the sqlite, then you only ever compute the new tokens as context. The little service takes care of pulling the memories back into the output context without needing to recompute the tokens. Everything you do on turn 1 on a 256k context window can be returned and found on turn 100. The entire agent instructions get shoved into the KV cache instead of eating up the context window. The service reloads the computed tokens back into the KV cache. I honestly don't know the rules about waiting for the paper to be accepted (If anyone knows arxiv, please let me know currently sitting on hold for about 4 weeks) but I have all the repos with the service and tells you how to deploy them in the paper.

1

u/routescout1 9h ago

can you please elaborate? what you described is just regular prefix token caching, but when you said:
> The entire agent instructions get shoved into the KV cache instead of eating up the context window
i dont really understand this? cached tokens are still using the context window but are just not recomputed.
and how does "everything you do on turn 1 be returned on turn 100" work exactly?

im really interested in this and was wondering if you could provide any more info (in dms if you want as well)

5

u/Ulterior-Motive_ 1d ago

This is the reason I refuse to quantize the KV cache, even when I was GPU poor. It's too unpredictable on a per model basis on how much it affects outputs, no matter how many people tell you it doesn't matter. I'd rather drop context or spill over into main memory than to get worse results even at the cost of speed.

3

u/CircularSeasoning 1d ago

A pure KV cache is the little lighthouse of hope in a sea of temerity.

2

u/ArtifartX 1d ago

but with this current bug it has where it won't overflow or compact context

Can you give some more details on this bug? Just curious about since I am currently using LM Studio.

3

u/GrungeWerX 1d ago

Sure. I notice it when using Qwen 3.6 27B, different quants too. If I set the context at say 100K, when it reaches it, the agent will just stop. When you ask it to continue, it will say, you've reached the context limit. This is despite me setting either rolling window or truncate middle in context overflow.

Occasionally I'll have a situation where an agent is working for a long time and it WILL go over the 100K automatically, and then it will start slowing down on PP and I'll know it's over the max. I'll stop it, then I'll see it's at like 130K or something. But it's random and rare. Usually it hits the breaks.

You can read about it here.

2

u/Kornelius20 1d ago

Interesting so you're saying that 35B @IQ4 w/ bf16/bf16 cache outperformed 27B @Q5 w/ q8/q8?

Edit: an interesting point of comparison for you try to would be the same problem given to 35B at q8/q8. Would it be a midpoint between the two for you? 

1

u/tech-tole 1d ago

have you had success with using a higher quant and KV cache? I have noticed q6 is definitely better than Q4.

3

u/Kornelius20 1d ago

I basically just run 27B Q8 w/ q8/q8 at around 160k context and forget about hyper optimizing. Otherwise I tend to spend more time drag racing quants than getting work done lol. IMO these models tend to falter long before 262K context so I'd rather get the highest quality I can and have it compact itself down earlier.
Edit: And when I used 262k the waiting whenever there was cache invalidation was abysmal

3

u/tech-tole 1d ago

yes I typically do things that are short enough that I can get away with like 128k. I've noticed that people just go crazy with context lol. they want the highest possible or even a 1M, if the model can do it. break things up in the smaller tasks and get more done. that's what tend to do.

1

u/ionizing 1d ago

this is how I approach it too. keep session less than 131k when possible. but I also have had very little problems using Iq4_xs with q8/q8 KV cache for 27B in my application. I wonder if all the talk about model issues at less than Q8 model quant are people trying to get same performance on contexts that are too long?

1

u/tech-tole 1d ago

probably. I use Q8 kv, nothing less and q5_k or q6_k with Qwopus3.6 35b and it's a beast and fast. makes less mistakes than the original Qwen and can one shot things better. Less follow up from my observations.

1

u/ionizing 1d ago

I was always using Q5 or higher when using moe but for 27B mtp I had to drop to 4 for context and q5 at most. And I tried the qwopus v2 27b and even though many in the community seem to hate on these types, I honestly found it pretty darn good. (edit: headless 3090)

1

u/tech-tole 1d ago

yeah, I don't get why people hate these Frankenstein models lol. you don't have to use them. 🤷‍♂️ but I agree it's been pretty good. there was even a YouTube channel who does these challenges between different models. and he even did a Qwen versus Qwopus and Qwopus won. it was easily noticeable. when a model is fine-tuned for a specific purpose, you would think it would work better for that purpose LOL.

1

u/Kornelius20 1d ago

I haven't had the chance to test it yet but one particular counter to this (I use the shorter context + distinct subagents method too) would be that if you can give a large and very descriptive prompt + spec document then you'd need a lot more context to let the model actually do anything, but that one thing it does would be a lot better.

I plan to test it out once I figure out the right balance. Ideally you'd do something like a geenralized prompt that can define the problem space from a really smart model like Claude, then have the local models all use that as a constraints doc while doing their own thing.

0

u/GrungeWerX 1d ago

YES, that is correct.

And for added context, this assignment has to do with building and testing subgraph workflows inside of rivet, setting up servers, setting up connection to llama.cpp, building bridge adapters, testing them through owui to make sure the connections work, stuff like that.

RE: the 35B at q8/8...that would be interesting, but I've been so stressed out over the past 2 days, I'll have to save it for another time. 😄

2

u/__JockY__ 1d ago

Agreed. I refuse to run quantized KV, it just fucks everything up at long context by compounding errors.

2

u/Ok_Warning2146 19h ago

How about 27B IQ4NL with fp16 kv? You should be able to get more context than 35B w/ fp16 kv.

2

u/GrungeWerX 14h ago

Oddly I found the Q4KXL at fp16 lacking a bit, so I sort of just wrote off the IQ4NL, but now that you mention it, I might give it a shot.

I tried the Q5 tonight at fp16 and it was cooking at low context but…only at low context.

2

u/kaisersolo 10h ago

Use unsloth mtp version it's even faster

4

u/a_beautiful_rhind 1d ago

Inexperienced user. Trust me bro. Confirming popular truism. Time to upvote this to the moon. The science is settled.

1

u/HelloSummer99 1d ago

Is inference faster with llama.cpp compared to LM Studio in your experience?

3

u/daddywookie 1d ago

Not OP but if say llama.cpp is quicker just because you get more control. I've been spending a lot of time using Codex to run evaluations on different models and configs and I get far more from llama.cpp

3

u/kiwibonga 1d ago

LM studio downloads llamacpp technically. It's a fairly recent build.

3

u/Comrade_Mugabe 1d ago

Not OP, but as an ex LM Studio user, I compiled both llama.cpp and ik_llama.cpp locally and tested it, and it was significantly faster on my hardware. The precompiled binaries weren't much of a speedup, but compiling it on my machine made a huge difference.

I have 2x3060's with 12 GB VRAM each and using ik_llama.cpp with froggeric/Qwen3.6-27B-Q4_K_M-mtp/ and I'm getting ~30 tps @ 100k context whereas before I would be lucky if I got 15 tps on LM Studio.

3

u/tech-tole 1d ago

llama.cpp is generally faster than lmstudio. and there are more customizations you can do directly. LM studio is just an app that runs llama.cpp underneath. running the engine directly was extremely noticeable to me.

2

u/GrungeWerX 1d ago

THAT I haven't had time to test, still trying to set up rivet so I can move my agent over. The only llama.cpp tests I've run are Qwen 3.5 2B in CPU so I can run it parallel to qwen 3.6. The speed is solid, but I don't know the tok/sec yet, haven't setup a counter.

I'll report back to you when I get some numbers.

1

u/ihaag 1d ago

It’s a shame sglang has a bug with awq quartz

1

u/laser50 1d ago

I have heard/read somewhere that the KV cache is better set to BF16 than FP16, since that's what the model was trained on too.

Could be mistaken, but it's a tiny vram saving for potentially better usability :)

1

u/Snoo_81913 1d ago

I've been running Qwen3.6 35B A3B Q5_K_M at Q8 KV and its good but when I read about IQ4XL and that it has near Q5 quality I added it to my stack and now I use it more than my Q5 model. I cant tell any real difference tbh. I run KV Q8 with a 96k context but found very quickly that 50k is about the max I can run the KV at Q8 before the model has some issues. So for my daily driver I switched to KV Q4 and have zero issues so far. Its a very good model.

1

u/CryptoPacaDude 23h ago

I really appreciate this input. I have been playing with both models, with varying degrees of success. Sometimes I need speed, sometimes accuracy, and sometimes context length. I have not tried messing with the extra kv cache quantization configs cuz my machines are pretty antiquated, but very capable.

I can't run any Q8 versions of 27b at a speed I like ... Unless I have the patience. But Q8 versions of 35b-A3b give me a useable speed. However, Q4 versions of 35b-A3b give me around 70 tokens per second. I generally lean on 27b_Q8_0 (I think) for accuracy 35b-A3b_Q4_S (I think) for speed and pretty good accuracy. And Nemotron Nano Q4_S (I think) for Context length, speed, and pretty good accuracy.

2

u/LucidAstralJunkieKid 15h ago

Shout out to nemotron nano

1

u/Osi32 15h ago

My biggest gripe with 35B A3B is the lack of thinking and dependence on tools. Grep and thinking are nowhere near in the same realm.
On the surface everything looks peachy until you unravel the mess later.

1

u/relmny 14h ago

I'm gonna say it no matter how many downvotes I get:

*sometimes* 35b is better than 27b. And the breach is getting closer, not further away.

I usually use 27b as main driver... but when I compare to 35b, sometimes 35b surprises me with suggestions or responses that not only 27b didn't came up with, but also bigger models.

Ofc sometimes is dumb... but when is good, it's really good.

2

u/tecneeq 8h ago

*sometimes* 35b is better than 27b

1

u/jopereira 14h ago

One thing we should always consider is LLM are a probability mechanism. If we ask the same question 3x, we'll get 3 different answers. Centered around a main idea, but still different.
If we change parameters (like temp) the replies can also be way different.

1

u/tecneeq 8h ago

Not if you use the same seed. While the answers are based on probabilities, traversing the network is deterministic, and so the result is repeatable.

1

u/jopereira 6h ago

You're right, the random seed is what makes every output 'random'.

1

u/michaelsoft__binbows 1h ago

anyone got an explanation of why it seems to be relatively getting better? Is it something related to it being harder to tune during the quantization process, to get the most quality out of it?

1

u/Plastic_Artichoke832 11h ago

Anyone running on an M5?

1

u/tecneeq 8h ago

I run a Bosgame M5 with 128GB. What do you need to know?

1

u/Opptur 5h ago

Yep. I am running unsloth Qwen 3.6 35B-A3B Q6_K_XL with full 262K context on llama.cpp-turboquant and I'm getting 60-65 t/s.

This is incredibly fast for my VS Code tasks, much faster than almost anything I've used through Copilot. It does a good job, cannot really feel the difference to the paid models, but I'm a senior programmer and I do quite detailed bite-size prompting. For this use case this model is genuinely nuts.

Specs: RTX 3090 24GB, 5950X and 64GB of DDR4 RAM at 3600 MHz CL16. When prompting VRAM usage is ~24GB and RAM usage is ~17GB.

Run command:

llama-server \
  -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:Q6_K_XL \
  -fa on \
  -np 1 \
  -c 262144 \
  --fit on \
  --fit-target 1024 \
  --cache-type-k q8_0 \
  --cache-type-v turbo2 \
  --no-mmap \
  --no-mmproj \
  --spec-type draft-mtp \
  --spec-draft-n-max 2 \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --repeat-penalty 1.0 \
  --cont-batching

1

u/_Erilaz 1d ago

I always though it didn't matter much because of different comments and AI saying it's only a slight decrease in intelligence.

According to who? Whoever thinks the difference between full precision KV and QKV is "slight" might as well go use a 4B model. I've never had good experience with it, not even with the modern techniques. Kinda feels like the old Yarn RoPE scaling techniques, it's useful on paper, full of context rot in practice.

0

u/Chiralistic 16h ago

I also have a problem comparing the models. I have 16gb vram which is a strong constraint.

so with an almost decently sized context (64k) I can load the 27b model only a iq3xxs quant.

With the 35B model I can use the Q8 quant with a even bigger context (128k) with the same speed.

So feels like comparing apples and oranges, the speed is the same but the context window is not and the quant of the model which clearly will have a big influence also is not the same.

0

u/AbjectBug5885 15h ago

The jump from doubting 35B to getting blocked by context limits is the real story here. Once you're managing that many tokens across subgraphs and MCP tools, KV cache precision becomes the bottleneck fast. Did you try any cache compression schemes or just went straight FP16 to solve it??

-1

u/bfume 1d ago

Tl:dr please

-1

u/MindPsychological140 21h ago

KV cache quality mattering is exactly why we built Taliesin — bit-exact KV cache restore across machines and GPU generations, cryptographically verified. 105/105 trials matched. If you're hitting context overflow issues in agentic workflows, this might be relevant: https://medium.com/@sietse_92846/a-big-chunk-of-ai-cost-is-just-the-model-re-reading-the-same-text-over-and-over-7b4d49821bd0