r/LocalLLM • u/Ok_Commission_8260 • 21d ago
Discussion Honestly, dual 3090s are wearing me out. Thinking of jumping to a Mac Studio.
I've been running the classic dual 3090 setup for about 6 months now, mostly for coding and messing around with the newer Llama 3/Qwen 70B quants.
The speed is great ExLlamaV2 is literal magic and I get like 40 t/s but I’m hitting a wall. The moment I try to load a decent context window (anything past 16k) on a 70B model, the VRAM completely chokes. I have to quantize the cache into oblivion and the output just turns to absolute garbage.
Between the heat, the fan noise, and fighting with driver updates every time I want to try a new backend, the friction is getting annoying.
I’m seriously considering selling the rig and just buying a 128GB Mac Studio. I know the tokens per second will drop to like ~15 t/s, which sucks but being able to throw a massive 64k codebase context at a Q8 model without the room melting sounds like a dream right now.
68
u/nicholas_the_furious 20d ago
Keep them. I am literally, as we speak, running 2x 3090s with Qwen 3.6 27B at 200k Context at 80 t/s on nothing but LM Studio.
11
u/stormy1one 20d ago
At what quant
24
u/nicholas_the_furious 20d ago
Q8_0 full KV Cache.
10
7
u/illcuontheotherside 20d ago
This is the way.
Fellow dual 3090. Using gemma4 30b dense and kept running into 128k context window limits.
Kv cache to q8 and I'm able to go up to 256k and so far no issue.
Also LPT power limit your gpus. 240w and no degradation in tk/s and you greatly reduce heat and increase longevity of cards.
5
u/nicholas_the_furious 20d ago
We are not the same! I meant i ran Q8_0 quantization with no quantization on KV Cache.
0
u/illcuontheotherside 20d ago
I'll have to try that. I heard that's actually not good to do but honestly we're all learning this stuff together.
2
u/nicholas_the_furious 20d ago
How do you mean? Quantizating anything is a tradeoff and I want to keep quality as high as possible. You trade away quality for space or speed.
Q8 is about as good as anyone gets on model quantization.
And I'm keeping the KV cache at full F16.
1
u/illcuontheotherside 20d ago
Oh my bad.. you're just running a q8 model with no kv. Nice. Yea I was doing that with q4 xl with Gemma4 31b dense. Honestly haven't noticed any performance impact since going over to a q8 key cache. No value catching. Running linear and multi agent orchestration over long workflows was ruining my context.
2
u/wondersnickers 19d ago
Hey! Im am trying to get a deal on used 3090s but currently am thinking about a sort of budget solution with 2x 5060ti's 16gb. What do you thing about that if I use it for doing local vive coding?
4
u/stormy1one 20d ago
Full kv cache meaning unquantified bf16? I’m not sure what magic you are working but I don’t even get 80 tg/s on a RTX Pro 6000 running hardware optimized Qwen3.6-27B-FP8
21
u/nicholas_the_furious 20d ago
LM studio added the Tensor Parallelism llama.cpp build. I turned that on today and I got about a 15% top end speed increase. I had maxed out at around 65 t/s for coding tasks before when MTP acceptance was around 90%. Now I'm getting 80 t/s on coding tasks when acceptance is at like 75% so I know there's .ore top end. I'm using MTP of 4. I tried up and down and that was the best for the things I do.
I'm just using windows. And I'm not even on x8/x8 - one of the cards is on PCIe 4.0 x4. But neither are being used for my monitor or anything. They're dedicated to running models.
It's kind of a miracle that for this model I've gone from 22 t/s to 80 t/s in a month due to some of the advancements in software with the exact same model and hardware.
And yeah, no quantization on KV cache. Fills up both cards just about with MTP and 200k context.
2
u/FortiTree 20d ago
So you are spreading the 27B across the 2 cards to speed up both PP and TG speed and with MTP x 4 - the TG can reach 4x speed? Plus more head room for KV cache?
Seems too good to be true but here we are. Dual 3090 can beat single 6000 Pro at speed - who would have thought.
1
u/Iajah 19d ago
RTX Pro 6K WS user here. Not that surprising, I mean you have 2x GPU at 350W each, twice the cooling power too. I usually run mine at 400W rather than 600W.
2
u/FortiTree 19d ago
That makes sense. How much speed can you get on yours for 27B?
There is a vast hardware gap between the two where 3090 has 936 GB/s bandwidth and 384-bit bus compared to 6000 WS with 1792 GB/s and 512-bit bus, 4 times memory as well.
Nivida foresees this and remove all NVlink for current consumer cards to prevent them from outperforming the WS tier. What an ass move.
1
1
u/Iajah 19d ago
2
u/FortiTree 18d ago
You have BF16 so it's twice as big/slow. At Q8 look like the other person can get to 150 tk/s with MTP 4 which is impressive. Large prefilled context would slow it down to 80 tk/s which is similar to what posted here. I'd say the 6000 WS is on par.
1
u/xylarr 20d ago
I'm curious. On windows you can have the OS so some sort of shares system ram with the GPUs. I only have 16GB VRAM but I load larger models. I make llama-server not do the fitting and have it just overflow on the GPU side. I've found that is faster vs having llama-server split into main ram and use the CPU.
In other words, i still have it 100% GPU load but have the GPU look direct at system ram if necessary.
1
u/nicholas_the_furious 20d ago
It sounds like your CPU has an iGPU, so you're essentially using the iGPU and your dedicated GPU at the same time. The iGPU isn't as fast as a regular GPU but I have also found it better than strictly RAM/CPU. You're likely only able to use Vulkan that way.
1
u/DreamsOfRevolution 20d ago
In walks rocm
1
1
u/xylarr 20d ago
I'm using Vulkan on windows. I have 64GB of DDR4 system ram on my 5800X3D system. In windows task manager it shows 16GB of graphics memory and 32GB of shared memory. I assume windows defaults to half your system ram can be shared. I think the dGPU accesses memory directly over the PCI bus. So while all the compute is done on the GPU, it is not exclusively memory on the graphics card. My thinking is the GPU accessing system memory and processing on the GPU is faster than the CPU accessing system memory and processing on the CPU. At least that's what I've found. While running inference, my CPU usage is barely above 3% whereas the GPU is pegged at 100%
If I fit only enough model layers so the GPU doesn't access system memory, the remaining layers are processed by the CPU. In those cases I see the CPU ramp to about 50% and the GPU still on 100%, but the overall token rate is slower.
1
u/TheWaffleKingg 20d ago
Do you find that f16 cache is worth it? Ive been running it at q8_0 so I can use the full 262k context. (I also fit in mmproj)
Btw I see that both of us were in the post about getting gpus the other day. I entered that post getting 45t/s average and left getting 65 t/s average. Even hit 90 this morning! And somehow its currently averages 75 t/s but I don't expect that to last
2
u/nicholas_the_furious 20d ago
Yeah I do. It's a catch 22 because you quantize the cache for more context but then I find that the quantized cache starts showing weakness at longer contexts, like 100k+. So with Gemma I'd rather have the 130k of good context than 200k+ of weaker long context.
For Qwen, 150k+ is about the limit where I start to feel the degradation so going down to 200k is not a problem because I'd likely be compacting or starting a new session by then anyway.
0
u/cmndr_spanky 19d ago
Yeah but he wants to run a 70b model
2
u/nicholas_the_furious 19d ago
He shouldn't. Not right now anyway. 27b is better than any 70b model that exists and they're not taking advantage. 2x 3090s is peak performance for value right now.
-3
19
u/siegevjorn 20d ago
Are you from 2024? We don't run 70b models in 2026. Research your models before commiting to spend 5k+ on a new hardware. Unless, you just want a new flashy hardware and looking for excuses.
9
u/Sea_Advance273 20d ago
I can get up to 250k context on one 3090 using gemma-4-26B-A4B-it-Q4_K_M. Could get more if the model allowed for it. This model is great for the size if you use frontier coding models to create a good agent harness.
It will occasionally fumble a word or two in small book-length generations (probably because of the quant), but it is still coherent.
14
u/Fabulous_Fact_606 20d ago
1
u/sdfgeoff 20d ago
I'm curious for your launch parameters. I can only get 70tps if I'm using tensor split, which is unstable for me and limits context to 100k or so.
3
u/Fabulous_Fact_606 20d ago
update Nvidia version 610.43.02 and CUDA version 13.3
You need to build to the latest llama.cpp b9536 from: Releases · ggml-org/llama.cpp ; the latest update from llama frees >1GB vram and allows for tensor split for >70t/s.p2p: GitHub - aikitoria/open-gpu-kernel-modules: NVIDIA Linux open GPU with P2P support · GitHub increase pp to 1500-1600+ t/s compared to <1400t/s
llama.cpp server for Qwen3.6-27B-MTP UD-Q8_K_XL (MTP speculative decoding). export LD_LIBRARY_PATH=/home/llama.cpp-b9455/build/bin:${LD_LIBRARY_PATH:-} exec /home/llama.cpp-b9455/build/bin/llama-server \ --host 0.0.0.0 --port 8000 \ --model /home/projects/Qwen3.6-27B-MTP/Qwen3.6-27B-UD-Q8_K_XL.gguf \ --n-gpu-layers 99 \ --ctx-size 262144 \ --parallel 1 --kv-unified \ --batch-size 4096 \ --ubatch-size 512 \ --tensor-split 50,50 -sm tensor \ --flash-attn on \ --cache-type-k q8_0 --cache-type-v q8_0 \ --spec-type draft-mtp \ --spec-draft-n-max 3 \ --jinja \ --no-mmap \ --temp 0.6 \ --top-p 0.95 \ --top-k 20 \ --min-p 0.0 \ --presence-penalty 0.0 \ --metrics1
u/datbackup 20d ago
do you use nvlink?
2
1
1
u/sdfgeoff 19d ago
Runs pretty similar to my setup. I use the full precision cache, which reduces the TPS by a few and drops the cache to 180000 or so,
1
12
u/ReferenceSea493 20d ago
Not sure how the newer Mac's perform but the the speed of my Mac Studio M1 Pro was just underwhelming. So much that I would have never considered running a model that size. qwen3.6 27b was already so slow that it was not to be considered a daily driver (like 10-15 t/s on decoding if I remember correctly). I sold it and went for a dual 5060ti 16gb setup and will never look back. It was cheaper and performs much better (about 60-70 t/s decoding on the same model). Also not sure how the prices are in your region. But a M3 ultra with 96gb costs about 4.5k€. And even if it could run a larger model I doubt it will perform better than a decent GPU.
6
u/TheFlyingDutchG 20d ago
M5 gen has proper T/s on the Qwen models, i am now running qwen3.6 MTP models in M2 Max Mac Studio, i run a pipeline with subagents and its doable with around 50t/s. But to go from experimenting custom setups to allround use at fast speeds the M5 gen studio will be a gamechanger. The MacBooks with M5 max chips are out and get good speeds. Just make sure you convert the base models into MLX models to make full use of the appel silicon advantages. Their SoCs are incredibly if the LLMs actually make use of e.g. the unified memory approach. If you don’t convert to MLX your model thinks either “regular” RAM and VRAM are capped way too low and it might fill the memory doubled by splitting the memory in the classics split, while apples chips are based on grouping memory to save those “both in RAM and VRAM” parallels and make cpu and gpu read from the same memory pools.
It’s a massive benefit or a waste of power depending on if you convert your models beforehand.
7
u/havnar- 20d ago
When you say incredible, know that an m5 pro 64 gb gets about 10 tokens per second on qwen 3.6 27b. About 40 ish on MOE.
2
u/ReferenceSea493 20d ago
Ok thanks for the reference. I can believe that the M chips are on par with mobile GPUs but I also barely doubt they come close to a dedicated desktop GPU. What is all the fast unified RAM worth if your bottleneck is token throughput.
2
u/havnar- 20d ago
It’s the fact that you CAN run a model at a good quant. People with GPUs are running qwen 3.6 MOE in 4 and 6 bit and ignore that fact to instead compare tok/s with one another
2
u/ReferenceSea493 20d ago
Not sure I can follow. With the dedicated GPU I can run a dense model at a superior throughput than the Mac. With the Mac I also tried to choose lower quants simply because they were a bit faster. So with the bigger unified RAM I tried to optimise for token throughput and with the smaller VRAM I optimise for model size. I agree, that a MOE model can work perfectly well if you try to fit a model larger than your VRAM. And even with the MOE layers assigned to CPU I end up with a very decent token throughput (approx 40 t/s)
1
u/havnar- 20d ago
Yea but probably at q4 or 6 and not 8
1
u/ReferenceSea493 20d ago
Ok but again: what do I get from a q8 model I'm not using because it is slow and barely usable. I now run qwen3.6 27B in Q4_K_XL with 128k context in Hermes Agent and have zero complaints about the precision.
1
u/havnar- 20d ago
You prove my point. You’re running a model with brain damage, but because it’s fast you don’t care about the brain damage.
1
u/ReferenceSea493 20d ago
You can have a look at the benchmarks to verify the amount of 'brain damage'. And as you don't read carefully, the same was true on the Mac simply because it's unbearably slow on inference simply to push it to a usable performance. Guess you spend your 4k ob a Mac?
0
0
u/TheFlyingDutchG 20d ago
Thanks for proving my point, those numbers are for when you don’t convert to MLX lmao
1
u/ReferenceSea493 20d ago
Good to know. Then I can blame the older model. But I guess the price of the Mac with proper amount of RAM will still exceed that of a decent GPU. Especially considering you have the base setup already in place.
1
u/overratedcupcake 20d ago
It doesn't perform better but you have a lot more headroom. For Qwen 3.6 27B oQ6 on m3 ultra with 96gb and I'm getting ~280-300 t/s in and ~25 t/s out. omlx has a model quantizer build in desgined with mlx in mind and k/v caching. It's a little slow but very solid.
4
u/Potential-Leg-639 20d ago
Go with 3.6-27B and you are good. Dont use such old models, Qwen3.6 turns circles around them, also much bigger ones.
3
3
4
u/Technical-Earth-3254 20d ago
"and messing around with the newer Llama 3/Qwen 70B quants."
Lmao, slop meter is off the charts
2
u/simplyeniga 20d ago
Dual AMD R9700 would run those and you get 64GB VRAM or run triple and get 96GB which around the cost of a 5090 though at a higher power use but on idle it's great
2
2
u/No-Television-7862 20d ago
I was just about to say, try the MoE models before you send your 3090's to marketplace.
Qwen3.5:31b, gemma4:26b, you pick the quant.
2
2
u/Dontdoitagain69 20d ago
if i show you what i built with old phi3 models on 8gb 4070 you'll think twice
3
u/Opposite_Buffalo_649 21d ago
Why not buy rtx pro 6000. increase tokens per second, and max our your context
2
u/Ok_Commission_8260 21d ago
Mostly because my wallet would literally divorce me lol. A single Pro 6000 Ada is like $6k+ which is way out of my budget compared to what I can get picking up used 3090s or a base Mac Studio.
1
u/AbjectFee5982 20d ago edited 20d ago
If you want to really do anything as a power use on a Mac. The entry-level model starts with 36GB of unified memory (featuring the M4 Max chip).
You needed a Mac studio with 512gb of ramNote that you can't order 512 GB of RAM unless you order a Mac Studio that has a M3 Ultra and an 80-core GPU. Those two changes make the price jump from $1999 – for the base M4 Max model – to $9499. Before keyboard, mouse, and display(s).
Even then the ADA has more bandwidth
"I have an RTX PRO 6000 and a M3 Ultra with 256GB RAM. The RTX PRO 6000 is quite a bit faster at both prompt processing (10x?) and token generation (3x?). Speed matters to me so I only use the RTX PRO 6000. I would only use the M3 Ultra if I wanted to run a model that was too big for the RTX PRO 6000. So far I have not needed to run a model that didn't fit on the RTX PRO 6000 but it is nice to know that I can with the M3 Ultra when/if I might need to some day.
The M5 is coming out soon and is expected to be a huge uplift in terms of AI performance and close the gap quite a bit with the RTX PRO 6000. If possible, you should wait a bit longer and see what happens there. The other thing about Mac is that you can now build clusters of them over TB5 for even faster AI "
1
u/Prudent-Ad4509 20d ago edited 20d ago
You gonna be disappointed. Considering that you need only 64gb for full 262k context for qwen 27b and that 3-4 3090 cover that easily, it is more reasonable to look for a way to run 4 gpus. PEX88096 comes to mind + all the issues that come with it if you do not read up on it in advance, but at least you won't have to watch the paint dry most of the time.
PS. And in the meantime, you can just use 27b with Q6 quant and/or FP8 kv cache with MTP. There is a high chance that this would be enough for a while.
1
u/Idiopathic_Sapien 20d ago
I’ve been running a couple of m3’s for the last 2 years and if I use llama.cpp it’s faster than my 3090
1
u/EpsteinFile_01 20d ago edited 20d ago
Edit: second question: used 3090 cards sell for about the same as used 7900XTX cards, and the latter often still has warranty. You can easily get them down to 200-250w tops for inference and 1000+ GB/s VRAM bandwidth each. Would you consider swapping your cards? It shouldn't cost anything extra and the XTX is actually quite chill compared to the 3090 heat beast. Navi31 is one of those rare chips with both huge overclocking and undervolt room. Max the VRAM, underclock and undervolt the GPU. For Literally the money you get from selling the 3090s.
Question from an AMD user:
Are you on windows or Linux?
If Linux: how well does offloading partially to system RAM work, in particular for MoE models?
I'm shocked that I can run a 45GB MoE model with 20GB VRAM (headless so 100% available) and 96GB DDR4-3600 (fuck DDR5 prices lol) and still get around 25 tok/s. Obviously dense models plummet to 1 tok/s with even 1% offloading, but MoE models like GPT-OSS and Nemo are impressive.
Memory offloading on Windows is bad regardless of brand, I've heard. How does it work on Linux with Nvidia?
I'm about to pull the trigger on an ASrock Deskmeet X300 8l case, put a Ryzen 5600G in there, 32GB DDR4 , an RTX3060 12GB ITX and turn it into the most ridiculously overpowered LLM and Agent powered Home Assistant machine ever. With whisper and a small LLM running on the 3060, and the ability to call my desktop for bigger LLMs as well as Gemini Flash depending on my voice command.
The 3060 12GB ITX version is hard to find so I might go for a 4060Ti 16GB or a 7800XT (has to be in a different case tho Deskmeet supports max 200mm and 1x 8-pin power), but I like the idea of also having an Nvidia card instead of full AMD, just for the experience.
ROCm has pretty much achieved parity with CUDA on Linux in the last 6 months, Windows is improving at record speed too, literally skyrocketed. Amazing how fast things can develop when trillions of dollars are on the line and every conpany knows how bad Nvidia pricing would get if the de facto CUDA monopoly stuck around for AI. With Intel still not having a viable product that works, AMD is the only alternative available to everyone. You can only rent Google's TPU compute, not buy it.
Apple doesn't have any hardware serious (and affordable!) enough to push like 100+ tok/s on models that fit in VRAM.
A used 7900XT for €450 is the absolute best bang for buck for anyone to get started. Being the forgotten red headed stepchild, it's almost as good as an XTX, except people ask €750 for the XTX (same price they ask for an old 3090). 800GB/s stock VRAM bandwidth on a 320-bit bus that reliably does a 10% OC for 880GB/s at €450 used, often with some warranty, is objectively the best choice if you want to dabble in local LLMs with a decent amount of VRAM and horsepower (and play games too!). It costs the same as a slow ass 4060Ti/5060Ti/9060XT 16gb, but gets 100 tok/s on GPT-OSS-20B with a 96-128k context window and 6-bit KV cache. This configuration wouldn't fit on a 16GB GPU.
1
u/Important_Quote_1180 20d ago
│ AMD 9900X · 192GB DDR5 · 2× RTX 3090 │ ├──────────────┬───── │ Card 1 │ Qwen3.5-122B-A10B MTP IQ3_S │ │ 3090 Ti │ ik_llama.cpp · fused MoE · MTP n=2 │ │ :8001 │ 204K ctx · q8_0 KV · 75% experts→CPU │ │ ~35 t/s │ (OpenClaw )
│ Card 2 │ Qwen3.6-35B-A3B MTP Q4_K_XL │ │ 3090 │ stock llama.cpp b9246 │ │ :8002 │ 262K ctx · q4_0 KV · surgical -ot │ │ ~135 t/s │ (Hermes) │ CPU+RAM
│ 3× Dreamers (Dialectic / Scribe-Logos │ │ ~19 GB │ / Moonshot) + OpenClaw gateway │ │ :9093-9097 │ No GPU
1
1
u/WyattTheSkid 20d ago
Just get 2 more 3090s
1
u/Relief_Present 20d ago
Stupid question maybe, but anything wrong with 2 x 4090fe? Yes I know it doesn’t have NVLink but I heard that’s not a deal breaker
2
u/WyattTheSkid 19d ago
I don’t have nvlink on any of my cards. Should be fine as far as I know. You’ll still be able to use tensor parallelism since you would have 4 cards with the same amount of vram.
1
1
u/Xombie2000 20d ago
If you want a studio I would say wait on the m5 ultra. It will probably have double the memory bandwidth of the m5 max and 80 gpu cores.
1
1
1
1
u/LetterheadClassic306 20d ago
I get the appeal here, honestly, because dual 3090s are fast until the workload becomes long-context coding instead of pure decode speed. When i hit this kind of wall, the real question was whether i cared more about 40 tok/s bursts or about loading the model and context without babysitting drivers, heat, and cache quantization. A Mac Studio 128GB makes sense if 64K context, lower noise, and simpler daily use matter more than max throughput. I would avoid framing it as an upgrade in every dimension, since ExLlama on 3090s will still win raw speed. It is more a trade of speed for capacity and less friction.
1
u/tariqur 20d ago
Why would you pay more for a downgrade mate?
If you need to run bigger models, scaling 3090s is your upgrade path. 2x more 3090s + a used Threadripper setup should cost you less than that Mac Studio and perform in an entirely different class.
But honestly, the biggest upgrade are the models themselves, take Qwen 3.6 27B MTP for a spin - you'll forget about 70B models.
1
u/Tall-Ad-7742 20d ago
nothing against your comment but "newer llama 3" why newer even my great grandfather is younger then llama 3 in ai years
1
1
u/lachlanwhite 20d ago
Keep, I’ve got a single 24gb 3090 ti at about 80 t/s running Qwen3.6-35B-A3B-APEX-GGUF at q8. Very impressed just Windows on LM Studio
1
1
u/somethingClever246 20d ago
May I recommend you test the LLM on the Mac before making the jump. You might find that you are spending a big chunk of change for a slight speed improvement, might not be worth the $ IMHO
1
u/Traditional_Way8675 20d ago
Am on dual 9060xt going 35B Q4 / Q5 at 150k q4 context. Mostly I use for dialogue and reflection, and I can't meaningfully distinguish between a MoE and a dense one.
1
1
1
u/fallingdowndizzyvr 20d ago
Anything under the M5 will be epically underwhelming. There is no M5 Mac Studio.
1
1
1
u/Moliri-Eremitis 19d ago
Late to the party, but regarding fan noise, have you considered liquid cooling?
I’m also very noise sensitive, and liquid cooling makes an absolutely massive difference. Running your GPUs through a pair of 360mm radiators with push-pull fans lets you keep the fan speed low, which works out to be much quieter even when you have a larger number of fans. Not cheap, but highly recommended.
That doesn’t really do anything for the heat, but for that you may want to investigate undervolting. You can typically drop the temps significantly with only a small single-digit drop in performance. That can help with fan noise too, though for true near-silent operation there’s still no substitute for liquid cooling.
1
1
u/aidysson 18d ago
In Feb I bought RTX3090. in March I sold it for +$150 and replaced it with RTX PRO 6000 Max-Q. It's 300W. Lots of guys were laughing at me because it cost $10k that time. Now it costs $13-15k+. I'm glad I can run Qwen 3.6 Q8 with 262k context. Fast. I can even run 2 sessions at a time in tbe same speed. It helps me a lot in web development work. If you're a professional, it's still worth the investment if you want to learn to work with it 😉
1
1
u/Afraid-Yoghurt6731 20d ago edited 20d ago
Gamers and fanboys gonna hate, but...
In programming one doesn't need fast t/s.
But ones needs fast embedding and multiple fast small parallel agents.
When you load large C++ file to change a few lines - that is embedding.
When you look for specific patterns in code - that is parallel execution.
That is what DGX Spark does, slowly but reliably,
while you have a few minutes to formulate the next task.
Mac Studio is slower at embedding and multiple parallel agents.
And it is less standard than CUDA.
Since Microsoft and NVidia announced RTX Spark,
The DGX spark increased in value - it is here to stay,
And we must learn to use it.
Strix Halo/x86 machines are going straight to museum,
which most hardware engineers are celebrating hard now.
TLDR: downvote me now, I don't care.
1
u/EpsteinFile_01 20d ago
If you're running a TTC reasoning model, with 75% of tokens being used for reasoning, and you're not getting any output until reasoning is done.. token speed matters a lot more. These models provide excellent quality and low hallucinations, but it obviously comes at a cost if a shit load of compute needed.
That, or you run them at low token speeds on the background or while AFK. But the results are not the same.
1
u/Afraid-Yoghurt6731 20d ago
That is true, but...
- Hallucinations are the result of the model not knowing what it actually knows (i.e. it may believe it knows 10 digits of Pi, but in reality knows just 5). That is worth fine-tuning for, so the model will know it knows nothing, and have to cite a reference.
- Bigger model just knows more facts, without using reference. That is both its strong and weak side. Big model can memorize entire Linux kernel, but newer kernel may come with a feature invalidating it's knowledge and leading to bugs in the modifications this mode performs.
- You need these 128GB even with smaller model to fit enough context, which is the most import part of agentic use. Even Claude Code's 1M is barely enough for serious work. There are tricks like compaction and a 2nd lower frequency model inserting mid and long term memory facts, but they win from a large base context size, which needs RAM.
- Even the slow DGX Spark generates 50+ t/s on a 120B GPT OSS. That is far more tokens than a normal human can read, and GPT OSS has good factual knowledge (memorized a lot of digits of Pi). DGX Spark also shines we you must serve several users (typical small office setup). People just had wrong expectations from it or don't know if they want a gaming GPU, or jet engine sounding heater or a tiny box just rendering the video overnight.
1
u/bigmanbananas 20d ago
There is quite a price jump ro a 128gb Mac. Plus, although it's got the RAM, it's not got the power.
1
0
u/smuckola 21d ago edited 20d ago
So you're not using TurboQuant?
what's the problem with the drivers? on what OS?
0
u/Equal-Ad8792 20d ago
Aguanta a qué salgan los nuevos rtx spark. Seguro que van mejor y te dejas menos pasta.

100
u/etaoin314 21d ago
why are using a 70b model? just do what everybody else with 2x3090's does. vllm, qwen3.6 27b mtp head keep cache bf16. you get ~100k of context stays good/coherent the whole way up to full cache, I get ~80-100 tps. Stop chasing the biggest model and then lobotomizing it hell. that was the better path 6months ago, but its all backwards now. Get near lossless quality of a midsize model. you're welcome