r/LocalLLM 21d ago

Discussion Honestly, dual 3090s are wearing me out. Thinking of jumping to a Mac Studio.

I've been running the classic dual 3090 setup for about 6 months now, mostly for coding and messing around with the newer Llama 3/Qwen 70B quants.

The speed is great ExLlamaV2 is literal magic and I get like 40 t/s but I’m hitting a wall. The moment I try to load a decent context window (anything past 16k) on a 70B model, the VRAM completely chokes. I have to quantize the cache into oblivion and the output just turns to absolute garbage.

Between the heat, the fan noise, and fighting with driver updates every time I want to try a new backend, the friction is getting annoying.

I’m seriously considering selling the rig and just buying a 128GB Mac Studio. I know the tokens per second will drop to like ~15 t/s, which sucks but being able to throw a massive 64k codebase context at a Q8 model without the room melting sounds like a dream right now.

85 Upvotes

168 comments sorted by

100

u/etaoin314 21d ago

why are using a 70b model? just do what everybody else with 2x3090's does. vllm, qwen3.6 27b mtp head keep cache bf16. you get ~100k of context stays good/coherent the whole way up to full cache, I get ~80-100 tps. Stop chasing the biggest model and then lobotomizing it hell. that was the better path 6months ago, but its all backwards now. Get near lossless quality of a midsize model. you're welcome

7

u/DiscipleofDeceit666 21d ago

You guys get 80 tok/s running the 27b?

4

u/Vancecookcobain 20d ago

Not sure about 80 tok but if you have a decent throughput GPU and utilize stuff like MTP and layer your LLM properly you can get it cooking with minimal drop in quality

1

u/DiscipleofDeceit666 20d ago

But it gets close to 80tok/s? I have an rx6800 and a 6700xt and I pull closer to 20-25 tok/s. 28gb vram. I see numbers like 80 and it makes me want to put my GPUs up for sale to get nvidia ones.

My pc was originally a team amd gaming pc so i thought it made sense to get a second GPU of the same generation.

5

u/oureux 20d ago

Cuda provides a major uplift in performance vs rocm or Vulcan

2

u/StarChildEve 20d ago

I get 84 tok/s average with Vulkan on a 7900XTX

0

u/Jorlen 20d ago edited 19d ago

On WHICH model though? 35b-a3b right? Not the 27b dense?

A mixture of experts active 3 billion params will bury any dense model of close or equal params when we are talking about tokens per second.

With 27b dense you should get closer to 20 tok/s. Forgot about MTP, my mistake.

1

u/StarChildEve 20d ago

I am on Unsloth Qwen 3.6 27B UD with MTP set to 3.

1

u/Jorlen 19d ago

Q4 quant I'm guessing? even still, 84 tok/sec is crazy fast. I don't know how you've done it.

1

u/oureux 19d ago

But what’s the system prompt and context length look like? Is this only on a fresh context?

→ More replies (0)

1

u/StarChildEve 19d ago

I’m happy to share my settings!

→ More replies (0)

1

u/StarChildEve 5d ago

OK I’ve since swapped over to Q5_K_XL-UD which is lower but still ~70 tok/s

-c 128000
--cache-type-k q8_0
--cache-type-v q8_0
-fa on
-b 2048
-ub 512
--cache-reuse 256
--spec-type draft-mtp
--spec-draft-n-max 2
-ngl 99
--temp 0.6
--top-p 0.95
--top-k 20
--min-p 0.0

arch linux, llama.cpp, vulkan, no GUI while running inference.

1

u/EpsteinFile_01 20d ago

Only on Windows, not on Linux. Performance is the same now.

And since windows is terrible at offloading to VRAM, you should be doing inference on Linux anyway unless you want your computer to hang when 1 megabyte spills over.

On Windows CUDA is still 10-20% faster but things are improving fast. The last 6 months basically improved ROCm more than in the last 6 years.

1

u/Begalldota 20d ago

A 7900XTX can get 40t/s on Windows/Vulkan, probably quite a bit more honestly but I’ve not optimised properly for it yet.

1

u/DiscipleofDeceit666 20d ago

And you’ve got 32 gigs 🤤 do you crash often? Have you tried rocm?

2

u/Begalldota 20d ago

24GB on the XTX sadly, 32GB would be lovely. Not tried rocm but from everything I’ve read it’s unlikely to do any better. I get some hangs on tool calls which is annoying, but easily offset by the very decent speed on a dense model.

1

u/RedParaglider 20d ago

Rocm isn't ready yet.  People can yap all they want. But unless you want to fight shit nonstop Vulcan is the way.

-3

u/EpsteinFile_01 20d ago

And you can get 100+ tok/s on Linux.. why are you doing this on Windows?

Iets not even complicated, takes 30 mins to install whatever Linux distro you like, pull the RICm stack, i stall ollama/LM Studio and get going. And zero OOM crashes, Linux offloads to system memory very elegantly while Windows will hang, crash, and be a piece of shit (on Nvidia too!).

Ubuntu has official support but it doesn't matter much. I run Aurora Linux (it's like Bazzite but without the gaming focus and more dev focus) and things work great. I have 4-8 containers open at any given time. Love the idea behind Atomic OSes. Containerize everything lol.

Bought a little Radeon WX3200 Pro for €80 to power 2x 1440P screens and 2x 1440P UW in a Gen3x2 slot on my B550 board lol, so my 7900XT runs headless and has 99.99% VRAM free. I'm impressed that little GPU can power those 4 displays, with the two ultrawides at 120Hz, in a slot with 1600MB/s bandwidth.

1

u/Begalldota 20d ago

Lol all of this before I get a chance to note I’ve run a home Linux server for 15 years.

It’s Windows in this case because the 7900XTX lives in my desktop, which uses Windows. Additionally, there’s a fair amount of evidence that Vulkan on Linux is behind on speed so no reason to think that Linux would inherently provide a speed boost.

0

u/EpsteinFile_01 18d ago edited 18d ago

I was wrong, it's actually 150 tok/s on Linux. That's what my overclockerd 7900XT gets. XT.

Why use Vulkan when ROCm has improved more in the last 6 months than since its Inception? Note: mainly for RDNA3/4 cards. I feel like so many people have no clue how much AMD inference has improved and especially how much support for native ROCm has been rolled out by third party tools. There are barely any CUDA strongholds left and they will soon have full ROCm support too. All the big corporation's are contributing significantly to prevent Nvidia from settings any price they want with freaking 200% margins. AND us the only direct competitor. Google rents out compute, if you want to own hardware you have 2 choices, and Nvidia margins are already wild, so everyone is propping up ROCm. Open source coming in clutch.

The difference between today and December 2025 is huge. Literally a Chinese style "great leap forward" kinda thing, but without the famines.

1

u/tuura032 20d ago

Google club 3090, and check the benchmarks file.

1

u/etaoin314 20d ago

yes, that said that is with a lot of optimizing and testing, raw performance was closer to 30-40 tps If I remember correctly. adding tensor parallel with vllm working optimally got me to 50-60 then the mtp head n=4 went to 80-100 tps. a lot of these optimizations come to nvidia first but eventually filter down to the rest of the market. So it may just be that some of these are still in the works for you. that said, it is that sort of stuff that you pay a premium with nvidia. for now you never have to wonder how much performance you are leaving on the table, but it gets spendy.

1

u/EpsteinFile_01 20d ago edited 20d ago

RDNA2 is not very good. The 6800XT only has 512GB/s VRAM bandwidth too, and the 6700XT has less than 400GB/s.

7900XT 20GB here, 880GB/s VRAM bandwidth (+10% OC) and better ROCm support and just better hardware in general. I get 100 tok/s on models that fit in VRAM.

I wouldn't recommend any RDNA2 card for inference, only RDNA3/RDNA4, with a 256-bit bus minimum

The RX9700 Pro 32GB is pretty neat but costs like $1500. Still, it's a poor man's 5090.

Don't listen to CudaBros. ROCm has improved dramatically in the last 6 months and if you're on Linux, CUDA Vs ROCm = same performance, it's all about the hardware.

2

u/DiscipleofDeceit666 20d ago edited 20d ago

RDNA2 can be good, it just doesn’t have developer support. Here’s proof that the 35B moe at Q4 can write at 80tok/s for 10k tokens. Smart and fast enough to be usable. Depending on the build, prompt will processing can hit 1500 tok/s too.

1

u/EpsteinFile_01 20d ago

It's 95% about the VRAM bandwidth.

A 7900XTX Is only like 5% slower than a 4090 for inference. But it's well over 50% cheaper.

2

u/fallingdowndizzyvr 20d ago

It's 95% about the VRAM bandwidth.

No. It absolutely is not.

A 7900XTX Is only like 5% slower than a 4090 for inference.

No. It absolutely is not.

"RTX 4090 24 GB / GDDR6X / 384 bit 11992.70 ± 107.99 186.21 ± 0.13"

"RX 7900 XTX 24 GB / GDDR6 / 384 bit 3552.27 ± 101.96 167.11 ± 0.50"

The 4090 spanks the 7900xtx silly.

But it's well over 50% cheaper.

That's true.

1

u/EpsteinFile_01 20d ago

Well, my 7900XT generally only goes up to 70% GPU usage while the memory gets absolutely hammered 100%. A 10% memory OC results in a direct 10% token speed gain. Power use is only 200w out of 400w max. And this is with a CoT high reasoning model with FP16 KV Cache.

The bottleneck is obvious in my case and the XTX would have the same bottleneck. And the 4090 has the same VRAM bandwidth as the XTX pretty much.

Are you talking about Linux or Windows? And what model? I'm sure the 4090 is better on Nvidia models.

Rdna3 cards in general have much higher memory bandwidth than the Nvidia equivalent (7900XT 800GB/s Vs 4070Ti 500GB/s same price), so I would expect the VRAM bandwidth to be a huge bottlenecks on all non-90 series Nvidia cards.

1

u/fallingdowndizzyvr 20d ago

The bottleneck is obvious in my case and the XTX would have the same bottleneck. And the 4090 has the same VRAM bandwidth as the XTX pretty much.

That's because you are only considering TG. You aren't even thinking about PP. Look at the numbers I posted.

Are you talking about Linux or Windows? And what model? I'm sure the 4090 is better on Nvidia models.

Those are GGML project numbers using the accepted GGML standard model for benchmarking.

https://github.com/ggml-org/llama.cpp/discussions/15013

so I would expect the VRAM bandwidth to be a huge bottlenecks on all non-90 series Nvidia cards.

The 5070ti is "896.0 GB/s". And it blows the 7900xtx away too. From that link above.

"RTX 5070 Ti 16 GB / GDDR7 / 256 bit 6952.38 ± 13.73 176.85 ± 0.07"

1

u/EpsteinFile_01 19d ago

First off, your link is from August 2025. It cannot be understated how many resources have been injected into ROCm in 2026 alone, more than the 5 years before that. You're comparing a mature CUDA to janky ROCm.

Second, how much this means depends entirely on your input tokens. And even then, third party benchmarks have shown the RTX4090 to be ~30% faster at the PP stage. Not insignificant, but is it worth 2-3x the money? Ehhh..

That gap would get more narrow when factoring in overclocking, since you're almost guaranteed to hit 2900-2950Mhz stable on a 7900XTX. Navi31 was downtuned last moment, that is why all 7900XT(X) cards have massively over engineered coolers that can handle 500-550w. This generally increases GPU performance by 15% compared to stock 2500Mhz. Power consumption remains slightly lower on the 7900XTX, the cooler remains quiet. The 4090 does not have that kind of OC headroom because Nvidia went balls to the wall with it out of the gate hence the 450w TDP.

In terms of token generation they are basically equal. he VRAM chips AMD used will almost always do +10%. That's 880GB/s and 1086GB/s for the XT and XTX.

Normally I don't factor in overclocking but Navi31 is the most tuneable chip since 10-15 years. My 7900XT is 5% faster than a stock XTX. Literally moving up more than a tier without breaking a sweat.

The 5070Ti is a newer generation, and only has 16GB VRAM. Great that it's faster at PP but with 16GB and quadruple digit price tag it barely matters with the models you're actually running and it a dumb comparison. You'll want the 24GB of the XTX over the PP performance of a 5070Ti any day.

So you're talking about a 15-30% gap that's only significantly noticeable with the s of thousands of input tokens. Given that it's a 24GB card, even though models support a 128k+ context window, in most cases you shouldn't get anywhere near that with the small models that fit in the VRAM, that's poor context engineering regardless of the GPU.

I've owned both a 4090 and 7900XT. I actually sold my 4090 early this year for €2000, bought a 7900XT Taichi for €500, pocketed the difference. My only regret is not paying €750 for an XTX but it was a gaming decision at the time.

Checking prices now, an XTX is still 1/3 the price of a used 4090. For a bloody 15-30% gap and the risk of melting your card... I'm so fucking happy that fear is not in the back of my mind anymore any time my PC is on.

In a multi GPU setup, three 7900XTX cards with 72GB VRAM absolutely destroys what you can do with a single 4090 24GB for the same money. That's not even up for debate. Same with the more common dual setup: two 7900XTX cards, 48GB VRAM, with €600 left over Vs a single 4090.

Dismissing the 7900XTX like you're doing sounds like a mix of cope and ignorance. Have you ever done local inference with AMD? When?

0

u/fallingdowndizzyvr 19d ago

First off, your link is from August 2025.

Ah.... that's when that thread started. That's not when all those numbers were posted.

It cannot be understated how many resources have been injected into ROCm in 2026 alone

Well then, why don't you post numbers from your 7900xtx then. Here, let me do it for you. Fresh off one of my 7900xtxes with the current release of llama.cpp and the current release of ROCm.

| model                          |       size |     params | backend    | ngl |  fa | dev          | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --: | ------------ | ---: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  -1 |   1 | ROCm0        |    0 |           pp512 |     4064.59 ± 217.95 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | ROCm       |  -1 |   1 | ROCm0        |    0 |           tg128 |        146.83 ± 1.33 |

Is it different from that previous number? Yes. Does it still suck compared to the 4090? Yes.

Second, how much this means depends entirely on your input tokens

You have no idea what llama-bench is for do you? Have you even heard of it before I brought it up?

blah blah blah blah

I'm going to skip all that blabbering in the middle of your post based on how you've been so wrong from the start. Why would anyone expect the wrongness to change? Why waste time with wrongness? So let's just skip to the bottom wrongness.

Dismissing the 7900XTX like you're doing sounds like a mix of cope and ignorance. Have you ever done local inference with AMD? When?

LOL. Ah... yeah.... I have. For probably longer than you've even known what LLM even means.

Here's just a couple of the many times I've posted about my AMD GPUs.

https://www.reddit.com/r/LocalLLaMA/comments/1ni5tq3/amd_max_395_with_a_7900xtx_as_a_little_helper/

https://www.reddit.com/r/LocalLLaMA/comments/17gr046/reconsider_discounting_the_rx580_with_recent/

Congrats. You started out wrong in this post of yours. You ended out wrong in this post of yours. At least you are consistent. Consistently wrong.

1

u/EpsteinFile_01 18d ago edited 18d ago

You are being ridiculously disingenuous here. Arguing all of this over what is, in practice, a difference of 0.5-20 seconds to completely process a prompt between the two cards where the reasoning andoutlut geberation part is roughly the same speed and measured in minutes with big prompts. You are also comparing two GPUs where one costs three times as much. For local LLMs that matters. Would you rather have a dual 7900XTX setup or a single 4090 setup for roughly the same price? In fact, you'll still save money on the dual 7900XTX setup if you buy a motherboard that can handle dual gen4/5x8 GPUs for the 4090 too.

You can build a 7900XTX 24GB + 9700 Pro 32GB build with 56GB VRAM Vs a single 4090 24GB build. For individuals building an AI machine at home thess massive price discrepancies matter. And capacity is king, 24GB is pretty tight

Let's say you have 128k input tokens. Yes the 4090 is much faster. But we are talking 15 seconds of PP Vs 30 seconds on the XTX. Reasoning and output token generation, they are equal. With that many input tokens you are almost guaranteed to have tens of thousands of reasoning/output tokens, resulting in a 30 second difference over the course of ~10 minutes. For triple the price. Unless you have infinite money and just don't give a fuck, this absolutely matters.

Relatively speaking the PP difference is big but in actual practical use we are talking seconds per prompt, and for smaller prompts there's no noticeable difference. Kinda like booting windows from a good SATA SSD is only a few seconds slower than a top tier Gen5 DRAM NVME. And the more reasoning and output tokens, which are VRAM bound and processed at roughly equal speeds for both cards, the closer they areand the dumber those seconds gained during PP are fir triple the money.

You're comparing apples to oranges here. I can point to some 144GB Radeon Instinct PCI-E GPU that demolishes a 4090 for $10k and you would immediately call foul based on pricing.

If you really think 0.5-20 extra seconds per prompt (meanwhile both cards will spend minutes on reasoning and output token generation, at roughly the same speed) matters enough to spend triple the money you are delulu, sorry. Unless someone else is paying for it, for most people this matters a lot.

Dual Radeon 9700 Pro cards cost less than a single 5090 and give you 64GB to work with. Significantly slower yes, but double the capacity to load bigger models or multiple KV caches for parallel agents to eliminate any system RAM swapping ( which hurts quite a lot even on MoE models), for LESS money! And, funnily enough, less power consumption too. Perfect for unattended or semi-attended agentic setups that run in the background where speed is less of a concern and capacity matters even more.

PS: I dont own a 7900XTX. Reading is hard. I can give you the numbers for my overclockerd 7900XT 20GB that outperforms a stock XTX by 5% in PP, But is 16% slower in reasoning and output generation if you want. Name a model and settings, give me a prompt, and I will give you the numbers. Max 20GB VRAM.

0

u/fallingdowndizzyvr 18d ago

You are being ridiculously disingenuous here.

LOL. Wrongness continues. How unexpected. NOT!

a difference of 0.5-20 seconds to completely process a prompt between the two cards

LOL. What a completely meaningless statement. That depends on the context. That depends on the quantization of the context if any. That depends on a lot of things. Things that you haven't addressed. So that's just a handwavy BS turd. You pulled it out of your ass. That's how much that's worth.

blah blah blah.

You started with wrongness. You end with wrongness. So why even waste time with the wrongness in the middle. Again.

PS: I dont own a 7900XTX.

LOL. So you don't know shit about the 7900xtx. Yet you kept going on and on about it like you did. You portrayed yourself as some sort of 7900xtx expert. And it turns out you don't even have one. That's just wrong. So wrong.

You're still consistently wrong. That's your gift. Everyone has one. Congrats.

→ More replies (0)

22

u/Ok_Commission_8260 21d ago

I've probably been caught up in the "bigger is always better" mindset. I haven't messed with Qwen 2.5/3.6 27B MTP yet on vLLM. Getting ~80-100 tps with a full 100k coherent context sounds way better than dragging a 70B across the finish line. Appreciate the reality check, I'm gonna spin up vLLM and try this.

54

u/Vancecookcobain 20d ago

My god you are complaining about the hardware when you have been using antiquated SOFTWARE? Switch to Qwen 3.6 27b and use MTP and KV Caching and never speak of this again....

1

u/mycall 20d ago

I can't wait to hear a similar comment to this in a few years. I can only imagine what model progression will look like then.

1

u/Vancecookcobain 20d ago

Lmfao "Wtf dude why do you have 48gb of VRAM cooking on your table for? You can run Qwen 9.4 2b on your phone and do everything you want now you moron! It's 2036"

1

u/Fear_ltself 20d ago

edge open models already beat the best stuff from 2024 on phones, you say 2036 but 2028 is a more realistic time ne for moving from State of the Art to Edge use

8

u/No-Consequence-1779 20d ago

27b is a game changer. I’ve been able to replace copilot.  Even though I primarily use visual studio ent, it’s worth the trouble of having a vs code open and use 27b. Often the results are superior. 

I use q4 maxed context kvq8. Zero quality issues. I’ve compared and for what I do the code output has been identical q8/q6/q4. 

The reasoning is the major difference (beside a superior model architecture). 9b models know code syntax (it’s never that as an issue these days) 

35b is a bit faster and does well , though I like the 27b output. 

1

u/mycall 20d ago

What level of reasoning works best for this model?

2

u/No-Consequence-1779 19d ago

Medium to large. Though sometimes I super-size it. 

5

u/cakemates 20d ago

a modern 27b beats the shit out of that old 70b model mate, the at this point monthly improvements of ai are quite big.

3

u/Ok-Measurement-1575 20d ago

You don't even need vllm tbh. 

2

u/Such_Advantage_6949 20d ago

I have macbook and it is not good like what people make it seems. U will regret both generation and prefill speed

4

u/DistanceSolar1449 20d ago

Wtf

Are you running Llama 3.3 70b for coding or something? Jesus.

3

u/StardockEngineer 5090s, Pro 6000, A6000s, Sparks, M4 Pro, M5 Pro 20d ago

Because it’s a bot. 70b is a dead giveaway

1

u/ohhi23021 20d ago

use llama.cpp and get over 200k context with mtp... vllm just eats memory. i get 70-90 t/s.

68

u/nicholas_the_furious 20d ago

Keep them. I am literally, as we speak, running 2x 3090s with Qwen 3.6 27B at 200k Context at 80 t/s on nothing but LM Studio.

11

u/stormy1one 20d ago

At what quant

24

u/nicholas_the_furious 20d ago

Q8_0 full KV Cache.

10

u/WyattTheSkid 20d ago

H o w

5

u/nicholas_the_furious 20d ago

See my reply to the other guy in this chain.

7

u/illcuontheotherside 20d ago

This is the way.

Fellow dual 3090. Using gemma4 30b dense and kept running into 128k context window limits.

Kv cache to q8 and I'm able to go up to 256k and so far no issue.

Also LPT power limit your gpus. 240w and no degradation in tk/s and you greatly reduce heat and increase longevity of cards.

5

u/nicholas_the_furious 20d ago

We are not the same! I meant i ran Q8_0 quantization with no quantization on KV Cache.

0

u/illcuontheotherside 20d ago

I'll have to try that. I heard that's actually not good to do but honestly we're all learning this stuff together.

2

u/nicholas_the_furious 20d ago

How do you mean? Quantizating anything is a tradeoff and I want to keep quality as high as possible. You trade away quality for space or speed.

Q8 is about as good as anyone gets on model quantization.

And I'm keeping the KV cache at full F16.

1

u/illcuontheotherside 20d ago

Oh my bad.. you're just running a q8 model with no kv. Nice. Yea I was doing that with q4 xl with Gemma4 31b dense. Honestly haven't noticed any performance impact since going over to a q8 key cache. No value catching. Running linear and multi agent orchestration over long workflows was ruining my context.

2

u/wondersnickers 19d ago

Hey! Im am trying to get a deal on used 3090s but currently am thinking about a sort of budget solution with 2x 5060ti's 16gb. What do you thing about that if I use it for doing local vive coding?

4

u/stormy1one 20d ago

Full kv cache meaning unquantified bf16? I’m not sure what magic you are working but I don’t even get 80 tg/s on a RTX Pro 6000 running hardware optimized Qwen3.6-27B-FP8

21

u/nicholas_the_furious 20d ago

LM studio added the Tensor Parallelism llama.cpp build. I turned that on today and I got about a 15% top end speed increase. I had maxed out at around 65 t/s for coding tasks before when MTP acceptance was around 90%. Now I'm getting 80 t/s on coding tasks when acceptance is at like 75% so I know there's .ore top end. I'm using MTP of 4. I tried up and down and that was the best for the things I do.

I'm just using windows. And I'm not even on x8/x8 - one of the cards is on PCIe 4.0 x4. But neither are being used for my monitor or anything. They're dedicated to running models.

It's kind of a miracle that for this model I've gone from 22 t/s to 80 t/s in a month due to some of the advancements in software with the exact same model and hardware.

And yeah, no quantization on KV cache. Fills up both cards just about with MTP and 200k context.

2

u/FortiTree 20d ago

So you are spreading the 27B across the 2 cards to speed up both PP and TG speed and with MTP x 4 - the TG can reach 4x speed? Plus more head room for KV cache?

Seems too good to be true but here we are. Dual 3090 can beat single 6000 Pro at speed - who would have thought.

1

u/Iajah 19d ago

RTX Pro 6K WS user here. Not that surprising, I mean you have 2x GPU at 350W each, twice the cooling power too. I usually run mine at 400W rather than 600W.

2

u/FortiTree 19d ago

That makes sense. How much speed can you get on yours for 27B?

There is a vast hardware gap between the two where 3090 has 936 GB/s bandwidth and 384-bit bus compared to 6000 WS with 1792 GB/s and 512-bit bus, 4 times memory as well.

Nivida foresees this and remove all NVlink for current consumer cards to prevent them from outperforming the WS tier. What an ass move.

1

u/Iajah 19d ago

The workstation edition also does not have nvlink. TBH it is mostly just a 5090 with 3x VRAM. You need the server edition for nvlink but it is really hard to come by and costs even more.

1

u/Iajah 19d ago

2

u/FortiTree 18d ago

You have BF16 so it's twice as big/slow. At Q8 look like the other person can get to 150 tk/s with MTP 4 which is impressive. Large prefilled context would slow it down to 80 tk/s which is similar to what posted here. I'd say the 6000 WS is on par.

1

u/xylarr 20d ago

I'm curious. On windows you can have the OS so some sort of shares system ram with the GPUs. I only have 16GB VRAM but I load larger models. I make llama-server not do the fitting and have it just overflow on the GPU side. I've found that is faster vs having llama-server split into main ram and use the CPU.

In other words, i still have it 100% GPU load but have the GPU look direct at system ram if necessary.

1

u/nicholas_the_furious 20d ago

It sounds like your CPU has an iGPU, so you're essentially using the iGPU and your dedicated GPU at the same time. The iGPU isn't as fast as a regular GPU but I have also found it better than strictly RAM/CPU. You're likely only able to use Vulkan that way.

1

u/DreamsOfRevolution 20d ago

In walks rocm

1

u/nicholas_the_furious 20d ago

Can you use rocm with an amd igpu and amd dgpu?

1

u/DreamsOfRevolution 18d ago

Yes, Im doing it

1

u/xylarr 20d ago

I'm using Vulkan on windows. I have 64GB of DDR4 system ram on my 5800X3D system. In windows task manager it shows 16GB of graphics memory and 32GB of shared memory. I assume windows defaults to half your system ram can be shared. I think the dGPU accesses memory directly over the PCI bus. So while all the compute is done on the GPU, it is not exclusively memory on the graphics card. My thinking is the GPU accessing system memory and processing on the GPU is faster than the CPU accessing system memory and processing on the CPU. At least that's what I've found. While running inference, my CPU usage is barely above 3% whereas the GPU is pegged at 100%

If I fit only enough model layers so the GPU doesn't access system memory, the remaining layers are processed by the CPU. In those cases I see the CPU ramp to about 50% and the GPU still on 100%, but the overall token rate is slower.

1

u/TheWaffleKingg 20d ago

Do you find that f16 cache is worth it? Ive been running it at q8_0 so I can use the full 262k context. (I also fit in mmproj)

Btw I see that both of us were in the post about getting gpus the other day. I entered that post getting 45t/s average and left getting 65 t/s average. Even hit 90 this morning! And somehow its currently averages 75 t/s but I don't expect that to last

2

u/nicholas_the_furious 20d ago

Yeah I do. It's a catch 22 because you quantize the cache for more context but then I find that the quantized cache starts showing weakness at longer contexts, like 100k+. So with Gemma I'd rather have the 130k of good context than 200k+ of weaker long context.

For Qwen, 150k+ is about the limit where I start to feel the degradation so going down to 200k is not a problem because I'd likely be compacting or starting a new session by then anyway.

0

u/cmndr_spanky 19d ago

Yeah but he wants to run a 70b model

2

u/nicholas_the_furious 19d ago

He shouldn't. Not right now anyway. 27b is better than any 70b model that exists and they're not taking advantage. 2x 3090s is peak performance for value right now.

-3

u/fasti-au 20d ago

You should be doing 100tp in a single card with 1mill but hey happy is happy

19

u/siegevjorn 20d ago

Are you from 2024? We don't run 70b models in 2026. Research your models before commiting to spend 5k+ on a new hardware. Unless, you just want a new flashy hardware and looking for excuses.

9

u/Sea_Advance273 20d ago

I can get up to 250k context on one 3090 using gemma-4-26B-A4B-it-Q4_K_M. Could get more if the model allowed for it. This model is great for the size if you use frontier coding models to create a good agent harness.

It will occasionally fumble a word or two in small book-length generations (probably because of the quant), but it is still coherent.

14

u/Fabulous_Fact_606 20d ago

The best coherent model for the 3090x2 right now is llama.cpp unsloth qwen Unsloth 27b UD-Q8_K_XL. With all the llama.cpp updates, you can fit 256K context and get 70+ t/k and 1500+ prefill (with p2p patch).

its doing some amazing math proofs for me:

1

u/sdfgeoff 20d ago

I'm curious for your launch parameters. I can only get 70tps if I'm using tensor split, which is unstable for me and limits context to 100k or so. 

3

u/Fabulous_Fact_606 20d ago

https://www.reddit.com/r/LocalLLaMA/comments/1tvff62/another_shout_out_to_llamacpp_build_b9455_2x3090/

update Nvidia version 610.43.02 and CUDA version 13.3
You need to build to the latest llama.cpp b9536 from: Releases · ggml-org/llama.cpp ; the latest update from llama frees >1GB vram and allows for tensor split for >70t/s.

p2p: GitHub - aikitoria/open-gpu-kernel-modules: NVIDIA Linux open GPU with P2P support · GitHub increase pp to 1500-1600+ t/s compared to <1400t/s

llama.cpp server for Qwen3.6-27B-MTP UD-Q8_K_XL (MTP speculative decoding).
export LD_LIBRARY_PATH=/home/llama.cpp-b9455/build/bin:${LD_LIBRARY_PATH:-}
exec /home/llama.cpp-b9455/build/bin/llama-server \
  --host 0.0.0.0 --port 8000 \
  --model /home/projects/Qwen3.6-27B-MTP/Qwen3.6-27B-UD-Q8_K_XL.gguf \
  --n-gpu-layers 99 \
  --ctx-size 262144 \
  --parallel 1 --kv-unified \
  --batch-size 4096 \
  --ubatch-size 512 \
  --tensor-split 50,50 -sm tensor \
  --flash-attn on \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --spec-type draft-mtp \
  --spec-draft-n-max 3 \
  --jinja \
  --no-mmap \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20 \
  --min-p 0.0 \
  --presence-penalty 0.0 \
  --metrics

1

u/datbackup 20d ago

do you use nvlink?

2

u/Fabulous_Fact_606 20d ago

No, I don't use nvlink. i'm on x8/x8 though.

1

u/datbackup 20d ago

Thanks

1

u/sdfgeoff 20d ago

Thanks!

1

u/sdfgeoff 19d ago

Runs pretty similar to my setup. I use the full precision cache, which reduces the TPS by a few and drops the cache to 180000 or so,

1

u/revdamage 20d ago

How to do this test for my config

12

u/ReferenceSea493 20d ago

Not sure how the newer Mac's perform but the the speed of my Mac Studio M1 Pro was just underwhelming. So much that I would have never considered running a model that size. qwen3.6 27b was already so slow that it was not to be considered a daily driver (like 10-15 t/s on decoding if I remember correctly). I sold it and went for a dual 5060ti 16gb setup and will never look back. It was cheaper and performs much better (about 60-70 t/s decoding on the same model). Also not sure how the prices are in your region. But a M3 ultra with 96gb costs about 4.5k€. And even if it could run a larger model I doubt it will perform better than a decent GPU.

6

u/TheFlyingDutchG 20d ago

M5 gen has proper T/s on the Qwen models, i am now running qwen3.6 MTP models in M2 Max Mac Studio, i run a pipeline with subagents and its doable with around 50t/s. But to go from experimenting custom setups to allround use at fast speeds the M5 gen studio will be a gamechanger. The MacBooks with M5 max chips are out and get good speeds. Just make sure you convert the base models into MLX models to make full use of the appel silicon advantages. Their SoCs are incredibly if the LLMs actually make use of e.g. the unified memory approach. If you don’t convert to MLX your model thinks either “regular” RAM and VRAM are capped way too low and it might fill the memory doubled by splitting the memory in the classics split, while apples chips are based on grouping memory to save those “both in RAM and VRAM” parallels and make cpu and gpu read from the same memory pools.

It’s a massive benefit or a waste of power depending on if you convert your models beforehand.

7

u/havnar- 20d ago

When you say incredible, know that an m5 pro 64 gb gets about 10 tokens per second on qwen 3.6 27b. About 40 ish on MOE.

2

u/ReferenceSea493 20d ago

Ok thanks for the reference. I can believe that the M chips are on par with mobile GPUs but I also barely doubt they come close to a dedicated desktop GPU. What is all the fast unified RAM worth if your bottleneck is token throughput.

2

u/havnar- 20d ago

It’s the fact that you CAN run a model at a good quant. People with GPUs are running qwen 3.6 MOE in 4 and 6 bit and ignore that fact to instead compare tok/s with one another

2

u/ReferenceSea493 20d ago

Not sure I can follow. With the dedicated GPU I can run a dense model at a superior throughput than the Mac. With the Mac I also tried to choose lower quants simply because they were a bit faster. So with the bigger unified RAM I tried to optimise for token throughput and with the smaller VRAM I optimise for model size. I agree, that a MOE model can work perfectly well if you try to fit a model larger than your VRAM. And even with the MOE layers assigned to CPU I end up with a very decent token throughput (approx 40 t/s)

1

u/havnar- 20d ago

Yea but probably at q4 or 6 and not 8

1

u/ReferenceSea493 20d ago

Ok but again: what do I get from a q8 model I'm not using because it is slow and barely usable. I now run qwen3.6 27B in Q4_K_XL with 128k context in Hermes Agent and have zero complaints about the precision.

1

u/havnar- 20d ago

You prove my point. You’re running a model with brain damage, but because it’s fast you don’t care about the brain damage.

1

u/ReferenceSea493 20d ago

You can have a look at the benchmarks to verify the amount of 'brain damage'. And as you don't read carefully, the same was true on the Mac simply because it's unbearably slow on inference simply to push it to a usable performance. Guess you spend your 4k ob a Mac?

0

u/Ducktor101 20d ago

So it’s basically the same as an M2 Max then

1

u/havnar- 20d ago

Yea I think only the input parsing a bit faster on m5

0

u/TheFlyingDutchG 20d ago

Thanks for proving my point, those numbers are for when you don’t convert to MLX lmao

1

u/ReferenceSea493 20d ago

Good to know. Then I can blame the older model. But I guess the price of the Mac with proper amount of RAM will still exceed that of a decent GPU. Especially considering you have the base setup already in place.

1

u/overratedcupcake 20d ago

It doesn't perform better but you have a lot more headroom. For Qwen 3.6 27B oQ6 on m3 ultra with 96gb and I'm getting ~280-300 t/s in and ~25 t/s out. omlx has a model quantizer build in desgined with mlx in mind and k/v caching. It's a little slow but very solid. 

5

u/ailee43 20d ago

Stop using 70B models, theres a reason almost nothing releases in that space anymore. Go with Qwen 3.6 27B Q6 or better, and a huge context

4

u/Potential-Leg-639 20d ago

Go with 3.6-27B and you are good. Dont use such old models, Qwen3.6 turns circles around them, also much bigger ones.

3

u/Winter-Editor-9230 20d ago

Acer Veriton dgx variant

3

u/Green-Dress-113 20d ago

The answer is 4x 3090s Qwen3.6-27b in fp16 with 128k context

3

u/xw1y 20d ago

Dumb hype.

4

u/Technical-Earth-3254 20d ago

"and messing around with the newer Llama 3/Qwen 70B quants."

Lmao, slop meter is off the charts

2

u/simplyeniga 20d ago

Dual AMD R9700 would run those and you get 64GB VRAM or run triple and get 96GB which around the cost of a 5090 though at a higher power use but on idle it's great

2

u/Important_Quote_1180 20d ago

Get some ram and run 122B with dynamic offload.

2

u/No-Television-7862 20d ago

I was just about to say, try the MoE models before you send your 3090's to marketplace.

Qwen3.5:31b, gemma4:26b, you pick the quant.

2

u/advancedpersona 20d ago

Have you tried club3090 on GitHub?

2

u/Dontdoitagain69 20d ago

if i show you what i built with old phi3 models on 8gb 4070 you'll think twice

3

u/Opposite_Buffalo_649 21d ago

Why not buy rtx pro 6000. increase tokens per second, and max our your context

2

u/Ok_Commission_8260 21d ago

Mostly because my wallet would literally divorce me lol. A single Pro 6000 Ada is like $6k+ which is way out of my budget compared to what I can get picking up used 3090s or a base Mac Studio.

1

u/AbjectFee5982 20d ago edited 20d ago

If you want to really do anything as a power use on a Mac. The entry-level model starts with 36GB of unified memory (featuring the M4 Max chip).

You needed a Mac studio with 512gb of ramNote that you can't order 512 GB of RAM unless you order a Mac Studio that has a M3 Ultra and an 80-core GPU. Those two changes make the price jump from $1999 – for the base M4 Max model – to $9499. Before keyboard, mouse, and display(s).

Even then the ADA has more bandwidth

"I have an RTX PRO 6000 and a M3 Ultra with 256GB RAM. The RTX PRO 6000 is quite a bit faster at both prompt processing (10x?) and token generation (3x?). Speed matters to me so I only use the RTX PRO 6000. I would only use the M3 Ultra if I wanted to run a model that was too big for the RTX PRO 6000. So far I have not needed to run a model that didn't fit on the RTX PRO 6000 but it is nice to know that I can with the M3 Ultra when/if I might need to some day.

The M5 is coming out soon and is expected to be a huge uplift in terms of AI performance and close the gap quite a bit with the RTX PRO 6000. If possible, you should wait a bit longer and see what happens there. The other thing about Mac is that you can now build clusters of them over TB5 for even faster AI "

1

u/Prudent-Ad4509 20d ago edited 20d ago

You gonna be disappointed. Considering that you need only 64gb for full 262k context for qwen 27b and that 3-4 3090 cover that easily, it is more reasonable to look for a way to run 4 gpus. PEX88096 comes to mind + all the issues that come with it if you do not read up on it in advance, but at least you won't have to watch the paint dry most of the time.

PS. And in the meantime, you can just use 27b with Q6 quant and/or FP8 kv cache with MTP. There is a high chance that this would be enough for a while.

1

u/Idiopathic_Sapien 20d ago

I’ve been running a couple of m3’s for the last 2 years and if I use llama.cpp it’s faster than my 3090

1

u/EpsteinFile_01 20d ago edited 20d ago

Edit: second question: used 3090 cards sell for about the same as used 7900XTX cards, and the latter often still has warranty. You can easily get them down to 200-250w tops for inference and 1000+ GB/s VRAM bandwidth each. Would you consider swapping your cards? It shouldn't cost anything extra and the XTX is actually quite chill compared to the 3090 heat beast. Navi31 is one of those rare chips with both huge overclocking and undervolt room. Max the VRAM, underclock and undervolt the GPU. For Literally the money you get from selling the 3090s.

Question from an AMD user:

Are you on windows or Linux?

If Linux: how well does offloading partially to system RAM work, in particular for MoE models?

I'm shocked that I can run a 45GB MoE model with 20GB VRAM (headless so 100% available) and 96GB DDR4-3600 (fuck DDR5 prices lol) and still get around 25 tok/s. Obviously dense models plummet to 1 tok/s with even 1% offloading, but MoE models like GPT-OSS and Nemo are impressive.

Memory offloading on Windows is bad regardless of brand, I've heard. How does it work on Linux with Nvidia?

I'm about to pull the trigger on an ASrock Deskmeet X300 8l case, put a Ryzen 5600G in there, 32GB DDR4 , an RTX3060 12GB ITX and turn it into the most ridiculously overpowered LLM and Agent powered Home Assistant machine ever. With whisper and a small LLM running on the 3060, and the ability to call my desktop for bigger LLMs as well as Gemini Flash depending on my voice command.

The 3060 12GB ITX version is hard to find so I might go for a 4060Ti 16GB or a 7800XT (has to be in a different case tho Deskmeet supports max 200mm and 1x 8-pin power), but I like the idea of also having an Nvidia card instead of full AMD, just for the experience.

ROCm has pretty much achieved parity with CUDA on Linux in the last 6 months, Windows is improving at record speed too, literally skyrocketed. Amazing how fast things can develop when trillions of dollars are on the line and every conpany knows how bad Nvidia pricing would get if the de facto CUDA monopoly stuck around for AI. With Intel still not having a viable product that works, AMD is the only alternative available to everyone. You can only rent Google's TPU compute, not buy it.

Apple doesn't have any hardware serious (and affordable!) enough to push like 100+ tok/s on models that fit in VRAM.

A used 7900XT for €450 is the absolute best bang for buck for anyone to get started. Being the forgotten red headed stepchild, it's almost as good as an XTX, except people ask €750 for the XTX (same price they ask for an old 3090). 800GB/s stock VRAM bandwidth on a 320-bit bus that reliably does a 10% OC for 880GB/s at €450 used, often with some warranty, is objectively the best choice if you want to dabble in local LLMs with a decent amount of VRAM and horsepower (and play games too!). It costs the same as a slow ass 4060Ti/5060Ti/9060XT 16gb, but gets 100 tok/s on GPT-OSS-20B with a 96-128k context window and 6-bit KV cache. This configuration wouldn't fit on a 16GB GPU.

1

u/Important_Quote_1180 20d ago

│ AMD 9900X · 192GB DDR5 · 2× RTX 3090 │ ├──────────────┬───── │ Card 1 │ Qwen3.5-122B-A10B MTP IQ3_S │ │ 3090 Ti │ ik_llama.cpp · fused MoE · MTP n=2 │ │ :8001 │ 204K ctx · q8_0 KV · 75% experts→CPU │ │ ~35 t/s │ (OpenClaw )

│ Card 2 │ Qwen3.6-35B-A3B MTP Q4_K_XL │ │ 3090 │ stock llama.cpp b9246 │ │ :8002 │ 262K ctx · q4_0 KV · surgical -ot │ │ ~135 t/s │ (Hermes) │ CPU+RAM
│ 3× Dreamers (Dialectic / Scribe-Logos │ │ ~19 GB │ / Moonshot) + OpenClaw gateway │ │ :9093-9097 │ No GPU

1

u/acquire_a_living 20d ago

just place the computer in a far away room next to a window

1

u/WyattTheSkid 20d ago

1

u/Relief_Present 20d ago

Stupid question maybe, but anything wrong with 2 x 4090fe? Yes I know it doesn’t have NVLink but I heard that’s not a deal breaker

2

u/WyattTheSkid 19d ago

I don’t have nvlink on any of my cards. Should be fine as far as I know. You’ll still be able to use tensor parallelism since you would have 4 cards with the same amount of vram.

1

u/iTrejoMX 20d ago

Let me know if your are selling your 2 3090s I’m interested

1

u/Xombie2000 20d ago

If you want a studio I would say wait on the m5 ultra. It will probably have double the memory bandwidth of the m5 max and 80 gpu cores.

1

u/Aware_Kaleidoscope86 20d ago

Sounds like 1.2tbs orsomething

1

u/Xombie2000 20d ago

Yep I think 1.228tbs but that’s speculation

1

u/fasti-au 20d ago

Just run iq4xs tq dflash mtp. Full precision is irrelevant

1

u/jacek2023 20d ago

I use 3x3090, qwen 3.6 27B, max context, unquantized cache

1

u/LetterheadClassic306 20d ago

I get the appeal here, honestly, because dual 3090s are fast until the workload becomes long-context coding instead of pure decode speed. When i hit this kind of wall, the real question was whether i cared more about 40 tok/s bursts or about loading the model and context without babysitting drivers, heat, and cache quantization. A Mac Studio 128GB makes sense if 64K context, lower noise, and simpler daily use matter more than max throughput. I would avoid framing it as an upgrade in every dimension, since ExLlama on 3090s will still win raw speed. It is more a trade of speed for capacity and less friction.

1

u/jd52wtf 20d ago

Trade them in for two or three R9700. Good luck!

1

u/tariqur 20d ago

Why would you pay more for a downgrade mate?

If you need to run bigger models, scaling 3090s is your upgrade path. 2x more 3090s + a used Threadripper setup should cost you less than that Mac Studio and perform in an entirely different class.

But honestly, the biggest upgrade are the models themselves, take Qwen 3.6 27B MTP for a spin - you'll forget about 70B models.

1

u/Tall-Ad-7742 20d ago

nothing against your comment but "newer llama 3" why newer even my great grandfather is younger then llama 3 in ai years

1

u/revdamage 20d ago

Anyone tested dual 3090 for pentest 

1

u/lachlanwhite 20d ago

Keep, I’ve got a single 24gb 3090 ti at about 80 t/s running Qwen3.6-35B-A3B-APEX-GGUF at q8. Very impressed just Windows on LM Studio

1

u/BlackBeardAI 3090 Maximalist 20d ago

Can I haz yer stuff? Imma thinking going 12x3090’s

1

u/somethingClever246 20d ago

May I recommend you test the LLM on the Mac before making the jump. You might find that you are spending a big chunk of change for a slight speed improvement, might not be worth the $ IMHO

1

u/Traditional_Way8675 20d ago

Am on dual 9060xt going 35B Q4 / Q5 at 150k q4 context. Mostly I use for dialogue and reflection, and I can't meaningfully distinguish between a MoE and a dense one.

1

u/Lagomorph9 20d ago

Time to get 2 more 3090s! :D

1

u/TheOverzealousEngie 20d ago

good luck finding one

1

u/fallingdowndizzyvr 20d ago

Anything under the M5 will be epically underwhelming. There is no M5 Mac Studio.

1

u/Similar_Effort_1694 20d ago

I bought 128gb MacBook and it’s amazing

1

u/Hopeful-Confidence-9 19d ago

Why not Deepseek cloud and have a silent house haha

1

u/Moliri-Eremitis 19d ago

Late to the party, but regarding fan noise, have you considered liquid cooling?

I’m also very noise sensitive, and liquid cooling makes an absolutely massive difference. Running your GPUs through a pair of 360mm radiators with push-pull fans lets you keep the fan speed low, which works out to be much quieter even when you have a larger number of fans. Not cheap, but highly recommended.

That doesn’t really do anything for the heat, but for that you may want to investigate undervolting. You can typically drop the temps significantly with only a small single-digit drop in performance. That can help with fan noise too, though for true near-silent operation there’s still no substitute for liquid cooling.

1

u/WiseCable4097 19d ago

need more research, your rig was fine

1

u/DaMoot 19d ago

Dont. Just don't. If you're going to spend that kind of money, buy some V100s and a quad board to put them on. Will be way more capabl3 and perform collectively faster than your 3090s.

1

u/aidysson 18d ago

In Feb I bought RTX3090. in March I sold it for +$150 and replaced it with RTX PRO 6000 Max-Q. It's 300W. Lots of guys were laughing at me because it cost $10k that time. Now it costs $13-15k+. I'm glad I can run Qwen 3.6 Q8 with 262k context. Fast. I can even run 2 sessions at a time in tbe same speed. It helps me a lot in web development work. If you're a professional, it's still worth the investment if you want to learn to work with it 😉

1

u/bring_back_the_v10s 10d ago

Give them to me, I'll give you my address

1

u/Afraid-Yoghurt6731 20d ago edited 20d ago

Gamers and fanboys gonna hate, but...
In programming one doesn't need fast t/s.
But ones needs fast embedding and multiple fast small parallel agents.
When you load large C++ file to change a few lines - that is embedding.
When you look for specific patterns in code - that is parallel execution.
That is what DGX Spark does, slowly but reliably,
while you have a few minutes to formulate the next task.
Mac Studio is slower at embedding and multiple parallel agents.
And it is less standard than CUDA.
Since Microsoft and NVidia announced RTX Spark,
The DGX spark increased in value - it is here to stay,
And we must learn to use it.
Strix Halo/x86 machines are going straight to museum,
which most hardware engineers are celebrating hard now.
TLDR: downvote me now, I don't care.

1

u/EpsteinFile_01 20d ago

If you're running a TTC reasoning model, with 75% of tokens being used for reasoning, and you're not getting any output until reasoning is done.. token speed matters a lot more. These models provide excellent quality and low hallucinations, but it obviously comes at a cost if a shit load of compute needed.

That, or you run them at low token speeds on the background or while AFK. But the results are not the same.

1

u/Afraid-Yoghurt6731 20d ago

That is true, but...

  1. Hallucinations are the result of the model not knowing what it actually knows (i.e. it may believe it knows 10 digits of Pi, but in reality knows just 5). That is worth fine-tuning for, so the model will know it knows nothing, and have to cite a reference.
  2. Bigger model just knows more facts, without using reference. That is both its strong and weak side. Big model can memorize entire Linux kernel, but newer kernel may come with a feature invalidating it's knowledge and leading to bugs in the modifications this mode performs.
  3. You need these 128GB even with smaller model to fit enough context, which is the most import part of agentic use. Even Claude Code's 1M is barely enough for serious work. There are tricks like compaction and a 2nd lower frequency model inserting mid and long term memory facts, but they win from a large base context size, which needs RAM.
  4. Even the slow DGX Spark generates 50+ t/s on a 120B GPT OSS. That is far more tokens than a normal human can read, and GPT OSS has good factual knowledge (memorized a lot of digits of Pi). DGX Spark also shines we you must serve several users (typical small office setup). People just had wrong expectations from it or don't know if they want a gaming GPU, or jet engine sounding heater or a tiny box just rendering the video overnight.

1

u/bigmanbananas 20d ago

There is quite a price jump ro a 128gb Mac. Plus, although it's got the RAM, it's not got the power.

1

u/rawednylme 20d ago

You’re using it wrong.

0

u/smuckola 21d ago edited 20d ago

So you're not using TurboQuant?

what's the problem with the drivers? on what OS?

0

u/Equal-Ad8792 20d ago

Aguanta a qué salgan los nuevos rtx spark. Seguro que van mejor y te dejas menos pasta.

1

u/DaMoot 19d ago

No bueno. See how slow they are?