mac studio runs large models like that

by Poster May 19, 2025 26

That is, the level of a toy mac studio m3 ultra, 512g memory/video memory, 671b q4 _ k _ m, gpu and memory are full, more than 10 tokens/s 32b, not much memory, 8%, but the gpu is always full, more than 20 tokens/s If you add embedded and rerank models (standard in the knowledge base) to one machine, it is basically stuck Knowledge base for running obsidian and dify, the speed is about the same as my amd + 64g memory + 4060ti 16g running 14b.

Replies

Anonymous3584 May 18, 2025

You can give me a 50% discount, and I don't dislike it. 🐶

Anonymous46 May 18, 2025

It's like that. One is that although the Mac has a large video memory, it has little TOPS, and the computing power of the really large model is not enough. The other is that most mainstream models are specifically optimized for CUDA, and few people will consider how to run on the Mac.. If you really want to run the model, why not get a 48G 4090?

Anonymous1658 May 18, 2025

671b Compare you with 14b

Poster May 18, 2025

@ Anonymous1658 32b Didn't see it

Anonymous107 May 18, 2025

Will the M4 be greatly improved?

Anonymous2059 May 18, 2025

Before, it was said that this small stand-alone machine can load the 617b model, but other consumer-grade graphics cards don't have enough video memory to run at all. Why don't you run 617b with amd + 64g memory?

Anonymous12301 May 18, 2025

Then why don't you compare it to 32b

Anonymous2239 May 18, 2025

It's not fast at first, but the NV moat is still a bit valuable...

Anonymous2034 May 18, 2025

Of course Otherwise, who will Lao Huang's card be sold to

Anonymous7660 May 18, 2025

It just can run the largest model, but the computing power is not good, so it is not as good as expected

Anonymous8038 May 18, 2025

@ Anonymous2059 There was an article before, spending 3w groups of 768G memory amd pc to run 617b Q8, CPU hard to run 7 token/s. This is twice the price of mac, but it is also slower.

Anonymous1995 May 18, 2025

You really believe what they say. Whoever says it can be found

Anonymous13183 May 18, 2025

Nowadays, none of the local models run by consumer-grade hardware are effective enough for personal use. There is no need for local ai to make little sense

Anonymous7027 May 18, 2025

@ Anonymous8038 Half less, double less is zero yuan bought https://i.imgur.com/N9E3iZ2. png

Anonymous339 May 18, 2025

Haha, those bloggers don't mention how long it takes to spit out the first word when answering the question, let alone the speed after the context becomes longer. At present, generally speaking, among ordinary consumer-grade products, only those 27b and 32b quantification with 4090 and 5090 running video memory occupying about 20GB are okay, but the gpt 4o official api, which is far stronger than the local 32b, is at least less costly than running it yourself. A 0 cheaper

Anonymous13396 May 18, 2025

At present, the biggest significance of personal local deployment of large models is the deployment itself, in other words, it is a tossing process 😂

Anonymous2059 May 18, 2025

@ Anonymous8038 Now the bottleneck of LLM inference in most cases is memory bandwidth rather than computing power. The video memory of A100/H100 is very expensive HBM, with a bandwidth of a few TB/s, and the bandwidth of MacStudio 512GB is only 800GB/s, which is not comparable to The memory solution is basically ktransformers. When the CPU/GPU computing power is sufficient, it is also limited by memory bandwidth

Anonymous355 May 18, 2025

What's wrong with your language skills and logical expression? It seems that your broken 4060ti is at the same level as the m3 ultra

Anonymous4004 May 18, 2025

q4km.... you won't even run mlx, the exclusive format for mac, here complaining about the slow speed? mlx At least your speed can be close to 18t/s for r1, and 20t/s for v3 And I have reason to believe you didn't unlock the memory

Anonymous10504 May 18, 2025

It's a toy level. What are you expecting? Mac has only one advantage, that is, it can run larger models with relatively good cost performance under bs = 1 and light prefill load. This is because its memory bandwidth is high and its computing power is very, very low compared, so even if M3U runs r1, MoE with less than 40b activation is slightly higher than 10tps under real workload, and the context generation speed and TTFT are even uglier.