@simon any recommendations for what M4 MacBooks I should be looking at if I want a future proof running local LLMs for the next couple years?

Eric Holscher

Simon Willison

@ericholscher optimise for RAM, that's the limiting factor on how good a model you can run

I filled uo my 1TB hard drive with models pretty quickly, but loading them off an external drive seems to work fine if you have the memory for them

Eric Holscher

@simon Makes sense. Basically RAM limits how big of a model you can run, and GPU & memory bandwidth is limits the token/s, from what I've read?

Jeff "weBOOOOlogy" Triplett

@ericholscher @simon the unified bus GPU + RAM are what really make Macs nice to work with LLMs.

I have a Mac Studio with 64 GB of RAM and while I sort-of regret not getting more ram, there aren't many models I can't run. I can run 70 billion parameter models. They tend to jump up to 200 B or 400 B models and nothing can run those either way.

https://ollama.com is a really nice project to work with locally that's easy to run with good performance (very cachable.

Simon Willison

@webology @ericholscher listen to Jeff, he runs more local models than I do!

Jeff "weBOOOOlogy" Triplett

@simon @ericholscher 64 GB is going to get you the best of today. 128 GB is hard to justify, but it might give you a bit more runway if model sizes change. I'm not even sure how to predict that.

The latest LLama 3.2 models are fairly reasonably sized (1B to 11B for consumers) https://ollama.com/library/llama3.2

LLM + llm-ollama is a pretty nice combo plus the many other projects Simon writes about. Ollama can run hugging fact models too.

Roché Compaan

@webology @simon @ericholscher a similar question was asked on the LocalLLaMA reddit a few days ago. https://www.reddit.com/r/LocalLLaMA/s/5oUdBZvnxx. If it was an option I won't run it on my main laptop but offload it to a mac mini. Bottomline is still go for as much RAM you can afford.

Jeff "weBOOOOlogy" Triplett

@rochecompaan @simon @ericholscher RAM is >90% of what I have seen.

Like checkout the Llama 3.1 models: https://ollama.com/library/llama3.1/tags

8B ~= 8 GB RAM
70B ~= 64 GB RAM
405B ~= (More RAM than any of us can afford or that Apple will put in a Mac Studio)

I'm sure the M4 vs M2 is a nice bump for most apps, but I get good performance on my M2 Mac Studio.

I'd get a 64 GB (better choice), 96 GB, or 128 GB MacBook Pro or wait for the M4 Studio.

Roché Compaan

@webology @simon @ericholscher I'm not gardening on Apple land and didn't know Mac Mini maxes out at 48 GB RAM. llama 3.1 70b only needs 40 GB. It will run on the Mac Mini, and one can always add another Mac Mini for even larger models with distributed llama: https://b4rtaz.medium.com/how-to-run-llama-3-405b-on-home-devices-build-ai-cluster-ad0d5ad3473b. I suspect that we will see advances where many parameters require much less RAM very soon, which would be great for local and private AI. Some devs are already achieving this with tuning: https://www.reddit.com/r/LocalLLaMA/comments/188197j/80_faster_50_less_memory_0_accuracy_loss_llama/

Jeff "weBOOOOlogy" Triplett

@rochecompaan @simon @ericholscher That was for Eric’s question of how much ram to be future proof.

If you cant fit the fill model in RAM and the context window in memory, it might take one to ten minutes per token to process if it even works. That article appears to be swapping with a small output window. You can do that but I am not sure its worth it. (1/2)

Eric Holscher

@webology @rochecompaan @simon Appreciate the info! I am already pretty impressed with that 3B llama model that runs pretty fast on my old M1, so definitely feels like the quality of what we can run on a 64GB machine over the next few years is gonna be pretty impressive.

Simon Willison

@ericholscher @webology @rochecompaan yeah, one of the most exciting trends right now is that the capable models keep getting smaller and easier to run - I've been having fun with this one: https://simonwillison.net/2024/Nov/2/smollm2/