I finally managed to get the Llama 3.2 and Phi 3.5 vision models to run on my M2 Mac laptop, using the mistral.rs Rust library and its CLI Tool and Python bindings https://simonwillison.net/2024/Oct/19/mistralrs/

Simon Willison

I then used the mistralrs-metal Python library to run this photo from Mendenhall's Museum of Gasoline Pumps & Petroliana: through Microsoft's Phi-3.5 Vision model https://www.niche-museums.com/107

"What is shown in this image? Write a detailed response analyzing the scene."

Prem Kumar Aparanji 👶🤖🐘

@simon did you see that h2o.ai did well even with 0.8B model?

https://venturebeat.com/ai/small-but-mighty-h2o-ais-new-ai-models-challenge-tech-giants-in-document-analysis/

Simon Willison

@prem_k do you know if anyone has figured out a recipe for running that on the Mac yet?

Prem Kumar Aparanji 👶🤖🐘

@simon no, not yet.

I'm yet to look into the model files, but if they're available as gguf or onnx, it should be possible to run with llama.cpp or wllama for gguf and Transformers.js for onnx.

It's also possible to convert gguf files for running with ollama.

Prem Kumar Aparanji 👶🤖🐘

@simon also, since the h20 models are available as safetensors, it should be possible to run with mlx on Mac.

I haven't looked into this rust inferencing engine you wrote about in the OP above. Would you know what model file formats does it support?

Joseph Szymborski :qcca:

@simon

Very impressive.

I'm trying to find many of the objects it's pointing out, and while I can guess what it's referring to, I would struggle to say that it is accurate in describing things in the scene.

e.g. I see a gas canister, but it isn't white and black, nor is it adjacent to a pump which is red and white (although it is adjacent to two pumps, being red and white respectively).

Simon Willison

@jszym yeah it's definitely not a completely accurate description, the vision models are even more prone to hallucination than just plain text!

Simon Willison

I recommend reading the descriptions closely and comparing them with the images - these vision models mix what they are seeing with "knowledge" baked into their weights and can often hallucinate things that aren't present in the image as a result

Leaping Woman

@simon yep, which is particularly not helpful for users of screen readers.

Simon Willison

@leapingwoman I've talked to screen reader users who still get enormous value out of the vision LLMs - they're generally reliable for things like text and high level overviews, where they get weird is more detailed descriptions

Plus the best hosted models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) are a whole lot less likely to hallucinate than the ones I can run on my laptop!

Simon Willison

@leapingwoman I use Claude 3.5 Sonnet to help me write alt text on almost a daily basis, but I never use exactly what it spat out - I always further edit it myself for clarity and to make sure it's as useful as possible

Simon Willison

@prem_k read me says that it can do both GGUF and "plain models" - I haven't figured out what that means yet, but the community around the to seem to be releasing their own builds of models onto hugging face https://github.com/EricLBuehler/mistral.rs

Joseph Szymborski :qcca:

@simon @leapingwoman Yah, it looks like Calude 3.5 Sonnet is right on the money with this one:

Erik Živković

@simon something I've wondered about when reading your blog is: Do you generate parts of / whole posts?

Simon Willison

@ezivkovic not really - very occasionally I'll let Copilot in VS Code finish a sentence for me (just the last two or three words) but I tend to find LLM generated text is never exactly what I want to say

Leaping Woman

@jszym @simon Does the LLM know that it's writing alternative text for an image? Because these descriptions are significantly more detailed than they need to be.

Simon Willison

@leapingwoman @jszym here's the prompt I use for alt text with Claude

Simon Willison

@leapingwoman @jszym that prompt gave me this for the museum exterior photo: "Exterior of the Pioneer Memorial Museum, a white neoclassical building with columns. Sign in foreground reads "HEADQUARTERS INTERNATIONAL SOCIETY DAUGHTERS OF UTAH PIONEERS". Statue visible in front of building. Overcast sky and trees surrounding the museum."

Simon Willison

@leapingwoman @jszym and for the antique gas pumps: "Vintage gas station memorabilia collection: Colorful display of old gas pumps, signs, and accessories including Hancock Gasoline, Dixie, Ethyl, Union Gasoline, Skelly, and other brands. Visible text: "HANCOCK GASOLINE", "DIXIE", "ETHYL", "UNION GASOLINE", "SKELLY", "GASOLINE Buy it Here", "SLOW DANGEROUS CORNER", "CAUTION", "CONTAINS LEAD".

I'd edit that one, I don't think it's quite right

Florian Idelberger

@simon thanks for this! I had some issues however replicating this, where on an M3 max it always crashes. (Plus also annoying that it also crashes or errors if it cannot find an image. There is a PR to fix that, but it's not merged yet) Like even on the M3 MAX, as the in-situ quantization is done on one core, it takes a while... have you experienced one or all of these?