I finally managed to get the Llama 3.2 and Phi 3.5 vision models to run on my M2 Mac laptop, using the mistral.rs Rust library and its CLI Tool and Python bindings https://simonwillison.net/2024/Oct/19/mistralrs/

Joseph Szymborski :qcca:

Very impressive.

I'm trying to find many of the objects it's pointing out, and while I can guess what it's referring to, I would struggle to say that it is accurate in describing things in the scene.

e.g. I see a gas canister, but it isn't white and black, nor is it adjacent to a pump which is red and white (although it is adjacent to two pumps, being red and white respectively).

Simon Willison

@jszym yeah it's definitely not a completely accurate description, the vision models are even more prone to hallucination than just plain text!

Simon Willison

I recommend reading the descriptions closely and comparing them with the images - these vision models mix what they are seeing with "knowledge" baked into their weights and can often hallucinate things that aren't present in the image as a result

Leaping Woman

@simon yep, which is particularly not helpful for users of screen readers.

Simon Willison

@leapingwoman I've talked to screen reader users who still get enormous value out of the vision LLMs - they're generally reliable for things like text and high level overviews, where they get weird is more detailed descriptions

Plus the best hosted models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) are a whole lot less likely to hallucinate than the ones I can run on my laptop!

Simon Willison

@leapingwoman I use Claude 3.5 Sonnet to help me write alt text on almost a daily basis, but I never use exactly what it spat out - I always further edit it myself for clarity and to make sure it's as useful as possible

Simon Willison

@prem_k read me says that it can do both GGUF and "plain models" - I haven't figured out what that means yet, but the community around the to seem to be releasing their own builds of models onto hugging face https://github.com/EricLBuehler/mistral.rs

Joseph Szymborski :qcca:

@simon @leapingwoman Yah, it looks like Calude 3.5 Sonnet is right on the money with this one:

Erik Živković

@simon something I've wondered about when reading your blog is: Do you generate parts of / whole posts?

Simon Willison

@ezivkovic not really - very occasionally I'll let Copilot in VS Code finish a sentence for me (just the last two or three words) but I tend to find LLM generated text is never exactly what I want to say

Leaping Woman

@jszym @simon Does the LLM know that it's writing alternative text for an image? Because these descriptions are significantly more detailed than they need to be.

Simon Willison

@leapingwoman @jszym here's the prompt I use for alt text with Claude

Simon Willison

@leapingwoman @jszym that prompt gave me this for the museum exterior photo: "Exterior of the Pioneer Memorial Museum, a white neoclassical building with columns. Sign in foreground reads "HEADQUARTERS INTERNATIONAL SOCIETY DAUGHTERS OF UTAH PIONEERS". Statue visible in front of building. Overcast sky and trees surrounding the museum."

Simon Willison

@leapingwoman @jszym and for the antique gas pumps: "Vintage gas station memorabilia collection: Colorful display of old gas pumps, signs, and accessories including Hancock Gasoline, Dixie, Ethyl, Union Gasoline, Skelly, and other brands. Visible text: "HANCOCK GASOLINE", "DIXIE", "ETHYL", "UNION GASOLINE", "SKELLY", "GASOLINE Buy it Here", "SLOW DANGEROUS CORNER", "CAUTION", "CONTAINS LEAD".

I'd edit that one, I don't think it's quite right

Florian Idelberger

@simon thanks for this! I had some issues however replicating this, where on an M3 max it always crashes. (Plus also annoying that it also crashes or errors if it cannot find an image. There is a PR to fix that, but it's not merged yet) Like even on the M3 MAX, as the in-situ quantization is done on one core, it takes a while... have you experienced one or all of these?

Simon Willison

@fl0_id it didn't crash on me but I literally only did the things in my write-up, I haven't explored beyond that yet