I finally managed to get the Llama 3.2 and Phi 3.5 vision models to run on my M2 Mac laptop, using the mistral.rs Rust library and its CLI Tool and Python bindings https://simonwillison.net/2024/Oct/19/mistralrs/
-
Joseph Szymborski :qcca:replied to Simon Willison last edited by
Very impressive.
I'm trying to find many of the objects it's pointing out, and while I can guess what it's referring to, I would struggle to say that it is accurate in describing things in the scene.
e.g. I see a gas canister, but it isn't white and black, nor is it adjacent to a pump which is red and white (although it is adjacent to two pumps, being red and white respectively).
-
Simon Willisonreplied to Joseph Szymborski :qcca: last edited by
@jszym yeah it's definitely not a completely accurate description, the vision models are even more prone to hallucination than just plain text!
-
Simon Willisonreplied to Simon Willison last edited by
I recommend reading the descriptions closely and comparing them with the images - these vision models mix what they are seeing with "knowledge" baked into their weights and can often hallucinate things that aren't present in the image as a result
-
Leaping Womanreplied to Simon Willison last edited by
@simon yep, which is particularly not helpful for users of screen readers.
-
Simon Willisonreplied to Leaping Woman last edited by
@leapingwoman I've talked to screen reader users who still get enormous value out of the vision LLMs - they're generally reliable for things like text and high level overviews, where they get weird is more detailed descriptions
Plus the best hosted models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) are a whole lot less likely to hallucinate than the ones I can run on my laptop!
-
Simon Willisonreplied to Simon Willison last edited by
@leapingwoman I use Claude 3.5 Sonnet to help me write alt text on almost a daily basis, but I never use exactly what it spat out - I always further edit it myself for clarity and to make sure it's as useful as possible
-
Simon Willisonreplied to Prem Kumar Aparanji 👶🤖🐘 last edited by
@prem_k read me says that it can do both GGUF and "plain models" - I haven't figured out what that means yet, but the community around the to seem to be releasing their own builds of models onto hugging face https://github.com/EricLBuehler/mistral.rs
-
Joseph Szymborski :qcca:replied to Simon Willison last edited by
@simon @leapingwoman Yah, it looks like Calude 3.5 Sonnet is right on the money with this one:
-
Erik Živkovićreplied to Simon Willison last edited by
@simon something I've wondered about when reading your blog is: Do you generate parts of / whole posts?
-
Simon Willisonreplied to Erik Živković last edited by
@ezivkovic not really - very occasionally I'll let Copilot in VS Code finish a sentence for me (just the last two or three words) but I tend to find LLM generated text is never exactly what I want to say
-
Leaping Womanreplied to Joseph Szymborski :qcca: last edited by
-
Simon Willisonreplied to Leaping Woman last edited by
@leapingwoman @jszym here's the prompt I use for alt text with Claude
-
Simon Willisonreplied to Simon Willison last edited by
@leapingwoman @jszym that prompt gave me this for the museum exterior photo: "Exterior of the Pioneer Memorial Museum, a white neoclassical building with columns. Sign in foreground reads "HEADQUARTERS INTERNATIONAL SOCIETY DAUGHTERS OF UTAH PIONEERS". Statue visible in front of building. Overcast sky and trees surrounding the museum."
-
Simon Willisonreplied to Simon Willison last edited by
@leapingwoman @jszym and for the antique gas pumps: "Vintage gas station memorabilia collection: Colorful display of old gas pumps, signs, and accessories including Hancock Gasoline, Dixie, Ethyl, Union Gasoline, Skelly, and other brands. Visible text: "HANCOCK GASOLINE", "DIXIE", "ETHYL", "UNION GASOLINE", "SKELLY", "GASOLINE Buy it Here", "SLOW DANGEROUS CORNER", "CAUTION", "CONTAINS LEAD".
I'd edit that one, I don't think it's quite right
-
Florian Idelbergerreplied to Simon Willison last edited by
@simon thanks for this! I had some issues however replicating this, where on an M3 max it always crashes. (Plus also annoying that it also crashes or errors if it cannot find an image. There is a PR to fix that, but it's not merged yet) Like even on the M3 MAX, as the in-situ quantization is done on one core, it takes a while... have you experienced one or all of these?
-
Simon Willisonreplied to Florian Idelberger last edited by
@fl0_id it didn't crash on me but I literally only did the things in my write-up, I haven't explored beyond that yet