I added multi-modal (image, audio, video) support to my LLM command-line tool and Python library, so now you can use it to run all sorts of content through LLMs such as GPT-4o, Claude and Google Gemini

Simon Willison

If you are still LLM-skeptical but haven't spent much time thinking about or experimenting with these multi-modal variants I'd encourage you to take a look at them

Being able to extract information from images, audio and video is a truly amazing capability, and something which was previously prohibitively difficult - see XKCD 1425 https://xkcd.com/1425/

Simon Willison

The LLM Python library supports attachments now as well https://llm.datasette.io/en/stable/python-api.html#attachments

Prem Kumar Aparanji 👶🤖🐘

@simon neat!

Where can I look at the code behind this function?

Simon Willison

@prem_k more docs here: https://llm.datasette.io/en/stable/plugins/advanced-model-plugins.html#attachments-for-multi-modal-models

Implementations are spread out across different plugins, eg https://github.com/simonw/llm/blob/a44ba49c21f8d4ac30c8e41bfa5599c258ce53cc/llm/default_plugins/openai_models.py#L338 and https://github.com/simonw/llm-gemini/blob/ce82727a6950c7769a8e40bf030591d0e6f83e5e/llm_gemini.py#L135

Daniel

@simon Note that some of us are skeptical for reasons such as the exploitation of creative folks, the copyright infringements at scale, the hype cycle created by venture capital, the impact it has on misinformation and the ads space, and so on. Some of the tech is cool no doubt.

Simon Willison

@djh those are all very valid reasons to be skeptical!

The only reason I'll consistently push back at is the idea that these things aren't useful at all

Bornach

@djh @simon
Some are also skeptical because research demonstrates that LLMs are not reasoning
https://youtu.be/TpfXFEP0aFs

but instead are regurgitating memorised answers
https://youtu.be/y1WnHpedi2A

among other problems of which users should be made aware before placing all their trust in the generated response
https://youtu.be/7bmhjt1cpRs

Florencio Cano

@bornach @djh @simon in my opinion, that LLMs don't reason is a red herring: computers do not reason and they have proven to be very useful. I think other concerns like energy consumption or not honoring intellectual property are valid thou.

aburka 🫣

@simon note that the capability is dangerously untrustworthy (and when you don't mention this or any of the other concerns like environmental harm or creative theft, "check this out if you're still skeptical" comes across as condescending) https://www.engadget.com/ai/openais-whisper-invents-parts-of-transcriptions--a-lot-120039028.html

Simon Willison

@aburka from my post:

aburka 🫣

@simon I mean, they've been evaluated, they're not suitable. What's left to explore?

Simon Willison

@aburka since you ask, I did dig around in one of the papers underlying the other story and found it was partly about how much better whisper v3 was compared to v2 https://fedi.simonwillison.net/@simon/113380266881069878

Simon Willison

@aburka generally though the most important thing about using LLMs (and AI/machine learning models in general) is figuring out how to make effective and responsible use of inherently unreliable technology

Generating unreviewed medical transcripts and then throwing away the original recordings is NOT responsible

Xing Shi Cai

@simon Does video work? I tried both Gemini pro and flash, but I only got some error message. Do I need a paid account to use video scraping? (Image works as expected.)

Simon Willison

@xsc video should work, what file format were you trying? Currently needs to be less than 20MB - that's a temporary limitation of my llm-gemini plugin

Xing Shi Cai

@simon I was using an MP4 of 5 mb size. The error just says "internal error" I downloaded the video from here https://www.pexels.com/video/catching-and-releasing-a-big-carp-fish-in-the-lake-5538137/

Simon Willison

@xsc I've seen a few of those "Internal error" messages too - I think it's Gemini being a little bit flaky, sometimes resubmitting works fine the second time

Bornach

@florenciocano @djh @simon
Just not very useful for solving maths problems that haven't already been solved and scraped into the training data
https://youtu.be/8_Nr5oKIAmI
And students are supposedly using this to cheat on their homework?

Simon Willison

@bornach @florenciocano @djh media right - LLMs are notoriously bad at math (and logic puzzles too)

Xing Shi Cai

@simon I was using the following command

> llm 'please explain what is happening in the video' -a man-in-water.mp4 -m gemini-1.5-flash-latest

Does it look like it should work?