About as open source as a binary blob without the training data

[email protected]

Office space meme:

"If y'all could stop calling an LLM "open source" just because they published the weights... that would be great."

[email protected]

Uuuuh… why?

Do you only accept open source code if you can see every key press every developer made?

[email protected]

Do you call binary-only software with EULA "Open Source" too?

[email protected]

Dude, the CPU instructions are right there, of course it's open source.

[email protected]

Open source means you can recreate the binaries yourself. Neiter Facebook. Nor the devs of deepseek published which training data they used, nor their training algorithm.

[email protected]

source control management software like git lets you see basically this, yes.

unalivejoy

Open Source (generally and for AI) has an established definition.

The Open Source AI Definition – 1.0

version 1.0 Preamble Why we need Open Source Artificial Intelligence (AI) Open Source has demonstrated that massive benefits accrue to everyone after removing the barriers to learning, using, sharing and…

Open Source Initiative (opensource.org)

[email protected]

It really comes down to this part of the "Open Source" definition:

The source code [released] must be the preferred form in which a programmer would modify the program

A compiled binary is not the format in which a programmer would prefer to modify the program - it's much preferred to have the text file which you can edit in a text editor. Just because it's possible to reverse engineer the binary and make changes by patching bytes doesn't make it count. Any programmer would much rather have the source file instead.

Similarly, the released weights of an AI model are not easy to modify, and are not the "preferred format" that the internal programmers use to make changes to the AI mode. They typically are making changes to the code that does the training and changes to the training dataset. So for the purpose of calling an AI "open source", the training code and data used to produce the weights are considered the "preferred format", and is what needs to be released for it to really be open source. I would call "open weights" models actually just "self hostable" models instead of open source.

[email protected]

Seems kinda reductive about what makes it different from most other LLM’s. Reading the comments i see the issue is that the training days is why some consider it not open source, but isn’t that just trained from the other AI? It’s not why this AI is special. And the way it uses that data, afaik, is open and editable, and the license to use it is open. Whats the issue here?

[email protected]

This is exactly it, open source is not just the availability of the machine instructions, it's also the ability to recreate the machine instructions. Anything less is incomplete.

It strikes me as a variation on the "free as in beer versus free as in speech" line that gets thrown around a lot. These weights allow you to use the model for free and you are free to modify the existing weights but being unable to re-create the original means it falls short of being truly open source. It is free as in beer, but that's it.

magic_lobster_party

They published the source code needed run the model. It’s open source in the way that anyone can download the model, run it locally, and further build on it.

Training from scratch costs millions.

[email protected]

Open source isn't really applicable to LLM models IMO.

There is open weights (the model), and available training data, and other nuances.

They actually went a step further and provided a very thorough breakdown of the training process, which does mean others could similarly train models from scratch with their own training data. HuggingFace seems to be doing just that as well. https://huggingface.co/blog/open-r1

Pennomi

It’s just AI haters trying to find any way to disparage AI. They’re trying to be “holier than thou”.

The model weights are data, not code. It’s perfectly fine to call it open source even though you don’t have the means to reproduce the data from scratch. You are allowed to modify and distribute said modifications so it’s functionally free (as in freedom) anyway.

[email protected]

And looking at mobile games like Tacticus, there are loads of people with millions to burn on hobbies

[email protected]

Source - it’s about open source, not access to the database

Pennomi

No, but I do call a CC licensed png file open source even if the author didn’t share the original layered Photoshop file.

Model weights are data, not code.

[email protected]

Right. You could train it yourself too. Though its scope would be limited based on capability. But that’s not necessarily a bad thing. Taking a class? Feed it your text book. Or other available sources, and it can help you on that subject. Just because it’s hard didn’t mean it’s not open

[email protected]

A software analogy:

Someone designs a compiler, makes it open source. Make an open runtime for it. 'Obtain' some source code with unclear license. Compiles it with the compiler and releases the compiled byte code that can run with the runtime on free OS. Do you call the program open source? Definitely it is more open than something that requires proprietary inside use only compiler and closed runtine and sometimes you can't access even the binary; it runs on their servers. It depends on perspective.

ps: the compiler takes ages and costs mils in hardware.

edit: typo

[email protected]

Thank you for taking the time to write this. Making the rests reproducable and possible to improve on is important.

[email protected]

Thank you for the explanation. I didn’t know about the ‘preferred format’ definition or how AI models are changed at all.