About as open source as a binary blob without the training data

[email protected]

I used the word "source" a couple times in that post.. The first time was in a general sense, as an input to generate an output. The training data is the source, the model is the "function" (using the mathematics definition here, NOT the computer science definition!), and the weights are the output. The second use was "source code."

Weights can be changed just like a compiled binary can be changed. Closed source software can be modified without having access to the source code.

[email protected]

If the Source is Open to copying, and I won't get sued for doing it, well, then....

[email protected]

The LLM is a machine that when simplified down takes two inputs. A data set, and weight variables. These two inputs are not the focus of the software, as long as the structure is valid, the machine will give an output. The input is not the machine, and the machines source code is open source. The machine IS what is revolutionary about this LLM. Its not being praised because its weights are fine tuned, it didn’t sink Nvidia’s stock price by 700 billion because it has extra special training data. Its special because of its optimizations, and its novel method of using two halves to bounce ideas back and forth with its halves and to value its answers. Its the methodology of its function. And that is given to you open to see its source code

[email protected]

Just to add, a good chunk of newer emulators require you to get a dump of the firmware externally, not just the ps2. Pretty much anything from ps2 onwards is like that.

[email protected]

I don't know what, if any, CS background you have, but that is way off. The training dataset is used to generate the weights, or the trained model. In the context of building a trained LLM model, the input is the dataset and the output is the trained model, or weights.

Is more appropriate to call deepseek "open-weight" rather than open-source.

[email protected]

That seems kind of like pointing to reverse engineering communities and saying that binaries are the preferred format because of how much they can do. Sure you can modify finished models a lot, but what you can do with just pre trained weights vs being able to replicate the final training or changing training parameters is just an entirely different beast.

There's a reason why the OSI stipulates that code and parameters used to train is considered part of the "source" that should be released in order to count as an open source model.

You're free to disagree with me and the OSI.

[email protected]

... Did you not read the litteral next phrase in the sentence?

since it distinctly lacks any form of executable content.

Your definition of open source specified reproducible binaries. From context it's clear that I took issue with your definition, not with the the notion of reproducing data.

[email protected]

I haven't seen any tutorials that include the training data. As you highlight, these would make for poor tutorials. If you know where there are more complete tutorials, I'd appreciate it if you could share them

[email protected]

Would an open-source Windows installer make it open-source? After all, you can replace its .dll files and modify the registry. I guess PrismLauncher also makes Minecraft open-source, you can replace the textures there as well.

[email protected]

You don°t have access to the source.

[email protected]

It's constantly referred to as "open source".

[email protected]

Creative commons and MIT licence are distinct, though.

[email protected]

Yeah - but it isnt

[email protected]

Open weights

[email protected]

what a weird hill to die on

[email protected]

Great, so we agree. ᕕ(ᐛ)ᕗ

xttweaponttx

Just wanted to thank you both for this discourse! As somebody who's interested in AI but totally ignorant to how the hell it works, I found this conversation very helpful! I would say you both have good points. Happy days to you both!

[email protected]

But it is factually inaccurate. We don't call binaries open-source, we don't even call visible-source open-source. An AI model is an artifact just like a binary is.

[email protected]

I don't understand your objections. Even if the amount of data is rather big, it doesn't change that this data is part of the source, and leaving it out makes the whole project non-open-source.

Under that standard of scrutiny not only could there never be an LLM that would qualify, but projects that are considered open source would not be. Thus making the distinction meaningless.

What? No? Open-source projects literally do meet this standard.

[email protected]

On the contrary. What they open sourced was just a small part of the project. What they did not open source is what makes the AI tick. Having less than one percent of a project open sourced does not make it an "Open Source" project.