About as open source as a binary blob without the training data

[email protected]

I don't know what, if any, CS background you have, but that is way off. The training dataset is used to generate the weights, or the trained model. In the context of building a trained LLM model, the input is the dataset and the output is the trained model, or weights.

Is more appropriate to call deepseek "open-weight" rather than open-source.

[email protected]

That seems kind of like pointing to reverse engineering communities and saying that binaries are the preferred format because of how much they can do. Sure you can modify finished models a lot, but what you can do with just pre trained weights vs being able to replicate the final training or changing training parameters is just an entirely different beast.

There's a reason why the OSI stipulates that code and parameters used to train is considered part of the "source" that should be released in order to count as an open source model.

You're free to disagree with me and the OSI.

[email protected]

... Did you not read the litteral next phrase in the sentence?

since it distinctly lacks any form of executable content.

Your definition of open source specified reproducible binaries. From context it's clear that I took issue with your definition, not with the the notion of reproducing data.

[email protected]

I haven't seen any tutorials that include the training data. As you highlight, these would make for poor tutorials. If you know where there are more complete tutorials, I'd appreciate it if you could share them

[email protected]

Would an open-source Windows installer make it open-source? After all, you can replace its .dll files and modify the registry. I guess PrismLauncher also makes Minecraft open-source, you can replace the textures there as well.

[email protected]

You don°t have access to the source.

[email protected]

It's constantly referred to as "open source".

[email protected]

Creative commons and MIT licence are distinct, though.

[email protected]

Yeah - but it isnt

[email protected]

Open weights

[email protected]

what a weird hill to die on

[email protected]

Great, so we agree. ᕕ(ᐛ)ᕗ

xttweaponttx

Just wanted to thank you both for this discourse! As somebody who's interested in AI but totally ignorant to how the hell it works, I found this conversation very helpful! I would say you both have good points. Happy days to you both!

[email protected]

But it is factually inaccurate. We don't call binaries open-source, we don't even call visible-source open-source. An AI model is an artifact just like a binary is.

[email protected]

I don't understand your objections. Even if the amount of data is rather big, it doesn't change that this data is part of the source, and leaving it out makes the whole project non-open-source.

Under that standard of scrutiny not only could there never be an LLM that would qualify, but projects that are considered open source would not be. Thus making the distinction meaningless.

What? No? Open-source projects literally do meet this standard.

[email protected]

On the contrary. What they open sourced was just a small part of the project. What they did not open source is what makes the AI tick. Having less than one percent of a project open sourced does not make it an "Open Source" project.

[email protected]

That "specific block of data" is more than 99% of such a project. Hardly insignificant.

Fushuan [he/him]

The engine is open source, the model is not.

The enumqtor is open source, the games it can run are not.

I don't see how it's so hard to understand.

They are saying that the model that the engine is running is open source because they released the model. That's like saying that a game is open source because I released an emulator and the exscutable file. It's just not true.

Fushuan [he/him]

What most people understand as deepseek is the app thauses their trained model, not the running or training engines.

This post mentions open source, not open source code, big distinction. The source of a trained model is part the training engine, and way bigger part the input data. We only have access to a fraction of that "source". So the service isn't open source.

Just to make clear, no LLM service is open source currently.

Fushuan [he/him]

The running engine and the training engine are open source. The service that uses the model trained with the open source engine and runs it with the open source runner is not, because a biiiig big part of what makes AI work is the trained model, and a big part of the source of a trained model is training data.

When they say open source, 99.99% of the people will understand that everything is verifiable, and it just is not. This is misleading.

As others have stated, a big part of open source development is providing everything so that other users can get the exact same results. This has always been the case in open source ML development, people do provide links to their training data for reproducibility. This has been the case with most of the papers on natural language processing (overarching branch of llm) I have read in the past. Both code and training data are provided.