About as open source as a binary blob without the training data
-
[email protected]replied to [email protected] last edited by
I don't know what, if any, CS background you have, but that is way off. The training dataset is used to generate the weights, or the trained model. In the context of building a trained LLM model, the input is the dataset and the output is the trained model, or weights.
Is more appropriate to call deepseek "open-weight" rather than open-source.
-
[email protected]replied to [email protected] last edited by
That seems kind of like pointing to reverse engineering communities and saying that binaries are the preferred format because of how much they can do. Sure you can modify finished models a lot, but what you can do with just pre trained weights vs being able to replicate the final training or changing training parameters is just an entirely different beast.
There's a reason why the OSI stipulates that code and parameters used to train is considered part of the "source" that should be released in order to count as an open source model.
You're free to disagree with me and the OSI.
-
[email protected]replied to [email protected] last edited by
... Did you not read the litteral next phrase in the sentence?
since it distinctly lacks any form of executable content.
Your definition of open source specified reproducible binaries. From context it's clear that I took issue with your definition, not with the the notion of reproducing data.
-
[email protected]replied to [email protected] last edited by
I haven't seen any tutorials that include the training data. As you highlight, these would make for poor tutorials. If you know where there are more complete tutorials, I'd appreciate it if you could share them
-
Would an open-source Windows installer make it open-source? After all, you can replace its .dll files and modify the registry. I guess PrismLauncher also makes Minecraft open-source, you can replace the textures there as well.
-
[email protected]replied to [email protected] last edited by
You don°t have access to the source.
-
[email protected]replied to [email protected] last edited by
It's constantly referred to as "open source".
-
[email protected]replied to [email protected] last edited by
Creative commons and MIT licence are distinct, though.
-
[email protected]replied to [email protected] last edited by
Yeah - but it isnt
-
[email protected]replied to [email protected] last edited by
Open weights
-
[email protected]replied to [email protected] last edited by
what a weird hill to die on
-
[email protected]replied to [email protected] last edited by
Great, so we agree. ᕕ(ᐛ)ᕗ
-
Just wanted to thank you both for this discourse! As somebody who's interested in AI but totally ignorant to how the hell it works, I found this conversation very helpful! I would say you both have good points. Happy days to you both!
-
[email protected]replied to KillingTimeItself last edited by
But it is factually inaccurate. We don't call binaries open-source, we don't even call visible-source open-source. An AI model is an artifact just like a binary is.
-
[email protected]replied to [email protected] last edited by
I don't understand your objections. Even if the amount of data is rather big, it doesn't change that this data is part of the source, and leaving it out makes the whole project non-open-source.
Under that standard of scrutiny not only could there never be an LLM that would qualify, but projects that are considered open source would not be. Thus making the distinction meaningless.
What? No? Open-source projects literally do meet this standard.
-
[email protected]replied to [email protected] last edited by
On the contrary. What they open sourced was just a small part of the project. What they did not open source is what makes the AI tick. Having less than one percent of a project open sourced does not make it an "Open Source" project.
-
[email protected]replied to KillingTimeItself last edited by
That "specific block of data" is more than 99% of such a project. Hardly insignificant.
-
Fushuan [he/him]replied to [email protected] last edited by
The engine is open source, the model is not.
The enumqtor is open source, the games it can run are not.
I don't see how it's so hard to understand.
They are saying that the model that the engine is running is open source because they released the model. That's like saying that a game is open source because I released an emulator and the exscutable file. It's just not true.
-
Fushuan [he/him]replied to [email protected] last edited by
What most people understand as deepseek is the app thauses their trained model, not the running or training engines.
This post mentions open source, not open source code, big distinction. The source of a trained model is part the training engine, and way bigger part the input data. We only have access to a fraction of that "source". So the service isn't open source.
Just to make clear, no LLM service is open source currently.
-
Fushuan [he/him]replied to KillingTimeItself last edited by
The running engine and the training engine are open source. The service that uses the model trained with the open source engine and runs it with the open source runner is not, because a biiiig big part of what makes AI work is the trained model, and a big part of the source of a trained model is training data.
When they say open source, 99.99% of the people will understand that everything is verifiable, and it just is not. This is misleading.
As others have stated, a big part of open source development is providing everything so that other users can get the exact same results. This has always been the case in open source ML development, people do provide links to their training data for reproducibility. This has been the case with most of the papers on natural language processing (overarching branch of llm) I have read in the past. Both code and training data are provided.