About as open source as a binary blob without the training data

[email protected]

It's still not open source. No matter hou extendable the weights are.

[email protected]

I mean, this does not help me understand.

Pennomi

Fair enough, it’s not source code, so open source doesn’t apply.

[email protected]

it’s the entirety of the bulk unfiltered data you want

Or more realistically: a description of how you could source the data.

doesnt touch on at all how this LLM is different from other LLM’s?

Correct. Llama isn't open source, either.

like saying that an open source game emulator can’t be open source because Nintendo games are encapsulated

Not at all. It's like claiming an emulator is open source, because it has a plugin system, but you need a closed source build dependency that the developer doesn't disclose to the puplic.

[email protected]

502 Bad Gateway

(slrpnk.net)

[email protected]

A closer analogy would be only providing the binary output of the emulator build and calling it open source. If you can't reproduce building the output from what they provide in what way is it reproducible? The model is the output, the training data and algorithm to build the model based on the training data are the input.

[email protected]

Would it? Not sure how that would be a better analogy. The argument is that it’s nearly all open… but it still does not count because the data set before it’s manipulated by the LLM (in my analogy the data set the emulator is using would be a Nintendo ROM) is not open. A data set that if provided would be so massive, it would render the point of tokenization pointless and be completely unusable by literally ANYONE without multiple data centers redlining for WEEKS. Under that standard of scrutiny not only could there never be an LLM that would qualify, but projects that are considered open source would not be. Thus making the distinction meaningless.

An emulator without a ROM mounted is still an emulator, even if not usable.

[email protected]

Source build dependency… so you don’t have a problem with the LLM at all! You have a problem with the data collection process or the pre-training! So an emulator can’t be open source if the methodology on how the developers discovered how to read Nintendo ROM’s was discovered? Or which games were dissected in order to reverse engineer that info? I don’t consider that a prerequisite to say an emulator is open

So if i say… remove the data set from deepseek what remains would be considered open source by you?

[email protected]

So an emulator can’t be open source if the methodology on how the developers discovered how to read Nintendo ROM’s was discovered?

No. The emulator is open source if it supplies the way on hou to get the binary in the end. I don't know how else to explain it to you: No LLM is open source.

[email protected]

So i still don’t see your issue with deepseek, because just like an emulator, everything is open source, with the exception of the data. The end result is dependent on the ROM put in to it, you can always make your own ROM, if you had the tools, and the end result followed the expected format. And if the ROM was removed the emulator is still the emulator.

So if deep seek removed its data set, would you then consider deepseek open source?

[email protected]

Thanks for the correction and clarification! I just assumed from the open-r1 post that they gave everything aside from the training data.

[email protected]

everything is open source, with the exception of the data

If I distribute a set consisting of emulator and a Rom of a closed source game (without the sourcecode), then the full set is not open source.

So if deep seek removed its data set, would you then consider deepseek open source?

Kind of, but that's like expecting a console without any firmware. The Weights are the important bit of an LLM distribution.

[email protected]

Another theory is that it's the copyright industry at work. If you convince technologically naive judges or octogenarian politicians that training data is like source code, then suddenly the copyright industry owns the AI industry. Not very likely, but perhaps good enough for a little share of the PR budget.

[email protected]

It's not hard. There's lots of tutorials out there.

[email protected]

Tutorials won't disclose the data used to train the model.

[email protected]

Yes. Wouldn't be a tutorial if it did.

[email protected]

The weights aren't the source, they're the output. Modifying the weights is analogous to editing a compiled binary, and the training dataset is analogous to source code.

[email protected]

So the models aren't opn source

KillingTimeItself

i mean, if it's not directly factually inaccurate, than, it is open source. It's just that the specific block of data they used and operate on isn't published or released, which is pretty common even among open source projects.

AI just happens to be in a fairly unique spot where that thing is actually like, pretty important. Though nothing stops other groups from creating an openly accessible one through something like distributed computing. Which seems to be a fancy new kid on the block moment for AI right now.

[email protected]

Its not open so it doesnt matter.