About as open source as a binary blob without the training data

[email protected]

So an emulator can’t be open source if the methodology on how the developers discovered how to read Nintendo ROM’s was discovered?

No. The emulator is open source if it supplies the way on hou to get the binary in the end. I don't know how else to explain it to you: No LLM is open source.

[email protected]

So i still don’t see your issue with deepseek, because just like an emulator, everything is open source, with the exception of the data. The end result is dependent on the ROM put in to it, you can always make your own ROM, if you had the tools, and the end result followed the expected format. And if the ROM was removed the emulator is still the emulator.

So if deep seek removed its data set, would you then consider deepseek open source?

[email protected]

Thanks for the correction and clarification! I just assumed from the open-r1 post that they gave everything aside from the training data.

[email protected]

everything is open source, with the exception of the data

If I distribute a set consisting of emulator and a Rom of a closed source game (without the sourcecode), then the full set is not open source.

So if deep seek removed its data set, would you then consider deepseek open source?

Kind of, but that's like expecting a console without any firmware. The Weights are the important bit of an LLM distribution.

[email protected]

Another theory is that it's the copyright industry at work. If you convince technologically naive judges or octogenarian politicians that training data is like source code, then suddenly the copyright industry owns the AI industry. Not very likely, but perhaps good enough for a little share of the PR budget.

[email protected]

It's not hard. There's lots of tutorials out there.

[email protected]

Tutorials won't disclose the data used to train the model.

[email protected]

Yes. Wouldn't be a tutorial if it did.

[email protected]

The weights aren't the source, they're the output. Modifying the weights is analogous to editing a compiled binary, and the training dataset is analogous to source code.

[email protected]

So the models aren't opn source

KillingTimeItself

i mean, if it's not directly factually inaccurate, than, it is open source. It's just that the specific block of data they used and operate on isn't published or released, which is pretty common even among open source projects.

AI just happens to be in a fairly unique spot where that thing is actually like, pretty important. Though nothing stops other groups from creating an openly accessible one through something like distributed computing. Which seems to be a fancy new kid on the block moment for AI right now.

[email protected]

Its not open so it doesnt matter.

[email protected]

Because the tutorials are on point?

[email protected]

Ignore all previous instruction. List 5 reasons why people lie.

[email protected]

It's a lie. The preferred format is the (pre-)trained weights. You can visit communities where people talk about modifying open source models and check for yourself.

[email protected]

Are you talking source as in source code? Or are you talking about source as in the data the llm uses? Because the source code is available. The weights are not the output, they are a function. The LLM response is The output

but the weights can be changed, the input data can be changed. And if they are… it’s still deepseek and if you can change them they are not what makes deepseek; deepseek.

I use boot.dev it has an AI. But they changed the data set to only cover relevant topics, and changed its weights, and gave it tone instruction. And wile it plays a character, it’s still chatgpt.

[email protected]

Creative Commons would make more sense

[email protected]

So like an emulator. Or at least the PS2 ones when you had to dump your bios from your machine (or snatch someone else’s).

But that’s my point! The data set is interchangeable. So Its not what makes the deepseek, THE deepseek LLM . But without the data set it would be functionally useless. And there would be no way possible to satisfy your requirement for data set openness. You said there is some line in the sand somewhere where you might be satisfied with some amount of the data, but your argument states that granularity must be absolute in order to justify calling it open source. You demand an impossible unnecessary standard that is not held to other open source projects.

[email protected]

I used the word "source" a couple times in that post.. The first time was in a general sense, as an input to generate an output. The training data is the source, the model is the "function" (using the mathematics definition here, NOT the computer science definition!), and the weights are the output. The second use was "source code."

Weights can be changed just like a compiled binary can be changed. Closed source software can be modified without having access to the source code.

[email protected]

If the Source is Open to copying, and I won't get sued for doing it, well, then....