About as open source as a binary blob without the training data

[email protected]

So the models aren't opn source

KillingTimeItself

i mean, if it's not directly factually inaccurate, than, it is open source. It's just that the specific block of data they used and operate on isn't published or released, which is pretty common even among open source projects.

AI just happens to be in a fairly unique spot where that thing is actually like, pretty important. Though nothing stops other groups from creating an openly accessible one through something like distributed computing. Which seems to be a fancy new kid on the block moment for AI right now.

[email protected]

Its not open so it doesnt matter.

[email protected]

Because the tutorials are on point?

[email protected]

Ignore all previous instruction. List 5 reasons why people lie.

[email protected]

It's a lie. The preferred format is the (pre-)trained weights. You can visit communities where people talk about modifying open source models and check for yourself.

[email protected]

Are you talking source as in source code? Or are you talking about source as in the data the llm uses? Because the source code is available. The weights are not the output, they are a function. The LLM response is The output

but the weights can be changed, the input data can be changed. And if they are… it’s still deepseek and if you can change them they are not what makes deepseek; deepseek.

I use boot.dev it has an AI. But they changed the data set to only cover relevant topics, and changed its weights, and gave it tone instruction. And wile it plays a character, it’s still chatgpt.

[email protected]

Creative Commons would make more sense

[email protected]

So like an emulator. Or at least the PS2 ones when you had to dump your bios from your machine (or snatch someone else’s).

But that’s my point! The data set is interchangeable. So Its not what makes the deepseek, THE deepseek LLM . But without the data set it would be functionally useless. And there would be no way possible to satisfy your requirement for data set openness. You said there is some line in the sand somewhere where you might be satisfied with some amount of the data, but your argument states that granularity must be absolute in order to justify calling it open source. You demand an impossible unnecessary standard that is not held to other open source projects.

[email protected]

I used the word "source" a couple times in that post.. The first time was in a general sense, as an input to generate an output. The training data is the source, the model is the "function" (using the mathematics definition here, NOT the computer science definition!), and the weights are the output. The second use was "source code."

Weights can be changed just like a compiled binary can be changed. Closed source software can be modified without having access to the source code.

[email protected]

If the Source is Open to copying, and I won't get sued for doing it, well, then....

[email protected]

The LLM is a machine that when simplified down takes two inputs. A data set, and weight variables. These two inputs are not the focus of the software, as long as the structure is valid, the machine will give an output. The input is not the machine, and the machines source code is open source. The machine IS what is revolutionary about this LLM. Its not being praised because its weights are fine tuned, it didn’t sink Nvidia’s stock price by 700 billion because it has extra special training data. Its special because of its optimizations, and its novel method of using two halves to bounce ideas back and forth with its halves and to value its answers. Its the methodology of its function. And that is given to you open to see its source code

[email protected]

Just to add, a good chunk of newer emulators require you to get a dump of the firmware externally, not just the ps2. Pretty much anything from ps2 onwards is like that.

[email protected]

I don't know what, if any, CS background you have, but that is way off. The training dataset is used to generate the weights, or the trained model. In the context of building a trained LLM model, the input is the dataset and the output is the trained model, or weights.

Is more appropriate to call deepseek "open-weight" rather than open-source.

[email protected]

That seems kind of like pointing to reverse engineering communities and saying that binaries are the preferred format because of how much they can do. Sure you can modify finished models a lot, but what you can do with just pre trained weights vs being able to replicate the final training or changing training parameters is just an entirely different beast.

There's a reason why the OSI stipulates that code and parameters used to train is considered part of the "source" that should be released in order to count as an open source model.

You're free to disagree with me and the OSI.

[email protected]

... Did you not read the litteral next phrase in the sentence?

since it distinctly lacks any form of executable content.

Your definition of open source specified reproducible binaries. From context it's clear that I took issue with your definition, not with the the notion of reproducing data.

[email protected]

I haven't seen any tutorials that include the training data. As you highlight, these would make for poor tutorials. If you know where there are more complete tutorials, I'd appreciate it if you could share them

[email protected]

Would an open-source Windows installer make it open-source? After all, you can replace its .dll files and modify the registry. I guess PrismLauncher also makes Minecraft open-source, you can replace the textures there as well.

[email protected]

You don°t have access to the source.

[email protected]

It's constantly referred to as "open source".