Gonna write a blog entry on the #OpenSourceAI definition.
-
Miroslav Suchýreplied to Jan Wildeboer 😷:krulorange: last edited by
@jwildeboer I do not want to train it on the same data. I want to slightly modify the input data.
-
Jan Wildeboer 😷:krulorange:replied to Miroslav Suchý last edited by
@mirek ... and train the same or a different model with that modified data. Which will need a lot of resources. That approach will not be the majority use case, IMHO. We created InstructLab for a reason
-
Íñigo Huguetreplied to Jan Wildeboer 😷:krulorange: last edited by
@jwildeboer Hmm, let's see the blog post, but that argument sounds a bit like "why do you want the sources if you have the binary already compiled".
I know, it's not exactly the same because with the AI model you can modify it having the model only.
But what about other aspects like being able to inspect/audit how it has been trained? Being able to reproduce the same training to check that the model corresponds with the training that it is supposed to have?
-
Jan Wildeboer 😷:krulorange:replied to Íñigo Huguet last edited by [email protected]
@ihuguet Yeah, its complicated. Do you want to see all my failed code that led to the patch I sent upstream as pull request? Are my private branches part of the complete source code? Where is the pragmatic middle ground? We need to discuss that in the AI training data context, IMHO.
-
Íñigo Huguetreplied to Jan Wildeboer 😷:krulorange: last edited by
@jwildeboer But that isn't a fair example, either. Failed patches are not expected to be published, but the final one is. Private branches are neither expected, and the main one(s) are not required but nice to have.
The equivalence here would be if you do some trial and error with the training data, and by the end you only publish the good one.
However, not publishing the training data is more like publishing the Linux kernel as a blob and the minimum to build 3rd party modules (kind of).
-
Jan Wildeboer 😷:krulorange:replied to Íñigo Huguet last edited by
@ihuguet Yep. A discussion we need to have, as I said, but that hasn't been really had yet.
-
tokudanreplied to Jan Wildeboer 😷:krulorange: last edited by
@jwildeboer my understanding of those AI models are pretty basic, but to me the model acts like a binary blob.
Can you actually understand the contents of the model just by looking at it without some "decompilation" (if that's possible) of the model, similar to an image viewer or a similar program? If not, that fits my description of source-not-available and thus disqualifies it from being open source. -
Jan Wildeboer 😷:krulorange:replied to tokudan last edited by
@tokudan The only thing you can do with the training data is to train a new model with (hopefully) the same parameters so the results become identical. Which proves that the model reflects the input. At the cost of another full training run, which needs a lot of resources.
-
Leszekreplied to Jan Wildeboer 😷:krulorange: last edited by
@jwildeboer The training data is not there for recreating the model. The training data is there so you have access and can validate that indeed all the ingested data is either CC0, or some other license and thus derivative work produced with that model which adheres to this other license fulfills the requirements of the training data's license.
You don't have to move both together until you fork it with the intent of distributing it.
[1/n]
-
@jwildeboer Re: "data hoarders will collect whatever training data"
Wait, are you considering "OS models" made with non-public data? If the data isn't either in public domain or on a permissive license how do you want to legally distribute a derivative work generator with traces of that training data on such a license?
-
Jan Wildeboer 😷:krulorange:replied to Leszek last edited by [email protected]
@makdaam You put words in my mouth I never said. But thanks for the input. I consider the “a model can only be open source when all training data is open source” position to be a dogmatic approach that I don’t subscribe to.