2008, me: I love the idea of cryptocurrencyBITCOIN: The word "cryptocurrency" now means "financial scams based on inefficient write-only ledgers"2018, me: I love the idea of the metaverseFACEBOOK: The word "metaverse" now means "proprietary 3D chat pro...
-
@pinkdrunkenelephants @datarama Because humans also are the ones who interpret and enforce laws and if the government does not enforce copyright against companies which market their products as "AI", then copyright does not apply to those companies.
-
-
dataramareplied to pinkdrunkenelephants on last edited by
@pinkdrunkenelephants @mcc In the EU, there actually is some legislation. Copyright explicitly *doesn't* protect works from being used in machine learning for academic research, but ML training for commercial products must respect a "machine-readable opt-out".
But that's easy enough to get around. That's why eg. Stability funded an "independent research lab" who did the actual data gathering for them.
-
@datarama I consider this illegitimate and fundamentally unfair because I have already released large amounts of work under creative commons/open source licenses. I can't retroactively add terms to some of them because the plain language somehow no longer applies. If I add such opt-outs now, it would be like I'm admitting the licenses previously didn't apply to statistics-based derivative works
-
pinkdrunkenelephantsreplied to datarama on last edited by
-
@mcc I consider it illegitimate and fundamentally unfair because it's opt-out.
-
Did you see this? The whole thing with "the stack".
https://post.lurk.org/@emenel/112111014479288871
Some jerks did mass scraping of open source projects, putting them in a collection called "the stack" which they specifically recommend other people use as machine learning sources. If you look at their "Github opt-out repository" you'll find just page after page of people asking to have their stuff removed:
https://github.com/bigcode-project/opt-out-v2/issues
(1/2)
-
dataramareplied to pinkdrunkenelephants on last edited by
@pinkdrunkenelephants @mcc I think if there was a simple clear-cut answer to that, the world would be a *very* different place.
-
…but wait! If you look at what they actually did (correct me if I'm wrong), they aren't actually doing any machine learning in the "stack" repo itself. The "stack" just collects zillions of repos in one place. Mirroring my content as part of a corpus of open source software, torrenting it, putting it on microfilm in a seedbank is the kind of thing I want to encourage. The problem becomes that they then *suggest* people create derivative works of those repos in contravention of the license. (2/2)
-
So… what is happening here? All these people are opting out of having their content recorded as part of a corpus of open source code. And I'll probably do the same, because "The Stack" is falsely implying people have permission to use it for ML training. But this means "The Stack" has put a knife in the heart of publicly archiving open source code at all. Future attempts to preserve OSS code will, if they base themselves on "the stack", not have any of those opted-out repositories to draw from.
-
@mcc That's also basically how LAION made the dataset for Stable Diffusion. They collected a bunch of links to images with descriptive alt-text.
(Are you taking time to write good alt-text because you respect disabled people? Congratulations, your good work is being exploited by the worst assholes in tech. Silicon Valley never lets a good deed go unpunished.)
-
Like, heck, how am I *supposed* to rely on my code getting preserved after I lose interest, I die, BitBucket deletes every bit of Mercurial-hosted content it ever hosted, etc? Am I supposed to rely on *Microsoft* to responsibly preserve my work? Holy crud no.
We *want* people to want their code widely mirrored and distributed. That was the reason for the licenses. That was the social contract. But if machine learning means the social contract is dead, why would people want their code mirrored?
-
Graham Sutherland / Polynomialreplied to mcc on last edited by
@mcc I have generally come to the conclusion that this is an intended effect. All the things you feel compelled to do for the good of others, in an ordinarily altruistic sense, are essentially made impossible unless you accept that your works and your expressions will be repackaged, sold, and absorbed into commercialised datasets.
The SoaD line "manufacturing consent is the name of the game" has been in my head a lot lately.
-
Mark T. Tomczakreplied to Graham Sutherland / Polynomial on last edited by
@gsuberland @mcc One almost wonders if the end-game is to stop pulling and try pushing.
Maybe instead of trying to claw back data we've made publicly crawlable because "I wanted it visible, but not like that" we ask why any of these companies get to keep their data proprietary when it's built on ours?
Would people be more okay with all of this if the rule were "You can build a trained model off of publicly-available data, but that model must itself be publicly-available?"
-
@mark @gsuberland In my opinion, a trapdoor like "okay, well if copyright doesn't apply to the training data you stole, your model isn't copyrightable either" is no good. The US Gov has already said GenAI images and text are not copyrightable. It doesn't help. The thing about generative AI is it inherently takes heavy computational resources (disk space, CPU time, often-unacknowledged low-wage tagging work). Therefore, as a tool, it is inherently biased toward capital and away from individuals.
-
@mark @gsuberland If we say "AI is a new class of thing that is outside the copyright regime entirely", that is not a level playing field. The tool is designed in a way it inherently serves the powerful. "Machine learning models are inherently open" is the exact model I am afraid of— a world where copyright is something that applies to actors who have less than some specific amount of money, and anyone with more than that specific amount of money is liberated from it.
-
@mcc @mark @gsuberland Exactly.
Even if, say, GPT-4 wasn't covered by copyright, so what? Even if you could get it out of OpenAI's data centres in the first place, you couldn't run it with reasonable performance. And you *certainly* couldn't retrain it.
-
@mcc you're right to flag that, for sure
-
Irenes (many)replied to Irenes (many) on last edited by
@mcc we definitely think that copyright as a tool for building a better world has bent the structure of capitalism as far as it is going to. we can't afford to REMOVE that crowbar, and in fact we should probably be coming up with more radical copyleft + non-commercial + anti-war licenses, but enforcement is going to keep favoring large power structures, not individuals.
-
Irenes (many)replied to Irenes (many) on last edited by
@mcc (the point of making ever-more-radical licenses is to stay ahead of capitalist attempts to subsume critique into itself)