2008, me: I love the idea of cryptocurrencyBITCOIN: The word "cryptocurrency" now means "financial scams based on inefficient write-only ledgers"2018, me: I love the idea of the metaverseFACEBOOK: The word "metaverse" now means "proprietary 3D chat pro...

mcc

…but wait! If you look at what they actually did (correct me if I'm wrong), they aren't actually doing any machine learning in the "stack" repo itself. The "stack" just collects zillions of repos in one place. Mirroring my content as part of a corpus of open source software, torrenting it, putting it on microfilm in a seedbank is the kind of thing I want to encourage. The problem becomes that they then *suggest* people create derivative works of those repos in contravention of the license. (2/2)

mcc

So… what is happening here? All these people are opting out of having their content recorded as part of a corpus of open source code. And I'll probably do the same, because "The Stack" is falsely implying people have permission to use it for ML training. But this means "The Stack" has put a knife in the heart of publicly archiving open source code at all. Future attempts to preserve OSS code will, if they base themselves on "the stack", not have any of those opted-out repositories to draw from.

datarama

@mcc That's also basically how LAION made the dataset for Stable Diffusion. They collected a bunch of links to images with descriptive alt-text.

(Are you taking time to write good alt-text because you respect disabled people? Congratulations, your good work is being exploited by the worst assholes in tech. Silicon Valley never lets a good deed go unpunished.)

mcc

Like, heck, how am I *supposed* to rely on my code getting preserved after I lose interest, I die, BitBucket deletes every bit of Mercurial-hosted content it ever hosted, etc? Am I supposed to rely on *Microsoft* to responsibly preserve my work? Holy crud no.

We *want* people to want their code widely mirrored and distributed. That was the reason for the licenses. That was the social contract. But if machine learning means the social contract is dead, why would people want their code mirrored?

Graham Sutherland / Polynomial

@mcc I have generally come to the conclusion that this is an intended effect. All the things you feel compelled to do for the good of others, in an ordinarily altruistic sense, are essentially made impossible unless you accept that your works and your expressions will be repackaged, sold, and absorbed into commercialised datasets.

The SoaD line "manufacturing consent is the name of the game" has been in my head a lot lately.

Mark T. Tomczak

@gsuberland @mcc One almost wonders if the end-game is to stop pulling and try pushing.

Maybe instead of trying to claw back data we've made publicly crawlable because "I wanted it visible, but not like that" we ask why any of these companies get to keep their data proprietary when it's built on ours?

Would people be more okay with all of this if the rule were "You can build a trained model off of publicly-available data, but that model must itself be publicly-available?"

mcc

@mark @gsuberland In my opinion, a trapdoor like "okay, well if copyright doesn't apply to the training data you stole, your model isn't copyrightable either" is no good. The US Gov has already said GenAI images and text are not copyrightable. It doesn't help. The thing about generative AI is it inherently takes heavy computational resources (disk space, CPU time, often-unacknowledged low-wage tagging work). Therefore, as a tool, it is inherently biased toward capital and away from individuals.

mcc

@mark @gsuberland If we say "AI is a new class of thing that is outside the copyright regime entirely", that is not a level playing field. The tool is designed in a way it inherently serves the powerful. "Machine learning models are inherently open" is the exact model I am afraid of— a world where copyright is something that applies to actors who have less than some specific amount of money, and anyone with more than that specific amount of money is liberated from it.

datarama

@mcc @mark @gsuberland Exactly.

Even if, say, GPT-4 wasn't covered by copyright, so what? Even if you could get it out of OpenAI's data centres in the first place, you couldn't run it with reasonable performance. And you *certainly* couldn't retrain it.

Irenes (many)

@mcc you're right to flag that, for sure

Irenes (many)

@mcc we definitely think that copyright as a tool for building a better world has bent the structure of capitalism as far as it is going to. we can't afford to REMOVE that crowbar, and in fact we should probably be coming up with more radical copyleft + non-commercial + anti-war licenses, but enforcement is going to keep favoring large power structures, not individuals.

Irenes (many)

@mcc (the point of making ever-more-radical licenses is to stay ahead of capitalist attempts to subsume critique into itself)

Oblomov

@datarama @mcc @mark @gsuberland there is one upside to forcing these models to be open and it's that it removes one of the, of not the primary, incentives in developing them in the first place. Yes, they could still sell its execution as a service, but if they lose control of the model itself, it becomes a considerably less profitable endeavor.

datarama

@oblomov @mcc @mark @gsuberland How, though?

Let's say that tomorrow, a judge rules that GPT-4 is not covered by copyright. What has actually changed? OpenAI isn't compelled to share it with anyone, and it's too big for anyone except large and wealthy corporations to actually do anything with.

Sure, you couldn't get sued if you got a bittorrent of it somehow. But you're not getting a bittorrent of a 1.76 trillion parameter neural network anyway.

datarama

@gsuberland @mcc This isn't why the AI craze has made me anxious, but it *is* why I have become terribly depressed.

I like writing code and making various weird computer programs, and sharing them with people for mutual entertainment and occasional enlightenment. Now I can't do that without accepting that everything I do will be appropriated and commoditized by some of the most horrible people in tech, unless I do it in secret.

And then what's the point?

Carlos Solís

@datarama @mcc GenAI trained with your own art, on your own devices, is perfectly acceptable and the actual expected usage case for it. Commercial AI hedging on the seams of "fair use", not so much.

mcc

@csolisr @datarama Yeah, but how am I going to communicate what steps I did or didn't take to do the model training ethically/legally? People are just gonna see signifiers of AI and tune out.

crzwdjk ✅

@datarama @oblomov @mcc @mark @gsuberland 1.76 trillion parameters is about a hard drive's worth of data, no?

spaduf

@mcc @csolisr @datarama outside of mastodon I really don't think the connotation is so negative that you have to actively defend yourself.

mcc

@spaduf @csolisr @datarama A lot of my friends are aritsts