2008, me: I love the idea of cryptocurrencyBITCOIN: The word "cryptocurrency" now means "financial scams based on inefficient write-only ledgers"2018, me: I love the idea of the metaverseFACEBOOK: The word "metaverse" now means "proprietary 3D chat pro...

mcc

@pinkdrunkenelephants @datarama Because humans also are the ones who interpret and enforce laws and if the government does not enforce copyright against companies which market their products as "AI", then copyright does not apply to those companies.

pinkdrunkenelephants

@mcc @datarama I guess that's more of a bribery problem than a legal precedent one, then.

datarama

@pinkdrunkenelephants @mcc In the EU, there actually is some legislation. Copyright explicitly *doesn't* protect works from being used in machine learning for academic research, but ML training for commercial products must respect a "machine-readable opt-out".

But that's easy enough to get around. That's why eg. Stability funded an "independent research lab" who did the actual data gathering for them.

mcc

@datarama I consider this illegitimate and fundamentally unfair because I have already released large amounts of work under creative commons/open source licenses. I can't retroactively add terms to some of them because the plain language somehow no longer applies. If I add such opt-outs now, it would be like I'm admitting the licenses previously didn't apply to statistics-based derivative works

pinkdrunkenelephants

@datarama @mcc I wonder why it is people don't just revolt and destroy their servers then. Or drag them into jail.

Why do people delude themselves into accepting atrocities?

datarama

@mcc I consider it illegitimate and fundamentally unfair because it's opt-out.

mcc

Did you see this? The whole thing with "the stack".

emenel (@[email protected])

If you had code on GitHub at any point it looks like it might be included in a large dataset called “The Stack” — If you want your code removed from this massive “ai” training data go here: https://huggingface.co/spaces/bigcode/in-the-stack I found two of my old Github repos in there. Both were deleted last year and both were private. This is a serious breach of trust by Github and @[email protected]. Remove all your code from Github. CONSENT IS NOT OPT-OUT. Edit — thanks for all the replies. More context here: https://hachyderm.io/@joeyh/112105744123363587 Also the repos i found of mine i’m sure were private, but even if they were public at some point, for a brief time, in the past that isn’t my consent to use them for purposes beyond their intent. --- Edit 2 -- I see this made it to HN, which is a level of attention I do not want nor appreciate.... For all those wondering about the private repo issue -- No, I am not 100% sure that these ancient repos weren't at some point public for a split second before I changed it. I do know that they were never meant for this and that one of them didn't even contain any code. If my accidentally making a repo public for a moment just so happened to overlap with this scraping, then I guess that's possible. But it in no way invalidates the issues, and the anger that i feel about it.

post.lurk.org (post.lurk.org)

Some jerks did mass scraping of open source projects, putting them in a collection called "the stack" which they specifically recommend other people use as machine learning sources. If you look at their "Github opt-out repository" you'll find just page after page of people asking to have their stuff removed:

Issues · bigcode-project/opt-out-v2

Repository for opt-out requests. Contribute to bigcode-project/opt-out-v2 development by creating an account on GitHub.

GitHub (github.com)

(1/2)

datarama

@pinkdrunkenelephants @mcc I think if there was a simple clear-cut answer to that, the world would be a *very* different place.

mcc

…but wait! If you look at what they actually did (correct me if I'm wrong), they aren't actually doing any machine learning in the "stack" repo itself. The "stack" just collects zillions of repos in one place. Mirroring my content as part of a corpus of open source software, torrenting it, putting it on microfilm in a seedbank is the kind of thing I want to encourage. The problem becomes that they then *suggest* people create derivative works of those repos in contravention of the license. (2/2)

mcc

So… what is happening here? All these people are opting out of having their content recorded as part of a corpus of open source code. And I'll probably do the same, because "The Stack" is falsely implying people have permission to use it for ML training. But this means "The Stack" has put a knife in the heart of publicly archiving open source code at all. Future attempts to preserve OSS code will, if they base themselves on "the stack", not have any of those opted-out repositories to draw from.

datarama

@mcc That's also basically how LAION made the dataset for Stable Diffusion. They collected a bunch of links to images with descriptive alt-text.

(Are you taking time to write good alt-text because you respect disabled people? Congratulations, your good work is being exploited by the worst assholes in tech. Silicon Valley never lets a good deed go unpunished.)

mcc

Like, heck, how am I *supposed* to rely on my code getting preserved after I lose interest, I die, BitBucket deletes every bit of Mercurial-hosted content it ever hosted, etc? Am I supposed to rely on *Microsoft* to responsibly preserve my work? Holy crud no.

We *want* people to want their code widely mirrored and distributed. That was the reason for the licenses. That was the social contract. But if machine learning means the social contract is dead, why would people want their code mirrored?

Graham Sutherland / Polynomial

@mcc I have generally come to the conclusion that this is an intended effect. All the things you feel compelled to do for the good of others, in an ordinarily altruistic sense, are essentially made impossible unless you accept that your works and your expressions will be repackaged, sold, and absorbed into commercialised datasets.

The SoaD line "manufacturing consent is the name of the game" has been in my head a lot lately.

Mark T. Tomczak

@gsuberland @mcc One almost wonders if the end-game is to stop pulling and try pushing.

Maybe instead of trying to claw back data we've made publicly crawlable because "I wanted it visible, but not like that" we ask why any of these companies get to keep their data proprietary when it's built on ours?

Would people be more okay with all of this if the rule were "You can build a trained model off of publicly-available data, but that model must itself be publicly-available?"

mcc

@mark @gsuberland In my opinion, a trapdoor like "okay, well if copyright doesn't apply to the training data you stole, your model isn't copyrightable either" is no good. The US Gov has already said GenAI images and text are not copyrightable. It doesn't help. The thing about generative AI is it inherently takes heavy computational resources (disk space, CPU time, often-unacknowledged low-wage tagging work). Therefore, as a tool, it is inherently biased toward capital and away from individuals.

mcc

@mark @gsuberland If we say "AI is a new class of thing that is outside the copyright regime entirely", that is not a level playing field. The tool is designed in a way it inherently serves the powerful. "Machine learning models are inherently open" is the exact model I am afraid of— a world where copyright is something that applies to actors who have less than some specific amount of money, and anyone with more than that specific amount of money is liberated from it.

datarama

@mcc @mark @gsuberland Exactly.

Even if, say, GPT-4 wasn't covered by copyright, so what? Even if you could get it out of OpenAI's data centres in the first place, you couldn't run it with reasonable performance. And you *certainly* couldn't retrain it.

Irenes (many)

@mcc you're right to flag that, for sure

Irenes (many)

@mcc we definitely think that copyright as a tool for building a better world has bent the structure of capitalism as far as it is going to. we can't afford to REMOVE that crowbar, and in fact we should probably be coming up with more radical copyleft + non-commercial + anti-war licenses, but enforcement is going to keep favoring large power structures, not individuals.

Irenes (many)

@mcc (the point of making ever-more-radical licenses is to stay ahead of capitalist attempts to subsume critique into itself)