2008, me: I love the idea of cryptocurrencyBITCOIN: The word "cryptocurrency" now means "financial scams based on inefficient write-only ledgers"2018, me: I love the idea of the metaverseFACEBOOK: The word "metaverse" now means "proprietary 3D chat pro...

datarama

@bobayaga @mcc Have you seen how diffusion models draw fingers?

mcc

I'm really concerned about the effect "generative AI" is going to have on the attempt to build a copyleft/commons.

As artists/coders, we saw that copyright constrains us. So we decided to make a fenced-off area where we could make copyright work for us in a limited way, with permissions for derivative works within the commons according to clear rules set out in licenses.

Now OpenAI has made a world where rules and licenses don't apply to any company with a valuation over $N billion dollars.

mcc

(The exact value of "N" is not known yet; I assume it will be solidly fixed by some upcoming court case.)

mcc

In a world where copyleft licenses turn out to restrict only the small actors they were meant to empower, and don't apply to big bad-actor "AI" companies, what is the incentive to put your work out under a license that will only serve to make it a target for "AI" scraping?

With NFTs, we saw people taking their work private because putting something behind a clickwall/paywall was the only way to not be stolen for NFTs. I assume the same process will accelerate in an "AI" world.

pinkdrunkenelephants

@mcc They should just make a license that explicitly bans AI usage then.

datarama

@mcc There is no such incentive. There is a very, very strong incentive (namely, not wanting to empower the worst scumbags in tech) to *not* share your work publicly anymore.

This, to me, is the most harmful effect so far of generative AI.

datarama

@pinkdrunkenelephants @mcc That doesn't work if copyright *itself* doesn't apply to AI training, which is what all those court cases are about. Licenses start from the assumption that the copyright holder reserves all rights, and then the license explicitly waives some of those rights under a set of given conditions.

But with AI, it's up in the air whether a copyright holder has any rights at all.

pinkdrunkenelephants

@datarama @mcc I don't see how it would be up in the air. Humans feed that data into AI and use the churned remains so it's still a human violating the copyright.

mcc

@pinkdrunkenelephants @datarama Because humans also are the ones who interpret and enforce laws and if the government does not enforce copyright against companies which market their products as "AI", then copyright does not apply to those companies.

pinkdrunkenelephants

@mcc @datarama I guess that's more of a bribery problem than a legal precedent one, then.

datarama

@pinkdrunkenelephants @mcc In the EU, there actually is some legislation. Copyright explicitly *doesn't* protect works from being used in machine learning for academic research, but ML training for commercial products must respect a "machine-readable opt-out".

But that's easy enough to get around. That's why eg. Stability funded an "independent research lab" who did the actual data gathering for them.

mcc

@datarama I consider this illegitimate and fundamentally unfair because I have already released large amounts of work under creative commons/open source licenses. I can't retroactively add terms to some of them because the plain language somehow no longer applies. If I add such opt-outs now, it would be like I'm admitting the licenses previously didn't apply to statistics-based derivative works

pinkdrunkenelephants

@datarama @mcc I wonder why it is people don't just revolt and destroy their servers then. Or drag them into jail.

Why do people delude themselves into accepting atrocities?

datarama

@mcc I consider it illegitimate and fundamentally unfair because it's opt-out.

mcc

Did you see this? The whole thing with "the stack".

emenel (@[email protected])

If you had code on GitHub at any point it looks like it might be included in a large dataset called “The Stack” — If you want your code removed from this massive “ai” training data go here: https://huggingface.co/spaces/bigcode/in-the-stack I found two of my old Github repos in there. Both were deleted last year and both were private. This is a serious breach of trust by Github and @[email protected]. Remove all your code from Github. CONSENT IS NOT OPT-OUT. Edit — thanks for all the replies. More context here: https://hachyderm.io/@joeyh/112105744123363587 Also the repos i found of mine i’m sure were private, but even if they were public at some point, for a brief time, in the past that isn’t my consent to use them for purposes beyond their intent. --- Edit 2 -- I see this made it to HN, which is a level of attention I do not want nor appreciate.... For all those wondering about the private repo issue -- No, I am not 100% sure that these ancient repos weren't at some point public for a split second before I changed it. I do know that they were never meant for this and that one of them didn't even contain any code. If my accidentally making a repo public for a moment just so happened to overlap with this scraping, then I guess that's possible. But it in no way invalidates the issues, and the anger that i feel about it.

post.lurk.org (post.lurk.org)

Some jerks did mass scraping of open source projects, putting them in a collection called "the stack" which they specifically recommend other people use as machine learning sources. If you look at their "Github opt-out repository" you'll find just page after page of people asking to have their stuff removed:

Issues · bigcode-project/opt-out-v2

Repository for opt-out requests. Contribute to bigcode-project/opt-out-v2 development by creating an account on GitHub.

GitHub (github.com)

(1/2)

datarama

@pinkdrunkenelephants @mcc I think if there was a simple clear-cut answer to that, the world would be a *very* different place.

mcc

…but wait! If you look at what they actually did (correct me if I'm wrong), they aren't actually doing any machine learning in the "stack" repo itself. The "stack" just collects zillions of repos in one place. Mirroring my content as part of a corpus of open source software, torrenting it, putting it on microfilm in a seedbank is the kind of thing I want to encourage. The problem becomes that they then *suggest* people create derivative works of those repos in contravention of the license. (2/2)

mcc

So… what is happening here? All these people are opting out of having their content recorded as part of a corpus of open source code. And I'll probably do the same, because "The Stack" is falsely implying people have permission to use it for ML training. But this means "The Stack" has put a knife in the heart of publicly archiving open source code at all. Future attempts to preserve OSS code will, if they base themselves on "the stack", not have any of those opted-out repositories to draw from.

datarama

@mcc That's also basically how LAION made the dataset for Stable Diffusion. They collected a bunch of links to images with descriptive alt-text.

(Are you taking time to write good alt-text because you respect disabled people? Congratulations, your good work is being exploited by the worst assholes in tech. Silicon Valley never lets a good deed go unpunished.)

mcc

Like, heck, how am I *supposed* to rely on my code getting preserved after I lose interest, I die, BitBucket deletes every bit of Mercurial-hosted content it ever hosted, etc? Am I supposed to rely on *Microsoft* to responsibly preserve my work? Holy crud no.

We *want* people to want their code widely mirrored and distributed. That was the reason for the licenses. That was the social contract. But if machine learning means the social contract is dead, why would people want their code mirrored?