2008, me: I love the idea of cryptocurrencyBITCOIN: The word "cryptocurrency" now means "financial scams based on inefficient write-only ledgers"2018, me: I love the idea of the metaverseFACEBOOK: The word "metaverse" now means "proprietary 3D chat pro...

datarama

@mcc I feel like an asshole when I say I enjoy (and used to make) "generative art" now.

mcc

RONALD LACEY: Again we see, Ms. McClure, there is nothing you can possess which I cannot take away.

Boba Yaga

@mcc this monkey's paw is bound to run out of fingers eventually

datarama

@bobayaga @mcc Have you seen how diffusion models draw fingers?

mcc

I'm really concerned about the effect "generative AI" is going to have on the attempt to build a copyleft/commons.

As artists/coders, we saw that copyright constrains us. So we decided to make a fenced-off area where we could make copyright work for us in a limited way, with permissions for derivative works within the commons according to clear rules set out in licenses.

Now OpenAI has made a world where rules and licenses don't apply to any company with a valuation over $N billion dollars.

mcc

(The exact value of "N" is not known yet; I assume it will be solidly fixed by some upcoming court case.)

mcc

In a world where copyleft licenses turn out to restrict only the small actors they were meant to empower, and don't apply to big bad-actor "AI" companies, what is the incentive to put your work out under a license that will only serve to make it a target for "AI" scraping?

With NFTs, we saw people taking their work private because putting something behind a clickwall/paywall was the only way to not be stolen for NFTs. I assume the same process will accelerate in an "AI" world.

pinkdrunkenelephants

@mcc They should just make a license that explicitly bans AI usage then.

datarama

@mcc There is no such incentive. There is a very, very strong incentive (namely, not wanting to empower the worst scumbags in tech) to *not* share your work publicly anymore.

This, to me, is the most harmful effect so far of generative AI.

datarama

@pinkdrunkenelephants @mcc That doesn't work if copyright *itself* doesn't apply to AI training, which is what all those court cases are about. Licenses start from the assumption that the copyright holder reserves all rights, and then the license explicitly waives some of those rights under a set of given conditions.

But with AI, it's up in the air whether a copyright holder has any rights at all.

pinkdrunkenelephants

@datarama @mcc I don't see how it would be up in the air. Humans feed that data into AI and use the churned remains so it's still a human violating the copyright.

mcc

@pinkdrunkenelephants @datarama Because humans also are the ones who interpret and enforce laws and if the government does not enforce copyright against companies which market their products as "AI", then copyright does not apply to those companies.

pinkdrunkenelephants

@mcc @datarama I guess that's more of a bribery problem than a legal precedent one, then.

datarama

@pinkdrunkenelephants @mcc In the EU, there actually is some legislation. Copyright explicitly *doesn't* protect works from being used in machine learning for academic research, but ML training for commercial products must respect a "machine-readable opt-out".

But that's easy enough to get around. That's why eg. Stability funded an "independent research lab" who did the actual data gathering for them.

mcc

@datarama I consider this illegitimate and fundamentally unfair because I have already released large amounts of work under creative commons/open source licenses. I can't retroactively add terms to some of them because the plain language somehow no longer applies. If I add such opt-outs now, it would be like I'm admitting the licenses previously didn't apply to statistics-based derivative works

pinkdrunkenelephants

@datarama @mcc I wonder why it is people don't just revolt and destroy their servers then. Or drag them into jail.

Why do people delude themselves into accepting atrocities?

datarama

@mcc I consider it illegitimate and fundamentally unfair because it's opt-out.

mcc

Did you see this? The whole thing with "the stack".

emenel (@[email protected])

If you had code on GitHub at any point it looks like it might be included in a large dataset called “The Stack” — If you want your code removed from this massive “ai” training data go here: https://huggingface.co/spaces/bigcode/in-the-stack I found two of my old Github repos in there. Both were deleted last year and both were private. This is a serious breach of trust by Github and @[email protected]. Remove all your code from Github. CONSENT IS NOT OPT-OUT. Edit — thanks for all the replies. More context here: https://hachyderm.io/@joeyh/112105744123363587 Also the repos i found of mine i’m sure were private, but even if they were public at some point, for a brief time, in the past that isn’t my consent to use them for purposes beyond their intent. --- Edit 2 -- I see this made it to HN, which is a level of attention I do not want nor appreciate.... For all those wondering about the private repo issue -- No, I am not 100% sure that these ancient repos weren't at some point public for a split second before I changed it. I do know that they were never meant for this and that one of them didn't even contain any code. If my accidentally making a repo public for a moment just so happened to overlap with this scraping, then I guess that's possible. But it in no way invalidates the issues, and the anger that i feel about it.

post.lurk.org (post.lurk.org)

Some jerks did mass scraping of open source projects, putting them in a collection called "the stack" which they specifically recommend other people use as machine learning sources. If you look at their "Github opt-out repository" you'll find just page after page of people asking to have their stuff removed:

Issues · bigcode-project/opt-out-v2

Repository for opt-out requests. Contribute to bigcode-project/opt-out-v2 development by creating an account on GitHub.

GitHub (github.com)

(1/2)

datarama

@pinkdrunkenelephants @mcc I think if there was a simple clear-cut answer to that, the world would be a *very* different place.

mcc

…but wait! If you look at what they actually did (correct me if I'm wrong), they aren't actually doing any machine learning in the "stack" repo itself. The "stack" just collects zillions of repos in one place. Mirroring my content as part of a corpus of open source software, torrenting it, putting it on microfilm in a seedbank is the kind of thing I want to encourage. The problem becomes that they then *suggest* people create derivative works of those repos in contravention of the license. (2/2)