Someone Made a Dataset of One Million Bluesky Posts for 'Machine Learning Research'

The Nexus of Privacy

Someone Made a Dataset of One Million Bluesky Posts for 'Machine Learning Research'

A Hugging Face employee made a huge dataset of Bluesky posts, and it’s already very popular.

404 Media (www.404media.co)

Who, who could have predicted?

Well, everybody. Bluesky's an all-public artchiteture that's optimized for surveillance capitalism. But other than "everybody" ... who, who could have predicted?

#bluesky

The Nexus of Privacy

Bluesky's now thinking about allowing users to specify consent (or not) for AI training -- analogous to Mastodon et. al's "indexable" attribute for search engines

Bluesky (@bsky.app)

Brief update on our ongoing efforts to allow users to specify consent (or not) for AI training: 🧵

Bluesky Social (bsky.app)

"Bluesky is an open and public social network, much like websites on the Internet itself. Websites can specify whether they consent to outside companies crawling their data with a robots.txt file, and we’re investigating a similar practice here.
For example, this might look like a setting that allows Bluesky users to specify whether they consent to outside developers using their content in AI training datasets
Bluesky won’t be able to enforce this consent outside of our systems. It will be up to outside developers to respect these settings"

#bluesky

The Nexus of Privacy

Here's a great thread with context on the non-consensual scraping of Bluesky from @cfiesler -- lots of valuable links and instights.

https://skywriter.blue/pages/cfiesler.bsky.social/post/3lbwurkbfcs2w is the single-page summary, and https://bsky.app/profile/cfiesler.bsky.social/post/3lbwurkbfcs2w is the Bluesky thread

And here's a thread from @huggingface's principal ethicist, includng

Open source and open science mean people explore, test, and correct mistakes if there were any, taking responsibility for them in a very transparent way. We don't need the ethicist there to say what's right and what's wrong.
Hugging Face is meant to be a platform for people to regain the power they have been losing from big tech companies, and be in control of their data. So we're on your side, and rest assured this won't be happening again in the future.

How reassuring!

And of course a few hours later it happened again. The second dataset (two million posts this time) wasn't uploaded by a Hugging Face employee,. Instead it was somebody who I've seen described as a "fascist" and is apparently on blocklists for rightwing trolls who uploaded it to Hugging Face. But hey, "open", right?

#bluesky