Would you share your Fediverse data with researchers?

Jason Pester

@evan I don't view it that way. I choose to share content / data on platforms with an understanding of how that platform respects my rights to the content / data I post. Others have expressed a similar sentiment (see links below). I'm a straight male, but I think the author's point in the Medium article regarding LGBTQ+ outing through network data analysis is a good one.

Just a moment...

(bmatb.medium.com)

More Mastodon Scraping Without Consent (Notes on Nobre et al 2022)

There’s a new paper out about Mastodon! But unfortunately, it’s a deeply problematic one. Nobre et al’s “More of the Same? A Study of Images Shared on Mastodon’s Federated Timeline” is a paper that is now published in proceedings from International Conference on Social Informatics. (Unfortunately, it’s not open access.) Because I’m currently researching the fediverse and blogging about that process, I thought I’d write up notes on this paper. Why this paper? Frankly, because I’m pretty certain it violates the community norms, as well as terms of service, of many Mastodon instances. It instantly reminded me of the controversial paper from Zignani et al, “Mastodon Content Warnings: Inappropriate Contents on a Microblogging Platform”, which resulted in a scathing open letter and the retraction of a dataset from the Harvard Dataverse. Nobre et al’s “More of the Same” is a study of image-sharing. The authors claim that it is about image-sharing on Mastodon, but really their focus is on images they culled from Mastodon.social’s federated timeline. They pulled 4M posts from 103K active users, of which 1M had images. Since they pulled posts from Mastodon.social’s federated timeline, they saw posts from 4K separate instances. The authors state that a “relevant number” of the images they found are “explicit.” They categorize the images as such after running them through Google’s Vision AI Safe Search system. They also run the images they find through Google’s image search to trace where the images came from and how they are shared on Mastodon. Ultimately, the authors don’t really make an argument, other than stating in passing that Mastodon needs better moderation, since people share explicit images. In some ways, “More of the Same” lives up to its title: it’s more of the same poor scholarship that can be seen in Zignani et al (in fact, Nobre et al cite that controversial paper). Here are my critiques:

FOSS Academic (fossacademic.tech)

Elasticsearch server actively scraping Mastodon user data; over 150,000 individuals exposed so far

If you’re a Twitter user, you’ve probably heard of Mastodon, a free open-source software with similar micro-blogging features.

Hot for Security (www.bitdefender.com)

Evan Prodromou

@JasonPester OK. I feel like the phrasing of the question suggests consent, but ok if you don't.

Kristian

@evan Would have voted "qualified no" if I actually could, at this point. Experiences with "researchers" on other platforms leave me very very cautious and concerned here. Much more than these, however: _My_ Fediverse data itself is irrelevant. What matters seems data that somehow relates me to others, and I can't at all be sure most (or even all) of my contacts are willing for me to share any information on our interactions, communications, messages. Plus, it feels at least not trivial to handle data from people that agreed _and_, from that set, weeding out data from people that have _not_ agreed without actually having at least this interaction information at hand, back then.
(On the other end, most of my communication out here is public. I have learnt not to very much trust the Fediverse and specificially ActivityPub from a technical perspective with real "private" data so guess any researcher could probably go there and utilize some sort of web or AP crawler to get whichever public information is there without a second thought.)

Michael Vogel

@evan While writing about European privacy laws, I realised that any Fediverse research that uses personal data (even the public data) would have to be on a strict opt-in basis for European users.

When you subscribe to a social network (Fediverse-based or otherwise), you always have to agree to its terms of use, which also have to tell you something about how your data will be used. If there is no mention of (academic) research, then no one is allowed to use my data for that. Data like "who interacts with whom" (the social graph) is declared as very sensitive data, so it can never be used for anything other than the intended purpose (communication) unless I explicitly agree.

The fines are really high. Meta, for example, has had to pay fines of more than 2 billion euros in recent years, Amazon almost 800 million euros, and so on.

Evan Prodromou

@heluecht

So, in my mind, "Would you share your Fediverse data with researchers?" implies that you have agency to share or not, and that you can consent or not.

But I guess you're reading it a different way.

Evan Prodromou

@z428 Why don't you trust ActivityPub with private data? It's as good as email.

Kristian

@evan At first, I don't trust e-mail with really "private" data either, due to its very nature (store-and-forward, unencrypted metadata, encryption mainly "just" done using PGP/GPG with long-lived private keys closely tied to my identity).

Plus, I think these things don't really compare. E-mail, by default, has access control and whatever is in _my_ mailbox is supposed to be in _my_ mailbox. With maybe the exception of mailing lists, I usually don't have such a thing as an e-mail sent out to a "random public" - it's always addressing one specific recipient and usually supposed to end up in this persons very inbox invisible to someone else. Fediverse, to me, seems more like "the old WWW" here where a lot of things are public by default and anything to reduce visibility is somewhat difficult to do right on top.

Adding to that, for ActivityPub things seem slightly more complex depending on how various implementations handle things. In example, I've seen a bunch of situations in which "private" or "follower-only" messages have made it to public views in Friendica. Not sure whether these issues arising from loopholes or weaknesses in ActivityPub as a spec or "just" flaws in individual implementations, yet this makes me very very cautious how to make sure "private" messages actually remain "private".

cc @heluecht

whither and d'ye

@evan depends on whether we mean academic research or corporate data scientists

Evan Prodromou

@squinky would you say yes to either?

Thibault Molleman🇧🇪 🌈🐝

@evan That is really vague tbf. What data? the data that's already public?

yes. But only if it's for academic non-commercial research

Evan Prodromou

@z428 @heluecht the UI for private messages in Mastodon leads to a lot of mistakes, such as people posting something publicly that was meant to be private.

Evan Prodromou

@thibaultmol it's a hypothetical question. The point is for you to think about what data you would share under what conditions.

Dr Pen

@evan academic researcher here. I'd really love to be involved in the researching of the Fediverse. What's being planned by swf? Data, interviews, UX, all of it hopefully. Sign me up.

ramin

@evan I actually just started a research project about environmental stewardship and social media. We are getting data from insta, fb, yt, tiktok & Twitter but we did not consider the fediverse. Too few users + hard to get consent.

Evan Prodromou

Really great results. I'm qualified yes; I have been part of academic research projects before and I'd be ok with doing one here if I felt like my privacy would be preserved and I supported the topic of study.

Lots of good responses in the comments. Some blanket statements by people saying they would not permit others to participate in research on the Fediverse, which is kind of overstepping.

Evan Prodromou

Many people assumed either nonconsensual or commercial research or both, which doesn't follow from the question for me, but I understand why people are vigilant.

Evan Prodromou

An interesting question came up about consent of connected people -- social graph or reactions to content, for example. I'm interested and I am going to investigate how it works for social networks.

Lawrence Pritchard Waterhouse

@evan For what it's worth: I think you worded it well. People are prone to vigorous knee-jerk reactions, and I can't blame 'em (Being prone to those myself, especially in areas where ethics are concerned)

Evan Prodromou

@lpwaterhouse I also was part of the launch of a new non-profit foundation, https://socialwebfoundation.org/ , this week. I think people are feeling worried about the SWF's purpose, so they interpreted the question really negatively.

Evan Prodromou

@Transflux Awesome! Thanks for the reply.

One thing I'm trying to figure out is how researchers deal with consent of creators of incidental content -- for example, if you analyse the primary subject's image post, analysing the "likes" of that post.

As far as I can tell, it's usually aggregated and/or anonymized, but researchers don't seek consent from secondary subjects. Is that about right?