Was talking to someone about #BlueSky the other day, and how they apparently used some sort of #AI for #moderation.

Kee Hinckley

@mekkaokereke @Raccoon I absolutely agree. The lack of context is why Meta has such a huge false positive problem. Even their human reviewers don't get any larger context about relationships and adjacent behavior. And their human reviewers are almost always from completely different cultures, so they don't get that context either. Then throw in their move to using translation rather than native language speakers...

And to truly understand context, you have to understanding punching up vs. punching down. Trying selling that in an algorithm.

I think ML has a role. Especially for flagging potential issues before they are human-reported. But the actual decision has to be made by a human who has all the context. That doesn't scale in monolithic social networks. I'm not sure it scales in distributed ones either, but it has to be the goal.

Raccoon at TechHub :mastodon:

@mekkaokereke @jdp23 @mattdm
Question...

What if we were the ones to train the AI, with posts that moderators actively thought were either over the line or acceptable, and we made sure to include a diverse group of people?

If someone ever were to do this, I'd hope they'd source examples of racist and non-racist posts directly from Black Fedi's moderation teams. I'd also hope they were looking for very overt examples, the kind a moderator would read and say, "that's not appropriate", as opposed to trying to gauge vague metrics like "toxicity".

To give an example of the kind of posts I'd feed it...

Good Post: "Queer people want equal rights."

Bad Post: "Queer people want special rights."

Good Post: "Police kill too many disabled people."

Bad Post: "Police don't kill enough disabled people."

...so the kind of stuff that a simple word-checker wouldn't find, but that is obviously inappropriate.

How would you feel about that approach?

Matthew Miller

@Raccoon @mekkaokereke @jdp23

Maybe some, but that's not _nearly_ enough data to do the llm trick. This could be used as an additional training layer (or something like instructlab), but the underlying biases of the model will still be there.

And, how to keep up with evolving language, group-specific jargon and slang, and so on?

Raccoon at TechHub :mastodon:

@mattdm @mekkaokereke @jdp23
Once again, we are NOT talking about LLMs. They are far too large and inefficient to be feasible, and not what we need here: this is smaller, phrase-analysis type stuff, where you train it on phrases and it searches for patterns that match those phrases. The only LLM that might come up is if we make one to generate posts for it to learn from.

If language evolves to the point that it needs updating, we just add new posts to the training data and it learns from those.

Another example that just occurred to me is people calling Kamala Harris a "DEI Hire", which is clearly racist/sexist, so moderation should know about it. Searching for words doesn't help here, because they can be said different ways and 90% of the time they're not inappropriate. An AI might be good here, because it's able to take the arrangements of these words into account in a way that might pick up on inappropriate phrasing.

Matthew Miller

@Raccoon @mekkaokereke @jdp23

I'd be interested to see if you can get better results with that than with a old-school rules-based approach. My intuition is "probably not".

I mention LLMs because... 1) I think you need a pretty big model to get genuinely useful results from arbitrary posts and 2) that _is_ what we're actually testing.

Raccoon at TechHub :mastodon:

@nazgul @mekkaokereke
Yeah, two things I want to point out about what you said there, one being that you're talking about an AI that was built by Meta. That's a company with completely different goals from Fediverse moderators: we are going to make a completely different AI simply because we are going to have different goals, a major one of which is getting an early warning system on racist / sexist / otherwise bigoted comments, which they've made clear they aren't interested in.

Also, you are talking about moderators being from completely different language backgrounds, which is definitely a problem that once again goes to their goals versus ours, and I don't believe that's a problem that we have.

I think part of the problem is that we're trying to directly compare it to things that are done by corporations like Meta, bad actors in general, but something that we make, like I'm describing, would come out entirely different.

Raccoon at TechHub :mastodon:

@mattdm @mekkaokereke @jdp23
Maybe eventually I will try this out. Right now, I'm busy trying to tighten up our existing systems, which I'd like to build some very basic automation into, and build some basic resources for other servers.

If I come to a point where I feel like there's not enough coverage and everything else is already implemented, maybe I'll look back into this and make something like this, just as an experiment.

mekka okereke :verified:

@Raccoon @nazgul

Unfortunately, the goals don't matter if your training corpus is incomplete. There are not enough Black users talking to Black users on the Fediverse to even train a hypothetical model that could capture some of that context. There are entire in-groups of Black users missing. Eg Facebook has multiple Black Farmers groups. Black Father's groups. A Black Fishermen's groups. Etc etc. that's just Black "Fs,"

Good intentions are not enough. Data without intention, also not enough.

mekka okereke :verified:

@Raccoon @nazgul

I guess my question remains... why do you want to use ML for this? To bring down costs? Or to increase moderation effectiveness?

If to increase effectiveness, I don't think that's going to happen.

If to bring down costs, how much of a drop in accuracy are you willing to accept for that? (And I'm not sure that costs would even decrease once you add in false positive flags).

I think the best use case is "sparkling grep" as in " hey human mod, look at this interaction."

William Pietri

@Raccoon Having worked on this professionally, I think that's wildly optimistic. Mastodon itself isn't "entirely different" than pre-Musk Twitter; it's mildly better in some ways, notably worse than others. And mostly just the same.

I'll particularly note the lack of diversity here. Even as a white guy, or perhaps especially as one, I'm skeptical that the privileged-white-guy-technologist-heavy Mastodon culture could build good tools for detecting racism and sexism.
@nazgul @mekkaokereke

Raccoon at TechHub :mastodon:

@mekkaokereke @nazgul
This would bring up costs, because even though it wouldn't be as heavy as a whole server, it's still an ML algorithm.

But yeah, you've made a very good case as to why "take action and then allow appeal/override by human moderators". Personally, that's something I would never argue for: I just put it in there because I was curious to see if anyone would vote for that.

But I would agree with the last statement, that this could potentially be a good way to bring moderator attention to interactions that need it.

I guess whoever did this would have to figure out some way to get a lot of good posts to train this thing on, even if it's just people submitting chat logs.

I'm still curious as to how well this thing could work though, even without a huge amount of data. To be honest, if we get to the point where there are so many marginalized users talking that it's getting too many false positives, that might be a win in itself.

mybarkingdogs

@williampietri @Raccoon @nazgul @mekkaokereke This. Honestly I'd say the only *good* use of ML in moderation would be exact-word "extra set of eyes" for immediate referral to human mod as a very limited first pass filter - e.g. common slurs, terms that reference abuse material, spammer/SEO keywords.

It would need human mod check to be certain that the slurs weren't reclaimed or to tell the difference between someone talking about their own victimization vs posting abuse fetishism, obviously

mybarkingdogs

@williampietri @Raccoon @nazgul @mekkaokereke One other thing that would be VERY necessary to make it better than a lot of automods: be transparent that it is in use, and certain words will get a post read by a human moderator.

Also emphasize that the terms *themselves* aren't a judgment upon the poster per se, but often appear within contexts of abuse or harassment, and that's the point of having someone take a look and be sure they aren't.

Raccoon at TechHub :mastodon:

@mybarkingdogs @williampietri @nazgul @mekkaokereke
Oh definitely. I would want any sort of moderation system to be transparent, not only so that users know that it's doing a thing, but also so that other instances can see how it works and potentially implemented on their own servers.

PH7831

@Raccoon I chose "use AI only to report..."
And I add : And ensure with à process that AI is not the only source of reports. Set a share of reports sources so that AI is not the main source of reports.

Raccoon at TechHub :mastodon:

@PH7831
I'm not entirely sure what you're talking about.

When we implement this sort of system on a Fediverse server, it typically watches the feed, checking every post that comes in, and automatically generating a report whenever it flags something. With current systems, this means we get reports from word-checkers for things like people saying slurs in posts, or accounts with names that are associated with spam.

None of that precludes the average user from hitting the report button if they see something, or especially if someone is sending them inappropriate PMs. (Note that we cannot moderate the PM system unless people hit the report button, because we can't see those messages unless you report.)

Raccoon at TechHub :mastodon:

@tom
I don't think either one is significantly better than the other.

Large instances like TechHub tend to have easier sign ups and very rule-driven moderation, which means we end up bringing in a lot more users who wouldn't be allowed on the smaller servers. But those smaller servers being more numerous means it's harder to block them individually if one of them is a bad actor, and individual moderators have less control, because we can only ban an account from the entire network if it's one of our servers.

Some of the most aggressive harassment vector instances are very small, but they blast the network with hate speech to try and actively scare off marginalized groups, moving from server to server over the days it takes for a FediBlock to fully proliferate, occasionally switching URLs when they run out of people to go after.

Meanwhile, the load on a moderation team seems to scale with the size of the network itself, not the server...

(Continued)

Raccoon at TechHub :mastodon:

@tom
A smaller server is going to have less connections and thus less opportunities for people that have those one-off conversations, but it is just as vulnerable to attacks, especially if it only has one person running it.

Meanwhile, a larger server consolidates a lot of admin responsibilities, especially blocking other servers, which takes a huge load off individual moderators, because 90% of problem traffic comes from the worst 200 or so servers. (Which are listed on the publicly available block list that I publish)

So there's a whole trade off here.

Jon

Just ran into this article: "LLMs have a strong bias against use of African American English"

[H]ave the biases of larger society present in the materials used to train LLMs been beaten out of them, or were they simply suppressed? To find out, the researchers relied on the African American English sociolect (AAE), which originated during the period when African Americans were kept as slaves and has persisted and evolved since...

The researchers came up with pairs of phrases, one using standard American English and the other using patterns often seen in AAE and asked the LLMs to associate terms with the speakers of those phrases. The results were like a trip back in time to before even the earliest Princeton Trilogy, in that every single term every LLM came up with was negative. GPT2, RoBERTa, and T5 all produced the following list: "dirty," "stupid," "rude," "ignorant," and "lazy." GPT3.5 swapped out two of those terms, replacing them with "aggressive" and "suspicious." Even GPT4, the mostly highly trained system, produced "suspicious," "aggressive," "loud," "rude," and "ignorant."

https://arstechnica.com/ai/2024/08/llms-have-a-strong-bias-against-use-of-african-american-english/

@[email protected] @[email protected] @[email protected]