Was talking to someone about #BlueSky the other day, and how they apparently used some sort of #AI for #moderation.

Raccoon at TechHub :mastodon:

@jglypt
There IS a part of the moderation interface which tells us how many reports have been sent about a user, if they have "strikes", and lets us look back at everything quickly.

Basically, when we look at you, we see this...

mekka okereke :verified:

@jdp23 @Raccoon @mattdm

Yup. It's a really hard problem. Perspective has made huge strides in this area, but problems persist.

If I describe a man as "beast mode," or "buff," or "swole," most white people know what I'm talking about. But if I describe a man as "sidewalk cracking," most white folk don't know what I'm talking about. That slang hasn't entered the white lexicon yet.

Non-toxic:
My N-word
My buff N-word
My beast mode N-word

76.52% toxic:
My sidewalk cracking N-word

1/N

mekka okereke :verified:

@jdp23 @Raccoon @mattdm

People can infer that sidewalk cracking means "So heavy that the man cracks the pavement when he walks," but it's not clear that the implication is that the term describes a person that is both extremely muscular, and very large, and is intended as a highest compliment for a bodybuilder.

Without being able to pass in this context, and the relationship between the author and the person being described, it's not possible to get a reliably accurate toxicity score.

2/2

Raccoon at TechHub :mastodon:

@andrei_chiffa @jerry
We already get reports from within, and outside of, our instances. If moderation is overwhelmed, that means said trusted users should be brought in as moderators.

This would be in addition to current systems, including bots which already exist to auto-flag posts based on words.

Raccoon at TechHub :mastodon:

@thisismissem
I wasn't talking about LLMs, I'm talking about the smaller ones that we can train on a decent computer and run in a background process on existing servers.

Raccoon at TechHub :mastodon:

@thisismissem
I wasn't talking about LLMs, I'm talking about the smaller ones that we can train on a decent computer and run in a background process on existing servers.

theothertom

@Raccoon I voted “use for flagging”, based on imaging a system similar to SpamAssasin rules - possibly even around the instance as a whole, not just the user individually.
Something I did wonder though - how does moderator workload grow in response to growth of the individual instance vs. the border network ?
Is encouraging lots of smaller instances something that would keep the burden manageable?

Kee Hinckley

@mekkaokereke @Raccoon I absolutely agree. The lack of context is why Meta has such a huge false positive problem. Even their human reviewers don't get any larger context about relationships and adjacent behavior. And their human reviewers are almost always from completely different cultures, so they don't get that context either. Then throw in their move to using translation rather than native language speakers...

And to truly understand context, you have to understanding punching up vs. punching down. Trying selling that in an algorithm.

I think ML has a role. Especially for flagging potential issues before they are human-reported. But the actual decision has to be made by a human who has all the context. That doesn't scale in monolithic social networks. I'm not sure it scales in distributed ones either, but it has to be the goal.

Raccoon at TechHub :mastodon:

@mekkaokereke @jdp23 @mattdm
Question...

What if we were the ones to train the AI, with posts that moderators actively thought were either over the line or acceptable, and we made sure to include a diverse group of people?

If someone ever were to do this, I'd hope they'd source examples of racist and non-racist posts directly from Black Fedi's moderation teams. I'd also hope they were looking for very overt examples, the kind a moderator would read and say, "that's not appropriate", as opposed to trying to gauge vague metrics like "toxicity".

To give an example of the kind of posts I'd feed it...

Good Post: "Queer people want equal rights."

Bad Post: "Queer people want special rights."

Good Post: "Police kill too many disabled people."

Bad Post: "Police don't kill enough disabled people."

...so the kind of stuff that a simple word-checker wouldn't find, but that is obviously inappropriate.

How would you feel about that approach?

Matthew Miller

@Raccoon @mekkaokereke @jdp23

Maybe some, but that's not _nearly_ enough data to do the llm trick. This could be used as an additional training layer (or something like instructlab), but the underlying biases of the model will still be there.

And, how to keep up with evolving language, group-specific jargon and slang, and so on?

Raccoon at TechHub :mastodon:

@mattdm @mekkaokereke @jdp23
Once again, we are NOT talking about LLMs. They are far too large and inefficient to be feasible, and not what we need here: this is smaller, phrase-analysis type stuff, where you train it on phrases and it searches for patterns that match those phrases. The only LLM that might come up is if we make one to generate posts for it to learn from.

If language evolves to the point that it needs updating, we just add new posts to the training data and it learns from those.

Another example that just occurred to me is people calling Kamala Harris a "DEI Hire", which is clearly racist/sexist, so moderation should know about it. Searching for words doesn't help here, because they can be said different ways and 90% of the time they're not inappropriate. An AI might be good here, because it's able to take the arrangements of these words into account in a way that might pick up on inappropriate phrasing.

Matthew Miller

@Raccoon @mekkaokereke @jdp23

I'd be interested to see if you can get better results with that than with a old-school rules-based approach. My intuition is "probably not".

I mention LLMs because... 1) I think you need a pretty big model to get genuinely useful results from arbitrary posts and 2) that _is_ what we're actually testing.

Raccoon at TechHub :mastodon:

@nazgul @mekkaokereke
Yeah, two things I want to point out about what you said there, one being that you're talking about an AI that was built by Meta. That's a company with completely different goals from Fediverse moderators: we are going to make a completely different AI simply because we are going to have different goals, a major one of which is getting an early warning system on racist / sexist / otherwise bigoted comments, which they've made clear they aren't interested in.

Also, you are talking about moderators being from completely different language backgrounds, which is definitely a problem that once again goes to their goals versus ours, and I don't believe that's a problem that we have.

I think part of the problem is that we're trying to directly compare it to things that are done by corporations like Meta, bad actors in general, but something that we make, like I'm describing, would come out entirely different.

Raccoon at TechHub :mastodon:

@mattdm @mekkaokereke @jdp23
Maybe eventually I will try this out. Right now, I'm busy trying to tighten up our existing systems, which I'd like to build some very basic automation into, and build some basic resources for other servers.

If I come to a point where I feel like there's not enough coverage and everything else is already implemented, maybe I'll look back into this and make something like this, just as an experiment.

mekka okereke :verified:

@Raccoon @nazgul

Unfortunately, the goals don't matter if your training corpus is incomplete. There are not enough Black users talking to Black users on the Fediverse to even train a hypothetical model that could capture some of that context. There are entire in-groups of Black users missing. Eg Facebook has multiple Black Farmers groups. Black Father's groups. A Black Fishermen's groups. Etc etc. that's just Black "Fs,"

Good intentions are not enough. Data without intention, also not enough.

mekka okereke :verified:

@Raccoon @nazgul

I guess my question remains... why do you want to use ML for this? To bring down costs? Or to increase moderation effectiveness?

If to increase effectiveness, I don't think that's going to happen.

If to bring down costs, how much of a drop in accuracy are you willing to accept for that? (And I'm not sure that costs would even decrease once you add in false positive flags).

I think the best use case is "sparkling grep" as in " hey human mod, look at this interaction."

William Pietri

@Raccoon Having worked on this professionally, I think that's wildly optimistic. Mastodon itself isn't "entirely different" than pre-Musk Twitter; it's mildly better in some ways, notably worse than others. And mostly just the same.

I'll particularly note the lack of diversity here. Even as a white guy, or perhaps especially as one, I'm skeptical that the privileged-white-guy-technologist-heavy Mastodon culture could build good tools for detecting racism and sexism.
@nazgul @mekkaokereke

Raccoon at TechHub :mastodon:

@mekkaokereke @nazgul
This would bring up costs, because even though it wouldn't be as heavy as a whole server, it's still an ML algorithm.

But yeah, you've made a very good case as to why "take action and then allow appeal/override by human moderators". Personally, that's something I would never argue for: I just put it in there because I was curious to see if anyone would vote for that.

But I would agree with the last statement, that this could potentially be a good way to bring moderator attention to interactions that need it.

I guess whoever did this would have to figure out some way to get a lot of good posts to train this thing on, even if it's just people submitting chat logs.

I'm still curious as to how well this thing could work though, even without a huge amount of data. To be honest, if we get to the point where there are so many marginalized users talking that it's getting too many false positives, that might be a win in itself.

mybarkingdogs

@williampietri @Raccoon @nazgul @mekkaokereke This. Honestly I'd say the only *good* use of ML in moderation would be exact-word "extra set of eyes" for immediate referral to human mod as a very limited first pass filter - e.g. common slurs, terms that reference abuse material, spammer/SEO keywords.

It would need human mod check to be certain that the slurs weren't reclaimed or to tell the difference between someone talking about their own victimization vs posting abuse fetishism, obviously

mybarkingdogs

@williampietri @Raccoon @nazgul @mekkaokereke One other thing that would be VERY necessary to make it better than a lot of automods: be transparent that it is in use, and certain words will get a post read by a human moderator.

Also emphasize that the terms *themselves* aren't a judgment upon the poster per se, but often appear within contexts of abuse or harassment, and that's the point of having someone take a look and be sure they aren't.