Was talking to someone about #BlueSky the other day, and how they apparently used some sort of #AI for #moderation.

Emelia 👸🏻

@Raccoon have basically already done this (though we're not doing AI right now, it is a possibility, but it'll always be with instance operators/owners consent)

https://about.iftas.org/activities/moderation-as-a-service/content-classification-service/

Emelia 👸🏻

@Raccoon I don't think we necessarily need LLMs, as machine learning models or naive bayesian classifiers would probably cover most things without the huge expense & inefficiency of LLMs.

Fi, infosec-aspected 🏳️‍⚧️

@ajn142 @Raccoon

Yeah, there's a lot of extremely obvious adversarial examples, and then a much larger set of less obvious adversarial examples.

It's asking for trouble.

Buttered Jorts

@munin @Raccoon I think the important part at the point of inception is “is the trouble asked for less than the present amount of trouble, and are there feasible alternatives that are less trouble?”

Fi, infosec-aspected 🏳️‍⚧️

@ajn142 @Raccoon

I think that misses a pertinent dimension.

This changes the power balance of the system as a whole. It provides an infrastructural force for the purpose of disenfranchising certain users.

Now you have created a new vulnerability surface, and a new set of risks.

Fi, infosec-aspected 🏳️‍⚧️

@ajn142 @Raccoon

If you are creating a new kind of vulnerability surface, you have to reevaluate the -entire system- and how it interacts.

This would be disjoint from how Masto works and would meaningfully constitute a wholly new dynamic.

Artemesia

@Raccoon

Smells a bit like a solution in search of a problem, and could readily end up with an enshittified feed of flagging, analogous to what google did to what previously was a good search product. Why not do tiered heuristics instead, feeding off keywords or keyphrases, and mapping the application of tiers to past needed moderation of that user, and user characteristics such as recency of signup, follows/follower profile, and quality of followers? For instance, a well-established user who has never needed past moderation would have their posts profiled only against tier 1, a more dubious user might have posts profiled against tiers 1-4 (out of a hypothetical 5). And/or one could silo by topic the application of keyword/phrase groups to users: one might mark a certain user as suspect on racism, or sexism, or transphobia, etc.

One could also add some ad hoc rules, for instance a bloomscrolling type hashtag post that mentions a politician or current event get flagged.

Raccoon at TechHub :mastodon:

@johntimaeus @munin @reedmideke
Yeah I'm not talking about like chat GPT here, I'm talking about like, a lightweight text analyzer, that looks for patterns based on a set of posts that wouldn't get reported versus posts that are questionable enough we would want them to be reported. The "ignore previous instructions" thing only works with Chat GPT and it's knockoffs specifically: even most LLMs don't have any attempt of implementing the concept of "instructions".

It wouldn't be that simple to just manipulate, unless you are specifically structuring a post that would get reported in such a way that it doesn't get reported... In which case it will get reported anyway because someone will see it and report.

This would be an extension of current Auto moderator systems on here which just look for specific keywords, like the N word or "DEI Hire". (Note that they don't have to be things you can't say, just things that regularly show up in posts that break the rules)

mekka okereke :verified:

@Raccoon

This is the kind of thing that sounds like a good idea to people that don't talk to enough Black people in tech. ‍️

The paradox of almost every ML based moderation system in existence:

* Black women receive the most abuse online
* ML systems disproportionately false positive statements by Black women, and disproportionately false negative abuse against Black women

Similarly, facial recognition systems most used against Black folk, get the most false positives on Black folk. ‍️

1/N

Raccoon at TechHub :mastodon:

@mekkaokereke
Going to let you give your longer response, because this is definitely good thoughts, and I am familiar with Timnit Gebru's work on the subject. Just wanted to point out that current, non-AI systems already flag perfectly appropriate posts by black people, mainly those using the N-word in appropriate context. Extending them with this might actually be more likely to NOT flag those.

Continue though, because this is something that is definitely worth thinking about.

mekka okereke :verified:

@Raccoon

I posted this after the Perspective toxicity API was first released.

Other gems from the initial launch:

"Police don't kill too many Black kids."
Score: Not toxic. ‍️

"Police kill too many Black kids.
Score: 80.28% toxic.

"I'll never vote for Bernie Sanders until he apologizes to black women."
Score: 71.43% toxic. ‍️

"South Carolina voters are low information people."
Score: Not toxic

"Elizabeth Warren is a snake."
Score: Not toxic

2/N

Raccoon at TechHub :mastodon:

@artemesia
That's a good point. The current systems I've been looking at just check for specific words, but it would be very useful to check for words like "Harris", "Trump", "Genocide", and "MutualAid" on the bloomscrolling hashtag, just because the specific point of it is to have something that isn't depressing on it.

On top of that, any post that has both the Mutual Aid and Kamala Harris hashtags I would definitely want to see flagged.

AI wouldn't be good for that, but a more detailed automatic flagging script would.

Raccoon at TechHub :mastodon:

@mekkaokereke
I remember that yeah, and this was the system they insisted could replace actual moderation... Not even, that not using this system was "completely irresponsible", because of all the things your actual moderators would not be able to catch or provide "unbiased decisions" for.

mekka okereke :verified:

@Raccoon

When someone tells me they're going to use ML for moderation, or for flagging toxic posts, I ask which model they're going to use, and what info the model is going to do inference on.

If the input doesn't include the relationship between the two people, and the community that it is being said in, then it is impossible to not get many false positives. There is not enough context to do reliable inference based on just a short text sample.

mekka okereke :verified: (@[email protected])

@[email protected] @Paxxi @timbray The "C-word" is one of the most offensive words in the US. It's often used in the vilest, most misogynistic contexts. But it's less offensive in Australia? And when combined with other words and contexts, the meaning changes a lot. When an AUS friend told someone "Mekka's a hard c-word!" that was in reference to me being our rugby team's enforcer. When they said "Oh you're a sick c-word now!" that means I lost weight but stayed muscular. Both intended as compliments.🤷🏿‍♂️

Hachyderm.io (hachyderm.io)

3/N

mekka okereke :verified:

@Raccoon

So no, I don't like ML for moderation.

I could like it in theory, but in practice I rarely see implementations that:
a) include enough context
b) do not amplify the very problem that is experienced by the most vulnerable users

4/4

Raccoon at TechHub :mastodon:

@thisismissem
This looks useful, and some sort of program to streamline reporting of CSAM content goes through my mind every time I come across it (which, to be fair, is maybe once a month at most). I was more thinking of something that would look over new posts, run them through a quick algorithm, and decide if maybe a moderator should glance at them.

I would have to build a data set, create said AI, and do some testing, to know if it's worth it. I don't see myself bothering anytime soon, as there are automated systems that are far simpler to implement and far more effective which we haven't implemented yet.

Artemesia

@Raccoon

Yeah, that's why I'd look at doing layers of regexp rules mapped to users' risk profiles, both for apparent quality of their account and topical marking (binning, if you like) for dubious past posts. Say something borderline transphobic, guess what, you didn't get modded because you stayed just under the line, but are now getting the transphobia flagging regexps applied to your future posts. etc, etc for other categories of behavior needing mod intervention. Way more powerful and tuneable than trying to come up with perfect LLM prompts. More power efficient, too.

Another thought: it would be wonderful if one could save the mods some schlep-work by backfeeding the user's past N posts though the regexp heuristics. Saves one having to manually poke through their history.

Another ad hoc rule: High % of a user's replies (above some threshold count) have been to a single user who is not responding positively. Defeats 1-2 post/week sealioning where no one post is objectionable.

Raccoon at TechHub :mastodon:

@mekkaokereke
Thanks for that long response, you really brought up a good concern that I wasn't thinking about, but I've read about from past work in the field.

Obviously, any implementation we have needs to keep this in mind, and I will note, current systems which flag slurs already end up flagging posts by black and queer people using the N word and the F word respectively, which is not the kind of thing we are looking to catch with this. I'm well aware of the issue of these AI systems seeing different styles of communication and deciding differently based on that.

That said, this feels like a stronger argument against letting it run unsupervised rather than using it as a flagging system in general: if we all know it's an automated system, that it's fallible, and that it's point is to make sure we see things to look over them, and not to tell us they're bad, in theory we should be providing fair moderation.

(Continued)

Raccoon at TechHub :mastodon:

@mekkaokereke
Something that made me really curious about what you said here though is that you implied that you think there are some tools that work better than others for this? Obviously the one you quoted was an example of a bad one, but could you give me any examples or information about ones you think went down the right route?

Right now this is very much speculation, because again, I don't plan to implement this anytime soon, but it would be useful to know how we might avoid these pitfalls in the speculation phase in case someone actually decides to do it.

Raccoon at TechHub :mastodon:

@artemesia
Oooh, interesting, this could be that "lower level" we keep saying would be useful: having a bot watching specific people more closely for posts that cross the line would not only be helpful, but would mean that we as moderators don't have to look back at them. We could even set it to automatically unflag them after a certain period of time...

Definitely a good approach to be considering!