finally set up hard blocks for ai crawlers on one of my servers and we are now serving a thousand 403's a minute

jonny (good kind)

Kip Van Den Bos, CEO of the Good Posting Foundation

@jonny nice... How'd you do it? User agent?

jonny (good kind)

just user agents for tonight, but i'm logging IPs and will set up that next phase of user agent -> ip block tomorrow. they have been spreading out requests over ips and time s.t. they aren't tripping my fail2ban rules but i'm on to them.

jonny (good kind)

@tobi yaya just user agent for now, but my git server was being so laggy from them crawling it that i'm like it's war now https://neuromatch.social/@jonny/113922213999005285

Kip Van Den Bos, CEO of the Good Posting Foundation

@jonny I'm considering if we should hardcode some UA blocks in GtS as well. We already follow most of the "recommended" stuff to politely stop crawling but I doubt politeness will keep cutting the mustard.

jonny (good kind)

@tobi i never really understood why robots.txt was a thing, why ask nicely when you can just refuse to serve?

Cassandrich

@jonny Feed them Markov spam instead of 403s. 403 will just make them switch UAs. Markov spam will rapidly accelerate model collapse.

hazel

@jonny @tobi i dont know if this is historically accurate but my assumption was that it was because for cases where it’s not a hard rule for security/privacy/licensing reasons but instead as guidance for what is actually relevant to search engine scrapers it’s simply more convenient to do it this way, and while blocking would work instead there wouldn’t really be a motivation to take that approach when the point is to make the bots do their thing better rather than to protect the content of the servers. in the case of some things i run i’ll have robots.txt rules filter out pages that are completely irrelevant for bots, not because i don’t want them to access it but because they wouldn’t have any use for it and it would just be wasteful

{Insert Pasta Pun}

@dalias @jonny why not both :3

Randomly choose whether to block the UA, the IP, or to serve junk

Hans-Cees 🌳🌳🤢🦋🐈🐈🍋🍋🐝🐜

@dalias @jonny how do you serve this spam exactly?

Cassandrich

@risottobias @jonny Because blocking them aids them in knowing they should forge UA to evade your block. Feeding them spam doesn't.

Cassandrich

@hanscees @jonny There are multiple tools for this. I'll see if I can find some.

{Insert Pasta Pun}

@dalias @jonny I think letting a spammer think "oh, I solved it, they block on IPs" when it suddenly lets them through, while only messing with 1/100 visits, makes a thorny unreproducible, quiet poisoning

Being very obvious doesn't stall them for time

Optimize for them having to debug a site that they don't know is actually poisoning their model

qrazi

@dalias @hanscees @jonny I think these sort of tools? https://tldr.nettime.org/@asrg/113867412641585520

jonny (good kind)

@qrazi @dalias oh sick i'll set this up tomorrow

jonny (good kind)

@qrazi @dalias I can't tell if it would be better to try and do gradient descent and generate text that would try and 0 the weight vectors, or to pick one topic area and go hard trying to poison one very localized neighborhood of the latent space like a roadblock since you can't possibly serve enough text to compete with the sum of written language