finally set up hard blocks for ai crawlers on one of my servers and we are now serving a thousand 403's a minute
-
finally set up hard blocks for ai crawlers on one of my servers and we are now serving a thousand 403's a minute
-
replied to jonny (good kind) last edited by
@jonny nice... How'd you do it? User agent?
-
replied to jonny (good kind) last edited by
just user agents for tonight, but i'm logging IPs and will set up that next phase of user agent -> ip block tomorrow. they have been spreading out requests over ips and time s.t. they aren't tripping my fail2ban rules but i'm on to them.
-
replied to Kip Van Den Bos last edited by
@tobi yaya just user agent for now, but my git server was being so laggy from them crawling it that i'm like it's war now https://neuromatch.social/@jonny/113922213999005285
-
replied to jonny (good kind) last edited by
@jonny I'm considering if we should hardcode some UA blocks in GtS as well. We already follow most of the "recommended" stuff to politely stop crawling but I doubt politeness will keep cutting the mustard.
-
replied to Kip Van Den Bos last edited by
@tobi i never really understood why robots.txt was a thing, why ask nicely when you can just refuse to serve?
-
replied to jonny (good kind) last edited by
@jonny Feed them Markov spam instead of 403s. 403 will just make them switch UAs. Markov spam will rapidly accelerate model collapse.
-
replied to jonny (good kind) last edited by
@jonny @tobi i dont know if this is historically accurate but my assumption was that it was because for cases where itβs not a hard rule for security/privacy/licensing reasons but instead as guidance for what is actually relevant to search engine scrapers itβs simply more convenient to do it this way, and while blocking would work instead there wouldnβt really be a motivation to take that approach when the point is to make the bots do their thing better rather than to protect the content of the servers. in the case of some things i run iβll have robots.txt rules filter out pages that are completely irrelevant for bots, not because i donβt want them to access it but because they wouldnβt have any use for it and it would just be wasteful
-
-
replied to Cassandrich last edited by
-
replied to Cassandrich last edited by
-
replied to {Insert Pasta Pun} last edited by
@risottobias @jonny Because blocking them aids them in knowing they should forge UA to evade your block. Feeding them spam doesn't.
-
replied to Hans-Cees π³π³π€’π¦ππππππ last edited by
-
replied to Cassandrich last edited by [email protected]
@dalias @jonny I think letting a spammer think "oh, I solved it, they block on IPs" when it suddenly lets them through, while only messing with 1/100 visits, makes a thorny unreproducible, quiet poisoning
Being very obvious doesn't stall them for time
Optimize for them having to debug a site that they don't know is actually poisoning their model
-
replied to Cassandrich last edited by
@dalias @hanscees @jonny I think these sort of tools? https://tldr.nettime.org/@asrg/113867412641585520
-
replied to qrazi last edited by
-
replied to jonny (good kind) last edited by
@qrazi @dalias I can't tell if it would be better to try and do gradient descent and generate text that would try and 0 the weight vectors, or to pick one topic area and go hard trying to poison one very localized neighborhood of the latent space like a roadblock since you can't possibly serve enough text to compete with the sum of written language