setting silly stats was a really incredibly effective way to get crawlers to start respecting robots.txt: https://github.com/superseriousbusiness/gotosocial/pull/3718#issuecomment-2629328372
-
setting silly stats was a really incredibly effective way to get crawlers to start respecting robots.txt: https://github.com/superseriousbusiness/gotosocial/pull/3718#issuecomment-2629328372
-
replied to Kip Van Den Bos last edited by
some of the beseeching in this issue is really leaning a lot on concepts like "network respectability" which aren't explained or unpacked in any meaningful way: https://github.com/superseriousbusiness/gotosocial/issues/3723 I find it really interesting how gts setting some randomized stats to baffle crawlers is highlighting some long-standing tensions here between people who want a "respectable network" and people who just want to talk shit with their friends. It's also very interesting how the "respectable network" / "honor system" side of the argument has to paper over the fact that robots.txt is being ignored, which is pretty ironic given that disobeying robots.txt is explicitly breaking some rules
-
replied to Kip Van Den Bos last edited by
I think there are fundamental ideological differences at play here and it requires some thinking
-
-
replied to Kip Van Den Bos last edited by
specifically I think we at gts are not very interested in making software for the sorts of people who get very heated about the respectability of baffling crawlers with random stats but I have to think more about that
-
replied to Kip Van Den Bos last edited by
@[email protected] (if it helps, the moment I read about faking statistics I was like "fuck yeah, GOOD").
I know that they're not specifically advocating for logical approaches in this context (I suspect your framing it as an ideological thing is correct) but "you should allow them to break the rules but not do anything about it" is the type of scenario that generally leads to bullies breaking things soooooo -
replied to Kip Van Den Bos last edited by
anyway it's very interesting how in these comments the onus is being put on GtS to behave itself, but not on the crawlers to crawl more politely or to simply keep the list of GtS instances but don't store or serve the stats about posts and stuff; nevermind that perhaps the whole fundamental assumption that mapping the fediverse is good, is actually wrong
-
replied to Kip Van Den Bos last edited by
on the previous issue you mention that GtS explicitly opts-out of crawling on /.well-known routes. my software will resolve your nodeinfo, regardless of robots.txt rules, upon receiving content from your instance: together with instance actor and its key, upub grabs nodeinfo to be able to know about the instance software. this is not optional: due to many kinks of AP it is necessary to know where content is coming from (for example, to un-mess lemmy hashtags). to be able to ignore GtS (as you suggest on the other issue) i would still need to resolve your nodeinfo. i presume mine is not the only software doing this
is this individual talking about a crawler they've written or like, an actual ActivityPub server implementation that federates?
because if it's the former I want to tell them to go fuck themselves (I will not do this, for the record; I'm just support on the sidelines here). Because deliberately breaking the rules and then whining that you're breaking the rules? I have a complaint bin for that: the pain box from Dune (except I stab you with the Gom Jabbar regardless of what you do with your hand). -
replied to Asta [AMP] last edited by
@aud they're talking about a federating AP server implementation I believe...
-
replied to Kip Van Den Bos last edited by
@[email protected] ah, okay. I suppose a followup question (that doesn't really change my opinion about the need to generate fake data): is the API endpoint they're hitting explicitly part of the AP protocol?
(even if it is, generating fake data is good. I want an entire fucking OS that serves up fake data to every application unless I explicitly authorize otherwise). -
replied to Asta [AMP] last edited by
@aud I don't think nodeinfo is an explicit part of AP, but I'd have to check it's one of those things like webfinger where it's sort of expected, iirc
-
replied to Kip Van Den Bos last edited by
@[email protected] I suspected as much. Normally I'm not one to go all "protocol doesn't say shit about this" but, well, the protocol doesn't say shit about an endpoint that is easily abused so why the fuck should you make it available.
I dunno. I guess I'm not sympathetic to the line of arguments these people are making because we're well past the point where we can assume any good intention on any developer or scraper's part (if we ever were). Too many people abusing openness to pump up their own stupid horseshit.
The less known about a server, its occupants, etc, the better. The server admin has explicitly opted to tell you to fuck off, so if you don't like that the software lets them do just that, that's your problem.