I see that openai.com/gptbot is crawling my blog, top to bottom, side to side.
-
-
-
@datarama @timbray Yeah you're not wrong I was also looking into how to best stop these data hungry LLM overlords from training on my writing without consent.
I believe
- on the technical side you can try a robots.txt or blocking user agents or ip ranges
- on the license side I'm also going with CC BY NC SAbut I can't say with certainty that this will fly in practice.
-
@djh @timbray I've toyed with the idea of robots.txt *and* matching user agents or ip ranges - if they don't abide by robots.txt, they get a bunch of garbage instead of actual content. (perhaps an LLM-generated bunch of bullshit based on the actual content?)
But they'll just fake their user agents.
It's depressing, really. You can't have a commons in the middle of a toxic open-pit mine.
-
@djh @timbray On the license side: I live in an EU country, where the law explicitly says that 1. data mining and ML for commercial products *must* respect a "machine-readable opt-out" (eg. https://www.w3.org/community/reports/tdmrep/CG-FINAL-tdmrep-20240202/ ), and 2. ML for academic research purposes doesn't have to respect copyright at all.
This, I guess, is why these companies love having a "non-profit research lab" to do all the actual scraping for them.