I see that openai.com/gptbot is crawling my blog, top to bottom, side to side.

Tim Bray

I see that openai.com/gptbot is crawling my blog, top to bottom, side to side. I’m sure OpenAI has consulted the “Rights” link clearly displayed on every page, invoking a Creative Commons license that freely grants rights to reuse and remix but not for commercial purposes.

#genAI #llms

datarama

@timbray Oh, I'm sure it's OpenAI the nonprofit crawling your blog for academic research purposes. The totally nonprofit dataset they make will then be provided to OpenAI the company, who just happen to make a commercial product with it.

Daniel

@timbray (regardless if they're respecting the license or not) I thought Creative Commons Non-Commercial is not preventing models to be trained on it under fair use? Not a lawyer but wasn't there some outrage a while ago about this?

datarama

Daniel

@datarama @timbray Yeah you're not wrong I was also looking into how to best stop these data hungry LLM overlords from training on my writing without consent.

I believe
- on the technical side you can try a robots.txt or blocking user agents or ip ranges
- on the license side I'm also going with CC BY NC SA

but I can't say with certainty that this will fly in practice.

datarama

@djh @timbray I've toyed with the idea of robots.txt *and* matching user agents or ip ranges - if they don't abide by robots.txt, they get a bunch of garbage instead of actual content. (perhaps an LLM-generated bunch of bullshit based on the actual content?)

But they'll just fake their user agents.

It's depressing, really. You can't have a commons in the middle of a toxic open-pit mine.

datarama

@djh @timbray On the license side: I live in an EU country, where the law explicitly says that 1. data mining and ML for commercial products *must* respect a "machine-readable opt-out" (eg. https://www.w3.org/community/reports/tdmrep/CG-FINAL-tdmrep-20240202/ ), and 2. ML for academic research purposes doesn't have to respect copyright at all.

This, I guess, is why these companies love having a "non-profit research lab" to do all the actual scraping for them.