Archived: Block AI training on a web site

This is a simplified archive of the page at https://blog.zgp.org/block-ai-training-on-a-web-site/

Use this page embed on your own site:

09 Jun 2024

ReadArchived

09 Jun 2024

(Update 14 Jun 2024: Add darkvisitors.com API and GPC.)

I’m going to start with a warning. You can’t completely block “AI” training from a web site. Underground AI will always get through, and it might turn out that the future of AI-based infringement is bot accounts so that the sites that profit from it can just be shocked at what one of their users was doing—kind of like how big companies monetize copyright infringement.

But there are some ways to tell the halfway crooks of the AI business to go away. Will update if I find others.

robots.txt

Dark Visitors - A List of Known AI Agents on the Internet is a good source of an up-to-date set of lines to add to your robots.txt file.

This site uses the API to catch up on the latest. So if I fall behind on reading the technology news, the Makefile has me covered.

# update AI crawlers blocking list from darkvisitors.com
tmp/robots.txt : 
        curl -X POST "https://api.darkvisitors.com/robots-txts" \
                -H "Authorization: Bearer $(shell pass darkvisitors-token)" \
                -H "Content-Type: application/json" \
                -d '{"agent_types": ["AI Data Scraper", "AI Assistant", "Undocumented AI Agent", "AI Search Crawler"], "disallow": "/"}' \
        > $@

# The real robots.txt is built from the local lines
# in the conf directory, with the
# darkvisitors.com lines added
public/robots.txt : conf/robots.txt tmp/robots.txt
        cat conf/robots.txt tmp/robots.txt > $@

One of my cleanup scripts gets rid of the tmp/robots.txt fetched from Dark Visitors if it gets stale, and I use Pass to store the token.

noai meta tag

Raptive Support covers the noai meta tag. Pretty easy, just put this in the HTML head with any other meta and link elements.

<meta name="robots" content="noai, noimageai">

That support FAQ includes a good point that applies to all of these—the opt out is stronger if it’s backed up with the site Terms of Service or User Agreement. Big companies have invested hella lawyer hours in making these things more enforceable, and if they wanted to override ToS they would be acting against their other interests in keeping their sites in company town mode.

new: privacy opt out for servers

This is the first site to include the new SPC meta tag and X-Robots-Tag header for a privacy opt-out that works like Global Privacy Control but for servers. Basically you have legally enforceable rights in your personal information, blogs have personal information, but regular GPC only works from your browser (client) to company on the server. This goes the other way, and sends a legally enforceable* *yes, I know, this has not yet been tested in court, but give it a minute, we’re just getting started here privacy signal from a personal blog on the server to an AI scraper on the client side.

So the new header on here is

X-Robots-Tag: noai, noimageai, SPC

So we’re up to four, somebody send me number five?

#