Block AI training on a web site

09 Jun 2024

(Update 14 Jun 2024: Add darkvisitors.com API and GPC.)

I’m going to start with a warning. You can’t completely block “AI” training from a web site. Underground AI will always get through, and it might turn out that the future of AI-based infringement is bot accounts so that the sites that profit from it can just be shocked at what one of their users was doing—kind of like how big companies monetize copyright infringement.

But there are some ways to tell the halfway crooks of the AI business to go away. Will update if I find others.

robots.txt

Dark Visitors - A List of Known AI Agents on the Internet is a good source of an up-to-date set of lines to add to your robots.txt file.

This site uses the API to catch up on the latest. So if I fall behind on reading the technology news, the Makefile has me covered.

# update AI crawlers blocking list from darkvisitors.com
tmp/robots.txt : 
        curl -X POST "https://api.darkvisitors.com/robots-txts" \
                -H "Authorization: Bearer $(shell pass darkvisitors-token)" \
                -H "Content-Type: application/json" \
                -d '{"agent_types": ["AI Data Scraper", "AI Assistant", "Undocumented AI Agent", "AI Search Crawler"], "disallow": "/"}' \
        > $@

# The real robots.txt is built from the local lines
# in the conf directory, with the
# darkvisitors.com lines added
public/robots.txt : conf/robots.txt tmp/robots.txt
        cat conf/robots.txt tmp/robots.txt > $@

One of my cleanup scripts gets rid of the tmp/robots.txt fetched from Dark Visitors if it gets stale, and I use Pass to store the token.

noai meta tag

Raptive Support covers the noai meta tag. Pretty easy, just put this in the HTML head with any other meta and link elements.

<meta name="robots" content="noai, noimageai">

That support FAQ includes a good point that applies to all of these—the opt out is stronger if it’s backed up with the site Terms of Service or User Agreement. Big companies have invested hella lawyer hours in making these things more enforceable, and if they wanted to override ToS they would be acting against their other interests in keeping their sites in company town mode.

new: privacy opt out for servers

This is the first site to include the new SPC meta tag and X-Robots-Tag header for a privacy opt-out that works like Global Privacy Control but for servers. Basically you have legally enforceable rights in your personal information, blogs have personal information, but regular GPC only works from your browser (client) to company on the server. This goes the other way, and sends a legally enforceable* *yes, I know, this has not yet been tested in court, but give it a minute, we’re just getting started here privacy signal from a personal blog on the server to an AI scraper on the client side.

So the new header on here is

X-Robots-Tag: noai, noimageai, SPC

So we’re up to four, somebody send me number five?

Bonus links

The Internet is a Series of Webs The future of the internet seems up in the air. Consumed by rotting behemoths. What we have now is failing, but it is also part of our every-day life, our politics, our society, our communities and our friendships. All of those are at risk, in part because the ways we communicate are under attack. (So if Google search ads are scammy enough to get an FBI warning, Meta is a shitshow, and Amazon is full of fake and stolen stuff, what do you do? Make a list of legit companies on your blog and hope others do the same?)

For tech CEOs, the dystopia is the point The CEOs obviously don’t much care what some flyby cultural critics think of their branding aspirations, but beyond even that, we have to bear in mind that these dystopias are actively useful to them.

Apple Removes Nonconsensual AI Nude Apps Following 404 Media Investigation (think of how bad the Internet would be without independent sites covering the big companies…then go subscribe to 404 Media.)

Amazon is filled with garbage ebooks. Here’s how they get made. The biographer in question was just one in a vast, hidden ecosystem centered on the production and distribution of very cheap, low-quality ebooks about increasingly esoteric subjects. Many of them gleefully share misinformation or repackage basic facts from WikiHow behind a title that’s been search-engine-optimized to hell and back again. Some of them even steal the names of well-established existing authors and masquerade as new releases from those writers. (I’m going to the real bookstore.)

“Pink slime” local news outlets erupt all over US as election nears Kathleen Carley, a computer science professor at Carnegie Mellon University, said her research suggests that following the 2022 midterms “a lot more money” is being poured into pink slime sites, including advertising on Meta.