“AI” companies think that we should have to opt-out of data-scraping bots that take our work to train their products. There isn’t even a required no-scraping period between the announcement and when they start. Too late? Tough. Once they have your data, they don’t provide you with a way to have it deleted, even before they’ve processed it for training.
These companies should be prevented from using data that they haven’t been given explicit consent for. Opt-out is problematic as it counts on concerned parties hearing about new or modified bots BEFORE their sites are targeted by them. That is simply not practical.
It should be strictly opt-in. No one should be required to provide their work for free to any person or organization. The online community is under no responsibility to help them create their products. Some will declare that I am “Anti-AI” for saying such things, but that would be a misrepresentation. I am not declaring that these systems should be torn down, simply that their developers aren’t entitled to our work. They can still build those systems with purchased or donated data.
There are ongoing court cases and debates in political circles around the world. Decisions and policies will move more slowly than either side on this issue would like, but in the meantime, SOME of the bots involved in scraping data for training have been identified and can be blocked. (Others may still be secret or operate without respect for the wishes of a website’s owner.) Here’s how:
(If you are not technically inclined, please talk to your webmaster, whatever support options are at your disposal, or a tech-savvy friend.)
robots.txt
This is a file placed in the home directory of your website that is used to tell web crawlers and bots which portions of your website they are allowed to visit. Well-behaved bots honor these directives. (Not all scraping bots are well-behaved and there are no consequences, short of negative public opinion, for ignoring them. At this point, there have been no claims that bots being named in this post have ignored these directives.)
This what our robots.txt looks like:
User-agent: CCBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: Omgili
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: ImagesiftBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: cohere-ai
Disallow: /
The first line identifies CCBot, the bot used by the Common Crawl. This data has been used by ChatGPT, Bard, and others for training a number of models. The second line states that this user-agent is not allowed to access data from our entire website. Some image scraping bots also use Common Crawl data to find images.
The next two user-agents identify ChatGPT-specific bots.
ChatGPT-User is the bot used when a ChatGPT user instructs it to reference your website. It’s not automatically going to your site on its own, but it is still accessing and using data from your site.
GPTBot is a bot that OpenAI specifically uses to collect bulk training data from your website for ChatGPT.
Google-Extended is the recently announced product token that allows you to block Google from scraping your site for Bard and VertexAI. This will not have an impact on Google Search indexing. The only way this works is if it is in your robots.txt. According to their documentation: “Google-Extended doesn’t have a separate HTTP request user agent string. Crawling is done with existing Google user agent strings; the robots.txt user-agent token is used in a control capacity.”
Applebot-Extended does not directly crawl webpages. It is used to determine whether or not pages crawled by the Applebot user agent will be used to train Apple’s models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.
Anthropic-ai is used by used by Anthropic to gather data for their “AI” products, such as Claude.
Claudebot is another agent used by Anthropic that is more specifically related to Claude.
Omgilibot and Omgili are from webz.io. I noticed The New York Times was blocking them and discovered that they sell data for training LLMs.
FacebookBot is Meta’s bot that crawls public web pages to improve language models for their speech recognition technology. This is not what Facebook uses to get the image and snippet for when you post a link there.
Diffbot is a somewhat dishonest scraping bot used to collect data to train LLMs. This is their default user-agent, but they make it easy for their clients to change it to something else and ignore your wishes.
Bytespider has been identified as ByteDance’s bot used to gather data for their LLMs, including Doubao.
ImagesiftBot is billed as a reverse image search tool, but it’s associated with The Hive, a company that produces models for image generation. It’s not definitively scraping for “AI” models, but there are enough reasons to be concerned that it may be. Commenters here have suggested it’s inclusion. If anyone from the company would like to clarify, we’re all ears.
PerplexityBot is used by perplexity.ai, an AI-based search engine. While billed as a search engine, it can and does generate text based on scraped material.
cohere-AI is unconfirmed bot believed to be associated with Cohere’s chatbot. It falls into the same class as ChatGPT-User as it is appears to trigger in response to a user-directed query.
ChatGPT has been previously reported to use another unnamed bot that had been referencing Reddit posts to find “quality data.” That bot’s user agent has never been officially identified and its current status is unknown. Reddit, Tumblr, and others have recently announced their intent to license their user’s content to the “AI” industry and in some cases, there are no opt-out controls. Protecting your work outside of your own websites is likely to become increasingly complicated.
Updating or Installing robots.txt
You can check if your website has a robots.txt by going to yourwebsite.com/robots.txt. If it doesn’t find that page, then you don’t have one.
If your site is hosted by Squarespace (see below), or another simple website-building site, you could have a problem. At present, many of those companies don’t allow you to update or add your own robots.txt. They may not even have the ability to do it for you. I recommend contacting support so you can get specific information regarding their current abilities and plans to offer such functionality. Remind them that once slurped up, you have no ability to remove your work from their hold, so this is an urgent priority. (It also demonstrates once again why “opt-out” is a bad model.)
If you are using Wix, they provide directions for modifying your robots.txt here.
If you are using Squarespace, they provide directions for blocking a very fixed set of AI scraping bots here. They will allow you to block some, but not all of the bots mentioned in this post.
If you are using WordPress (not WordPress.com–see below), there are a few plugins that allow you to modify your robots.txt. Many of these include SEO (Search Engine Optimization) plugins have robots.txt editing features. (Use those instead of making your own.) Here’s a few we’ve run into:
-
-
- Yoast: directions
- AIOSEO: directions (there’s a report in the comments that user agent blocking may not be working at the moment)
- SEOPress: directions
- Dark Visitors: (under option 2) – this one will self-update with newly discovered bots, they also maintain a useful website for information about different bots
If your WordPress site doesn’t have a robots.txt or something else that modifies robots.txt, I recommend installing something like the Dark Visitors plugin. It provide a wide and robust range of bot-blocking capabilities. (The people behind the Dark Visitors plugin have been a very reliable and regularly updated source of information about bots via their website.) To avoid any signs of favoritism, I’ll mention other plugins I’ve heard of as well: Block AI Crawlers, Simple NoAI and NoImageAI. Do not install and activate more than one of these at a time.
For more experienced users: If you don’t have a robots.txt, you can create a text file by that name and upload it via FTP to your website’s home directory. If you have one, it can be downloaded, altered and reuploaded. If your hosting company provides you with cPanel or some other control panel, you can use its file manager to view, modify, or create the file as well.
If your site already has a robots.txt, it’s important to know where it came from as something else may be updating it. You don’t want to accidentally break something, so talk to whoever set up your website or your hosting provider’s support team.
Firewalls and CDNs (less common, but better option)
Your website may have a firewall or CDN in front of your actual server. Many of these products have the ability to block bots and specific user agents. Blocking the user agents (CCBot, GPTBot, ChatGPT-User, Anthropic-ai, ClaudeBot, Omgilibot, Omgili, FacebookBot, Diffbot, Bytespider, PerplexityBot, ImagesiftBot, and cohere-ai) there is even more effective than using a robots.txt directive. (As I mentioned, directives can be ignored. Blocks at the firewall level prevent them from accessing your site at all.) Some of these products include Sucuri, Cloudflare, QUIC.cloud, and Wordfence. (Happy to add more if people let me know about them. Please include a link to their user agent blocking documentation as well.) Contact their support if you need further assistance.
CLOUDFLARE USERS: In July 2024, Cloudflare updated their settings to allow you to block AI bots in the Web Application Firewall (WAF). Directions are in the linked post. The list of bots they are blocking is extensive and they’ve committed to updating it to block new bots as they are found.
NOTE: Google-Extended and Applebot-Extended aren’t bots. You need to have this in your robots.txt file if you want to prevent them from using your site content as training.
.htaccess (another option)
In the comments, DJ Mary pointed out that you can also block user agents with your website’s .htaccess file by adding these lines:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (CCBot|ChatGPT|GPTBot|anthropic-ai|ClaudeBot|Omgilibot|Omgili|FacebookBot|Diffbot|Bytespider|PerplexityBot|ImagesiftBot|cohere-ai) [NC]
RewriteRule ^ – [F]
I’d rate this one as something for more experienced people to do. This has a similar effect to that of the firewall and CDN blocks above.
NOTE: Google-Extended isn’t a bot. You need to have this in your robots.txt file if you want to prevent them from using your site content as training.
Additional Protection for Images
There are some image-scraping tools that honor the following directive:
<meta name="robots" content="noai, noimageai">
when placed in the header section of your webpages. Unfortunately, many image-scraping tools allow their users to ignore this directives.
Tools like Glaze and Mist that can make it more difficult for models to perform style mimicry based on altered images. (Assuming they don’t get or already have an unaltered copy from another source.)
There are other techniques that you can apply for further protection (blocking direct access to images, watermarking, etc.) but I’m probably not the best person to talk to for this one. If you know a good source, recommend them in the comments.
Podcasts
The standard lack of transparency from the “AI” industry makes it difficult to know what is being done with regards to audio. It is clear, however, that the Common Crawl has audio listed among the types of data it has acquired. Blocks to the bots mentioned should protect an RSS feed (the part of your site that shares information about episodes), but if your audio files (or RSS feed) are hosted on a third party website (like Libsyn, PodBean, Blubrry, etc.), it may be open from their end if they aren’t blocking. I am presently unaware of any that are blocking those bots, but I have started asking. The very nature of how podcasts are distributed makes it very difficult to close up the holes that would allow access. This is yet another reason why Opt-In needs to be the standard.
ai.txt
I just came across this one recently and I don’t know which “AI” companies are respecting Spawning’s ai.txt settings, but if anyone is, it’s worth having. They provide a tool to generate the file and an assortment of installation directions for different websites.
https://site.spawning.ai/spawning-ai-txt
Substack
If you are using Substack, there is an option to “block AI training,” but it is defaulted to off. Courtesy of Alan Baxter on BlueSky “If you do NOT want your publication to be used to train AI, open your publication, go to Settings > Publication details and switch it on.”
Image courtesy of Alan Baxter.
WordPress.com
If you are using the WordPress hosting provided by WordPress.com (this is not to be confused with WordPress installed on your own hosting plan), please be aware that they have partnered with “AI” companies and will be providing your content to those companies unless you opt-out. This post details what you can do to prevent that, but the option you are looking for (Prevent Third-Party Sharing) can be found under Settings → General, in the privacy section.
“Activating the “Prevent third-party sharing” feature excludes your site’s public content from our network of content and research partners. It also adds known AI bots to the “disallow” list in your site’s robots.txt file in order to stop them from crawling your site, though it is up to AI platforms to honor this request. Using this option also means your blog posts will not appear in the WordPress.com Reader.”
Closing
None of these options are guarantees. They are based on an honor system and there’s no shortage of dishonorable people who want to acquire your data for the “AI” gold rush or other purposes. Sadly, the most effective means of protecting your work from scraping is to not put it online at all. Even paywall models can be compromised by someone determined to do so.
Other techniques for informing bots that you don’t want work used for “AI” training (or might be willing to do so under license) are being developed. One such effort, TDM Reservation Protocol (TDMRep) has been drafted by a W3C community group but “is not a W3C Standard nor is it on the W3C Standards Track.” I am unaware of any bots currently employing this functionality, though some vendors have mentioned it to me. While overly complex, the advantage to something like this is that it would avoid the necessity of having to block individual bots and companies. Like others, it would not protect you from bad actors.
Writers and artists should also start advocating for “AI”-specific clauses in their contracts to restrict publishers using, selling, donating, or licensing your work for the purposes of training these systems. Online works might be the most vulnerable to being fed to training algorithms, but print, audio, and ebook editions developed by publishers can be used too. It is not safe to assume that anyone will take the necessary efforts to protect your work from these uses, so get it in writing.
[This post will be updated with additional information as it becomes available.]
9/28/2023 – Added the recently announced Google-Extended robots.txt product token. This must be in robots.txt. There are no alternatives.
9/28/2023 – Added Omgilibot/Omgili, bots apparently used by a company that sells data for LLM training.
9/29/2023 – Adam Johnson on Mastodon pointed us at FacebookBot, which is used by Meta to help improve their language models.
11/6/2023 – Added anthropic-ai user-agent used by Anthropic.
11/16/2023 – Added Substack section information provided by Alan Baxter.
11/17/2023 – Squarespace has provided directions to block some, but not all bots mentioned in this post.
12/12/2023 – Added information about Cloudflare AI blocking rules in WAF.
1/25/2024 – Added Bytespider bot courtesy of darkvisitors.com.
2/28/2024 – Added WordPress.com directions for preventing the sharing of your content with their “AI” partners.
3/1/2024 – Added ImagesiftBot due to concerns raised by people commenting on this post. At this time, it’s not a confirmed “AI” scraping bot that is contributing to art generating models, but there’s sufficient cause to be cautious due to it’s association with a company (The Hive) that does. Hive hasn’t revealed how it acquires data or from where.
3/23/2024 – Added information about Diffbot and cohere-ai.
4/8/2024 – Added Dark Visitors WordPress Plugin
4/27/2024 – Added ClaudeBot as per comment from Scott Adams
5/4/2024 – Added PerplexityBot as per comment from Marijka Azzopardi
6/11/2024 – Added AppleBot as per comment from Robin Phillips and updated the list of WordPress plugins