Author: Neil Clarke Page 1 of 193

On fast rejections

On 02/11/2024

At Clarkesworld, we try to respond to submissions in under 48 hours. Sometimes life gets in the way and that can slide to a week, but we always make an effort to catch up and return to normal. We’ve made this response time a goal and have been committed to it for a very long time.

Every now and then, an author decides to take affront at a quick response. It happens often enough, that I’ve sometimes sat on stories just to get them over the 24 hour mark, which seems to decrease, but not eliminate, the outrage factor. Rather than repeat myself with those authors, I’m just going to start pointing people to this post.

Things you need to know

We actually read every submission. Seriously, why would we waste our time handling this many submissions if we had no intention of reading them? Suggesting otherwise is insulting.
A good or great story can still be rejected. Most well-known editors have rejected at least one story that has gone on to win awards and usually, they have no regrets about that. The story simply wasn’t right for their project or they weren’t the right editor for that story. It eventually landed in the right market with the right editor, found its audience, and was celebrated for it. We like to see that.
Don’t be misled. Magazines that have slower response times don’t necessarily spend more time reading your story than we did. The difference in response times is almost entirely related to the amount of time your story sits in a pile, unread. If a publication averages 30 days, it means they can keep up with the daily volume and could have a 48-hour response time, if it was one of their priorities.
To us, getting back to you quickly is a sign of respect. The sooner we reject this one, the sooner you can send it to the next market. It also resets the clock on when you can send us your next one. Every story is a clean-slate, so rejecting one or ten or hundred doesn’t mean we’ve given up on you. While a few authors sell quickly, others have submitted over fifty before landing an acceptance here. Everyone’s journey is different.
We did warn you. Our average response time is listed in our guidelines and on the submission confirmation screen.

2023 Clarkesworld Submissions Snapshot

By Neil Clarke

On 02/02/2024

In clarkesworld magazine, slush, stats

Now that the year is over and I’ve had some time to sit with the data, I can share a snapshot of the 2023 submissions data and a few observations. For the sake of clarity, the data here represents stories that were submitted from January 1, 2023 through December 31, 2023. It does not include the submissions that we identified as plagiarism or generated by LLMs (also loosely called “AI” submissions).

We received a total of 13,207 submissions in 2023. This includes the 1,124 Spanish language submissions that were part of a one-month Spanish Language Project (SLP). We were closed to submissions from February 20, 2023 through March 12, 2023 due to a surge (hundreds in a month) of “AI”-generated submissions. During that period, we made updates to our software in an attempt to better manage this new form of unwanted works. The changes made helped us avoid the need to close submissions again, despite continued surges. (This is a band aid, not a cure. Higher volumes could necessitate future closures.)

Combined, the number of submissions received was our second-highest year ever. (2020 and its wild pandemic submission patterns holds the record by just a few dozen.) If we consider just the English language submissions, it was down, but only because we were closed for a few weeks. Assuming the weekly submissions volume would have held during that time, we would have had roughly the same volume as we did in 2022 and, with SLP submissions, our best overall year ever.

7,898 (59.8%) of the submissions were from authors that had never submitted to Clarkesworld before. Courtesy of the SLP, that percentage is considerably higher than normal. In case it isn’t clear, I consider this to be a good thing. Opening the door for authors that had been unable to submit in the past was a major point of that project.

We received submissions from 154 different countries, smashing our previous record by 12. US-based authors accounted for 7780 (58.9%) of all works, a new record for lowest levels from there. However, if we remove the SLP, it increases to 64.1%, which is the highest percentage we’ve seen in several years. I’ve tried to dig into the reasons for this change have have identified two likely causes, both of which are tied to the generated submissions crisis:

The press coverage concerning our situation with “AI” submissions was global and effectively spread the word that we had closed submissions, but the news of our reopening did not receive similar attention. American authors were far more likely to hear that we were once again accepting submissions.
With the amount of time we had to spend on the “AI” and Amazon subscription problems, we did not put as much effort into promoting our openness to international submissions. As I’ve noted in past years, many foreign authors have difficulty believing that US-based publications are open to submissions from them. (History certainly makes it look that way.) Making public statements about your willingness to do so is an important part of undoing the damage and it must be maintained. The decrease in the time we spent doing that in 2023 almost certainly had an impact.

While the percentage is disappointing (please don’t take this as not wanting more US-based authors, we want everyone, but one group needs more encouragement than the other), it is recoverable and was offset by the SLP. If anything, that’s a reason for doing more language-based projects in the future. It was clearly more effective than anything we’ve done in the past.

Short stories (works under 7,500 words) represented 85.9% of all submissions, followed by novelettes (works between 7,500 and 17,500 words) at 12.7% and novellas (17,500+, our cap is 22,000) at 1.4%. Acceptances followed a similar trend with 82.9%, 14.8%, and 2.3% respectively.

Science fiction took the lead across all categories. The following is our submission funnel (by genre) for 2023. The outer ring is all submissions. The middle ring includes stories that were passed onto the second round of evaluation. The inner ring is acceptances.

I discourage people from using this data to determine what they should and should not send to us. The data reflects our opinions on the individual works submitted to us at that specific time, and does not represent a general opinion on the length or genre of works submitted to us. If we were no longer interested in considering a specific type of work, it would either be present in our hard-sell list or missing from our list of accepted genres. (This already happened with horror.) We will not waste your (or our) time by leaving a category open if we have no intention of considering it for publication.

Genres have trends and right now, some of those trends are working for or against stories in our submissions review process. Works across all categories are still reaching the second round, indicating (to me) that it is still important that we continue to encourage and consider works in those genres.

Clarkesworld Readers’ Poll (round one) now open for 48 hours

By Neil Clarke

On 01/27/2024

In clarkesworld magazine

The 48-hour nomination phase of our annual Clarkesworld Readers’ Poll is now open. Nominate your favorite 2023 Clarkesworld short stories, novelettes/novellas, and cover art. Finalists will be revealed in our February issue.
https://www.surveymonkey.com/r/cwreaders2023
CLOSES MONDAY AT NOON

Clarkesworld Magazine 2023 Stories and Cover Art

By Neil Clarke

On 12/17/2023

In clarkesworld magazine

Here are the stories and cover art published in Clarkesworld Magazine’s 2023 issues:

Short Stories

“Symbiosis” by D.A. Xiaolin Spires
“Sharp Undoing” by Natasha King
“Pearl” by Felix Rose Kawitzky
“The Portrait of a Survivor, Observed from the Water” by Yukimi Ogawa
“Somewhere, Its About to Be Spring” by Samantha Murray
“Larva Pupa Imago” by Eric Schwitzgebel
“Silo, Sweet Silo” by James Castles
“Going Time” by Amal Singh
“Love in the Season of New Dance” by Bo Balder
“Pinocchio Photography” by Angela Liu
“The Spoil Heap” by Fiona Moore
“Failure to Convert” by Shih-Li Kow
“Zeta-Epsilon” by Isabel J. Kim
“AI Aboard the Golden Parrot” by Louise Hughes
“Love is a Process of Unbecoming” by Jonathan Kincade
“Re/Union” by L Chan
“There Are the Art-Makers, Dreamers of Dreams, and There Are Ais” by Andrea Kriz
“Rake the Leaves” by R.T. Ester
“Keeper of the Code” by Nick Thomas
“The Librarian and the Robot” by Shi Heiyao
“Voices Singing in the Void” by Rajan Khanna
“Better Living Through Algorithms” by Naomi Kritzer
“Through the Roof of the World” by Harry Turtledove
“LOL, Said the Scorpion” by Rich Larson
“Sensation and Sensibility” by Parker Ragland
“The Giants Among Us” by Megan Chee
“Action at a Distance” by An Hao
“The Fall” by Jordan Chase-Young
“The Officiant” by Dominica Phetteplace
“Vast and Trunkless Legs of Stone” by Carrie Vaughn
“Day Ten Thousand” by Isabel J. Kim
“The Moon Rabbi” by David Ebenbach
“. . . Your Little Light” by Jana Bianchi
“To Helen” by Bella Han
“Mirror View” by Rajeev Prasad
“Cheaper to Replace” by Marie Vibbert
“Death and Redemption, Somewhere Near Tuba City” by Lou J Berger
“Estivation Troubles” by Bo Balder
“Tigers for Sale” by Risa Wolf
“Timelock” by Davian Aw
“What Remains, the Echoes of a Flute Song” by Alexandra Seidel
“The Orchard of Tomorrow” by Kelsea Yu
“Every Seed is a Prayer (And Your World is a Seed)” by Stephen Case
“Window Boy” by Thomas Ha
“Empathetic Ear” by M. J. Pettit
“Gel Pen Notes from Generation Ship Y” by Marisca Pichette
“Resistant” by Koji A. Dae
“Stones” by Nnedi Okorafor
“The Queen of Calligraphic Susurrations” by D.A. Xiaolin Spires
“A Guide to Matchmaking on Station 9” by Nika Murphy
“The People from the Dead Whale” by Djuna
“The Five Remembrances, According to STE-319” by R. L. Meza
“Upgrade Day” by RJ Taylor
“Possibly Just About A Couch” by Suzanne Palmer
“The Blaumilch” by Lavie Tidhar
“Post Hacking for the Uninitiated” by Grace Chan
“Rafi” by Amal Singh
“Timothy: An Oral History” by Michael Swanwick
“Eddies are the Worst” by Bo Balder
“Bird-Girl Builds a Machine” by Hannah Yang
“The Long Mural” by James Van Pelt
“The Parts That Make Me” by Louise Hughes
“The Mub” by Thomas Ha
“Thin Ice” by Kemi Ashing-Giwa
“To Carry You Inside You” by Tia Tashiro
“Morag’s Boy” by Fiona Moore
“Thirteen Ways of Looking at a Cyborg” by Samara Auman
“In Memories We Drown” by Kelsea Yu
“Waffles Are Only Goodbye for Now” by Ryan Cole
“The Worlds Wife” by Ng Yi-Sheng
“The Last Gamemaster in the World” by Angela Liu
“Kill That Groundhog” by Fu Qiang

Novelettes

“The Fortunate Isles” by Gregory Feeley
“Anais Gets a Turn” by R.T. Ester
“Zhuangzi’s Dream” by Cao Baiyu
“An Ode to Stardust” by R. P. Sand
“Introduction to 2181 Overture, Second Edition” by Gu Shi
“Bek, Ascendant” by Shari Paul
“Happiness” by Octavia Cade
“Stranger Shores” by Gregory Feeley
“Imagine: Purple-Haired Girl Shooting Down the Moon” by Angela Liu
“Clio’s Scroll” by Brenda W. Clough
“Light Speed Is Not a Speed” by Andy Dudak
“Who Can Have the Moon” by Congyun “Muming” Gu
“Down To The Root” by Lisa Papademetriou
“Such Is My Idea Of Happiness” by David Goodman
“De Profundis, a Space Love Letter” by Bella Han

Novellas

“To Sail Beyond the Botnet” by Suzanne Palmer
“Axiom of Dreams” by Arula Ratnakar
“Eight or Die (Part 1, Part 2)” by Thoraiya Dyer

Cover Art

“The Different Path” by Kishal Sukumaran
“Harvest II” by Arthur Haas
“Home” by Alex Rommel
“Android” by Lyss Menold
“Taking a Sample” by Arthur Haas
“Raid” by Pascal Blanché
“Autumn Pond” by Sergio Rebolledo
“Old Ways” by J.R. Slattum
“Escape” by Ignacio Bazan-Lazcano
“Utopia #2” by Dofresh
“The pirates on the beach” by Dofresh
“The Gift” by Matt Dixon

Thank you!

By Neil Clarke

On 10/27/2023

In awards, clarkesworld magazine, conventions

I spent a portion of this month in Chengdu, China attending the 2023 Worldcon. I have lots to nice things to say about my time there, but I contracted Covid somewhere along the line and I’m a bit fatigued right now. (So far, it’s an extremely mild case. Boosters are demonstrating their value right now.)

I did want to offer my thanks for a very enjoyable convention and express my joy at winning the Hugo Award for Best Editor (Short Form) for the second year in a row. I’ve had a long relationship with Chinese SF, so this was a very personally rewarding win among friends. It’s also been a particularly rough year, so well-timed on that front too. Thanks to everyone that voted for me!

I opted not to travel through three airports and multiple security checkpoints with my Hugo this time. It’s slowly making it’s way to the US, but for now, I have a few photos:

Block the Bots that Feed “AI” Models by Scraping Your Website

By Neil Clarke

On 08/23/2023

In technical

“AI” companies think that we should have to opt-out of data-scraping bots that take our work to train their products. There isn’t even a required no-scraping period between the announcement and when they start. Too late? Tough. Once they have your data, they don’t provide you with a way to have it deleted, even before they’ve processed it for training.

These companies should be prevented from using data that they haven’t been given explicit consent for. Opt-out is problematic as it counts on concerned parties hearing about new or modified bots BEFORE their sites are targeted by them. That is simply not practical.

It should be strictly opt-in. No one should be required to provide their work for free to any person or organization. The online community is under no responsibility to help them create their products. Some will declare that I am “Anti-AI” for saying such things, but that would be a misrepresentation. I am not declaring that these systems should be torn down, simply that their developers aren’t entitled to our work. They can still build those systems with purchased or donated data.

There are ongoing court cases and debates in political circles around the world. Decisions and policies will move more slowly than either side on this issue would like, but in the meantime, SOME of the bots involved in scraping data for training have been identified and can be blocked. (Others may still be secret or operate without respect for the wishes of a website’s owner.) Here’s how:

(If you are not technically inclined, please talk to your webmaster, whatever support options are at your disposal, or a tech-savvy friend.)

robots.txt

This is a file placed in the home directory of your website that is used to tell web crawlers and bots which portions of your website they are allowed to visit. Well-behaved bots honor these directives. (Not all scraping bots are well-behaved and there are no consequences, short of negative public opinion, for ignoring them. At this point, there have been no claims that bots being named in this post have ignored these directives.)

This what our robots.txt looks like:

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended 
Disallow: /
User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot 
Disallow: /
User-agent: Omgilibot
Disallow: /

User-agent: Omgili
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: ImagesiftBot 
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: cohere-ai
Disallow: /

The first line identifies CCBot, the bot used by the Common Crawl. This data has been used by ChatGPT, Bard, and others for training a number of models. The second line states that this user-agent is not allowed to access data from our entire website. Some image scraping bots also use Common Crawl data to find images.

The next two user-agents identify ChatGPT-specific bots.

ChatGPT-User is the bot used when a ChatGPT user instructs it to reference your website. It’s not automatically going to your site on its own, but it is still accessing and using data from your site.

GPTBot is a bot that OpenAI specifically uses to collect bulk training data from your website for ChatGPT.

Google-Extended is the recently announced product token that allows you to block Google from scraping your site for Bard and VertexAI. This will not have an impact on Google Search indexing. The only way this works is if it is in your robots.txt. According to their documentation: “Google-Extended doesn’t have a separate HTTP request user agent string. Crawling is done with existing Google user agent strings; the robots.txt user-agent token is used in a control capacity.”

Applebot-Extended does not directly crawl webpages. It is used to determine whether or not pages crawled by the Applebot user agent will be used to train Apple’s models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools.

Anthropic-ai is used by used by Anthropic to gather data for their “AI” products, such as Claude.

Claudebot is another agent used by Anthropic that is more specifically related to Claude.

Omgilibot and Omgili are from webz.io. I noticed The New York Times was blocking them and discovered that they sell data for training LLMs.

FacebookBot is Meta’s bot that crawls public web pages to improve language models for their speech recognition technology. This is not what Facebook uses to get the image and snippet for when you post a link there.

Diffbot is a somewhat dishonest scraping bot used to collect data to train LLMs. This is their default user-agent, but they make it easy for their clients to change it to something else and ignore your wishes.

Bytespider has been identified as ByteDance’s bot used to gather data for their LLMs, including Doubao.

ImagesiftBot is billed as a reverse image search tool, but it’s associated with The Hive, a company that produces models for image generation. It’s not definitively scraping for “AI” models, but there are enough reasons to be concerned that it may be. Commenters here have suggested it’s inclusion. If anyone from the company would like to clarify, we’re all ears.

PerplexityBot is used by perplexity.ai, an AI-based search engine. While billed as a search engine, it can and does generate text based on scraped material.

cohere-AI is unconfirmed bot believed to be associated with Cohere’s chatbot. It falls into the same class as ChatGPT-User as it is appears to trigger in response to a user-directed query.

ChatGPT has been previously reported to use another unnamed bot that had been referencing Reddit posts to find “quality data.” That bot’s user agent has never been officially identified and its current status is unknown. Reddit, Tumblr, and others have recently announced their intent to license their user’s content to the “AI” industry and in some cases, there are no opt-out controls. Protecting your work outside of your own websites is likely to become increasingly complicated.

Updating or Installing robots.txt

You can check if your website has a robots.txt by going to yourwebsite.com/robots.txt. If it doesn’t find that page, then you don’t have one.

If your site is hosted by Squarespace (see below), or another simple website-building site, you could have a problem. At present, many of those companies don’t allow you to update or add your own robots.txt. They may not even have the ability to do it for you. I recommend contacting support so you can get specific information regarding their current abilities and plans to offer such functionality. Remind them that once slurped up, you have no ability to remove your work from their hold, so this is an urgent priority. (It also demonstrates once again why “opt-out” is a bad model.)

If you are using Wix, they provide directions for modifying your robots.txt here.

If you are using Squarespace, they provide directions for blocking a very fixed set of AI scraping bots here. They will allow you to block some, but not all of the bots mentioned in this post.

If you are using WordPress (not WordPress.com–see below), there are a few plugins that allow you to modify your robots.txt. Many of these include SEO (Search Engine Optimization) plugins have robots.txt editing features. (Use those instead of making your own.) Here’s a few we’ve run into:

- - Yoast: directions
  - AIOSEO: directions (there’s a report in the comments that user agent blocking may not be working at the moment)
  - SEOPress: directions
  - Dark Visitors: (under option 2) – this one will self-update with newly discovered bots, they also maintain a useful website for information about different bots

If your WordPress site doesn’t have a robots.txt or something else that modifies robots.txt, I recommend installing something like the Dark Visitors plugin. It provide a wide and robust range of bot-blocking capabilities. (The people behind the Dark Visitors plugin have been a very reliable and regularly updated source of information about bots via their website.) To avoid any signs of favoritism, I’ll mention other plugins I’ve heard of as well: Block AI Crawlers, Simple NoAI and NoImageAI. Do not install and activate more than one of these at a time.

For more experienced users: If you don’t have a robots.txt, you can create a text file by that name and upload it via FTP to your website’s home directory. If you have one, it can be downloaded, altered and reuploaded. If your hosting company provides you with cPanel or some other control panel, you can use its file manager to view, modify, or create the file as well.

If your site already has a robots.txt, it’s important to know where it came from as something else may be updating it. You don’t want to accidentally break something, so talk to whoever set up your website or your hosting provider’s support team.

Firewalls and CDNs (less common, but better option)

Your website may have a firewall or CDN in front of your actual server. Many of these products have the ability to block bots and specific user agents. Blocking the user agents (CCBot, GPTBot, ChatGPT-User, Anthropic-ai, ClaudeBot, Omgilibot, Omgili, FacebookBot, Diffbot, Bytespider, PerplexityBot, ImagesiftBot, and cohere-ai) there is even more effective than using a robots.txt directive. (As I mentioned, directives can be ignored. Blocks at the firewall level prevent them from accessing your site at all.) Some of these products include Sucuri, Cloudflare, QUIC.cloud, and Wordfence. (Happy to add more if people let me know about them. Please include a link to their user agent blocking documentation as well.) Contact their support if you need further assistance.

CLOUDFLARE USERS: In July 2024, Cloudflare updated their settings to allow you to block AI bots in the Web Application Firewall (WAF). Directions are in the linked post. The list of bots they are blocking is extensive and they’ve committed to updating it to block new bots as they are found.

NOTE: Google-Extended and Applebot-Extended aren’t bots. You need to have this in your robots.txt file if you want to prevent them from using your site content as training.

.htaccess (another option)

In the comments, DJ Mary pointed out that you can also block user agents with your website’s .htaccess file by adding these lines:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (CCBot|ChatGPT|GPTBot|anthropic-ai|ClaudeBot|Omgilibot|Omgili|FacebookBot|Diffbot|Bytespider|PerplexityBot|ImagesiftBot|cohere-ai) [NC]
RewriteRule ^ – [F]

I’d rate this one as something for more experienced people to do. This has a similar effect to that of the firewall and CDN blocks above.

NOTE: Google-Extended isn’t a bot. You need to have this in your robots.txt file if you want to prevent them from using your site content as training.

Additional Protection for Images

There are some image-scraping tools that honor the following directive:

<meta name="robots" content="noai, noimageai">

when placed in the header section of your webpages. Unfortunately, many image-scraping tools allow their users to ignore this directives.

Tools like Glaze and Mist that can make it more difficult for models to perform style mimicry based on altered images. (Assuming they don’t get or already have an unaltered copy from another source.)

There are other techniques that you can apply for further protection (blocking direct access to images, watermarking, etc.) but I’m probably not the best person to talk to for this one. If you know a good source, recommend them in the comments.

Podcasts

The standard lack of transparency from the “AI” industry makes it difficult to know what is being done with regards to audio. It is clear, however, that the Common Crawl has audio listed among the types of data it has acquired. Blocks to the bots mentioned should protect an RSS feed (the part of your site that shares information about episodes), but if your audio files (or RSS feed) are hosted on a third party website (like Libsyn, PodBean, Blubrry, etc.), it may be open from their end if they aren’t blocking. I am presently unaware of any that are blocking those bots, but I have started asking. The very nature of how podcasts are distributed makes it very difficult to close up the holes that would allow access. This is yet another reason why Opt-In needs to be the standard.

ai.txt

I just came across this one recently and I don’t know which “AI” companies are respecting Spawning’s ai.txt settings, but if anyone is, it’s worth having. They provide a tool to generate the file and an assortment of installation directions for different websites.

https://site.spawning.ai/spawning-ai-txt

Substack

If you are using Substack, there is an option to “block AI training,” but it is defaulted to off. Courtesy of Alan Baxter on BlueSky “If you do NOT want your publication to be used to train AI, open your publication, go to Settings > Publication details and switch it on.”

Image courtesy of Alan Baxter.

WordPress.com

If you are using the WordPress hosting provided by WordPress.com (this is not to be confused with WordPress installed on your own hosting plan), please be aware that they have partnered with “AI” companies and will be providing your content to those companies unless you opt-out. This post details what you can do to prevent that, but the option you are looking for (Prevent Third-Party Sharing) can be found under Settings → General, in the privacy section.

“Activating the “Prevent third-party sharing” feature excludes your site’s public content from our network of content and research partners. It also adds known AI bots to the “disallow” list in your site’s robots.txt file in order to stop them from crawling your site, though it is up to AI platforms to honor this request. Using this option also means your blog posts will not appear in the WordPress.com Reader.”

Closing

None of these options are guarantees. They are based on an honor system and there’s no shortage of dishonorable people who want to acquire your data for the “AI” gold rush or other purposes. Sadly, the most effective means of protecting your work from scraping is to not put it online at all. Even paywall models can be compromised by someone determined to do so.

Other techniques for informing bots that you don’t want work used for “AI” training (or might be willing to do so under license) are being developed. One such effort, TDM Reservation Protocol (TDMRep) has been drafted by a W3C community group but “is not a W3C Standard nor is it on the W3C Standards Track.” I am unaware of any bots currently employing this functionality, though some vendors have mentioned it to me. While overly complex, the advantage to something like this is that it would avoid the necessity of having to block individual bots and companies. Like others, it would not protect you from bad actors.

Writers and artists should also start advocating for “AI”-specific clauses in their contracts to restrict publishers using, selling, donating, or licensing your work for the purposes of training these systems. Online works might be the most vulnerable to being fed to training algorithms, but print, audio, and ebook editions developed by publishers can be used too. It is not safe to assume that anyone will take the necessary efforts to protect your work from these uses, so get it in writing.

[This post will be updated with additional information as it becomes available.]

9/28/2023 – Added the recently announced Google-Extended robots.txt product token. This must be in robots.txt. There are no alternatives.

9/28/2023 – Added Omgilibot/Omgili, bots apparently used by a company that sells data for LLM training.

9/29/2023 – Adam Johnson on Mastodon pointed us at FacebookBot, which is used by Meta to help improve their language models.

11/6/2023 – Added anthropic-ai user-agent used by Anthropic.

11/16/2023 – Added Substack section information provided by Alan Baxter.

11/17/2023 – Squarespace has provided directions to block some, but not all bots mentioned in this post.

12/12/2023 – Added information about Cloudflare AI blocking rules in WAF.

1/25/2024 – Added Bytespider bot courtesy of darkvisitors.com.

2/28/2024 – Added WordPress.com directions for preventing the sharing of your content with their “AI” partners.

3/1/2024 – Added ImagesiftBot due to concerns raised by people commenting on this post. At this time, it’s not a confirmed “AI” scraping bot that is contributing to art generating models, but there’s sufficient cause to be cautious due to it’s association with a company (The Hive) that does. Hive hasn’t revealed how it acquires data or from where.

3/23/2024 – Added information about Diffbot and cohere-ai.

4/8/2024 – Added Dark Visitors WordPress Plugin

4/27/2024 – Added ClaudeBot as per comment from Scott Adams

5/4/2024 – Added PerplexityBot as per comment from Marijka Azzopardi

6/11/2024 – Added AppleBot as per comment from Robin Phillips and updated the list of WordPress plugins