AI statement

On 05/28/2023

REVISED 6/12/23 based on feedback

Where We Stand on AI in Publishing

Recent innovations in AI, particularly in art and text generation, narration, and other areas, will have a significant impact on the publishing industry. AI technologies are here to stay and even their developers are stating that there is a need for regulation. If we want to be a part of that conversation, publishing professionals need to be clear about our concerns, expectations, and intentions. We don’t have a formal organization or group that can speak for us, but as individuals we can voice our concerns, attitudes, and hopes. This was created as an open document that others can sign onto or adopt and modify as they need. The point is to speak up.

This version is a near-final draft. Feedback is welcome, but if you are here to discourage or silence, please move along.

We believe that:

Regarding AI & AI-Detection Development

AI technologies will likely create significant breakthroughs in a wide range of fields, but those gains should be earned through ethical use and acquisition of data.
the increased speed of progress achieved by acquiring AI training data without consent is not an adequate or legitimate excuse to continue employing those practices.
the companies and individuals sourcing data for the training of AI technologies should be fully transparent about the copyrighted works used for training, including but not limited to, providing a searchable index of fully-attributed works that includes when, where, and how they were acquired.
the companies and individuals sourcing data for the training of AI technologies should be required to publicly identify the user-agents of all data scraping tools and bots, provide simple methods to opt-out of collection, and honor robots.txt and other standard bot-blocking technologies.
copyright holders should have the ability remove their work from training datasets (to prevent further training use) that have been created from digital or analog sources without their explicit consent.
user input to AI systems/tools should not be stored or used as further training material without opt-in consent.
AI technologies also have the potential to create significant harm and, to help mitigate some of that damage, the companies producing these tools should be required to provide easily-available, inexpensive (or subsidized), and reliable detection tools.
detection and detection-avoidance will be locked in a never-ending struggle similar to that seen in computer virus and anti-virus development, but that it is critically important that detection not continue to be downplayed or treated as a lesser priority than the development of new or improved AIs.

Regarding the Publishing Industry

publishers do not have the right to use contracted works in the training of AI or related technologies without contracts that have clauses explicitly granting those rights.
submitting a work for consideration does not entitle a publisher, agent, or submission software developers to use it in the training of any AI or related technologies.
publishers or agents that utilize AI in the evaluation of submitted works should indicate that in their submission guidelines.
publishers or agents should clearly state their position on AI or AI-assisted works in their submission guidelines.
publishers should make reasonable efforts to prevent third-party use of contracted works as training data for AI technologies.
publishing contracts should include statements regarding all parties’ use, intended use, or decision not to use AI in the development of the work.
publishers and booksellers should clearly label works generated by or with the assistance of AI tools as a form of transparency to their customers.
publishers and agents need to recognize that the present state of detection tools for generated art and text is capable of false positives and false negatives and should not be relied upon as the sole source of determination that a work has been generated.
publishers should grant authors the right to decline the use of AI-generated or -assisted cover art, translations, or audiobooks/narration on newly, currently, and previously contracted works.

Regarding Authors, Artists, Translators, and Other Copyright Holders

individuals using AI tools should be transparent about its involvement in their process when those works are shared, distributed, or made otherwise available to a third party.
individuals should be respectful of publishers’, booksellers’, and other professionals’ policies regarding their desire not to work with AI-generated or -assisted works and not misrepresent those works as entirely their own.
individuals should acknowledge that there are limits to what a publisher can do to prevent the use of their work as training data by a third party that does not respect their right to say “no.”

Regarding Governments

governments should craft meaningful legislation that both protects the rights of individuals, promotes the promise of this technology, and specifies consequences for those who seek to abuse it.
“fair use” exceptions (and similar exceptions in other countries) with regard to authors’, artists’, translators’, and narrators’ creative output should not apply to the training of AI technologies and that explicit consent to use those works should be required.
governments should be seeking advice on this legislation from a considerably wider range of people than just those who profit from this technology.
governments should attempt to reach a mutually-beneficial understanding with other countries on standards regarding AI development and training.
copyright should not be extended to generated works.

NOTES:

AI-generated and -assisted are not specifically defined due to the fact that the technology is evolving and any current definition would likely create future loopholes. Broad terms are meant to encourage transparency and allow the parties involved to determine whether or not such uses cross their individual lines. For example, one’s attitude towards AI-assisted research may be different for non-fiction vs. fiction.
AI detection is a complex problem on the same level, or perhaps higher, than the development of AI systems themselves. AI developers are in the best position to address these challenges, either within their tools (watermarking, throttling, etc.), via separate applications, or some combination of efforts. They are also the only party capable of collecting sufficient materials generated by their system, should that become a necessary dataset for detection. Industry professionals have been quoted saying things like “get used to it” and have placed insufficient effort into addressing the problems they are creating. Providing a means of detection should be a cost of doing business in AI and not an alternate revenue stream where they can earn from both sides of an AI/AI-detection arms race.
We are in no way advocating the end of “fair use” or its equivalents in other countries. While members of the tech industry have made it clear that they would like fair use to apply to their AI training data, the courts have not yet agreed with them. We’re saying fair use shouldn’t be permitted to apply in this case. There are already vendors producing “ethically-sourced” datasets and we welcome this development. Preventing the wholesale use of copyrighted works may slow data acquisition (and by extension development of some AI products), but it does not end or undermine it.

49 Comments

Add Comment →

Bryan Schmidt

Well said. I am in.

05/28/2023

Reply
Mel Melcer

Hear, hear!

05/28/2023

Reply
Mohd. Arman ul Haq

Couldn’t be said more better than this..
I’m in..
Let’s encourage creativity..

05/28/2023

Reply
Holly Messinger

I’ll be sharing this. Thanks for articulating it so clearly.

05/28/2023

Reply
Sheila Williams

I agree with everything you’ve said here.

05/28/2023

Reply
John Doerschuk

First, please define the acronym LLM as Large Language Model early in the statement. Not everyone is familiar with the term.

This seems to boil down to a couple of basic things:

As a baseline, LLM developers should not use the works of others for training without permission.
Contractual agreement to use a copyright holder’s works in training a LLM should be required.
Copyright holders must have the right to require their works to be removed from a training database in the event a contractual agreement for use does not exist.
Transparency in what materials are being used in LLM training is required.
LLM generated materials must be detectable and clearly identified as LLM generated
LLMs can not copyright generated materials nor can publishers of LLM generated materials

These things make sense to me.

LLMs and ‘art’ generation software is here and is not going away. We have to, very quickly, determine how to manage it on a global scale, not just in a single country, or we will have many issues going forward.

05/28/2023

Reply
Salvantonio Clemente

Thank you for being at the vanguard at this critical juncture. I can find nothing to criticize in this well-reasoned and even-handed manifesto.

05/28/2023

Reply
Salvantonio Clemente

Thank you for being at the vanguard at this critical juncture. I can find nothing to criticize in this well-reasoned and even-handed manifesto.

05/28/2023

Reply
Chris M. Barkley

It is ABSOLUTELY essential that writers, editors and publishers agree to, promote and continually support declarations such this in order to preserve and perpetuate the human element in the creation of all types of written works.

Count me ALL IN!!!!!

05/28/2023

Reply
Douglas Phillip Marx

An excellent start. Thank you for starting the ball rolling.

05/28/2023

Reply
- Victor Volle
  
  +1
  
  05/29/2023
  
  Reply
Kimberly Rei

Thank you for putting words to thoughts I’ve struggled to express. On board with all of this.

05/28/2023

Reply
Paula Cappa

Yes! Five stars. Thank you. We really need this.

05/28/2023

Reply
Anna Tambour

Spot on. Thank you for writing this. It says everything that needs to be said.

05/28/2023

Reply
Sheila Jenné

As I understand it, detection can be made very simple if developers choose to make it so. By including a “watermark” in LLM output, such as altering the frequency of letters in the text or embedding patterns of letters, they can make it possible for a basic program to detect a specific LLM’s output.

Now this watermark could be removed with a lot of editing, but it would cut way down on low-effort spam, because it would no longer be as low effort.

05/28/2023

Reply
Gerry Huntman

I’m happy to sign in on this. Thanks for your work on this serious issue.

05/29/2023

Reply
Karen Strickholm

I’m no technology/AI expert, but I don’t think I saw a requirement to disclose source material. Say I purchase an AI-generated product. Would it be of value to require disclosure of all source material? Unless creator of source material opts out.

AI is a massive intellectual property endeavor. I would think some intellectual property attorneys should have input.

Also accountants, from a top shelf firm. Say there are royalties to be paid for material used to create the AI projects. Well that means it needs to be auditable. What other financial considerations need to be included?

Also what about when disputes arise? Where does the content prrovider go to file a complaint? Is there an existing federal agrncy for this, or does one need to be created?

I recognize this is a statement of principles, but I do feel that the input from these other disciplines will be invaluable at the front end. Ciao for now!

05/29/2023

Reply
Bartek Biedrzycki

Well, thanks. Mutual respect based on full transparency and choice sounds like a nice thing to have.

05/29/2023

Reply
Jon Sparks

Where do I sign?

05/29/2023

Reply
Geoff Brown

Very much agree.
We need to do our best, as publishers (or other positions within the publishing industry), to ensure that AI creations are not used in any part of our publishing process, to ensure to the best of our ability that our writers and artists are not used for AI training, and to ensure that human creativity is paramount in our process.

05/29/2023

Reply
William Powell

Mostly clear, but I had to reread

We believe that authors should acknowledge that there are limits to what a publisher can do to prevent the use of their work as training data by a third party that does not respect their right to say “no.”

about 5 times before I worked out what your point was. Rephrase as ‘I believe that when X then Y’?

05/29/2023

Reply
Jeremy Szal

Well said.

05/29/2023

Reply
John Morgan

I am in complete agreement with you in this matter.

05/29/2023

Reply
Tade Thompson

I’d sign my name to this.

05/29/2023

Reply
T Vogl

How about: LLMs should develop an honor a data metatagging format for websites similiar to what “honorable” web crawlers supported (eg NoIndex, NoFollow). This should stop them from injesting web content that owners not want used for training purposes.

05/29/2023

Reply
Susan Kaye Quinn

Thank you for being such a strong leader in this area.

05/29/2023

Reply
Lloyd Penney

I agree completely. We need all the original creativity we can get.

05/29/2023

Reply
William C Tracy

Thank you for putting this together! Well thought out and a great start. Regulation on such a fast-moving technology must start from those who are most affected by it.

05/29/2023

Reply
Hal Greene

“We believe that copyright holders should be able to request the removal of their works from databases that have been created from digital or analog sources without their consent.”

Not sure this one makes sense, at least from how I understand LLM’s to work (I could be wrong.) From my understanding, the model that creates an LLM analyzes a text and then creates “weights” based up the frequency of words, their relative placement, the correlations between them, etc. But once the text has been analyzed and the statistical data extracted, it’s no longer being “used” by the LLM. The text itself is not stored in any database, just its weights. Consequently, how could one “remove” a coyrighted text from an LLM once its weights have been added? Re-analyze it and “un-weight” its data? More accurate would be to say that copyrighted texts should not be analyzed without consent in the first place.

05/30/2023

Reply
- Neil Clarke
  
  This is in reference to the databases of scraped data that are used to train AI models. Intent is to prevent further use of already scraped or otherwise acquired data in the development of new systems or the enhancement of existing ones.
  
  05/30/2023
  
  Reply
Jon McGoran

Very well said, and yes, thank you so much for taking on this critical endeavor.

05/30/2023

Reply
Kristine Kathryn Rusch

I agree, Neil. Well done. One suggestion. When you get to the final draft, lump the pieces on the same topic together (authors rights, publishers obligations, etc). I suspect you planned to do that anyway, but just in case…

05/30/2023

Reply
mark

a) Not just yes, but hell, yes.
b) What I wrote to my Congressman weeks ago, and asking for a response, about the issues with AI, and asking that copyright law be expanded with one simple clause: “Generative AI may not be trained on copyrightten works without the explicit permission of the rights owner.” In other words, if they get an electronic copy (say, of my novel), and train an AI on it without ever contacting me, they have violated copyright.

05/30/2023

Reply
- mark
  
  Oh, and in the contract I signed mid-April, I got the publisher to include that they would not do this. I wanted it saying that on the disclaimer page, but their lawyer said they couldn’t police that.
  
  05/30/2023
  
  Reply
- Gideon Marcus
  
  Mark, would you be willing to post a copy of your letter as a template for the rest of us?
  
  05/31/2023
  
  Reply
Lezlie Kinyon, Ph.D.

Thank you for this. I edit a small peer-reviewed arts journal. As a researcher in the human sciences and an editor, I will say that the standard for non-fiction and research to be reviewed should have even stronger statements concerning the inappropriate use of AI generated writing in research. In short: we have made the decision to accept no AI generated writing of any kind as inappropriate. I look forward to the day when all of us in publishing act together, be it fiction, poetry, academic and science writings, or prose in journalism and editorial works. I completely agree with you on the issue and provide both standards and completely agree on the contractual basis for training LLMs in AI systems.

05/30/2023

Reply
Brenda Gates Spielman

Well said. I will sign.

06/01/2023

Reply
Pierre Couillaud

I would add 3 paragraphs (feel free to rephrase them in a better way if you want but you’ll get the idea)

1: Copyrights holders should have acces to a quick, free and effective mechanism (something similar to a DMCA claim) to require the immediate removal of a problematic AI-made material as soon as they discover it.
“Problematic” in this case being defined as either an AI made output (novel, image, etc…) that clearly infringed on their copyright; or as an entire AI model that has been descovered to have been trained on their copyrighted material without consent.
The removal of problematic AI materials should not be tied to a long and costly legal procedure every single time, as it would be impossible for free-lancers who do not have an entire legal team and budget at their bidding to keep up otherwise.

2: whenever a problematic (copyright-infringing) AI system is discovered; ALL parties accomplice in this infringements should be held accountable:
-the user, if they have knowingly purchased or used the problematic AI model while knowing it was problematic.
-the individual or company who made the problematic AI model.
-the website or platform who has allowed the problematic AI model or their output to be distributed on their platform without proper checking or safeguards.
-when the problematic AI was created by way of additional training or modification of another AI system, the company who made the original AI system should also be held accountable: for having failed to put safeguards in place to prevent this type of unethical modifications.

3: the legal penalities for unethical usage of ML AI technologies should be big enough to be -actually- dissuasive.
otherwise, the very existence of those laws is a total moot point. the fines will be cynically seen as nothing more than an “acceptable cost” of doing buisness by those companies, and the consequences will be far lower than the benefits of breaking the law anyway.

in order to ensure that they are dissuasive, the fines should be based on % of income rather than fixed values, with serious offences for exemple going potentially as high as 10% or 20% of the company’s yearly income during the entire period where they operated the unethical AI system.
there should also be potentially criminal charges agaisnt CEO or individuals in serious cases as well, potentially including prison time.

finally, it goes witout saying, any AI model found to contain non-consented copyrighted materials in the dataset, or otherwise problematic or unethical, should be immediately destroyed.
and all other AI models based or derived from it, as well as all media output by this model or any of the derivative models, should also immediately lose all legal legitimacy and be legally required to be destroyed too.

06/02/2023

Reply
Pierre Couillaud

on another note, Neal, if you haven’t already done so, i HIGHLY recommend you send this draft to the USA Federal register. on their “AI Accountability Policy Request for Comment” page, they currently are asking for comments precisely like this: of what the public/artists would want a fair legislation to look like.
here:
https://www.federalregister.gov/documents/2023/04/13/2023-07776/ai-accountability-policy-request-for-comment

please also spread it to other artists you know if they have not sent their comments to them as well, they are listening and this is super important, but it see hardly anyone talk about it…

06/02/2023

Reply
- CK Page
  
  I sent in a comment, using up nearly all my available characters LOL. Thanks for putting the link up and the head’s up.
  
  06/04/2023
  
  Reply
Ed Teja

Well done. I was glad to see publisher start to include a statement that the content wasn’t generated in whole or part by AI. It isn’t a solution to people who game systems, but it’s a statement of intent.

06/03/2023

Reply
Maria V Korolov

I totally agree on the “opt-in” for training data. Copyrighted work should not be used for training commercial AI models without the rights’ holders’ express permission.

We’re already seeing this happening with commercial AI-generated art platforms like Shutterstock, Adobe’s Firefly and Getty Images, so fully-licensed generative text AI models probably won’t be far behind.

As far as labeling AI-generated work, this might end up having the same effect as those “cookie notices” on websites — people just start ignoring them because they’re everywhere.

For a publisher, it would be a no-brainer to include the notice “some of the content in this work was generated by AI” on every single work if there’s a legal requirement to do so. After all, every piece of work out there has SOME words in it generated by AI — this comment, for example, had words generated by AI (I left out the word “be” when I originally wrote the first sentence in this paragraph, and an AI suggested that I add it). And generative AI is about to be included as a feature in Google Docs and Microsoft Word.

So publishers, to be on the safe side, will mark every piece of work as having some AI-generated content. Once nearly every published work has this disclaimer, it will become meaningless.

I don’t have a solution to propose here.

I support the right of readers to avoid AI-generated content if they want to do so. I myself hate watching those repetitive, AI-generated “news” videos and downrank them whenever YouTube recommends them to me. But many of my favorite YouTube commentators are probably already using AI to help research, write, or edit their content — and I don’t mind if they do as long as it makes their videos better.

As far as copyrighting AI-generated works goes, I have a feeling that these copyrights will soon become possible. After all, photographers can copyright their photographs, even if “all they did was press a button.” Plus, AI-generated text is often heavily edited by humans. Businesses in particular want to generate text with AI *and* want to protect that text, so there’s going to be considerable pressure here from the business community.

06/04/2023

Reply
CK Page

Well put. I’d sign on.

06/04/2023

Reply
Keith Kenny

I appreciate everything you’ve said. My limited experience with ChatGPT (which I understand has been superseded) was uninspiring. I got tired of the constant insistence that my writing, and especially the characters, be less imaginative and more woke. I had not considered the use of AI for publishers sorting and selecting what to accept, but I see from your comments how that could be manipulated to further politicize the field. If popularity were part of the algorithm, we might never see another classic—only unending versions of Twilight and Spiderman. I certainly would not like my stories being used to train AI, i.e., soon the world would confuse my writing for synthetic creations.

06/07/2023

Reply
Lancelot Schaubert

I do think that a statement should be made about compensation to those who weren’t notified their works were used to train the systems. A data set is a library, libraries have to license their works or negotiate (as in the case of the Bloomberg terminals at Stavros Niarchos or their backup copy of the Foundation Library) the terms of the free use of the data set, works, etc.

If works were used illegally in a data set (read: library) to train a given thing without notification to the creators, the creators own a share in common with the LLM. They’re entitled to a royalty payout of the revenue it generates and the easiest way to navigate that is through a sovereign wealth fund, managed by the state, that pays out a royalty for the maximum benefit of all stakeholders (inflation proofed, invested in the global economy, etc.)

06/10/2023

Reply
Gus Loe

I’m with you on this Neil.

However, I am severely dyslexic and I already use specialist software to help correct spelling and grammar errors. That software is good, and to a degree it adapts to the sort of mistakes I make. But, that software doesn’t catch all the mistakes and there is still quite a bit of work to do in getting the test right. Using an AI tool based on LLMs could make that sort of software even better and more useful to me and other dyslexics. So if we are to have guidelines, policies and principles on the use of AI in litterature and art – and we should – then we need to ensure that these don’t discriminate against those who need specialist tools to support their work. I don’t know how we should phrase an eventuell exception to policies and guidelines on the use of AI, nor do I know how we would police it, but if we are to avoid a lot of unecessary litigation and pain for all involved then we should at least give it some thought.

06/10/2023

Reply
Patrick Samphire

This is good stuff. My one concern is this clause: “copyright holders should have the ability remove their work from training datasets (to prevent further training use) that have been created from digital or analog sources without their explicit consent.”

This is an opt-out rather than opt-in for authors, and experience has taught us that opt-outs are ineffectual as they rely on authors having to constantly track down and opt out of ever-proliferating databases, and it becomes impossible to keep up.

06/14/2023

Reply
- Neil Clarke
  
  This is to address content that has already been acquired and stored.
  
  06/14/2023
  
  Reply
Wendy S. Delmater

We at Abyss & Apex are standing with you on this, Neil.

06/16/2023

Reply

Where We Stand on AI in Publishing

49 Comments

Mel Melcer

Mohd. Arman ul Haq

Holly Messinger

John Doerschuk

Salvantonio Clemente

Salvantonio Clemente

Douglas Phillip Marx

Victor Volle

Kimberly Rei

William Powell

Tade Thompson

T Vogl

Pierre Couillaud

Pierre Couillaud

Gus Loe

Leave a Reply Cancel reply