Jacqueline Charlesworth (Yale University - Law School) has posted Generative AI's Illusory Case for Fair Use on SSRN. Here is the abstract:
Pointing to Google Books, HathiTrust, Sega and other technology-driven fair use precedents, AI companies and those who advocate for their interests claim that mass unauthorized reproduction of books, music, photographs, visual art, news articles and other copyrighted works to train generative AI systems is a fair use of those works. Though acknowledging that works are copied without permission for the training process, the proponents of fair use maintain that an AI machine learns only uncopyrightable information about the works during that process. Once trained, they say, the model does not comprise or make use of the content of the training works. As such, they contend, the copying is a fair use under U.S. law.
This article challenges the above narrative by reviewing generative AI training and functionality. Despite wide employment of anthropomorphic terms to describe their behavior, AI machines do not learn or reason as humans do. They do not "know" anything independently of the works on which they are trained, so their output is a function of the copied materials. Large language models, or LLMs, are trained by breaking textual works down into small segments, or "tokens" (typically individual words or parts of words) and converting the tokens into vectors-numerical representations of the tokens and where they appear in relation to other tokens in the text. The training works thus do not disappear, as claimed, but are encoded, token by token, into the model and relied upon to generate output. AI image generators are trained through a "diffusion" process in which they learn to reconstruct particular training images in conjunction with associated descriptive text. Like an LLM, an AI image generator relies on encoded representations of training works to generate its output.
The exploitation of copied works for their intrinsic expressive value sharply distinguishes AI copying from that at issue in the technological fair use cases relied upon by AI's fair use advocates. In these earlier cases, the determination of fair use turned on the fact that the alleged infringer was not seeking to capitalize on expressive content-exactly the opposite of generative AI.
Generative AI's claim to fair use is further hampered by the propensity of models to generate copies and derivatives of training works, which are presumptively infringing. In addition, some AI models rely on retrieval-augmented generation, or RAG, which searches out and copies materials from online sources without permission to respond to user prompts (for example, a query concerning an event that postdates the training of the underlying model). Here again, the materials are being copied and exploited to make use of expressive content.
For these and other reasons, each of the four factors of section 107 of the Copyright Act weighs against AI's claim of fair use, especially when considered against the backdrop of a rapidly evolving market for licensed use of training materials.
Highly recommended.