The US Copyright Office recently released a pre-publication draft of Part 3 that addresses issues arising from using copyrighted works in developing and training generative AI systems, including infringement, fair use, and licensing.1
We previously commented here on Part 2 of the Report on Copyright and AI (Report) by the US Copyright Office (Office).
Key conclusions by the Office include that generative AI training engages various rights, fair use is flexible enough to address issues raised by generative AI training, and voluntary licensing arrangements are the preferred mechanism for authorizing access. Below we summarize Part 3 of the Report and outline some practical tips and considerations in view of the Report.
Technical primer
Generative AI models, including large language models (LLMs) and image generators, are developed through machine learning techniques. Central to improving their functionality is training on vast datasets. Many of these datasets may comprise copyrighted works. The training process involves multiple phases, including pre-training to learn general patterns, and post-training or fine-tuning for specific tasks.
Developers scrape, filter, clean, curate, and aggregate data into vast training datasets on which generative AI models learn to predict different outputs based on the data inputs and model “weights” optimized to achieve desired outputs.
Copyright implications in AI training
The Office identifies several aspects of AI development that may implicate copyright owners’ rights:
- Downloading, copying, and modifying copyrighted works for training datasets generally implicate the reproduction right. The process may also involve creating derivative works and, in some cases, removing copyright management information.
- Training models on copyrighted works involve further reproduction, including temporary copies during model optimization. The Office noted ongoing debate regarding the extent to which model weights “memorize” or retain copies of protected works. Further distribution of these weights may engage the reproduction right where they reflect substantial similarity to copyrighted works.
- Generative AI systems may output material that replicates or closely resembles copyrighted works. Using retrieval-augmented generation (RAG) to incorporate external content at the moment of output generation may also raise implications for infringement.
Fair use analysis
The fair use doctrine is a defence to copyright infringement. Multiple factors are considered in a fair use assessment. The pre-publication guidance explores the factors in depth under Section IV.
Factor One (whether the infringing use has a further purpose or different character − transformativeness and commerciality) was considered in great detail in view of a recent Supreme Court decision. The Office noted that while adding new expression can be relevant to evaluating whether a use has a different purpose and character, it does not necessarily make the use transformative. While transformative use is not an exception to infringement, it is an important factor in the fair use analysis, and co-linked with the commerciality factor.
The Office rejects two common arguments about the transformative nature of AI training, that: (1) using copyrighted works to train AI models is inherently transformative, and (2) AI training is inherently transformative because it is like human learning.
The Office further provided that, “In the Office’s view, training a generative AI foundation model on a large and diverse dataset will often be transformative,” as “[t]he process converts a massive collection of training examples into a statistical model that can generate a wide range of outputs across a diverse array of new situations,” that is “meant to perform a variety of functions, some of which may be distinct from the purpose of the copyrighted works they are trained on.”
However, the Office further noted “although transformativeness often leads to a finding of fair use, not every transformative use is a fair one,” and “[u]ses that merely change the medium, or spare the user inconvenience, are not transformative.” The Office takes a nuanced view of this exception for generative AI, noting that “generative AI models may simultaneously serve transformative and non-transformative purposes.”
The Office considered retrieval-augmented generation separately and noted that “use of RAG is less likely to be transformative where the purpose is to generate outputs that summarize or provide abridged versions of retrieved copyrighted works, such as news articles, as opposed to hyperlinks.”
The Office focused most on whether the output would compete with the original work. Training a model for research may be highly transformative, while training models to generate outputs that compete with or closely resemble copyrighted works is not.
The commerciality of the use and whether access to the training data was lawful or authorized are also relevant considerations. Ultimately, the Office noted that some generative AI training will qualify as fair use while others will not, and the court will have to weigh all relevant factors in any given case. The Office identified issues with “data laundering” by non-commercial entities (e.g., academic and non-profit researchers), and noted that “commerciality does not turn solely on whether an organization is designated as ‘profit’ or ‘nonprofit,’ but whether the use itself furthers commercial purposes” and “the nonprofit status of an organization should not in itself preclude a finding of commerciality.”
For pirated or illegally accessed works, the Office’s view was that “knowing use of a dataset that consists of pirated or illegally accessed works should weigh against fair use without being determinative.”
Factor Two (nature of the copyrighted works) was briefly considered and it was simply noted that “—the facts will vary depending on the model and works at issue,” contrasting highly creative works like novels alongside with those with more factual or functional content, like computer code or scholarly articles.
Factor Three (amount and substantiality of the portion used in relation to the copyrighted work as a whole) is relevant in relation to the nuances of how the training set is curated or accessed. The Office noted that “[d]ownloading works, curating them into a training dataset, and training on that dataset generally involve using all or substantially all of those works. Such wholesale taking ordinarily weighs against fair use,” and that “[t]he Office agrees that the use of entire copyrighted works is less clearly justified in the context of AI training than it was for Google Books or a thumbnail image search.” Finally, the Office concedes “the use of entire works appears to be practically necessary for some forms of training for many generative AI models.”
For Factor Three, the Office also noted that “the third factor may weigh less heavily against generative AI training where there are effective limits on the trained model’s ability to output protected material from works in the training data” (emphasis added), and “[a]s in the intermediate copying cases, generative AI typically do not make all of what was copied available to the public.”
Factor Four (effect on potential market - lost sales (e.g., market substitution), market dilution, and lost licensing opportunities) was also considered in detail. For RAG in particular, it was noted that “retrieval of copyrighted works by RAG can also result in market substitution,” and “RAG augments AI model responses by retrieving relevant content during the generation process, resulting in outputs that may be more likely to contain protectable expression, including derivative summaries and abridgments.”
The Office took a wide perspective in terms of defining the “effect” on the market, rejecting a narrower view proposed by some commenters that the fourth factor analysis considers only harm to markets for the specific copyrighted work.
In terms of lost licensing opportunities, the Office noted that “[m]any commenters stated that individual and collective licenses for AI use were already in existence or under development” (in respect of public licensing deals), taking the position that “[w]here licensing markets are available to meet AI training needs, unlicensed uses will be disfavored under the fourth factor.”
Licensing mechanisms
The Office considered both theoretical and practical issues relating to voluntarily licensing copyright works to train AI models and potential mechanisms for government intervention, including collective and compulsory licensing.
Overall, the Office suggested voluntary licensing was preferred and should continue to develop. The Office noted some AI models have been trained exclusively on licensed or public domain works, but recognized there remained challenges and contexts where voluntary licensing may prove infeasible. These challenges include identifying individual rightsholders and negotiating terms for large, diverse datasets.
The Office also noted that collective management organizations (CMOs) can reduce transaction costs and facilitate bulk licensing. Extended collective licensing (ECL) could be an avenue to address market failures in voluntary licensing. ECL would operate through CMOs and resemble voluntary collective licensing but with more oversight and an opt-out mechanism (rather than an opt-in mechanism). This would ensure a license could be obtained for all works of a particular class and resolve issues relating to identifying and licensing disparate works/owners.
The Office did not consider statutory compulsory licensing to be a workable solution, suggesting it should be deployed only as a last resort if market-based solutions prove unworkable.
The Office recognized that voluntary licensing may not be feasible at scale, but suggested voluntary licensing should continue to develop without government intervention and the proposed alternatives should be considered only after market failures voluntarily licensing had been identified.
Key practical implications
Generative AI developers should consider licensing where feasible. Where it is not feasible, a careful assessment of the source, purpose, and use of copyrighted works in training AI will provide a sense of risks and whether fair use may be available. Developers should also exercise caution in sharing model weights, particularly if the model can reproduce substantial portions of copyrighted works as a result.
For more information, please contact your IP professional at Norton Rose Fulbright Canada LLP.
For a complete list of our IP team, click here.