
Infringement risk relating to training a generative AI system
Global | Publication | May 2025
Generative AI systems are trained using vast amounts of data, often taken from publicly available sources that may be protected by copyright or other intellectual property rights, such as, in the UK and EU, a database right.
It may not be obvious what source data have been used for the training. The EU has included provisions in Regulation (EU) 2024/1689 (EU AI Act) to address this concern in relation to so-called general purpose AI models.
Transparency and compliance requirements for training under the EU AI ActLegal basis for use Recital 105 of the EU AI Act acknowledges that training general-purpose AI (GPAI) models often involves copyright material and that use of such content requires authorization from rightsholders, unless a legal exception applies. (The EU AI Act defines GPAI models as those trained on large datasets that exhibit significant generality, performing a wide range of distinct tasks.) Under Directive (EU) 2019/790 on copyright and related rights in the Digital Single Market (DSM Directive), text and data mining (TDM) is permitted unless rightsholders opt out using machine-readable means (e.g., robots.txt, metadata). Many crawlers ignore these signals, creating compliance risks. Code of practice Article 53(1)(c) of the EU AI Act requires Providers of GPAI models to put in place a policy to comply with EU law on copyright and related rights, and in particular to identify and comply with (including through state-of-the-art technologies) a reservation of rights (expressed pursuant to Article 4(3) of the DSM Directive). The EU AI Act does not specify what such a copyright compliance policy should consist of. Article 53(4) provides that Providers of GPAI models may rely on codes of practice to demonstrate compliance until the European Commission establishes harmonized standards. As at the date of this publication, a final draft of such a code of practice is not yet finalized (see European Commission, General-Purpose AI Code of Practice). The code is voluntary but (by virtue of Article 53(4) of the EU AI Act) is an official compliance instrument under the EU AI Act. It is being developed with broad stakeholder involvement. Originally expected by 2 May 2025, as at the date of this publication:
Once finalized, adhering to the code of practice could serve as evidence of lawful conduct in AI training, especially regarding:
Transparency Article 53(1)(d) of the EU AI Act requires Providers of GPAI models to draw up and make publicly available a sufficiently detailed summary of the content used for training a GPAI model, according to a template provided by the EU AI Office. This transparency measure is intended to allow creators to verify whether their works have been used in training and whether opt-out requests have been complied with. Drafts of the template have been published by the European AI Office (as at the date of this publication, the current draft is: Template for summary training data). For information on the regulation of AI in the EU more generally, see our blog, The EU AI Act – the countdown begins. |
Could training a generative AI system using publicly accessible copyright work constitute an infringement?
Where the system is trained using a copyright-protected work without the copyright owner’s consent, and assuming that the training involves an act of copying1 of the whole or a substantial part of the work, this would in many jurisdictions be an infringement, unless a relevant defense or exception applies.
Case study: GermanyWhether a defense or exception applies depends on the jurisdiction in which the training occurs. For example, in Germany:
Recent case law in Germany
|
Do relevant defenses/exceptions exist (assuming the system is used for commercial purposes)?
Australia
Unlikely. The Copyright Act 1968 (Cth) (CA 1968) ‘fair dealing’ defenses for copyright infringement include dealings for the following purposes: research or study, criticism and review, news reporting, or parody and satire.2 In addition to these narrow prescriptive purposes (which are generally considered narrower than in other common law jurisdictions), the focus in Australia is also on what is considered to be ‘fair’, and commercial objectives being the driving force behind the infringement are typically not considered to be fair. |
Canada
There is no TDM exception under the Canadian Copyright Act (R.S.C., 1985, c. C-42), but two general exceptions may apply to training a generative AI system: (1) the temporary reproduction for a technological process exception; and (2) the fair dealing exception.
To qualify for the temporary reproduction for technological processes exception, three requirements must be met:3
A generative AI program that processes large datasets may need to make temporary reproductions of copyright material that are essential to its technological process. If reproductions are temporary and only exist for the duration of the dataset analysis, they may be covered by the exception of temporary reproduction for technological processes.
The Canadian Copyright Act has a fair dealing exception. This allows use of copyright works for the purpose of research, private study, education, satire, parody, criticism, review or news reporting, provided that the use of the work is ‘fair’.4
If the purpose of use is for criticism, review or news reporting, then the source and author of the work must be cited. Whether something is ‘fair’ will depend on the circumstances, and several factors will be considered in the analysis:5
When considering use in training generative AI systems, ‘research’ or ‘education’ may be relevant fair dealings. The Supreme Court of Canada has held that ‘research is not limited to non-commercial or private contexts’ and should be otherwise afforded liberal interpretation.6 One Supreme Court case,7 for example, found that listening to 30 to 90 second music previews to determine a user’s musical preferences constituted research for the purposes of the fair dealing exception. However, as at the date of this publication, a Canadian court has not considered whether training generative AI systems using copyrighted material is within the scope of the ‘research’ exception. In November 2024, a claim was filed against OpenAI by several Canadian media companies and news publishers alleging that training of OpenAI’s generative AI systems using news articles without permission is copyright infringement. The claim also alleges that OpenAI’s training of its generative AI systems using the materials circumvents technological protection measures and breaches terms of use of various websites. This matter is ongoing as at the date of this publication. |
China
No. There is currently no TDM exception existing in the PRC copyright law system. Generally, the relevant exception to an infringement claim in PRC law only applies to non-commercial usage for personal study, research or appreciation or copying a small quantity for teaching or science research purposes. It will not apply to a system for substantial/large-scale commercial usage purposes. |
EU
Yes. TDM (that is, reproduction and extraction) of lawfully accessible works is permitted for any purpose provided that the rights holder has not ‘expressly reserved’ its rights in an appropriate manner, which may be in a machine-readable way where the content is available online.8 For information on the regulation of AI in the EU, see our blog, The EU AI Act – the countdown begins. |
France
Same as EU position.9 On 11 December 2024, the French High Council for Literary and Artistic Property published a report on the implementation of the EU AI Act, focusing on the transparency of data used for AI training (Article 53(1)(c) and (d) of the EU AI Act) and for respecting copyright and related rights. The report suggests guidelines to clarify these EU AI Act requirements and a template for the summary that AI model providers must make available to the public. The report also suggests two ways to facilitate the exchange of information and the resolution of disputes between the right holders and the AI model providers, without prejudice to judicial action: a direct dialogue and a complaint procedure. |
Germany
Same as EU position. Recent clarifications emphasize that rights holders can reserve their rights to prevent TDM, except when conducted for scientific research purposes. Landgericht Hamburg (27 September 2024): the Hamburg Regional Court confirmed the legality of using image data for AI training under Section 60d (the TDM exception) of the German Copyright Act for scientific purposes (see Case study: Germany, above). |
Hong Kong
No. There is currently no TDM exception under the Hong Kong Copyright Ordinance, and the use in training of a generative AI system is unlikely to fall within any of the fair dealing exceptions under that Ordinance. However, following a public consultation conducted in 2024, the Hong Kong Government proposed to amend the Copyright Ordinance to introduce a TDM exception. The proposed TDM exception in Hong Kong will be applicable to both non-commercial and commercial uses, and there will be an ‘opt-out’ option provided for copyright owners to expressly reserve their rights. As at the date of this publication, the government aims to submit an amendment bill to the Legislative Council before July 2025. |
The Netherlands
Same as EU position. |
Singapore
Yes. There is a statutory exception permitting the copying of copyright works for the purpose of ‘computational data analysis’, which includes:
The Intellectual Property Office of Singapore has clarified that ‘computational data analysis’ includes sentiment analysis, TDM and training machine learning.11 However, the exception is subject to certain conditions and safeguards to protect the commercial interests of copyright owners:
While this statutory exception allows a generative AI system to be trained without infringing copyright (as long as the above conditions are met), there is still a risk that the Output of the generative AI system will infringe copyright. For more information on:
|
South Africa
No, provided it constitutes a substantive reproduction or adaptation of the original work (and authorship and ownership thereof can be proven), there would be no defense to copyright infringement in such circumstances. |
UK
No. A statutory exception for TDM exists, but is only available for non-commercial research purposes.12 The UK government has undertaken a consultation in relation to expanding the statutory exception for TDM, but as at the date of this publication, the outcome of that consultation is not yet known – for more information, see UK Government consults on copyright and Artificial Intelligence. |
USA
The fair use doctrine may apply to protect the challenged activity; however, as at the date of this publication, it has not been tested yet, and it is not clear to what extent it would apply. It is highly likely that the training process will involve the reproduction of entire works or substantial portions of it. OpenAI, for example, acknowledges that its programs are trained on large, publicly available datasets that include copyright works, and that copies of such works are made as part of the process. The copying of copyright works without consent (express or implied) from the copyright owner may result in liability for copyright infringement. It is expected that AI companies will argue that their training processes constitute fair use and, therefore, do not infringe any work copied. Whether or not copying constitutes fair use depends on four statutory factors under 17 U.S.C. § 107:
AI advocates are likely to argue that consideration of these factors requires a conclusion of fair use. For example, under the first factor above, AI companies may argue that their purpose is ‘transformative’ because the training process creates a useful generative AI system, rather than an expressive work. However, this argument will be impacted by the U.S. Supreme Court’s decision in Warhol v. Goldsmith, in which the Court arguably downplayed the importance of 'transformativeness' within the overall first-factor fair use analysis. Under the third factor listed above, note that the copies are not made available to the public but are used only to train the program, an argument a court may weigh in favor of a fair use conclusion. This intermediate use argument is heavily relied upon by generative AI defendants in various ongoing U.S. copyright litigation over generative AI, the logic being that any ‘use’ of the work is only in service of further, non-infringing, use (i.e., generative AI outputs, which, generally speaking, are not substantially similar to the original). In contrast, some generative AI applications have raised concern that training AI programs on copyright works allows them to generate works that compete with the original works. Such evidence would be considered under the fourth fair use factor listed above and would likely weigh against a conclusion of fair use. For information on the regulation of AI in the US, see our blog, President Biden issues sweeping artificial intelligence directives targeting safety, security and trust. |
Footnotes
A computer scientist’s view of training is that it does not strictly involve the creation of a copy of the training data per se. Rather, the training data is transformed into a mathematical model that, in the case of a written source, converts the words into tokens and ‘learns’ the correlations between tokens. Nevertheless, the assumption is that the model can reproduce the training data (and for example ChatGPT can quote verbatim certain texts that it has apparently been trained on) and, if so, it may ultimately not matter to the copyright analysis in what form the data is stored within the model.
Generative AI
Subscribe and stay up to date with the latest legal news, information and events . . .