Generative AI systems are trained using vast amounts of data, often taken from publicly available sources that may be protected by copyright or other intellectual property rights, such as, in the UK and EU, a database right.

It may not be obvious what source data have been used for the training.  The EU has included provisions in Regulation (EU) 2024/1689 (EU AI Act) to address this concern in relation to so-called general purpose AI models.

Transparency and compliance requirements for training under the EU AI Act

Legal basis for use

Recital 105 of the EU AI Act acknowledges that training general-purpose AI (GPAI) models often involves copyright material and that use of such content requires authorization from rightsholders, unless a legal exception applies. (The EU AI Act defines GPAI models as those trained on large datasets that exhibit significant generality, performing a wide range of distinct tasks.)

Under Directive (EU) 2019/790 on copyright and related rights in the Digital Single Market (DSM Directive), text and data mining (TDM) is permitted unless rightsholders opt out using machine-readable means (e.g., robots.txt, metadata). Many crawlers ignore these signals, creating compliance risks.

Code of practice

Article 53(1)(c) of the EU AI Act requires Providers of GPAI models to put in place a policy to comply with EU law on copyright and related rights, and in particular to identify and comply with (including through state-of-the-art technologies) a reservation of rights (expressed pursuant to Article 4(3) of the DSM Directive).

The EU AI Act does not specify what such a copyright compliance policy should consist of.  Article 53(4) provides that Providers of GPAI models may rely on codes of practice to demonstrate compliance until the European Commission establishes harmonized standards.

As at the date of this publication, a final draft of such a code of practice is not yet finalized (see European Commission, General-Purpose AI Code of Practice). The code is voluntary but (by virtue of Article 53(4) of the EU AI Act) is an official compliance instrument under the EU AI Act. It is being developed with broad stakeholder involvement.

Originally expected by 2 May 2025, as at the date of this publication:

  • The code is scheduled for publication in August 2025.
  • The current draft of the code provides (among other things) for:
    • Implementing a copyright policy.
    • Complying with TDM opt-outs.
    • Mitigating the risk of generating copyright infringing Outputs.
    • Designating a single point of contact and complaints handling procedures.

Once finalized, adhering to the code of practice could serve as evidence of lawful conduct in AI training, especially regarding:

  • Copyright requirements (e.g., use of protected training data).
  • Transparency obligations toward downstream Providers, the EU AI Office, and national competent authorities of EU member states.
  • Risk management and documentation duties for Providers of GPAI models with systemic risk.

Transparency

Article 53(1)(d) of the EU AI Act requires Providers of GPAI models to draw up and make publicly available a sufficiently detailed summary of the content used for training a GPAI model, according to a template provided by the EU AI Office.

This transparency measure is intended to allow creators to verify whether their works have been used in training and whether opt-out requests have been complied with.

Drafts of the template have been published by the European AI Office (as at the date of this publication, the current draft is: Template for summary training data).

For information on the regulation of AI in the EU more generally, see our blog, The EU AI Act – the countdown begins.

 

Could training a generative AI system using publicly accessible copyright work constitute an infringement?

Where the system is trained using a copyright-protected work without the copyright owner’s consent, and assuming that the training involves an act of copying1 of the whole or a substantial part of the work, this would in many jurisdictions be an infringement, unless a relevant defense or exception applies.

 

Case study: Germany

Whether a defense or exception applies depends on the jurisdiction in which the training occurs. For example, in Germany:

  • Under German copyright law, there are several exceptions which Providers can rely on when using publicly available copyright works to train their generative AI system: Section 44b of the German Copyright Act permits reproductions of works that are necessary for the automated analysis of these works, unless the author has expressly reserved this type of use – e.g. with metadata and machine-readable tags. There is, therefore, legal permission to collect works on a large scale and to create a training corpus from them. The exception also requires that the work has been lawfully accessed and is deleted once the training is completed and storage is no longer necessary.
  • If the author of the copyright works has expressly reserved its rights in the way described above, the Provider may rely on Section 44a of the German Copyright Act. This provision permits temporary reproductions of works that have no individual economic value; data mining is expressly named as an act falling under this exception. However, AI training can only be based on this exception if no training corpus is created, but the AI learns from works stored in the working memory, which is deleted immediately afterwards.
  • Finally, Section 60d of the German Copyright Act (the TDM exception) allows privileged research organizations to reproduce works even if there was an opt-out by the author of the work.

Recent case law in Germany

  • Landgericht Hamburg (27 September 2024): the Court ruled that the use of image data for training generative AI models was permissible under Section 60d (the TDM exception) of the German Copyright Act for scientific purposes, provided no commercial interests were involved.
  • Robert Kneschke v. LAION e.v: the Hamburg Regional Court in 2024 confirmed that using publicly available images for creating AI training datasets through web scraping did not constitute copyright infringement if done for scientific research purposes, as outlined in Article 3 of the EU Directive on Copyright in the Digital Single Market and Section 60d (the TDM exception) of the German Copyright Act. For more information, see Germany: landmark court decision deals with AI training and copyright.
 
 

Do relevant defenses/exceptions exist (assuming the system is used for commercial purposes)?

Australia

x-mark
Unlikely. The Copyright Act 1968 (Cth) (CA 1968) ‘fair dealing’ defenses for copyright infringement include dealings for the following purposes: research or study, criticism and review, news reporting, or parody and satire.2 In addition to these narrow prescriptive purposes (which are generally considered narrower than in other common law jurisdictions), the focus in Australia is also on what is considered to be ‘fair’, and commercial objectives being the driving force behind the infringement are typically not considered to be fair.        

Canada

question-mark
There is no TDM exception under the Canadian Copyright Act (R.S.C., 1985, c. C-42), but two general exceptions may apply to training a generative AI system: (1) the temporary reproduction for a technological process exception; and (2) the fair dealing exception.

To qualify for the temporary reproduction for technological processes exception, three requirements must be met:3

  • The reproduction must form an essential part of the technological process.
  • The reproduction should only be used to facilitate a use that is not an infringement of copyright.
  • The reproduction must exist only for the duration of the technological process.

A generative AI program that processes large datasets may need to make temporary reproductions of copyright material that are essential to its technological process. If reproductions are temporary and only exist for the duration of the dataset analysis, they may be covered by the exception of temporary reproduction for technological processes.

The Canadian Copyright Act has a fair dealing exception. This allows use of copyright works for the purpose of research, private study, education, satire, parody, criticism, review or news reporting, provided that the use of the work is ‘fair’.4

If the purpose of use is for criticism, review or news reporting, then the source and author of the work must be cited.

Whether something is ‘fair’ will depend on the circumstances, and several factors will be considered in the analysis:5

  • The purpose of the dealing (is it commercial or research/educational?)
  • The character of the dealing (what was done with the work? Was it an isolated use or ongoing, repetitive use? How widely was it distributed?)
  • The amount of the dealing (how much was copied?)
  • Alternatives to the dealing (was the work necessary for the result? Could a different work have been used instead?)
  • The nature of the work (is there a public interest in its dissemination? Was it previously unpublished?)
  • The effect of the dealing on the original work (does the use compete with the market of the original work?)

When considering use in training generative AI systems, ‘research’ or ‘education’ may be relevant fair dealings. The Supreme Court of Canada has held that ‘research is not limited to non-commercial or private contexts’ and should be otherwise afforded liberal interpretation.6

One Supreme Court case,7 for example, found that listening to 30 to 90 second music previews to determine a user’s musical preferences constituted research for the purposes of the fair dealing exception. However, as at the date of this publication, a Canadian court has not considered whether training generative AI systems using copyrighted material is within the scope of the ‘research’ exception.

In November 2024, a claim was filed against OpenAI by several Canadian media companies and news publishers alleging that training of OpenAI’s generative AI systems using news articles without permission is copyright infringement. The claim also alleges that OpenAI’s training of its generative AI systems using the materials circumvents technological protection measures and breaches terms of use of various websites. This matter is ongoing as at the date of this publication.


China

x-mark
No. There is currently no TDM exception existing in the PRC copyright law system. Generally, the relevant exception to an infringement claim in PRC law only applies to non-commercial usage for personal study, research or appreciation or copying a small quantity for teaching or science research purposes. It will not apply to a system for substantial/large-scale commercial usage purposes.

EU

x-mark

Yes. TDM (that is, reproduction and extraction) of lawfully accessible works is permitted for any purpose provided that the rights holder has not ‘expressly reserved’ its rights in an appropriate manner, which may be in a machine-readable way where the content is available online.8

For information on the regulation of AI in the EU, see our blog, The EU AI Act – the countdown begins.


France

x-mark

Same as EU position.9

On 11 December 2024, the French High Council for Literary and Artistic Property published a report on the implementation of the EU AI Act, focusing on the transparency of data used for AI training (Article 53(1)(c) and (d) of the EU AI Act) and for respecting copyright and related rights. The report suggests guidelines to clarify these EU AI Act requirements and a template for the summary that AI model providers must make available to the public.

The report also suggests two ways to facilitate the exchange of information and the resolution of disputes between the right holders and the AI model providers, without prejudice to judicial action: a direct dialogue and a complaint procedure.                                                                                               


Germany

x-mark

Same as EU position.

Recent clarifications emphasize that rights holders can reserve their rights to prevent TDM, except when conducted for scientific research purposes. 

Landgericht Hamburg (27 September 2024): the Hamburg Regional Court confirmed the legality of using image data for AI training under Section 60d (the TDM exception) of the German Copyright Act for scientific purposes (see Case study: Germany, above).                                                                                                                                                                      


Hong Kong

Check-mark
No. There is currently no TDM exception under the Hong Kong Copyright Ordinance, and the use in training of a generative AI system is unlikely to fall within any of the fair dealing exceptions under that Ordinance. However, following a public consultation conducted in 2024, the Hong Kong Government proposed to amend the Copyright Ordinance to introduce a TDM exception. The proposed TDM exception in Hong Kong will be applicable to both non-commercial and commercial uses, and there will be an ‘opt-out’ option provided for copyright owners to expressly reserve their rights. As at the date of this publication, the government aims to submit an amendment bill to the Legislative Council before July 2025.

The Netherlands

x-mark
Same as EU position.                                                                                                                                                                                                    

Singapore

x-mark
Yes. There is a statutory exception permitting the copying of copyright works for the purpose of ‘computational data analysis’, which includes:
  • Using a computer program to identify, extract and analyze information or data from the work or recording; and
  • Using the work or recording as an example of a type of information or data to improve the functioning of a computer program relating to that type of information or data.10

The Intellectual Property Office of Singapore has clarified that ‘computational data analysis’ includes sentiment analysis, TDM and training machine learning.11

However, the exception is subject to certain conditions and safeguards to protect the commercial interests of copyright owners:

  • The user cannot share copies of the works with others, except for verifying the results of the computational data analysis or for collaborative research or study relating to the purpose of such analysis.
  • The user must not use copies of the works made under this exception for any other purpose.
  • The user must have lawful access to the works to be copied; and
  • The work from which copies are made must not itself be an infringing copy (unless the use of infringing copies is necessary for a prescribed analysis) or, if it is an infringing copy: (1) the user must not know this; and (2) if that copy was obtained from a flagrantly infringing online location, the user must not know (or reasonably have known) that.

While this statutory exception allows a generative AI system to be trained without infringing copyright (as long as the above conditions are met), there is still a risk that the Output of the generative AI system will infringe copyright.

For more information on:


South Africa

Check-mark
No, provided it constitutes a substantive reproduction or adaptation of the original work (and authorship and ownership thereof can be proven), there would be no defense to copyright infringement in such circumstances.

UK

Check-mark

No. A statutory exception for TDM exists, but is only available for non-commercial research purposes.12 

The UK government has undertaken a consultation in relation to expanding the statutory exception for TDM, but as at the date of this publication, the outcome of that consultation is not yet known – for more information, see UK Government consults on copyright and Artificial Intelligence.


USA

question-mark

The fair use doctrine may apply to protect the challenged activity; however, as at the date of this publication, it has not been tested yet, and it is not clear to what extent it would apply.

It is highly likely that the training process will involve the reproduction of entire works or substantial portions of it. OpenAI, for example, acknowledges that its programs are trained on large, publicly available datasets that include copyright works, and that copies of such works are made as part of the process. The copying of copyright works without consent (express or implied) from the copyright owner may result in liability for copyright infringement.

It is expected that AI companies will argue that their training processes constitute fair use and, therefore, do not infringe any work copied. Whether or not copying constitutes fair use depends on four statutory factors under 17 U.S.C. § 107:

  1. The purpose and character of the use, including whether such use is commercial or is for nonprofit educational purposes.
  2. The nature of the copyright work.
  3. The amount and substantiality of the portion used in relation to the copyright work as a whole.
  4. The effect of the use upon the potential market for or value of the copyright work.

AI advocates are likely to argue that consideration of these factors requires a conclusion of fair use. For example, under the first factor above, AI companies may argue that their purpose is ‘transformative’ because the training process creates a useful generative AI system, rather than an expressive work. However, this argument will be impacted by the U.S. Supreme Court’s decision in Warhol v. Goldsmith, in which the Court arguably downplayed the importance of 'transformativeness' within the overall first-factor fair use analysis.

Under the third factor listed above, note that the copies are not made available to the public but are used only to train the program, an argument a court may weigh in favor of a fair use conclusion. This intermediate use argument is heavily relied upon by generative AI defendants in various ongoing U.S. copyright litigation over generative AI, the logic being that any ‘use’ of the work is only in service of further, non-infringing, use (i.e., generative AI outputs, which, generally speaking, are not substantially similar to the original).

In contrast, some generative AI applications have raised concern that training AI programs on copyright works allows them to generate works that compete with the original works. Such evidence would be considered under the fourth fair use factor listed above and would likely weigh against a conclusion of fair use.

For information on the regulation of AI in the US, see our blog, President Biden issues sweeping artificial intelligence directives targeting safety, security and trust.



Footnotes

1  

A computer scientist’s view of training is that it does not strictly involve the creation of a copy of the training data per se. Rather, the training data is transformed into a mathematical model that, in the case of a written source, converts the words into tokens and ‘learns’ the correlations between tokens. Nevertheless, the assumption is that the model can reproduce the training data (and for example ChatGPT can quote verbatim certain texts that it has apparently been trained on) and, if so, it may ultimately not matter to the copyright analysis in what form the data is stored within the model.

3   s.30.71 of Copyright Act.

4   s.29 of Copyright Act.

5   CCH Canadian Ltd. v. Law Society of Upper Canada, 2004 SCC 13.

6   CCH Canadian Ltd. v. Law Society of Upper Canada, 2004 SCC 13.

7   Society of Composers, Authors and Music Publishers of Canada v. Bell Canada, 2012 SCC 36.

Subscribe and stay up to date with the latest legal news, information and events . . .