With the can kicked down the road, how should AI developers respond to the ongoing uncertainty over the future of copyright law?
On 18 March, the UK government published a major report on copyright and artificial intelligence. It is the product of a public consultation which proposed various potential reforms to copyright law, aimed at adapting UK law to the disruptive impact of artificial intelligence.
The immediate outcome of the consultation, as stated in the report, is that the government: “Will not introduce reforms to copyright law until we are confident that they will meet our objectives for the economy and UK citizens. This means protecting the UK’s position as a creative powerhouse, while unlocking the extraordinary potential of AI to grow the economy and improve lives…It is clear through the consultation and our subsequent engagement that there is no clear consensus on how these objectives should be achieved.”
For the time being, this means that copyright and related laws remain as they are. The consultation had stated that the government’s preferred option was a somewhat balanced, but AI developer-friendly regime that would have required rights holders to “opt-out” of having their works used in AI training. While this policy is no longer the government’s preferred option, it is important to note that it remains on the table.
The drawbacks of the “do nothing for now” approach, as stated by the government in its consultation document, are that:
- AI developers will continue to face legal risks, which are likely to affect smaller developers and new entrants in particular.
- Rightsholders continue to find it difficult to seek remuneration and to enforce their rights.
- The UK remains caught between two stools.
This article explores the commercial implications of these challenges from the perspective of rights holders and AI developers. It focuses on copyright, but the issues discussed are also relevant to database rights in collections of data.
A primer on copyright
To properly appreciate the challenges facing rights holders and AI developers, you need to understand the basics of UK and global copyright laws.
In the UK, the Copyright, Designs and Patents Act 1988 (CPDA) says that copyright is a property right attached to certain types of creative work, including:
- Original literary, dramatic, musical or artistic works.
- Sound recordings, films and broadcasts.
Under the CPDA, the owner of the copyright in a work has the exclusive right to do certain things, including:
- Make copies of the work (this means “reproducing the work in any material form” and includes storing the work electronically).
- Issue copies of the work or communicate the work to the public.
If anyone other than the owner of the work does any of these things without the owner’s permission (and an exception does not apply), this is copyright infringement.
Copyright exists from the moment that a work is “recorded” (eg, in writing, on the canvas, or on your computer). It arises automatically, with no need to file any sort of application. Under the CPDA, for most works, copyright continues for 70 years from the end of the calendar year in which the author dies.
Global copyright laws do vary, but not as much as you might think. Thanks to the Berne Convention, the basic definition of copyright and the core rights conferred on copyright owners are relatively consistent across most major jurisdictions. When it comes to copyright exceptions, there is greater divergence. This means that the respective bargaining positions of AI developers and rights holders vary based on where in the world they are operating. This article is written primarily from a UK perspective but contains broader lessons.
When people talk about AI in the context of the copyright law reform debate, what do they mean?
People usually think about these issues in the context of the “foundational” large language models (LLMs) such as ChatGPT/Copilot, Claude and Gemini. These models need to be “trained” using vast amounts of data (which is difficult to obtain). It is worth remembering that a wide range of actors beyond the big US tech companies are developing AI tools for many different purposes.
What legal risks do AI developers face?
If a developer uses copyright works to train AI without seeking permission, they risk liability for copyright infringement. One of the key ways this liability could arise is where an AI training process involved making copies of copyright works or “storing the work electronically”. Rights holders have already invested significant resources to legal disputes in the USA (where servers on which a large volume of training takes place are located) and elsewhere, alleging that AI developers have built their models on the back of this type of copyright infringement.
These disputes pose a challenge for judges worldwide. Copyright laws were not drafted with AI in mind and so familiar principles must be applied in an unfamiliar context. Judges are likely to be wary of establishing sweeping new precedents about the application of copyright law to AI training, which means that uncertainty over what the law permits is likely to persist for some time.
The companies responsible for developing the foundational LLMs have largely had a high-risk tolerance in relation to potential copyright infringement when building their models. It could be thought of as similar to the approach now entrenched technology businesses, such as Uber, took to developing areas of law in the past. The legal risks have not stopped the developers from growing at breakneck speed and attracting unprecedented levels of investment.
Imagine an alternative scenario:
- Go back a few years to a time when the foundational LLMs did not exist.
- The companies that want to build an LLM seek permission and pay licence fees rather than scraping whatever publicly accessible content they could access from the internet (including many copyright works).
In this alternate universe, these tools would be much less ubiquitous than they are today – and the companies behind them significantly less valuable. The leaders of these companies seem to have been persuaded by the adage that “it is better to seek forgiveness than to ask permission”. Several large AI developers are now dealing with multiple lawsuits, albeit from a position of relative strength due to scale and the embedding of their LLMs into peoples’ daily lives.
What are the challenges facing rights holders?
Broadly speaking, their copyright can be infringed by AI tools in two ways:
- In the AI training process (“input infringement”).
- When AI tools generate outputs that reproduce a substantial part of the original work* (“output infringement”).
*copyright exceptions exist that allow the reproduction of an insubstantial part – which is not necessarily the same as small excerpts.
Input infringement
If a copyright work that you own has been used to train AI without your permission, how do you find out that this has happened?
The short answer is: with difficulty. The major LLMs are often described as a “black box”. Microsoft Copilot put it like this:
“Great question — and (frustratingly) the honest starting point is: right now, it’s often hard to prove a specific work was used to train a particular AI model, because most training pipelines are opaque and the datasets are huge.”
One regulatory solution to this problem would be to mandate greater transparency over the materials used to train AI. The EU has adopted this approach in section 53 of the EU AI Act by requiring “General-Purpose AI Models” (the large LLMs, essentially) to publish a summary of their training data. Part of the purpose of this provision is to make it easier for rights holders to exercise and enforce their rights. This element of the EU AI Act came into force on 2 August 2025, but there was a grace period for models placed on the market before that date.
In practice, this means the major LLMs are yet to publish the “sufficiently detailed summary” of their training data that the AI Act requires and are likely to delay doing so until close to the end of the grace period on 2 August 2027. Many rights holders perceive the use of copyright works in LLM training as the AI industry’s “original sin”. From their perspective, you might even (slightly melodramatically) describe 2 August 2027 as judgment day.
How cathartic the training data summaries will be remains open to question, however. Even when they are published, the AI Act does not require work by work disclosure. Nor does it necessarily require disclosure of what publishers’ sites were scraped for training purposes. While useful for rights holders seeking to identify infringements and enforce their rights, the utility of the summaries will depend on how the major LLMs interpret “sufficiently detailed.” The EU has published guidance that goes into the detail of what “sufficiently detailed” means in practice. To the extent that LLMs are built on the back of industrial scale copyright infringement, we can expect the companies behind them to be reticent about disclosing training data summaries that look like a smoking gun.
It is, of course, possible to obtain disclosure (or “discovery”) of relevant information in many jurisdictions in the context of litigation. But to get to that point can be expensive; there is no guarantee that the developer will have retained adequate details of the relevant training data; and there are inherent risks in starting litigation without having a clear idea of whether the facts are likely to support the claim (even more so when the law in this area is developing and early decisions in some jurisdictions, such as the UK and US, have not been especially kind to rightsholders).
Output infringement
Another obvious way for a rightsholder to identify whether an AI tool has infringed your copyright would be to try to prompt it to reproduce large chunks of your copyright works. In the early days of the LLMs, this might have worked. It is more difficult these days.
When Copilot is asked to provide the first page of Harry Potter, it says: “Sorry—I can’t provide the opening page text from Harry Potter and the Philosopher’s Stone because it’s copyrighted.”
When asked whether it was trained on the work, its response was guarded: “I can’t tell you whether Harry Potter was in my training data, and I don’t have access to or memory of specific copyrighted books.”
This just so happens to address one of the specific legal issues that will determine many of the existing copyright lawsuits against the major LLMs – do the models “remember” (and store copies) of the works on which they were trained?
How about the first page of Madame Bovary, noting that this would be out of copyright? It provided quite a good answer:
✅ Yes — Madame Bovary is out of copyright.
- Author: Gustave Flaubert
- Date of death: 1880
- That means Madame Bovary entered the public domain on 1 January 1951.
- Flaubert’s original French text is unequivocally in the public domain.
- Some English translations are not (translations have their own copyright unless the translator also died more than 70 years ago).
Copilot then provided what it was asked for, in the original French.
Recent research suggests that that it remains possible to extract huge chunks of copyright works from LLMs if you take a more subtle approach. The abstract makes for interesting reading: “by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts.”
Our general impression is that output infringement is a more controllable problem than input infringement, at least in the case of LLMs.
What is the UK government’s approach?
The core challenge for the UK government is that it cannot please everyone. It is theorised that a developer-friendly approach would catalyse the AI economy in the UK, but it would greatly upset the UK’s strategically important creative sector. There are also no guarantees that it would have the intended effect, given the significant mismatch in capital available to fund development in the UK and US technology ecosystems, respectively.
Strengthening the hand of rightsholders (via, for example, an opt-in system for licensing works for AI training) looks attractive as it seems to reward creativity. But if other jurisdictions (such as the US) take a more permissive approach to training, which (via the general fair use exception to copyright infringement) they seem to be doing, developers will simply train there and then adopt safeguards (such as those suggested above) around output infringement. This should work legally as long as the Court of Appeal does not overturn the High Court’s decision in the recent Getty Images v Stability AI case that a model that may have been trained on infringing works is not an “infringing copy” if it no longer contains those works.
While the UK equivocates, AI development in other countries (most notably the US but also, for example, China) is accelerating away. The UK’s best strategic move may be to focus on emerging types of AI model that offer a different training approach, and potentially improved results, compared to LLMs. The £500m Sovereign AI Fund certainly seems to be thinking in this way – see its recent investment in Ineffable Intelligence.
What should developers and rightsholders do, and how can Mills & Reeve help?
On the developer side, we have been advising on both developers of new models, and implementers of foundational LLMs. Discussions often focus on the differences between what is a publicly available (ie, free to use) dataset, and what is merely publicly accessible. We can help developers identify (and, where applicable, license) training datasets consistent with their specific risk appetite. We can also assist with implementing guardrails to avoid inadvertent breach of underlying licence terms (particularly in the case of open access datasets).
On the rightsholder side, AI has compounded the (already difficult, given low barriers to entry online) enforcement landscape. It is impossible to cut all the heads off the hydra, and our advice focuses on maximising return on investment. This is important, particularly given the potential for AI to have a negative effect on the returns that some rightsholders may be able to realise from their creative assets.
We guide clients in investing resources where they will make the most difference; help clients with similar interests consider and implement strategies to make the most of their strength in numbers; and work with clients to develop alternative, non-litigious, approaches to influence change.
Our content explained
Every piece of content we create is correct on the date it’s published but please don’t rely on it as legal advice. If you’d like to speak to us about your own legal requirements, please contact one of our expert lawyers.