Authors of 'explosive' study proving AI model training infringes copyright explain why legal exceptions should not apply
And there's bad news for those hoping that generative models can 'unlearn' data
THE TRAINING OF generative AI models on copyrighted content without the consent of rightsholders is a topic we frequently cover in Charting. Several leading AIs have openly admitted training their generative models on data scraped from the web. They cite the ‘fair use’ defence in American copyright law, and the text and data mining exception under European law.
Last month, a groundbreaking study challenged conventional wisdom on the text and data mining defence saying it shouldn’t apply to generative AI training. According to the authors, intellectual property expert Tim Dornis and AI scientist Sebastian Stober, there’s “no suitable copyright exception to justify the massive infringements occurring during AI training”.
Their joint study was commissioned by the Copyright Initiative which campaigns for fair copyright laws in Germany and across Europe. Hanna Möllers, legal advisor to the German Federation of Journalists and a representative of the European Federation of Journalists, described their findings as “explosive” since they proved “we are dealing with large-scale theft of intellectual property”. “The ball is now in the politicians’ court to draw the necessary conclusions and finally put an end to this theft at the expense of journalists and other authors,” she said.
To find out how they arrived at their findings, I caught up with Tim, professor of law at Leibniz University Hannover and New York University School of Law, and Sebastian, professor for AI at Otto-von-Guericke University, Magdeburg.
In a fascinating and illuminating discussion they told me:
How generative AI training goes so much further than text and data mining
Whether something genuinely ‘transformative’ is created by generative AI
How copyright laws still apply to US models made available beyond America
Whether it’s possible to remove copyrighted materials from gen AI models
What they want politicians to do and what they hope their study will achieve
Here’s the full audio interview, with transcript below:
Tim ... text and data mining, what is it, and why shouldn’t the defence apply to AI model training?
This goes to the heart of the problem. Classic text and data mining is used in natural and social sciences to collect and analyse data in order to gain new information and knowledge. So, a short example: if I look at infections, let’s say, Covid infections in a city, and I also collect the data on where have all the people that have been infected been during the week – at work, at the gym, at the supermarket, and so on. This is simple information, raw facts, and I put that in a big list, an Excel list, and I apply a statistics program and I might ultimately find some correlations, and then I can say, and that’s a new information, they caught their Covid infection at the gym. So, text and data mining takes what we call ‘semantic information’ that is, the mere facts of something. Generative AI training takes much more. Generative AI training takes not only the information but everything that’s around the information. So, if you feed the system with lots of paintings by Salvador Dali it will not only give the scenes that Dali pictured in those paintings it will also give much more. Even though style is not copyright-protected, all the details that you are familiar with, if you know how Salvador Dali paintings look like, that will be somehow replicated. And this is much more than semantic information, simply the facts. So that’s the reason why the text and data mining exception should not apply to the training of generative AI models.
Sebastian, give us your perspective please as someone who’s steeped in AI and the way that generative models are trained. Why shouldn’t the text and data mining exception apply to generative AI?
The really short answer here is that training of generative models simply is not text and data mining. And so, if you look at the definition used by researchers working in this field, data mining aims to extract useful knowledge from data collections. So text mining is just a specific application to text. And this mining really literally means processing data to gain knowledge, right, to get knowledge out of the data. And this can mean all sorts of things, like attaching labels or annotations to data, or identifying clusters of similar data points. But training generative models is something totally different. Here the aim is to generate new data that just looks like the training data, so there is no new knowledge or insights that are produced, just more of the same.
Tim, you’re also a professor of law at New York University. Tell us about the ‘fair use’ defence, and what the AIs need to show in order to use it?
I’m sure there will be a battle going on for at least one or two more years. The courts have not settled on the fair use defence so far, but the all the AI companies bring forward the defence and they say, ‘well look, this is fair use what we’re doing’. The most relevant question is ‘is this transformative fair use?’ And most easily put you would say, well if it brings out something new, if it creates something new, if it’s welfare-enhancing, if it’s knowledge-enhancing, if it’s whatever you can say AI is definitely doing, then you could make a good case to explain it as transformative and then you’d have a good case to say it is fair use. I’m very careful to say how it’s going to end, I mean whether the courts will ultimately say it’s fair use, but I would say there’s good arguments in favour of fair use.
Sebastian, from a technology perspective, is something new being created by generative AI?
Well, what we definitely can see is that it can interpolate, in a way, the new data points from the training sets. So these data points are new, but are they novel enough to justify the term transformative? So I think the whole debate basically stands and falls with the interpretation of the term transformative. Let me share one perspective here that’s maybe a little bit over-restrictive but this is how I would see transformative. So from an artistic or creative point of view I would expect here to venture into totally new uncharted territory and not just interpolating [but] maybe extrapolating, basically going beyond just creating more of the same, staying within this bubble of the training set. And right now, I don’t see that yet with the models that we currently have. They’re just not trained to do that thing. If they produce something that’s surprising and maybe goes out of the dataset then this is probably triggered by some outside influence maybe through a very creative prompt, but the creativity is not in the model itself. I think at some point in the future AI will be able to do this kind of thing, like really create novel things, move out of this training data set distribution, but for this you might need very different approaches, very different from what we have nowadays.
Tim, your study says that even if AI training is taking place outside Europe the AIs can’t avoid European copyright laws. Tell us what you found and why European copyright laws still apply ...
Most of the scholars say copyright law is territorial, that’s a general principle. So each country has its own copyright law, and it only applies for what happens within the country. So, American copyright law applies to what happens in the United States, German copyright law what happens in Germany, and so on. This seems to imply that we cannot do anything, we the Europeans, against American companies using European works for training their AI machines. That means since they are training the machines in the United States [then] American law applies. If the fair use defence applies, then there is no claim against AI training. What I think we found out is that if you most easily could accept that there is a copy of, or something that comes close to a copy, a replication so to speak of the training data inside the AI model, then making the model available online as ChatGPT for instance, or even offering it for download to a European audience to European users, would fall under what’s called the right of making available. And that is copyright protected and universally accepted, and in Europe it’s the right of making available, of making publicly available, and then that would be copyright infringement. So that would somehow circumvent that problem of territoriality. Copyright would still be territorial, but you could say well if they offer it in Europe, if they offer it to European users, [then] that’s the act of infringement that we’re attaching and that’s the basis of a lawsuit for instance.
Sebastian, I’m interested to know if it’s possible to somehow remove copyrighted materials from AI models ...
Well, I’m afraid for the current models it’s unfortunately impossible to unlearn data that was seen during training, so basically you would have to start training from scratch and leave out the problematic data, which is insanely expensive, so it’s probably not a good option. The only other option would be to have some output filtering to check what your model produces against some database of things you don’t want the model to generate, but this is also tricky because, you know, you have to define some similarities and so on because it will not be like exactly the same thing, but if it’s narrow enough, close enough, then it will still be a problem. So this is really tricky to implement so there’s actually no good solution to this problem right now.
Tim, what’s the reaction been to your study from legal colleagues in Europe?
Well so far there has been no written reactions. They will soon come, I’m pretty sure. I met a couple of colleagues at a conference last week and they were congratulating us on the study, but I’m pretty sure that they will really go on the other side with the majority of legal scholars in Europe, at least in Germany; they support the text and data mining defence, they say, ‘well this is text and data mining’, they have settled on that position pretty early. That made the study so interesting, so fascinating for me and I think for Sebastian as well, that we were kind of looking at something everybody seemed to agree or most of the colleagues seemed to agree on.
And Sebastian, how have your colleagues reacted?
So far I have not really seen reactions from colleagues within my field of research, but to me this is not really surprising. Most are really working on improving the state-of-the-art. Some discussion has been going on but not really about the legal aspect, it was more from an ethical perspective. Hopefully next year we’ll have a little bit more discussion about the legal aspects.
Tim, what would you like politicians to do now?
For me it was a fascinating study, I thought the topic is interesting and I was not predetermined from the beginning into what direction it would go. So that kind of extends to what I expect the lawmakers to do. I would say I’m not definitely saying do this or do that, but I would say you have to reconsider it, I mean you have to look at the technology. This was an eye-opener for me to work together with Sebastian, first time for me with an AI expert, and because you can go back to the real expert and say, ‘well look at this, that’s the legal doctrine, what do you think?’, and I think this has still not happened enough. I mean this is something that still has to be done, and then you can also consider other consequences of generative AI of course, but the technology I think is essential.
Sebastian, what do you hope your study will achieve?
First of all, for me, this was something I really wanted to engage with because I think science communication is super important. As somebody working in AI research, I think the whole society needs to follow and be able to discuss [it] in an informed way without causing panic or these typical headline-driven discussions but really on an informed level. This is my main goal [and] why I participated in this, and I really hope that this helps politicians to make better decisions in the future.
Tim, how is this clash between the AIs and rightsholders going to be resolved?
This is the most complicated question, I think. In a sense this will come naturally by lawsuits, I’m really confident that there will be more lawsuits in Europe, in Germany; there’s lots of lawsuits already in the United States. Both copyright owners and AI companies will probably battle fiercely about those issues, about those questions. It will probably be slower than we expected or we'd like it to see the solution. I’m pretty sure that litigation will go on, and this will be a problem that will stay with us for a couple of years probably.
Sebastian, your final thoughts. How do we further raise awareness of what’s going on?
There isn’t really much that I can add here. We can only do so much, right, and this has already been quite some effort putting together this document. I really enjoyed it, going back and forth with Tim, and also realising we are not using the right metaphors or images to convey the meaning that I wanted to convey. I hope that I can build on top of this, I’m already doing regularly talks about this kind of thing in the public. As Tim said, this will be something that will be around us for the next couple of years and it’s going to get more wild I’m pretty sure!
Huge thanks to Tim and Sebastian. More on the generative AI developments that threaten to reshape the human-made media landscape in Friday’s Weekly Newsletter.
HUMAN♥️MADE. While we may link to examples of generative content no chatbots were harmed in the writing of this newsletter.
Excellent post. Shared to LinkedIn.
Had to block a Russian guy on LI who attacked me for recommending caution when proceeding with AI. He turned it into an ageist rant. I silently hope the CIA has him on their radar.
One thing these two guys miss in their analysis: Fair use generally presumes attribution. Which AI does not give. Even in cases where AI could be useful in data or text analysis, it still may not infringe copyright by not citing, or licensing.
That point needs to be more clearly made.
Thank you for this thoughtful discussion. While waiting on lawsuits to go through the process, Credtent.org has taken a proactive approach to helping creators opt-out or license their work, while also helping LLMs select credible, ethically-sourced content for training. With the new California Law SB 2013 that just passed, LLMs will need to ensure transparency in their training sets and that will invite more lawsuits if we don't get fair-market licensing under control.