The Associated Press discreetly consented to give OpenAI access to a portion of its text archive at some point in the summer of 2023. Not much fanfare. terms not disclosed. Just a deal, made in the midst of the growing backlog of over a hundred copyright cases in US courts. That agreement proved to be the first stone dropped into a very deep pond, even though it was modest by the standards of what followed.
There was a rush after that. Axel Springer, News Corp, Condé Nast, The Atlantic, Hearst, Reddit, France’s Le Monde, Spain’s PRISA, and others signed content agreements with OpenAI alone. Google used a similar strategy.
| Category | Details |
|---|---|
| Topic | AI Copyright Litigation & Training Data Licensing |
| Primary Legal Cases | The New York Times v. OpenAI, Kramer v. Meta, Bartz v. Anthropic, Getty Images v. Stability AI (UK) |
| Key AI Companies Involved | OpenAI, Anthropic, Meta, Google, Stability AI |
| Number of Active U.S. Copyright Cases | 100+ |
| First Major Licensing Deal | Associated Press × OpenAI — July 2023 |
| Notable Licensing Partners (OpenAI) | Axel Springer, News Corp, Condé Nast, The Atlantic, Hearst, Reddit, Financial Times, Le Monde |
| Estimated Deal Value | Hundreds of millions of dollars collectively |
| Regulatory Flashpoint | The “Big Beautiful Bill” — proposed 10-year moratorium on U.S. state AI regulations |
| Reference | U.S. Copyright Office – AI Policy |
| Key Legal Concept Contested | Fair use doctrine as applied to AI training on scraped internet data |
| UK Parallel Case | Getty Images v. Stability AI — output-based claims later partially dropped |
| Expert Source | A&O Shearman AI Legal Group |
| Emerging Market | Large-scale data licensing ecosystem for LLM development |
| Adjacent Legal Risk | Privacy law — repurposing data for model training raises data protection concerns |
When taken as a whole, these agreements sent hundreds of millions of dollars to content owners who had previously watched helplessly as AI companies uninvitedly scraped their archives. It’s ironic that publishers who had been watching their profits decline for years now have something Silicon Valley actually needed.
The businesses involved are aware that this expenditure isn’t solely motivated by altruism. American courts are still debating whether using scraped internet data to train AI models is fair use or more akin to the most egregious copyright violation in recent memory. That question is still genuinely unanswered. And writing checks begins to seem like the sensible course of action when the legal foundation of an entire industry is so precarious.

The peculiar position this places the AI companies in is difficult to ignore. Some legal experts think they have a strong case, especially for general-purpose models trained on publicly accessible text, so they might eventually prevail in the fair use debate. However, a “reasonable case” is not a guarantee, and making a mistake could have existential consequences. For example, the agreement between OpenAI and Reddit was never just about copyright.
The majority of the content posted by Reddit’s users is not even owned by Reddit. Legal clearance was just as important to the arrangement as dependable API access and avoiding breach-of-contract exposure. In other words, these transactions are performing several tasks concurrently.
The way that the nature of AI systems is altering calculus is less talked about. Early language models generated outputs with minimal traceability to a particular source after processing training data. That isn’t exactly true anymore. Retrieval-augmented generation, in which a model obtains real-time data from the internet prior to responding to a query, is increasingly being used.
The system may pull from a dozen news articles in real time when someone asks Claude about a significant Supreme Court decision, occasionally generating summaries that are uncomfortably close to the original text. Training is not the same as that kind of legal exposure. It is inference-time copying that occurs continuously, invisibly, and at scale. When you grasp that aspect of the issue, licensing begins to make a lot more sense.
However, this licensing boom is rooted in a fundamental tension that receives insufficient attention. Nearly all of the agreements being made are between AI firms and sizable, well-funded content owners, such as wire services, major newspapers, and magazine publishers. Simply put, the economics do not scale downward.
The infrastructure to negotiate these agreements, monitor usage, and enforce compliance is lacking for independent writers, local bloggers, and mid-sized news organizations without a legal department. For the majority of the real producers on the internet, mass licensing will never be a feasible option, according to copyright expert Matthew Sag. In reality, the agreements being hailed as progress might be a settlement between two groups of institutions that exclude everyone else.
An additional level of complexity is introduced by the circumstances in the United Kingdom. The Getty Images lawsuit against Stability AI has drawn a lot of attention, in part because Getty first included output-based claims in the lawsuit, claiming that the AI system was replicating copyrighted images in its results. However, Getty later dropped that part of the lawsuit.
That retreat is important. It implies that there is a high evidentiary burden to prove output infringement, which is helpful information for deployers attempting to assess their own risk. However, the training issue remains unresolved, and U.S. and UK courts operate under distinct legal traditions.
Right now, investors are finding it extremely challenging to value AI companies due to the uncertainty. Not only is a foundation model based on training data that courts subsequently find to be infringing legally exposed, but its entire development history also becomes a liability.
Technology-by-technology and dataset-by-dataset analysis of training methods and agreements, if any, are becoming more and more important components of due diligence in AI transactions. For most M&A teams, that is unfamiliar territory, and they are still creating the frameworks to deal with it.
As this develops, it seems possible that the licensing agreements themselves will influence the decisions made by the courts. Judges assessing fair use claims might be less receptive to AI companies claiming that obtaining permission would have been impractical if licensing is common and profitable. Strangely, the legality of not licensing and the practicality of licensing have become intertwined issues.
The market for data licensing is undoubtedly still in its early stages of development. Depending on whether training, fine-tuning, or real-time retrieval are involved, different structures are employed. Those structures will eventually come together as legal clarity develops, either through legislative action, court rulings, or both.
Until then, the agreements will continue to be made, the number of lawsuits will continue to increase, and somewhere in the midst of all that chaos, the regulations pertaining to creative ownership and artificial intelligence are being drafted slowly and haltingly.
Disclaimer
Nothing published on Creative Learning Guild — including news articles, legal news, lawsuit summaries, settlement guides, legal analysis, financial commentary, expert opinion, educational content, or any other material — constitutes legal advice, financial advice, investment advice, or professional counsel of any kind. All content on this website is provided strictly for informational, educational, and news reporting purposes only. Consult your legal or financial advisor before taking any step.
