The legal storm surrounding generative AI has reached a new peak. Encyclopedia Britannica and its subsidiary, Merriam-Webster, have officially filed a lawsuit against OpenAI, alleging that the AI giant’s models have not just learned from their vast repositories of knowledge, but have effectively “memorized” them.
This lawsuit, filed in federal court following reports from Reuters, marks a significant escalation in the ongoing tension between traditional publishers and the architects of Large Language Models (LLMs). While previous lawsuits from authors and news organizations focused on the act of training, Britannica’s case highlights a more technical and perhaps more damaging phenomenon: the near-verbatim regurgitation of proprietary facts and definitions.
At the heart of the complaint is the distinction between an AI “understanding” a concept and simply storing a copy of the text. Britannica alleges that GPT-4 can output near-identical copies of its copyrighted articles on demand. For a company that has spent over 250 years curating human knowledge, this isn't just a copyright violation—it is a direct threat to their business model.
To understand the gravity of this, consider the analogy of a student and a textbook. If a student reads a history book and then writes an original essay based on what they learned, that is generally considered transformative use. However, if that student walks into an exam and recites the textbook word-for-word, they are no longer demonstrating understanding; they are acting as a human photocopier. Britannica argues that OpenAI’s models are doing the latter.
The lawsuit provides specific examples where GPT-4 allegedly produced responses that were “substantially similar” to Britannica’s entries. In the world of LLMs, this is known as “regurgitation.” It occurs when a model is trained so heavily on a specific dataset that the weights of the neural network become tuned to reproduce that data exactly when prompted with specific keywords.
For Merriam-Webster, the stakes are equally high. Dictionary definitions are, by necessity, concise and specific. If an AI provides a definition that matches Merriam-Webster’s unique phrasing and structural nuances, it bypasses the need for a user to ever visit the publisher’s website. This “zero-click” reality drains ad revenue and subscription potential from the very institutions that provide the high-quality data AI relies on.
We have seen similar cases from The New York Times and various prominent novelists, but the Britannica case is unique for two reasons:
While OpenAI has not yet released a full rebuttal to this specific filing, their historical defense remains consistent. They argue that training AI models on publicly available internet data constitutes “fair use.” They contend that the models are creating something entirely new—a multi-purpose reasoning engine—rather than a database of existing works.
OpenAI also frequently points to “guardrails” they have implemented to prevent the exact type of regurgitation Britannica is complaining about. However, as this lawsuit suggests, those guardrails may be more porous than the company admits, especially when users employ specific prompting techniques to “extract” training data.
One of the most difficult aspects of this legal battle is the technical reality of LLMs. Once a model is trained on a dataset, “unlearning” that specific data is notoriously difficult. It isn't as simple as deleting a file from a hard drive. The information is diffused across billions of parameters.
If the court rules in favor of Britannica, OpenAI might be forced to do more than just pay a fine. They could be required to filter outputs more aggressively or, in a worst-case scenario for the tech firm, retrain models from scratch without the disputed data—a process that would cost millions of dollars and months of compute time.
This lawsuit is a bellwether for the “data licensing” era of AI. We are moving away from the “Wild West” period where AI companies scraped the web with impunity. In the coming months, we will likely see more high-profile partnerships where AI firms pay for access to high-quality, verified data silos.
For users, this could mean that AI responses become more transparent, with clearer citations and links back to original sources. For the industry, it means that the cost of building a top-tier LLM is about to go up significantly as “free” data sources start putting up legal paywalls.
As the legal landscape shifts, here is how you should navigate the changing environment:



Our end-to-end encrypted email and cloud storage solution provides the most powerful means of secure data exchange, ensuring the safety and privacy of your data.
/ Create a free account