Irony or prescience?
On July 24, 2024, as tech stocks plummeted in a sudden AI-related market rout, researchers from the University of Cambridge and the University of Edinburgh published a groundbreaking study in Nature. The paper, titled “AI models collapse when trained on recursively generated data,” revealed a critical flaw in the development of artificial intelligence systems that could have far-reaching consequences for the technology’s future.
The irony of timing was not lost on industry observers. As investors fled from AI-focused companies, citing concerns about overvaluation and regulatory risks, the Nature study exposed a more fundamental threat to the technology’s long-term viability. This confluence of events underscored the volatile and unpredictable nature of AI development, highlighting the need for a deeper understanding of its limitations and potential pitfalls.
The proliferation of AI-generated content
As large language models (LLMs) like GPT-3 and ChatGPT have become increasingly sophisticated, they have begun to contribute significantly to the vast pool of online text. This trend shows no signs of slowing, with AI-generated content becoming ubiquitous across various platforms and industries. The researchers behind the Nature study pondered a crucial question: “What may happen to GPT-{n} once LLMs contribute much of the text found online?”
The answer, it turns out, is far from reassuring. The study revealed that indiscriminate use of model-generated content in training leads to irreversible defects in the resulting models. This phenomenon, dubbed “model collapse,” occurs when AI systems are trained on data produced by other AI models without sufficient input from human-created sources.
The rapid onset of model collapse
One of the most alarming findings of the research is the speed at which model collapse can occur. The study demonstrated that AI systems can completely break down after just seven to ten generations of training on synthetic data, producing nothing but gibberish. This rapid deterioration is due to a feedback loop in which errors are amplified and compounded with each iteration.
The researchers tested various types of AI models, including Gaussian mixture models (GMMs), variational autoencoders (VAEs), and large language models. In all cases, they observed instances of model collapse. For example, when testing a system with text about medieval architecture, it took only nine generations before the output devolved into a repetitive list of jackrabbits.
Potential solutions and their challenges
While the study paints a grim picture of AI’s future, it also suggests some potential solutions to mitigate the risk of model collapse. These include:
- Preserving and periodically retraining AI models on “clean,” pre-AI data sets
- Introducing and retraining models on new human-generated content
- Exploring protective measures, such as watermarking content
However, implementing these solutions presents significant challenges. Watermarking, for instance, can be easily removed, and AI companies have shown resistance to working together on such initiatives. Moreover, the sheer volume of AI-generated content makes it increasingly difficult to filter out synthetic data from training sets.
The far-reaching implications of model collapse
The potential consequences of model collapse extend far beyond the realm of chatbots and internet content. As AI systems become more deeply integrated into various aspects of our lives, the risks associated with their degradation grow exponentially.
Trust in AI-powered systems could erode rapidly if they begin producing nonsensical or unreliable outputs. This could have severe implications for industries relying on AI for critical decision-making, such as healthcare, finance, and autonomous vehicles. As Emily Wenger, a researcher not involved in the study, points out, “The problem must be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web.”
Furthermore, the study suggests that model collapse could lead to a loss of diversity in AI outputs. As less common elements of the original data distribution are gradually eliminated through successive generations of training, AI systems may fail to reflect the full variety of human experience and knowledge. This homogenization of AI-generated content could have profound societal implications, potentially reinforcing existing biases and limiting the range of ideas and perspectives represented in digital spaces.
The seeds of collapse have already been sown
Perhaps most concerning is the realization that the conditions for model collapse are already present in today’s AI landscape. NewsGuard, a company that assesses news platform credibility, has identified more than 500 news outlets that primarily rely on AI to generate articles with minimal human oversight. This widespread use of AI-generated content in journalism exemplifies the potential for rapid propagation of errors and biases.
The case of Microsoft’s MSN news portal serves as a stark warning. Relying heavily on AI to generate stories without human editorial supervision, the platform published an article in September 2023 describing a deceased NBA player as “useless at 42”. Such egregious errors highlight the risks of unchecked AI content generation and underscore the urgent need for robust safeguards against model collapse.
As we stand at the crossroads of AI development, the Nature study serves as a crucial wake-up call. The promise of artificial intelligence remains immense, but so too do the risks of its unchecked proliferation. To ensure a future in which AI enhances rather than degrades our information ecosystem, stakeholders across industry, academia, and government must work together to address the challenge of model collapse. Only through vigilant oversight, continued research, and a commitment to preserving the value of human-generated data can we hope to harness the full potential of AI while avoiding its potential pitfalls.
Peter is chairman of Flexiion and has a number of other business interests. (c) 2023, Peter Osborn