ChatGPT's Influence on AI: A New Data Contamination Era

Published Date : 15-06-2025

The launch of OpenAI's ChatGPT in November 2022 has created a significant shift in the AI world, akin to the beginning of the atomic age. Researchers are now concerned about the contamination of AI data, leading to potential model collapse and competitive advantages for early market entrants.

For artificial intelligence researchers, the launch of OpenAI's ChatGPT on November 30, 2022, changed the world in a way similar to the detonation of the first atomic bomb. The Trinity test, in New Mexico on July 16, 1945, marked the beginning of the atomic age. One manifestation of that moment was the contamination of metals manufactured after that date – as airborne particulates left over from Trinity and other nuclear weapons permeated the environment.

The poisoned metals interfered with the function of sensitive medical and technical equipment. Thus, until recently, scientists involved in the production of those devices sought metals uncontaminated by background radiation, referred to as low-background steel, low-background lead, and so on. One source of low-background steel was the German naval fleet that Admiral Ludwig von Reuter scuttled in 1919 to keep the ships from the British.

Shortly after the debut of ChatGPT, academics and technologists started to wonder if the recent explosion in AI models has also created contamination. Their concern is that AI models are being trained with synthetic data created by AI models. Subsequent generations of AI models may therefore become less and less reliable, a state known as AI model collapse.

In March 2023, John Graham-Cumming, then CTO of Cloudflare and now a board member, registered the web domain lowbackgroundsteel.ai and began posting about various sources of data compiled prior to the 2022 AI explosion, such as the Arctic Code Vault (a snapshot of GitHub repos from 02/02/2020).

The Register asked Graham-Cumming whether he came up with the low-background steel analogy, but he said he didn't recall. 'I knew about low-background steel from reading about it years ago,' he responded by email. 'And I’d done some machine learning stuff in the early 2000s for POPFile. It was an analogy that just popped into my head and I liked the idea of a repository of known human-created stuff. Hence the site.'

Graham-Cumming isn’t sure contaminated AI corpuses is a problem. 'The interesting question is 'Does this matter?'' he asked. Some AI researchers think it does and that AI model collapse is concerning. The year after ChatGPT’s debut, several academic papers explored the potential consequences of model collapse or Model Autophagy Disorder (MAD), as one set of authors termed the issue. The Register interviewed one of the authors of those papers, Ilia Shumailov, in early 2024.

Though AI practitioners have argued that model collapse can be mitigated, the extent to which that's true remains a matter of ongoing debate. Just last week, Apple researchers entered the fray with an analysis of model collapse in large reasoning models, only to have their conclusions challenged by Alex Lawsen, senior program associate with Open Philanthropy, with help from AI model Claude Opus. Essentially, Lawsen argued that Apple's reasoning evaluation tests, which found reasoning models fail at a certain level of complexity, were flawed because they forced the models to write more tokens than they could accommodate.

In December 2024, academics affiliated with several universities reiterated concerns about model collapse in a paper titled 'Legal Aspects of Access to Human-Generated Data and Other Essential Inputs for AI Training.' They contended the world needs sources of clean data, akin to low-background steel, to maintain the function of AI models and to preserve competition. 'I often say that the greatest contribution to nuclear medicine in the world was the German admiral who scuppered the fleet in 1919,' Maurice Chiodo, research associate at the Centre for the Study of Existential Risk at the University of Cambridge and one of the co-authors, told The Register. 'Because that enabled us to have this almost infinite supply of low-background steel. If it weren’t for that, we’d be kind of stuck.

So the analogy works here because you need something that happened before a certain date. Now here the date is more flexible, let's say 2022. But if you're collecting data before 2022 you're fairly confident that it has minimal, if any, contamination from generative AI. Everything before the date is 'safe, fine, clean,' everything after that is 'dirty.'

What Chiodo and his co-authors worry about is not so much that models fed on their own output will produce unreliable information, but that access to supplies of clean data will confer a competitive advantage to early market entrants. With AI model-makers spewing more and more generative AI data on a daily basis, AI startups will find it harder to obtain quality training data, creating a lockout effect that makes their models more susceptible to collapse and reinforces the power of dominant players. That's their theory, anyway.

'You can build a very usable model that lies. You can build quite a useless model that tells the truth,' Chiodo said. Rupprecht Podszun, professor of civil and competition law at Heinrich Heine University Düsseldorf and a co-author, said, 'If you look at email data or human communication data – which pre-2022 is really data which was typed in by human beings and sort of reflected their style of communication – that's much more useful [for AI training] than getting what a chatbot communicated after 2022.'

Podszun said the accuracy of the content matters less than the style and the creativity of the ideas during real human interaction. Chiodo said everyone participating in generative AI is polluting the data supply for everyone, for model makers who follow and even for current ones.

So how can we clean up the AI environment? 'In terms of policy recommendation, it's difficult,' admits Chiodo. 'We start by suggesting things like forced labeling of AI content, but even that gets hard because it's very hard to label text and very easy to clean off watermarking.' Labeling pictures and videos becomes complicated when different jurisdictions are involved, Chiodo added. 'Anyone can deploy data anywhere on the internet, and so because of this scraping of data, it's very hard to force all operating LLMs to always watermark output that they have,' he said.

The paper discusses other policy options like promoting federated learning, by which those holding uncontaminated data might allow third parties to train on that data without providing the data directly. The idea would be to limit the competitive advantage of those with access to unadulterated datasets, so we don't end up with AI model monopolies. But as Chiodo observes, there are other risks to having a centralized government-maintained store of uncontaminated data. 'You've got privacy and security risks for these vast amounts of data, so what do you keep, what do you not keep, how are you careful about what you keep, how do you keep it secure, how do you keep it politically stable,' he said. 'You might put it in the hands of some governments who are okay today, but tomorrow they’re not.'

Podszun argues that competition in the management of uncontaminated data can help mitigate the risks. 'That would obviously be something that is a bulwark against political influence, against technical mistakes, against sort of commercial concentration,' he said. 'The problem we're identifying with model collapse is that this issue is going to affect the development of AI itself,' said Chiodo. 'If the government cares about long-term good, productive, competitive development of AI, large-service models, then it should care very much about model collapse and about creating guardrails, regulations, guides for what's going to happen with datasets, how we might keep some datasets clean, how we might grant access to data.'

There's not much government regulation of AI in the US to speak of. The UK is also pursuing a light-touch regulatory regime for fear of falling behind rival nations. Europe, with the AI Act, seems more willing to set some ground rules.

Frequently Asked Questions (FAQS):

Q: What is AI model collapse?

A: AI model collapse refers to the potential issue where AI models become less reliable and accurate over time due to being trained on synthetic data created by other AI models.

Q: Why is low-background steel relevant to AI?

A: Low-background steel is a metaphor used to describe data that is uncontaminated by AI-generated content, which is crucial for maintaining the reliability and accuracy of AI models.

Q: What are the potential consequences of model collapse?

A: The consequences of model collapse can include the production of unreliable information, reduced accuracy in AI models, and a competitive advantage for early market entrants with access to clean data.

Q: How can we mitigate the risks of AI data contamination?

A: Possible solutions include forced labeling of AI content, promoting federated learning, and creating policies to ensure the availability of uncontaminated data for AI training.

Q: What is the role of government in regulating AI data?

A: Governments can play a crucial role in setting regulations and guidelines to ensure the long-term, competitive, and secure development of AI, including measures to prevent data contamination and model collapse.

ChatGPT's Influence on AI: A New Data Contamination Era

The launch of OpenAI's ChatGPT in November 2022 has created a significant shift in the AI world, akin to the beginning of the atomic age. Researchers are now concerned about the contamination of AI data, leading to potential model collapse and competitive advantages for early market entrants.

Frequently Asked Questions (FAQS):

More Related Topics :

Thinking About AI Vision for Your Business? Let's Make It Happen.

Explore our AI-powered tools that can boost your business success.

Watchman AI

Employee Monitoring

ICAO Facial Image App

Container Number Recognition System

Automated Number Plate Recognition

Proctor AI