Published Date: 19/07/2024
For years, the developers of powerful artificial intelligence systems have relied on vast amounts of text, images, and videos scraped from the internet to train their models. However, a recent study by the Data Provenance Initiative, an MIT-led research group, has revealed an alarming trend the data that powers AI is disappearing at a rapid pace.
The study, which analyzed 14,000 web domains, found that many of the most important web sources used for training AI models have restricted the use of their data. This has resulted in an 'emerging crisis in consent,' as publishers and online platforms take steps to prevent their data from being harvested. The researchers estimate that 5% of all data, and 25% of data from the highest-quality sources, has been restricted through the Robots Exclusion Protocol or website terms of service.
This decline in consent will have significant ramifications not just for AI companies, but also for researchers, academics, and non-commercial entities that rely on public data sets. The study's lead author, Shayne Longpre, warns that the rapid decline in accessible data will have a profound impact on the development of AI.
Data is the lifeblood of today's generative AI systems, which are fed billions of examples of text, images, and videos. The more high-quality data these models receive, the better their outputs generally are. However, the backlash against the use of data for AI training has grown, with many publishers and website owners expressing misgivings about being used as AI training fodder without permission or compensation.
As a result, some publishers have set up paywalls or changed their terms of service to limit the use of their data for AI training. Others have blocked the automated web crawlers used by companies like OpenAI, Anthropic, and Google. Sites like Reddit and StackOverflow have begun charging AI companies for access to data, and a few publishers have taken legal action, including The New York Times, which sued OpenAI and Microsoft for copyright infringement last year.
Companies like OpenAI, Google, and Meta have gone to great lengths to gather more data to improve their systems. Some have struck deals with publishers, including The Associated Press and News Corp, the owner of The Wall Street Journal, to gain ongoing access to their content. However, smaller AI outfits and academic researchers who rely on public data sets are in trouble.
The looming data crisis raises important questions about the ownership and control of data, and the ethics of using online content to train AI models. As the AI industry continues to grow, it is essential that we address these issues and find a way to balance the needs of AI developers with the rights of data owners.
Q: What is the main ingredient in today's generative AI systems?
A: Data is the main ingredient in today's generative AI systems, which are fed billions of examples of text, images, and videos.
Q: Why are publishers and website owners restricting access to their data?
A: Publishers and website owners are restricting access to their data due to concerns about being used as AI training fodder without permission or compensation.
Q: How does the decline in accessible data affect AI development?
A: The decline in accessible data will have a profound impact on the development of AI, making it more difficult to train and improve AI models.
Q: What are some examples of companies that have struck deals with publishers for data access?
A: Companies like OpenAI, Google, and Meta have struck deals with publishers, including The Associated Press and News Corp, the owner of The Wall Street Journal, to gain ongoing access to their content.
Q: Who will be most affected by the looming data crisis?
A: Smaller AI outfits and academic researchers who rely on public data sets will be most affected by the looming data crisis.