For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.
Now that data is drying up
Over the past year, many of the most important web sources used for training AI models have restricted the use of their data, according to a study published last week by the Data Provenance Initiative, a Massachusetts Institute of Technology-led research group.
The study, which looked at 14,000 web domains that are included in three commonly used AI training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.
The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5% of all data, and 25% of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.
The study also found that as much as 45% of the data in one set, C4, had been restricted by websites’ terms of service.
“We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for AI companies, but for researchers, academics and noncommercial entities,” Shayne Longpre, the study’s lead author, said in an interview.
Data is the main ingredient in today’s generative AI systems, which are fed billions of examples of text, images and videos. Much of that data is scraped from public websites by researchers and compiled in large data sets, which can be downloaded and freely used, or supplemented with data from other sources.
Learning from that data is what allows generative AI tools like OpenAI’s ChatGPT, Google’s Gemini and Anthropic’s Claude to write, code and generate images and videos. The more high-quality data is fed into these models, the better their outputs generally are.
For years, AI developers were able to gather data fairly easily. But the generative AI boom of the past few years has led to tensions with the owners of that data — many of whom have misgivings about being used as AI training fodder or at least want to be paid for it.
As the backlash has grown, some publishers have set up paywalls or changed their terms of service to limit the use of their data for AI training. Others have blocked the automated web crawlers used by companies like OpenAI, Anthropic and Google.
Sites like Reddit and StackOverflow have begun charging AI companies for access to data, and a few publishers have taken legal action — including The New York Times, which sued OpenAI and Microsoft for copyright infringement last year, alleging that the companies used news articles to train their models without permission.
Companies like OpenAI, Google and Meta have gone to extreme lengths in recent years to gather more data to improve their systems, including transcribing YouTube videos and bending their own data policies.
More recently, some AI companies have struck deals with publishers including The Associated Press and News Corp., the owner of The Wall Street Journal, giving them ongoing access to their content.
But widespread data restrictions may pose a threat to AI companies, which need a steady supply of high-quality data to keep their models fresh and up to date.
They could also spell trouble for smaller AI outfits and academic researchers who rely on public data sets and can’t afford to license data directly from publishers. Common Crawl, one such data set that comprises billions of pages of web content and is maintained by a nonprofit, has been cited in more than 10,000 academic studies, Longpre said.
It’s not clear which popular AI products have been trained on these sources, since few developers disclose the full list of data they use. But data sets derived from Common Crawl, including C4 (which stands for Colossal, Cleaned Crawled Corpus) have been used by companies including Google and OpenAI to train previous versions of their models. Spokespeople for Google and OpenAI declined to comment.
Yacine Jernite, a machine-learning researcher at Hugging Face, a company that provides tools and data to AI developers, characterized the consent crisis as a natural response to the AI industry’s aggressive data-gathering practices.
“Unsurprisingly, we’re seeing blowback from data creators after the text, images and videos they’ve shared online are used to develop commercial systems that sometimes directly threaten their livelihoods,” he said.
But he cautioned that if all AI training data needed to be obtained through licensing deals, it would exclude “researchers and civil society from participating in the governance of the technology.”
Stella Biderman, the executive director of EleutherAI, a nonprofit AI research organization, echoed those fears.
“Major tech companies already have all of the data,” she said. “Changing the license on the data doesn’t retroactively revoke that permission, and the primary impact is on later-arriving actors, who are typically either smaller startups or researchers.”
Longpre said that one of the big takeaways from the study is that we need new tools to give website owners more precise ways to control the use of their data. Some sites might object to AI giants using their data to train chatbots for a profit but might be willing to let a nonprofit or educational institution use the same data, he said. Right now, there’s no good way for them to distinguish between those uses, or block one while allowing the other.
But there’s also a lesson here for big AI companies, who have treated the internet as an all-you-can-eat data buffet for years, without giving the owners of that data much of value in return. Eventually, if you take advantage of the web, the web will start shutting its doors.