- The AI industry has a major problem: The real-world data used to make smarter models is running out.
- Companies scrambling for an alternative think synthetic data could offer a solution.
- Research suggests synthetic data could poison AI with low-quality information.
The AI world is on the cusp of running out of its most valuable resource — and it's leading industry leaders into a fierce debate over a fast-growing alternative being touted as a replacement: synthetic data, or essentially "fake" data.
For years, the likes of OpenAI and Google have scraped data from the internet to train the large language models that power their AI tools and features. These LLMs digested reams of text, video, and other media online produced by humans over centuries — be it research papers, novels, or YouTube clips.
Now, the supply of "real," human-generated data is running dry. The research firm Epoch AI predicts textual data could run out by 2028. Meanwhile, companies that have mined every corner of the internet for usable training data — sometimes breaking their policies to do so — face increased restrictions on what remains.
To some, that's not necessarily a problem. OpenAI CEO Sam Altman has argued that AI models should eventually produce synthetic data good enough to train themselves effectively. The allure is obvious: Training data has become one of the most precious resources in the AI boom, and the prospect of generating it cheaply and seemingly infinitely is tantalizing.
Still, researchers debate whether synthetic data is the magic bullet, with some arguing this path could lead to AI models poisoning themselves with poor-quality information and that they could "collapse" as a result.
A recent paper published by a group of Oxford and Cambridge researchers said that feeding a model with AI-generated data eventually led it to produce gibberish. AI-generated data was not unusable for training, the authors found, and should be balanced with real-world data.
As the well of usable human-generated data dries up, more companies look into using synthetic data. In 2021, the research firm Gartner predicted that by 2024, 60% of data used for developing AI would be synthetically generated.
"It's a crisis," said Gary Marcus, an AI analyst and professor emeritus of psychology and neural science at New York University. "People had the illusion that you could infinitely make large language models better by just using more and more data, but now they've basically used all the data they can."
He added: "Yes, it will help you with some problems, but the deeper problem is that these systems don't really reason; they don't really plan. All the synthetic data you can imagine is not going to solve that foundational problem."
More companies create synthetic data
The need for "fake" data hinges on the notion that real-world data is quickly running out.
This is partly because tech firms have been moving as fast as possible to use publicly available data to train AI in an effort to outsmart rivals. It's also because online data owners have become increasingly wary of companies taking their data for free.
OpenAI researchers revealed in 2020 how they used free data from Common Crawl, a web crawler that the AI company said contained "nearly a trillion words" from online resources, to train the AI model that would eventually power ChatGPT.
Research published in July by the Data Provenance Initiative found websites were putting restrictions in place to stop AI firms from using data that didn't belong to them. News publications and other top sites are increasingly blocking AI companies from freely cribbing their data.
To get around this problem, companies such as OpenAI and Google are cutting checks for tens of millions of dollars for access to data from Reddit and news outlets, which act as conveyor belts of fresh data for training models. Even this has its limitations.
"There are no longer major areas of the textual web just waiting to be grabbed," Nathan Lambert, a researcher at the Allen Institute for AI, wrote in May.
This is where synthetic data comes in. Rather than being pulled from the real world, synthetic data is generated by AI systems that have been trained on real-world data.
In June, for instance, Nvidia released an AI model that can create artificial datasets for training and alignment. In July, researchers at the Chinese tech giant Tencent created a synthetic-data generator called Persona Hub, which does a similar job.
Some startups, such as Gretel and SynthLabs, are even popping up with the sole purpose of generating and selling troves of specific types of data to companies that need it.
Proponents of synthetic data offer fair reasons for its use. Like the real world, human-generated data is often messy, leaving researchers with the complex and laborious task of cleaning and labeling it before it can be used.
Synthetic data may fill holes that human data cannot. In late July, Meta introduced Llama 3.1, a new series of AI models that generate synthetic data and rely on it for "fine-tuning" in training. In particular, it used the data to improve the performance of specific skills, such as coding in languages like Python, Java, and Rush, as well as solving math problems.
Synthetic training could be particularly effective for smaller AI models. Microsoft last year said it gave OpenAI's models a diverse list of words that a typical 3- to 4-year-old would know and then asked it to generate short stories using that data. The resulting dataset was used to create a group of small but capable language models.
Synthetic data may help offer some effective "countertuning" to the biases produced by real-world data, too. In their 2021 paper, "On the Dangers of Stochastic Parrots," the former Google researchers Timnit Gebru, Margaret Mitchell, and others said that LLMs trained on massive datasets of text from the internet would likely reflect the data's biases.
In April, a group of Google DeepMind researchers published a paper championing the use of synthetic data to address problems around data scarcity and privacy concerns in training, saying that ensuring the accuracy and lack of bias in this AI-generated data "remains a critical challenge."
'Habsburg AI'
While the AI industry found some advantages in synthetic data, it faces serious issues it can't afford to ignore, such as fears synthetic data can wreck AI models.
In Meta's research paper on Llama 3.1, the company said that training the 405 billion-parameter version of the latest model "on its own generated data is not helpful" and may even "degrade performance."
A study published in the journal Nature last month found that "indiscriminate use" of synthetic data in model training could cause "irreversible defects." The researchers called this phenomenon "model collapse" and said that the problem must be taken seriously "if we are to sustain the benefits of training from large-scale data scraped from the web."
Jathan Sadowski, a senior research fellow at Monash University, coined a term for this idea: Habsburg AI, in reference to the Austrian dynasty that some historians believe destroyed itself through inbreeding. Since coining the term, Sadowski told Business Insider he has felt validated by the research backing his assertion that models heavily trained on AI outputs could become mutated.
"The open question for researchers and companies building AI systems is, how much synthetic data is too much?" Sadowski said. "They need to find any possible solution to overcome the challenges of data scarcity for AI systems," he added, noting that some of the solutions may turn out to be short-term fixes that do more harm than good.
However, research published in April found that models trained on their own generated data didn't necessarily need to "collapse" if they were trained with both "real" and synthetic data. Now, some companies are betting on a future of "hybrid data," where synthetic data is generated by using some real data in an effort to stop the model from going off-piste.
Scale AI, which helps companies label and test data, said the company was exploring "the direction of hybrid data," using both synthetic and nonsynthetic data. (Scale AI CEO Alexandr Wang recently said: "Hybrid data is the real future.")
In search of other solutions
AI may require new approaches, as simply jamming more data into models may only go so far.
A group of Google DeepMind researchers may have proved the merits of another approach in January when the company announced AlphaGeometry, an AI system that can solve geometry problems at an Olympiad level.
In a supplemental paper, the researchers said AlphaGeometry used a "neuro-symbolic" approach, which meshes the strengths of other AI approaches, landing somewhere between data-hungry deep-learning models and rule-based logical reasoning. IBM's research group said it saw it as "a pathway to achieve artificial general intelligence."
What's more, in the case of AlphaGeometry, it was pretrained on entirely synthetic data.
The neuro-symbolic field of AI is relatively young, and it remains to be seen whether it will propel AI forward.
Given the pressures companies such as OpenAI, Google, and Microsoft face in turning AI hype into profits, expect them to try every solution possible to solve the data crisis.
"We're still basically going to be stuck here unless we take new approaches altogether," Marcus said.