Synthetic Data at Scale: Inside NVIDIA’s 340B Model
Writer trained a frontier-class model for $700,000. A comparable OpenAI model reportedly cost $4.6 million. The difference wasn’t a smarter team. It was synthetic data, and it’s about to change how every enterprise AI budget gets built.
If you’re building a domain-specific model this year, synthetic data is no longer the experimental option. It’s the default line item. NVIDIA has spent well over $320 million buying into it. Microsoft trained part of Phi-4 on 400 billion synthetic tokens. And enterprise buyers evaluating vendors like Mostly AI, Tonic.ai, and Hazy need a clear answer to one question: does this actually work, or does it just get you to a worse model faster?
The honest answer, after digging through the peer-reviewed research, the regulatory filings, and the vendor claims: both. Synthetic data is solving a real, measurable problem. It’s also creating a new one that most vendor pitch decks conveniently skip.
Why synthetic data exists now
Every frontier lab is running into the same wall. Epoch AI estimates there’s roughly 300 trillion tokens of high-quality public text on the entire internet. GPT-4-class models already consume 6 to 13 trillion tokens per training run. Do that math a few more times and the public web runs dry, not in some distant future, but on a timeline that matters for product roadmaps being written right now.
At the same time, real data got more expensive to use, not just to collect. GDPR, the EU AI Act’s phased rollout through 2026 and 2027, HIPAA, and CCPA all raise the cost and legal exposure of training on real customer or patient records. Synthetic data promised a way around both problems at once: manufacture the training signal instead of mining it, and skip the privacy landmine while you’re at it.
That promise isn’t new, either. Statistician Donald Rubin proposed generating synthetic records to protect the confidentiality of census microdata back in 1993. What changed is generative modeling. GANs, then diffusion models, then LLMs, made it possible to produce synthetic text, images, and tabular data realistic enough to actually train on, at a scale that simply didn’t exist five years ago.
NVIDIA’s 340B bet
The clearest signal that synthetic data moved from side project to platform strategy came from NVIDIA. In June 2024, the company released Nemotron-4 340B, an open, commercially licensed model family built specifically to generate synthetic training data for other LLMs. It’s not a small side experiment. Nemotron-4 340B was pretrained on 9 trillion tokens, and over 98% of the data used in its own alignment process was synthetically generated, according to NVIDIA’s technical report.
Then, in March 2025, NVIDIA acquired Gretel, a synthetic-data startup with roughly 80 employees and about $67 million in prior VC funding. The deal was reported at more than $320 million, exceeding Gretel’s last valuation, according to Wired and corroborated by TechCrunch, SiliconANGLE, and Benzinga. Terms weren’t fully disclosed, but the size of the number tells you how NVIDIA is thinking. This isn’t a compliance tool bolted onto the GPU business. It’s infrastructure.
The real cost math
Here’s the number that should actually change how your team plans a training budget. Writer, an enterprise generative AI company, trained its Palmyra X 004 model almost entirely on synthetic data for a reported $700,000. A comparably sized OpenAI model was estimated at around $4.6 million, according to TechCrunch’s reporting in December 2024.
That’s not a rounding error. That’s the difference between a project a mid-size company can actually greenlight and one that only a frontier lab can afford. If you’re building domain-specific LLMs rather than chasing frontier-lab scale, that cost gap is the opportunity, but only where your team has real curation and filtering discipline. Cheap synthetic data without quality control just gets you to a bad model faster and cheaper, which isn’t actually a win.
Synthetic data models let teams rapidly build on human intuition about what data a model actually needs. But raw synthetic data can’t be trusted to avoid forgetful, homogenous outputs unless it’s carefully filtered and paired with fresh real data. Luca Soldaini, Senior Research Scientist, Allen Institute for AI (AI2), via TechCrunch
The model collapse problem
Here’s the part the optimistic vendor pitch skips. In 2024, a team led by Ilia Shumailov published a peer-reviewed study in Nature establishing what’s now called model collapse: when generative models are trained recursively on their own or other models’ synthetic outputs, generation after generation, the original data distribution’s tails erode. Rare events and minority patterns disappear first. Outputs drift toward a narrower, more generic mean.
This isn’t theoretical anymore. A February 2026 Communications of the ACM piece documented model collapse showing up in production systems already: background-removal tools failing on specific hair textures, image generators producing increasingly homogeneous outputs. These are shipped products, not lab experiments.
Synthetic data’s value lies in its statistical similarity to real data. Recent advances in generative modeling are what made large-scale, realistic synthetic data generation newly possible at a fidelity that simply didn’t exist before. Kalyan Veeramachaneni, Principal Research Scientist, MIT LIDS; co-founder, DataCebo, via MIT News
There’s also a sharper version of this critique worth sitting with. Fraud detection is one of the most-cited synthetic-data success stories, but real fraud represents under 0.1% of transactions. That means synthetic fraud generation is filling in for genuinely rare edge cases that are inherently hard to validate against ground truth. It’s not simply “more of the same data, cheaper.” It’s manufacturing your own answer key for the exact patterns you have the least real evidence about.
AI companies may be aware of unresolved problems with synthetic data and model collapse, but they have strong financial incentive to downplay these risks so as not to spook investors during the AI boom. Jathan Sadowski, researcher on AI political economy, via LGT
What regulators are already doing
The biggest live risk for regulated-industry teams isn’t technical. It’s the assumption that synthetic equals automatically exempt from privacy law. It doesn’t.
- EDPB Opinion 28/2024: The European Data Protection Board laid out a three-step legality test for whether synthetic data actually qualifies as anonymous under GDPR. The real data used to generate it still needs a lawful basis.
- NIST SP 800-226: Sets guidance on differential privacy claims, directly relevant to any vendor promising synthetic data is inherently private.
- UK FCA Synthetic Data Expert Group: Actively mapping governance expectations onto existing model-risk policy for financial services.
If your compliance team’s current stance is “it’s synthetic, so it’s fine,” that stance is already out of date.
How big is this, really
Ask five research firms how big the synthetic data market is, and you’ll get five different answers for the exact same year. That spread matters, because a lot of vendor sales decks lean on the biggest number available.
| Firm | 2026 Estimate | 2030s Projection | CAGR |
|---|---|---|---|
| Precedence Research | $791.3M | $6.9B by 2034 | 31.1% |
| Mordor Intelligence | $710M | $3.67B by 2031 | 38.96% |
| Grand View Research | N/A (2023 baseline: $218.4M) | $1.79B by 2030 | 35.3% |
The gap exists because there’s no standardized definition of what counts as “the synthetic data market.” Some estimates count only dedicated vendors. Others fold in hyperscaler tooling revenue. Treat any single “the market will be worth $X billion” headline with a healthy dose of skepticism unless it names its methodology.
Gartner’s frequently cited projection that 75% of businesses will use generative AI to create synthetic customer data by 2026 is also worth flagging clearly: it’s an analyst prediction, not a measured outcome. Decisions should be based on your own pilot data quality, not market-growth headlines.
What enterprise teams should do now
If you’re a CTO or data engineering lead evaluating this space, the practical split is between two very different use cases:
- Synthetic data for privacy-safe testing and data sharing. Mature, well-understood, low risk. This is the use case that’s actually been battle-tested for years.
- Synthetic data as a primary model training source. Higher risk, actively debated, and prone to collapse if used recursively without real-data anchoring. This is where the Writer cost-savings story lives, and also where the CACM production failures live.
Our read: the teams getting real value right now are the ones treating synthetic data as a supplement to real data, not a replacement for it, and the ones running their compliance check before their procurement check, not after.
Frequently Asked Questions
What is synthetic data in AI?
Synthetic data is artificial information generated by algorithms or AI models rather than collected from real-world events. It’s built to mimic the statistical properties of real data without exposing personal or sensitive records, and it’s used for AI training, testing, and privacy-safe data sharing.
Is synthetic data as good as real data?
It depends on the use case. Synthetic data can match real-data performance for well-understood patterns like fraud simulation or tabular records, but it degrades model quality through model collapse when used recursively across generations without real-data anchoring.
Does synthetic data solve AI privacy problems?
Only partially. The European Data Protection Board has clarified that synthetic data doesn’t automatically qualify as anonymous under GDPR. A legality test still applies, and the original real data used to generate it still needs a lawful basis.
How big is the synthetic data market?
Estimates vary by research firm, ranging from roughly $600 million to $900 million in 2026 depending on methodology, with projected growth to $3.7 billion to $6.9 billion by the early 2030s at 31 to 39 percent CAGR.
What is model collapse in AI?
Model collapse is the progressive degradation of an AI model’s outputs when it’s trained recursively on AI-generated data instead of real-world data. It causes loss of rare patterns and increasingly generic, homogeneous results over successive generations.
Where this goes next
What’s clear now that wasn’t clear a year ago: synthetic data isn’t a shortcut around the data wall, it’s a different tool with its own failure mode. NVIDIA’s infrastructure bet, Writer’s cost numbers, and the CACM production failures are all real, all documented, and all pointing in different directions at once.
Three things worth watching over the next 6 to 18 months: whether the 2025 rebuttal to Shumailov’s collapse findings holds up under further scrutiny, whether the EDPB’s GDPR test becomes the template other regulators copy, and whether the market-size estimates start converging as vendors standardize what actually counts as “synthetic data” revenue. Regulatory scrutiny of AI training data isn’t slowing down either. Our recent coverage of the ChatGPT Canada privacy ruling shows what happens when real-data training practices collide with privacy law. Synthetic data is one proposed way around that collision, though regulators are already scrutinizing it too.
Want the next installment of this story before it hits the feed? Subscribe to The Neural Loop at neuralwired.com/newsletter.
