How Often Should You Retrain an ML Model? Google’s Data
A 450,000-model study out of Google and UC Berkeley just answered a question most MLOps teams have been guessing at for years, and the answer has almost nothing to do with drift schedules.
Somewhere inside Google, an ML pipeline retrained itself nine times before lunch and shipped exactly one of those runs to production. That is not a glitch in someone’s dashboard. It is the production reality behind a question every CTO funding a machine learning team eventually asks: how often should you retrain a machine learning model? A study from Google and UC Berkeley researchers, built on provenance data covering 3,000 production pipelines and more than 450,000 trained models, finally answers it with real numbers instead of conventional wisdom. The answer is stranger, and more useful, than picking a calendar cadence or chasing drift alerts.
Most retraining advice circulating online treats cadence like a calendar problem: pick weekly, monthly, or quarterly, and move on. The Google data suggests the real bottleneck isn’t how often models get retrained. It’s how rarely those runs actually matter once they’re finished.
What 450,000 Models Actually Told Researchers
The study behind these numbers comes from Doris Xin, Hui Miao, Aditya Parameswaran, and Neoklis Polyzotis, who analyzed the full provenance graph of production ML pipelines inside Google over a four-month window: 3,000 pipelines, 450,000-plus trained models, every training run and every push tracked end to end. It’s one of the largest empirical looks anyone has published at what production ML actually does, as opposed to what teams assume it does.
The numbers don’t match the “quarterly refresh” mental model most engineering orgs still budget around.
| What the data measured | What it found |
|---|---|
| Average retrains per pipeline | Roughly 7 times per day |
| Pipelines retraining more than 100 times a day | 1.12% of all pipelines |
| Retrained models that actually get deployed | About 1 in 4 |
| Mean model training time | 168 hours |
| Mean gap between deployed models | Roughly 40 hours |
Source: Xin, Miao, Parameswaran & Polyzotis, “Production Machine Learning Pipelines: Empirical Analysis and Optimization Opportunities,” arXiv:2103.16007.
If your team is retraining far less than seven times a day, that’s not necessarily a problem. Google’s pipelines include extremely high-velocity systems (ad ranking, search relevance) that skew the average up. But the second number matters regardless of your industry: only about one in four of those models ever ships. The rest, roughly 80 percent, get trained and quietly discarded.
The Habit Nobody Budgets For
Most retraining compute buys nothing
For every four models retrained inside Google’s pipelines, only one reached production. The other three consumed GPU hours, engineering attention, and CI capacity, then went nowhere. At a mean training time of 168 hours per model, that’s not a rounding error. It’s a budget line most platform teams don’t separate out, because most dashboards only track “models trained,” not “models trained and discarded.”
This is the number that should reframe the cadence conversation. The question isn’t “are we doing this often enough.” It’s “what happens to the three out of four runs that don’t ship, and why.”
Why Drift Might Be the Wrong Villain
The instinctive answer is drift: the model degraded, the data moved, the model didn’t keep up, so it got rejected. The Google researchers tested that instinct directly, and it didn’t hold up.
They compared models that got pushed to production against models that didn’t, looking specifically at input-data similarity and code-change rates between the two groups. If drift or code changes were driving the decision to deploy, you’d expect a clear gap. There wasn’t one: data similarity scored 0.101 for pushed models versus 0.099 for unpushed ones, and code-match rates came in at 84.6% versus 83.8%. Statistically, that’s noise, not signal.
So if drift isn’t the reason most runs die quietly, what is? The researchers point to pipeline-level inefficiency and push-rate throttling instead, meaning the bottleneck sits in process and infrastructure, not in the data the model is learning from.
“Or another aspect is model drift. Things change over time.” Chip Huyen, ML engineer and author, TechTarget
Huyen’s point is correct and worth holding alongside the Google data rather than against it: drift is real, and it’s common. A 2022 study in Scientific Reports by Vela et al. tested 128 model-dataset combinations across healthcare, weather, airport traffic, and finance, and found measurable temporal degradation in 91% of pairs. That figure is genuine and peer-reviewed (read more in our breakdown of the underlying drift mechanics), but it measures whether degradation happens at all, not whether degradation is what’s killing your discarded retrains specifically. Those are two different claims, and conflating them is how “drift” becomes the catch-all explanation for problems that are really about pipeline design.
Our read: if your team explains every discarded retrain as “drift,” you’re probably explaining away a pipeline problem, not a data problem. Researchers Shreya Shankar, Rolando Garcia, Joseph Hellerstein, and Aditya Parameswaran reached a related conclusion from a different angle: in 18 interviews with practicing ML engineers at companies running chatbots, autonomous vehicles, and finance systems, they found that engineers consistently treat production behavior as something that can’t be fully known until the model is live, which is precisely why monitoring infrastructure, not retrain frequency, ends up being the deciding factor in whether a model ships.
There’s a second failure mode hiding in here too: alert fatigue. Statistical drift tests like the Population Stability Index and the Kolmogorov-Smirnov test routinely flag distribution shifts that never translate into a measurable performance drop, a pattern confirmed across multiple independent analyses (arXiv:2003.12808). Teams that don’t tune thresholds to actual business impact eventually start ignoring alerts altogether, real ones included. That’s arguably a bigger operational risk than drift itself, and it’s a pipeline-design problem too, not a data problem.
This lines up with broader patterns we’ve tracked in MLOps pipeline failures: the infrastructure layer, not the model layer, is where most production ML actually breaks.
Building a Retrain Schedule That Matches Production, Not a Calendar
If push-rate, not how often you retrain, is the real lever, your policy should be instrumented around it instead. Four steps, in order:
1. Baseline your own push rate first
Before touching your retrain schedule, measure how many of your team’s runs actually reach production today. That number, not your calendar, is your true starting point.
2. Track “retrained” and “deployed” as separate metrics
Most teams report retrain count as a proxy for ML activity. Splitting it into trained versus deployed exposes exactly the gap Google’s data found, and tells you where compute is leaking.
3. Instrument the deploy decision itself
Log the reason every retrained model did or didn’t ship: performance gate, manual review, throttling, rollback. That log tells you more about your real bottleneck than a drift dashboard ever will.
4. Use the 40-hour benchmark as a sanity check, not a target
Google’s pipelines averaged roughly 40 hours between deployed models. If your gap looks wildly different in either direction, investigate that gap before you touch the retrain calendar at all.
Decisions like these tend to fall on whoever owns ML infrastructure, a role that, per our reporting on the enterprise AI skills gap, many organizations still haven’t clearly assigned.
When Nobody Catches It in Time
Process gaps like these aren’t abstract. Zillow’s iBuying arm, Zillow Offers, shut down in November 2021 after its pricing models systematically overvalued homes the company then had to sell at a loss. The numbers, from Zillow’s own Q3 2021 SEC filing, were stark: a $304 million quarterly operating loss, $175 to $230 million in additional impairment costs, and a roughly 25 percent workforce reduction.
“We were unintentionally purchasing homes at higher prices.” Rich Barton, Co-founder & CEO, Zillow Group, GeekWire
We’ve covered the full Zillow case study in detail elsewhere, so we won’t retell it here. The relevant point for this article: Zillow’s failure wasn’t primarily a story about retraining too rarely. It was a story about a pricing signal that kept degrading without anyone instrumenting the gap between “the model said X” and “X turned out to be wrong,” which is the exact same blind spot the Google study found at much smaller, less catastrophic scale across thousands of unremarkable pipelines.
Frequently Asked Questions
What is model drift in machine learning?
Model drift is the gradual decline in a deployed model’s predictive accuracy as real-world data or relationships diverge from training conditions. It shows up as data drift, where inputs change, or concept drift, where the relationship between inputs and outputs changes entirely.
How often should you retrain a machine learning model?
There is no universal schedule. Production data from a 450,000-model Google study shows pipelines retrain roughly seven times a day on average, but only about one in four of those retrained models ever gets deployed, so cadence matters less than your deploy-decision process.
What is the difference between data drift and concept drift?
Data drift means the distribution of input features shifts while the relationship between inputs and outputs stays the same. Concept drift means that relationship itself breaks, so an input that looked normal now warrants a different correct answer, which makes it harder to catch.
Does model drift cause most ML deployment failures?
Not necessarily. A Google and UC Berkeley study of 450,000 production models found no meaningful difference in data similarity or code changes between models that got deployed and models that did not, suggesting pipeline inefficiency, not drift, explains most discarded retrains.
How do you detect model drift?
Compare live production data against a training baseline using statistical tests like the Population Stability Index or the Kolmogorov-Smirnov test, paired with direct performance tracking against labeled outcomes. Watch for alert fatigue: poorly tuned thresholds flag shifts that never affect real accuracy.
The Bottom Line
The number worth carrying out of this article isn’t 91 percent (how often models drift) or even seven times a day (how often Google’s pipelines retrain). It’s one in four: how often a retrain actually earns its compute. Most conversations skip straight from “is our model degrading” to “how often should we retrain,” without ever asking whether that cadence was the bottleneck in the first place.
Over the next 6 to 18 months, expect this question to get more urgent, not less. Gartner forecasts worldwide AI spending will hit $2.59 trillion in 2026, up 47% year over year, and the same firm predicts that 40% of organizations deploying AI will adopt dedicated observability tooling by 2028. Budget is arriving faster than judgment about where to point it. The teams that benchmark their own push rate now, before the next wave of tooling spend, will be the ones who can tell the difference between buying real visibility and buying a more expensive version of the same blind spot.
It’s also worth keeping this separate from the broader AI-project failure narrative. The 70 to 95 percent failure-rate figures that get cited from MIT, Gartner, and RAND research on AI ROI are measuring pilot-to-production failure broadly, for reasons that often have nothing to do with this issue specifically. Treating them as the same problem inflates the apparent size of the drift issue and obscures the much narrower, much more fixable pipeline question this study actually answers.
Three things to watch from here: whether more vendors start publishing push-rate benchmarks the way this Google study did, whether the EU AI Act’s risk-monitoring provisions start requiring documented retrain-versus-deploy decisions rather than just drift scores, and whether the same questions get applied to hosted LLM and agent pipelines, where there’s often no training data to inspect at all. That last one is where this entire conversation is heading next.
Get analysis like this in your inbox. Subscribe to The Neural Loop at neuralwired.com/newsletter
