Google, Microsoft & xAI Will Hand the US Government Unreleased AI Models for National Security Testing
The Commerce Department’s CAISI has struck voluntary agreements with Google DeepMind, Microsoft, and Elon Musk’s xAI, giving federal analysts early access to frontier models before they reach the public. Here’s what the tests actually cover, why all three companies said yes, and what the deals can’t do.
On May 5, 2026, the US Department of Commerce’s Center for AI Standards and Innovation announced new pre-deployment testing agreements with Google DeepMind, Microsoft, and xAI. The pacts give government evaluators access to powerful AI models that haven’t shipped yet, with a specific mandate to probe them for national security risks: cyberattacks, biosecurity vulnerabilities, and capabilities that could compromise critical infrastructure. All three companies agreed voluntarily. No law required it.
This matters for a simple reason. These aren’t small players submitting to niche academic benchmarks. Google DeepMind, Microsoft, and xAI collectively represent a dominant share of the frontier AI market. When they open their pre-release pipelines to federal scrutiny, the shape of AI oversight in America shifts. Quietly, but it shifts.
What is CAISI? The Center for AI Standards and Innovation sits inside the National Institute of Standards and Technology (NIST) at the Commerce Department. Formerly known as the AI Safety Institute (AISI), it was renamed in June 2025 under Commerce Secretary Howard Lutnick as part of the Trump administration’s America’s AI Action Plan. CAISI leads frontier AI evaluations for the federal government, with a specific focus on national security implications.
A Pact Five Years in the Making, Signed in Five Minutes of News
The agreements announced Monday build on earlier voluntary pacts that the Biden administration struck with OpenAI and Anthropic, both of which were renegotiated in 2025 to conform with the Trump administration’s priorities. With Monday’s additions, all five major US frontier AI labs are now covered under some form of pre-deployment review. That’s not a coincidence. It’s the result of deliberate White House outreach: senior officials met with executives from Anthropic, Google, and OpenAI in the days before the announcement to align expectations.
CAISI director Chris Fall framed the significance plainly.
“Independent, rigorous measurement science is essential to understanding frontier AI and its national security implications. These expanded industry collaborations help us scale our work in the public interest at a critical moment.”
Chris Fall, Director, Center for AI Standards and Innovation (CAISI), NIST — NIST Press Release, May 5, 2026
The phrase “at a critical moment” isn’t rhetoric. CAISI has completed more than 40 AI model evaluations as of May 5, 2026. Several of those evaluations involved unreleased state-of-the-art models tested in classified environments. The pace of that work is accelerating precisely because frontier AI capabilities are accelerating.
What CAISI Actually Tests, and How
The testing methodology centers on red-teaming: structured adversarial probing designed to expose what a model can do when pushed, manipulated, or stripped of its safety guardrails. CAISI evaluates models both before and after deployment. The pre-release evaluations, which these new agreements specifically enable, are the more sensitive category.
In practice, participating labs provide access to models “with reduced or removed safeguards” for raw capability assessment. That phrasing comes directly from NIST documentation and it’s telling. Evaluators aren’t testing the polished, safety-tuned product that the public will use. They’re testing the base model underneath it, looking for capabilities that fine-tuning might mask but not eliminate.
Cybersecurity
Models face simulated capture-the-flag (CTF) challenges from platforms like pwn.college, measuring their ability to autonomously exploit vulnerabilities.
Biosecurity
Evaluators probe whether models can assist in synthesizing dangerous pathogens or provide meaningful uplift to someone attempting to do so.
General Capability
Standard benchmarks like MMLU-Pro gauge overall reasoning and knowledge depth, providing a baseline for comparing models across labs.
Critical Infrastructure
Tests probe whether models could assist in attacks against energy grids, financial systems, or government networks.
CAISI’s published evaluation of DeepSeek’s models offers the clearest public window into this methodology. DeepSeek V3.1 solved 28 percent of 577 CTF cyber tasks drawn from the pwn.college benchmark, according to CAISI’s September 2025 technical report. On the MMLU-Pro knowledge benchmark, it scored 89 percent versus 90 percent for the top US reference model. The numbers sound benign. They’re not. A model that successfully handles even 28 percent of advanced CTF challenges represents genuine cyber uplift for a malicious actor with no coding background.
Why “Measurement Science” Is the Key Phrase
CAISI officials consistently use the term “measurement science” rather than “safety oversight.” That word choice is deliberate. The agency’s mandate isn’t to block releases or impose mandates. It’s to build empirical baselines that let the government understand, and eventually anticipate, what frontier AI can do. Think of it less like an FDA drug approval and more like the FAA collecting flight data from airlines before any regulation exists.
Why Google, Microsoft, and xAI Said Yes
Voluntarily handing pre-release models to the government is not the obvious choice for companies racing to ship. So why did Google DeepMind, Microsoft, and xAI all agree?
The business logic is clearer than it looks. Companies that participate get a seat at the table when testing frameworks are designed. That matters enormously: whoever shapes the benchmarks shapes what “safe” means in regulatory conversations. An AI lab that helps write the evaluation criteria for frontier models is in a very different position from one that waits for external standards to be imposed on it.
xAI’s position carries an additional dimension. Elon Musk’s relationship with the Trump administration creates alignment incentives that don’t apply to Google or Microsoft. Whether the agreement reflects genuine safety commitment, political calculation, or both, xAI now sits in the same oversight framework as the labs it routinely criticizes publicly.
The competitive angle: By joining the framework, Google DeepMind, Microsoft, and xAI can position compliance as a differentiator. Labs outside the agreement, whether foreign developers or smaller domestic players, face implicit comparison to an emerging US standard. That’s a reputational and potentially regulatory moat.
Microsoft’s participation also connects directly to its cloud business. Azure hosts a significant share of the AI workloads running in the US defense and intelligence community. Demonstrating pre-deployment cooperation with CAISI reinforces that positioning, particularly as government procurement decisions increasingly weigh AI safety posture alongside raw performance metrics.
Inside the TRAINS Taskforce That Reviews the Results
Once CAISI completes an evaluation, the findings flow into the TRAINS Taskforce, an interagency body established in November 2024. As of May 2026, TRAINS includes experts from more than ten federal agencies. The roster spans the Department of Defense, Department of Energy, Department of Homeland Security, the NSA, and the NIH, among others. Each brings a different threat lens: the NSA cares about cyber; NIH cares about biosecurity; DHS cares about infrastructure.
The interagency structure exists because AI risk doesn’t fit neatly into any one agency’s portfolio. A model that can assist in cyberattacks is a military problem, an intelligence problem, and a civilian infrastructure problem simultaneously. TRAINS attempts to synthesize those perspectives into a coherent federal assessment. Whether it succeeds in any given evaluation cycle isn’t public information.
All Five US Frontier Labs: How Their Agreements Compare
| Lab | Agreement Type | Origin Era | Pre-Release Access | Notes |
|---|---|---|---|---|
| OpenAI | Renegotiated voluntary pact | Biden-era, revised 2025 | Yes | Earliest US lab to enter formal AISI/CAISI framework |
| Anthropic | Renegotiated voluntary pact | Biden-era, revised 2025 | Yes | Aligned with America’s AI Action Plan under Lutnick |
| Google DeepMind | New voluntary agreement | Announced May 5, 2026 | Yes | Covers Gemini-family and future frontier models |
| Microsoft | New voluntary agreement | Announced May 5, 2026 | Yes | Relevant to Azure AI and OpenAI partnership models |
| xAI | New voluntary agreement | Announced May 5, 2026 | Yes | Musk’s Trump ties add political dimension to participation |
The table above maps what “all five labs covered” actually looks like in practice. The agreements aren’t identical. OpenAI and Anthropic have been operating under renegotiated versions of Biden-era pacts since mid-2025, giving CAISI roughly a year of working history with those organizations. Google DeepMind, Microsoft, and xAI are starting fresh under the new framework, which means the agency will spend time calibrating its evaluation approach to each lab’s specific model architecture and release cadence.
The Real Limits of Voluntary Testing
The honest accounting of what these agreements can’t do is just as important as what they can. CAISI has a staff of under 200 people, according to reporting from The Brightminded. Frontier AI labs each ship multiple major models per year, with continuous incremental updates between releases. The arithmetic doesn’t favor comprehensive coverage.
More fundamentally, the agreements are voluntary. Companies can withdraw. CAISI has no statutory authority to block a release based on evaluation findings. If an evaluation surfaces serious concerns, the agency can communicate those concerns to the lab and to other government stakeholders. It can’t issue a stop-order. The contrast with, say, the FDA’s authority over drug approvals is stark. This is measurement, not enforcement.
- CAISI cannot block or delay a model release based on evaluation results
- Labs may exit the agreement at will; no penalties exist for withdrawal
- Testing covers a snapshot of a model’s capabilities, not its ongoing behavior post-deployment
- Safeguard-removed evaluations test raw capability but may not reflect real-world attack surfaces
- The TRAINS Taskforce’s findings are not publicly released, limiting independent verification
Critics of the voluntary approach, including several researchers who spoke to outlets covering the announcement, argue that these structural weaknesses make the framework closer to a public relations exercise than a meaningful check. That criticism deserves engagement. The counter-argument is that measurement science has to precede regulation. You can’t write sensible rules for capabilities you don’t yet understand how to measure. CAISI’s real product isn’t compliance. It’s a body of empirical knowledge that could, eventually, support enforceable standards.
The UK and EU have taken somewhat harder regulatory stances. Brussels’ AI Act mandates certain transparency and testing requirements for high-risk AI applications, with teeth. The US approach, even with Monday’s expansion, remains far more industry-collaborative. Whether that gap narrows, or widens, depends significantly on what the TRAINS evaluations find over the next 12 to 18 months.
For a deeper look at how red-teaming methodologies have evolved since the Biden-era voluntary commitments, see our guide to AI red-teaming practices. And for context on how the Trump administration’s AI Action Plan changed CAISI’s mandate from its predecessor agency, our AISI-to-CAISI transition analysis covers the organizational shift in detail.
Reader Questions Answered
Will these tests delay when Google, Microsoft, or xAI can release new AI models?
What happens if CAISI finds a serious vulnerability in a pre-release model?
Why aren’t foreign labs like DeepSeek included?
How does this fit into the broader US-EU AI regulation picture?
Does this affect developers building on Google, Microsoft, or xAI models via API?
Google, Microsoft & xAI Are Now Inside the Framework. What Comes Next?
The addition of Google DeepMind, Microsoft, and xAI closes the most obvious gap in the US pre-deployment review framework. Every major domestic frontier lab now participates voluntarily. That’s a genuine milestone. But the harder questions are structural, and Monday’s announcement doesn’t resolve them.
CAISI’s 200-person staff will need to absorb three new institutional relationships, each with distinct model architectures, release schedules, and internal safety cultures. The TRAINS Taskforce must integrate those evaluation outputs across more than ten agencies with competing priorities. And the whole apparatus operates without binding authority, sustained only by the political consensus that voluntary cooperation beats nothing at all.
For now, that consensus holds. The Trump administration needs industry cooperation to advance its AI competitiveness agenda. The labs need government credibility to access defense contracts and shape the regulatory environment. The mutual interest is real, even if the underlying incentives aren’t purely about safety. That’s not unusual in technology policy. It’s just worth being clear-eyed about.
Read more on NeuralWired: our running tracker of US AI policy developments in 2026, and the comparison of frontier AI safety benchmarks currently in use by government and independent evaluators.
