3D Google G logo bearing US federal shield, representing Google DeepMind's pre-release AI testing agreement with CAISI 2026Google DeepMind, alongside Microsoft and xAI, has agreed to let federal analysts test its most powerful AI models before they ever reach the public.
Google, Microsoft & xAI to Hand US Government Pre-Release AI Models for National Security Testing | NeuralWired

Google, Microsoft & xAI Will Hand the US Government Unreleased AI Models for National Security Testing

The Commerce Department’s CAISI has struck voluntary agreements with Google DeepMind, Microsoft, and Elon Musk’s xAI, giving federal analysts early access to frontier models before they reach the public. Here’s what the tests actually cover, why all three companies said yes, and what the deals can’t do.

On May 5, 2026, the US Department of Commerce’s Center for AI Standards and Innovation announced new pre-deployment testing agreements with Google DeepMind, Microsoft, and xAI. The pacts give government evaluators access to powerful AI models that haven’t shipped yet, with a specific mandate to probe them for national security risks: cyberattacks, biosecurity vulnerabilities, and capabilities that could compromise critical infrastructure. All three companies agreed voluntarily. No law required it.

This matters for a simple reason. These aren’t small players submitting to niche academic benchmarks. Google DeepMind, Microsoft, and xAI collectively represent a dominant share of the frontier AI market. When they open their pre-release pipelines to federal scrutiny, the shape of AI oversight in America shifts. Quietly, but it shifts.

What is CAISI? The Center for AI Standards and Innovation sits inside the National Institute of Standards and Technology (NIST) at the Commerce Department. Formerly known as the AI Safety Institute (AISI), it was renamed in June 2025 under Commerce Secretary Howard Lutnick as part of the Trump administration’s America’s AI Action Plan. CAISI leads frontier AI evaluations for the federal government, with a specific focus on national security implications.

A Pact Five Years in the Making, Signed in Five Minutes of News

The agreements announced Monday build on earlier voluntary pacts that the Biden administration struck with OpenAI and Anthropic, both of which were renegotiated in 2025 to conform with the Trump administration’s priorities. With Monday’s additions, all five major US frontier AI labs are now covered under some form of pre-deployment review. That’s not a coincidence. It’s the result of deliberate White House outreach: senior officials met with executives from Anthropic, Google, and OpenAI in the days before the announcement to align expectations.

CAISI director Chris Fall framed the significance plainly.

“Independent, rigorous measurement science is essential to understanding frontier AI and its national security implications. These expanded industry collaborations help us scale our work in the public interest at a critical moment.”

Chris Fall, Director, Center for AI Standards and Innovation (CAISI), NIST — NIST Press Release, May 5, 2026

The phrase “at a critical moment” isn’t rhetoric. CAISI has completed more than 40 AI model evaluations as of May 5, 2026. Several of those evaluations involved unreleased state-of-the-art models tested in classified environments. The pace of that work is accelerating precisely because frontier AI capabilities are accelerating.

What CAISI Actually Tests, and How

The testing methodology centers on red-teaming: structured adversarial probing designed to expose what a model can do when pushed, manipulated, or stripped of its safety guardrails. CAISI evaluates models both before and after deployment. The pre-release evaluations, which these new agreements specifically enable, are the more sensitive category.

In practice, participating labs provide access to models “with reduced or removed safeguards” for raw capability assessment. That phrasing comes directly from NIST documentation and it’s telling. Evaluators aren’t testing the polished, safety-tuned product that the public will use. They’re testing the base model underneath it, looking for capabilities that fine-tuning might mask but not eliminate.

🛡️

Cybersecurity

Models face simulated capture-the-flag (CTF) challenges from platforms like pwn.college, measuring their ability to autonomously exploit vulnerabilities.

🧬

Biosecurity

Evaluators probe whether models can assist in synthesizing dangerous pathogens or provide meaningful uplift to someone attempting to do so.

📊

General Capability

Standard benchmarks like MMLU-Pro gauge overall reasoning and knowledge depth, providing a baseline for comparing models across labs.

🏛️

Critical Infrastructure

Tests probe whether models could assist in attacks against energy grids, financial systems, or government networks.

CAISI’s published evaluation of DeepSeek’s models offers the clearest public window into this methodology. DeepSeek V3.1 solved 28 percent of 577 CTF cyber tasks drawn from the pwn.college benchmark, according to CAISI’s September 2025 technical report. On the MMLU-Pro knowledge benchmark, it scored 89 percent versus 90 percent for the top US reference model. The numbers sound benign. They’re not. A model that successfully handles even 28 percent of advanced CTF challenges represents genuine cyber uplift for a malicious actor with no coding background.

Why “Measurement Science” Is the Key Phrase

CAISI officials consistently use the term “measurement science” rather than “safety oversight.” That word choice is deliberate. The agency’s mandate isn’t to block releases or impose mandates. It’s to build empirical baselines that let the government understand, and eventually anticipate, what frontier AI can do. Think of it less like an FDA drug approval and more like the FAA collecting flight data from airlines before any regulation exists.

Why Google, Microsoft, and xAI Said Yes

Voluntarily handing pre-release models to the government is not the obvious choice for companies racing to ship. So why did Google DeepMind, Microsoft, and xAI all agree?

The business logic is clearer than it looks. Companies that participate get a seat at the table when testing frameworks are designed. That matters enormously: whoever shapes the benchmarks shapes what “safe” means in regulatory conversations. An AI lab that helps write the evaluation criteria for frontier models is in a very different position from one that waits for external standards to be imposed on it.

xAI’s position carries an additional dimension. Elon Musk’s relationship with the Trump administration creates alignment incentives that don’t apply to Google or Microsoft. Whether the agreement reflects genuine safety commitment, political calculation, or both, xAI now sits in the same oversight framework as the labs it routinely criticizes publicly.

The competitive angle: By joining the framework, Google DeepMind, Microsoft, and xAI can position compliance as a differentiator. Labs outside the agreement, whether foreign developers or smaller domestic players, face implicit comparison to an emerging US standard. That’s a reputational and potentially regulatory moat.

Microsoft’s participation also connects directly to its cloud business. Azure hosts a significant share of the AI workloads running in the US defense and intelligence community. Demonstrating pre-deployment cooperation with CAISI reinforces that positioning, particularly as government procurement decisions increasingly weigh AI safety posture alongside raw performance metrics.

Inside the TRAINS Taskforce That Reviews the Results

Once CAISI completes an evaluation, the findings flow into the TRAINS Taskforce, an interagency body established in November 2024. As of May 2026, TRAINS includes experts from more than ten federal agencies. The roster spans the Department of Defense, Department of Energy, Department of Homeland Security, the NSA, and the NIH, among others. Each brings a different threat lens: the NSA cares about cyber; NIH cares about biosecurity; DHS cares about infrastructure.

The interagency structure exists because AI risk doesn’t fit neatly into any one agency’s portfolio. A model that can assist in cyberattacks is a military problem, an intelligence problem, and a civilian infrastructure problem simultaneously. TRAINS attempts to synthesize those perspectives into a coherent federal assessment. Whether it succeeds in any given evaluation cycle isn’t public information.

All Five US Frontier Labs: How Their Agreements Compare

Lab Agreement Type Origin Era Pre-Release Access Notes
OpenAI Renegotiated voluntary pact Biden-era, revised 2025 Yes Earliest US lab to enter formal AISI/CAISI framework
Anthropic Renegotiated voluntary pact Biden-era, revised 2025 Yes Aligned with America’s AI Action Plan under Lutnick
Google DeepMind New voluntary agreement Announced May 5, 2026 Yes Covers Gemini-family and future frontier models
Microsoft New voluntary agreement Announced May 5, 2026 Yes Relevant to Azure AI and OpenAI partnership models
xAI New voluntary agreement Announced May 5, 2026 Yes Musk’s Trump ties add political dimension to participation

The table above maps what “all five labs covered” actually looks like in practice. The agreements aren’t identical. OpenAI and Anthropic have been operating under renegotiated versions of Biden-era pacts since mid-2025, giving CAISI roughly a year of working history with those organizations. Google DeepMind, Microsoft, and xAI are starting fresh under the new framework, which means the agency will spend time calibrating its evaluation approach to each lab’s specific model architecture and release cadence.

The Real Limits of Voluntary Testing

The honest accounting of what these agreements can’t do is just as important as what they can. CAISI has a staff of under 200 people, according to reporting from The Brightminded. Frontier AI labs each ship multiple major models per year, with continuous incremental updates between releases. The arithmetic doesn’t favor comprehensive coverage.

More fundamentally, the agreements are voluntary. Companies can withdraw. CAISI has no statutory authority to block a release based on evaluation findings. If an evaluation surfaces serious concerns, the agency can communicate those concerns to the lab and to other government stakeholders. It can’t issue a stop-order. The contrast with, say, the FDA’s authority over drug approvals is stark. This is measurement, not enforcement.

  • CAISI cannot block or delay a model release based on evaluation results
  • Labs may exit the agreement at will; no penalties exist for withdrawal
  • Testing covers a snapshot of a model’s capabilities, not its ongoing behavior post-deployment
  • Safeguard-removed evaluations test raw capability but may not reflect real-world attack surfaces
  • The TRAINS Taskforce’s findings are not publicly released, limiting independent verification

Critics of the voluntary approach, including several researchers who spoke to outlets covering the announcement, argue that these structural weaknesses make the framework closer to a public relations exercise than a meaningful check. That criticism deserves engagement. The counter-argument is that measurement science has to precede regulation. You can’t write sensible rules for capabilities you don’t yet understand how to measure. CAISI’s real product isn’t compliance. It’s a body of empirical knowledge that could, eventually, support enforceable standards.

The UK and EU have taken somewhat harder regulatory stances. Brussels’ AI Act mandates certain transparency and testing requirements for high-risk AI applications, with teeth. The US approach, even with Monday’s expansion, remains far more industry-collaborative. Whether that gap narrows, or widens, depends significantly on what the TRAINS evaluations find over the next 12 to 18 months.

For a deeper look at how red-teaming methodologies have evolved since the Biden-era voluntary commitments, see our guide to AI red-teaming practices. And for context on how the Trump administration’s AI Action Plan changed CAISI’s mandate from its predecessor agency, our AISI-to-CAISI transition analysis covers the organizational shift in detail.

Reader Questions Answered

Will these tests delay when Google, Microsoft, or xAI can release new AI models?
Unlikely in the near term. CAISI has no authority to block a release, and the agreements don’t include any built-in delay mechanism. Labs share pre-release model access voluntarily, and testing proceeds in parallel with, not as a prerequisite to, public launch. Findings may prompt a lab to voluntarily adjust a model, but there’s no confirmed case where a CAISI evaluation has held up a release date.
What happens if CAISI finds a serious vulnerability in a pre-release model?
CAISI communicates findings to the lab and shares relevant assessments with the TRAINS Taskforce’s interagency partners. There’s no public disclosure mechanism tied to the current agreements. The lab then decides how to respond, whether by adjusting the model, adding additional guardrails, or proceeding with release anyway. CAISI can express concern; it can’t compel action.
Why aren’t foreign labs like DeepSeek included?
Foreign labs can’t be compelled to participate, and the voluntary framework depends on companies having enough trust in US government institutions to share unreleased models. CAISI has evaluated DeepSeek models, but those evaluations used publicly available or commercially accessible versions, not pre-release access. The September 2025 DeepSeek report is the clearest example of that kind of post-deployment evaluation.
How does this fit into the broader US-EU AI regulation picture?
The US approach remains voluntary and measurement-focused, in contrast to the EU’s AI Act, which mandates compliance for high-risk AI applications sold in European markets. Monday’s announcements expand the voluntary framework’s reach but don’t change its fundamental character. The US is betting that industry cooperation and shared standards development will produce better safety outcomes than top-down mandates. That bet is still unproven.
Does this affect developers building on Google, Microsoft, or xAI models via API?
Not directly. The agreements cover the labs’ own frontier models at the pre-release stage. Third-party developers building on those models via APIs work with whatever the labs ship publicly. However, if pre-release evaluations prompt a lab to adjust a model before launch, developers will indirectly benefit from any security or safety improvements that result.

Google, Microsoft & xAI Are Now Inside the Framework. What Comes Next?

The addition of Google DeepMind, Microsoft, and xAI closes the most obvious gap in the US pre-deployment review framework. Every major domestic frontier lab now participates voluntarily. That’s a genuine milestone. But the harder questions are structural, and Monday’s announcement doesn’t resolve them.

CAISI’s 200-person staff will need to absorb three new institutional relationships, each with distinct model architectures, release schedules, and internal safety cultures. The TRAINS Taskforce must integrate those evaluation outputs across more than ten agencies with competing priorities. And the whole apparatus operates without binding authority, sustained only by the political consensus that voluntary cooperation beats nothing at all.

For now, that consensus holds. The Trump administration needs industry cooperation to advance its AI competitiveness agenda. The labs need government credibility to access defense contracts and shape the regulatory environment. The mutual interest is real, even if the underlying incentives aren’t purely about safety. That’s not unusual in technology policy. It’s just worth being clear-eyed about.

What to Watch
01 CAISI’s first evaluation reports on Google DeepMind and xAI models. No timeline has been announced. The DeepSeek report took several months to produce; expect similar timelines for the new partners.
02 Whether any lab withdraws. The voluntary nature of these agreements means any company can walk away. A withdrawal would signal that evaluation findings became uncomfortable, or that competitive pressures outweighed the reputational benefits of participation.
03 Congressional appetite for enforcement authority. The current framework works only as long as voluntary cooperation holds. Legislators watching CAISI’s expanding portfolio may eventually push for mandatory pre-deployment review, particularly after the next high-profile AI safety incident.
04 TRAINS Taskforce output becoming public. The interagency process currently operates behind closed doors. If political pressure or a major disclosure forces TRAINS findings into the public record, the nature of what these evaluations actually find will become far clearer, and far more consequential.

Read more on NeuralWired: our running tracker of US AI policy developments in 2026, and the comparison of frontier AI safety benchmarks currently in use by government and independent evaluators.


Stay Ahead of AI Policy NeuralWired covers the decisions shaping frontier AI’s regulatory future. No noise, no hype, just analysis.
Get the Newsletter

Leave a Reply

Your email address will not be published. Required fields are marked *