25% of Your Documents Are Already Corrupted. AI Agents Did It Silently.
A new Microsoft Research benchmark finds that frontier AI models corrupt roughly one in four documents after just 20 editing interactions, and most of the damage is invisible until it isn’t.
There’s a quiet crisis unfolding inside enterprise AI deployments, and most teams aren’t looking for it. When you hand an AI agent the keys to your document workflows, you aren’t just offloading labor. You’re also, according to a major new study, introducing a compounding fidelity problem that standard quality checks simply can’t catch.
A paper published April 17, 2026 by Philippe Laban and colleagues at Microsoft Research describes what they call “silent document corruption,” a failure mode where AI agents slowly degrade the structural and semantic integrity of documents over repeated editing sessions. The research, formally titled “LLMs Corrupt Your Documents When You Delegate,” doesn’t single out one weak model. It finds the problem across 19 tested systems, including some of the most capable models currently available.
The implications are significant for any organization using agentic AI for document-heavy work: legal filings, financial reports, codebases, scientific notation, medical records. The content looks fine. The errors hide inside.
The DELEGATE-52 Benchmark
To measure the problem systematically, the Microsoft Research team built DELEGATE-52, a new evaluation framework designed from the ground up to stress-test long-horizon document editing. Standard AI benchmarks typically score a model on a single interaction, one prompt, one response, done. DELEGATE-52 works differently.
The benchmark simulates up to 20 sequential editing interactions across 52 distinct professional domains. That list spans a striking range: software code, music notation, crystallography data, legal contracts, financial models, and scientific literature. Each domain was selected because it has verifiable structural rules, meaning researchers could use programmatic parsers and backtranslation methods to objectively score whether the final document matched the original intent.
What is backtranslation evaluation? The DELEGATE-52 team converted documents to intermediate formats and back again, then compared results against ground-truth originals. This approach detects structural corruption that would pass a surface-level readability check, catching errors a human reviewer might miss entirely.
The team also constructed 310 distinct “work environments,” sets of documents that agents could reference while editing, including irrelevant “distractor” files that mimicked realistic workplace conditions. That detail matters. Real agentic deployments don’t operate in clean, isolated contexts. They swim in noise, and DELEGATE-52 was built to reflect that.
“Short-horizon performance is simply not predictive of long-horizon reliability. A model that edits a document well once can corrupt it systematically across twenty interactions.”
Philippe Laban, Senior Researcher, Microsoft Research — arXiv:2604.15597
The study defined a “ready” threshold at 98% fidelity or above. That’s the floor below which a document is considered unreliable for professional delegation. Only one domain, Python code, consistently cleared that bar across most frontier models. Every other domain fell short to varying degrees.
How Corruption Happens
The failure isn’t random noise. The research identifies a specific, predictable mechanism: compounding error propagation. Each time an agent edits a document, it works from its current understanding of that document’s state. If the previous edit introduced even a minor structural misstep, the next pass builds on that error. Then the next. By interaction 20, the cumulative drift can be substantial.
Three distinct corruption patterns emerge from the data.
Context Drift
Models lose track of document structure as conversation history grows. Earlier sections get reconstructed from inference rather than retained from source.
Content Hallucination
When an agent can’t retain full document context, it fills gaps with plausible-sounding but fabricated content, seamlessly, invisibly.
Silent Truncation
Sections of documents get quietly dropped, particularly in longer files. The output document is shorter and structurally incomplete, but coherent enough to appear complete.
One particularly counterintuitive finding: standard agentic tool use, where models manipulate files using Python scripts rather than outputting text directly, doesn’t prevent the degradation. The assumption that programmatic file handling would preserve fidelity turns out to be wrong. The structural awareness problem sits at the reasoning layer, not the output layer.
Key finding: Larger documents and the presence of distractor files consistently worsened corruption severity. More context doesn’t help the model stay accurate. It gives the model more surface area to get confused.
The corruption is also described as “silent” for a specific reason: the degraded documents typically remain readable. They don’t throw errors. They don’t look broken. A legal clause might be subtly reworded. A formula might be adjusted. A code function might be refactored into something plausibly equivalent but functionally different. Standard review processes, including AI-assisted review, catch very little of this.
By the Numbers
The headline figure from the DELEGATE-52 study is stark: across the 19 models tested, the average document corruption rate after 20 interactions sits at roughly 50%. Even frontier models, those at the top of current capability rankings, average around 25% corruption. That’s one document in four, significantly altered from its original intent.
| Metric | Value | What It Means |
|---|---|---|
| Frontier model corruption rate | ~25% | Average fidelity loss across top-tier models after 20 editing interactions |
| All-model average corruption rate | ~50% | Aggregated across all 19 systems tested in the study |
| Professional domains tested | 52 | Spanning code, music notation, crystallography, legal, financial, and more |
| Work environments simulated | 310 | Including distractor files to mimic realistic, noisy workspaces |
| “Ready” fidelity threshold | 98%+ | Minimum benchmark score for reliable professional delegation |
| Domains clearing “ready” threshold | 1 (Python) | Only programmatic code consistently qualified across most models |
The Python exception is instructive. Code has a built-in verification layer: it either runs or it doesn’t. Syntax errors surface immediately. Semantic errors often surface in testing. That feedback loop provides a correction mechanism that prose documents, spreadsheets, music files, and scientific records simply don’t have. When there’s no external validator, errors survive and propagate.
The study’s dataset and evaluation code were released publicly on April 19 via GitHub and Hugging Face, allowing independent researchers to replicate results and test additional models. Early community analysis, emerging from developer forums in the days after publication, largely confirmed the findings.
Enterprise Risk
For businesses that have deployed autonomous agents against document-heavy workflows, the DELEGATE-52 results constitute a direct operational warning. The scenarios most at risk aren’t hypothetical. They’re already live at scale.
- Legal teams using agents to draft, revise, or summarize contracts face the prospect of altered clauses that pass human review but introduce material ambiguity.
- Financial analysts relying on agents to update models and reports may receive outputs where key figures or formulas have been silently adjusted across iterative sessions.
- Engineering teams using AI for codebase maintenance are the best-positioned group, given code’s natural validation mechanisms, but remain exposed in documentation and configuration files.
- Research and scientific publishing workflows, where notation and citation integrity are critical, fall squarely into the high-corruption-risk categories identified by the study.
The phrase the research uses is “silent trust crisis.” That framing captures something real. The danger isn’t that organizations will notice AI agents producing obviously broken output. They won’t. The danger is that workflows will operate at scale on subtly corrupted content for months before any downstream failure makes the problem visible, at which point the audit trail is deep and the remediation is costly.
“Businesses relying on autonomous agents for high-stakes document management face a trust problem that won’t announce itself until it’s already caused damage.”
Microsoft Research Analysis — Emergent Mind coverage
The findings also arrived during ICLR 2026 in Rio de Janeiro, where related discussions on multi-agent system reliability ran alongside presentations on alignment and evaluation. The timing gave the research unusual visibility in the research community at a moment when deployment of agentic systems is accelerating fastest.
What Researchers Suggest
The DELEGATE-52 paper doesn’t prescribe a complete solution, but the data points clearly toward where solutions need to develop. The findings push in three directions.
Better Memory and State Management
The core problem is that models lose structural awareness across long editing sessions. Any durable fix requires agents that can maintain, verify, and restore accurate representations of document state, not just the conversation history that surrounds it. This is an open research problem, and one the ICLR community is actively working on.
Domain-Specific Verification Layers
Python code works because it has an execution environment that catches errors. Other domains need analogous validators. Music notation has formal parsing tools. Crystallography data has structural rules. Legal and financial documents don’t yet have widely deployed AI-compatible validators, but the study’s implicit argument is that building them should be a priority before agentic systems are trusted with high-stakes content in those fields. Verification infrastructure is infrastructure, and it needs investment to match deployment pace.
Long-Horizon Benchmarking Standards
Perhaps the most durable contribution of DELEGATE-52 is the benchmark itself. The AI industry has relied heavily on single-turn evaluations to compare models and declare capability milestones. This study makes a compelling empirical case that those evaluations miss something important. Evaluation methodology needs to catch up with actual deployment conditions, and that means longer horizon tests, noisier environments, and domain-specific fidelity scoring.
Practical step for teams now: The DELEGATE-52 dataset is publicly available. Organizations with high-stakes document workflows can use it to evaluate their specific deployed models before extending agent autonomy further. Testing against the benchmark won’t close the fidelity gap, but it can quantify it and help teams make more informed decisions about where human oversight stays essential.
Frequently Asked Questions
What is the DELEGATE-52 benchmark?
DELEGATE-52 is an evaluation framework created by Microsoft Research to measure how well AI models maintain document fidelity across long editing sessions. It tests 19 AI systems across 52 professional domains and 310 simulated work environments, using up to 20 sequential interactions per session to expose compounding corruption that single-turn benchmarks miss.
Which AI models were tested in the study?
The study tested 19 models, including frontier systems like Gemini 3.1 Pro, Claude 4.6 Opus, and GPT-5.4. Even the highest-performing frontier models averaged around 25% document corruption after 20 interactions, with the average across all 19 models reaching approximately 50%.
Why does document corruption happen in AI agents?
Corruption results from compounding error propagation over long editing sessions. As agents make sequential edits, they lose track of original document structure, leading to context truncation and hallucinated content inserted to bridge gaps. Each interaction builds on previous errors, amplifying the total drift from the original document.
Does using Python tools prevent document corruption?
No. The study found that standard agentic tool use, including having models manipulate files programmatically via Python, does not prevent degradation. The structural awareness problem occurs at the model’s reasoning layer, not at the output layer, so changing the output mechanism doesn’t resolve the underlying issue.
What is the “ready” threshold in DELEGATE-52?
The benchmark defines 98% fidelity or above as the “ready” threshold for reliable professional delegation. Only one domain, Python code, consistently cleared this bar across most tested models. All other evaluated domains fell below it, including legal, financial, scientific, and musical notation formats.
Is the DELEGATE-52 dataset publicly available?
Yes. Microsoft Research released the full DELEGATE-52 dataset and evaluation code on April 19, 2026, via GitHub and Hugging Face. Teams can use it to independently test their own deployed models against the benchmark before extending autonomous editing capabilities to high-stakes document workflows.
Which document domains carry the highest corruption risk?
Domains without built-in external validators carry the highest risk. These include legal contracts, financial models, music notation, scientific records, and crystallography data. Code, particularly Python, is the outlier because execution environments catch errors automatically, providing a fidelity correction mechanism other domains lack.
What should enterprise teams do right now?
Teams should audit which document types their AI agents are editing autonomously, especially across repeated sessions, and prioritize human review checkpoints for high-stakes content. Running internal models against the publicly available DELEGATE-52 benchmark can help quantify exposure before deciding how much autonomy to extend.
What This Means for AI Agents
The DELEGATE-52 study lands at a specific moment. Agentic AI systems are being deployed faster than the research community can fully characterize their failure modes. Most capability benchmarks measure what a model can do once, under clean conditions, with a clear prompt. The real world doesn’t work like that, and DELEGATE-52 is one of the clearest empirical demonstrations of the gap between benchmark performance and operational reliability.
Twenty-five percent corruption among frontier models isn’t a verdict against AI-assisted document work. It’s a calibration. It tells organizations where the boundary of trustworthy autonomy currently sits, and it’s more restrictive than most deployment decisions have assumed. The single domain that qualifies as “ready,” Python code, has the built-in properties the others lack. Everything else needs verification infrastructure that doesn’t yet exist at scale.
That infrastructure is buildable. Domain-specific validators, long-horizon evaluation standards, memory mechanisms that preserve structural state across sessions, these are solvable engineering and research challenges. But they require acknowledging the problem first. The study’s most important contribution may simply be making the silence audible.
