AgentStatus by Carmel Labs · April 2026
The State of AI Agent Drift:
88% of Agents Changed Behavior in 30 Days
We tracked behavioral drift across 6,200+ production AI agents over 30 days using 18 million tests. Nearly every agent we monitored experienced measurable changes in answer correctness, response speed, or both. When answer correctness dropped, it dropped hard.
What We Measured
Continuous drift detection across 6,200+ production AI agents.
What drift means: drift is when an AI agent's behavior changes over time. An agent that was giving correct answers last week starts giving wrong answers this week, or an agent that responded in 2 seconds starts taking 10 seconds. Drift can happen because the underlying AI model was updated, because infrastructure changed, or for reasons nobody can identify after the fact.
How we detect it: for each agent, we continuously compare its current test results against its own recent history (a rolling average of the past 7 days). When the current result is significantly worse than the agent's own recent average, the system records a drift event. We track four types: answer correctness, response speed, answer quality score, and time to first response.
Over 30 days, the detector recorded 1,540,508 drift events. This does not mean 1.54 million separate incidents. The same agent can trigger many drift events while it stays in a degraded state, so one prolonged regression might generate dozens or hundreds of recorded events. Across the agents we monitor, those in the most active detection groups averaged 179 to 234 drift events per agent over the 30-day window.
Finding 1
88% of agents started giving worse answers at least once in 30 days.
5,475 out of approximately 6,200 active agents triggered the answer correctness drift detector at least once during the 30-day observation window. In plain terms, the agent's rate of giving correct answers dropped significantly below its own recent average at least once.
For response speed, coverage was even higher: 6,164 agents (over 99%) experienced at least one significant slowdown compared to their own recent performance.
How many agents drifted, by type
Finding 2
When answer correctness drifts, it does not degrade gracefully. It collapses.
Among the 1.1 million answer correctness drift events, the typical drop was 93 points out of 100. That means at the moment the drift detector fired, the agent had gone from giving correct answers most of the time to giving correct answers almost none of the time. The average drop was 84 points out of 100.
This does not mean 88% of agents sat at near-zero correctness for an entire month. It means that when these agents drifted, the failures were severe. The detector fires at moments when the agent's current result is dramatically worse than its recent average, and those moments were not rare, with most agents triggering hundreds of them over 30 days.
Typical answer correctness drop when drift is detected
Among 1.1 million answer correctness drift events, the typical (median) change was a 93-point drop on a 100-point scale. When agents break, they do not give slightly worse answers. They give wrong answers.
How severe is each type of drift?
| Drift Type | Typical Change | Average Change | Worst 5% of Cases |
|---|---|---|---|
| Answer correctness | -93 points | -84 points | -99 points |
| Answer quality score | -12 points | -12 points | -15 points |
| Response speed | 2x slower | 3.5x slower | 7x slower |
| Time to first response | 3x slower | 6x slower | 20x slower |
Finding 3
On March 29, the number of drifting agents tripled in a single day.
Between March 27 and March 29, the number of agents triggering answer quality drift jumped from 1,968 to 5,455. This was not a linear increase: March 28 actually dipped to 715 agents before the March 29 spike.
Our own test volume did increase over this period (from approximately 134,000 tests on March 27 to 217,000 on March 29, roughly a 1.6x increase), but the agent count jumped nearly 2.8x. The increase in test volume alone does not explain the spike. Something systemic affected the majority of the agents we monitor.
We checked public model release timelines for this window. No major model provider (OpenAI, Anthropic, Google) published a release specifically on March 28 or 29. The cause could be a behind-the-scenes routing change, an update not reflected in public release notes, or a cascading infrastructure event. We cannot point to a specific provider based on the data we have today.
Agents with answer quality drift, by day
| Date | Agents drifting | Drift events | Tests run |
|---|---|---|---|
| Mar 27 | 1,968 | 2,465 | 134K |
| Mar 28 | 715 | 750 | 149K |
| Mar 29 | 5,455 | 11,835 | 217K |
| Mar 30 | 5,456 | 13,246 | 235K |
| Apr 1 | 5,449 | 17,290 | 254K |
| Apr 4 | 5,452 | 48,903 | 229K |
| Apr 10 | 5,394 | 48,561 | — |
Finding 4
Less than half of drifted agents showed signs of recovery.
We sampled 400 agents that had drifted and checked whether each one later produced a test result where its answer correctness was back above the level it had been at before the drift started. This is a loose definition of "recovery" since crossing that level once is not the same as returning to stable performance, but it gives a lower bound on how many agents showed any improvement at all.
Among the 174 agents that did show some recovery, the typical wait time was 9.6 hours before the agent produced a result back at its pre-drift level. In the slowest 10% of cases, it took 34.6 hours. A few agents took over 400 hours (more than two weeks) before producing a single result above their pre-drift level.
Typical wait before a drifted agent gives a correct answer again
For the 44% of agents that eventually returned to their pre-drift level, the typical wait was nearly 10 hours. In the slowest 10% of cases, it was 34.6 hours. The remaining 56% showed no recovery during the observation window.
Finding 5
Drift is not a one-time event. Most agents keep failing for weeks.
We counted how many times each agent triggered the drift detector over 30 days. The result is not a long tail where a few broken agents skew the numbers. The overwhelming majority of drifting agents triggered the detector more than 100 times each.
| Times drift was detected (per agent) | Number of agents | Total events |
|---|---|---|
| 1 to 20 | 10 | 83 |
| 21 to 50 | 45 | 1,666 |
| 51 to 100 | 61 | 4,556 |
| 101 to 200 | 2,792 | 499,380 |
| 201+ | 2,567 | 599,450 |
5,359 out of 5,475 drifting agents triggered the detector more than 100 times in 30 days, meaning they were consistently giving worse answers than their recent average for weeks on end. This is not a story about occasional bad results. The overwhelming pattern is that once an agent starts drifting, it continues drifting for an extended period.
Key Takeaways
What this data tells us about production AI agents.
Drift is the norm, not the exception.
88% of agents experienced answer correctness drift at least once in 30 days. Over 99% experienced response speed drift. Consistent, stable behavior across a full month was rare.
When agents drift on answer correctness, the drops are severe.
The typical correctness drop at the moment of detection was 93 points out of 100. Agents do not gradually get worse. They go from working to broken, and they do it suddenly.
Recovery is slow when it happens, and it often does not happen.
In a sample of 400 drifted agents, 56% showed no return to their pre-drift correctness level during the follow-up period. The 44% that did recover took a typical 9.6 hours. If nobody is monitoring for this, nobody is fixing it quickly.
Systemic events can affect thousands of agents overnight.
On March 29, the number of drifting agents tripled in a single day, from 1,968 to 5,455, and stayed elevated for the rest of the observation period. The cause is not attributable to a specific provider, but the pattern suggests a systemic change that cascaded across thousands of agents simultaneously.
Drift is persistent, not transient.
99.8% of agents that drifted triggered the detector more than 20 times in 30 days. Most were in the 101-200+ range, meaning they were consistently giving worse answers than their recent average for weeks, not experiencing occasional bad results.
Methodology
How we collected and analyzed this data.
This report is based on first-party data from the AgentStatus distributed testing network. Tests were run from real consumer devices on residential home networks across 26 regions. No third-party data sources were used.
Drift detection works by comparing each agent's current test result against a rolling average of that agent's own results over the previous 7 days. When the current result is significantly worse than the agent's own recent average, the system records a drift event with the before and after values and the size of the change. The same agent can generate many drift events while it remains in a degraded state, which is why the total event count (1.54 million) is much higher than the number of affected agents (~6,200).
The "88% experienced answer correctness drift" figure counts agents with at least one answer correctness drift event in the 30-day window (5,475 out of approximately 6,250 active agents). It does not mean 88% of agents were broken for the entire period.
The severity statistics (typical drop of 93 points, average drop of 84 points) describe how large the change was at the moment each drift event was recorded. Because agents in a degraded state generate many events, these numbers are weighted toward prolonged failures rather than one-time blips.
The recovery analysis is based on a random sample of 400 agents, not all 5,475, because analyzing the full set exceeded the database's query time limits. "Recovery" is defined as the first test result after a drift event where the agent's correctness rate was higher than the level it had been at before the drift started. This is a low bar that does not confirm the agent returned to stable performance.
The March 29 timeline analysis includes a comparison against test volume on the same dates. Test volume increased 1.6x while the number of drifting agents increased 2.8x, indicating that the growth in testing alone does not explain the spike in drift detection.
More research
Continue reading
June 2026
The Two Failures Hiding in LLM-as-a-Judge
Calibration problems shrink with better technique. Competence problems do not. The structural ceiling in agent evaluation, and the two methods older than language models that get past it.
Read reportMarch 2026
The State of AI Agent Reliability
We monitored 3,260 production AI agents across 48 countries. 89% with perfect uptime scored 0% on quality. The full data is inside.
Read reportApril 2026
The Anti-Synthetic Monitoring Thesis
Why real-user simulation from real devices in real locations is the only monitoring architecture that survives the next generation of AI agents.
Read reportResearch
9 Businesses You Can Build on Agent Behavioral Data
Insurance underwriting, credit scores, compliance certification, procurement intelligence — the commercial layer that sits on top of continuous agent monitoring.
Read report