Back to website

AgentStatus by Carmel Labs · April 2026

The State of AI Agent Drift:
88% of Agents Changed Behavior in 30 Days

We tracked behavioral drift across 6,200+ production AI agents over 30 days using 18 million tests. Nearly every agent we monitored experienced measurable changes in answer correctness, response speed, or both. When answer correctness dropped, it dropped hard.

~10,000,000
Total tests
6,200+
Agents monitored
1,540,000+
Drift events detected
30 days
Observation period
agentstatus
agentstatus.dev | April 2026

What We Measured

Continuous drift detection across 6,200+ production AI agents.

What drift means: drift is when an AI agent's behavior changes over time. An agent that was giving correct answers last week starts giving wrong answers this week, or an agent that responded in 2 seconds starts taking 10 seconds. Drift can happen because the underlying AI model was updated, because infrastructure changed, or for reasons nobody can identify after the fact.

How we detect it: for each agent, we continuously compare its current test results against its own recent history (a rolling average of the past 7 days). When the current result is significantly worse than the agent's own recent average, the system records a drift event. We track four types: answer correctness, response speed, answer quality score, and time to first response.

Over 30 days, the detector recorded 1,540,508 drift events. This does not mean 1.54 million separate incidents. The same agent can trigger many drift events while it stays in a degraded state, so one prolonged regression might generate dozens or hundreds of recorded events. Across the agents we monitor, those in the most active detection groups averaged 179 to 234 drift events per agent over the 30-day window.

1.1M
Answer correctness drift events across 5,475 agents
269K
Response speed drift events across 6,164 agents
147K
Answer quality score drift events across 747 agents
20K
Time to first response drift events across 330 agents

Finding 1

88% of agents started giving worse answers at least once in 30 days.

5,475 out of approximately 6,200 active agents triggered the answer correctness drift detector at least once during the 30-day observation window. In plain terms, the agent's rate of giving correct answers dropped significantly below its own recent average at least once.

For response speed, coverage was even higher: 6,164 agents (over 99%) experienced at least one significant slowdown compared to their own recent performance.

Nearly every production AI agent we monitored experienced measurable behavioral change within a single month. Consistent, stable behavior is the exception, not the norm.

How many agents drifted, by type

Response speed
99%
Answer correctness
88%
Answer quality score
12%
Time to first response
5.3%

Finding 2

When answer correctness drifts, it does not degrade gracefully. It collapses.

Among the 1.1 million answer correctness drift events, the typical drop was 93 points out of 100. That means at the moment the drift detector fired, the agent had gone from giving correct answers most of the time to giving correct answers almost none of the time. The average drop was 84 points out of 100.

This does not mean 88% of agents sat at near-zero correctness for an entire month. It means that when these agents drifted, the failures were severe. The detector fires at moments when the agent's current result is dramatically worse than its recent average, and those moments were not rare, with most agents triggering hundreds of them over 30 days.

93%

Typical answer correctness drop when drift is detected

Among 1.1 million answer correctness drift events, the typical (median) change was a 93-point drop on a 100-point scale. When agents break, they do not give slightly worse answers. They give wrong answers.

How severe is each type of drift?

Drift TypeTypical ChangeAverage ChangeWorst 5% of Cases
Answer correctness-93 points-84 points-99 points
Answer quality score-12 points-12 points-15 points
Response speed2x slower3.5x slower7x slower
Time to first response3x slower6x slower20x slower
Response speed drift is almost as striking as answer correctness drift. The typical speed increase was 2x, meaning response times roughly doubled at the moment the detector fired. In the worst 5% of cases, agents became nearly 7x slower. Time to first response was even more extreme: users typically waited 3x longer for the agent to start replying.

Finding 3

On March 29, the number of drifting agents tripled in a single day.

Between March 27 and March 29, the number of agents triggering answer quality drift jumped from 1,968 to 5,455. This was not a linear increase: March 28 actually dipped to 715 agents before the March 29 spike.

Our own test volume did increase over this period (from approximately 134,000 tests on March 27 to 217,000 on March 29, roughly a 1.6x increase), but the agent count jumped nearly 2.8x. The increase in test volume alone does not explain the spike. Something systemic affected the majority of the agents we monitor.

We checked public model release timelines for this window. No major model provider (OpenAI, Anthropic, Google) published a release specifically on March 28 or 29. The cause could be a behind-the-scenes routing change, an update not reflected in public release notes, or a cascading infrastructure event. We cannot point to a specific provider based on the data we have today.

Agents with answer quality drift, by day

DateAgents driftingDrift eventsTests run
Mar 271,9682,465134K
Mar 28715750149K
Mar 295,45511,835217K
Mar 305,45613,246235K
Apr 15,44917,290254K
Apr 45,45248,903229K
Apr 105,39448,561
Once agents started drifting around March 29, they largely stayed in a drifted state. The number of agents triggering drift detection remained above 5,200 for the rest of the observation period.

Finding 4

Less than half of drifted agents showed signs of recovery.

We sampled 400 agents that had drifted and checked whether each one later produced a test result where its answer correctness was back above the level it had been at before the drift started. This is a loose definition of "recovery" since crossing that level once is not the same as returning to stable performance, but it gives a lower bound on how many agents showed any improvement at all.

44%
Of sampled agents eventually gave at least one correct-enough result after drifting (174 of 400)
56%
Of sampled agents never returned to their pre-drift correctness level during our observation window (226 of 400)

Among the 174 agents that did show some recovery, the typical wait time was 9.6 hours before the agent produced a result back at its pre-drift level. In the slowest 10% of cases, it took 34.6 hours. A few agents took over 400 hours (more than two weeks) before producing a single result above their pre-drift level.

9.6h

Typical wait before a drifted agent gives a correct answer again

For the 44% of agents that eventually returned to their pre-drift level, the typical wait was nearly 10 hours. In the slowest 10% of cases, it was 34.6 hours. The remaining 56% showed no recovery during the observation window.

Important context: This is based on a random sample of 400 agents, not all 5,475 that drifted. "Recovery" here means the agent produced at least one test result back at its pre-drift correctness level, which is a low bar and does not confirm the agent returned to stable, consistent performance. A full analysis across all agents was not possible within the database's query time limits.

Finding 5

Drift is not a one-time event. Most agents keep failing for weeks.

We counted how many times each agent triggered the drift detector over 30 days. The result is not a long tail where a few broken agents skew the numbers. The overwhelming majority of drifting agents triggered the detector more than 100 times each.

Times drift was detected (per agent)Number of agentsTotal events
1 to 201083
21 to 50451,666
51 to 100614,556
101 to 2002,792499,380
201+2,567599,450

5,359 out of 5,475 drifting agents triggered the detector more than 100 times in 30 days, meaning they were consistently giving worse answers than their recent average for weeks on end. This is not a story about occasional bad results. The overwhelming pattern is that once an agent starts drifting, it continues drifting for an extended period.

Only 10 agents out of 5,475 had fewer than 20 drift events in 30 days. For the other 99.8%, drift was a recurring, persistent condition rather than a one-time event.

Key Takeaways

What this data tells us about production AI agents.

1

Drift is the norm, not the exception.

88% of agents experienced answer correctness drift at least once in 30 days. Over 99% experienced response speed drift. Consistent, stable behavior across a full month was rare.

2

When agents drift on answer correctness, the drops are severe.

The typical correctness drop at the moment of detection was 93 points out of 100. Agents do not gradually get worse. They go from working to broken, and they do it suddenly.

3

Recovery is slow when it happens, and it often does not happen.

In a sample of 400 drifted agents, 56% showed no return to their pre-drift correctness level during the follow-up period. The 44% that did recover took a typical 9.6 hours. If nobody is monitoring for this, nobody is fixing it quickly.

4

Systemic events can affect thousands of agents overnight.

On March 29, the number of drifting agents tripled in a single day, from 1,968 to 5,455, and stayed elevated for the rest of the observation period. The cause is not attributable to a specific provider, but the pattern suggests a systemic change that cascaded across thousands of agents simultaneously.

5

Drift is persistent, not transient.

99.8% of agents that drifted triggered the detector more than 20 times in 30 days. Most were in the 101-200+ range, meaning they were consistently giving worse answers than their recent average for weeks, not experiencing occasional bad results.

Methodology

How we collected and analyzed this data.

This report is based on first-party data from the AgentStatus distributed testing network. Tests were run from real consumer devices on residential home networks across 26 regions. No third-party data sources were used.

Drift detection works by comparing each agent's current test result against a rolling average of that agent's own results over the previous 7 days. When the current result is significantly worse than the agent's own recent average, the system records a drift event with the before and after values and the size of the change. The same agent can generate many drift events while it remains in a degraded state, which is why the total event count (1.54 million) is much higher than the number of affected agents (~6,200).

The "88% experienced answer correctness drift" figure counts agents with at least one answer correctness drift event in the 30-day window (5,475 out of approximately 6,250 active agents). It does not mean 88% of agents were broken for the entire period.

The severity statistics (typical drop of 93 points, average drop of 84 points) describe how large the change was at the moment each drift event was recorded. Because agents in a degraded state generate many events, these numbers are weighted toward prolonged failures rather than one-time blips.

The recovery analysis is based on a random sample of 400 agents, not all 5,475, because analyzing the full set exceeded the database's query time limits. "Recovery" is defined as the first test result after a drift event where the agent's correctness rate was higher than the level it had been at before the drift started. This is a low bar that does not confirm the agent returned to stable performance.

The March 29 timeline analysis includes a comparison against test volume on the same dates. Test volume increased 1.6x while the number of drifting agents increased 2.8x, indicating that the growth in testing alone does not explain the spike in drift detection.

agentstatus
Published April 2026 by AgentStatus, a product of Carmel Labs, Inc.
agentstatus.dev

More research

Continue reading