Back to website

AgentStatus by Carmel Labs

March 2026
Agent Reliability
Report

A comprehensive analysis of AI agent reliability across 6,259 production agents, 4.5 million tests, and 10 geographic regions. Over half of agents maintained 100% uptime. 89% of them gave wrong answers. All data collected first-party by AgentStatus.

4,492,066
tests executed
6,259
agents registered
0.2%
fully successful
10
regions tested
agentstatus
agentstatus.dev | March 2026

Executive Summary

56.6% of agents were online. 89% gave wrong answers.

In March 2026, AgentStatus executed 4,492,066 tests against 6,259 registered AI agents across 10 geographic regions. All data in this report was collected first-party by AgentStatus's distributed testing network. No third-party data sources were used. The headline finding is not about uptime. Over half of all agents maintained 100% uptime throughout the month. The problem is what they said when they responded.

89.2% of all test results showed a 0% evaluation pass rate, meaning the agent responded to the prompt but the answer failed every quality check. Out of 4.5 million total test executions, only 9,381 were fully successful, a rate of 0.2%. When we narrow to the 1.1 million tests that received full reliability verdicts, only 0.8% came back healthy. The rest were either degraded (62.8%) or completely down (36.5%).

This is the gap between uptime monitoring and response quality monitoring. Traditional tools would have shown 56.6% of these agents as perfectly healthy. They were online. They responded. They just responded wrong. And nobody knew.

These are not synthetic benchmarks. These are real tests sent from real consumer devices on residential networks across the United States, Canada, Ukraine, the United Kingdom, Japan, Romania, Spain, Turkey, South Africa, and Rwanda.

The core finding: The majority of production AI agents are online and responding, but failing to give correct answers. This is invisible to every traditional monitoring tool because the HTTP response looks healthy. The failure is semantic, not structural.

Test Volume

4.5 million tests in 30 days.

AgentStatus executed 4,492,066 individual test runs across March 2026. Of these, 1,109,869 were unique evaluated tests with full reliability verdicts. The remaining executions include retries, connectivity checks, and multi-step evaluation runs. Test volume grew dramatically through the month, peaking in the third week before stabilizing.

4,492,066
total test executions
1,109,869
evaluated tests with verdicts
6,259
agents registered

Weekly test volume

Mar 2
12,451
Mar 9
210,092
Mar 16
434,459
Mar 23
331,949
Mar 30
120,855

Reliability

Only 0.8% of evaluated tests returned a healthy result.

Color key used throughout this report:
Green = UP / healthy / good · Orange = DEGRADED / warning · Red = DOWN / failing / critical

Of the 1,109,869 tests that received full reliability verdicts, AgentStatus classifies each into one of three statuses: UP (agent responded correctly), DEGRADED (agent responded but with issues), or DOWN (agent failed to respond or returned an error).

8,628
UP (0.8%)
696,473
DEGRADED (62.8%)
404,768
DOWN (36.5%)

Reliability distribution

UP 0.8%
DEGRADED 62.8%
DOWN 36.5%
UP
DEGRADED 62.8%
DOWN 36.5%

The DEGRADED category is the most concerning from a product perspective. These agents responded to prompts but the response had issues, whether that was a hallucinated answer, incorrect information, a failed evaluation check, or a response that did not meet the defined quality threshold. In traditional monitoring, these would all appear as healthy since the HTTP status code was 200.

62.8% of all tests returned a 200 status code with a problematic response. The agent did not crash. It responded. It just responded wrong. A healthy HTTP status code does not mean a correct answer.

Quality

89% of agents scored 0% on evaluation checks.

AgentStatus evaluates response quality using evaluation prompts, predefined reference prompts with known correct answers. In March, the results were stark: 89.2% of all test results showed a complete 0% evaluation pass rate. Another 10.6% passed between 1-24% of checks. Not a single agent achieved a pass rate of 25% or above.

89.2%
of tests scored 0% on evaluation
10.6%
scored between 1-24%
0
agents scored above 25%

Evaluation pass rate distribution

Pass RateCountPercentage
0%989,08889.2%
1-24%117,88310.6%
NULL (not evaluated)2,8980.3%
25% or above00.0%

The uptime vs quality disconnect

This is the most important finding in this report. When we separate uptime (is the agent reachable?) from quality (is the answer correct?), a clear picture emerges:

56.6%
of agents had 100% uptime
89.2%
of agents scored 0% on quality

Uptime distribution

UptimeCountPercentage
100% uptime627,83356.6%
80-99% uptime8800.1%
50-79% uptime61,2425.5%
1-49% uptime37,6273.4%
0% uptime382,28734.4%

HTTP connectivity pass rate

HTTP Pass RateCountPercentage
1-24% pass rate727,58265.6%
0% pass rate382,28734.4%
The agents are online. The answers are wrong. More than half of all agents maintained perfect uptime throughout March. They were reachable. They responded to every prompt. But 89% of them failed every single evaluation check. This is the exact failure mode that inside-out monitoring tools cannot detect. Your Grafana dashboard shows green. Your users are getting wrong answers.

Total test outcomes

9,381
fully successful tests (0.2%)
410,805
failed tests (9.1%)
4,492,066
total test executions

Out of nearly 4.5 million total test executions across the month, only 9,381 returned a fully successful result where the agent was reachable, responded within acceptable latency, and passed all evaluation checks. That is 0.2% of all executions.

Latency

Average response time: 7.5 seconds.

Latency varied dramatically across agents and regions. The average P50 latency was 7,535ms, but the median was 3,368ms, suggesting a long tail of slow agents pulling the average up. Some agents took over 22 minutes to respond.

7,535ms
average P50 latency
3,368ms
median P50 latency
1,320s
max latency observed

Time to first byte

For streaming agents, time to first byte (TTFB) averaged 1,032ms at P50 and 1,629ms at P95. This is the time between sending the prompt and receiving the first token of the response. A TTFB above 2 seconds typically correlates with a noticeably poor user experience.

Errors

Two types of failure: access issues and agent issues.

When tests failed, the failure modes fell into two distinct categories. The first is access issues, where our test could not reach the agent at all. These include HTTP 4xx errors (blocked requests, rate limits, auth failures), DNS failures, connection timeouts, and TLS failures. These are not evidence that the agent is broken. In many cases, this is the agent's bot detection working as intended.

The second category is agent issues, where we reached the agent and it responded, but the response was wrong. These include evaluation misses (agent responded but failed quality checks), schema violations, read timeouts (agent started responding but never finished), and parse failures. These are the failures that matter most because they represent an agent that looks healthy but is not working correctly.

Access issues (agent may be fine, request was blocked)

Error TypeCount% of ErrorsWhat This Means
HTTP 4xx923,66556.8%Request blocked (rate limit, auth, forbidden, bot detection)
DNS Failure253,75115.6%Agent endpoint could not be resolved (down or misconfigured)
Connect Timeout28,6751.8%Could not establish connection to agent
TLS Failure9,7120.6%SSL/TLS certificate or handshake failure

Agent issues (agent responded but the response was wrong)

Error TypeCount% of ErrorsWhat This Means
Unknown Error168,96010.4%Unexpected failure (network issues, malformed responses)
Evaluation Miss112,3966.9%Agent responded but answer failed quality evaluation
HTTP 5xx103,0466.3%Server-side error (the type traditional monitoring does catch)
Read Timeout95,0325.8%Agent started responding but never completed
Schema Invalid49,3463.0%Response format did not match expected schema
Parse Failure8,8940.5%Response could not be parsed (invalid JSON, broken encoding)
Why this distinction matters: HTTP 4xx errors (56.8% of all failures) are not proof that an agent is malfunctioning. A 403 or 429 often means the agent's bot detection is working correctly. The more concerning failures are evaluation misses, schema violations, and read timeouts, where the agent responded but the response was wrong or incomplete. These are the failures that are invisible to traditional monitoring because the HTTP status code was 200.

Geographic Distribution

Agent behavior varies significantly by region.

Tests were executed from 10 geographic regions. The United States accounted for the largest share of test volume (69.7%), followed by Canada, Ukraine, and the United Kingdom. Latency varied dramatically by region, with agents in Rwanda averaging 30,514ms (over 30 seconds) compared to 3,830ms in Canada.

RegionTestsShareAvg LatencyRelative
United States773,07169.7%8,521ms2.2x baseline
Canada121,03310.9%3,830msbaseline
Ukraine57,0085.1%3,904ms1.0x
United Kingdom53,3224.8%4,390ms1.1x
Japan39,8453.6%7,480ms2.0x
Romania26,3282.4%5,807ms1.5x
Spain10,0630.9%4,431ms1.2x
Turkey6,6350.6%4,282ms1.1x
South Africa6,1550.6%5,549ms1.4x
Rwanda4,9960.5%30,514ms8.0x
Users in Rwanda experienced 8x worse latency than users in Canada for the same agents. This kind of geographic variance is invisible to any monitoring tool that tests from a single location. The agent works fine from the company's office. It is unusable for users in certain regions.

Network

180 active nodes across 10 regions.

AgentStatus operates a distributed network of residential compute nodes that execute tests from real consumer devices. In March 2026, 180 nodes were actively running tests out of 228 total enrolled nodes, managed by 303 node providers.

303
node providers
180
active nodes
231,878
node transactions

Key Takeaways

What this data means for teams deploying AI agents.

1. Uptime is not reliability.

56.6% of agents maintained 100% uptime in March. They were online for every single test. But 89% of those responses failed evaluation checks. The agents were running. They were just wrong. If your definition of reliability is "the server responded," your agents look fine. If your definition is "the answer was correct," almost none of them are reliable.

2. Inside-out monitoring is blind to the most common failure mode.

The most common failure mode in March was not a server crash or a timeout. It was an agent returning a 200 status code with a wrong answer. Traditional monitoring tools that track logs, traces, and error rates would not have flagged any of these. The failure was in the content of the response, not the system metrics.

3. Geography changes what your users experience.

The same agent that responds in 3.8 seconds from Canada takes over 30 seconds from Rwanda. For global companies, testing from a single location gives a false sense of reliability. Bot detection, geo-routing, and CDN behavior all change what the user gets depending on where they are.

4. Only 0.2% of test executions were fully successful.

Out of nearly 4.5 million total test executions, only 9,381 returned a result where the agent was reachable, responded within acceptable latency, and passed all evaluation checks. This is the baseline state of AI agents in production in March 2026. The industry is shipping agents that work in demos but fail in the real world. Response quality is the new uptime.

Methodology

How we collected this data.

All data in this report was collected first-party by AgentStatus. No third-party data sources, APIs, or external datasets were used. Every data point comes from tests executed by AgentStatus's own distributed network of residential compute nodes. Each test consists of one prompt sent to one agent from one node in one geographic region. Tests were executed continuously throughout March 2026, with volume increasing as new agents and nodes were onboarded.

Agents in this dataset include publicly accessible AI agents discovered on the open internet, spanning customer support agents, coding assistants, search and research agents, travel assistants, and general-purpose chatbots. No agents in this dataset were tested with permission from the agent operator. All tests were conducted from the public internet, the same way a real user would interact with these agents.

Evaluation was performed using a combination of rule-based checks (response structure, schema validation, keyword matching) and LLM-as-Judge evaluation (semantic correctness assessment). Status classification follows AgentStatus's three-tier system: UP (all checks pass), DEGRADED (agent responds but fails one or more evaluation checks), DOWN (agent fails to respond or returns an error).

agentstatus
Published March 2026 by AgentStatus, a product of Carmel Labs, Inc.
agentstatus.dev

More research

Continue reading