AgentStatus by Carmel Labs

March 2026
Agent Reliability
Report

A comprehensive analysis of AI agent reliability across 6,259 production agents, 4.5 million tests, and 10 geographic regions. Over half of agents maintained 100% uptime. 89% of them gave wrong answers. All data collected first-party by AgentStatus.

4,492,066

tests executed

6,259

agents registered

0.2%

fully successful

10

regions tested

agentstatus.dev | March 2026

Executive Summary

56.6% of agents were online. 89% gave wrong answers.

In March 2026, AgentStatus executed 4,492,066 tests against 6,259 registered AI agents across 10 geographic regions. All data in this report was collected first-party by AgentStatus's distributed testing network. No third-party data sources were used. The headline finding is not about uptime. Over half of all agents maintained 100% uptime throughout the month. The problem is what they said when they responded.

89.2% of all test results showed a 0% evaluation pass rate, meaning the agent responded to the prompt but the answer failed every quality check. Out of 4.5 million total test executions, only 9,381 were fully successful, a rate of 0.2%. When we narrow to the 1.1 million tests that received full reliability verdicts, only 0.8% came back healthy. The rest were either degraded (62.8%) or completely down (36.5%).

This is the gap between uptime monitoring and response quality monitoring. Traditional tools would have shown 56.6% of these agents as perfectly healthy. They were online. They responded. They just responded wrong. And nobody knew.

These are not synthetic benchmarks. These are real tests sent from real consumer devices on residential networks across the United States, Canada, Ukraine, the United Kingdom, Japan, Romania, Spain, Turkey, South Africa, and Rwanda.

The core finding: The majority of production AI agents are online and responding, but failing to give correct answers. This is invisible to every traditional monitoring tool because the HTTP response looks healthy. The failure is semantic, not structural.

Test Volume

4.5 million tests in 30 days.

AgentStatus executed 4,492,066 individual test runs across March 2026. Of these, 1,109,869 were unique evaluated tests with full reliability verdicts. The remaining executions include retries, connectivity checks, and multi-step evaluation runs. Test volume grew dramatically through the month, peaking in the third week before stabilizing.

4,492,066

total test executions

1,109,869

evaluated tests with verdicts

6,259

agents registered

Weekly test volume

Mar 2

12,451

Mar 9

210,092

Mar 16

434,459

Mar 23

331,949

Mar 30

120,855

Reliability

Only 0.8% of evaluated tests returned a healthy result.

Color key used throughout this report:
Green = UP / healthy / good · Orange = DEGRADED / warning · Red = DOWN / failing / critical

Of the 1,109,869 tests that received full reliability verdicts, AgentStatus classifies each into one of three statuses: UP (agent responded correctly), DEGRADED (agent responded but with issues), or DOWN (agent failed to respond or returned an error).

8,628

UP (0.8%)

696,473

DEGRADED (62.8%)

404,768

DOWN (36.5%)

Reliability distribution

UP 0.8%

DEGRADED 62.8%

DOWN 36.5%

DEGRADED 62.8%

DOWN 36.5%

The DEGRADED category is the most concerning from a product perspective. These agents responded to prompts but the response had issues, whether that was a hallucinated answer, incorrect information, a failed evaluation check, or a response that did not meet the defined quality threshold. In traditional monitoring, these would all appear as healthy since the HTTP status code was 200.

62.8% of all tests returned a 200 status code with a problematic response. The agent did not crash. It responded. It just responded wrong. A healthy HTTP status code does not mean a correct answer.

Quality

89% of agents scored 0% on evaluation checks.

AgentStatus evaluates response quality using evaluation prompts, predefined reference prompts with known correct answers. In March, the results were stark: 89.2% of all test results showed a complete 0% evaluation pass rate. Another 10.6% passed between 1-24% of checks. Not a single agent achieved a pass rate of 25% or above.

89.2%

of tests scored 0% on evaluation

10.6%

scored between 1-24%

0

agents scored above 25%

Evaluation pass rate distribution

Pass Rate	Count	Percentage
0%	989,088	89.2%
1-24%	117,883	10.6%
NULL (not evaluated)	2,898	0.3%
25% or above	0	0.0%

The uptime vs quality disconnect

This is the most important finding in this report. When we separate uptime (is the agent reachable?) from quality (is the answer correct?), a clear picture emerges:

56.6%

of agents had 100% uptime

89.2%

of agents scored 0% on quality

Uptime distribution

Uptime	Count	Percentage
100% uptime	627,833	56.6%
80-99% uptime	880	0.1%
50-79% uptime	61,242	5.5%
1-49% uptime	37,627	3.4%
0% uptime	382,287	34.4%

HTTP connectivity pass rate

HTTP Pass Rate	Count	Percentage
1-24% pass rate	727,582	65.6%
0% pass rate	382,287	34.4%

The agents are online. The answers are wrong. More than half of all agents maintained perfect uptime throughout March. They were reachable. They responded to every prompt. But 89% of them failed every single evaluation check. This is the exact failure mode that inside-out monitoring tools cannot detect. Your Grafana dashboard shows green. Your users are getting wrong answers.

Total test outcomes

9,381

fully successful tests (0.2%)

410,805

failed tests (9.1%)

4,492,066

total test executions

Out of nearly 4.5 million total test executions across the month, only 9,381 returned a fully successful result where the agent was reachable, responded within acceptable latency, and passed all evaluation checks. That is 0.2% of all executions.

Latency

Average response time: 7.5 seconds.

Latency varied dramatically across agents and regions. The average P50 latency was 7,535ms, but the median was 3,368ms, suggesting a long tail of slow agents pulling the average up. Some agents took over 22 minutes to respond.

7,535ms

average P50 latency

3,368ms

median P50 latency

1,320s

max latency observed

Time to first byte

For streaming agents, time to first byte (TTFB) averaged 1,032ms at P50 and 1,629ms at P95. This is the time between sending the prompt and receiving the first token of the response. A TTFB above 2 seconds typically correlates with a noticeably poor user experience.

Errors

Two types of failure: access issues and agent issues.

When tests failed, the failure modes fell into two distinct categories. The first is access issues, where our test could not reach the agent at all. These include HTTP 4xx errors (blocked requests, rate limits, auth failures), DNS failures, connection timeouts, and TLS failures. These are not evidence that the agent is broken. In many cases, this is the agent's bot detection working as intended.

The second category is agent issues, where we reached the agent and it responded, but the response was wrong. These include evaluation misses (agent responded but failed quality checks), schema violations, read timeouts (agent started responding but never finished), and parse failures. These are the failures that matter most because they represent an agent that looks healthy but is not working correctly.

Access issues (agent may be fine, request was blocked)

Error Type	Count	% of Errors	What This Means
HTTP 4xx	923,665	56.8%	Request blocked (rate limit, auth, forbidden, bot detection)
DNS Failure	253,751	15.6%	Agent endpoint could not be resolved (down or misconfigured)
Connect Timeout	28,675	1.8%	Could not establish connection to agent
TLS Failure	9,712	0.6%	SSL/TLS certificate or handshake failure

Agent issues (agent responded but the response was wrong)

Error Type	Count	% of Errors	What This Means
Unknown Error	168,960	10.4%	Unexpected failure (network issues, malformed responses)
Evaluation Miss	112,396	6.9%	Agent responded but answer failed quality evaluation
HTTP 5xx	103,046	6.3%	Server-side error (the type traditional monitoring does catch)
Read Timeout	95,032	5.8%	Agent started responding but never completed
Schema Invalid	49,346	3.0%	Response format did not match expected schema
Parse Failure	8,894	0.5%	Response could not be parsed (invalid JSON, broken encoding)

Why this distinction matters: HTTP 4xx errors (56.8% of all failures) are not proof that an agent is malfunctioning. A 403 or 429 often means the agent's bot detection is working correctly. The more concerning failures are evaluation misses, schema violations, and read timeouts, where the agent responded but the response was wrong or incomplete. These are the failures that are invisible to traditional monitoring because the HTTP status code was 200.

Geographic Distribution

Agent behavior varies significantly by region.

Tests were executed from 10 geographic regions. The United States accounted for the largest share of test volume (69.7%), followed by Canada, Ukraine, and the United Kingdom. Latency varied dramatically by region, with agents in Rwanda averaging 30,514ms (over 30 seconds) compared to 3,830ms in Canada.

Region	Tests	Share	Avg Latency	Relative
United States	773,071	69.7%	8,521ms	2.2x baseline
Canada	121,033	10.9%	3,830ms	baseline
Ukraine	57,008	5.1%	3,904ms	1.0x
United Kingdom	53,322	4.8%	4,390ms	1.1x
Japan	39,845	3.6%	7,480ms	2.0x
Romania	26,328	2.4%	5,807ms	1.5x
Spain	10,063	0.9%	4,431ms	1.2x
Turkey	6,635	0.6%	4,282ms	1.1x
South Africa	6,155	0.6%	5,549ms	1.4x
Rwanda	4,996	0.5%	30,514ms	8.0x

Users in Rwanda experienced 8x worse latency than users in Canada for the same agents. This kind of geographic variance is invisible to any monitoring tool that tests from a single location. The agent works fine from the company's office. It is unusable for users in certain regions.

Network

180 active nodes across 10 regions.

AgentStatus operates a distributed network of residential compute nodes that execute tests from real consumer devices. In March 2026, 180 nodes were actively running tests out of 228 total enrolled nodes, managed by 303 node providers.

303

node providers

180

active nodes

231,878

node transactions

Key Takeaways

What this data means for teams deploying AI agents.

1. Uptime is not reliability.

56.6% of agents maintained 100% uptime in March. They were online for every single test. But 89% of those responses failed evaluation checks. The agents were running. They were just wrong. If your definition of reliability is "the server responded," your agents look fine. If your definition is "the answer was correct," almost none of them are reliable.

2. Inside-out monitoring is blind to the most common failure mode.

The most common failure mode in March was not a server crash or a timeout. It was an agent returning a 200 status code with a wrong answer. Traditional monitoring tools that track logs, traces, and error rates would not have flagged any of these. The failure was in the content of the response, not the system metrics.

3. Geography changes what your users experience.

The same agent that responds in 3.8 seconds from Canada takes over 30 seconds from Rwanda. For global companies, testing from a single location gives a false sense of reliability. Bot detection, geo-routing, and CDN behavior all change what the user gets depending on where they are.

4. Only 0.2% of test executions were fully successful.

Out of nearly 4.5 million total test executions, only 9,381 returned a result where the agent was reachable, responded within acceptable latency, and passed all evaluation checks. This is the baseline state of AI agents in production in March 2026. The industry is shipping agents that work in demos but fail in the real world. Response quality is the new uptime.

Methodology

How we collected this data.

All data in this report was collected first-party by AgentStatus. No third-party data sources, APIs, or external datasets were used. Every data point comes from tests executed by AgentStatus's own distributed network of residential compute nodes. Each test consists of one prompt sent to one agent from one node in one geographic region. Tests were executed continuously throughout March 2026, with volume increasing as new agents and nodes were onboarded.

Agents in this dataset include publicly accessible AI agents discovered on the open internet, spanning customer support agents, coding assistants, search and research agents, travel assistants, and general-purpose chatbots. No agents in this dataset were tested with permission from the agent operator. All tests were conducted from the public internet, the same way a real user would interact with these agents.

Evaluation was performed using a combination of rule-based checks (response structure, schema validation, keyword matching) and LLM-as-Judge evaluation (semantic correctness assessment). Status classification follows AgentStatus's three-tier system: UP (all checks pass), DEGRADED (agent responds but fails one or more evaluation checks), DOWN (agent fails to respond or returns an error).

Published March 2026 by AgentStatus, a product of Carmel Labs, Inc.

agentstatus.dev

More research

Continue reading

June 2026

The Two Failures Hiding in LLM-as-a-Judge

Calibration problems shrink with better technique. Competence problems do not. The structural ceiling in agent evaluation, and the two methods older than language models that get past it.

Read report

April 2026

The State of AI Agent Drift

88% of agents started giving worse answers at least once in 30 days. A look at how production AI agents drift — and the systemic March 29 event.

Read report

April 2026

The Anti-Synthetic Monitoring Thesis

Why real-user simulation from real devices in real locations is the only monitoring architecture that survives the next generation of AI agents.

Read report

Research

9 Businesses You Can Build on Agent Behavioral Data

Insurance underwriting, credit scores, compliance certification, procurement intelligence — the commercial layer that sits on top of continuous agent monitoring.

Read report

March 2026Agent ReliabilityReport

56.6% of agents were online. 89% gave wrong answers.

4.5 million tests in 30 days.

Weekly test volume

Only 0.8% of evaluated tests returned a healthy result.

Reliability distribution

89% of agents scored 0% on evaluation checks.

Evaluation pass rate distribution

The uptime vs quality disconnect

Uptime distribution

HTTP connectivity pass rate

Total test outcomes

Average response time: 7.5 seconds.

Time to first byte

Two types of failure: access issues and agent issues.

Access issues (agent may be fine, request was blocked)

Agent issues (agent responded but the response was wrong)

Agent behavior varies significantly by region.

180 active nodes across 10 regions.

What this data means for teams deploying AI agents.

1. Uptime is not reliability.

2. Inside-out monitoring is blind to the most common failure mode.

3. Geography changes what your users experience.

4. Only 0.2% of test executions were fully successful.

How we collected this data.

Continue reading

The Two Failures Hiding in LLM-as-a-Judge

The State of AI Agent Drift

The Anti-Synthetic Monitoring Thesis

9 Businesses You Can Build on Agent Behavioral Data

March 2026
Agent Reliability
Report