AgentStatus by Carmel Labs
March 2026
Agent Reliability
Report
A comprehensive analysis of AI agent reliability across 6,259 production agents, 4.5 million tests, and 10 geographic regions. Over half of agents maintained 100% uptime. 89% of them gave wrong answers. All data collected first-party by AgentStatus.
Executive Summary
56.6% of agents were online. 89% gave wrong answers.
In March 2026, AgentStatus executed 4,492,066 tests against 6,259 registered AI agents across 10 geographic regions. All data in this report was collected first-party by AgentStatus's distributed testing network. No third-party data sources were used. The headline finding is not about uptime. Over half of all agents maintained 100% uptime throughout the month. The problem is what they said when they responded.
89.2% of all test results showed a 0% evaluation pass rate, meaning the agent responded to the prompt but the answer failed every quality check. Out of 4.5 million total test executions, only 9,381 were fully successful, a rate of 0.2%. When we narrow to the 1.1 million tests that received full reliability verdicts, only 0.8% came back healthy. The rest were either degraded (62.8%) or completely down (36.5%).
This is the gap between uptime monitoring and response quality monitoring. Traditional tools would have shown 56.6% of these agents as perfectly healthy. They were online. They responded. They just responded wrong. And nobody knew.
These are not synthetic benchmarks. These are real tests sent from real consumer devices on residential networks across the United States, Canada, Ukraine, the United Kingdom, Japan, Romania, Spain, Turkey, South Africa, and Rwanda.
Test Volume
4.5 million tests in 30 days.
AgentStatus executed 4,492,066 individual test runs across March 2026. Of these, 1,109,869 were unique evaluated tests with full reliability verdicts. The remaining executions include retries, connectivity checks, and multi-step evaluation runs. Test volume grew dramatically through the month, peaking in the third week before stabilizing.
Weekly test volume
Reliability
Only 0.8% of evaluated tests returned a healthy result.
Green = UP / healthy / good · Orange = DEGRADED / warning · Red = DOWN / failing / critical
Of the 1,109,869 tests that received full reliability verdicts, AgentStatus classifies each into one of three statuses: UP (agent responded correctly), DEGRADED (agent responded but with issues), or DOWN (agent failed to respond or returned an error).
Reliability distribution
The DEGRADED category is the most concerning from a product perspective. These agents responded to prompts but the response had issues, whether that was a hallucinated answer, incorrect information, a failed evaluation check, or a response that did not meet the defined quality threshold. In traditional monitoring, these would all appear as healthy since the HTTP status code was 200.
Quality
89% of agents scored 0% on evaluation checks.
AgentStatus evaluates response quality using evaluation prompts, predefined reference prompts with known correct answers. In March, the results were stark: 89.2% of all test results showed a complete 0% evaluation pass rate. Another 10.6% passed between 1-24% of checks. Not a single agent achieved a pass rate of 25% or above.
Evaluation pass rate distribution
| Pass Rate | Count | Percentage |
|---|---|---|
| 0% | 989,088 | 89.2% |
| 1-24% | 117,883 | 10.6% |
| NULL (not evaluated) | 2,898 | 0.3% |
| 25% or above | 0 | 0.0% |
The uptime vs quality disconnect
This is the most important finding in this report. When we separate uptime (is the agent reachable?) from quality (is the answer correct?), a clear picture emerges:
Uptime distribution
| Uptime | Count | Percentage |
|---|---|---|
| 100% uptime | 627,833 | 56.6% |
| 80-99% uptime | 880 | 0.1% |
| 50-79% uptime | 61,242 | 5.5% |
| 1-49% uptime | 37,627 | 3.4% |
| 0% uptime | 382,287 | 34.4% |
HTTP connectivity pass rate
| HTTP Pass Rate | Count | Percentage |
|---|---|---|
| 1-24% pass rate | 727,582 | 65.6% |
| 0% pass rate | 382,287 | 34.4% |
Total test outcomes
Out of nearly 4.5 million total test executions across the month, only 9,381 returned a fully successful result where the agent was reachable, responded within acceptable latency, and passed all evaluation checks. That is 0.2% of all executions.
Latency
Average response time: 7.5 seconds.
Latency varied dramatically across agents and regions. The average P50 latency was 7,535ms, but the median was 3,368ms, suggesting a long tail of slow agents pulling the average up. Some agents took over 22 minutes to respond.
Time to first byte
For streaming agents, time to first byte (TTFB) averaged 1,032ms at P50 and 1,629ms at P95. This is the time between sending the prompt and receiving the first token of the response. A TTFB above 2 seconds typically correlates with a noticeably poor user experience.
Errors
Two types of failure: access issues and agent issues.
When tests failed, the failure modes fell into two distinct categories. The first is access issues, where our test could not reach the agent at all. These include HTTP 4xx errors (blocked requests, rate limits, auth failures), DNS failures, connection timeouts, and TLS failures. These are not evidence that the agent is broken. In many cases, this is the agent's bot detection working as intended.
The second category is agent issues, where we reached the agent and it responded, but the response was wrong. These include evaluation misses (agent responded but failed quality checks), schema violations, read timeouts (agent started responding but never finished), and parse failures. These are the failures that matter most because they represent an agent that looks healthy but is not working correctly.
Access issues (agent may be fine, request was blocked)
| Error Type | Count | % of Errors | What This Means |
|---|---|---|---|
| HTTP 4xx | 923,665 | 56.8% | Request blocked (rate limit, auth, forbidden, bot detection) |
| DNS Failure | 253,751 | 15.6% | Agent endpoint could not be resolved (down or misconfigured) |
| Connect Timeout | 28,675 | 1.8% | Could not establish connection to agent |
| TLS Failure | 9,712 | 0.6% | SSL/TLS certificate or handshake failure |
Agent issues (agent responded but the response was wrong)
| Error Type | Count | % of Errors | What This Means |
|---|---|---|---|
| Unknown Error | 168,960 | 10.4% | Unexpected failure (network issues, malformed responses) |
| Evaluation Miss | 112,396 | 6.9% | Agent responded but answer failed quality evaluation |
| HTTP 5xx | 103,046 | 6.3% | Server-side error (the type traditional monitoring does catch) |
| Read Timeout | 95,032 | 5.8% | Agent started responding but never completed |
| Schema Invalid | 49,346 | 3.0% | Response format did not match expected schema |
| Parse Failure | 8,894 | 0.5% | Response could not be parsed (invalid JSON, broken encoding) |
Geographic Distribution
Agent behavior varies significantly by region.
Tests were executed from 10 geographic regions. The United States accounted for the largest share of test volume (69.7%), followed by Canada, Ukraine, and the United Kingdom. Latency varied dramatically by region, with agents in Rwanda averaging 30,514ms (over 30 seconds) compared to 3,830ms in Canada.
| Region | Tests | Share | Avg Latency | Relative |
|---|---|---|---|---|
| United States | 773,071 | 69.7% | 8,521ms | 2.2x baseline |
| Canada | 121,033 | 10.9% | 3,830ms | baseline |
| Ukraine | 57,008 | 5.1% | 3,904ms | 1.0x |
| United Kingdom | 53,322 | 4.8% | 4,390ms | 1.1x |
| Japan | 39,845 | 3.6% | 7,480ms | 2.0x |
| Romania | 26,328 | 2.4% | 5,807ms | 1.5x |
| Spain | 10,063 | 0.9% | 4,431ms | 1.2x |
| Turkey | 6,635 | 0.6% | 4,282ms | 1.1x |
| South Africa | 6,155 | 0.6% | 5,549ms | 1.4x |
| Rwanda | 4,996 | 0.5% | 30,514ms | 8.0x |
Network
180 active nodes across 10 regions.
AgentStatus operates a distributed network of residential compute nodes that execute tests from real consumer devices. In March 2026, 180 nodes were actively running tests out of 228 total enrolled nodes, managed by 303 node providers.
Key Takeaways
What this data means for teams deploying AI agents.
1. Uptime is not reliability.
56.6% of agents maintained 100% uptime in March. They were online for every single test. But 89% of those responses failed evaluation checks. The agents were running. They were just wrong. If your definition of reliability is "the server responded," your agents look fine. If your definition is "the answer was correct," almost none of them are reliable.
2. Inside-out monitoring is blind to the most common failure mode.
The most common failure mode in March was not a server crash or a timeout. It was an agent returning a 200 status code with a wrong answer. Traditional monitoring tools that track logs, traces, and error rates would not have flagged any of these. The failure was in the content of the response, not the system metrics.
3. Geography changes what your users experience.
The same agent that responds in 3.8 seconds from Canada takes over 30 seconds from Rwanda. For global companies, testing from a single location gives a false sense of reliability. Bot detection, geo-routing, and CDN behavior all change what the user gets depending on where they are.
4. Only 0.2% of test executions were fully successful.
Out of nearly 4.5 million total test executions, only 9,381 returned a result where the agent was reachable, responded within acceptable latency, and passed all evaluation checks. This is the baseline state of AI agents in production in March 2026. The industry is shipping agents that work in demos but fail in the real world. Response quality is the new uptime.
Methodology
How we collected this data.
All data in this report was collected first-party by AgentStatus. No third-party data sources, APIs, or external datasets were used. Every data point comes from tests executed by AgentStatus's own distributed network of residential compute nodes. Each test consists of one prompt sent to one agent from one node in one geographic region. Tests were executed continuously throughout March 2026, with volume increasing as new agents and nodes were onboarded.
Agents in this dataset include publicly accessible AI agents discovered on the open internet, spanning customer support agents, coding assistants, search and research agents, travel assistants, and general-purpose chatbots. No agents in this dataset were tested with permission from the agent operator. All tests were conducted from the public internet, the same way a real user would interact with these agents.
Evaluation was performed using a combination of rule-based checks (response structure, schema validation, keyword matching) and LLM-as-Judge evaluation (semantic correctness assessment). Status classification follows AgentStatus's three-tier system: UP (all checks pass), DEGRADED (agent responds but fails one or more evaluation checks), DOWN (agent fails to respond or returns an error).
More research
Continue reading
June 2026
The Two Failures Hiding in LLM-as-a-Judge
Calibration problems shrink with better technique. Competence problems do not. The structural ceiling in agent evaluation, and the two methods older than language models that get past it.
Read reportApril 2026
The State of AI Agent Drift
88% of agents started giving worse answers at least once in 30 days. A look at how production AI agents drift — and the systemic March 29 event.
Read reportApril 2026
The Anti-Synthetic Monitoring Thesis
Why real-user simulation from real devices in real locations is the only monitoring architecture that survives the next generation of AI agents.
Read reportResearch
9 Businesses You Can Build on Agent Behavioral Data
Insurance underwriting, credit scores, compliance certification, procurement intelligence — the commercial layer that sits on top of continuous agent monitoring.
Read report