Back to website

AgentStatus by Carmel Labs · April 2026

Your Synthetic Monitor Can't Reach
64% of AI Agents

We tested 6,228 production AI agents from both datacenter and residential IPs using the same prompts and the same evaluation pipeline. The gap between what datacenter monitoring sees and what residential testing reveals is not close.

1,553,000+
Datacenter validations
4,850,000+
Residential validations
6,228
Agents compared
3,995 (64%)
Agents unreachable from datacenter
agentstatus
agentstatus.dev | April 2026

The Study

We tested 6,228 agents from both datacenter and residential IPs.

We ran 1.55 million tests from datacenter IPs and 4.85 million tests from residential consumer devices against the same pool of production AI agents. The datacenter validations used the same IP ranges as Datadog Synthetics, Pingdom, UptimeRobot, and every other traditional monitoring tool. The residential validations came from real consumer devices on home networks across 26 regions.

Both groups used the same evaluation prompts and the same automated scoring pipeline, which uses a language model to judge whether the agent's response is actually correct. Each test produces one of three verdicts: the agent is fully working, partially working but giving degraded answers, or completely down. The only variable was the network origin of the test.

1.55M
Datacenter tests from cloud infrastructure IPs
4.85M
Residential tests from real consumer devices
6,228
Production AI agents tested by both groups

Finding 1

Most AI agents refuse to talk to datacenter IPs.

The most important metric is the rejection rate: how often does the agent see the incoming request and refuse to respond? From a datacenter IP, agents rejected 74.2% of all tests. From a residential IP, that number drops to 22.6%.

The majority of monitoring traffic from datacenter IPs gets rejected before it ever reaches the agent.

Datacenter

74.2%
of requests rejected
67.3%
throttled for sending too many requests
1.2%
fully working

Residential

22.6%
of requests rejected
15.8%
throttled for sending too many requests
3.1%
fully working
64%

of agents block datacenter but allow residential through

3,995 out of 6,228 agents rejected more than half of datacenter requests while letting residential requests through at normal rates. Zero agents did the reverse. The relationship is entirely one-directional.

If your monitoring tool runs from AWS, GCP, or Azure, most of the agents it tries to test are rejecting it outright. The tool ends up measuring the bot-detection wall your agent puts up, not the agent itself.

Rejection rate by group

Datacenter
74.2%
Residential
22.6%

Throttle rate by group

Datacenter
67.3%
Residential
15.8%
What this means for Datadog Synthetics, Pingdom, and every traditional monitoring tool: Your synthetic monitor runs from a datacenter, and the agent rejects it. The monitor reports the agent as down or unreachable, but the agent is actually up and working fine for real users. It just refused to respond to a datacenter IP.

Finding 2

Even when datacenter validations get through, agents give them worse answers.

Rejection is the access problem, but even when datacenter tests were not rejected, the quality of responses differed. We compared how often the agent gave a correct answer to datacenter tests versus residential tests, using an automated scoring system that evaluates whether the response actually answers the question.

75.5%
Correct answer rate from residential tests
68.2%
Correct answer rate from datacenter tests
+7.3pp
Residential tests score higher on answer quality

On the same agents running the same prompts, there is a 7.3 percentage point gap in whether the response was actually correct, determined entirely by where the request came from.

Some agents detect the request origin and serve a degraded or cached response to traffic that looks like automation. Others have model routing logic that treats datacenter IPs differently. In either case, what the datacenter validate sees is not what your user sees.

2,116 agents gave correct answers to residential tests more than 10 percentage points more often than to datacenter tests. That is 42% of matched agents where the quality of the response meaningfully differed based solely on where the request came from.

Finding 3

88% of agents behave differently depending on where you test from.

We looked at agents where the rejection rate was within 5 percentage points between datacenter and residential, meaning agents that treat both sources roughly equally. Only 731 out of 6,228 agents met that threshold, which is 12%.

The remaining 88% produce meaningfully different results depending on whether the request comes from a datacenter or a real consumer device.

12%
Agents that behave the same regardless of where you test from
88%
Agents that behave differently depending on where you test from
If your monitoring tool runs from a datacenter, it is telling you about a different product than the one your users are experiencing, and that is true for 88% of the agents we tested.

Why This Happens

AI agents are not websites.

Synthetic monitoring was designed for web pages. Point it at a URL, check if the page loads, measure latency, alert if the server returns an error. That works for static content. It does not work for AI agents.

Agents have bot detection. Your agent sits behind rate limiting, IP reputation checks, and bot protection layers like DataDome, Cloudflare, and HUMAN. A datacenter IP looks like a bot. The agent may block it outright, throttle it aggressively, or return a different response. 74% of datacenter tests were blocked in our dataset.

Agents give different answers every time. A website either loads or it does not, but an agent gives an answer, and whether that answer is correct is a quality question that requires sending a real prompt and evaluating the real response. A successful server response does not mean the answer was right. Our March 2026 report found that 89% of agents with perfect uptime scored 0% on answer quality, meaning the server was responding fine but the answers were wrong.

The coverage gap

ToolNetwork OriginReaches 64% of Agents?Evaluates Response Quality?
Datadog SyntheticsAWS / GCP datacenterNoNo
PingdomDatacenterNoNo
UptimeRobotDatacenterNoNo
New Relic SyntheticsDatacenterNoNo
AgentStatusResidential, 26 regionsYesYes

Key Takeaways

Four things this data proves.

1

Datacenter monitoring cannot reach 64% of AI agents.

3,995 of 6,228 agents blocked datacenter validations while allowing residential through. If your monitoring runs from AWS or GCP, the majority of agents reject it before returning a response.

2

Agents give different answers depending on where the request comes from.

Correct answer rate is 7.3 percentage points lower for datacenter tests. In 2,116 agents the gap exceeds 10 points. What your monitoring tool sees is not what your user sees.

3

Rate limiting is 4x worse for datacenter traffic.

Datacenter tests get throttled 67.3% of the time. Residential tests get throttled 15.8% of the time. Agents actively slow down datacenter traffic at more than four times the rate of residential.

4

88% of agents behave differently depending on where you test from.

Only 12% of agents produced consistent results regardless of network origin. For the other 88%, the tool you use to monitor fundamentally changes what you measure.

Methodology

How we collected this data.

This report is based on first-party data from the AgentStatus distributed testing network. No third-party data sources were used.

The datacenter group consists of 1.55 million tests executed from cloud infrastructure IPs, including ranges used by major cloud providers. The residential group consists of 4.85 million tests executed from verified residential consumer devices on home networks across 26 regions.

Both groups tested the same pool of 6,228 production AI agents using identical prompts and the same automated scoring system. Agents were required to have data from both groups to be included in comparative analysis. The rejection rate comparison required at least 20 tests per group per agent. The answer quality comparison required at least one scored evaluation from each group.

agentstatus
Published April 2026 by AgentStatus, a product of Carmel Labs, Inc.
agentstatus.dev

More research

Continue reading