AgentStatus by Carmel Labs · April 2026
Your Synthetic Monitor Can't Reach
64% of AI Agents
We tested 6,228 production AI agents from both datacenter and residential IPs using the same prompts and the same evaluation pipeline. The gap between what datacenter monitoring sees and what residential testing reveals is not close.
The Study
We tested 6,228 agents from both datacenter and residential IPs.
We ran 1.55 million tests from datacenter IPs and 4.85 million tests from residential consumer devices against the same pool of production AI agents. The datacenter validations used the same IP ranges as Datadog Synthetics, Pingdom, UptimeRobot, and every other traditional monitoring tool. The residential validations came from real consumer devices on home networks across 26 regions.
Both groups used the same evaluation prompts and the same automated scoring pipeline, which uses a language model to judge whether the agent's response is actually correct. Each test produces one of three verdicts: the agent is fully working, partially working but giving degraded answers, or completely down. The only variable was the network origin of the test.
Finding 1
Most AI agents refuse to talk to datacenter IPs.
The most important metric is the rejection rate: how often does the agent see the incoming request and refuse to respond? From a datacenter IP, agents rejected 74.2% of all tests. From a residential IP, that number drops to 22.6%.
The majority of monitoring traffic from datacenter IPs gets rejected before it ever reaches the agent.
Datacenter
Residential
of agents block datacenter but allow residential through
3,995 out of 6,228 agents rejected more than half of datacenter requests while letting residential requests through at normal rates. Zero agents did the reverse. The relationship is entirely one-directional.
If your monitoring tool runs from AWS, GCP, or Azure, most of the agents it tries to test are rejecting it outright. The tool ends up measuring the bot-detection wall your agent puts up, not the agent itself.
Rejection rate by group
Throttle rate by group
Finding 2
Even when datacenter validations get through, agents give them worse answers.
Rejection is the access problem, but even when datacenter tests were not rejected, the quality of responses differed. We compared how often the agent gave a correct answer to datacenter tests versus residential tests, using an automated scoring system that evaluates whether the response actually answers the question.
On the same agents running the same prompts, there is a 7.3 percentage point gap in whether the response was actually correct, determined entirely by where the request came from.
Some agents detect the request origin and serve a degraded or cached response to traffic that looks like automation. Others have model routing logic that treats datacenter IPs differently. In either case, what the datacenter validate sees is not what your user sees.
Finding 3
88% of agents behave differently depending on where you test from.
We looked at agents where the rejection rate was within 5 percentage points between datacenter and residential, meaning agents that treat both sources roughly equally. Only 731 out of 6,228 agents met that threshold, which is 12%.
The remaining 88% produce meaningfully different results depending on whether the request comes from a datacenter or a real consumer device.
Why This Happens
AI agents are not websites.
Synthetic monitoring was designed for web pages. Point it at a URL, check if the page loads, measure latency, alert if the server returns an error. That works for static content. It does not work for AI agents.
Agents have bot detection. Your agent sits behind rate limiting, IP reputation checks, and bot protection layers like DataDome, Cloudflare, and HUMAN. A datacenter IP looks like a bot. The agent may block it outright, throttle it aggressively, or return a different response. 74% of datacenter tests were blocked in our dataset.
Agents give different answers every time. A website either loads or it does not, but an agent gives an answer, and whether that answer is correct is a quality question that requires sending a real prompt and evaluating the real response. A successful server response does not mean the answer was right. Our March 2026 report found that 89% of agents with perfect uptime scored 0% on answer quality, meaning the server was responding fine but the answers were wrong.
The coverage gap
| Tool | Network Origin | Reaches 64% of Agents? | Evaluates Response Quality? |
|---|---|---|---|
| Datadog Synthetics | AWS / GCP datacenter | No | No |
| Pingdom | Datacenter | No | No |
| UptimeRobot | Datacenter | No | No |
| New Relic Synthetics | Datacenter | No | No |
| AgentStatus | Residential, 26 regions | Yes | Yes |
Key Takeaways
Four things this data proves.
Datacenter monitoring cannot reach 64% of AI agents.
3,995 of 6,228 agents blocked datacenter validations while allowing residential through. If your monitoring runs from AWS or GCP, the majority of agents reject it before returning a response.
Agents give different answers depending on where the request comes from.
Correct answer rate is 7.3 percentage points lower for datacenter tests. In 2,116 agents the gap exceeds 10 points. What your monitoring tool sees is not what your user sees.
Rate limiting is 4x worse for datacenter traffic.
Datacenter tests get throttled 67.3% of the time. Residential tests get throttled 15.8% of the time. Agents actively slow down datacenter traffic at more than four times the rate of residential.
88% of agents behave differently depending on where you test from.
Only 12% of agents produced consistent results regardless of network origin. For the other 88%, the tool you use to monitor fundamentally changes what you measure.
Methodology
How we collected this data.
This report is based on first-party data from the AgentStatus distributed testing network. No third-party data sources were used.
The datacenter group consists of 1.55 million tests executed from cloud infrastructure IPs, including ranges used by major cloud providers. The residential group consists of 4.85 million tests executed from verified residential consumer devices on home networks across 26 regions.
Both groups tested the same pool of 6,228 production AI agents using identical prompts and the same automated scoring system. Agents were required to have data from both groups to be included in comparative analysis. The rejection rate comparison required at least 20 tests per group per agent. The answer quality comparison required at least one scored evaluation from each group.
More research
Continue reading
June 2026
The Two Failures Hiding in LLM-as-a-Judge
Calibration problems shrink with better technique. Competence problems do not. The structural ceiling in agent evaluation, and the two methods older than language models that get past it.
Read reportMarch 2026
The State of AI Agent Reliability
We monitored 3,260 production AI agents across 48 countries. 89% with perfect uptime scored 0% on quality. The full data is inside.
Read reportApril 2026
The State of AI Agent Drift
88% of agents started giving worse answers at least once in 30 days. A look at how production AI agents drift — and the systemic March 29 event.
Read reportResearch
9 Businesses You Can Build on Agent Behavioral Data
Insurance underwriting, credit scores, compliance certification, procurement intelligence — the commercial layer that sits on top of continuous agent monitoring.
Read report