AgentStatus by Carmel Labs · April 2026
9 Businesses You Can Build on
AI Agent Behavioral Data
We have run 4.5 million tests on 8k+ production AI agents from real consumer devices in 26 regions. That dataset is an intelligence asset. Here are 9 companies we think someone should build on top of it — and a data residency API to make it possible.
The Thesis
Training data had a market. Production behavioral data will too.
Scale AI built a multi-billion dollar company selling labeled data to train models. That market exists because training data is hard to collect, expensive to curate, and critical to model quality.
We believe the same thing is about to happen for production behavioral data — the continuous record of how AI agents actually behave once they are deployed. How they respond to real users in real geographies. Whether they drift after model updates. When they hallucinate. Where they break.
AgentStatus collects this data every day. We test AI agents from the outside, from real consumer devices on residential networks, and we evaluate whether the responses are actually correct. Over time, this creates a longitudinal behavioral profile for every agent we monitor: uptime patterns, quality scores, geographic reliability, drift trajectories, latency distributions, and failure modes.
That data is useful for monitoring. But it is also useful for insurance, compliance, procurement, benchmarking, regulation, competitive intelligence, and things we have not thought of yet.
We cannot build all of those businesses. But we already built (and are still building) the data layer that powers them.
The 9 Ideas
Businesses that should exist.
Each of these is a real product that can be built on continuous AI agent behavioral data. We have described the opportunity, the data that powers it, and who should build it. All 9 are open.
AI Agent Insurance Underwriting
The actuarial table for AI agents
If you are building insurance for AI agents, you need to price risk. A one-time certification audit tells you the agent passed on Tuesday. It does not tell you whether it drifted by Thursday, hallucinated for users in Japan on Friday, or went down in Germany on Saturday.
Continuous behavioral data turns a snapshot into an actuarial table. Uptime history, quality score trends, geographic failure patterns, drift velocity after model updates — this is the data an underwriting model needs to price a premium that reflects actual risk, not projected risk.
An agent with a 95% evaluation pass rate and stable behavior across 6 months is a different risk profile than one with a 72% pass rate and three major drift events. The behavioral data is the premium calculation.
Agent Credit Scores
The credit bureau for AI agents
Every AI agent should have a reliability score. Not a one-time benchmark. A living, continuously updated score based on how the agent actually performs in production, calculated from real behavioral data over time.
Think of it as a credit score for AI agents. Enterprises buying or integrating third-party agents could check the score before committing. Insurance companies use it for underwriting. Procurement teams use it for vendor evaluation. Regulators reference it for compliance. The methodology is public. The data is proprietary. The score is the product.
A score of 850 means the agent has maintained a 90%+ evaluation pass rate with less than 2% drift over 90 days across all tested regions, with P95 latency under 5 seconds. A score of 520 means it drifts frequently, fails evaluation in multiple regions, and has had two or more major incidents in the last quarter.
Compliance Evidence-as-a-Service
Continuous proof that your agent stayed within guardrails
Regulated industries — finance, healthcare, government, legal — will need to prove that their AI agents behaved correctly over time. Not just that they passed an audit on a specific date, but that they maintained compliance continuously between audits. The EU AI Act, SOC 2 for AI, and industry-specific regulations are all moving in this direction.
A compliance evidence feed delivers exactly this: a continuous, third-party-verified record of agent behavior. An auditor can pull 90 days of behavioral data and see exactly how the agent performed, where it failed, and whether it stayed within defined parameters. The data comes from outside the system — it cannot be tampered with by the agent operator.
Model Update Impact Intelligence
The early warning system for model changes
When OpenAI updates GPT-4o, when Anthropic ships a new Claude version, when Google pushes a Gemini revision — thousands of agents are affected simultaneously. Some break. Some get better. Most teams find out days later, from user complaints.
Because AgentStatus tests thousands of agents continuously, we see the impact of model updates in near real-time. We can detect a quality shift across hundreds of agents within hours of a model change, before any individual operator has noticed. That signal is worth money.
Imagine a product that tells you: "GPT-4o was updated 8 hours ago. Across 2,400 agents in our network that use GPT-4o, average evaluation pass rate dropped 6.2pp. Customer support agents are the hardest hit category. Here are the specific failure patterns we are seeing." That is Bloomberg Terminal-level intelligence for AI operations.
Agent Procurement Intelligence
The G2 or Gartner for AI agents, backed by real data
Enterprises are choosing between AI agent vendors with almost no objective data. They rely on demos, sales pitches, and pilot projects. What if a procurement team could see 90 days of real production behavioral data before signing a contract?
"Your shortlisted customer support agent vendors: Vendor A has a 78% evaluation pass rate with 4 drift events in the last quarter. Vendor B has an 89% pass rate with zero drift events. Vendor A fails in 3 out of 10 tested regions. Vendor B is consistent across all 10." That is the kind of data that changes purchasing decisions.
Geographic Access Intelligence
Where in the world your agent actually works
AI agents do not behave the same everywhere. Rate limiters, geo-restrictions, CDN routing, bot detection, and local regulations all change what a user in Tokyo sees compared to a user in London or Lagos. Most teams have no visibility into this.
A geographic access intelligence product tells you exactly where your agent works, where it does not, and where it gives different answers depending on the region. This data is useful for agent operators, but also for companies evaluating global AI infrastructure, regulators studying digital access equity, and researchers studying AI availability patterns.
SLA Verification for AI Agent Contracts
Third-party proof that the SLA was met — or breached
As AI agents become enterprise software, they will have SLAs. Uptime guarantees. Response quality commitments. Latency thresholds. Geographic availability requirements. And like every SLA, there will be disputes about whether they were actually met.
A third-party SLA verification service provides the neutral evidence. The agent vendor says they met 99.9% uptime. The customer suspects otherwise. The verification service has 90 days of continuous behavioral data collected from outside both systems. It can tell you exactly what happened, when, and where.
AI Agent Security Posture Scoring
How exposed is this agent to adversarial conditions?
Security teams need to know how their agents behave under adversarial conditions. Not just whether the agent can be jailbroken in a lab, but whether its production behavior drifts in ways that create security exposure. Does it give different answers to different regions? Does it fail open or fail closed? Does it leak information under latency pressure?
A security posture score for AI agents combines behavioral data with adversarial testing signals. The behavioral baseline comes from continuous monitoring. The adversarial layer adds targeted validations designed to test boundaries: geo-context manipulation, prompt injection resistance under real-world conditions, consistency under load.
The AI Agent Behavioral Research Dataset
The largest public dataset of how AI agents actually behave in production
There is no large-scale public dataset of production AI agent behavior. Researchers studying AI reliability, drift, hallucination patterns, and geographic disparities have to build their own test infrastructure from scratch. That is expensive and slow.
An anonymized behavioral research dataset — aggregate patterns from millions of tests across thousands of agents — would accelerate research in AI safety, reliability engineering, and agent evaluation methodology. It would also position the contributing platform as the definitive source of truth for how AI agents behave in the wild.
This is the AI agent equivalent of the Common Crawl dataset, except it captures how agents behave rather than what the web contains. Academic researchers, AI safety organizations, and government bodies studying AI deployment would all use this.
The Data Platform
The AgentStatus Data Residency API
The idea is simple: we test thousands of production AI agents from the outside, continuously, from real consumer devices. That generates a behavioral dataset. The API lets companies building products in insurance, compliance, procurement, security, and research license that data instead of collecting it themselves. You build the product, we supply the signal through a programmatic interface.
What the data includes
Every test we run generates a behavioral record. Across millions of tests, these records compound into something more valuable than any individual data point: a longitudinal behavioral profile for each agent we monitor.
Per-test signals: HTTP status, response latency (P50/P95/P99), TTFB, response body, evaluation verdict (UP/DEGRADED/DOWN), semantic quality score, evaluation type used, geographic origin of test, timestamp, agent endpoint
Aggregated signals: uptime over time, quality score trends, drift detection (score change after model updates), geographic variance, error distribution, latency trends, evaluation pass rate by region, behavioral consistency score
Comparative signals: datacenter vs residential access rates, cross-agent category benchmarks, regional anomaly detection, model-update impact measurement
How the API works
Two tiers of access. Public endpoints serve aggregate benchmarks and category-level statistics — the kind of data that appears in our public reports, available to anyone. Licensed endpoints provide agent-level behavioral profiles for the 8k+ public agents we test from the outside, accessed via authenticated API keys with usage-based pricing. Partners can query individual agent behavioral histories, pull category-level comparisons, or access the full dataset for model training.
This works because all of the data is first-party. We test public-facing agents from the outside, from real consumer devices, without needing access to the agent's internal systems. The behavioral data we collect belongs to us. Licensing it is no different from a credit bureau licensing financial data or a market research firm licensing consumer behavior data. The agents are public. The observations are ours.
Licensing models
Dataset license: Annual access to the full historical behavioral dataset for model training. Priced per-year. Ideal for insurance companies building actuarial models, research labs studying agent behavior, or any company that needs the raw data to train their own systems.
Per-query API: Real-time lookups on specific agents or categories. Priced per-call or per-agent-per-month. Ideal for procurement platforms checking an agent before a purchase decision, compliance tools pulling behavioral evidence, or security platforms scoring agent posture.
Streaming feed: Continuous data delivery for partners who need real-time signal. Priced per-agent-per-month. Ideal for model-update impact monitoring, drift alerting services, or any product that needs to react to behavioral changes as they happen.
What We Are Building
We are building the data layer that makes all of them possible.
AgentStatus is a monitoring product. We test AI agents from the outside, from real consumer devices, and we evaluate whether the responses are correct. That is our business. That is what we do every day.
The data residency program is not a new product. It is an API on top of the monitoring infrastructure we already operate. Every test we run generates behavioral data. The API makes that data accessible to companies building products that need it.
All 9 of the ideas in this report are businesses that someone else should build. We are not pivoting into insurance, or compliance, or procurement analytics. We are saying: if you are building in any of these spaces, the data you need already exists, and we will license it to you through the API.
The more partners building on the data, the more valuable each monitored agent becomes, the more agents come onto the platform, the more data we collect. That is the flywheel. Monitoring is the engine. The Data Residency API is how the data compounds beyond monitoring.
Build on the data.
The AgentStatus Data Residency API gives you programmatic access to the largest continuous behavioral dataset on production AI agents. If any of these 9 ideas is your next company, or if you have a tenth we have not thought of, let's talk about licensing.
More research
Continue reading
June 2026
The Two Failures Hiding in LLM-as-a-Judge
Calibration problems shrink with better technique. Competence problems do not. The structural ceiling in agent evaluation, and the two methods older than language models that get past it.
Read reportMarch 2026
The State of AI Agent Reliability
We monitored 3,260 production AI agents across 48 countries. 89% with perfect uptime scored 0% on quality. The full data is inside.
Read reportApril 2026
The State of AI Agent Drift
88% of agents started giving worse answers at least once in 30 days. A look at how production AI agents drift — and the systemic March 29 event.
Read reportApril 2026
The Anti-Synthetic Monitoring Thesis
Why real-user simulation from real devices in real locations is the only monitoring architecture that survives the next generation of AI agents.
Read report