Continuous user-side validation
for AI agents.

We ask your agents real questions from where your users are. And tell you when the answers are wrong.

Agent Status dashboard

Non-determinism- Five reasons your AI Agent gives different answers, every time.

Mechanism

Floating-point non-associativity

(a + b) + c  =  0.492371
a + (b + c)  =  0.492368
— argmax flips at bit 9

GPU kernels reduce in nondeterministic order. The same logits, summed twice, do not produce the same logits.

The deterministic path is a marketing term.

Mechanism

Batch composition

REQ 01REQ 02YOUREQ 04REQ 05REQ 06REQ 01REQ 02YOUREQ 04REQ 05REQ 06neighbors change the math

Your prompt is served in a batch with other people's prompts.

Your answer depends on who else is querying the model right now.

Mechanism

Mixture-of-experts routing

Expert 1Expert 2Expert 3Expert 4Expert 5thecatsatonmat

MoE gating networks are themselves trained, and small differences in activation values route the same token to different experts.

The "model" you are calling is, at the level of computation, a different model on every call.

Mechanism

Speculative decoding

DRAFT ▸
FINAL ▸
stochastic boundary

A small draft model proposes tokens; a large verifier accepts or rejects them. The accept boundary is stochastic.

The final text is shorter, faster, and not the same.

Mechanism

Silent provider updates

model · v3.2↻ stable
weights swapped
w_oldw_new

The model identifier did not change. The model did.

You will learn about it from your customers.

Non-determinism is a failure mode that, by construction, cannot be detected by inside-out tools.

AgentStatus measures it from outside.

For years, software told us when things break.

AI Agents broke the pattern.

Each era answered a question the previous one could not.

I
1971
Testing
Does the code do what we said it would?
JUnit, Jest, Pytest, Selenium
II
1995
Monitoring
Is the system on?
Nagios, Pingdom, PagerDuty, Zabbix
III
2014
Observability
Why is the system broken?
Datadog, New Relic, Honeycomb, Grafana
IV
2018
Synthetics
Would a fake user succeed?
Checkly, Cypress, Playwright, Datadog Synthetics
V
2022
ML evaluation
Did the model regress against the benchmark?
OpenAI Evals, LangSmith, Braintrust, Arize
VI
2026
User-side validationNow
Are real users, right now, getting truthful, consistent help?
Agent Status

Note. Years are approximate; eras overlap and never fully retire. The claim is not that Era V is obsolete, it is that no era prior to VI was even attempting the right measurement.

User-side validation isn't theory.We've been running it.

Live infrastructure
8k+

Agents continuously monitored across the global network.

18M+

USER-SIDE VALIDATIONS

30+

Countries covered

What user-side validation actually means

Seven checks every production agent needs — from real home networks, on a schedule, with plain verdicts. No instrumentation. Just your URL.

Status

Is it up right now? We probe from home networks on a schedule and give you a plain verdict — UP, degraded, or down — plus a run ledger and per-region view. Not a ping from your office.

Agent status dashboard with verdict, uptime, and run ledger

Reliability

Does it keep working over time? Pass rates, latency, time-to-first-byte, and week-over-week trends — so one green check does not fool you.

Reliability metrics dashboard with pass rates and latency over time

What changed

Slow shifts over time. We snapshot what normal looks like for your agent, then flag when behavior drifts away from it — day by day, with the rough runs worth a second look.

What changed dashboard with drift status and day-by-day behavioral snapshot

Answer quality

Reachable is table stakes. We grade the actual answer — spot checks, dual reviewers, format contracts, and domain-specific questions. A confident wrong answer is not up.

Answer quality dashboard with evaluation prompts and pass fail results

Conversations

Does it finish the job? We give a simulated user a real goal and let them pursue it over several messages — then judge whether they got what they came for.

Conversations dashboard with goal-driven scenarios and task outcomes

Consistency

Same situation, same story. Rephrased questions, rising stakes, and follow-ups should not flip the answer for no reason.

Consistency dashboard with wording stability and determinism checks

Robustness

Tools, streams, safety rules, and people trying to break it. Catch broken integrations and policy failures before customers do.

Robustness dashboard with tool probes, streaming, and safety rules

Explain

Why we flagged it. Every verdict comes with a probe trace — what we asked, what came back, which check failed, and a one-click reproduce so your team can fix it.

Explain dashboard with probe trace, verdict reasoning, and reproduce action

Alerts

When it breaks, you know. Slack, webhooks, PagerDuty, email digests — with enough context to fix it, not just a red dot.

Test an agent live. Get results in 30 seconds.

Choose how to test:

3 free tests per day. No account needed.
US
380msUP

We work with all kinds of AI Agents

OpenAIOpenAIClaudeClaudeAnthropicAnthropicGoogleGoogleAzureAzureAWS BedrockAWS BedrockLangChainLangChainLangGraphLangServeLangbaseFetch.aiFetch.aiForethoughtForethoughtElevenLabsElevenLabsElevenLabs VoiceRetellRetellPerplexityPerplexityPoePoeDevinDevinSwarmsVoiceflowBotpressCrewAIHuggingFaceGradioGoogle ADK / A2AA2A JSON-RPCNanda A2AAgent AIAgorAgenticAutoGenBlandBoostDecagonDifyMavenMCPn8nOpenAI AssistantsOpenAI CUATalkdeskuAgentVapi

Software fails loudly. Agents fail quietly.