Does AgentDiff have access to our source?

No. It runs as a status check that calls the deployed preview URL.

How long does a PR check take?

Under 60 seconds for typical eval sets. Larger sets parallelize across validation runners.

What if our agent has flaky responses?

Quality Score uses windowed pass rates, not single-run binary checks, so non-deterministic agents stabilize quickly.

Can we self-host the validation runner?

Yes, on enterprise plans for VPC-only agents.

Block the deploys that quietly regress your agent.

Every PR runs the same evaluation prompts against the new version, diffs against your baseline, and posts a verdict back to GitHub before merge.

User-side validation isn't theory.We've been running it.

Live infrastructure

8k+

Agents continuously monitored across the global network.

18M+

USER-SIDE VALIDATIONS

30+

Countries covered

What breaks today

The failure modes your current stack misses

Unit tests pass. Quality drops anyway.

Prompt tweaks, retrieval changes, and library bumps all pass green and still ship regressions.

Eval suites run too slowly to gate merges.

A full eval takes 20 minutes. Engineers merge without it and find regressions in prod.

Approvals are tribal.

Whoever owns the agent eyeballs the diff. There is no shared bar for 'good enough to ship'.

AgentDiff

Behavioral diff posted to the PR.

Each PR runs the same prompts against the new and the baseline build. The diff posts as a status check with per-prompt detail.

Pass / warn / fail per prompt
Side-by-side answer diff
Approve-anyway with required reason

Quality Score gate

A single threshold engineering can defend.

Set a minimum Quality Score per agent. PRs that drop below the threshold block on the status check.

Configurable per agent
Visible in the PR check
Override audit-logged

Region-aware checks

Validate the regions your agent serves, every PR.

Validations run from the regions you nominate. A PR that breaks in São Paulo but passes in New York fails the gate.

Multi-region per PR
Per-region pass rate
Region quarantine for noise

Validate the regions your agent serves, every PR.

How a pilot runs

From first validation to signed report in two weeks

Step 01

Connect

Point Agent Status at the user-facing surface of your agent. No SDK, no instrumentation. Average setup is under five minutes.

Step 02

Watch

Live verdicts stream in from every region you serve. Drift and latency alerts route to PagerDuty or Slack, with a signed report on every run.

Questions we hear most

Frequently asked

Stop shipping regressions you only catch in prod.

Spin up a validation in under five minutes. No credit card. First 100 runs free.