Is it a replacement for EvalView or Evidently?
No. Run them together.
They check golden datasets. We check geographic behavior from real residential ISPs. Both signals matter.
Every PR runs the same evaluation prompts from where your users actually live, diffs against your baseline, and posts a verdict back to GitHub before merge.
AgentDiff is free. No code changes, no extra runner, install in under two minutes.
The contrast that matters









AgentDiff is free. No code changes, no extra runner, install in under two minutes.
How it works
AgentDiff is a GitHub App, not yet another CI job. No new YAML, no new secrets in your runner. Just an outside-in verdict that updates the PR status.
Add the GitHub App and point it at your agent's preview URL. Auth secrets stay encrypted at rest.
On every PR sync we run paired validations against base and head, one residential node per region.
Pass, warn, or fail flips the GitHub Check. The PR comment shows a per-region matrix and one-click rerun.
AgentDiff is free. No code changes, no extra runner, install in under two minutes.
What it catches
We run the same prompt twice, against base and head, from one residential node, so any divergence shows up before you merge.
Coverage
No SDK to install on your agent. AgentDiff validations whatever HTTP surface you already expose.
AgentDiff is free. No code changes, no extra runner, install in under two minutes.
Use both
evalview / dataset-diff
1,402 prompts · no regression
evidently / drift-detection
no statistical drift detected
ci / unit-tests
247 tests passed in 12.4s
agentdiff / residential-validate
Mumbai returned TOOL_FAIL — Lagos & São Paulo pass
AgentDiff is free. No code changes, no extra runner, install in under two minutes.
For clarity
Is it a replacement for EvalView or Evidently?
No. Run them together.
They check golden datasets. We check geographic behavior from real residential ISPs. Both signals matter.
Is this a generic CI evaluation tool?
No. It does one thing.
Paired validations against the same agent from the same residential node, then a verdict. Nothing else.
Will it help with a localhost-only project?
Not yet. Ship a preview URL first.
AgentDiff validations deployed endpoints from real ISPs. If your agent never leaves your laptop, come back later.
Is it a consumer-facing persona experiment?
No. It is engineering infrastructure.
Built for the engineers shipping agents to production. There are no roleplay simulations, no synthetic personas, just real validations from real networks your users actually use.