AgentDiff · Public preview

Catch the regression before you merge.

Every PR runs the same evaluation prompts from where your users actually live, diffs against your baseline, and posts a verdict back to GitHub before merge.

AgentDiff is free. No code changes, no extra runner, install in under two minutes.

Paired validate · São Paulo nodeLive
residentialSA · sa-eastbase @ mainUP · 612 mshead @ PR #218UP · 1.4 sΔ regression
12
Validations
11
Pass
1
Δ
+24ms
p95

The contrast that matters

Your CI runs in us-east-1. Your users live in São Paulo.

Your CI runner
$ pytest12 passed in 4.2sus-east-1 · 54.x.x.x
What users actually hit
São Paulo skyline
403
São Paulo
rate-limited
Lagos skyline
429
Lagos
quota exhausted
Mumbai skyline
TOOL_FAIL
Mumbai
tool regression
Jakarta skyline
503
Jakarta
upstream timeout
Istanbul skyline
AUTH_ERR
Istanbul
bearer rejected
Mexico City skyline
JSON_DRIFT
Mexico City
schema mismatch
Cairo skyline
GEO_BLOCK
Cairo
region denied
Manila skyline
STREAM_END
Manila
premature SSE close
Bangkok skyline
404
Bangkok
endpoint missing

AgentDiff is free. No code changes, no extra runner, install in under two minutes.

How it works

Install once, then get a verdict on every pull request.

AgentDiff is a GitHub App, not yet another CI job. No new YAML, no new secrets in your runner. Just an outside-in verdict that updates the PR status.

Step 01

Install

Add the GitHub App and point it at your agent's preview URL. Auth secrets stay encrypted at rest.

Step 02

Open a pull request

On every PR sync we run paired validations against base and head, one residential node per region.

Step 03

Get a verdict

Pass, warn, or fail flips the GitHub Check. The PR comment shows a per-region matrix and one-click rerun.

AgentDiff is free. No code changes, no extra runner, install in under two minutes.

What it catches

Geographic regression testing, not generic CI evaluation.

We run the same prompt twice, against base and head, from one residential node, so any divergence shows up before you merge.

residential devices
countries on Fabric
agents monitored
total tests completed
Geo-emergent regressions
Tool calling that works in us-east and breaks in São Paulo gets caught and labelled as a regression.
Auth surface drift
Bearer, custom-header, and basic-auth are validated end-to-end. The wrong shape returns AUTH_ERROR rather than passing silently.
Latency-paired diffs
p50 and p95 are reported per region. Validations are paired sequentially from the same node so latency changes attribute to the agent, not the network.
Flake-resistant by design
A single one-retry exponential backoff (300 ms then 900 ms) handles transient 5xx, 429, and timeouts. Per-validate attempts surface in the report.
Geo-emergent regressions
Tool calling that works in us-east and breaks in São Paulo gets caught and labelled as a regression.
Auth surface drift
Bearer, custom-header, and basic-auth are validated end-to-end. The wrong shape returns AUTH_ERROR rather than passing silently.
Latency-paired diffs
p50 and p95 are reported per region. Validations are paired sequentially from the same node so latency changes attribute to the agent, not the network.
Flake-resistant by design
A single one-retry exponential backoff (300 ms then 900 ms) handles transient 5xx, 429, and timeouts. Per-validate attempts surface in the report.

Coverage

Works with the agent shapes you already ship.

No SDK to install on your agent. AgentDiff validations whatever HTTP surface you already expose.

Adapter shapes
OpenAI-compatible chatAnthropic MessagesGoogle GeminiAzure OpenAIAWS BedrockLangChainCrewAIHuggingFaceVoiceflowBotpressForethoughtRetellElevenLabsPerplexityPoeDevinSwarmsFetch.aiGoogle ADK / A2ANanda A2AGeneric RESTPlain HTTPMCP server (contact us)
Auth modes
No authBearer tokenCustom headerHTTP basic
Transport
HTTPSStreaming SSEWebSocketLong-poll
Secrets
Envelope encryption at restDecrypted on residential nodeNever loggedNever in webhooks

AgentDiff is free. No code changes, no extra runner, install in under two minutes.

Use both

Sits next to your evals in the same PR check pipeline.

PR #218 · Refactor checkout agent promptOpen

evalview / dataset-diff

1,402 prompts · no regression

evidently / drift-detection

no statistical drift detected

ci / unit-tests

247 tests passed in 12.4s

agentdiff / residential-validate

Mumbai returned TOOL_FAIL — Lagos & São Paulo pass

Blocking
Your evals stay green. AgentDiff catches what they can't see.

AgentDiff is free. No code changes, no extra runner, install in under two minutes.

For clarity

What AgentDiff is not.

Q.01

Is it a replacement for EvalView or Evidently?

No. Run them together.

They check golden datasets. We check geographic behavior from real residential ISPs. Both signals matter.

Q.02

Is this a generic CI evaluation tool?

No. It does one thing.

Paired validations against the same agent from the same residential node, then a verdict. Nothing else.

Q.03

Will it help with a localhost-only project?

Not yet. Ship a preview URL first.

AgentDiff validations deployed endpoints from real ISPs. If your agent never leaves your laptop, come back later.

Q.04

Is it a consumer-facing persona experiment?

No. It is engineering infrastructure.

Built for the engineers shipping agents to production. There are no roleplay simulations, no synthetic personas, just real validations from real networks your users actually use.

Catch the regression before you merge.