AgentDiff · Public preview

Catch the regression before you merge.

Every PR runs the same evaluation prompts from where your users actually live, diffs against your baseline, and posts a verdict back to GitHub before merge.

AgentDiff is free. No code changes, no extra runner, install in under two minutes.

Paired validate · São Paulo nodeLive

Validations

Pass

+24ms

p95

The contrast that matters

Your CI runs in us-east-1. Your users live in São Paulo.

Your CI runner

$ pytest12 passed in 4.2sus-east-1 · 54.x.x.x

What users actually hit

São Paulo skyline — São Paulo
rate-limited

Jakarta skyline — Jakarta
upstream timeout

Istanbul skyline — Istanbul
bearer rejected

Mexico City skyline — Mexico City
schema mismatch

Manila skyline — Manila
premature SSE close

Bangkok skyline — Bangkok
endpoint missing

AgentDiff is free. No code changes, no extra runner, install in under two minutes.

How it works

Install once, then get a verdict on every pull request.

AgentDiff is a GitHub App, not yet another CI job. No new YAML, no new secrets in your runner. Just an outside-in verdict that updates the PR status.

Step 01

Install

Add the GitHub App and point it at your agent's preview URL. Auth secrets stay encrypted at rest.

Step 02

Open a pull request

On every PR sync we run paired validations against base and head, one residential node per region.

Step 03

Get a verdict

Pass, warn, or fail flips the GitHub Check. The PR comment shows a per-region matrix and one-click rerun.

AgentDiff is free. No code changes, no extra runner, install in under two minutes.

What it catches

Geographic regression testing, not generic CI evaluation.

We run the same prompt twice, against base and head, from one residential node, so any divergence shows up before you merge.

…

residential devices

…

countries on Fabric

…

agents monitored

…

total tests completed

Geo-emergent regressions

Tool calling that works in us-east and breaks in São Paulo gets caught and labelled as a regression.

Auth surface drift

Bearer, custom-header, and basic-auth are validated end-to-end. The wrong shape returns AUTH_ERROR rather than passing silently.

Latency-paired diffs

p50 and p95 are reported per region. Validations are paired sequentially from the same node so latency changes attribute to the agent, not the network.

Flake-resistant by design

A single one-retry exponential backoff (300 ms then 900 ms) handles transient 5xx, 429, and timeouts. Per-validate attempts surface in the report.

Geo-emergent regressions

Tool calling that works in us-east and breaks in São Paulo gets caught and labelled as a regression.

Auth surface drift

Bearer, custom-header, and basic-auth are validated end-to-end. The wrong shape returns AUTH_ERROR rather than passing silently.

Latency-paired diffs

p50 and p95 are reported per region. Validations are paired sequentially from the same node so latency changes attribute to the agent, not the network.

Flake-resistant by design

A single one-retry exponential backoff (300 ms then 900 ms) handles transient 5xx, 429, and timeouts. Per-validate attempts surface in the report.

Coverage

Works with the agent shapes you already ship.

No SDK to install on your agent. AgentDiff validations whatever HTTP surface you already expose.

Adapter shapes

OpenAI-compatible chatAnthropic MessagesGoogle GeminiAzure OpenAIAWS BedrockLangChainCrewAIHuggingFaceVoiceflowBotpressForethoughtRetellElevenLabsPerplexityPoeDevinSwarmsFetch.aiGoogle ADK / A2ANanda A2AGeneric RESTPlain HTTPMCP server (contact us)

Auth modes

No authBearer tokenCustom headerHTTP basic

Transport

HTTPSStreaming SSEWebSocketLong-poll

Secrets

Envelope encryption at restDecrypted on residential nodeNever loggedNever in webhooks

AgentDiff is free. No code changes, no extra runner, install in under two minutes.

Use both

Sits next to your evals in the same PR check pipeline.

PR #218 · Refactor checkout agent promptOpen

✓

evalview / dataset-diff

1,402 prompts · no regression

✓

evidently / drift-detection

no statistical drift detected

✓

ci / unit-tests

247 tests passed in 12.4s

agentdiff / residential-validate

Mumbai returned TOOL_FAIL — Lagos & São Paulo pass

Blocking

Your evals stay green. AgentDiff catches what they can't see.

AgentDiff is free. No code changes, no extra runner, install in under two minutes.

For clarity

What AgentDiff is not.

Q.01

Is it a replacement for EvalView or Evidently?

No. Run them together.

They check golden datasets. We check geographic behavior from real residential ISPs. Both signals matter.

Q.02

Is this a generic CI evaluation tool?

No. It does one thing.

Paired validations against the same agent from the same residential node, then a verdict. Nothing else.

Q.03

Will it help with a localhost-only project?

Not yet. Ship a preview URL first.

AgentDiff validations deployed endpoints from real ISPs. If your agent never leaves your laptop, come back later.

Q.04

Is it a consumer-facing persona experiment?

No. It is engineering infrastructure.

Built for the engineers shipping agents to production. There are no roleplay simulations, no synthetic personas, just real validations from real networks your users actually use.

Catch the regression before you merge.

Your CI runs in us-east-1. Your users live in Buenos Aires.São Paulo.

Install once, then get a verdict on every pull request.

Install

Open a pull request

Get a verdict

Geographic regression testing, not generic CI evaluation.

Works with the agent shapes you already ship.

Sits next to your evals in the same PR check pipeline.

What AgentDiff is not.

Catch the regression before you merge.

Your CI runs in us-east-1. Your users live in São Paulo.