2-min read

AgentStatus × Hippocratic AI

RWE-LLM, from outside the stack.

Hippocratic's RWE-LLM framework set a new bar: output testing over input validation, 6,234 clinicians, 307,038 evaluated calls. We agree with the thesis, and we're the external complement. Continuous voice-aware validations of production agents from 800+ network egress points in ~30 countries, testing what gets through after the call leaves Hippocratic infrastructure.

800+

Network egress points

~30

Countries

Voice-aware

ASR, latency, IVR, multilingual

~10M

Test runs in ~2 months

agentstatus.dev | partner brief

Why we're reaching out

Internal output testing is the gold standard. External output testing is the next layer.

Dr. Bhimani's RWE-LLM paper makes an argument we deeply agree with: traditional input-side benchmarking is insufficient for healthcare, and comprehensive output testing across diverse real scenarios is the only path to safety assurance at scale.

The framework's four stages, pre-implementation, tiered review, resolution, and continuous monitoring, are well-developed inside Hippocratic. The continuous monitoring stage, by definition, has the largest external surface: ASR variance across network paths, IVR handoffs to third-party systems, multilingual auto-switch on degraded audio, response consistency across geographies.

That's the layer we test. Not what your 22-model constellation says. What gets through after the call leaves your infrastructure.

What AgentStatus is

Voice-aware external testing for production AI agents.

Scheduled outside-in validations of voice agents, full audio in, full audio out, from 800+ network egress points across ~30 countries. We measure ASR quality, response latency, audio dropouts, IVR navigation success, multilingual auto-switch behavior, and response consistency against expected answers. ~10M test runs in the last 2 months across ~6,000 agents.

Voice validating runs over programmatic telephony from independent network egress, not datacenter, not Hippocratic infra, not the customer's environment. We do not evaluate clinical accuracy and we do not replace your clinicians; we test the deployment layer between your model and the patient experience.

Where we fit

Two layers of output testing.

Hippocratic, RWE-LLM

• 6,234 clinicians evaluating call outputs
• 22-model constellation on Hippocratic infra
• Clinical safety: did the agent give the right medical answer
• Internal continuous monitoring: spot-checks, A/B, RWE feedback

AgentStatus, External RWE

• Independent voice validations from outside the stack
• ~30 countries, 800+ network egress points, scheduled and continuous
• Deployment integrity: did the right answer make it through ASR, IVR, network, channel
• External continuous monitoring: voice-aware, geography-split, drift-aware

One sentence. RWE-LLM answers "did our agent say the right thing?" AgentStatus answers "did the right thing actually reach the patient, every time, from everywhere?"

The deployment surface

What only outside-in voice validations can find.

Your internal eval runs on Hippocratic's infrastructure with Hippocratic's audio path. The deployment surface, the layer between Polaris and the patient, has variance your nurse panel can't see:

• ASR degradation across network paths: transcription accuracy varies with audio compression, jitter, packet loss
• Response latency by geography: P50/P95 latency differs across regions and network conditions in ways that affect conversational feel
• IVR navigation drift: third-party systems (other providers, labs, pharmacies) change behavior; outside-in tests catch breaks early
• Multilingual auto-switch behavior: Spanish at 99.83% on internal eval, but how does the auto-switch hold up on lower-quality audio paths?
• Configuration / model-swap drift: silent regressions after infra updates, third-party API changes, or version rollouts
• Regional outage detection: a customer in Tampa might experience failures invisible from Palo Alto

For your customers

An asset Hippocratic can hand to health systems.

Health system procurement teams increasingly ask: "How do we continuously verify the AI vendor is performing as promised post-deployment?" RWE-LLM is the answer for Hippocratic's internal validation. AgentStatus can be the answer Hippocratic hands its enterprise buyers, independent, third-party, ongoing deployment monitoring, not vendor-self-reporting.

This makes Hippocratic's procurement easier, not harder, and it fits the published RWE-LLM thesis on output testing.

Proof of scale

What we've run so far.

~10M test runs in ~2 months across the network. ~6,000 agents being tracked, including ones from companies you'd recognise (specifics under NDA).

We've also caught node operators trying to game the network with datacenter VMs instead of real consumer egress. Detection of adversarial behavior is built into the product, the same kind of rigor RWE-LLM applies to clinical outputs, applied to network integrity.

What we are not claiming

An external layer that coexists.

We are not clinicians. We are not adjudicating clinical accuracy. We are not replacing RWE-LLM, your clinician panel, or any part of Polaris. We do not handle PHI, tests run against demo or synthetic agent surfaces. We are the deployment-layer external complement to your internal output testing.

What we'd like from this conversation

A 30-minute methodology conversation.

01

A 30-minute call with Dr. Bhimani's team

On whether external voice-aware output testing belongs as a published extension of the RWE-LLM continuous monitoring stage.

02

A joint readout or co-authored write-up

Hippocratic publishes papers as standard practice. We'd love to contribute the external-monitoring chapter, your voice, our data, real findings from validating one of your demo agents across geographies and network conditions.

03

One demo agent for a 2-week sandbox

A non-PHI surface we can validate externally for two weeks. Output: a methodology artifact showing ASR variance, latency distribution, IVR navigation correctness, and multilingual behavior across geographies.

Closing

Hippocratic argued that output testing is the gold standard. AgentStatus is that same testing, from outside the stack, voice-aware, continuous, geography-distributed, and ready to be the published external chapter of RWE-LLM.

Chat with Dulra & Roman Why AgentStatus

Contact

dulra@carmel.soroman@carmel.so

"Test runs" and "agent rows" mean what we said above. Hippocratic AI and RWE-LLM descriptions are from public pages and announcements, not an endorsement by Hippocratic AI.