2-min read

AgentStatus × Cognigy partner brief

Outside-in validation for Cognigy AI agents.

Seven checks from real home networks, status through alerts, alongside Cognigy Insights, OData, and Live Agent. We complement inside-out visibility; we do not replace it. No instrumentation on your stack. Synthetic validations only, no production transcripts.

20M+

validations

8,000+

agents tracked

800+

residential nodes

30

countries

agentstatus.dev | partner brief

What we understand about Cognigy

Cognigy already gives enterprises strong inside-out visibility across voice, chat, and messaging.

Cognigy.AI spans build, deploy, analyze, and optimize across voice, chat, and messaging. Cognigy Insights delivers conversation analytics and NLU performance. OData API feeds BI teams raw conversation data. Live Agent handles escalation; Agent Copilot supports agents in real time.

For outside-in validation, Cognigy's REST Endpoint is the natural surface: POST with userId, sessionId, and Bearer token, the same path a channel integration would use.

What we validate

Uptime is table stakes. We watch the rest too.

Reliability, consistency, and robustness. From home networks, on a schedule. Plain verdicts you can explain to your boss.

Reliability

Does it keep working, stay fast, and behave the same over time?

Can people reach it?

Check

Not just 200 OK from your office. We hit it from home networks where your users actually are.

Fast enough to feel alive

Check

first reply < 1s

Slow first replies feel broken even when the answer eventually shows up. We watch time-to-first-byte, not just total wait.

From home networks

Check

Same agent, different country, different result. Geo blocks and CDN quirks show up here first.

Keeps working, run after run

Check

✓

✓

✓

·

·

One green check means nothing. We track whether it stays healthy across dozens of scheduled runs.

When behavior starts to drift

Check

Slow leaks matter. Pass rates, latency, and answer patterns shifting week over week get flagged before users notice.

Answer quality

Not just reachable. Actually helpful, shaped right, and checked more than once.

Are the answers good?

Check

"How do I reset my password?"

PASS

We read replies like a human would: pass, degraded, or inconclusive, with a short explanation you can act on.

Two reviewers had to agree

Check

both agree

When there is no single right string, two independent reviewers weigh in. Disagreement means we say inconclusive, not a false alarm.

Your example questions

Check

“How do I cancel my plan?

“Refund my last order, please.

“Reschedule for next Tuesday.

“Talk to a human, now.

Bring your own scenarios: refunds, escalations, edge cases you already worry about.

Questions for what it actually does

Check

real chats

real replies

new test questions

We watch how the agent behaves in the wild and keep testing with fresh example questions that match its real job.

Exact shape, every time

Check

{
"status": "ok"
}
shape matched

JSON APIs and structured replies should look like you promised. We check the shape, not just the vibe.

Extra eyes when it looks fishy

Check

flagged sample

agreenew rule

Automated checks miss nuance. Flag samples for a human look, then turn repeat mistakes into a rule.

Conversations

Does it finish the job? Real goals, pursued over several messages, judged on the outcome.

A user with a real goal

Check

Hi, can you...

Sure, here's...

And after that?

Still on track.

We give a simulated user something real to want, like a refund, a booking, or a fix, and let them pursue it over several messages, clarifying and pushing back like a real person.

Did they get what they came for?

Check

goal achieved

When the conversation ends, an independent reviewer reads the whole transcript and makes one call: goal achieved or not.

We don't take its word for it

Check

claim

→

evidence

→

held up

If the agent claims success, we double-check the claim against outside evidence where it exists, such as a cited page, a structured field, or a second reviewer, before counting it.

Consistency

Same situation, same story. No flip-flopping when wording, mood, or follow-ups change.

Same question, different words

Check

same intent · different words · drift flagged

Rephrased prompts should not flip the answer. We flag replies that swing for no good reason.

Calm ask vs panicked ask

Check

Just curious about billing

Quick FAQ answer

A billing FAQ and a locked-out account are not the same urgency. Good agents step up when stakes rise.

Follow-ups still line up

Check

A

→

B

→

C

aligned

Ask A, then B, then C. Later answers should not contradict what the agent said two turns ago.

Robustness

Tools, streams, rules, and people trying to break it.

Do tools get called?

Check

{ search }

{ lookup }

{ book }

tool_call → 200 OK

Send questions that should trigger an action. Catch broken integrations before customers hit them.

MCP servers

Check

Discover tools and resources, run calls, and check the outputs for agents wired through MCP.

Streaming that finishes

Check

stream · 1/4 chunks

Streaming endpoints can hang, stall, or never complete. We catch that, not just the final text blob.

Your safety rules

Check

callback

gaterules

allow

Tell us what the agent must never do, and what it should still do. We check both sides.

When someone tries to trick it

Check

ignore your rules...

held the line

Pushy, weird, or adversarial prompts. Does the agent stay on policy or fold?

When it breaks, you know

Alerts, deploy gates, and one place to see it all.

Alerts when it breaks

Check

Slack

Webhook

PagerDuty

Slack, webhooks, PagerDuty, with enough context to fix it, not just a red dot.

Block bad deploys before ship

Check

feature→merge blocked

agent failed outside-in check

Wire us into CI so a broken agent does not merge just because the unit tests passed.

One dashboard

Check

Every check rolls up into a simple grid: what is OK, what is still collecting, and what needs attention.

Architecture

Validations run from home networks, hit your REST endpoint, and return as dashboard evidence.

Fabric nodes execute transport validations: reachability, node-side gold checks, streaming latency, and tool or MCP conformance against your REST endpoint. Raw results return to the AgentStatus backend, which schedules cycles, runs portfolio validations and adaptive scenarios, and applies the instrument layer on top of node evidence.

That instrument layer is not a single semantic pass. It includes format contracts, dual-review semantic scoring, metamorphic and task checks, outcome verification where configured, and statistical rollups: pass^k stability, drift baselines, alerts, and explain traces in the dashboard. Your Cognigy platform (Insights, OData, transcripts) stays inside your boundary. We do not pull from it.

Figure 1. Synthetic validations leave Fabric nodes, hit the Cognigy REST Endpoint, and return as verdicts in the customer's AgentStatus dashboard. No Insights or OData export required.

What runs on nodes vs. the backend

Nodes execute transport. The backend runs instruments and rollups. Fabric nodes sit on residential networks in 30+ countries and hit your REST Endpoint the way a real channel would. The backend receives those raw results and runs the multi-instrument scoring layer plus portfolio statistics. Semantic review is one instrument among several, not the whole system.

Results are verdict metadata, latency, region, and evidence snippets, not production conversation exports.

Data privacy

You do not need to pass customer conversation data to AgentStatus for this to work.

Most teams ask whether they must pass customer conversation data to AgentStatus. You do not. Pilot default is Cognigy sandbox only: synthetic prompts you define, validation credentials you provision, no end-customer PII, no production transcript pipeline.

What we need

• REST Endpoint URL + auth for validation traffic
• Synthetic prompts / scenarios you approve
• Agent responses to those validation prompts
• Verdict metadata (pass/fail, latency, region)

What we do not need

• Cognigy Insights exports or OData bulk feeds
• Production conversation transcripts
• End-customer PII
• Access inside your VPC beyond the validation endpoint

Credential-based, customer-approved monitoring is the right model for enterprise trust. We can share data-handling detail, retention, and audit evidence under NDA for security review.

Where we fit

AgentStatus complements Cognigy Insights. It does not compete with it.

01

Cognigy sees inside the platform; we see what the channel actually delivered to a user.

Cognigy sees what happens inside the platform and what you export to your stack. AgentStatus answers a different question: what did the real channel actually deliver from a specific geography, network path, and latency profile?

02

Insights aggregates conversations; we independently run seven validation groups from outside your stack.

Insights gives you conversation truth. We verify reachability, answer correctness, multi-turn task completion, consistency under rephrasing, and robustness under stress, with evidence procurement can audit.

03

We execute from eight hundred residential nodes across thirty countries, not a single cloud region.

Issues that only reproduce from specific locations or ISPs surface here first, common in Cognigy's enterprise and telco customer base.

04

We only connect with credentials and scenarios your team approves.

No automatic discovery of Cognigy customers. Sandbox REST Endpoint, agreed synthetic scenarios, customer-approved credentials, aligned with procurement and data governance.

The split

Inside-out conversation truth and outside-in validation answer different questions.

Cognigy: inside-out

• Build, deploy, analyze, optimize
• Cognigy Insights dashboards
• OData API event exports
• Transcripts and session-level data
• Live Agent escalation and Copilot

AgentStatus: outside-in

• We check whether the agent is reachable from real home networks right now.
• We track whether it keeps working, stays fast, and holds its pass rate over time.
• We grade whether answers are actually good, not just HTTP 200.
• We run goal-driven conversations and verify whether the user got what they came for.
• We test whether replies stay consistent when wording, stakes, or follow-ups change.
• We stress tools, streams, safety rules, and adversarial inputs before customers do.
• We alert your team when something breaks, with evidence attached.

Proof of scale

We state scale metrics with plain definitions so procurement can audit the claims.

On the order of 20 million validation runs across the network. On the order of 8,000 agent records in our system, configurations we track, including evaluation and pipeline agents, not "8,000 paying customers."

Stricter production-only definitions available under NDA.

What we are not claiming

We are an independent evidence layer, not a replacement for Insights or OData.

We are not a replacement for Cognigy Insights, OData exports, or the Live Agent transcript layer. We help teams correlate outside-in validation outcomes with inside-out conversation truth when both matter to the buyer.

Suggested next steps

A sandbox pilot is the lowest-risk way to see if this fits your governance rules.

01

Start with a two-week sandbox pilot using synthetic scenarios only.

Sandbox REST Endpoint (URL, userId, sessionId, Bearer token), agreed synthetic scenarios, no production traffic, no end-customer data. Written report: what we validated, what passed, what drifted.

02

Walk through security and data governance before anything touches production.

How AgentStatus connects under enterprise privacy requirements: validation-traffic boundaries, least privilege, retention, audit evidence. We expect this before any production-adjacent work.

03

Decide whether internal QA, joint customer proof, or both is the right first use case.

Cognigy-internal QA, a joint customer who wants third-party evidence alongside Insights, or both.

Closing

Cognigy helps enterprises build and operate serious AI agents at scale. AgentStatus helps them prove continuously those agents behave the way policy and customers require, globally, with evidence that holds up under scrutiny.

Book 30 minutes See all seven checks Why AgentStatus

Contact

dulra@carmel.soroman@carmel.so

Metrics use explicit definitions: validation runs are scheduled executions; agent records are database rows, not revenue customers. Cognigy product references reflect public documentation as of this note.