All articles
11 min read

Agent Status / Field Notes / Monitoring

Monitoring MCP Servers in Production: The Layer Your Stack Forgot

Filed under Monitoring

Your agent works. Your model provider is up. Your prompt evals pass. Your dashboard is green.

Then a customer pings you: "The 'lookup_order' tool keeps failing."

You check the agent endpoint, 200 OK. You check the LLM API, green. You check your retrieval store, fine. Two hours later you discover the problem wasn't in any of those layers. The Model Context Protocol (MCP) server that exposes your tools to the agent had silently changed transports during a deploy. The agent was reaching it. The handshake was failing. Every tool call was returning an error wrapped in a perfectly valid agent response.

This is the layer almost no monitoring stack covers, and the one most likely to break next.


01Section

Why MCP Is Different (And Dangerous)

MCP is the protocol that exposes tools, resources, and prompts to LLM agents. In 2025 it was a curiosity. By mid-2026 it's everywhere, Claude Desktop, Cursor, agentic IDEs, internal LLM gateways, and most production agent stacks now sit on top of one or more MCP servers.

Three properties make MCP failures uniquely invisible to traditional monitoring:

  1. The agent is the consumer, not the user. A broken MCP call doesn't return a 5xx to your user-facing API. It returns a successful agent response that contains an error message, an apology, or worse, a hallucinated answer.
  2. Transports drift. A server can switch between HTTP, SSE, and Streamable HTTP between deploys. Your agent SDK may downgrade silently. Your monitoring sees a 200 because tools/list returned something, even if that something is unusable.
  3. The contract isn't a URL. It's a JSON-RPC method set. tools/list, resources/list, prompts/list, plus per-tool tools/call. None of those show up in URL-level uptime checks.

If you're monitoring MCP servers the way you monitor REST APIs, you're not monitoring them at all.


02Section

Failure Mode 1: The Transport Swap

In the Agent Status validate path, this is the difference between:

snippet
result['error'] = 'transport_not_supported: sse'
result['transport_type'] = 'sse'
result['transport_supported'] = False

and the silent-200 your dashboard would otherwise report.


03Section

Failure Mode 2: The Vanishing Tool

A real validate should answer three distinct questions:

QuestionValidation
Is the server reachable?tools/list returns 200 with valid JSON-RPC
Are the right tools present?Diff result.tools against expected set
Do the tools actually work?tools/call with a known input then assert known output

A tool that exists in the listing but errors on every call is the most expensive kind of broken: it looks healthy and acts dead.


04Section

Failure Mode 3: The Slow Discovery


05Section

Failure Mode 4: The Auth Boundary


06Section

Failure Mode 5: The Smithery / Gateway Hop


07Section

What "Real" MCP Monitoring Looks Like

A complete MCP validate answers six questions on every cycle:

  1. Reachability, did tools/list succeed?
  2. Transport, is the response in a transport my agent's SDK supports?
  3. Inventory, are the expected tools/resources/prompts present and correctly named?
  4. Schema, does each tool's input/output contract match what the agent prompt expects?
  5. Behavior, does at least one tools/call per critical tool return the expected output for a known input?
  6. Performance, is discovery + invocation latency within the user-experience SLA?

A failure of any one is a customer-visible defect, even when the URL still returns 200.

This is exactly the validate surface Agent Status runs against your MCP servers. The verdict isn't UP/DOWN based on HTTP. It's:

  • UP, reachable, transport supported, all tool validations pass
  • DEGRADED (mcp_tool_fail), reachable but one or more tool calls fail
  • DEGRADED (mcp_transport_unsupported: <type>), server changed transports out from under your client
  • DOWN (mcp_unreachable), handshake failed entirely

Each verdict is attributable, alertable, and actionable.


08Section

The Validation Profiles That Matter

Different MCP servers need different validate depths. We use four:

ProfileWhat it doesWhen to use
health_onlytools/list onlyPublic servers, cheap continuous checks
full_discoverytools + resources + promptsServers exposing more than just tools
tool_contractDiscovery + named tools/call validationsProduction servers with known critical tools
full_validationDiscovery + auto-generated validations from listingsServers under active development

Most teams default to health_only, which is roughly equivalent to a TCP ping. It's better than nothing, and it misses everything that matters.


09Section

Quick Wins

If you only do three things this week:

1. Diff your tool inventory daily

Snapshot tools/list once a day. Diff against yesterday. Alert on any change you didn't ship.

2. Validation at least one critical tool end-to-end

Pick the single MCP tool whose failure would hurt customers most. Call it on every validate cycle with a known input. Assert known output. This catches 80% of "broken in a way nobody notices" cases.

3. Validation through the same path your agent uses

If your agent calls MCP through a gateway, validate through the gateway. If it authenticates, your validate authenticates. The validate's job is to be indistinguishable from the agent, anything else is theater.


10Section

The Bottom Line

The MCP layer sits between your agent and everything it can actually do. When it breaks, your agent doesn't fail, it lies. It returns confident, fluent responses that just happen to omit the capabilities it lost.

Traditional uptime monitoring catches none of this. URL pings, HTTP status codes, and even LLM evaluators can't see a tool that's been silently renamed, a transport that quietly switched, or a gateway that started rejecting connections at the boundary.

If your agent depends on MCP, and increasingly, every production agent does, you need monitoring designed for the protocol, not for the URL it happens to live behind.

Your MCP server is one renamed tool away from a customer-visible regression. The only question is whether you'll find out from a validate or from a support ticket.

Independent monitoring

See your agent the way the world sees it.

Outside-in validations from real residential nodes, evaluation prompts that catch silent-200 failures.

Get Started