Every dashboard you have was designed for a world where failures look like 5xx responses.
That world doesn't include AI agents.
Modern agents fail in places where no HTTP code is ever emitted. The model provider hangs mid-stream. A tool call enters a retry loop nobody bounded. A background worker picks up a job and dies between heartbeats. An async planner gets stuck on a task it can't decompose. The agent isn't returning errors, it isn't returning at all. And because nothing was ever sent to your edge, your edge has nothing to report.
This is the most expensive failure mode in production agents and the one with the least monitoring coverage. Because the failure happens inside the agent loop, traditional observability, built for request/response systems with clean status codes, sees nothing.
Let's go through what actually breaks, why your stack misses it, and what monitoring it correctly looks like.
Why Agents Break Outside HTTP
Traditional web services are stateless. A request comes in, work happens synchronously, a response goes out. Every interaction emits a status code. Failures are observable by definition.
Agents are not that.
A modern production agent typically runs:
- A request handler that enqueues a job rather than executes it
- A worker that picks up the job and plans a response (multiple LLM calls)
- Tool invocations that may call other agents, MCP servers, or long-running APIs
- A streaming response back to the user via SSE or WebSocket
- Background reflection, memory writes, and follow-up workers triggered post-response
A failure can happen at any layer, and most layers don't speak HTTP. Some don't even have a request/response shape. They have steps that are supposed to complete and sometimes don't.
When something in this graph breaks, the user sees one of three things:
- Nothing. The connection sits open, gets killed by an upstream proxy minutes later, and a generic timeout page eventually loads.
- A truncated stream. The first 40 tokens arrive. The rest don't. The user assumes the response was complete.
- A confident lie. The agent replies with whatever it had before the loop went sideways. The result reads correct and isn't.
None of these produce a 5xx. None trigger your error-rate alert. All three were customer-impacting and invisible.
Failure Mode 1: The Stuck Worker
A reasonable watchdog query looks like this:
UPDATE jobs
SET status = 'failed',
error_message = COALESCE(error_message, 'watchdog_timeout'),
completed_at = NOW()
WHERE id = $1
AND status IN ('queued', 'executing');The AND status IN (...) clause is the part that matters. It means a real worker that finished half a second after the watchdog ran still wins. Without it, you're racing yourself.
Failure Mode 2: The Provider Stream That Never Ends
Streaming is a special case. A connection that's "alive" but not delivering bytes is the most insidious form of broken. The only defense is treating time-since-last-byte as a vital sign, not just connection state.
Failure Mode 3: The Recursive Planner
The default for any agent loop should be: fail loudly when bounded. The worst outcome is silent runaway, the second-worst is a successful response that arrived too late to matter.
Failure Mode 4: The Background Job That Wasn't
The asymmetry of foreground vs. background reliability is where most "the agent feels worse over time" complaints originate. The foreground response works. The background pipeline that maintains state quietly broke.
Failure Mode 5: The Job Lifecycle You Don't Own
If you only count failures the provider tells you about, you're underreporting. The most honest reliability number is 1 - (terminal_jobs / submitted_jobs).
Failure Mode 6: The Tool Loop
What "Real" Agent-Loop Monitoring Looks Like
Six questions every healthy agent observability stack answers:
- Did the request enter the queue? (Submission tracked.)
- Did a worker pick it up within SLA? (Queue latency bounded.)
- Did execution complete in any state, success or failure? (Watchdog catches stuck.)
- Did the streaming response deliver bytes consistently? (TTFT + TBT tracked.)
- Did all background tasks triggered by the response succeed? (Per-job-type success rate.)
- Did the agent respect its loop bounds? (Depth, token spend, tool calls per turn.)
The minimum infrastructure to answer those:
- A jobs table with explicit status (
queued,executing,completed,failed) and timestamps. - A watchdog process that sweeps stuck jobs on an interval shorter than your SLA.
- Per-step latency tracking (queue wait, execution time, time-to-first-token).
- Reconciliation between submitted and terminated jobs.
Agent reliability isn't an HTTP status problem. It's a job-lifecycle problem. The teams that get this right treat agent runs the way payment systems treat transactions: every state transition logged, every terminal state reachable, every stuck record reconciled.
Quick Wins
Today
- Add an explicit job state machine.
queued then executing then completed | failed. Persist transitions. Don't pass jobs around in memory hoping for the best. - Set a read timeout on every streaming model call. If the SDK doesn't expose one, wrap it. The default is "wait forever" and that's not a default, that's a bug waiting.
This week
- Run a stuck-job watchdog. Sweep jobs older than your SLA. Mark them failed. Make the update race-safe (
AND status IN ('queued','executing')). Run it every 30 seconds. - Cap planning depth, tool calls per turn, and per-request wall clock. Pick numbers. Enforce them. A bounded failure beats an unbounded one every time.
This month
- Reconcile submitted vs. terminated. Daily report: how many jobs entered, how many reached a terminal state, what's the gap. If the gap isn't zero, you have invisible failures.
- Track TTFT and TBT separately. Stream health isn't a single number. Latency to first token is a UX metric. Time between tokens is a liveness metric. Different alerts, different thresholds.
The Bottom Line
The reason these failures don't show up in your dashboard is the same reason they're so dangerous: they happen in code paths your dashboard wasn't designed to watch.
HTTP-status monitoring assumes every operation has a status. Agent loops don't. They have steps. Steps have lifecycles. Lifecycles get stuck. And when they do, the only thing your existing observability can tell you is something somewhere isn't responding, which is true of every healthy idle system in the world.
The teams running reliable agents in 2026 aren't doing anything exotic. They're treating agent execution the way reliable distributed systems have always been treated: explicit state, durable transitions, watchdog reconciliation, bounded loops, surfaced lifecycle metrics.
Everything else, every clever prompt, every novel architecture, every fancy framework, is downstream of whether your agent's loop can finish. Or fail loudly enough for someone to notice.
Failures that never reach HTTP are still failures. The question is whether your monitoring acknowledges that yet.