All articles
10 min read

Agent Status / Field Notes / Monitoring

Your Alert Webhooks Are Failing Silently (And You Have No Idea)

Filed under Monitoring

Picture the worst version of an outage:

Your AI agent goes down at 2:14 AM. Your monitoring catches it instantly. The alert fires. Your incident channel is silent. PagerDuty stays quiet. No one wakes up. By the time engineering arrives at 9 AM, customers have been impacted for nearly seven hours.

Root cause: the webhook delivering the alert returned a 502 from a misconfigured reverse proxy. The alert was created. It was just never delivered.

This isn't a hypothetical. It's the most under-monitored failure mode in modern observability stacks: the alert pipeline itself.

You instrument your agent. You instrument your model providers. You instrument your dependencies. You almost never instrument the path that tells you any of those things broke.


01Section

The Quiet Assumption Underneath Every Alert

Every alerting system in production assumes the same chain works:

snippet
event then rule then alert then channel adapter then delivery then human acknowledgement

Most teams instrument the first three steps and the last one. The middle two, channel adapter and delivery, are treated as plumbing. They're not. They're the part most likely to silently fail.

Slack is down. Discord rate-limits your bot. PagerDuty's Events API returns a 502 for 11 minutes. Your customer's webhook URL was rotated last week and you haven't redeployed. Microsoft Teams' incoming webhook connector was deprecated and you didn't notice. Each of these is a real outage we've seen, and in every case, the alert fired, the alert was created, and the alert was not delivered.

A monitoring system that can't prove its own alerts arrived isn't a monitoring system. It's a logging system with optimism.


02Section

Failure Mode 1: The Silent 4xx

The minimum bar: every delivery attempt should produce a row you can query.

snippet
attempt | channel_type | http_status | ok | error | sent_at

If you can't SELECT count(*) FROM alert_deliveries WHERE ok = false AND created_at > now() - interval '1 hour', you can't see your own alert failures.


03Section

Failure Mode 2: The Retry That Never Fires

A reasonable default looks like this:

snippet
_MAX_RETRY_ATTEMPTS = 3
_RETRY_DELAYS = [2, 5]  # seconds between attempt 1 to 2 and 2 to 3

# Retry only on network-level errors or 5xx
is_retryable = result.http_status is None or (result.http_status >= 500)

Three attempts, staged delays, retryable-only, and a row in a deliveries table per attempt. That's the floor. Anything less is hope.


04Section

Failure Mode 3: The Signature Mismatch

The hardest part of HMAC isn't the algorithm. It's the rotation choreography.


05Section

Failure Mode 4: The Channel Drift

A delivered-but-invisible alert is the worst kind of failure. The only defense is human-in-the-loop verification.


06Section

Failure Mode 5: The Multi-Channel Illusion

Redundancy without measurement is just wishful thinking with extra steps.


07Section

Failure Mode 6: The Acknowledgement Gap


08Section

What "Real" Alert Pipeline Monitoring Looks Like

Five questions a healthy alert system answers continuously:

  1. Did the alert get created? (Detection layer is working.)
  2. Did the dispatcher attempt every configured channel? (No channel was silently skipped.)
  3. Did each channel return success? (Per-channel HTTP status, persisted.)
  4. Did the receiver actually render the alert? (Synthetic alerts with markers.)
  5. Did a human acknowledge it within the SLA? (Ack-to-fire latency tracked.)

The minimum data model to answer those:

snippet
alert_deliveries (
  alert_id, channel_type, attempt, http_status, ok,
  provider_id, error_message, sent_at, created_at
)

If your alerting infrastructure can't tell you which alert went to which channel, succeeded or failed, on which attempt, with what response code, you don't have an alerting system. You have a print() statement with delusions of grandeur.


09Section

Quick Wins

This week

  1. Add a deliveries table. Persist per-channel attempt + status + error for every alert. Even SQLite is fine. The point is you can query it.
  2. Implement bounded retries. Three attempts, staged backoff, retryable-only (5xx + network), one row per attempt.
  3. Surface "last successful delivery" per channel. A channel that hasn't delivered in 24 hours when you fire alerts hourly is not "quiet", it's broken.

This month

  1. Run a weekly synthetic alert through every channel. Have a human ack each one. Track ack rate per channel.
  2. HMAC-sign customer webhooks. Include a key_id so rotation doesn't break verifiers.
  3. Tier severity then routing. Critical alerts page; informational alerts post; never both equally.

10Section

The Bottom Line

You spend weeks instrumenting your agent. You configure five alert channels. You feel safe.

Then a webhook returns a 502 at 2 AM and you find out the next morning, from a customer.

The alert was real. The detection worked. The dispatcher did its job. The human never knew.

Monitoring you can't prove arrived isn't monitoring. It's a story you tell yourself about your incident response.

The fix isn't more channels. It's measuring the channels you already have.

Independent monitoring

See your agent the way the world sees it.

Outside-in validations from real residential nodes, evaluation prompts that catch silent-200 failures.

Get Started