Incident Response & Alerting

Alert Flow

Agent Status detects DOWN
       ↓
Webhook fires → PagerDuty creates incident
       ↓
Email sends → Backup notification
       ↓
Engineer acknowledges
       ↓
Investigation using Agent Status data
       ↓
Fix deployed
       ↓
Agent Status detects UP
       ↓
Recovery alert → Incident resolved

Setting Up Alerts

Email Alerts

Alert Email: oncall@yourcompany.com
Alert on DOWN: ✅
Alert on Recovery: ✅

Webhook to PagerDuty

Webhook URL: https://events.pagerduty.com/integration/YOUR_KEY/enqueue

Multi-Channel

For critical agents, use multiple channels:

PagerDuty (pages on-call)
Slack (team awareness)
Email (backup)

When You Get an Alert

Step 1: Acknowledge

Acknowledge in PagerDuty/incident system
Post in Slack: "Investigating Agent Status alert for [agent]"

Step 2: Check Agent Status Dashboard

Current verdict — Is it still DOWN?

When it changed — How long ago?

Regional breakdown — All regions or specific ones?

Error codes — What's failing?

Step 3: Check Recent Changes

Any recent deployments?
Config changes?
Upstream service changes?

Step 4: Reproduce

curl -X POST https://your-agent.com/v1/chat/completions \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello"}]}'

Step 5: Identify Root Cause

Agent Status Shows	Likely Cause	Check
All timeouts	Server down	Server logs, hosting dashboard
HTTP 5xx	Application error	Application logs
HTTP 401/403	Auth failure	API key validity
One region DOWN	Regional issue	CDN, DNS for that region
Gold prompt fail	Model issue	Model behavior, prompts

Step 6: Fix

Deploy fix, or rollback to last known good, or update configuration

Step 7: Verify with Agent Status

Trigger a manual run:

curl -X POST "https://api.agentstatus.dev/api/v1/agents/$AGENT_ID/run" \
  -H "Authorization: Bearer $AGENTSTATUS_API_KEY"

Watch for verdict returning to UP, all regions healthy, and latency back to normal.

Step 8: Resolve

Close PagerDuty incident
Post summary in Slack
Update status page

Incident Documentation

For each incident, document: duration, impact, root cause, detection time, resolution, timeline, and follow-up actions.

Reducing Alert Noise

Transient Issues

Agent Status automatically retries 5xx errors to avoid alerting on transient flakes.

DEGRADED vs DOWN

Alert on DOWN: Always — something is seriously wrong
Alert on DEGRADED: Consider — may cause alert fatigue

Runbook Template

Create runbooks for each agent covering: contacts, quick links, common issues (HTTP 500, high latency, auth failures), and rollback procedures.