Back to Use Cases
Current verdict — Is it still DOWN?
When it changed — How long ago?
Regional breakdown — All regions or specific ones?
Error codes — What's failing?
Incident Response & Alerting
Respond quickly when your agent goes down.
Alert Flow
Agent Status detects DOWN
↓
Webhook fires → PagerDuty creates incident
↓
Email sends → Backup notification
↓
Engineer acknowledges
↓
Investigation using Agent Status data
↓
Fix deployed
↓
Agent Status detects UP
↓
Recovery alert → Incident resolved
Setting Up Alerts
Email Alerts
Alert Email: oncall@yourcompany.com
Alert on DOWN: ✅
Alert on Recovery: ✅
Webhook to PagerDuty
Webhook URL: https://events.pagerduty.com/integration/YOUR_KEY/enqueue
Multi-Channel
For critical agents, use multiple channels:
- PagerDuty (pages on-call)
- Slack (team awareness)
- Email (backup)
When You Get an Alert
Step 1: Acknowledge
- Acknowledge in PagerDuty/incident system
- Post in Slack: "Investigating Agent Status alert for [agent]"
Step 2: Check Agent Status Dashboard
Step 3: Check Recent Changes
- Any recent deployments?
- Config changes?
- Upstream service changes?
Step 4: Reproduce
curl -X POST https://your-agent.com/v1/chat/completions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello"}]}'
Step 5: Identify Root Cause
| Agent Status Shows | Likely Cause | Check |
|---|---|---|
| All timeouts | Server down | Server logs, hosting dashboard |
| HTTP 5xx | Application error | Application logs |
| HTTP 401/403 | Auth failure | API key validity |
| One region DOWN | Regional issue | CDN, DNS for that region |
| Gold prompt fail | Model issue | Model behavior, prompts |
Step 6: Fix
- Deploy fix, or rollback to last known good, or update configuration
Step 7: Verify with Agent Status
Trigger a manual run:
curl -X POST "https://api.agentstatus.dev/api/v1/agents/$AGENT_ID/run" \
-H "Authorization: Bearer $AGENTSTATUS_API_KEY"
Watch for verdict returning to UP, all regions healthy, and latency back to normal.
Step 8: Resolve
- Close PagerDuty incident
- Post summary in Slack
- Update status page
Incident Documentation
For each incident, document: duration, impact, root cause, detection time, resolution, timeline, and follow-up actions.
Reducing Alert Noise
Transient Issues
Agent Status automatically retries 5xx errors to avoid alerting on transient flakes.
DEGRADED vs DOWN
- Alert on DOWN: Always — something is seriously wrong
- Alert on DEGRADED: Consider — may cause alert fatigue
Runbook Template
Create runbooks for each agent covering: contacts, quick links, common issues (HTTP 500, high latency, auth failures), and rollback procedures.