Support CenterUse CasesIncident Response & Alerting
Back to Use Cases

Incident Response & Alerting

Respond quickly when your agent goes down.

Alert Flow

Agent Status detects DOWN
       ↓
Webhook fires → PagerDuty creates incident
       ↓
Email sends → Backup notification
       ↓
Engineer acknowledges
       ↓
Investigation using Agent Status data
       ↓
Fix deployed
       ↓
Agent Status detects UP
       ↓
Recovery alert → Incident resolved

Setting Up Alerts

Email Alerts

Alert Email: oncall@yourcompany.com
Alert on DOWN: ✅
Alert on Recovery: ✅

Webhook to PagerDuty

Webhook URL: https://events.pagerduty.com/integration/YOUR_KEY/enqueue

Multi-Channel

For critical agents, use multiple channels:

  • PagerDuty (pages on-call)
  • Slack (team awareness)
  • Email (backup)

When You Get an Alert

Step 1: Acknowledge

  • Acknowledge in PagerDuty/incident system
  • Post in Slack: "Investigating Agent Status alert for [agent]"

Step 2: Check Agent Status Dashboard

  • Current verdict — Is it still DOWN?
  • When it changed — How long ago?
  • Regional breakdown — All regions or specific ones?
  • Error codes — What's failing?
  • Step 3: Check Recent Changes

    • Any recent deployments?
    • Config changes?
    • Upstream service changes?

    Step 4: Reproduce

    curl -X POST https://your-agent.com/v1/chat/completions \
      -H "Authorization: Bearer $TOKEN" \
      -H "Content-Type: application/json" \
      -d '{"messages": [{"role": "user", "content": "Hello"}]}'
    

    Step 5: Identify Root Cause

    Agent Status ShowsLikely CauseCheck
    All timeoutsServer downServer logs, hosting dashboard
    HTTP 5xxApplication errorApplication logs
    HTTP 401/403Auth failureAPI key validity
    One region DOWNRegional issueCDN, DNS for that region
    Gold prompt failModel issueModel behavior, prompts

    Step 6: Fix

    • Deploy fix, or rollback to last known good, or update configuration

    Step 7: Verify with Agent Status

    Trigger a manual run:

    curl -X POST "https://api.agentstatus.dev/api/v1/agents/$AGENT_ID/run" \
      -H "Authorization: Bearer $AGENTSTATUS_API_KEY"
    

    Watch for verdict returning to UP, all regions healthy, and latency back to normal.

    Step 8: Resolve

    • Close PagerDuty incident
    • Post summary in Slack
    • Update status page

    Incident Documentation

    For each incident, document: duration, impact, root cause, detection time, resolution, timeline, and follow-up actions.

    Reducing Alert Noise

    Transient Issues

    Agent Status automatically retries 5xx errors to avoid alerting on transient flakes.

    DEGRADED vs DOWN

    • Alert on DOWN: Always — something is seriously wrong
    • Alert on DEGRADED: Consider — may cause alert fatigue

    Runbook Template

    Create runbooks for each agent covering: contacts, quick links, common issues (HTTP 500, high latency, auth failures), and rollback procedures.

    Need more help?

    Our support team is available to assist you

    Contact Support