# Alertmanager configuration for PagerDuty integration # # This file configures routing and escalation for StemeDB alerts to PagerDuty. # Place this in /etc/alertmanager/alertmanager.yml or merge with existing config. global: # PagerDuty Events API v2 endpoint pagerduty_url: 'https://events.pagerduty.com/v2/enqueue' # Default resolve timeout (how long to wait before auto-resolving) resolve_timeout: 5m # Route configuration route: # Group alerts by alert name and severity group_by: ['alertname', 'severity', 'component'] # Wait 10s before sending initial notification (batch alerts) group_wait: 10s # Send updates every 5 minutes for ongoing incidents group_interval: 5m # Repeat notifications every 3 hours if not resolved repeat_interval: 3h # Default receiver for all alerts receiver: 'pagerduty-warning' # Route critical alerts immediately to on-call routes: - match: severity: critical receiver: 'pagerduty-critical' group_wait: 10s repeat_interval: 1h - match: severity: warning receiver: 'pagerduty-warning' group_wait: 30s repeat_interval: 6h - match: severity: info receiver: 'slack-info' group_wait: 5m repeat_interval: 24h # Inhibition rules (prevent alert spam) inhibit_rules: # Inhibit warning alerts if critical alert is firing - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['component', 'instance'] # Inhibit "slow fsync" if "disk nearly full" is firing - source_match: alertname: 'WALDiskNearlyFull' target_match: alertname: 'WALFsyncSlow' equal: ['instance'] # Inhibit "high latency" if "API down" is firing - source_match: alertname: 'StemeDBAPIDown' target_match: alertname: 'HighAPILatency' equal: ['instance'] # Receivers (notification destinations) receivers: # Critical alerts -> PagerDuty High Urgency - name: 'pagerduty-critical' pagerduty_configs: - service_key: '' severity: 'critical' description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}' details: firing: '{{ .Alerts.Firing | len }}' resolved: '{{ .Alerts.Resolved | len }}' description: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' runbook: '{{ range .Alerts }}{{ .Annotations.runbook }}{{ end }}' impact: '{{ range .Alerts }}{{ .Annotations.impact }}{{ end }}' action: '{{ range .Alerts }}{{ .Annotations.action }}{{ end }}' # Warning alerts -> PagerDuty Low Urgency - name: 'pagerduty-warning' pagerduty_configs: - service_key: '' severity: 'warning' description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}' details: firing: '{{ .Alerts.Firing | len }}' description: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' runbook: '{{ range .Alerts }}{{ .Annotations.runbook }}{{ end }}' # Info alerts -> Slack only (no PagerDuty) - name: 'slack-info' slack_configs: - api_url: '' channel: '#stemedb-alerts-info' title: 'StemeDB INFO Alert' text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}' # Configuration for PagerDuty Integration ## Setup Instructions ### 1. Create PagerDuty Service 1. Log into PagerDuty → **Configuration** → **Services** 2. Click **+ New Service** 3. Configure service: - **Name**: `StemeDB Critical` - **Escalation Policy**: `Ops On-Call` - **Integration Type**: `Events API v2` - **Urgency**: `High` 4. Copy the **Integration Key** (starts with `R0...`) 5. Repeat for Warning service with Low urgency ### 2. Configure Alertmanager Replace placeholders in this file: ```yaml service_key: '' ``` With your actual integration keys: ```yaml service_key: 'R01234567890ABCDEF1234567890ABCD' ``` ### 3. Test Alert ```bash # Send test alert to Alertmanager curl -X POST http://localhost:9093/api/v1/alerts -d '[{ "labels": { "alertname": "TestAlert", "severity": "critical", "component": "test" }, "annotations": { "summary": "Test alert from StemeDB monitoring setup", "description": "This is a test. Please acknowledge in PagerDuty." } }]' ``` Verify alert appears in PagerDuty within 30 seconds. ### 4. Configure Escalation Policy Recommended escalation for **Critical** alerts: 1. **Level 1** (immediate): Page primary on-call engineer 2. **Level 2** (after 5 min): Page backup on-call + manager 3. **Level 3** (after 15 min): Page director + open Slack incident channel Recommended escalation for **Warning** alerts: 1. **Level 1** (immediate): Email primary on-call engineer 2. **Level 2** (after 30 min): Page primary on-call 3. **Level 3** (after 2 hours): Page manager ### 5. Link Runbooks Update Prometheus alert rules to include PagerDuty-accessible runbook URLs: ```yaml annotations: runbook: "https://docs.stemedb.com/operations/runbooks/disk-full.md" ``` Ensure runbooks are hosted on publicly accessible URL (or VPN-accessible). ## Troubleshooting ### Alerts not appearing in PagerDuty 1. **Check Alertmanager logs:** ```bash journalctl -u alertmanager -f | grep pagerduty ``` 2. **Verify integration key:** ```bash curl -X POST https://events.pagerduty.com/v2/enqueue \ -H 'Content-Type: application/json' \ -d '{ "routing_key": "YOUR_KEY", "event_action": "trigger", "payload": { "summary": "Test event", "severity": "critical", "source": "test" } }' ``` 3. **Check PagerDuty service status:** - Verify service is not in Maintenance Mode - Check Integration Status shows "Connected" ### Alert spam / duplicates - Increase `group_interval` to batch more alerts - Add inhibition rules for related alerts - Use `repeat_interval` to reduce notification frequency ### Alerts not resolving - Verify Prometheus scrape is still working - Check `for` duration in alert rules (may need longer resolve time) - Review `resolve_timeout` in Alertmanager config ## Best Practices 1. **Test regularly**: Send test alerts monthly to verify routing 2. **Document runbooks**: Every critical alert should link to a runbook 3. **Review escalation**: Quarterly review of on-call rotation and escalation policy 4. **Alert hygiene**: Remove noisy alerts, tune thresholds based on production data 5. **Post-mortems**: Document alert response time and effectiveness after incidents