stemedb/docs/operations/monitoring/alerting/pagerduty-config.yml

# Alertmanager configuration for PagerDuty integration
#
# This file configures routing and escalation for StemeDB alerts to PagerDuty.
# Place this in /etc/alertmanager/alertmanager.yml or merge with existing config.

global:
  # PagerDuty Events API v2 endpoint
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

  # Default resolve timeout (how long to wait before auto-resolving)
  resolve_timeout: 5m

# Route configuration
route:
  # Group alerts by alert name and severity
  group_by: ['alertname', 'severity', 'component']

  # Wait 10s before sending initial notification (batch alerts)
  group_wait: 10s

  # Send updates every 5 minutes for ongoing incidents
  group_interval: 5m

  # Repeat notifications every 3 hours if not resolved
  repeat_interval: 3h

  # Default receiver for all alerts
  receiver: 'pagerduty-warning'

  # Route critical alerts immediately to on-call
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      group_wait: 10s
      repeat_interval: 1h

    - match:
        severity: warning
      receiver: 'pagerduty-warning'
      group_wait: 30s
      repeat_interval: 6h

    - match:
        severity: info
      receiver: 'slack-info'
      group_wait: 5m
      repeat_interval: 24h

# Inhibition rules (prevent alert spam)
inhibit_rules:
  # Inhibit warning alerts if critical alert is firing
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['component', 'instance']

  # Inhibit "slow fsync" if "disk nearly full" is firing
  - source_match:
      alertname: 'WALDiskNearlyFull'
    target_match:
      alertname: 'WALFsyncSlow'
    equal: ['instance']

  # Inhibit "high latency" if "API down" is firing
  - source_match:
      alertname: 'StemeDBAPIDown'
    target_match:
      alertname: 'HighAPILatency'
    equal: ['instance']

# Receivers (notification destinations)
receivers:
  # Critical alerts -> PagerDuty High Urgency
  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: '<YOUR_PAGERDUTY_INTEGRATION_KEY_CRITICAL>'
        severity: 'critical'
        description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          resolved: '{{ .Alerts.Resolved | len }}'
          description: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
          runbook: '{{ range .Alerts }}{{ .Annotations.runbook }}{{ end }}'
          impact: '{{ range .Alerts }}{{ .Annotations.impact }}{{ end }}'
          action: '{{ range .Alerts }}{{ .Annotations.action }}{{ end }}'

  # Warning alerts -> PagerDuty Low Urgency
  - name: 'pagerduty-warning'
    pagerduty_configs:
      - service_key: '<YOUR_PAGERDUTY_INTEGRATION_KEY_WARNING>'
        severity: 'warning'
        description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
        details:
          firing: '{{ .Alerts.Firing | len }}'
          description: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
          runbook: '{{ range .Alerts }}{{ .Annotations.runbook }}{{ end }}'

  # Info alerts -> Slack only (no PagerDuty)
  - name: 'slack-info'
    slack_configs:
      - api_url: '<YOUR_SLACK_WEBHOOK_URL>'
        channel: '#stemedb-alerts-info'
        title: 'StemeDB INFO Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'

# Configuration for PagerDuty Integration

## Setup Instructions

### 1. Create PagerDuty Service

1. Log into PagerDuty → **Configuration** → **Services**
2. Click **+ New Service**
3. Configure service:
   - **Name**: `StemeDB Critical`
   - **Escalation Policy**: `Ops On-Call`
   - **Integration Type**: `Events API v2`
   - **Urgency**: `High`
4. Copy the **Integration Key** (starts with `R0...`)
5. Repeat for Warning service with Low urgency

### 2. Configure Alertmanager

Replace placeholders in this file:

```yaml
service_key: '<YOUR_PAGERDUTY_INTEGRATION_KEY_CRITICAL>'
```

With your actual integration keys:

```yaml
service_key: 'R01234567890ABCDEF1234567890ABCD'
```

### 3. Test Alert

```bash
# Send test alert to Alertmanager
curl -X POST http://localhost:9093/api/v1/alerts -d '[{
  "labels": {
    "alertname": "TestAlert",
    "severity": "critical",
    "component": "test"
  },
  "annotations": {
    "summary": "Test alert from StemeDB monitoring setup",
    "description": "This is a test. Please acknowledge in PagerDuty."
  }
}]'
```

Verify alert appears in PagerDuty within 30 seconds.

### 4. Configure Escalation Policy

Recommended escalation for **Critical** alerts:

1. **Level 1** (immediate): Page primary on-call engineer
2. **Level 2** (after 5 min): Page backup on-call + manager
3. **Level 3** (after 15 min): Page director + open Slack incident channel

Recommended escalation for **Warning** alerts:

1. **Level 1** (immediate): Email primary on-call engineer
2. **Level 2** (after 30 min): Page primary on-call
3. **Level 3** (after 2 hours): Page manager

### 5. Link Runbooks

Update Prometheus alert rules to include PagerDuty-accessible runbook URLs:

```yaml
annotations:
  runbook: "https://docs.stemedb.com/operations/runbooks/disk-full.md"
```

Ensure runbooks are hosted on publicly accessible URL (or VPN-accessible).

## Troubleshooting

### Alerts not appearing in PagerDuty

1. **Check Alertmanager logs:**
   ```bash
   journalctl -u alertmanager -f | grep pagerduty
   ```

2. **Verify integration key:**
   ```bash
   curl -X POST https://events.pagerduty.com/v2/enqueue \
     -H 'Content-Type: application/json' \
     -d '{
       "routing_key": "YOUR_KEY",
       "event_action": "trigger",
       "payload": {
         "summary": "Test event",
         "severity": "critical",
         "source": "test"
       }
     }'
   ```

3. **Check PagerDuty service status:**
   - Verify service is not in Maintenance Mode
   - Check Integration Status shows "Connected"

### Alert spam / duplicates

- Increase `group_interval` to batch more alerts
- Add inhibition rules for related alerts
- Use `repeat_interval` to reduce notification frequency

### Alerts not resolving

- Verify Prometheus scrape is still working
- Check `for` duration in alert rules (may need longer resolve time)
- Review `resolve_timeout` in Alertmanager config

## Best Practices

1. **Test regularly**: Send test alerts monthly to verify routing
2. **Document runbooks**: Every critical alert should link to a runbook
3. **Review escalation**: Quarterly review of on-call rotation and escalation policy
4. **Alert hygiene**: Remove noisy alerts, tune thresholds based on production data
5. **Post-mortems**: Document alert response time and effectiveness after incidents