stemedb/docs/operations/monitoring/alerting/pagerduty-config.yml
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

229 lines
6.5 KiB
YAML

# Alertmanager configuration for PagerDuty integration
#
# This file configures routing and escalation for StemeDB alerts to PagerDuty.
# Place this in /etc/alertmanager/alertmanager.yml or merge with existing config.
global:
# PagerDuty Events API v2 endpoint
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
# Default resolve timeout (how long to wait before auto-resolving)
resolve_timeout: 5m
# Route configuration
route:
# Group alerts by alert name and severity
group_by: ['alertname', 'severity', 'component']
# Wait 10s before sending initial notification (batch alerts)
group_wait: 10s
# Send updates every 5 minutes for ongoing incidents
group_interval: 5m
# Repeat notifications every 3 hours if not resolved
repeat_interval: 3h
# Default receiver for all alerts
receiver: 'pagerduty-warning'
# Route critical alerts immediately to on-call
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
group_wait: 10s
repeat_interval: 1h
- match:
severity: warning
receiver: 'pagerduty-warning'
group_wait: 30s
repeat_interval: 6h
- match:
severity: info
receiver: 'slack-info'
group_wait: 5m
repeat_interval: 24h
# Inhibition rules (prevent alert spam)
inhibit_rules:
# Inhibit warning alerts if critical alert is firing
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['component', 'instance']
# Inhibit "slow fsync" if "disk nearly full" is firing
- source_match:
alertname: 'WALDiskNearlyFull'
target_match:
alertname: 'WALFsyncSlow'
equal: ['instance']
# Inhibit "high latency" if "API down" is firing
- source_match:
alertname: 'StemeDBAPIDown'
target_match:
alertname: 'HighAPILatency'
equal: ['instance']
# Receivers (notification destinations)
receivers:
# Critical alerts -> PagerDuty High Urgency
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<YOUR_PAGERDUTY_INTEGRATION_KEY_CRITICAL>'
severity: 'critical'
description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
details:
firing: '{{ .Alerts.Firing | len }}'
resolved: '{{ .Alerts.Resolved | len }}'
description: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
runbook: '{{ range .Alerts }}{{ .Annotations.runbook }}{{ end }}'
impact: '{{ range .Alerts }}{{ .Annotations.impact }}{{ end }}'
action: '{{ range .Alerts }}{{ .Annotations.action }}{{ end }}'
# Warning alerts -> PagerDuty Low Urgency
- name: 'pagerduty-warning'
pagerduty_configs:
- service_key: '<YOUR_PAGERDUTY_INTEGRATION_KEY_WARNING>'
severity: 'warning'
description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
details:
firing: '{{ .Alerts.Firing | len }}'
description: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
runbook: '{{ range .Alerts }}{{ .Annotations.runbook }}{{ end }}'
# Info alerts -> Slack only (no PagerDuty)
- name: 'slack-info'
slack_configs:
- api_url: '<YOUR_SLACK_WEBHOOK_URL>'
channel: '#stemedb-alerts-info'
title: 'StemeDB INFO Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'
# Configuration for PagerDuty Integration
## Setup Instructions
### 1. Create PagerDuty Service
1. Log into PagerDuty → **Configuration** → **Services**
2. Click **+ New Service**
3. Configure service:
- **Name**: `StemeDB Critical`
- **Escalation Policy**: `Ops On-Call`
- **Integration Type**: `Events API v2`
- **Urgency**: `High`
4. Copy the **Integration Key** (starts with `R0...`)
5. Repeat for Warning service with Low urgency
### 2. Configure Alertmanager
Replace placeholders in this file:
```yaml
service_key: '<YOUR_PAGERDUTY_INTEGRATION_KEY_CRITICAL>'
```
With your actual integration keys:
```yaml
service_key: 'R01234567890ABCDEF1234567890ABCD'
```
### 3. Test Alert
```bash
# Send test alert to Alertmanager
curl -X POST http://localhost:9093/api/v1/alerts -d '[{
"labels": {
"alertname": "TestAlert",
"severity": "critical",
"component": "test"
},
"annotations": {
"summary": "Test alert from StemeDB monitoring setup",
"description": "This is a test. Please acknowledge in PagerDuty."
}
}]'
```
Verify alert appears in PagerDuty within 30 seconds.
### 4. Configure Escalation Policy
Recommended escalation for **Critical** alerts:
1. **Level 1** (immediate): Page primary on-call engineer
2. **Level 2** (after 5 min): Page backup on-call + manager
3. **Level 3** (after 15 min): Page director + open Slack incident channel
Recommended escalation for **Warning** alerts:
1. **Level 1** (immediate): Email primary on-call engineer
2. **Level 2** (after 30 min): Page primary on-call
3. **Level 3** (after 2 hours): Page manager
### 5. Link Runbooks
Update Prometheus alert rules to include PagerDuty-accessible runbook URLs:
```yaml
annotations:
runbook: "https://docs.stemedb.com/operations/runbooks/disk-full.md"
```
Ensure runbooks are hosted on publicly accessible URL (or VPN-accessible).
## Troubleshooting
### Alerts not appearing in PagerDuty
1. **Check Alertmanager logs:**
```bash
journalctl -u alertmanager -f | grep pagerduty
```
2. **Verify integration key:**
```bash
curl -X POST https://events.pagerduty.com/v2/enqueue \
-H 'Content-Type: application/json' \
-d '{
"routing_key": "YOUR_KEY",
"event_action": "trigger",
"payload": {
"summary": "Test event",
"severity": "critical",
"source": "test"
}
}'
```
3. **Check PagerDuty service status:**
- Verify service is not in Maintenance Mode
- Check Integration Status shows "Connected"
### Alert spam / duplicates
- Increase `group_interval` to batch more alerts
- Add inhibition rules for related alerts
- Use `repeat_interval` to reduce notification frequency
### Alerts not resolving
- Verify Prometheus scrape is still working
- Check `for` duration in alert rules (may need longer resolve time)
- Review `resolve_timeout` in Alertmanager config
## Best Practices
1. **Test regularly**: Send test alerts monthly to verify routing
2. **Document runbooks**: Every critical alert should link to a runbook
3. **Review escalation**: Quarterly review of on-call rotation and escalation policy
4. **Alert hygiene**: Remove noisy alerts, tune thresholds based on production data
5. **Post-mortems**: Document alert response time and effectiveness after incidents