This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
229 lines
6.5 KiB
YAML
229 lines
6.5 KiB
YAML
# Alertmanager configuration for PagerDuty integration
|
|
#
|
|
# This file configures routing and escalation for StemeDB alerts to PagerDuty.
|
|
# Place this in /etc/alertmanager/alertmanager.yml or merge with existing config.
|
|
|
|
global:
|
|
# PagerDuty Events API v2 endpoint
|
|
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
|
|
|
|
# Default resolve timeout (how long to wait before auto-resolving)
|
|
resolve_timeout: 5m
|
|
|
|
# Route configuration
|
|
route:
|
|
# Group alerts by alert name and severity
|
|
group_by: ['alertname', 'severity', 'component']
|
|
|
|
# Wait 10s before sending initial notification (batch alerts)
|
|
group_wait: 10s
|
|
|
|
# Send updates every 5 minutes for ongoing incidents
|
|
group_interval: 5m
|
|
|
|
# Repeat notifications every 3 hours if not resolved
|
|
repeat_interval: 3h
|
|
|
|
# Default receiver for all alerts
|
|
receiver: 'pagerduty-warning'
|
|
|
|
# Route critical alerts immediately to on-call
|
|
routes:
|
|
- match:
|
|
severity: critical
|
|
receiver: 'pagerduty-critical'
|
|
group_wait: 10s
|
|
repeat_interval: 1h
|
|
|
|
- match:
|
|
severity: warning
|
|
receiver: 'pagerduty-warning'
|
|
group_wait: 30s
|
|
repeat_interval: 6h
|
|
|
|
- match:
|
|
severity: info
|
|
receiver: 'slack-info'
|
|
group_wait: 5m
|
|
repeat_interval: 24h
|
|
|
|
# Inhibition rules (prevent alert spam)
|
|
inhibit_rules:
|
|
# Inhibit warning alerts if critical alert is firing
|
|
- source_match:
|
|
severity: 'critical'
|
|
target_match:
|
|
severity: 'warning'
|
|
equal: ['component', 'instance']
|
|
|
|
# Inhibit "slow fsync" if "disk nearly full" is firing
|
|
- source_match:
|
|
alertname: 'WALDiskNearlyFull'
|
|
target_match:
|
|
alertname: 'WALFsyncSlow'
|
|
equal: ['instance']
|
|
|
|
# Inhibit "high latency" if "API down" is firing
|
|
- source_match:
|
|
alertname: 'StemeDBAPIDown'
|
|
target_match:
|
|
alertname: 'HighAPILatency'
|
|
equal: ['instance']
|
|
|
|
# Receivers (notification destinations)
|
|
receivers:
|
|
# Critical alerts -> PagerDuty High Urgency
|
|
- name: 'pagerduty-critical'
|
|
pagerduty_configs:
|
|
- service_key: '<YOUR_PAGERDUTY_INTEGRATION_KEY_CRITICAL>'
|
|
severity: 'critical'
|
|
description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
|
|
details:
|
|
firing: '{{ .Alerts.Firing | len }}'
|
|
resolved: '{{ .Alerts.Resolved | len }}'
|
|
description: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
|
|
runbook: '{{ range .Alerts }}{{ .Annotations.runbook }}{{ end }}'
|
|
impact: '{{ range .Alerts }}{{ .Annotations.impact }}{{ end }}'
|
|
action: '{{ range .Alerts }}{{ .Annotations.action }}{{ end }}'
|
|
|
|
# Warning alerts -> PagerDuty Low Urgency
|
|
- name: 'pagerduty-warning'
|
|
pagerduty_configs:
|
|
- service_key: '<YOUR_PAGERDUTY_INTEGRATION_KEY_WARNING>'
|
|
severity: 'warning'
|
|
description: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
|
|
details:
|
|
firing: '{{ .Alerts.Firing | len }}'
|
|
description: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
|
|
runbook: '{{ range .Alerts }}{{ .Annotations.runbook }}{{ end }}'
|
|
|
|
# Info alerts -> Slack only (no PagerDuty)
|
|
- name: 'slack-info'
|
|
slack_configs:
|
|
- api_url: '<YOUR_SLACK_WEBHOOK_URL>'
|
|
channel: '#stemedb-alerts-info'
|
|
title: 'StemeDB INFO Alert'
|
|
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ end }}'
|
|
|
|
# Configuration for PagerDuty Integration
|
|
|
|
## Setup Instructions
|
|
|
|
### 1. Create PagerDuty Service
|
|
|
|
1. Log into PagerDuty → **Configuration** → **Services**
|
|
2. Click **+ New Service**
|
|
3. Configure service:
|
|
- **Name**: `StemeDB Critical`
|
|
- **Escalation Policy**: `Ops On-Call`
|
|
- **Integration Type**: `Events API v2`
|
|
- **Urgency**: `High`
|
|
4. Copy the **Integration Key** (starts with `R0...`)
|
|
5. Repeat for Warning service with Low urgency
|
|
|
|
### 2. Configure Alertmanager
|
|
|
|
Replace placeholders in this file:
|
|
|
|
```yaml
|
|
service_key: '<YOUR_PAGERDUTY_INTEGRATION_KEY_CRITICAL>'
|
|
```
|
|
|
|
With your actual integration keys:
|
|
|
|
```yaml
|
|
service_key: 'R01234567890ABCDEF1234567890ABCD'
|
|
```
|
|
|
|
### 3. Test Alert
|
|
|
|
```bash
|
|
# Send test alert to Alertmanager
|
|
curl -X POST http://localhost:9093/api/v1/alerts -d '[{
|
|
"labels": {
|
|
"alertname": "TestAlert",
|
|
"severity": "critical",
|
|
"component": "test"
|
|
},
|
|
"annotations": {
|
|
"summary": "Test alert from StemeDB monitoring setup",
|
|
"description": "This is a test. Please acknowledge in PagerDuty."
|
|
}
|
|
}]'
|
|
```
|
|
|
|
Verify alert appears in PagerDuty within 30 seconds.
|
|
|
|
### 4. Configure Escalation Policy
|
|
|
|
Recommended escalation for **Critical** alerts:
|
|
|
|
1. **Level 1** (immediate): Page primary on-call engineer
|
|
2. **Level 2** (after 5 min): Page backup on-call + manager
|
|
3. **Level 3** (after 15 min): Page director + open Slack incident channel
|
|
|
|
Recommended escalation for **Warning** alerts:
|
|
|
|
1. **Level 1** (immediate): Email primary on-call engineer
|
|
2. **Level 2** (after 30 min): Page primary on-call
|
|
3. **Level 3** (after 2 hours): Page manager
|
|
|
|
### 5. Link Runbooks
|
|
|
|
Update Prometheus alert rules to include PagerDuty-accessible runbook URLs:
|
|
|
|
```yaml
|
|
annotations:
|
|
runbook: "https://docs.stemedb.com/operations/runbooks/disk-full.md"
|
|
```
|
|
|
|
Ensure runbooks are hosted on publicly accessible URL (or VPN-accessible).
|
|
|
|
## Troubleshooting
|
|
|
|
### Alerts not appearing in PagerDuty
|
|
|
|
1. **Check Alertmanager logs:**
|
|
```bash
|
|
journalctl -u alertmanager -f | grep pagerduty
|
|
```
|
|
|
|
2. **Verify integration key:**
|
|
```bash
|
|
curl -X POST https://events.pagerduty.com/v2/enqueue \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{
|
|
"routing_key": "YOUR_KEY",
|
|
"event_action": "trigger",
|
|
"payload": {
|
|
"summary": "Test event",
|
|
"severity": "critical",
|
|
"source": "test"
|
|
}
|
|
}'
|
|
```
|
|
|
|
3. **Check PagerDuty service status:**
|
|
- Verify service is not in Maintenance Mode
|
|
- Check Integration Status shows "Connected"
|
|
|
|
### Alert spam / duplicates
|
|
|
|
- Increase `group_interval` to batch more alerts
|
|
- Add inhibition rules for related alerts
|
|
- Use `repeat_interval` to reduce notification frequency
|
|
|
|
### Alerts not resolving
|
|
|
|
- Verify Prometheus scrape is still working
|
|
- Check `for` duration in alert rules (may need longer resolve time)
|
|
- Review `resolve_timeout` in Alertmanager config
|
|
|
|
## Best Practices
|
|
|
|
1. **Test regularly**: Send test alerts monthly to verify routing
|
|
2. **Document runbooks**: Every critical alert should link to a runbook
|
|
3. **Review escalation**: Quarterly review of on-call rotation and escalation policy
|
|
4. **Alert hygiene**: Remove noisy alerts, tune thresholds based on production data
|
|
5. **Post-mortems**: Document alert response time and effectiveness after incidents
|