stemedb/docs/operations/monitoring/alerting/escalation-policy.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

274 lines
8.7 KiB
Markdown

# StemeDB Alert Escalation Policy
This document defines how StemeDB alerts escalate based on severity, response time, and notification channels.
## Severity Levels
| Severity | Definition | Response Time | Notification |
|----------|------------|---------------|--------------|
| **CRITICAL** | Service down, data loss risk, security breach | Immediate (<5 min) | PagerDuty (page) + Slack + Email |
| **WARNING** | Service degraded, SLO at risk, capacity concern | 30 minutes | PagerDuty (email) + Slack |
| **INFO** | Informational, audit trail, no action required | Best effort | Slack only |
---
## CRITICAL Alert Escalation
### Level 1 (0-5 minutes)
- **Notification:** PagerDuty page + #stemedb-alerts-critical Slack mention
- **Recipients:** Primary on-call engineer
- **Action:** Acknowledge alert in PagerDuty within 5 minutes
### Level 2 (5-15 minutes)
- **Trigger:** No acknowledgment after 5 minutes
- **Notification:** PagerDuty page escalates to backup on-call + manager
- **Recipients:** Backup on-call engineer, Engineering Manager
- **Action:**
- Backup on-call joins incident
- Create incident channel: `#incident-YYYY-MM-DD-HH-MM`
- Manager monitors for escalation needs
### Level 3 (15-30 minutes)
- **Trigger:** No resolution after 15 minutes
- **Notification:** PagerDuty page escalates to director + SRE lead
- **Recipients:** Engineering Director, SRE Lead, Product Lead
- **Action:**
- Director assesses need for customer communication
- SRE lead coordinates with infrastructure teams
- Consider engaging vendor support (AWS, etc.)
### Level 4 (30+ minutes)
- **Trigger:** Ongoing incident >30 minutes
- **Notification:** Email to executive team
- **Recipients:** CTO, VP Engineering, Customer Success
- **Action:**
- CTO decides on customer communication
- Customer Success prepares incident notification
- Schedule post-mortem review
---
## WARNING Alert Escalation
### Level 1 (0-30 minutes)
- **Notification:** PagerDuty email + #stemedb-alerts-warning Slack
- **Recipients:** Primary on-call engineer
- **Action:** Review alert within 30 minutes, add to task backlog if non-urgent
### Level 2 (30-120 minutes)
- **Trigger:** No acknowledgment after 30 minutes
- **Notification:** PagerDuty escalates to page
- **Recipients:** Primary on-call engineer (now paged)
- **Action:** Acknowledge and triage within 15 minutes
### Level 3 (2-4 hours)
- **Trigger:** No resolution after 2 hours
- **Notification:** Email to manager
- **Recipients:** Engineering Manager
- **Action:** Manager assigns ticket, schedules investigation
### Level 4 (4+ hours / escalating)
- **Trigger:** Warning alert escalating to critical thresholds
- **Notification:** Upgrade to CRITICAL escalation path
- **Action:** Follow CRITICAL escalation policy
---
## INFO Alert Handling
- **Notification:** #stemedb-alerts-info Slack only (no pages)
- **Recipients:** Engineering team (optional monitoring)
- **Action:** No immediate action required. Review during business hours.
**Escalation:** INFO alerts do NOT escalate unless manually upgraded by on-call engineer.
---
## Alert-Specific Escalation
### StemeDBAPIDown (CRITICAL)
| Time | Action | Owner |
|------|--------|-------|
| 0 min | Page on-call | Primary on-call |
| 2 min | Check runbook, verify API health | Primary on-call |
| 5 min | If not resolved, escalate to backup + manager | Backup on-call |
| 10 min | Engage AWS support if infrastructure issue | Manager |
| 15 min | Customer communication decision | Director |
### WALDiskNearlyFull (CRITICAL)
| Time | Action | Owner |
|------|--------|-------|
| 0 min | Page on-call | Primary on-call |
| 5 min | Run disk cleanup script | Primary on-call |
| 10 min | If cleanup insufficient, request disk resize | Primary on-call |
| 15 min | Escalate to infrastructure team | Manager |
| 20 min | Consider failover to replica with more disk | SRE lead |
### ReplicationLagCritical (CRITICAL)
| Time | Action | Owner |
|------|--------|-------|
| 0 min | Page on-call | Primary on-call |
| 5 min | Check network connectivity, peer health | Primary on-call |
| 10 min | Check disk I/O on lagging node (`iostat -x`) | Primary on-call |
| 15 min | If persistent, escalate to network team | Manager |
| 30 min | Consider force-resyncing peer | SRE lead |
### HighAPIErrorRate (WARNING)
| Time | Action | Owner |
|------|--------|-------|
| 0 min | Email on-call | Primary on-call |
| 30 min | Review logs for error patterns | Primary on-call |
| 1 hour | If rate increasing, upgrade to CRITICAL | Primary on-call |
| 2 hours | Create ticket, assign to team | Manager |
---
## Notification Channels by Severity
| Severity | PagerDuty | Slack | Email | SMS |
|----------|-----------|-------|-------|-----|
| CRITICAL | ✅ Page (high urgency) | ✅ @channel mention | ✅ All on-call | ✅ Primary only |
| WARNING | ✅ Email (low urgency) | ✅ @here mention | ✅ Primary on-call | ❌ |
| INFO | ❌ | ✅ No mentions | ❌ | ❌ |
---
## On-Call Rotation
### Primary On-Call
- **Shift length:** 1 week (Mon 9am - Mon 9am)
- **Response time:** <5 minutes for CRITICAL, <30 minutes for WARNING
- **Compensation:** 1 day PTO per week on-call + overtime pay for incidents
- **Handoff:** Monday morning standup
### Backup On-Call
- **Role:** Escalation point if primary unavailable
- **Response time:** <10 minutes for CRITICAL escalation
- **Compensation:** 0.5 day PTO per week backup
### Manager On-Call
- **Role:** Escalation point for Level 2+, coordination
- **Response time:** <15 minutes for escalated CRITICAL
- **Compensation:** Part of manager responsibilities
---
## Incident Response Workflow
```mermaid
graph TD
A[Alert Fires] --> B{Severity?}
B -->|CRITICAL| C[Page on-call]
B -->|WARNING| D[Email on-call]
B -->|INFO| E[Slack only]
C --> F[Acknowledge <5min]
F --> G[Follow runbook]
G --> H{Resolved?}
H -->|Yes| I[Mark resolved]
H -->|No| J{>15min?}
J -->|Yes| K[Escalate Level 2]
K --> L[Manager joins]
L --> M[Create incident channel]
M --> N{Resolved?}
N -->|Yes| I
N -->|No| O{>30min?}
O -->|Yes| P[Escalate Level 3]
P --> Q[Director + CTO join]
Q --> R[Customer communication]
D --> S[Acknowledge <30min]
S --> T[Triage]
T --> U{Escalating?}
U -->|Yes| C
U -->|No| V[Schedule fix]
```
---
## Post-Incident Review
After **all CRITICAL alerts** and **WARNING alerts >2 hours**, conduct post-mortem:
### Template
**Incident:** [Alert name + timestamp]
**Duration:** [Time from alert to resolution]
**Impact:** [Services affected, customer impact]
**Root cause:** [Technical explanation]
**Resolution:** [What fixed it]
**Prevention:** [Action items to prevent recurrence]
### Review Meeting
- **Attendees:** On-call engineer(s), manager, affected team leads
- **Schedule:** Within 48 hours of incident
- **Duration:** 30-60 minutes
- **Output:** Action items assigned with due dates
### Metrics to Track
- **MTTA (Mean Time to Acknowledge):** Target <5 min for CRITICAL
- **MTTR (Mean Time to Resolve):** Target <30 min for CRITICAL
- **Alert accuracy:** % of alerts that required action (target >80%)
- **Escalation rate:** % of alerts that reached Level 2+ (target <20%)
---
## Alert Tuning Process
### Quarterly Review
1. **Analyze alert volume** (past 90 days)
2. **Identify noisy alerts** (>5 firings/day, low action rate)
3. **Review thresholds** (adjust based on production baseline)
4. **Remove unused alerts** (0 firings in 90 days)
5. **Add new alerts** (based on incident learnings)
### Alert Hygiene Rules
- **Every CRITICAL alert** must have a runbook
- **Every alert** must have a defined action (not just FYI)
- **False positive rate** must be <10%
- **Alert must be actionable** by on-call without expert knowledge
---
## Contact Information
| Role | Primary | Backup | Email | Phone |
|------|---------|--------|-------|-------|
| On-Call Engineer | [Name] | [Name] | oncall@example.com | +1-XXX-XXX-XXXX |
| Engineering Manager | [Name] | [Name] | manager@example.com | +1-XXX-XXX-XXXX |
| SRE Lead | [Name] | [Name] | sre-lead@example.com | +1-XXX-XXX-XXXX |
| Engineering Director | [Name] | | director@example.com | +1-XXX-XXX-XXXX |
| CTO | [Name] | | cto@example.com | +1-XXX-XXX-XXXX |
**PagerDuty Schedules:** https://yourcompany.pagerduty.com/schedules
**Slack Channels:**
- Critical: #stemedb-alerts-critical
- Warning: #stemedb-alerts-warning
- Info: #stemedb-alerts-info
- Incident: #incident-YYYY-MM-DD-HH-MM (created on-demand)
**Runbook Repository:** https://docs.stemedb.com/operations/runbooks/
**Grafana Dashboards:** https://grafana.example.com/dashboards/stemedb
---
## Revision History
| Date | Version | Changes | Author |
|------|---------|---------|--------|
| 2026-02-11 | 1.0 | Initial escalation policy | AI Assistant |
**Review schedule:** Quarterly (every 3 months)