# StemeDB Alert Escalation Policy This document defines how StemeDB alerts escalate based on severity, response time, and notification channels. ## Severity Levels | Severity | Definition | Response Time | Notification | |----------|------------|---------------|--------------| | **CRITICAL** | Service down, data loss risk, security breach | Immediate (<5 min) | PagerDuty (page) + Slack + Email | | **WARNING** | Service degraded, SLO at risk, capacity concern | 30 minutes | PagerDuty (email) + Slack | | **INFO** | Informational, audit trail, no action required | Best effort | Slack only | --- ## CRITICAL Alert Escalation ### Level 1 (0-5 minutes) - **Notification:** PagerDuty page + #stemedb-alerts-critical Slack mention - **Recipients:** Primary on-call engineer - **Action:** Acknowledge alert in PagerDuty within 5 minutes ### Level 2 (5-15 minutes) - **Trigger:** No acknowledgment after 5 minutes - **Notification:** PagerDuty page escalates to backup on-call + manager - **Recipients:** Backup on-call engineer, Engineering Manager - **Action:** - Backup on-call joins incident - Create incident channel: `#incident-YYYY-MM-DD-HH-MM` - Manager monitors for escalation needs ### Level 3 (15-30 minutes) - **Trigger:** No resolution after 15 minutes - **Notification:** PagerDuty page escalates to director + SRE lead - **Recipients:** Engineering Director, SRE Lead, Product Lead - **Action:** - Director assesses need for customer communication - SRE lead coordinates with infrastructure teams - Consider engaging vendor support (AWS, etc.) ### Level 4 (30+ minutes) - **Trigger:** Ongoing incident >30 minutes - **Notification:** Email to executive team - **Recipients:** CTO, VP Engineering, Customer Success - **Action:** - CTO decides on customer communication - Customer Success prepares incident notification - Schedule post-mortem review --- ## WARNING Alert Escalation ### Level 1 (0-30 minutes) - **Notification:** PagerDuty email + #stemedb-alerts-warning Slack - **Recipients:** Primary on-call engineer - **Action:** Review alert within 30 minutes, add to task backlog if non-urgent ### Level 2 (30-120 minutes) - **Trigger:** No acknowledgment after 30 minutes - **Notification:** PagerDuty escalates to page - **Recipients:** Primary on-call engineer (now paged) - **Action:** Acknowledge and triage within 15 minutes ### Level 3 (2-4 hours) - **Trigger:** No resolution after 2 hours - **Notification:** Email to manager - **Recipients:** Engineering Manager - **Action:** Manager assigns ticket, schedules investigation ### Level 4 (4+ hours / escalating) - **Trigger:** Warning alert escalating to critical thresholds - **Notification:** Upgrade to CRITICAL escalation path - **Action:** Follow CRITICAL escalation policy --- ## INFO Alert Handling - **Notification:** #stemedb-alerts-info Slack only (no pages) - **Recipients:** Engineering team (optional monitoring) - **Action:** No immediate action required. Review during business hours. **Escalation:** INFO alerts do NOT escalate unless manually upgraded by on-call engineer. --- ## Alert-Specific Escalation ### StemeDBAPIDown (CRITICAL) | Time | Action | Owner | |------|--------|-------| | 0 min | Page on-call | Primary on-call | | 2 min | Check runbook, verify API health | Primary on-call | | 5 min | If not resolved, escalate to backup + manager | Backup on-call | | 10 min | Engage AWS support if infrastructure issue | Manager | | 15 min | Customer communication decision | Director | ### WALDiskNearlyFull (CRITICAL) | Time | Action | Owner | |------|--------|-------| | 0 min | Page on-call | Primary on-call | | 5 min | Run disk cleanup script | Primary on-call | | 10 min | If cleanup insufficient, request disk resize | Primary on-call | | 15 min | Escalate to infrastructure team | Manager | | 20 min | Consider failover to replica with more disk | SRE lead | ### ReplicationLagCritical (CRITICAL) | Time | Action | Owner | |------|--------|-------| | 0 min | Page on-call | Primary on-call | | 5 min | Check network connectivity, peer health | Primary on-call | | 10 min | Check disk I/O on lagging node (`iostat -x`) | Primary on-call | | 15 min | If persistent, escalate to network team | Manager | | 30 min | Consider force-resyncing peer | SRE lead | ### HighAPIErrorRate (WARNING) | Time | Action | Owner | |------|--------|-------| | 0 min | Email on-call | Primary on-call | | 30 min | Review logs for error patterns | Primary on-call | | 1 hour | If rate increasing, upgrade to CRITICAL | Primary on-call | | 2 hours | Create ticket, assign to team | Manager | --- ## Notification Channels by Severity | Severity | PagerDuty | Slack | Email | SMS | |----------|-----------|-------|-------|-----| | CRITICAL | ✅ Page (high urgency) | ✅ @channel mention | ✅ All on-call | ✅ Primary only | | WARNING | ✅ Email (low urgency) | ✅ @here mention | ✅ Primary on-call | ❌ | | INFO | ❌ | ✅ No mentions | ❌ | ❌ | --- ## On-Call Rotation ### Primary On-Call - **Shift length:** 1 week (Mon 9am - Mon 9am) - **Response time:** <5 minutes for CRITICAL, <30 minutes for WARNING - **Compensation:** 1 day PTO per week on-call + overtime pay for incidents - **Handoff:** Monday morning standup ### Backup On-Call - **Role:** Escalation point if primary unavailable - **Response time:** <10 minutes for CRITICAL escalation - **Compensation:** 0.5 day PTO per week backup ### Manager On-Call - **Role:** Escalation point for Level 2+, coordination - **Response time:** <15 minutes for escalated CRITICAL - **Compensation:** Part of manager responsibilities --- ## Incident Response Workflow ```mermaid graph TD A[Alert Fires] --> B{Severity?} B -->|CRITICAL| C[Page on-call] B -->|WARNING| D[Email on-call] B -->|INFO| E[Slack only] C --> F[Acknowledge <5min] F --> G[Follow runbook] G --> H{Resolved?} H -->|Yes| I[Mark resolved] H -->|No| J{>15min?} J -->|Yes| K[Escalate Level 2] K --> L[Manager joins] L --> M[Create incident channel] M --> N{Resolved?} N -->|Yes| I N -->|No| O{>30min?} O -->|Yes| P[Escalate Level 3] P --> Q[Director + CTO join] Q --> R[Customer communication] D --> S[Acknowledge <30min] S --> T[Triage] T --> U{Escalating?} U -->|Yes| C U -->|No| V[Schedule fix] ``` --- ## Post-Incident Review After **all CRITICAL alerts** and **WARNING alerts >2 hours**, conduct post-mortem: ### Template **Incident:** [Alert name + timestamp] **Duration:** [Time from alert to resolution] **Impact:** [Services affected, customer impact] **Root cause:** [Technical explanation] **Resolution:** [What fixed it] **Prevention:** [Action items to prevent recurrence] ### Review Meeting - **Attendees:** On-call engineer(s), manager, affected team leads - **Schedule:** Within 48 hours of incident - **Duration:** 30-60 minutes - **Output:** Action items assigned with due dates ### Metrics to Track - **MTTA (Mean Time to Acknowledge):** Target <5 min for CRITICAL - **MTTR (Mean Time to Resolve):** Target <30 min for CRITICAL - **Alert accuracy:** % of alerts that required action (target >80%) - **Escalation rate:** % of alerts that reached Level 2+ (target <20%) --- ## Alert Tuning Process ### Quarterly Review 1. **Analyze alert volume** (past 90 days) 2. **Identify noisy alerts** (>5 firings/day, low action rate) 3. **Review thresholds** (adjust based on production baseline) 4. **Remove unused alerts** (0 firings in 90 days) 5. **Add new alerts** (based on incident learnings) ### Alert Hygiene Rules - **Every CRITICAL alert** must have a runbook - **Every alert** must have a defined action (not just FYI) - **False positive rate** must be <10% - **Alert must be actionable** by on-call without expert knowledge --- ## Contact Information | Role | Primary | Backup | Email | Phone | |------|---------|--------|-------|-------| | On-Call Engineer | [Name] | [Name] | oncall@example.com | +1-XXX-XXX-XXXX | | Engineering Manager | [Name] | [Name] | manager@example.com | +1-XXX-XXX-XXXX | | SRE Lead | [Name] | [Name] | sre-lead@example.com | +1-XXX-XXX-XXXX | | Engineering Director | [Name] | — | director@example.com | +1-XXX-XXX-XXXX | | CTO | [Name] | — | cto@example.com | +1-XXX-XXX-XXXX | **PagerDuty Schedules:** https://yourcompany.pagerduty.com/schedules **Slack Channels:** - Critical: #stemedb-alerts-critical - Warning: #stemedb-alerts-warning - Info: #stemedb-alerts-info - Incident: #incident-YYYY-MM-DD-HH-MM (created on-demand) **Runbook Repository:** https://docs.stemedb.com/operations/runbooks/ **Grafana Dashboards:** https://grafana.example.com/dashboards/stemedb --- ## Revision History | Date | Version | Changes | Author | |------|---------|---------|--------| | 2026-02-11 | 1.0 | Initial escalation policy | AI Assistant | **Review schedule:** Quarterly (every 3 months)