stemedb/docs/operations/monitoring/alerting/escalation-policy.md

# StemeDB Alert Escalation Policy

This document defines how StemeDB alerts escalate based on severity, response time, and notification channels.

## Severity Levels

| Severity | Definition | Response Time | Notification |
|----------|------------|---------------|--------------|
| **CRITICAL** | Service down, data loss risk, security breach | Immediate (<5 min) | PagerDuty (page) + Slack + Email |
| **WARNING** | Service degraded, SLO at risk, capacity concern | 30 minutes | PagerDuty (email) + Slack |
| **INFO** | Informational, audit trail, no action required | Best effort | Slack only |

---

## CRITICAL Alert Escalation

### Level 1 (0-5 minutes)
- **Notification:** PagerDuty page + #stemedb-alerts-critical Slack mention
- **Recipients:** Primary on-call engineer
- **Action:** Acknowledge alert in PagerDuty within 5 minutes

### Level 2 (5-15 minutes)
- **Trigger:** No acknowledgment after 5 minutes
- **Notification:** PagerDuty page escalates to backup on-call + manager
- **Recipients:** Backup on-call engineer, Engineering Manager
- **Action:**
  - Backup on-call joins incident
  - Create incident channel: `#incident-YYYY-MM-DD-HH-MM`
  - Manager monitors for escalation needs

### Level 3 (15-30 minutes)
- **Trigger:** No resolution after 15 minutes
- **Notification:** PagerDuty page escalates to director + SRE lead
- **Recipients:** Engineering Director, SRE Lead, Product Lead
- **Action:**
  - Director assesses need for customer communication
  - SRE lead coordinates with infrastructure teams
  - Consider engaging vendor support (AWS, etc.)

### Level 4 (30+ minutes)
- **Trigger:** Ongoing incident >30 minutes
- **Notification:** Email to executive team
- **Recipients:** CTO, VP Engineering, Customer Success
- **Action:**
  - CTO decides on customer communication
  - Customer Success prepares incident notification
  - Schedule post-mortem review

---

## WARNING Alert Escalation

### Level 1 (0-30 minutes)
- **Notification:** PagerDuty email + #stemedb-alerts-warning Slack
- **Recipients:** Primary on-call engineer
- **Action:** Review alert within 30 minutes, add to task backlog if non-urgent

### Level 2 (30-120 minutes)
- **Trigger:** No acknowledgment after 30 minutes
- **Notification:** PagerDuty escalates to page
- **Recipients:** Primary on-call engineer (now paged)
- **Action:** Acknowledge and triage within 15 minutes

### Level 3 (2-4 hours)
- **Trigger:** No resolution after 2 hours
- **Notification:** Email to manager
- **Recipients:** Engineering Manager
- **Action:** Manager assigns ticket, schedules investigation

### Level 4 (4+ hours / escalating)
- **Trigger:** Warning alert escalating to critical thresholds
- **Notification:** Upgrade to CRITICAL escalation path
- **Action:** Follow CRITICAL escalation policy

---

## INFO Alert Handling

- **Notification:** #stemedb-alerts-info Slack only (no pages)
- **Recipients:** Engineering team (optional monitoring)
- **Action:** No immediate action required. Review during business hours.

**Escalation:** INFO alerts do NOT escalate unless manually upgraded by on-call engineer.

---

## Alert-Specific Escalation

### StemeDBAPIDown (CRITICAL)

| Time | Action | Owner |
|------|--------|-------|
| 0 min | Page on-call | Primary on-call |
| 2 min | Check runbook, verify API health | Primary on-call |
| 5 min | If not resolved, escalate to backup + manager | Backup on-call |
| 10 min | Engage AWS support if infrastructure issue | Manager |
| 15 min | Customer communication decision | Director |

### WALDiskNearlyFull (CRITICAL)

| Time | Action | Owner |
|------|--------|-------|
| 0 min | Page on-call | Primary on-call |
| 5 min | Run disk cleanup script | Primary on-call |
| 10 min | If cleanup insufficient, request disk resize | Primary on-call |
| 15 min | Escalate to infrastructure team | Manager |
| 20 min | Consider failover to replica with more disk | SRE lead |

### ReplicationLagCritical (CRITICAL)

| Time | Action | Owner |
|------|--------|-------|
| 0 min | Page on-call | Primary on-call |
| 5 min | Check network connectivity, peer health | Primary on-call |
| 10 min | Check disk I/O on lagging node (`iostat -x`) | Primary on-call |
| 15 min | If persistent, escalate to network team | Manager |
| 30 min | Consider force-resyncing peer | SRE lead |

### HighAPIErrorRate (WARNING)

| Time | Action | Owner |
|------|--------|-------|
| 0 min | Email on-call | Primary on-call |
| 30 min | Review logs for error patterns | Primary on-call |
| 1 hour | If rate increasing, upgrade to CRITICAL | Primary on-call |
| 2 hours | Create ticket, assign to team | Manager |

---

## Notification Channels by Severity

| Severity | PagerDuty | Slack | Email | SMS |
|----------|-----------|-------|-------|-----|
| CRITICAL | ✅ Page (high urgency) | ✅ @channel mention | ✅ All on-call | ✅ Primary only |
| WARNING | ✅ Email (low urgency) | ✅ @here mention | ✅ Primary on-call | ❌ |
| INFO | ❌ | ✅ No mentions | ❌ | ❌ |

---

## On-Call Rotation

### Primary On-Call
- **Shift length:** 1 week (Mon 9am - Mon 9am)
- **Response time:** <5 minutes for CRITICAL, <30 minutes for WARNING
- **Compensation:** 1 day PTO per week on-call + overtime pay for incidents
- **Handoff:** Monday morning standup

### Backup On-Call
- **Role:** Escalation point if primary unavailable
- **Response time:** <10 minutes for CRITICAL escalation
- **Compensation:** 0.5 day PTO per week backup

### Manager On-Call
- **Role:** Escalation point for Level 2+, coordination
- **Response time:** <15 minutes for escalated CRITICAL
- **Compensation:** Part of manager responsibilities

---

## Incident Response Workflow

```mermaid
graph TD
    A[Alert Fires] --> B{Severity?}
    B -->|CRITICAL| C[Page on-call]
    B -->|WARNING| D[Email on-call]
    B -->|INFO| E[Slack only]

    C --> F[Acknowledge <5min]
    F --> G[Follow runbook]
    G --> H{Resolved?}
    H -->|Yes| I[Mark resolved]
    H -->|No| J{>15min?}

    J -->|Yes| K[Escalate Level 2]
    K --> L[Manager joins]
    L --> M[Create incident channel]
    M --> N{Resolved?}

    N -->|Yes| I
    N -->|No| O{>30min?}
    O -->|Yes| P[Escalate Level 3]
    P --> Q[Director + CTO join]
    Q --> R[Customer communication]

    D --> S[Acknowledge <30min]
    S --> T[Triage]
    T --> U{Escalating?}
    U -->|Yes| C
    U -->|No| V[Schedule fix]
```

---

## Post-Incident Review

After **all CRITICAL alerts** and **WARNING alerts >2 hours**, conduct post-mortem:

### Template

**Incident:** [Alert name + timestamp]
**Duration:** [Time from alert to resolution]
**Impact:** [Services affected, customer impact]
**Root cause:** [Technical explanation]
**Resolution:** [What fixed it]
**Prevention:** [Action items to prevent recurrence]

### Review Meeting

- **Attendees:** On-call engineer(s), manager, affected team leads
- **Schedule:** Within 48 hours of incident
- **Duration:** 30-60 minutes
- **Output:** Action items assigned with due dates

### Metrics to Track

- **MTTA (Mean Time to Acknowledge):** Target <5 min for CRITICAL
- **MTTR (Mean Time to Resolve):** Target <30 min for CRITICAL
- **Alert accuracy:** % of alerts that required action (target >80%)
- **Escalation rate:** % of alerts that reached Level 2+ (target <20%)

---

## Alert Tuning Process

### Quarterly Review

1. **Analyze alert volume** (past 90 days)
2. **Identify noisy alerts** (>5 firings/day, low action rate)
3. **Review thresholds** (adjust based on production baseline)
4. **Remove unused alerts** (0 firings in 90 days)
5. **Add new alerts** (based on incident learnings)

### Alert Hygiene Rules

- **Every CRITICAL alert** must have a runbook
- **Every alert** must have a defined action (not just FYI)
- **False positive rate** must be <10%
- **Alert must be actionable** by on-call without expert knowledge

---

## Contact Information

| Role | Primary | Backup | Email | Phone |
|------|---------|--------|-------|-------|
| On-Call Engineer | [Name] | [Name] | oncall@example.com | +1-XXX-XXX-XXXX |
| Engineering Manager | [Name] | [Name] | manager@example.com | +1-XXX-XXX-XXXX |
| SRE Lead | [Name] | [Name] | sre-lead@example.com | +1-XXX-XXX-XXXX |
| Engineering Director | [Name] | — | director@example.com | +1-XXX-XXX-XXXX |
| CTO | [Name] | — | cto@example.com | +1-XXX-XXX-XXXX |

**PagerDuty Schedules:** https://yourcompany.pagerduty.com/schedules

**Slack Channels:**
- Critical: #stemedb-alerts-critical
- Warning: #stemedb-alerts-warning
- Info: #stemedb-alerts-info
- Incident: #incident-YYYY-MM-DD-HH-MM (created on-demand)

**Runbook Repository:** https://docs.stemedb.com/operations/runbooks/

**Grafana Dashboards:** https://grafana.example.com/dashboards/stemedb

---

## Revision History

| Date | Version | Changes | Author |
|------|---------|---------|--------|
| 2026-02-11 | 1.0 | Initial escalation policy | AI Assistant |

**Review schedule:** Quarterly (every 3 months)