stemedb/docs/operations/monitoring/alerting/escalation-policy.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

8.7 KiB

StemeDB Alert Escalation Policy

This document defines how StemeDB alerts escalate based on severity, response time, and notification channels.

Severity Levels

Severity Definition Response Time Notification
CRITICAL Service down, data loss risk, security breach Immediate (<5 min) PagerDuty (page) + Slack + Email
WARNING Service degraded, SLO at risk, capacity concern 30 minutes PagerDuty (email) + Slack
INFO Informational, audit trail, no action required Best effort Slack only

CRITICAL Alert Escalation

Level 1 (0-5 minutes)

  • Notification: PagerDuty page + #stemedb-alerts-critical Slack mention
  • Recipients: Primary on-call engineer
  • Action: Acknowledge alert in PagerDuty within 5 minutes

Level 2 (5-15 minutes)

  • Trigger: No acknowledgment after 5 minutes
  • Notification: PagerDuty page escalates to backup on-call + manager
  • Recipients: Backup on-call engineer, Engineering Manager
  • Action:
    • Backup on-call joins incident
    • Create incident channel: #incident-YYYY-MM-DD-HH-MM
    • Manager monitors for escalation needs

Level 3 (15-30 minutes)

  • Trigger: No resolution after 15 minutes
  • Notification: PagerDuty page escalates to director + SRE lead
  • Recipients: Engineering Director, SRE Lead, Product Lead
  • Action:
    • Director assesses need for customer communication
    • SRE lead coordinates with infrastructure teams
    • Consider engaging vendor support (AWS, etc.)

Level 4 (30+ minutes)

  • Trigger: Ongoing incident >30 minutes
  • Notification: Email to executive team
  • Recipients: CTO, VP Engineering, Customer Success
  • Action:
    • CTO decides on customer communication
    • Customer Success prepares incident notification
    • Schedule post-mortem review

WARNING Alert Escalation

Level 1 (0-30 minutes)

  • Notification: PagerDuty email + #stemedb-alerts-warning Slack
  • Recipients: Primary on-call engineer
  • Action: Review alert within 30 minutes, add to task backlog if non-urgent

Level 2 (30-120 minutes)

  • Trigger: No acknowledgment after 30 minutes
  • Notification: PagerDuty escalates to page
  • Recipients: Primary on-call engineer (now paged)
  • Action: Acknowledge and triage within 15 minutes

Level 3 (2-4 hours)

  • Trigger: No resolution after 2 hours
  • Notification: Email to manager
  • Recipients: Engineering Manager
  • Action: Manager assigns ticket, schedules investigation

Level 4 (4+ hours / escalating)

  • Trigger: Warning alert escalating to critical thresholds
  • Notification: Upgrade to CRITICAL escalation path
  • Action: Follow CRITICAL escalation policy

INFO Alert Handling

  • Notification: #stemedb-alerts-info Slack only (no pages)
  • Recipients: Engineering team (optional monitoring)
  • Action: No immediate action required. Review during business hours.

Escalation: INFO alerts do NOT escalate unless manually upgraded by on-call engineer.


Alert-Specific Escalation

StemeDBAPIDown (CRITICAL)

Time Action Owner
0 min Page on-call Primary on-call
2 min Check runbook, verify API health Primary on-call
5 min If not resolved, escalate to backup + manager Backup on-call
10 min Engage AWS support if infrastructure issue Manager
15 min Customer communication decision Director

WALDiskNearlyFull (CRITICAL)

Time Action Owner
0 min Page on-call Primary on-call
5 min Run disk cleanup script Primary on-call
10 min If cleanup insufficient, request disk resize Primary on-call
15 min Escalate to infrastructure team Manager
20 min Consider failover to replica with more disk SRE lead

ReplicationLagCritical (CRITICAL)

Time Action Owner
0 min Page on-call Primary on-call
5 min Check network connectivity, peer health Primary on-call
10 min Check disk I/O on lagging node (iostat -x) Primary on-call
15 min If persistent, escalate to network team Manager
30 min Consider force-resyncing peer SRE lead

HighAPIErrorRate (WARNING)

Time Action Owner
0 min Email on-call Primary on-call
30 min Review logs for error patterns Primary on-call
1 hour If rate increasing, upgrade to CRITICAL Primary on-call
2 hours Create ticket, assign to team Manager

Notification Channels by Severity

Severity PagerDuty Slack Email SMS
CRITICAL Page (high urgency) @channel mention All on-call Primary only
WARNING Email (low urgency) @here mention Primary on-call
INFO No mentions

On-Call Rotation

Primary On-Call

  • Shift length: 1 week (Mon 9am - Mon 9am)
  • Response time: <5 minutes for CRITICAL, <30 minutes for WARNING
  • Compensation: 1 day PTO per week on-call + overtime pay for incidents
  • Handoff: Monday morning standup

Backup On-Call

  • Role: Escalation point if primary unavailable
  • Response time: <10 minutes for CRITICAL escalation
  • Compensation: 0.5 day PTO per week backup

Manager On-Call

  • Role: Escalation point for Level 2+, coordination
  • Response time: <15 minutes for escalated CRITICAL
  • Compensation: Part of manager responsibilities

Incident Response Workflow

graph TD
    A[Alert Fires] --> B{Severity?}
    B -->|CRITICAL| C[Page on-call]
    B -->|WARNING| D[Email on-call]
    B -->|INFO| E[Slack only]

    C --> F[Acknowledge <5min]
    F --> G[Follow runbook]
    G --> H{Resolved?}
    H -->|Yes| I[Mark resolved]
    H -->|No| J{>15min?}

    J -->|Yes| K[Escalate Level 2]
    K --> L[Manager joins]
    L --> M[Create incident channel]
    M --> N{Resolved?}

    N -->|Yes| I
    N -->|No| O{>30min?}
    O -->|Yes| P[Escalate Level 3]
    P --> Q[Director + CTO join]
    Q --> R[Customer communication]

    D --> S[Acknowledge <30min]
    S --> T[Triage]
    T --> U{Escalating?}
    U -->|Yes| C
    U -->|No| V[Schedule fix]

Post-Incident Review

After all CRITICAL alerts and WARNING alerts >2 hours, conduct post-mortem:

Template

Incident: [Alert name + timestamp] Duration: [Time from alert to resolution] Impact: [Services affected, customer impact] Root cause: [Technical explanation] Resolution: [What fixed it] Prevention: [Action items to prevent recurrence]

Review Meeting

  • Attendees: On-call engineer(s), manager, affected team leads
  • Schedule: Within 48 hours of incident
  • Duration: 30-60 minutes
  • Output: Action items assigned with due dates

Metrics to Track

  • MTTA (Mean Time to Acknowledge): Target <5 min for CRITICAL
  • MTTR (Mean Time to Resolve): Target <30 min for CRITICAL
  • Alert accuracy: % of alerts that required action (target >80%)
  • Escalation rate: % of alerts that reached Level 2+ (target <20%)

Alert Tuning Process

Quarterly Review

  1. Analyze alert volume (past 90 days)
  2. Identify noisy alerts (>5 firings/day, low action rate)
  3. Review thresholds (adjust based on production baseline)
  4. Remove unused alerts (0 firings in 90 days)
  5. Add new alerts (based on incident learnings)

Alert Hygiene Rules

  • Every CRITICAL alert must have a runbook
  • Every alert must have a defined action (not just FYI)
  • False positive rate must be <10%
  • Alert must be actionable by on-call without expert knowledge

Contact Information

Role Primary Backup Email Phone
On-Call Engineer [Name] [Name] oncall@example.com +1-XXX-XXX-XXXX
Engineering Manager [Name] [Name] manager@example.com +1-XXX-XXX-XXXX
SRE Lead [Name] [Name] sre-lead@example.com +1-XXX-XXX-XXXX
Engineering Director [Name] director@example.com +1-XXX-XXX-XXXX
CTO [Name] cto@example.com +1-XXX-XXX-XXXX

PagerDuty Schedules: https://yourcompany.pagerduty.com/schedules

Slack Channels:

  • Critical: #stemedb-alerts-critical
  • Warning: #stemedb-alerts-warning
  • Info: #stemedb-alerts-info
  • Incident: #incident-YYYY-MM-DD-HH-MM (created on-demand)

Runbook Repository: https://docs.stemedb.com/operations/runbooks/

Grafana Dashboards: https://grafana.example.com/dashboards/stemedb


Revision History

Date Version Changes Author
2026-02-11 1.0 Initial escalation policy AI Assistant

Review schedule: Quarterly (every 3 months)