This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
274 lines
8.7 KiB
Markdown
274 lines
8.7 KiB
Markdown
# StemeDB Alert Escalation Policy
|
|
|
|
This document defines how StemeDB alerts escalate based on severity, response time, and notification channels.
|
|
|
|
## Severity Levels
|
|
|
|
| Severity | Definition | Response Time | Notification |
|
|
|----------|------------|---------------|--------------|
|
|
| **CRITICAL** | Service down, data loss risk, security breach | Immediate (<5 min) | PagerDuty (page) + Slack + Email |
|
|
| **WARNING** | Service degraded, SLO at risk, capacity concern | 30 minutes | PagerDuty (email) + Slack |
|
|
| **INFO** | Informational, audit trail, no action required | Best effort | Slack only |
|
|
|
|
---
|
|
|
|
## CRITICAL Alert Escalation
|
|
|
|
### Level 1 (0-5 minutes)
|
|
- **Notification:** PagerDuty page + #stemedb-alerts-critical Slack mention
|
|
- **Recipients:** Primary on-call engineer
|
|
- **Action:** Acknowledge alert in PagerDuty within 5 minutes
|
|
|
|
### Level 2 (5-15 minutes)
|
|
- **Trigger:** No acknowledgment after 5 minutes
|
|
- **Notification:** PagerDuty page escalates to backup on-call + manager
|
|
- **Recipients:** Backup on-call engineer, Engineering Manager
|
|
- **Action:**
|
|
- Backup on-call joins incident
|
|
- Create incident channel: `#incident-YYYY-MM-DD-HH-MM`
|
|
- Manager monitors for escalation needs
|
|
|
|
### Level 3 (15-30 minutes)
|
|
- **Trigger:** No resolution after 15 minutes
|
|
- **Notification:** PagerDuty page escalates to director + SRE lead
|
|
- **Recipients:** Engineering Director, SRE Lead, Product Lead
|
|
- **Action:**
|
|
- Director assesses need for customer communication
|
|
- SRE lead coordinates with infrastructure teams
|
|
- Consider engaging vendor support (AWS, etc.)
|
|
|
|
### Level 4 (30+ minutes)
|
|
- **Trigger:** Ongoing incident >30 minutes
|
|
- **Notification:** Email to executive team
|
|
- **Recipients:** CTO, VP Engineering, Customer Success
|
|
- **Action:**
|
|
- CTO decides on customer communication
|
|
- Customer Success prepares incident notification
|
|
- Schedule post-mortem review
|
|
|
|
---
|
|
|
|
## WARNING Alert Escalation
|
|
|
|
### Level 1 (0-30 minutes)
|
|
- **Notification:** PagerDuty email + #stemedb-alerts-warning Slack
|
|
- **Recipients:** Primary on-call engineer
|
|
- **Action:** Review alert within 30 minutes, add to task backlog if non-urgent
|
|
|
|
### Level 2 (30-120 minutes)
|
|
- **Trigger:** No acknowledgment after 30 minutes
|
|
- **Notification:** PagerDuty escalates to page
|
|
- **Recipients:** Primary on-call engineer (now paged)
|
|
- **Action:** Acknowledge and triage within 15 minutes
|
|
|
|
### Level 3 (2-4 hours)
|
|
- **Trigger:** No resolution after 2 hours
|
|
- **Notification:** Email to manager
|
|
- **Recipients:** Engineering Manager
|
|
- **Action:** Manager assigns ticket, schedules investigation
|
|
|
|
### Level 4 (4+ hours / escalating)
|
|
- **Trigger:** Warning alert escalating to critical thresholds
|
|
- **Notification:** Upgrade to CRITICAL escalation path
|
|
- **Action:** Follow CRITICAL escalation policy
|
|
|
|
---
|
|
|
|
## INFO Alert Handling
|
|
|
|
- **Notification:** #stemedb-alerts-info Slack only (no pages)
|
|
- **Recipients:** Engineering team (optional monitoring)
|
|
- **Action:** No immediate action required. Review during business hours.
|
|
|
|
**Escalation:** INFO alerts do NOT escalate unless manually upgraded by on-call engineer.
|
|
|
|
---
|
|
|
|
## Alert-Specific Escalation
|
|
|
|
### StemeDBAPIDown (CRITICAL)
|
|
|
|
| Time | Action | Owner |
|
|
|------|--------|-------|
|
|
| 0 min | Page on-call | Primary on-call |
|
|
| 2 min | Check runbook, verify API health | Primary on-call |
|
|
| 5 min | If not resolved, escalate to backup + manager | Backup on-call |
|
|
| 10 min | Engage AWS support if infrastructure issue | Manager |
|
|
| 15 min | Customer communication decision | Director |
|
|
|
|
### WALDiskNearlyFull (CRITICAL)
|
|
|
|
| Time | Action | Owner |
|
|
|------|--------|-------|
|
|
| 0 min | Page on-call | Primary on-call |
|
|
| 5 min | Run disk cleanup script | Primary on-call |
|
|
| 10 min | If cleanup insufficient, request disk resize | Primary on-call |
|
|
| 15 min | Escalate to infrastructure team | Manager |
|
|
| 20 min | Consider failover to replica with more disk | SRE lead |
|
|
|
|
### ReplicationLagCritical (CRITICAL)
|
|
|
|
| Time | Action | Owner |
|
|
|------|--------|-------|
|
|
| 0 min | Page on-call | Primary on-call |
|
|
| 5 min | Check network connectivity, peer health | Primary on-call |
|
|
| 10 min | Check disk I/O on lagging node (`iostat -x`) | Primary on-call |
|
|
| 15 min | If persistent, escalate to network team | Manager |
|
|
| 30 min | Consider force-resyncing peer | SRE lead |
|
|
|
|
### HighAPIErrorRate (WARNING)
|
|
|
|
| Time | Action | Owner |
|
|
|------|--------|-------|
|
|
| 0 min | Email on-call | Primary on-call |
|
|
| 30 min | Review logs for error patterns | Primary on-call |
|
|
| 1 hour | If rate increasing, upgrade to CRITICAL | Primary on-call |
|
|
| 2 hours | Create ticket, assign to team | Manager |
|
|
|
|
---
|
|
|
|
## Notification Channels by Severity
|
|
|
|
| Severity | PagerDuty | Slack | Email | SMS |
|
|
|----------|-----------|-------|-------|-----|
|
|
| CRITICAL | ✅ Page (high urgency) | ✅ @channel mention | ✅ All on-call | ✅ Primary only |
|
|
| WARNING | ✅ Email (low urgency) | ✅ @here mention | ✅ Primary on-call | ❌ |
|
|
| INFO | ❌ | ✅ No mentions | ❌ | ❌ |
|
|
|
|
---
|
|
|
|
## On-Call Rotation
|
|
|
|
### Primary On-Call
|
|
- **Shift length:** 1 week (Mon 9am - Mon 9am)
|
|
- **Response time:** <5 minutes for CRITICAL, <30 minutes for WARNING
|
|
- **Compensation:** 1 day PTO per week on-call + overtime pay for incidents
|
|
- **Handoff:** Monday morning standup
|
|
|
|
### Backup On-Call
|
|
- **Role:** Escalation point if primary unavailable
|
|
- **Response time:** <10 minutes for CRITICAL escalation
|
|
- **Compensation:** 0.5 day PTO per week backup
|
|
|
|
### Manager On-Call
|
|
- **Role:** Escalation point for Level 2+, coordination
|
|
- **Response time:** <15 minutes for escalated CRITICAL
|
|
- **Compensation:** Part of manager responsibilities
|
|
|
|
---
|
|
|
|
## Incident Response Workflow
|
|
|
|
```mermaid
|
|
graph TD
|
|
A[Alert Fires] --> B{Severity?}
|
|
B -->|CRITICAL| C[Page on-call]
|
|
B -->|WARNING| D[Email on-call]
|
|
B -->|INFO| E[Slack only]
|
|
|
|
C --> F[Acknowledge <5min]
|
|
F --> G[Follow runbook]
|
|
G --> H{Resolved?}
|
|
H -->|Yes| I[Mark resolved]
|
|
H -->|No| J{>15min?}
|
|
|
|
J -->|Yes| K[Escalate Level 2]
|
|
K --> L[Manager joins]
|
|
L --> M[Create incident channel]
|
|
M --> N{Resolved?}
|
|
|
|
N -->|Yes| I
|
|
N -->|No| O{>30min?}
|
|
O -->|Yes| P[Escalate Level 3]
|
|
P --> Q[Director + CTO join]
|
|
Q --> R[Customer communication]
|
|
|
|
D --> S[Acknowledge <30min]
|
|
S --> T[Triage]
|
|
T --> U{Escalating?}
|
|
U -->|Yes| C
|
|
U -->|No| V[Schedule fix]
|
|
```
|
|
|
|
---
|
|
|
|
## Post-Incident Review
|
|
|
|
After **all CRITICAL alerts** and **WARNING alerts >2 hours**, conduct post-mortem:
|
|
|
|
### Template
|
|
|
|
**Incident:** [Alert name + timestamp]
|
|
**Duration:** [Time from alert to resolution]
|
|
**Impact:** [Services affected, customer impact]
|
|
**Root cause:** [Technical explanation]
|
|
**Resolution:** [What fixed it]
|
|
**Prevention:** [Action items to prevent recurrence]
|
|
|
|
### Review Meeting
|
|
|
|
- **Attendees:** On-call engineer(s), manager, affected team leads
|
|
- **Schedule:** Within 48 hours of incident
|
|
- **Duration:** 30-60 minutes
|
|
- **Output:** Action items assigned with due dates
|
|
|
|
### Metrics to Track
|
|
|
|
- **MTTA (Mean Time to Acknowledge):** Target <5 min for CRITICAL
|
|
- **MTTR (Mean Time to Resolve):** Target <30 min for CRITICAL
|
|
- **Alert accuracy:** % of alerts that required action (target >80%)
|
|
- **Escalation rate:** % of alerts that reached Level 2+ (target <20%)
|
|
|
|
---
|
|
|
|
## Alert Tuning Process
|
|
|
|
### Quarterly Review
|
|
|
|
1. **Analyze alert volume** (past 90 days)
|
|
2. **Identify noisy alerts** (>5 firings/day, low action rate)
|
|
3. **Review thresholds** (adjust based on production baseline)
|
|
4. **Remove unused alerts** (0 firings in 90 days)
|
|
5. **Add new alerts** (based on incident learnings)
|
|
|
|
### Alert Hygiene Rules
|
|
|
|
- **Every CRITICAL alert** must have a runbook
|
|
- **Every alert** must have a defined action (not just FYI)
|
|
- **False positive rate** must be <10%
|
|
- **Alert must be actionable** by on-call without expert knowledge
|
|
|
|
---
|
|
|
|
## Contact Information
|
|
|
|
| Role | Primary | Backup | Email | Phone |
|
|
|------|---------|--------|-------|-------|
|
|
| On-Call Engineer | [Name] | [Name] | oncall@example.com | +1-XXX-XXX-XXXX |
|
|
| Engineering Manager | [Name] | [Name] | manager@example.com | +1-XXX-XXX-XXXX |
|
|
| SRE Lead | [Name] | [Name] | sre-lead@example.com | +1-XXX-XXX-XXXX |
|
|
| Engineering Director | [Name] | — | director@example.com | +1-XXX-XXX-XXXX |
|
|
| CTO | [Name] | — | cto@example.com | +1-XXX-XXX-XXXX |
|
|
|
|
**PagerDuty Schedules:** https://yourcompany.pagerduty.com/schedules
|
|
|
|
**Slack Channels:**
|
|
- Critical: #stemedb-alerts-critical
|
|
- Warning: #stemedb-alerts-warning
|
|
- Info: #stemedb-alerts-info
|
|
- Incident: #incident-YYYY-MM-DD-HH-MM (created on-demand)
|
|
|
|
**Runbook Repository:** https://docs.stemedb.com/operations/runbooks/
|
|
|
|
**Grafana Dashboards:** https://grafana.example.com/dashboards/stemedb
|
|
|
|
---
|
|
|
|
## Revision History
|
|
|
|
| Date | Version | Changes | Author |
|
|
|------|---------|---------|--------|
|
|
| 2026-02-11 | 1.0 | Initial escalation policy | AI Assistant |
|
|
|
|
**Review schedule:** Quarterly (every 3 months)
|