This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
8.7 KiB
StemeDB Alert Escalation Policy
This document defines how StemeDB alerts escalate based on severity, response time, and notification channels.
Severity Levels
| Severity | Definition | Response Time | Notification |
|---|---|---|---|
| CRITICAL | Service down, data loss risk, security breach | Immediate (<5 min) | PagerDuty (page) + Slack + Email |
| WARNING | Service degraded, SLO at risk, capacity concern | 30 minutes | PagerDuty (email) + Slack |
| INFO | Informational, audit trail, no action required | Best effort | Slack only |
CRITICAL Alert Escalation
Level 1 (0-5 minutes)
- Notification: PagerDuty page + #stemedb-alerts-critical Slack mention
- Recipients: Primary on-call engineer
- Action: Acknowledge alert in PagerDuty within 5 minutes
Level 2 (5-15 minutes)
- Trigger: No acknowledgment after 5 minutes
- Notification: PagerDuty page escalates to backup on-call + manager
- Recipients: Backup on-call engineer, Engineering Manager
- Action:
- Backup on-call joins incident
- Create incident channel:
#incident-YYYY-MM-DD-HH-MM - Manager monitors for escalation needs
Level 3 (15-30 minutes)
- Trigger: No resolution after 15 minutes
- Notification: PagerDuty page escalates to director + SRE lead
- Recipients: Engineering Director, SRE Lead, Product Lead
- Action:
- Director assesses need for customer communication
- SRE lead coordinates with infrastructure teams
- Consider engaging vendor support (AWS, etc.)
Level 4 (30+ minutes)
- Trigger: Ongoing incident >30 minutes
- Notification: Email to executive team
- Recipients: CTO, VP Engineering, Customer Success
- Action:
- CTO decides on customer communication
- Customer Success prepares incident notification
- Schedule post-mortem review
WARNING Alert Escalation
Level 1 (0-30 minutes)
- Notification: PagerDuty email + #stemedb-alerts-warning Slack
- Recipients: Primary on-call engineer
- Action: Review alert within 30 minutes, add to task backlog if non-urgent
Level 2 (30-120 minutes)
- Trigger: No acknowledgment after 30 minutes
- Notification: PagerDuty escalates to page
- Recipients: Primary on-call engineer (now paged)
- Action: Acknowledge and triage within 15 minutes
Level 3 (2-4 hours)
- Trigger: No resolution after 2 hours
- Notification: Email to manager
- Recipients: Engineering Manager
- Action: Manager assigns ticket, schedules investigation
Level 4 (4+ hours / escalating)
- Trigger: Warning alert escalating to critical thresholds
- Notification: Upgrade to CRITICAL escalation path
- Action: Follow CRITICAL escalation policy
INFO Alert Handling
- Notification: #stemedb-alerts-info Slack only (no pages)
- Recipients: Engineering team (optional monitoring)
- Action: No immediate action required. Review during business hours.
Escalation: INFO alerts do NOT escalate unless manually upgraded by on-call engineer.
Alert-Specific Escalation
StemeDBAPIDown (CRITICAL)
| Time | Action | Owner |
|---|---|---|
| 0 min | Page on-call | Primary on-call |
| 2 min | Check runbook, verify API health | Primary on-call |
| 5 min | If not resolved, escalate to backup + manager | Backup on-call |
| 10 min | Engage AWS support if infrastructure issue | Manager |
| 15 min | Customer communication decision | Director |
WALDiskNearlyFull (CRITICAL)
| Time | Action | Owner |
|---|---|---|
| 0 min | Page on-call | Primary on-call |
| 5 min | Run disk cleanup script | Primary on-call |
| 10 min | If cleanup insufficient, request disk resize | Primary on-call |
| 15 min | Escalate to infrastructure team | Manager |
| 20 min | Consider failover to replica with more disk | SRE lead |
ReplicationLagCritical (CRITICAL)
| Time | Action | Owner |
|---|---|---|
| 0 min | Page on-call | Primary on-call |
| 5 min | Check network connectivity, peer health | Primary on-call |
| 10 min | Check disk I/O on lagging node (iostat -x) |
Primary on-call |
| 15 min | If persistent, escalate to network team | Manager |
| 30 min | Consider force-resyncing peer | SRE lead |
HighAPIErrorRate (WARNING)
| Time | Action | Owner |
|---|---|---|
| 0 min | Email on-call | Primary on-call |
| 30 min | Review logs for error patterns | Primary on-call |
| 1 hour | If rate increasing, upgrade to CRITICAL | Primary on-call |
| 2 hours | Create ticket, assign to team | Manager |
Notification Channels by Severity
| Severity | PagerDuty | Slack | SMS | |
|---|---|---|---|---|
| CRITICAL | ✅ Page (high urgency) | ✅ @channel mention | ✅ All on-call | ✅ Primary only |
| WARNING | ✅ Email (low urgency) | ✅ @here mention | ✅ Primary on-call | ❌ |
| INFO | ❌ | ✅ No mentions | ❌ | ❌ |
On-Call Rotation
Primary On-Call
- Shift length: 1 week (Mon 9am - Mon 9am)
- Response time: <5 minutes for CRITICAL, <30 minutes for WARNING
- Compensation: 1 day PTO per week on-call + overtime pay for incidents
- Handoff: Monday morning standup
Backup On-Call
- Role: Escalation point if primary unavailable
- Response time: <10 minutes for CRITICAL escalation
- Compensation: 0.5 day PTO per week backup
Manager On-Call
- Role: Escalation point for Level 2+, coordination
- Response time: <15 minutes for escalated CRITICAL
- Compensation: Part of manager responsibilities
Incident Response Workflow
graph TD
A[Alert Fires] --> B{Severity?}
B -->|CRITICAL| C[Page on-call]
B -->|WARNING| D[Email on-call]
B -->|INFO| E[Slack only]
C --> F[Acknowledge <5min]
F --> G[Follow runbook]
G --> H{Resolved?}
H -->|Yes| I[Mark resolved]
H -->|No| J{>15min?}
J -->|Yes| K[Escalate Level 2]
K --> L[Manager joins]
L --> M[Create incident channel]
M --> N{Resolved?}
N -->|Yes| I
N -->|No| O{>30min?}
O -->|Yes| P[Escalate Level 3]
P --> Q[Director + CTO join]
Q --> R[Customer communication]
D --> S[Acknowledge <30min]
S --> T[Triage]
T --> U{Escalating?}
U -->|Yes| C
U -->|No| V[Schedule fix]
Post-Incident Review
After all CRITICAL alerts and WARNING alerts >2 hours, conduct post-mortem:
Template
Incident: [Alert name + timestamp] Duration: [Time from alert to resolution] Impact: [Services affected, customer impact] Root cause: [Technical explanation] Resolution: [What fixed it] Prevention: [Action items to prevent recurrence]
Review Meeting
- Attendees: On-call engineer(s), manager, affected team leads
- Schedule: Within 48 hours of incident
- Duration: 30-60 minutes
- Output: Action items assigned with due dates
Metrics to Track
- MTTA (Mean Time to Acknowledge): Target <5 min for CRITICAL
- MTTR (Mean Time to Resolve): Target <30 min for CRITICAL
- Alert accuracy: % of alerts that required action (target >80%)
- Escalation rate: % of alerts that reached Level 2+ (target <20%)
Alert Tuning Process
Quarterly Review
- Analyze alert volume (past 90 days)
- Identify noisy alerts (>5 firings/day, low action rate)
- Review thresholds (adjust based on production baseline)
- Remove unused alerts (0 firings in 90 days)
- Add new alerts (based on incident learnings)
Alert Hygiene Rules
- Every CRITICAL alert must have a runbook
- Every alert must have a defined action (not just FYI)
- False positive rate must be <10%
- Alert must be actionable by on-call without expert knowledge
Contact Information
| Role | Primary | Backup | Phone | |
|---|---|---|---|---|
| On-Call Engineer | [Name] | [Name] | oncall@example.com | +1-XXX-XXX-XXXX |
| Engineering Manager | [Name] | [Name] | manager@example.com | +1-XXX-XXX-XXXX |
| SRE Lead | [Name] | [Name] | sre-lead@example.com | +1-XXX-XXX-XXXX |
| Engineering Director | [Name] | — | director@example.com | +1-XXX-XXX-XXXX |
| CTO | [Name] | — | cto@example.com | +1-XXX-XXX-XXXX |
PagerDuty Schedules: https://yourcompany.pagerduty.com/schedules
Slack Channels:
- Critical: #stemedb-alerts-critical
- Warning: #stemedb-alerts-warning
- Info: #stemedb-alerts-info
- Incident: #incident-YYYY-MM-DD-HH-MM (created on-demand)
Runbook Repository: https://docs.stemedb.com/operations/runbooks/
Grafana Dashboards: https://grafana.example.com/dashboards/stemedb
Revision History
| Date | Version | Changes | Author |
|---|---|---|---|
| 2026-02-11 | 1.0 | Initial escalation policy | AI Assistant |
Review schedule: Quarterly (every 3 months)