jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

8.7 KiB

Raw Blame History

StemeDB Alert Escalation Policy

This document defines how StemeDB alerts escalate based on severity, response time, and notification channels.

Severity Levels

Severity	Definition	Response Time	Notification
CRITICAL	Service down, data loss risk, security breach	Immediate (<5 min)	PagerDuty (page) + Slack + Email
WARNING	Service degraded, SLO at risk, capacity concern	30 minutes	PagerDuty (email) + Slack
INFO	Informational, audit trail, no action required	Best effort	Slack only

CRITICAL Alert Escalation

Level 1 (0-5 minutes)

Notification: PagerDuty page + #stemedb-alerts-critical Slack mention
Recipients: Primary on-call engineer
Action: Acknowledge alert in PagerDuty within 5 minutes

Level 2 (5-15 minutes)

Trigger: No acknowledgment after 5 minutes
Notification: PagerDuty page escalates to backup on-call + manager
Recipients: Backup on-call engineer, Engineering Manager
Action:
- Backup on-call joins incident
- Create incident channel: #incident-YYYY-MM-DD-HH-MM
- Manager monitors for escalation needs

Level 3 (15-30 minutes)

Trigger: No resolution after 15 minutes
Notification: PagerDuty page escalates to director + SRE lead
Recipients: Engineering Director, SRE Lead, Product Lead
Action:
- Director assesses need for customer communication
- SRE lead coordinates with infrastructure teams
- Consider engaging vendor support (AWS, etc.)

Level 4 (30+ minutes)

Trigger: Ongoing incident >30 minutes
Notification: Email to executive team
Recipients: CTO, VP Engineering, Customer Success
Action:
- CTO decides on customer communication
- Customer Success prepares incident notification
- Schedule post-mortem review

WARNING Alert Escalation

Level 1 (0-30 minutes)

Notification: PagerDuty email + #stemedb-alerts-warning Slack
Recipients: Primary on-call engineer
Action: Review alert within 30 minutes, add to task backlog if non-urgent

Level 2 (30-120 minutes)

Trigger: No acknowledgment after 30 minutes
Notification: PagerDuty escalates to page
Recipients: Primary on-call engineer (now paged)
Action: Acknowledge and triage within 15 minutes

Level 3 (2-4 hours)

Trigger: No resolution after 2 hours
Notification: Email to manager
Recipients: Engineering Manager
Action: Manager assigns ticket, schedules investigation

Level 4 (4+ hours / escalating)

Trigger: Warning alert escalating to critical thresholds
Notification: Upgrade to CRITICAL escalation path
Action: Follow CRITICAL escalation policy

INFO Alert Handling

Notification: #stemedb-alerts-info Slack only (no pages)
Recipients: Engineering team (optional monitoring)
Action: No immediate action required. Review during business hours.

Escalation: INFO alerts do NOT escalate unless manually upgraded by on-call engineer.

Alert-Specific Escalation

StemeDBAPIDown (CRITICAL)

Time	Action	Owner
0 min	Page on-call	Primary on-call
2 min	Check runbook, verify API health	Primary on-call
5 min	If not resolved, escalate to backup + manager	Backup on-call
10 min	Engage AWS support if infrastructure issue	Manager
15 min	Customer communication decision	Director

WALDiskNearlyFull (CRITICAL)

Time	Action	Owner
0 min	Page on-call	Primary on-call
5 min	Run disk cleanup script	Primary on-call
10 min	If cleanup insufficient, request disk resize	Primary on-call
15 min	Escalate to infrastructure team	Manager
20 min	Consider failover to replica with more disk	SRE lead

ReplicationLagCritical (CRITICAL)

Time	Action	Owner
0 min	Page on-call	Primary on-call
5 min	Check network connectivity, peer health	Primary on-call
10 min	Check disk I/O on lagging node (`iostat -x`)	Primary on-call
15 min	If persistent, escalate to network team	Manager
30 min	Consider force-resyncing peer	SRE lead

HighAPIErrorRate (WARNING)

Time	Action	Owner
0 min	Email on-call	Primary on-call
30 min	Review logs for error patterns	Primary on-call
1 hour	If rate increasing, upgrade to CRITICAL	Primary on-call
2 hours	Create ticket, assign to team	Manager

Notification Channels by Severity

Severity	PagerDuty	Slack	Email	SMS
CRITICAL	✅ Page (high urgency)	✅ @channel mention	✅ All on-call	✅ Primary only
WARNING	✅ Email (low urgency)	✅ @here mention	✅ Primary on-call	❌
INFO	❌	✅ No mentions	❌	❌

On-Call Rotation

Primary On-Call

Shift length: 1 week (Mon 9am - Mon 9am)
Response time: <5 minutes for CRITICAL, <30 minutes for WARNING
Compensation: 1 day PTO per week on-call + overtime pay for incidents
Handoff: Monday morning standup

Backup On-Call

Role: Escalation point if primary unavailable
Response time: <10 minutes for CRITICAL escalation
Compensation: 0.5 day PTO per week backup

Manager On-Call

Role: Escalation point for Level 2+, coordination
Response time: <15 minutes for escalated CRITICAL
Compensation: Part of manager responsibilities

Incident Response Workflow

graph TD
    A[Alert Fires] --> B{Severity?}
    B -->|CRITICAL| C[Page on-call]
    B -->|WARNING| D[Email on-call]
    B -->|INFO| E[Slack only]

    C --> F[Acknowledge <5min]
    F --> G[Follow runbook]
    G --> H{Resolved?}
    H -->|Yes| I[Mark resolved]
    H -->|No| J{>15min?}

    J -->|Yes| K[Escalate Level 2]
    K --> L[Manager joins]
    L --> M[Create incident channel]
    M --> N{Resolved?}

    N -->|Yes| I
    N -->|No| O{>30min?}
    O -->|Yes| P[Escalate Level 3]
    P --> Q[Director + CTO join]
    Q --> R[Customer communication]

    D --> S[Acknowledge <30min]
    S --> T[Triage]
    T --> U{Escalating?}
    U -->|Yes| C
    U -->|No| V[Schedule fix]

Post-Incident Review

After all CRITICAL alerts and WARNING alerts >2 hours, conduct post-mortem:

Template

Incident: [Alert name + timestamp] Duration: [Time from alert to resolution] Impact: [Services affected, customer impact] Root cause: [Technical explanation] Resolution: [What fixed it] Prevention: [Action items to prevent recurrence]

Review Meeting

Attendees: On-call engineer(s), manager, affected team leads
Schedule: Within 48 hours of incident
Duration: 30-60 minutes
Output: Action items assigned with due dates

Metrics to Track

MTTA (Mean Time to Acknowledge): Target <5 min for CRITICAL
MTTR (Mean Time to Resolve): Target <30 min for CRITICAL
Alert accuracy: % of alerts that required action (target >80%)
Escalation rate: % of alerts that reached Level 2+ (target <20%)

Alert Tuning Process

Quarterly Review

Analyze alert volume (past 90 days)
Identify noisy alerts (>5 firings/day, low action rate)
Review thresholds (adjust based on production baseline)
Remove unused alerts (0 firings in 90 days)
Add new alerts (based on incident learnings)

Alert Hygiene Rules

Every CRITICAL alert must have a runbook
Every alert must have a defined action (not just FYI)
False positive rate must be <10%
Alert must be actionable by on-call without expert knowledge

Contact Information

Role	Primary	Backup	Email	Phone
On-Call Engineer	[Name]	[Name]	oncall@example.com	+1-XXX-XXX-XXXX
Engineering Manager	[Name]	[Name]	manager@example.com	+1-XXX-XXX-XXXX
SRE Lead	[Name]	[Name]	sre-lead@example.com	+1-XXX-XXX-XXXX
Engineering Director	[Name]	—	director@example.com	+1-XXX-XXX-XXXX
CTO	[Name]	—	cto@example.com	+1-XXX-XXX-XXXX

PagerDuty Schedules: https://yourcompany.pagerduty.com/schedules

Slack Channels:

Critical: #stemedb-alerts-critical
Warning: #stemedb-alerts-warning
Info: #stemedb-alerts-info
Incident: #incident-YYYY-MM-DD-HH-MM (created on-demand)

Runbook Repository: https://docs.stemedb.com/operations/runbooks/

Grafana Dashboards: https://grafana.example.com/dashboards/stemedb

Revision History

Date	Version	Changes	Author
2026-02-11	1.0	Initial escalation policy	AI Assistant

Review schedule: Quarterly (every 3 months)

8.7 KiB Raw Blame History