stemedb/docs/operations/runbooks/disaster-recovery.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

674 lines
16 KiB
Markdown

# Runbook: Disaster Recovery
## Overview
**Purpose:** Restore StemeDB from backup after catastrophic failure.
**RTO (Recovery Time Objective):** 4 hours
**RPO (Recovery Point Objective):** 15 minutes
**Scope:** Complete server failure, data center outage, or regional disaster requiring restore from backups.
---
## When to Use This Runbook
Use this runbook for:
- **Complete server failure** - Hardware dead, cannot boot
- **Data center outage** - Entire DC offline, need to restore elsewhere
- **Disk failure** - Storage completely lost, no local recovery possible
- **Ransomware/corruption** - Data encrypted or corrupted, need clean restore
- **Regional disaster** - DR drill or actual disaster requiring failover
**Do NOT use for:**
- Single node failure in cluster → Use cluster failover instead
- WAL corruption → Use [Restore from Backup](./restore-from-backup.md) §2
- Index rebuild → Use [Restore from Backup](./restore-from-backup.md) §4
---
## Prerequisites
Before starting DR, ensure:
- [ ] **New server provisioned** (or existing server with clean disk)
- [ ] **S3 access configured** (credentials, network access to S3)
- [ ] **Dependencies installed** (Rust, PostgreSQL if using external stores)
- [ ] **Stakeholders notified** (team knows DR is in progress)
- [ ] **DNS/load balancer updated** (if changing server IP)
**Minimum server specs:**
- CPU: 4 cores
- RAM: 16GB
- Disk: 2x backup size (for restore + buffer)
- Network: 1Gbps (for S3 downloads)
---
## Decision Tree
```
Disaster scenario
├─► Complete restore needed?
│ └─► §1 Full Restore from S3
├─► Point-in-time restore needed?
│ └─► §2 Point-in-Time Restore with WAL Replay
└─► Only recent data lost?
└─► §3 WAL-Only Recovery
```
---
## Resolution Steps
### §1. Full Restore from S3 (RTO: 4 hours, RPO: 15 minutes)
**Use case:** Complete data loss, restore everything from S3.
**Step 1: Provision new server (30 min)**
```bash
# Install dependencies
sudo apt update
sudo apt install -y awscli build-essential pkg-config libssl-dev postgresql-client
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
# Create stemedb user
sudo useradd -r -s /bin/bash -d /var/lib/stemedb -m stemedb
# Create data directories
sudo mkdir -p /var/lib/stemedb/{wal,db}
sudo chown -R stemedb:stemedb /var/lib/stemedb
```
**Step 2: Download latest full backup from S3 (60 min)**
```bash
# List available backups
aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup
# Expected output:
# PRE stemedb-backup-20260211-060000/
# PRE stemedb-backup-20260211-120000/
# PRE stemedb-backup-20260211-180000/ ← Latest
# Download latest full backup
LATEST_BACKUP=$(aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup | tail -n1 | awk '{print $2}' | tr -d '/')
sudo -u stemedb aws s3 sync \
s3://stemedb-backups-prod/${LATEST_BACKUP} \
/var/backups/stemedb/${LATEST_BACKUP} \
--region us-east-1
# Verify download
ls -lh /var/backups/stemedb/${LATEST_BACKUP}/
# Should show: backup-metadata.json, wal/, db/
cat /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json
# Verify timestamp, file counts
```
**Step 3: Download WAL segments since last backup (15 min)**
```bash
# Get backup timestamp
BACKUP_TIMESTAMP=$(jq -r .timestamp /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json)
echo "Backup timestamp: $BACKUP_TIMESTAMP"
# Download WAL segments archived after backup
sudo -u stemedb mkdir -p /var/lib/stemedb/wal-archive
sudo -u stemedb aws s3 sync \
s3://stemedb-backups-prod/wal-archive/ \
/var/lib/stemedb/wal-archive/ \
--region us-east-1
# Count segments
WAL_COUNT=$(find /var/lib/stemedb/wal-archive -name "*.wal" | wc -l)
echo "Downloaded $WAL_COUNT WAL segments"
```
**Step 4: Restore data directories (30 min)**
```bash
# Restore from backup
sudo -u stemedb rsync -av \
/var/backups/stemedb/${LATEST_BACKUP}/wal/ \
/var/lib/stemedb/wal/
sudo -u stemedb rsync -av \
/var/backups/stemedb/${LATEST_BACKUP}/db/ \
/var/lib/stemedb/db/
# Copy archived WAL segments
sudo -u stemedb cp -r /var/lib/stemedb/wal-archive/*.wal /var/lib/stemedb/wal/
# Verify restoration
du -sh /var/lib/stemedb/{wal,db}
# Should match backup sizes + WAL archive
```
**Step 5: Build and start StemeDB (30 min)**
```bash
# Clone repository
cd /opt
sudo git clone https://github.com/yourusername/stemedb.git
sudo chown -R stemedb:stemedb /opt/stemedb
# Build release binary
cd /opt/stemedb
sudo -u stemedb cargo build --release --bin stemedb-api
# Install systemd unit
sudo cp docs/operations/deployment/systemd/stemedb-api.service /etc/systemd/system/
sudo systemctl daemon-reload
# Configure environment
sudo tee /etc/default/stemedb <<ENV
STEMEDB_BIND_ADDR=0.0.0.0:18180
STEMEDB_WAL_DIR=/var/lib/stemedb/wal
STEMEDB_DB_DIR=/var/lib/stemedb/db
RUST_LOG=info
ENV
# Start StemeDB (will auto-replay WAL)
sudo systemctl start stemedb-api
# Monitor startup
sudo journalctl -u stemedb-api -f
# Expected logs:
# "Starting WAL recovery..."
# "Replayed 15234 entries from WAL"
# "Rebuilding indexes..."
# "Startup complete, listening on 0.0.0.0:18180"
```
**Step 6: Validate recovery (30 min)**
```bash
# Wait for startup to complete (watch journalctl)
# Then validate...
# Check health
curl http://localhost:18180/v1/health
# Expected:
# {
# "status": "healthy",
# "assertion_count": 105234,
# "wal_segments": 47,
# "uptime_seconds": 120
# }
# Verify assertion count matches expected
EXPECTED_COUNT=$(jq -r .assertion_count /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json)
ACTUAL_COUNT=$(curl -s http://localhost:18180/v1/health | jq .assertion_count)
echo "Expected: $EXPECTED_COUNT"
echo "Actual: $ACTUAL_COUNT"
echo "Delta: $((ACTUAL_COUNT - EXPECTED_COUNT))"
# Delta should equal assertions from WAL replay
# (data added between backup and failure)
# Test query
curl -X POST http://localhost:18180/v1/query \
-H "Content-Type: application/json" \
-d '{
"concept_path": "test/dr",
"predicate": "recovered",
"lens": "recency"
}'
# Should return 200 (even if empty results)
# Test ingestion
curl -X POST http://localhost:18180/v1/assert \
-H "Content-Type: application/json" \
-d '{
"concept_path": "test/dr_validation",
"predicate": "restored",
"value": true,
"confidence": 1.0,
"authority_tier": "expert"
}'
# Should return 201 Created
```
**Step 7: Resume operations (60 min)**
```bash
# Update DNS (if IP changed)
# Point stemedb.yourdomain.com to new server IP
# Update load balancer (if using LB)
# Add new server to backend pool
# Enable backup automation
sudo systemctl enable stemedb-backup.timer
sudo systemctl start stemedb-backup.timer
sudo systemctl enable stemedb-archive-wal.timer
sudo systemctl start stemedb-archive-wal.timer
sudo systemctl enable stemedb-verify-backup.timer
sudo systemctl start stemedb-verify-backup.timer
# Verify timers
systemctl list-timers 'stemedb-*'
# Notify stakeholders
echo "StemeDB DR complete at $(date -u)" | mail -s "StemeDB DR Complete" oncall@yourcompany.com
```
**Total time: ~4 hours (within RTO)**
---
### §2. Point-in-Time Restore with WAL Replay (RTO: 2 hours, RPO: 15 min)
**Use case:** Restore to specific timestamp (e.g., before bad data ingestion).
**Step 1: Identify target timestamp**
```bash
# Determine when bad data was ingested
# (from logs, monitoring, or user reports)
TARGET_TIMESTAMP="2026-02-11T14:30:00Z"
# Find backup immediately before target
aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup | \
awk '{print $2}' | tr -d '/' | \
while read backup; do
BACKUP_TS=$(aws s3 cp s3://stemedb-backups-prod/${backup}/backup-metadata.json - | jq -r .timestamp)
if [[ "$BACKUP_TS" < "$TARGET_TIMESTAMP" ]]; then
echo "$backup ($BACKUP_TS)"
fi
done | tail -n1
# Use backup: stemedb-backup-20260211-120000 (2026-02-11T12:00:00Z)
```
**Step 2: Restore base backup**
Follow §1 steps 1-4, but use the identified backup instead of latest.
**Step 3: Replay WAL to target timestamp**
```bash
# Download all WAL segments between backup and target
sudo -u stemedb aws s3 sync \
s3://stemedb-backups-prod/wal-archive/ \
/var/lib/stemedb/wal-partial/ \
--region us-east-1
# Filter WAL segments by timestamp
# (Keep only segments before target timestamp)
for wal in /var/lib/stemedb/wal-partial/*.wal; do
WAL_TS=$(stat -c %Y "$wal" | awk '{print strftime("%Y-%m-%dT%H:%M:%SZ", $1)}')
if [[ "$WAL_TS" < "$TARGET_TIMESTAMP" ]]; then
sudo -u stemedb cp "$wal" /var/lib/stemedb/wal/
fi
done
# Start StemeDB (will replay filtered WAL)
sudo systemctl start stemedb-api
# Validate timestamp
LAST_ASSERTION_TS=$(curl -s http://localhost:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "*", "lens": "recency", "limit": 1}' | \
jq -r '.assertions[0].timestamp')
echo "Last assertion timestamp: $LAST_ASSERTION_TS"
echo "Target timestamp: $TARGET_TIMESTAMP"
# Last assertion should be ≤ target
```
**Total time: ~2 hours**
---
### §3. WAL-Only Recovery (RTO: 30 min, RPO: 0 min)
**Use case:** Database intact, only recent WAL lost (e.g., WAL disk failure).
**Step 1: Verify database is intact**
```bash
sudo systemctl stop stemedb-api
# Check DB directory
ls -lh /var/lib/stemedb/db/
# Should show: *.kv files, no corruption
# Check for errors
journalctl -u stemedb-api | tail -n100 | grep -i "db\|database\|storage"
# Should NOT show corruption errors
```
**Step 2: Download archived WAL**
```bash
# Download all archived WAL segments
sudo -u stemedb aws s3 sync \
s3://stemedb-backups-prod/wal-archive/ \
/var/lib/stemedb/wal/ \
--region us-east-1 \
--delete
# Verify download
ls -lh /var/lib/stemedb/wal/*.wal | wc -l
# Should show: N segments
```
**Step 3: Start and replay**
```bash
sudo systemctl start stemedb-api
# Monitor replay
sudo journalctl -u stemedb-api -f
# Expected:
# "Replayed 523 entries from WAL"
# "Startup complete"
# Validate
curl http://localhost:18180/v1/health | jq .assertion_count
# Should match expected count
```
**Total time: ~30 min**
---
## Validation Checklist
After any DR procedure, validate:
- [ ] **Server starts successfully**
```bash
systemctl status stemedb-api
# Active (running)
```
- [ ] **Health endpoint responds**
```bash
curl http://localhost:18180/v1/health
# Returns 200 OK
```
- [ ] **Assertion count correct**
```bash
# Compare to backup metadata or expected count
```
- [ ] **Queries work**
```bash
curl -X POST http://localhost:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "test", "lens": "recency"}'
# Returns 200
```
- [ ] **Ingestion works**
```bash
# Test write
curl -X POST http://localhost:18180/v1/assert ... # 201 Created
```
- [ ] **Backups resume**
```bash
systemctl is-active stemedb-backup.timer # active
systemctl is-active stemedb-archive-wal.timer # active
```
- [ ] **Metrics exporting**
```bash
curl http://localhost:18180/metrics | grep stemedb_
# Shows metrics
```
- [ ] **Alerts firing correctly**
```bash
curl http://prometheus:9090/api/v1/alerts | jq .
# No backup alerts firing
```
- [ ] **DNS/LB updated**
```bash
nslookup stemedb.yourdomain.com
# Points to new IP (if changed)
```
---
## RTO/RPO Metrics
| Scenario | RTO | RPO | Data Loss |
|----------|-----|-----|-----------|
| Full restore from S3 | 4h | 15min | Last 15min of WAL |
| Point-in-time restore | 2h | variable | Controlled (to target timestamp) |
| WAL-only recovery | 30min | 0min | None (if WAL archived) |
**Factors affecting RTO:**
- S3 download speed (network bandwidth)
- Backup size (larger = slower restore)
- Server provisioning time (cloud vs. bare metal)
- DNS/LB propagation delay
**Factors affecting RPO:**
- WAL archival frequency (default: 15 min)
- Last successful backup age (default: 6h intervals)
- Time of failure (worst case: just before backup)
---
## Post-DR Actions
**Immediate (within 1 hour):**
1. **Document incident**
- Create incident report
- Record timeline (failure time, detection time, recovery time)
- Note RTO/RPO achieved vs. target
2. **Verify monitoring**
- Check all alerts are firing correctly
- Verify metrics are being collected
- Test PagerDuty/Slack notifications
3. **Communicate status**
- Notify stakeholders of recovery completion
- Update status page
- Send post-mortem invite
**Within 24 hours:**
1. **Root cause analysis**
- Identify what caused failure
- Determine if preventable
- Create action items
2. **Test backups**
- Verify next backup completes
- Validate verification passes
- Check S3 uploads working
3. **Review procedures**
- Update runbook with lessons learned
- Document any deviations from procedure
- Propose improvements
**Within 1 week:**
1. **Conduct post-mortem**
- Blameless review with team
- Identify process improvements
- Create corrective actions
2. **Update documentation**
- Incorporate lessons learned
- Update RTO/RPO estimates
- Revise prerequisites
3. **Schedule DR drill**
- Test procedure again (quarterly)
- Validate improvements
- Train new team members
---
## Common Pitfalls
### 1. Incomplete S3 sync
**Symptom:** Restore completes but assertion count too low.
**Cause:** S3 sync interrupted or incomplete.
**Fix:**
```bash
# Re-sync with --exact-timestamps
sudo -u stemedb aws s3 sync \
s3://stemedb-backups-prod/${BACKUP} \
/var/backups/stemedb/${BACKUP} \
--exact-timestamps \
--region us-east-1
```
### 2. WAL replay fails
**Symptom:** Server starts but assertion count wrong.
**Cause:** Corrupted WAL segment or version mismatch.
**Fix:**
```bash
# Check logs for specific segment
sudo journalctl -u stemedb-api | grep -i "wal.*error"
# If segment corrupted, skip it (accept data loss)
sudo mv /var/lib/stemedb/wal/segment-XXXXX.wal /tmp/
# Restart
sudo systemctl restart stemedb-api
```
### 3. Permissions incorrect
**Symptom:** Server won't start, permission denied errors.
**Cause:** Restored files owned by wrong user.
**Fix:**
```bash
sudo chown -R stemedb:stemedb /var/lib/stemedb
sudo chmod -R 755 /var/lib/stemedb/wal
sudo chmod -R 755 /var/lib/stemedb/db
```
### 4. DNS not updated
**Symptom:** Clients can't connect to restored server.
**Cause:** DNS still pointing to old IP.
**Fix:**
```bash
# Update DNS record
# (method varies by DNS provider)
# Verify propagation
dig stemedb.yourdomain.com +short
# Should return new IP
```
---
## DR Drill Procedure
**Frequency:** Quarterly (every 90 days)
**Purpose:** Validate DR procedures, train team, measure RTO/RPO.
**Steps:**
1. **Schedule drill** (at least 1 week notice)
2. **Provision staging environment** (separate from prod)
3. **Execute DR procedure** (§1 Full Restore)
4. **Measure RTO/RPO achieved**
5. **Document results** (drill report)
6. **Review with team** (post-drill retro)
7. **Update runbook** (incorporate learnings)
**Drill report template:**
```markdown
# DR Drill Report - YYYY-MM-DD
## Summary
- Date: YYYY-MM-DD HH:MM UTC
- Participants: [names]
- Scenario: Full restore from S3
- Result: ✅ Success / ⚠️ Partial / ❌ Failed
## Metrics
- RTO Target: 4 hours
- RTO Achieved: X hours Y min
- RPO Target: 15 min
- RPO Achieved: X min
- Data Loss: X assertions (expected)
## Timeline
- HH:MM - Drill started
- HH:MM - Server provisioned
- HH:MM - Backup downloaded
- HH:MM - WAL downloaded
- HH:MM - Data restored
- HH:MM - Service started
- HH:MM - Validation complete
- HH:MM - Drill complete
## Issues Encountered
1. [Issue description]
- Impact: [how it affected RTO]
- Resolution: [how it was fixed]
- Preventive action: [how to avoid next time]
## Lessons Learned
- [Lesson 1]
- [Lesson 2]
## Action Items
- [ ] [Action item 1] - Owner: [name] - Due: [date]
- [ ] [Action item 2] - Owner: [name] - Due: [date]
## Runbook Updates
- [Change 1: reason]
- [Change 2: reason]
```
---
## Related Runbooks
- [Restore from Backup](./restore-from-backup.md) - Non-disaster restore scenarios
- [Server Won't Start](./server-wont-start.md) - Startup failures
- [Disk Full](./disk-full.md) - Storage management
---
## Last Updated
2026-02-12 (P5.3 Implementation)