stemedb/docs/operations/runbooks/disaster-recovery.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

16 KiB

Runbook: Disaster Recovery

Overview

Purpose: Restore StemeDB from backup after catastrophic failure.

RTO (Recovery Time Objective): 4 hours RPO (Recovery Point Objective): 15 minutes

Scope: Complete server failure, data center outage, or regional disaster requiring restore from backups.


When to Use This Runbook

Use this runbook for:

  • Complete server failure - Hardware dead, cannot boot
  • Data center outage - Entire DC offline, need to restore elsewhere
  • Disk failure - Storage completely lost, no local recovery possible
  • Ransomware/corruption - Data encrypted or corrupted, need clean restore
  • Regional disaster - DR drill or actual disaster requiring failover

Do NOT use for:


Prerequisites

Before starting DR, ensure:

  • New server provisioned (or existing server with clean disk)
  • S3 access configured (credentials, network access to S3)
  • Dependencies installed (Rust, PostgreSQL if using external stores)
  • Stakeholders notified (team knows DR is in progress)
  • DNS/load balancer updated (if changing server IP)

Minimum server specs:

  • CPU: 4 cores
  • RAM: 16GB
  • Disk: 2x backup size (for restore + buffer)
  • Network: 1Gbps (for S3 downloads)

Decision Tree

Disaster scenario
    │
    ├─► Complete restore needed?
    │   └─► §1 Full Restore from S3
    │
    ├─► Point-in-time restore needed?
    │   └─► §2 Point-in-Time Restore with WAL Replay
    │
    └─► Only recent data lost?
        └─► §3 WAL-Only Recovery

Resolution Steps

§1. Full Restore from S3 (RTO: 4 hours, RPO: 15 minutes)

Use case: Complete data loss, restore everything from S3.

Step 1: Provision new server (30 min)

# Install dependencies
sudo apt update
sudo apt install -y awscli build-essential pkg-config libssl-dev postgresql-client

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

# Create stemedb user
sudo useradd -r -s /bin/bash -d /var/lib/stemedb -m stemedb

# Create data directories
sudo mkdir -p /var/lib/stemedb/{wal,db}
sudo chown -R stemedb:stemedb /var/lib/stemedb

Step 2: Download latest full backup from S3 (60 min)

# List available backups
aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup

# Expected output:
#                            PRE stemedb-backup-20260211-060000/
#                            PRE stemedb-backup-20260211-120000/
#                            PRE stemedb-backup-20260211-180000/  ← Latest

# Download latest full backup
LATEST_BACKUP=$(aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup | tail -n1 | awk '{print $2}' | tr -d '/')
sudo -u stemedb aws s3 sync \
    s3://stemedb-backups-prod/${LATEST_BACKUP} \
    /var/backups/stemedb/${LATEST_BACKUP} \
    --region us-east-1

# Verify download
ls -lh /var/backups/stemedb/${LATEST_BACKUP}/
# Should show: backup-metadata.json, wal/, db/

cat /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json
# Verify timestamp, file counts

Step 3: Download WAL segments since last backup (15 min)

# Get backup timestamp
BACKUP_TIMESTAMP=$(jq -r .timestamp /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json)
echo "Backup timestamp: $BACKUP_TIMESTAMP"

# Download WAL segments archived after backup
sudo -u stemedb mkdir -p /var/lib/stemedb/wal-archive
sudo -u stemedb aws s3 sync \
    s3://stemedb-backups-prod/wal-archive/ \
    /var/lib/stemedb/wal-archive/ \
    --region us-east-1

# Count segments
WAL_COUNT=$(find /var/lib/stemedb/wal-archive -name "*.wal" | wc -l)
echo "Downloaded $WAL_COUNT WAL segments"

Step 4: Restore data directories (30 min)

# Restore from backup
sudo -u stemedb rsync -av \
    /var/backups/stemedb/${LATEST_BACKUP}/wal/ \
    /var/lib/stemedb/wal/

sudo -u stemedb rsync -av \
    /var/backups/stemedb/${LATEST_BACKUP}/db/ \
    /var/lib/stemedb/db/

# Copy archived WAL segments
sudo -u stemedb cp -r /var/lib/stemedb/wal-archive/*.wal /var/lib/stemedb/wal/

# Verify restoration
du -sh /var/lib/stemedb/{wal,db}
# Should match backup sizes + WAL archive

Step 5: Build and start StemeDB (30 min)

# Clone repository
cd /opt
sudo git clone https://github.com/yourusername/stemedb.git
sudo chown -R stemedb:stemedb /opt/stemedb

# Build release binary
cd /opt/stemedb
sudo -u stemedb cargo build --release --bin stemedb-api

# Install systemd unit
sudo cp docs/operations/deployment/systemd/stemedb-api.service /etc/systemd/system/
sudo systemctl daemon-reload

# Configure environment
sudo tee /etc/default/stemedb <<ENV
STEMEDB_BIND_ADDR=0.0.0.0:18180
STEMEDB_WAL_DIR=/var/lib/stemedb/wal
STEMEDB_DB_DIR=/var/lib/stemedb/db
RUST_LOG=info
ENV

# Start StemeDB (will auto-replay WAL)
sudo systemctl start stemedb-api

# Monitor startup
sudo journalctl -u stemedb-api -f

# Expected logs:
# "Starting WAL recovery..."
# "Replayed 15234 entries from WAL"
# "Rebuilding indexes..."
# "Startup complete, listening on 0.0.0.0:18180"

Step 6: Validate recovery (30 min)

# Wait for startup to complete (watch journalctl)
# Then validate...

# Check health
curl http://localhost:18180/v1/health

# Expected:
# {
#   "status": "healthy",
#   "assertion_count": 105234,
#   "wal_segments": 47,
#   "uptime_seconds": 120
# }

# Verify assertion count matches expected
EXPECTED_COUNT=$(jq -r .assertion_count /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json)
ACTUAL_COUNT=$(curl -s http://localhost:18180/v1/health | jq .assertion_count)

echo "Expected: $EXPECTED_COUNT"
echo "Actual: $ACTUAL_COUNT"
echo "Delta: $((ACTUAL_COUNT - EXPECTED_COUNT))"

# Delta should equal assertions from WAL replay
# (data added between backup and failure)

# Test query
curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "concept_path": "test/dr",
    "predicate": "recovered",
    "lens": "recency"
  }'

# Should return 200 (even if empty results)

# Test ingestion
curl -X POST http://localhost:18180/v1/assert \
  -H "Content-Type: application/json" \
  -d '{
    "concept_path": "test/dr_validation",
    "predicate": "restored",
    "value": true,
    "confidence": 1.0,
    "authority_tier": "expert"
  }'

# Should return 201 Created

Step 7: Resume operations (60 min)

# Update DNS (if IP changed)
# Point stemedb.yourdomain.com to new server IP

# Update load balancer (if using LB)
# Add new server to backend pool

# Enable backup automation
sudo systemctl enable stemedb-backup.timer
sudo systemctl start stemedb-backup.timer

sudo systemctl enable stemedb-archive-wal.timer
sudo systemctl start stemedb-archive-wal.timer

sudo systemctl enable stemedb-verify-backup.timer
sudo systemctl start stemedb-verify-backup.timer

# Verify timers
systemctl list-timers 'stemedb-*'

# Notify stakeholders
echo "StemeDB DR complete at $(date -u)" | mail -s "StemeDB DR Complete" oncall@yourcompany.com

Total time: ~4 hours (within RTO)


§2. Point-in-Time Restore with WAL Replay (RTO: 2 hours, RPO: 15 min)

Use case: Restore to specific timestamp (e.g., before bad data ingestion).

Step 1: Identify target timestamp

# Determine when bad data was ingested
# (from logs, monitoring, or user reports)
TARGET_TIMESTAMP="2026-02-11T14:30:00Z"

# Find backup immediately before target
aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup | \
  awk '{print $2}' | tr -d '/' | \
  while read backup; do
    BACKUP_TS=$(aws s3 cp s3://stemedb-backups-prod/${backup}/backup-metadata.json - | jq -r .timestamp)
    if [[ "$BACKUP_TS" < "$TARGET_TIMESTAMP" ]]; then
      echo "$backup ($BACKUP_TS)"
    fi
  done | tail -n1

# Use backup: stemedb-backup-20260211-120000 (2026-02-11T12:00:00Z)

Step 2: Restore base backup

Follow §1 steps 1-4, but use the identified backup instead of latest.

Step 3: Replay WAL to target timestamp

# Download all WAL segments between backup and target
sudo -u stemedb aws s3 sync \
    s3://stemedb-backups-prod/wal-archive/ \
    /var/lib/stemedb/wal-partial/ \
    --region us-east-1

# Filter WAL segments by timestamp
# (Keep only segments before target timestamp)
for wal in /var/lib/stemedb/wal-partial/*.wal; do
    WAL_TS=$(stat -c %Y "$wal" | awk '{print strftime("%Y-%m-%dT%H:%M:%SZ", $1)}')
    if [[ "$WAL_TS" < "$TARGET_TIMESTAMP" ]]; then
        sudo -u stemedb cp "$wal" /var/lib/stemedb/wal/
    fi
done

# Start StemeDB (will replay filtered WAL)
sudo systemctl start stemedb-api

# Validate timestamp
LAST_ASSERTION_TS=$(curl -s http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "*", "lens": "recency", "limit": 1}' | \
  jq -r '.assertions[0].timestamp')

echo "Last assertion timestamp: $LAST_ASSERTION_TS"
echo "Target timestamp: $TARGET_TIMESTAMP"
# Last assertion should be ≤ target

Total time: ~2 hours


§3. WAL-Only Recovery (RTO: 30 min, RPO: 0 min)

Use case: Database intact, only recent WAL lost (e.g., WAL disk failure).

Step 1: Verify database is intact

sudo systemctl stop stemedb-api

# Check DB directory
ls -lh /var/lib/stemedb/db/
# Should show: *.kv files, no corruption

# Check for errors
journalctl -u stemedb-api | tail -n100 | grep -i "db\|database\|storage"
# Should NOT show corruption errors

Step 2: Download archived WAL

# Download all archived WAL segments
sudo -u stemedb aws s3 sync \
    s3://stemedb-backups-prod/wal-archive/ \
    /var/lib/stemedb/wal/ \
    --region us-east-1 \
    --delete

# Verify download
ls -lh /var/lib/stemedb/wal/*.wal | wc -l
# Should show: N segments

Step 3: Start and replay

sudo systemctl start stemedb-api

# Monitor replay
sudo journalctl -u stemedb-api -f

# Expected:
# "Replayed 523 entries from WAL"
# "Startup complete"

# Validate
curl http://localhost:18180/v1/health | jq .assertion_count
# Should match expected count

Total time: ~30 min


Validation Checklist

After any DR procedure, validate:

  • Server starts successfully

    systemctl status stemedb-api
    # Active (running)
    
  • Health endpoint responds

    curl http://localhost:18180/v1/health
    # Returns 200 OK
    
  • Assertion count correct

    # Compare to backup metadata or expected count
    
  • Queries work

    curl -X POST http://localhost:18180/v1/query \
      -H "Content-Type: application/json" \
      -d '{"concept_path": "test", "lens": "recency"}'
    # Returns 200
    
  • Ingestion works

    # Test write
    curl -X POST http://localhost:18180/v1/assert ... # 201 Created
    
  • Backups resume

    systemctl is-active stemedb-backup.timer  # active
    systemctl is-active stemedb-archive-wal.timer  # active
    
  • Metrics exporting

    curl http://localhost:18180/metrics | grep stemedb_
    # Shows metrics
    
  • Alerts firing correctly

    curl http://prometheus:9090/api/v1/alerts | jq .
    # No backup alerts firing
    
  • DNS/LB updated

    nslookup stemedb.yourdomain.com
    # Points to new IP (if changed)
    

RTO/RPO Metrics

Scenario RTO RPO Data Loss
Full restore from S3 4h 15min Last 15min of WAL
Point-in-time restore 2h variable Controlled (to target timestamp)
WAL-only recovery 30min 0min None (if WAL archived)

Factors affecting RTO:

  • S3 download speed (network bandwidth)
  • Backup size (larger = slower restore)
  • Server provisioning time (cloud vs. bare metal)
  • DNS/LB propagation delay

Factors affecting RPO:

  • WAL archival frequency (default: 15 min)
  • Last successful backup age (default: 6h intervals)
  • Time of failure (worst case: just before backup)

Post-DR Actions

Immediate (within 1 hour):

  1. Document incident

    • Create incident report
    • Record timeline (failure time, detection time, recovery time)
    • Note RTO/RPO achieved vs. target
  2. Verify monitoring

    • Check all alerts are firing correctly
    • Verify metrics are being collected
    • Test PagerDuty/Slack notifications
  3. Communicate status

    • Notify stakeholders of recovery completion
    • Update status page
    • Send post-mortem invite

Within 24 hours:

  1. Root cause analysis

    • Identify what caused failure
    • Determine if preventable
    • Create action items
  2. Test backups

    • Verify next backup completes
    • Validate verification passes
    • Check S3 uploads working
  3. Review procedures

    • Update runbook with lessons learned
    • Document any deviations from procedure
    • Propose improvements

Within 1 week:

  1. Conduct post-mortem

    • Blameless review with team
    • Identify process improvements
    • Create corrective actions
  2. Update documentation

    • Incorporate lessons learned
    • Update RTO/RPO estimates
    • Revise prerequisites
  3. Schedule DR drill

    • Test procedure again (quarterly)
    • Validate improvements
    • Train new team members

Common Pitfalls

1. Incomplete S3 sync

Symptom: Restore completes but assertion count too low.

Cause: S3 sync interrupted or incomplete.

Fix:

# Re-sync with --exact-timestamps
sudo -u stemedb aws s3 sync \
    s3://stemedb-backups-prod/${BACKUP} \
    /var/backups/stemedb/${BACKUP} \
    --exact-timestamps \
    --region us-east-1

2. WAL replay fails

Symptom: Server starts but assertion count wrong.

Cause: Corrupted WAL segment or version mismatch.

Fix:

# Check logs for specific segment
sudo journalctl -u stemedb-api | grep -i "wal.*error"

# If segment corrupted, skip it (accept data loss)
sudo mv /var/lib/stemedb/wal/segment-XXXXX.wal /tmp/

# Restart
sudo systemctl restart stemedb-api

3. Permissions incorrect

Symptom: Server won't start, permission denied errors.

Cause: Restored files owned by wrong user.

Fix:

sudo chown -R stemedb:stemedb /var/lib/stemedb
sudo chmod -R 755 /var/lib/stemedb/wal
sudo chmod -R 755 /var/lib/stemedb/db

4. DNS not updated

Symptom: Clients can't connect to restored server.

Cause: DNS still pointing to old IP.

Fix:

# Update DNS record
# (method varies by DNS provider)

# Verify propagation
dig stemedb.yourdomain.com +short
# Should return new IP

DR Drill Procedure

Frequency: Quarterly (every 90 days)

Purpose: Validate DR procedures, train team, measure RTO/RPO.

Steps:

  1. Schedule drill (at least 1 week notice)
  2. Provision staging environment (separate from prod)
  3. Execute DR procedure (§1 Full Restore)
  4. Measure RTO/RPO achieved
  5. Document results (drill report)
  6. Review with team (post-drill retro)
  7. Update runbook (incorporate learnings)

Drill report template:

# DR Drill Report - YYYY-MM-DD

## Summary
- Date: YYYY-MM-DD HH:MM UTC
- Participants: [names]
- Scenario: Full restore from S3
- Result: ✅ Success / ⚠️ Partial / ❌ Failed

## Metrics
- RTO Target: 4 hours
- RTO Achieved: X hours Y min
- RPO Target: 15 min
- RPO Achieved: X min
- Data Loss: X assertions (expected)

## Timeline
- HH:MM - Drill started
- HH:MM - Server provisioned
- HH:MM - Backup downloaded
- HH:MM - WAL downloaded
- HH:MM - Data restored
- HH:MM - Service started
- HH:MM - Validation complete
- HH:MM - Drill complete

## Issues Encountered
1. [Issue description]
   - Impact: [how it affected RTO]
   - Resolution: [how it was fixed]
   - Preventive action: [how to avoid next time]

## Lessons Learned
- [Lesson 1]
- [Lesson 2]

## Action Items
- [ ] [Action item 1] - Owner: [name] - Due: [date]
- [ ] [Action item 2] - Owner: [name] - Due: [date]

## Runbook Updates
- [Change 1: reason]
- [Change 2: reason]


Last Updated

2026-02-12 (P5.3 Implementation)