stemedb/docs/operations/runbooks/disaster-recovery.md

# Runbook: Disaster Recovery

## Overview

**Purpose:** Restore StemeDB from backup after catastrophic failure.

**RTO (Recovery Time Objective):** 4 hours
**RPO (Recovery Point Objective):** 15 minutes

**Scope:** Complete server failure, data center outage, or regional disaster requiring restore from backups.

---

## When to Use This Runbook

Use this runbook for:

- **Complete server failure** - Hardware dead, cannot boot
- **Data center outage** - Entire DC offline, need to restore elsewhere
- **Disk failure** - Storage completely lost, no local recovery possible
- **Ransomware/corruption** - Data encrypted or corrupted, need clean restore
- **Regional disaster** - DR drill or actual disaster requiring failover

**Do NOT use for:**
- Single node failure in cluster → Use cluster failover instead
- WAL corruption → Use [Restore from Backup](./restore-from-backup.md) §2
- Index rebuild → Use [Restore from Backup](./restore-from-backup.md) §4

---

## Prerequisites

Before starting DR, ensure:

- [ ] **New server provisioned** (or existing server with clean disk)
- [ ] **S3 access configured** (credentials, network access to S3)
- [ ] **Dependencies installed** (Rust, PostgreSQL if using external stores)
- [ ] **Stakeholders notified** (team knows DR is in progress)
- [ ] **DNS/load balancer updated** (if changing server IP)

**Minimum server specs:**
- CPU: 4 cores
- RAM: 16GB
- Disk: 2x backup size (for restore + buffer)
- Network: 1Gbps (for S3 downloads)

---

## Decision Tree

```
Disaster scenario
    │
    ├─► Complete restore needed?
    │   └─► §1 Full Restore from S3
    │
    ├─► Point-in-time restore needed?
    │   └─► §2 Point-in-Time Restore with WAL Replay
    │
    └─► Only recent data lost?
        └─► §3 WAL-Only Recovery
```

---

## Resolution Steps

### §1. Full Restore from S3 (RTO: 4 hours, RPO: 15 minutes)

**Use case:** Complete data loss, restore everything from S3.

**Step 1: Provision new server (30 min)**

```bash
# Install dependencies
sudo apt update
sudo apt install -y awscli build-essential pkg-config libssl-dev postgresql-client

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

# Create stemedb user
sudo useradd -r -s /bin/bash -d /var/lib/stemedb -m stemedb

# Create data directories
sudo mkdir -p /var/lib/stemedb/{wal,db}
sudo chown -R stemedb:stemedb /var/lib/stemedb
```

**Step 2: Download latest full backup from S3 (60 min)**

```bash
# List available backups
aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup

# Expected output:
#                            PRE stemedb-backup-20260211-060000/
#                            PRE stemedb-backup-20260211-120000/
#                            PRE stemedb-backup-20260211-180000/  ← Latest

# Download latest full backup
LATEST_BACKUP=$(aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup | tail -n1 | awk '{print $2}' | tr -d '/')
sudo -u stemedb aws s3 sync \
    s3://stemedb-backups-prod/${LATEST_BACKUP} \
    /var/backups/stemedb/${LATEST_BACKUP} \
    --region us-east-1

# Verify download
ls -lh /var/backups/stemedb/${LATEST_BACKUP}/
# Should show: backup-metadata.json, wal/, db/

cat /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json
# Verify timestamp, file counts
```

**Step 3: Download WAL segments since last backup (15 min)**

```bash
# Get backup timestamp
BACKUP_TIMESTAMP=$(jq -r .timestamp /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json)
echo "Backup timestamp: $BACKUP_TIMESTAMP"

# Download WAL segments archived after backup
sudo -u stemedb mkdir -p /var/lib/stemedb/wal-archive
sudo -u stemedb aws s3 sync \
    s3://stemedb-backups-prod/wal-archive/ \
    /var/lib/stemedb/wal-archive/ \
    --region us-east-1

# Count segments
WAL_COUNT=$(find /var/lib/stemedb/wal-archive -name "*.wal" | wc -l)
echo "Downloaded $WAL_COUNT WAL segments"
```

**Step 4: Restore data directories (30 min)**

```bash
# Restore from backup
sudo -u stemedb rsync -av \
    /var/backups/stemedb/${LATEST_BACKUP}/wal/ \
    /var/lib/stemedb/wal/

sudo -u stemedb rsync -av \
    /var/backups/stemedb/${LATEST_BACKUP}/db/ \
    /var/lib/stemedb/db/

# Copy archived WAL segments
sudo -u stemedb cp -r /var/lib/stemedb/wal-archive/*.wal /var/lib/stemedb/wal/

# Verify restoration
du -sh /var/lib/stemedb/{wal,db}
# Should match backup sizes + WAL archive
```

**Step 5: Build and start StemeDB (30 min)**

```bash
# Clone repository
cd /opt
sudo git clone https://github.com/yourusername/stemedb.git
sudo chown -R stemedb:stemedb /opt/stemedb

# Build release binary
cd /opt/stemedb
sudo -u stemedb cargo build --release --bin stemedb-api

# Install systemd unit
sudo cp docs/operations/deployment/systemd/stemedb-api.service /etc/systemd/system/
sudo systemctl daemon-reload

# Configure environment
sudo tee /etc/default/stemedb <<ENV
STEMEDB_BIND_ADDR=0.0.0.0:18180
STEMEDB_WAL_DIR=/var/lib/stemedb/wal
STEMEDB_DB_DIR=/var/lib/stemedb/db
RUST_LOG=info
ENV

# Start StemeDB (will auto-replay WAL)
sudo systemctl start stemedb-api

# Monitor startup
sudo journalctl -u stemedb-api -f

# Expected logs:
# "Starting WAL recovery..."
# "Replayed 15234 entries from WAL"
# "Rebuilding indexes..."
# "Startup complete, listening on 0.0.0.0:18180"
```

**Step 6: Validate recovery (30 min)**

```bash
# Wait for startup to complete (watch journalctl)
# Then validate...

# Check health
curl http://localhost:18180/v1/health

# Expected:
# {
#   "status": "healthy",
#   "assertion_count": 105234,
#   "wal_segments": 47,
#   "uptime_seconds": 120
# }

# Verify assertion count matches expected
EXPECTED_COUNT=$(jq -r .assertion_count /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json)
ACTUAL_COUNT=$(curl -s http://localhost:18180/v1/health | jq .assertion_count)

echo "Expected: $EXPECTED_COUNT"
echo "Actual: $ACTUAL_COUNT"
echo "Delta: $((ACTUAL_COUNT - EXPECTED_COUNT))"

# Delta should equal assertions from WAL replay
# (data added between backup and failure)

# Test query
curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "concept_path": "test/dr",
    "predicate": "recovered",
    "lens": "recency"
  }'

# Should return 200 (even if empty results)

# Test ingestion
curl -X POST http://localhost:18180/v1/assert \
  -H "Content-Type: application/json" \
  -d '{
    "concept_path": "test/dr_validation",
    "predicate": "restored",
    "value": true,
    "confidence": 1.0,
    "authority_tier": "expert"
  }'

# Should return 201 Created
```

**Step 7: Resume operations (60 min)**

```bash
# Update DNS (if IP changed)
# Point stemedb.yourdomain.com to new server IP

# Update load balancer (if using LB)
# Add new server to backend pool

# Enable backup automation
sudo systemctl enable stemedb-backup.timer
sudo systemctl start stemedb-backup.timer

sudo systemctl enable stemedb-archive-wal.timer
sudo systemctl start stemedb-archive-wal.timer

sudo systemctl enable stemedb-verify-backup.timer
sudo systemctl start stemedb-verify-backup.timer

# Verify timers
systemctl list-timers 'stemedb-*'

# Notify stakeholders
echo "StemeDB DR complete at $(date -u)" | mail -s "StemeDB DR Complete" oncall@yourcompany.com
```

**Total time: ~4 hours (within RTO)**

---

### §2. Point-in-Time Restore with WAL Replay (RTO: 2 hours, RPO: 15 min)

**Use case:** Restore to specific timestamp (e.g., before bad data ingestion).

**Step 1: Identify target timestamp**

```bash
# Determine when bad data was ingested
# (from logs, monitoring, or user reports)
TARGET_TIMESTAMP="2026-02-11T14:30:00Z"

# Find backup immediately before target
aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup | \
  awk '{print $2}' | tr -d '/' | \
  while read backup; do
    BACKUP_TS=$(aws s3 cp s3://stemedb-backups-prod/${backup}/backup-metadata.json - | jq -r .timestamp)
    if [[ "$BACKUP_TS" < "$TARGET_TIMESTAMP" ]]; then
      echo "$backup ($BACKUP_TS)"
    fi
  done | tail -n1

# Use backup: stemedb-backup-20260211-120000 (2026-02-11T12:00:00Z)
```

**Step 2: Restore base backup**

Follow §1 steps 1-4, but use the identified backup instead of latest.

**Step 3: Replay WAL to target timestamp**

```bash
# Download all WAL segments between backup and target
sudo -u stemedb aws s3 sync \
    s3://stemedb-backups-prod/wal-archive/ \
    /var/lib/stemedb/wal-partial/ \
    --region us-east-1

# Filter WAL segments by timestamp
# (Keep only segments before target timestamp)
for wal in /var/lib/stemedb/wal-partial/*.wal; do
    WAL_TS=$(stat -c %Y "$wal" | awk '{print strftime("%Y-%m-%dT%H:%M:%SZ", $1)}')
    if [[ "$WAL_TS" < "$TARGET_TIMESTAMP" ]]; then
        sudo -u stemedb cp "$wal" /var/lib/stemedb/wal/
    fi
done

# Start StemeDB (will replay filtered WAL)
sudo systemctl start stemedb-api

# Validate timestamp
LAST_ASSERTION_TS=$(curl -s http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "*", "lens": "recency", "limit": 1}' | \
  jq -r '.assertions[0].timestamp')

echo "Last assertion timestamp: $LAST_ASSERTION_TS"
echo "Target timestamp: $TARGET_TIMESTAMP"
# Last assertion should be ≤ target
```

**Total time: ~2 hours**

---

### §3. WAL-Only Recovery (RTO: 30 min, RPO: 0 min)

**Use case:** Database intact, only recent WAL lost (e.g., WAL disk failure).

**Step 1: Verify database is intact**

```bash
sudo systemctl stop stemedb-api

# Check DB directory
ls -lh /var/lib/stemedb/db/
# Should show: *.kv files, no corruption

# Check for errors
journalctl -u stemedb-api | tail -n100 | grep -i "db\|database\|storage"
# Should NOT show corruption errors
```

**Step 2: Download archived WAL**

```bash
# Download all archived WAL segments
sudo -u stemedb aws s3 sync \
    s3://stemedb-backups-prod/wal-archive/ \
    /var/lib/stemedb/wal/ \
    --region us-east-1 \
    --delete

# Verify download
ls -lh /var/lib/stemedb/wal/*.wal | wc -l
# Should show: N segments
```

**Step 3: Start and replay**

```bash
sudo systemctl start stemedb-api

# Monitor replay
sudo journalctl -u stemedb-api -f

# Expected:
# "Replayed 523 entries from WAL"
# "Startup complete"

# Validate
curl http://localhost:18180/v1/health | jq .assertion_count
# Should match expected count
```

**Total time: ~30 min**

---

## Validation Checklist

After any DR procedure, validate:

- [ ] **Server starts successfully**
  ```bash
  systemctl status stemedb-api
  # Active (running)
  ```

- [ ] **Health endpoint responds**
  ```bash
  curl http://localhost:18180/v1/health
  # Returns 200 OK
  ```

- [ ] **Assertion count correct**
  ```bash
  # Compare to backup metadata or expected count
  ```

- [ ] **Queries work**
  ```bash
  curl -X POST http://localhost:18180/v1/query \
    -H "Content-Type: application/json" \
    -d '{"concept_path": "test", "lens": "recency"}'
  # Returns 200
  ```

- [ ] **Ingestion works**
  ```bash
  # Test write
  curl -X POST http://localhost:18180/v1/assert ... # 201 Created
  ```

- [ ] **Backups resume**
  ```bash
  systemctl is-active stemedb-backup.timer  # active
  systemctl is-active stemedb-archive-wal.timer  # active
  ```

- [ ] **Metrics exporting**
  ```bash
  curl http://localhost:18180/metrics | grep stemedb_
  # Shows metrics
  ```

- [ ] **Alerts firing correctly**
  ```bash
  curl http://prometheus:9090/api/v1/alerts | jq .
  # No backup alerts firing
  ```

- [ ] **DNS/LB updated**
  ```bash
  nslookup stemedb.yourdomain.com
  # Points to new IP (if changed)
  ```

---

## RTO/RPO Metrics

| Scenario | RTO | RPO | Data Loss |
|----------|-----|-----|-----------|
| Full restore from S3 | 4h | 15min | Last 15min of WAL |
| Point-in-time restore | 2h | variable | Controlled (to target timestamp) |
| WAL-only recovery | 30min | 0min | None (if WAL archived) |

**Factors affecting RTO:**
- S3 download speed (network bandwidth)
- Backup size (larger = slower restore)
- Server provisioning time (cloud vs. bare metal)
- DNS/LB propagation delay

**Factors affecting RPO:**
- WAL archival frequency (default: 15 min)
- Last successful backup age (default: 6h intervals)
- Time of failure (worst case: just before backup)

---

## Post-DR Actions

**Immediate (within 1 hour):**

1. **Document incident**
   - Create incident report
   - Record timeline (failure time, detection time, recovery time)
   - Note RTO/RPO achieved vs. target

2. **Verify monitoring**
   - Check all alerts are firing correctly
   - Verify metrics are being collected
   - Test PagerDuty/Slack notifications

3. **Communicate status**
   - Notify stakeholders of recovery completion
   - Update status page
   - Send post-mortem invite

**Within 24 hours:**

1. **Root cause analysis**
   - Identify what caused failure
   - Determine if preventable
   - Create action items

2. **Test backups**
   - Verify next backup completes
   - Validate verification passes
   - Check S3 uploads working

3. **Review procedures**
   - Update runbook with lessons learned
   - Document any deviations from procedure
   - Propose improvements

**Within 1 week:**

1. **Conduct post-mortem**
   - Blameless review with team
   - Identify process improvements
   - Create corrective actions

2. **Update documentation**
   - Incorporate lessons learned
   - Update RTO/RPO estimates
   - Revise prerequisites

3. **Schedule DR drill**
   - Test procedure again (quarterly)
   - Validate improvements
   - Train new team members

---

## Common Pitfalls

### 1. Incomplete S3 sync

**Symptom:** Restore completes but assertion count too low.

**Cause:** S3 sync interrupted or incomplete.

**Fix:**
```bash
# Re-sync with --exact-timestamps
sudo -u stemedb aws s3 sync \
    s3://stemedb-backups-prod/${BACKUP} \
    /var/backups/stemedb/${BACKUP} \
    --exact-timestamps \
    --region us-east-1
```

### 2. WAL replay fails

**Symptom:** Server starts but assertion count wrong.

**Cause:** Corrupted WAL segment or version mismatch.

**Fix:**
```bash
# Check logs for specific segment
sudo journalctl -u stemedb-api | grep -i "wal.*error"

# If segment corrupted, skip it (accept data loss)
sudo mv /var/lib/stemedb/wal/segment-XXXXX.wal /tmp/

# Restart
sudo systemctl restart stemedb-api
```

### 3. Permissions incorrect

**Symptom:** Server won't start, permission denied errors.

**Cause:** Restored files owned by wrong user.

**Fix:**
```bash
sudo chown -R stemedb:stemedb /var/lib/stemedb
sudo chmod -R 755 /var/lib/stemedb/wal
sudo chmod -R 755 /var/lib/stemedb/db
```

### 4. DNS not updated

**Symptom:** Clients can't connect to restored server.

**Cause:** DNS still pointing to old IP.

**Fix:**
```bash
# Update DNS record
# (method varies by DNS provider)

# Verify propagation
dig stemedb.yourdomain.com +short
# Should return new IP
```

---

## DR Drill Procedure

**Frequency:** Quarterly (every 90 days)

**Purpose:** Validate DR procedures, train team, measure RTO/RPO.

**Steps:**

1. **Schedule drill** (at least 1 week notice)
2. **Provision staging environment** (separate from prod)
3. **Execute DR procedure** (§1 Full Restore)
4. **Measure RTO/RPO achieved**
5. **Document results** (drill report)
6. **Review with team** (post-drill retro)
7. **Update runbook** (incorporate learnings)

**Drill report template:**

```markdown
# DR Drill Report - YYYY-MM-DD

## Summary
- Date: YYYY-MM-DD HH:MM UTC
- Participants: [names]
- Scenario: Full restore from S3
- Result: ✅ Success / ⚠️ Partial / ❌ Failed

## Metrics
- RTO Target: 4 hours
- RTO Achieved: X hours Y min
- RPO Target: 15 min
- RPO Achieved: X min
- Data Loss: X assertions (expected)

## Timeline
- HH:MM - Drill started
- HH:MM - Server provisioned
- HH:MM - Backup downloaded
- HH:MM - WAL downloaded
- HH:MM - Data restored
- HH:MM - Service started
- HH:MM - Validation complete
- HH:MM - Drill complete

## Issues Encountered
1. [Issue description]
   - Impact: [how it affected RTO]
   - Resolution: [how it was fixed]
   - Preventive action: [how to avoid next time]

## Lessons Learned
- [Lesson 1]
- [Lesson 2]

## Action Items
- [ ] [Action item 1] - Owner: [name] - Due: [date]
- [ ] [Action item 2] - Owner: [name] - Due: [date]

## Runbook Updates
- [Change 1: reason]
- [Change 2: reason]
```

---

## Related Runbooks

- [Restore from Backup](./restore-from-backup.md) - Non-disaster restore scenarios
- [Server Won't Start](./server-wont-start.md) - Startup failures
- [Disk Full](./disk-full.md) - Storage management

---

## Last Updated

2026-02-12 (P5.3 Implementation)