# Runbook: Disaster Recovery ## Overview **Purpose:** Restore StemeDB from backup after catastrophic failure. **RTO (Recovery Time Objective):** 4 hours **RPO (Recovery Point Objective):** 15 minutes **Scope:** Complete server failure, data center outage, or regional disaster requiring restore from backups. --- ## When to Use This Runbook Use this runbook for: - **Complete server failure** - Hardware dead, cannot boot - **Data center outage** - Entire DC offline, need to restore elsewhere - **Disk failure** - Storage completely lost, no local recovery possible - **Ransomware/corruption** - Data encrypted or corrupted, need clean restore - **Regional disaster** - DR drill or actual disaster requiring failover **Do NOT use for:** - Single node failure in cluster → Use cluster failover instead - WAL corruption → Use [Restore from Backup](./restore-from-backup.md) §2 - Index rebuild → Use [Restore from Backup](./restore-from-backup.md) §4 --- ## Prerequisites Before starting DR, ensure: - [ ] **New server provisioned** (or existing server with clean disk) - [ ] **S3 access configured** (credentials, network access to S3) - [ ] **Dependencies installed** (Rust, PostgreSQL if using external stores) - [ ] **Stakeholders notified** (team knows DR is in progress) - [ ] **DNS/load balancer updated** (if changing server IP) **Minimum server specs:** - CPU: 4 cores - RAM: 16GB - Disk: 2x backup size (for restore + buffer) - Network: 1Gbps (for S3 downloads) --- ## Decision Tree ``` Disaster scenario │ ├─► Complete restore needed? │ └─► §1 Full Restore from S3 │ ├─► Point-in-time restore needed? │ └─► §2 Point-in-Time Restore with WAL Replay │ └─► Only recent data lost? └─► §3 WAL-Only Recovery ``` --- ## Resolution Steps ### §1. Full Restore from S3 (RTO: 4 hours, RPO: 15 minutes) **Use case:** Complete data loss, restore everything from S3. **Step 1: Provision new server (30 min)** ```bash # Install dependencies sudo apt update sudo apt install -y awscli build-essential pkg-config libssl-dev postgresql-client # Install Rust curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh source $HOME/.cargo/env # Create stemedb user sudo useradd -r -s /bin/bash -d /var/lib/stemedb -m stemedb # Create data directories sudo mkdir -p /var/lib/stemedb/{wal,db} sudo chown -R stemedb:stemedb /var/lib/stemedb ``` **Step 2: Download latest full backup from S3 (60 min)** ```bash # List available backups aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup # Expected output: # PRE stemedb-backup-20260211-060000/ # PRE stemedb-backup-20260211-120000/ # PRE stemedb-backup-20260211-180000/ ← Latest # Download latest full backup LATEST_BACKUP=$(aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup | tail -n1 | awk '{print $2}' | tr -d '/') sudo -u stemedb aws s3 sync \ s3://stemedb-backups-prod/${LATEST_BACKUP} \ /var/backups/stemedb/${LATEST_BACKUP} \ --region us-east-1 # Verify download ls -lh /var/backups/stemedb/${LATEST_BACKUP}/ # Should show: backup-metadata.json, wal/, db/ cat /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json # Verify timestamp, file counts ``` **Step 3: Download WAL segments since last backup (15 min)** ```bash # Get backup timestamp BACKUP_TIMESTAMP=$(jq -r .timestamp /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json) echo "Backup timestamp: $BACKUP_TIMESTAMP" # Download WAL segments archived after backup sudo -u stemedb mkdir -p /var/lib/stemedb/wal-archive sudo -u stemedb aws s3 sync \ s3://stemedb-backups-prod/wal-archive/ \ /var/lib/stemedb/wal-archive/ \ --region us-east-1 # Count segments WAL_COUNT=$(find /var/lib/stemedb/wal-archive -name "*.wal" | wc -l) echo "Downloaded $WAL_COUNT WAL segments" ``` **Step 4: Restore data directories (30 min)** ```bash # Restore from backup sudo -u stemedb rsync -av \ /var/backups/stemedb/${LATEST_BACKUP}/wal/ \ /var/lib/stemedb/wal/ sudo -u stemedb rsync -av \ /var/backups/stemedb/${LATEST_BACKUP}/db/ \ /var/lib/stemedb/db/ # Copy archived WAL segments sudo -u stemedb cp -r /var/lib/stemedb/wal-archive/*.wal /var/lib/stemedb/wal/ # Verify restoration du -sh /var/lib/stemedb/{wal,db} # Should match backup sizes + WAL archive ``` **Step 5: Build and start StemeDB (30 min)** ```bash # Clone repository cd /opt sudo git clone https://github.com/yourusername/stemedb.git sudo chown -R stemedb:stemedb /opt/stemedb # Build release binary cd /opt/stemedb sudo -u stemedb cargo build --release --bin stemedb-api # Install systemd unit sudo cp docs/operations/deployment/systemd/stemedb-api.service /etc/systemd/system/ sudo systemctl daemon-reload # Configure environment sudo tee /etc/default/stemedb <