# Runbook: Restore from Backup ## Symptom - Data loss after hardware failure, corruption, or operator error - WAL corruption preventing server startup - Need to rollback to known-good state - Assertion count doesn't match expected values - Database inconsistency detected **Metrics Alerts:** - N/A (typically discovered during incident response) --- ## Quick Diagnosis ``` Need to restore │ ├─► Data loss (hardware failure, operator error)? │ └─► §1 Complete Restore │ ├─► WAL corruption on startup? │ └─► §2 WAL-Only Restore │ ├─► Need to rollback to specific point in time? │ └─► §3 Point-in-Time Restore │ └─► Database inconsistency (assertion count mismatch)? └─► §4 Validation and Rebuild ``` --- ## Common Causes 1. **Hardware failure** — Likelihood: **30%** - Disk failure - Power loss during write - Network storage disconnection 2. **WAL corruption** — Likelihood: **25%** - Unclean shutdown (OOM kill, crash) - Disk corruption - Version mismatch after upgrade 3. **Operator error** — Likelihood: **20%** - Accidentally deleted data directory - Wrong command executed - Misconfigured deployment 4. **Software bug** — Likelihood: **15%** - Database corruption bug - Index inconsistency - Replication failure (cluster) 5. **Disaster recovery test** — Likelihood: **10%** - Scheduled DR validation - Migration to new infrastructure --- ## Prerequisites **Before starting restore:** - [ ] **Backup available:** ```bash ls -lh backups/ # Should show: stemedb-backup-YYYYMMDD-HHMMSS/ ``` - [ ] **Backup metadata valid:** ```bash cat backups/stemedb-backup-*/metadata.json # Should show: version, timestamp, assertion_count ``` - [ ] **Server stopped:** ```bash sudo systemctl stop stemedb-api sudo systemctl status stemedb-api # Should show: inactive (dead) ``` - [ ] **Disk space available:** ```bash df -h # Need: 2x backup size available ``` --- ## Resolution Steps ### §1. Complete Restore (Full Recovery) **Use case:** Data loss, complete restoration needed **Diagnostic:** ```bash # Verify backup integrity BACKUP_DIR="backups/stemedb-backup-20260211-100000" # Replace with your backup # Check metadata cat $BACKUP_DIR/metadata.json # Expected output: # { # "version": "0.1.0", # "timestamp": "2026-02-11T10:00:00Z", # "assertion_count": 10234, # "wal_segment_count": 15, # "backup_type": "full" # } # Check directory structure ls -lh $BACKUP_DIR/ # Should show: wal/ db/ metadata.json ``` **Resolution: Use restore script** ```bash # Run restore script (safe - renames existing dirs, never deletes) sudo ./scripts/restore-stemedb.sh $BACKUP_DIR # Expected output: # Stopping StemeDB API service... # Renaming existing data/wal to data/wal.backup.20260211-103045 # Renaming existing data/db to data/db.backup.20260211-103045 # Copying WAL from backup... # Copying DB from backup... # Copying metadata... # Restore complete. Starting StemeDB API service... # StemeDB API service started successfully. ``` **Validate restore:** ```bash # Check health endpoint curl http://localhost:18180/v1/health # Expected output: # { # "status": "healthy", # "version": "0.1.0", # "uptime_seconds": 5, # "assertion_count": 10234 # Should match backup metadata # } # Verify metadata matches cat data/metadata.json # Should match backup metadata.json # Test query curl -X POST http://localhost:18180/v1/query \ -H "Content-Type: application/json" \ -d '{"concept_path": "test/restore", "lens": "recency"}' # Should return 200 (even if empty results) ``` **If failed:** Health check shows different assertion_count → See §4 Validation and Rebuild. --- ### §2. WAL-Only Restore (Preserve Database) **Use case:** WAL corrupted but database intact ⚠️ **WARNING:** This preserves existing database but replaces WAL. Only use if confident database is uncorrupted. **Diagnostic:** ```bash # Check for WAL errors journalctl -u stemedb-api -n 50 | grep -i wal # Common errors indicating WAL corruption: # - "WAL magic byte validation failed" # - "Checksum mismatch in WAL segment" # - "Failed to recover WAL" # Verify database is intact ls -lh data/db/ # Should show: *.kv files, indexes, no corruption messages ``` **Resolution: Manual WAL replacement** ```bash # Stop server sudo systemctl stop stemedb-api # Backup corrupted WAL for forensics sudo mv data/wal data/wal.corrupted.$(date +%Y%m%d-%H%M%S) # Restore WAL from backup BACKUP_DIR="backups/stemedb-backup-20260211-100000" sudo cp -r $BACKUP_DIR/wal data/wal # Set correct permissions sudo chown -R stemedb:stemedb data/wal/ sudo chmod -R 755 data/wal/ # Start server (will replay WAL and rebuild indexes) sudo systemctl start stemedb-api # Monitor startup journalctl -u stemedb-api -f # Expected logs: # "Starting WAL recovery..." # "Replayed 1523 entries from WAL" # "Rebuilding indexes..." # "Startup complete" ``` **Validate WAL recovery:** ```bash # Check health curl http://localhost:18180/v1/health # Check metrics for WAL operations curl http://localhost:18180/metrics | grep wal_ # Should show: # wal_segments_total{...} 15 # wal_fsync_latency_seconds{...} <0.1 ``` **If failed:** Server still won't start with restored WAL → Perform complete restore (§1). --- ### §3. Point-in-Time Restore **Use case:** Rollback to specific timestamp (e.g., before bad data ingestion) ⚠️ **NOTE:** StemeDB is append-only, so this is "restore + filter" not true PITR. **Diagnostic:** ```bash # Identify when bad data was ingested curl http://localhost:18180/v1/query \ -H "Content-Type: application/json" \ -d '{"concept_path": "bad/data/path", "lens": "recency"}' | jq '.assertions[0].timestamp' # Find backup before this timestamp ls -lh backups/ | grep "before-timestamp" ``` **Resolution: Restore + retraction** ```bash # Step 1: Restore from backup before bad data sudo ./scripts/restore-stemedb.sh backups/stemedb-backup-20260210-230000 # Step 2: Start server sudo systemctl start stemedb-api # Step 3: If bad data source is known, retract it curl -X POST http://localhost:18180/v1/retract \ -H "Content-Type: application/json" \ -d '{ "concept_path": "source/bad_source", "reason": "data_quality_issue", "cascade": true }' # This marks source and all dependent assertions as retracted ``` **Validate rollback:** ```bash # Check assertion count curl http://localhost:18180/v1/health | jq '.assertion_count' # Should be less than current (rolled back) # Verify bad data is gone curl -X POST http://localhost:18180/v1/query \ -H "Content-Type: application/json" \ -d '{"concept_path": "bad/data/path", "lens": "recency"}' # Should return empty or show retracted status ``` **If failed:** Bad data still present → May need to filter WAL before replay (requires engineering support). --- ### §4. Validation and Rebuild **Use case:** Inconsistency detected, indexes corrupted **Diagnostic:** ```bash # Check health assertion_count vs expected curl http://localhost:18180/v1/health | jq '.assertion_count' HEALTH_COUNT=10234 cat data/metadata.json | jq '.assertion_count' METADATA_COUNT=10500 # If mismatch → Inconsistency detected # Check for index errors journalctl -u stemedb-api | grep -i "index" ``` **Resolution: Rebuild indexes from WAL** ```bash # Stop server sudo systemctl stop stemedb-api # Backup existing database sudo cp -r data/db data/db.backup.$(date +%Y%m%d-%H%M%S) # Remove indexes (will be rebuilt on startup) sudo rm -rf data/db/indexes/ # Start server (triggers full index rebuild) sudo systemctl start stemedb-api # Monitor rebuild progress journalctl -u stemedb-api -f # Expected logs: # "Index rebuild started..." # "Rebuilding predicate index from 10234 assertions..." # "Rebuilding concept index..." # "Index rebuild complete in 3.4s" ``` **Validate rebuild:** ```bash # Check health curl http://localhost:18180/v1/health # Verify assertion_count matches metadata HEALTH_COUNT=$(curl -s http://localhost:18180/v1/health | jq '.assertion_count') METADATA_COUNT=$(cat data/metadata.json | jq '.assertion_count') echo "Health: $HEALTH_COUNT, Metadata: $METADATA_COUNT" # Should match # Test query curl -X POST http://localhost:18180/v1/query \ -H "Content-Type: application/json" \ -d '{"concept_path": "test/validation", "lens": "recency"}' # Should return 200 with results ``` **If failed:** Rebuild fails or counts still mismatch → Perform complete restore (§1) from known-good backup. --- ## Validation After any restore procedure, validate system health: - [ ] **Server starts successfully** ```bash systemctl status stemedb-api # Should show: active (running) ``` - [ ] **Health endpoint returns correct count** ```bash curl http://localhost:18180/v1/health | jq '.assertion_count' # Should match backup metadata.json ``` - [ ] **Queries succeed** ```bash curl -X POST http://localhost:18180/v1/query \ -H "Content-Type: application/json" \ -d '{"concept_path": "test/restore", "lens": "recency"}' # Should return 200 ``` - [ ] **Ingest works** ```bash curl -X POST http://localhost:18180/v1/assert \ -H "Content-Type: application/json" \ -d '{ "concept_path": "test/restore_validation", "predicate": "restored", "value": true, "confidence": 0.95 }' # Should return 201 Created ``` - [ ] **Metrics are valid** ```bash curl http://localhost:18180/metrics | grep stemedb_ # Should show all metrics with reasonable values ``` - [ ] **Dashboard loads** - Open http://localhost:18188/ - Should show current assertion count - No errors in browser console --- ## Backup Script Reference **Script location:** `/home/jml/Workspace/stemedb/scripts/backup-stemedb.sh` **Usage:** ```bash # Manual backup sudo ./scripts/backup-stemedb.sh # Scheduled backup (cron) 0 2 * * * /path/to/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1 ``` **Backup structure:** ``` backups/stemedb-backup-20260211-100000/ ├── metadata.json # Backup metadata ├── wal/ # Write-ahead log │ ├── segment-00001.log │ ├── segment-00002.log │ └── ... └── db/ # Database files ├── assertions.kv ├── indexes/ └── ... ``` **Restore script location:** `/home/jml/Workspace/stemedb/scripts/restore-stemedb.sh` **Safety features:** - Never deletes existing data (renames to `.backup.TIMESTAMP`) - Validates backup metadata before restore - Stops/starts service automatically - Logs all operations --- ## Recovery Time Objective (RTO) **Pilot 5 targets:** | Deployment | Backup Size | RTO Target | Actual (tested) | |------------|-------------|------------|-----------------| | Single-node pilot | <10K assertions | 2 hours | 15 minutes | | Three-node cluster | <100K assertions | 5 minutes | 30 minutes | **Factors affecting RTO:** - Backup size - Network bandwidth (if backup on remote storage) - Disk I/O speed - Index rebuild time --- ## Recovery Point Objective (RPO) **Pilot 5 targets:** | Deployment | Backup Frequency | RPO Target | Data Loss Window | |------------|------------------|------------|------------------| | Single-node pilot | Daily | 24 hours | Last backup to failure | | Three-node cluster | Hourly | 1 hour | Last backup to failure | **Reducing RPO:** - Increase backup frequency (cron schedule) - Use continuous replication (cluster) - Enable WAL archival to S3 (roadmap P6.4) --- ## Prevention ### Automated Backups **Set up daily backup cron:** ```bash # Edit crontab sudo crontab -e # Add daily backup at 2 AM 0 2 * * * /home/jml/Workspace/stemedb/scripts/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1 # Verify cron job sudo crontab -l ``` **Set up backup retention:** ```bash # Keep last 7 daily backups find backups/ -name "stemedb-backup-*" -type d -mtime +7 -exec rm -rf {} \; # Add to cron (after backup) 0 3 * * * find /path/to/backups -name "stemedb-backup-*" -type d -mtime +7 -exec rm -rf {} \; ``` ### Backup Validation **Monthly DR test:** ```bash # Test restore on staging environment # 1. Copy production backup to staging scp -r prod:/backups/latest staging:/backups/test # 2. Restore on staging ssh staging "sudo ./scripts/restore-stemedb.sh /backups/test" # 3. Validate ssh staging "curl http://localhost:18180/v1/health" # 4. Document results echo "$(date): DR test passed, assertion_count: 10234" >> dr-test-log.txt ``` ### Monitoring **Set up alerts for:** ```yaml # Prometheus alert rules groups: - name: stemedb_backups rules: - alert: StemeDBBackupMissing expr: time() - stemedb_last_backup_timestamp_seconds > 86400 for: 1h labels: severity: warning annotations: summary: "StemeDB backup missing (>24 hours)" - alert: StemeDBBackupFailed expr: stemedb_backup_failures_total > 0 for: 5m labels: severity: critical annotations: summary: "StemeDB backup failed" ``` --- ## Related Runbooks - [Server Won't Start](./server-wont-start.md) - WAL corruption scenarios - [Disk Full](./disk-full.md) - Backup storage management - [High Query Latency](./high-query-latency.md) - Index rebuild performance --- ## Last Updated 2026-02-11