stemedb/docs/operations/runbooks/restore-from-backup.md

# Runbook: Restore from Backup

## Symptom

- Data loss after hardware failure, corruption, or operator error
- WAL corruption preventing server startup
- Need to rollback to known-good state
- Assertion count doesn't match expected values
- Database inconsistency detected

**Metrics Alerts:**
- N/A (typically discovered during incident response)

---

## Quick Diagnosis

```
Need to restore
    │
    ├─► Data loss (hardware failure, operator error)?
    │   └─► §1 Complete Restore
    │
    ├─► WAL corruption on startup?
    │   └─► §2 WAL-Only Restore
    │
    ├─► Need to rollback to specific point in time?
    │   └─► §3 Point-in-Time Restore
    │
    └─► Database inconsistency (assertion count mismatch)?
        └─► §4 Validation and Rebuild
```

---

## Common Causes

1. **Hardware failure** — Likelihood: **30%**
   - Disk failure
   - Power loss during write
   - Network storage disconnection

2. **WAL corruption** — Likelihood: **25%**
   - Unclean shutdown (OOM kill, crash)
   - Disk corruption
   - Version mismatch after upgrade

3. **Operator error** — Likelihood: **20%**
   - Accidentally deleted data directory
   - Wrong command executed
   - Misconfigured deployment

4. **Software bug** — Likelihood: **15%**
   - Database corruption bug
   - Index inconsistency
   - Replication failure (cluster)

5. **Disaster recovery test** — Likelihood: **10%**
   - Scheduled DR validation
   - Migration to new infrastructure

---

## Prerequisites

**Before starting restore:**

- [ ] **Backup available:**
  ```bash
  ls -lh backups/
  # Should show: stemedb-backup-YYYYMMDD-HHMMSS/
  ```

- [ ] **Backup metadata valid:**
  ```bash
  cat backups/stemedb-backup-*/metadata.json
  # Should show: version, timestamp, assertion_count
  ```

- [ ] **Server stopped:**
  ```bash
  sudo systemctl stop stemedb-api
  sudo systemctl status stemedb-api
  # Should show: inactive (dead)
  ```

- [ ] **Disk space available:**
  ```bash
  df -h
  # Need: 2x backup size available
  ```

---

## Resolution Steps

### §1. Complete Restore (Full Recovery)

**Use case:** Data loss, complete restoration needed

**Diagnostic:**
```bash
# Verify backup integrity
BACKUP_DIR="backups/stemedb-backup-20260211-100000"  # Replace with your backup

# Check metadata
cat $BACKUP_DIR/metadata.json

# Expected output:
# {
#   "version": "0.1.0",
#   "timestamp": "2026-02-11T10:00:00Z",
#   "assertion_count": 10234,
#   "wal_segment_count": 15,
#   "backup_type": "full"
# }

# Check directory structure
ls -lh $BACKUP_DIR/
# Should show: wal/ db/ metadata.json
```

**Resolution: Use restore script**

```bash
# Run restore script (safe - renames existing dirs, never deletes)
sudo ./scripts/restore-stemedb.sh $BACKUP_DIR

# Expected output:
# Stopping StemeDB API service...
# Renaming existing data/wal to data/wal.backup.20260211-103045
# Renaming existing data/db to data/db.backup.20260211-103045
# Copying WAL from backup...
# Copying DB from backup...
# Copying metadata...
# Restore complete. Starting StemeDB API service...
# StemeDB API service started successfully.
```

**Validate restore:**
```bash
# Check health endpoint
curl http://localhost:18180/v1/health

# Expected output:
# {
#   "status": "healthy",
#   "version": "0.1.0",
#   "uptime_seconds": 5,
#   "assertion_count": 10234  # Should match backup metadata
# }

# Verify metadata matches
cat data/metadata.json
# Should match backup metadata.json

# Test query
curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "test/restore", "lens": "recency"}'
# Should return 200 (even if empty results)
```

**If failed:** Health check shows different assertion_count → See §4 Validation and Rebuild.

---

### §2. WAL-Only Restore (Preserve Database)

**Use case:** WAL corrupted but database intact

⚠️ **WARNING:** This preserves existing database but replaces WAL. Only use if confident database is uncorrupted.

**Diagnostic:**
```bash
# Check for WAL errors
journalctl -u stemedb-api -n 50 | grep -i wal

# Common errors indicating WAL corruption:
# - "WAL magic byte validation failed"
# - "Checksum mismatch in WAL segment"
# - "Failed to recover WAL"

# Verify database is intact
ls -lh data/db/
# Should show: *.kv files, indexes, no corruption messages
```

**Resolution: Manual WAL replacement**

```bash
# Stop server
sudo systemctl stop stemedb-api

# Backup corrupted WAL for forensics
sudo mv data/wal data/wal.corrupted.$(date +%Y%m%d-%H%M%S)

# Restore WAL from backup
BACKUP_DIR="backups/stemedb-backup-20260211-100000"
sudo cp -r $BACKUP_DIR/wal data/wal

# Set correct permissions
sudo chown -R stemedb:stemedb data/wal/
sudo chmod -R 755 data/wal/

# Start server (will replay WAL and rebuild indexes)
sudo systemctl start stemedb-api

# Monitor startup
journalctl -u stemedb-api -f

# Expected logs:
# "Starting WAL recovery..."
# "Replayed 1523 entries from WAL"
# "Rebuilding indexes..."
# "Startup complete"
```

**Validate WAL recovery:**
```bash
# Check health
curl http://localhost:18180/v1/health

# Check metrics for WAL operations
curl http://localhost:18180/metrics | grep wal_

# Should show:
# wal_segments_total{...} 15
# wal_fsync_latency_seconds{...} <0.1
```

**If failed:** Server still won't start with restored WAL → Perform complete restore (§1).

---

### §3. Point-in-Time Restore

**Use case:** Rollback to specific timestamp (e.g., before bad data ingestion)

⚠️ **NOTE:** StemeDB is append-only, so this is "restore + filter" not true PITR.

**Diagnostic:**
```bash
# Identify when bad data was ingested
curl http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "bad/data/path", "lens": "recency"}' | jq '.assertions[0].timestamp'

# Find backup before this timestamp
ls -lh backups/ | grep "before-timestamp"
```

**Resolution: Restore + retraction**

```bash
# Step 1: Restore from backup before bad data
sudo ./scripts/restore-stemedb.sh backups/stemedb-backup-20260210-230000

# Step 2: Start server
sudo systemctl start stemedb-api

# Step 3: If bad data source is known, retract it
curl -X POST http://localhost:18180/v1/retract \
  -H "Content-Type: application/json" \
  -d '{
    "concept_path": "source/bad_source",
    "reason": "data_quality_issue",
    "cascade": true
  }'

# This marks source and all dependent assertions as retracted
```

**Validate rollback:**
```bash
# Check assertion count
curl http://localhost:18180/v1/health | jq '.assertion_count'
# Should be less than current (rolled back)

# Verify bad data is gone
curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "bad/data/path", "lens": "recency"}'
# Should return empty or show retracted status
```

**If failed:** Bad data still present → May need to filter WAL before replay (requires engineering support).

---

### §4. Validation and Rebuild

**Use case:** Inconsistency detected, indexes corrupted

**Diagnostic:**
```bash
# Check health assertion_count vs expected
curl http://localhost:18180/v1/health | jq '.assertion_count'
HEALTH_COUNT=10234

cat data/metadata.json | jq '.assertion_count'
METADATA_COUNT=10500

# If mismatch → Inconsistency detected

# Check for index errors
journalctl -u stemedb-api | grep -i "index"
```

**Resolution: Rebuild indexes from WAL**

```bash
# Stop server
sudo systemctl stop stemedb-api

# Backup existing database
sudo cp -r data/db data/db.backup.$(date +%Y%m%d-%H%M%S)

# Remove indexes (will be rebuilt on startup)
sudo rm -rf data/db/indexes/

# Start server (triggers full index rebuild)
sudo systemctl start stemedb-api

# Monitor rebuild progress
journalctl -u stemedb-api -f

# Expected logs:
# "Index rebuild started..."
# "Rebuilding predicate index from 10234 assertions..."
# "Rebuilding concept index..."
# "Index rebuild complete in 3.4s"
```

**Validate rebuild:**
```bash
# Check health
curl http://localhost:18180/v1/health

# Verify assertion_count matches metadata
HEALTH_COUNT=$(curl -s http://localhost:18180/v1/health | jq '.assertion_count')
METADATA_COUNT=$(cat data/metadata.json | jq '.assertion_count')

echo "Health: $HEALTH_COUNT, Metadata: $METADATA_COUNT"
# Should match

# Test query
curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "test/validation", "lens": "recency"}'
# Should return 200 with results
```

**If failed:** Rebuild fails or counts still mismatch → Perform complete restore (§1) from known-good backup.

---

## Validation

After any restore procedure, validate system health:

- [ ] **Server starts successfully**
  ```bash
  systemctl status stemedb-api
  # Should show: active (running)
  ```

- [ ] **Health endpoint returns correct count**
  ```bash
  curl http://localhost:18180/v1/health | jq '.assertion_count'
  # Should match backup metadata.json
  ```

- [ ] **Queries succeed**
  ```bash
  curl -X POST http://localhost:18180/v1/query \
    -H "Content-Type: application/json" \
    -d '{"concept_path": "test/restore", "lens": "recency"}'
  # Should return 200
  ```

- [ ] **Ingest works**
  ```bash
  curl -X POST http://localhost:18180/v1/assert \
    -H "Content-Type: application/json" \
    -d '{
      "concept_path": "test/restore_validation",
      "predicate": "restored",
      "value": true,
      "confidence": 0.95
    }'
  # Should return 201 Created
  ```

- [ ] **Metrics are valid**
  ```bash
  curl http://localhost:18180/metrics | grep stemedb_
  # Should show all metrics with reasonable values
  ```

- [ ] **Dashboard loads**
  - Open http://localhost:18188/
  - Should show current assertion count
  - No errors in browser console

---

## Backup Script Reference

**Script location:** `/home/jml/Workspace/stemedb/scripts/backup-stemedb.sh`

**Usage:**
```bash
# Manual backup
sudo ./scripts/backup-stemedb.sh

# Scheduled backup (cron)
0 2 * * * /path/to/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1
```

**Backup structure:**
```
backups/stemedb-backup-20260211-100000/
├── metadata.json          # Backup metadata
├── wal/                   # Write-ahead log
│   ├── segment-00001.log
│   ├── segment-00002.log
│   └── ...
└── db/                    # Database files
    ├── assertions.kv
    ├── indexes/
    └── ...
```

**Restore script location:** `/home/jml/Workspace/stemedb/scripts/restore-stemedb.sh`

**Safety features:**
- Never deletes existing data (renames to `.backup.TIMESTAMP`)
- Validates backup metadata before restore
- Stops/starts service automatically
- Logs all operations

---

## Recovery Time Objective (RTO)

**Pilot 5 targets:**

| Deployment | Backup Size | RTO Target | Actual (tested) |
|------------|-------------|------------|-----------------|
| Single-node pilot | <10K assertions | 2 hours | 15 minutes |
| Three-node cluster | <100K assertions | 5 minutes | 30 minutes |

**Factors affecting RTO:**
- Backup size
- Network bandwidth (if backup on remote storage)
- Disk I/O speed
- Index rebuild time

---

## Recovery Point Objective (RPO)

**Pilot 5 targets:**

| Deployment | Backup Frequency | RPO Target | Data Loss Window |
|------------|------------------|------------|------------------|
| Single-node pilot | Daily | 24 hours | Last backup to failure |
| Three-node cluster | Hourly | 1 hour | Last backup to failure |

**Reducing RPO:**
- Increase backup frequency (cron schedule)
- Use continuous replication (cluster)
- Enable WAL archival to S3 (roadmap P6.4)

---

## Prevention

### Automated Backups

**Set up daily backup cron:**
```bash
# Edit crontab
sudo crontab -e

# Add daily backup at 2 AM
0 2 * * * /home/jml/Workspace/stemedb/scripts/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1

# Verify cron job
sudo crontab -l
```

**Set up backup retention:**
```bash
# Keep last 7 daily backups
find backups/ -name "stemedb-backup-*" -type d -mtime +7 -exec rm -rf {} \;

# Add to cron (after backup)
0 3 * * * find /path/to/backups -name "stemedb-backup-*" -type d -mtime +7 -exec rm -rf {} \;
```

### Backup Validation

**Monthly DR test:**
```bash
# Test restore on staging environment
# 1. Copy production backup to staging
scp -r prod:/backups/latest staging:/backups/test

# 2. Restore on staging
ssh staging "sudo ./scripts/restore-stemedb.sh /backups/test"

# 3. Validate
ssh staging "curl http://localhost:18180/v1/health"

# 4. Document results
echo "$(date): DR test passed, assertion_count: 10234" >> dr-test-log.txt
```

### Monitoring

**Set up alerts for:**
```yaml
# Prometheus alert rules
groups:
  - name: stemedb_backups
    rules:
      - alert: StemeDBBackupMissing
        expr: time() - stemedb_last_backup_timestamp_seconds > 86400
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "StemeDB backup missing (>24 hours)"

      - alert: StemeDBBackupFailed
        expr: stemedb_backup_failures_total > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "StemeDB backup failed"
```

---

## Related Runbooks

- [Server Won't Start](./server-wont-start.md) - WAL corruption scenarios
- [Disk Full](./disk-full.md) - Backup storage management
- [High Query Latency](./high-query-latency.md) - Index rebuild performance

---

## Last Updated

2026-02-11