stemedb/docs/operations/runbooks/restore-from-backup.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

559 lines
13 KiB
Markdown

# Runbook: Restore from Backup
## Symptom
- Data loss after hardware failure, corruption, or operator error
- WAL corruption preventing server startup
- Need to rollback to known-good state
- Assertion count doesn't match expected values
- Database inconsistency detected
**Metrics Alerts:**
- N/A (typically discovered during incident response)
---
## Quick Diagnosis
```
Need to restore
├─► Data loss (hardware failure, operator error)?
│ └─► §1 Complete Restore
├─► WAL corruption on startup?
│ └─► §2 WAL-Only Restore
├─► Need to rollback to specific point in time?
│ └─► §3 Point-in-Time Restore
└─► Database inconsistency (assertion count mismatch)?
└─► §4 Validation and Rebuild
```
---
## Common Causes
1. **Hardware failure** — Likelihood: **30%**
- Disk failure
- Power loss during write
- Network storage disconnection
2. **WAL corruption** — Likelihood: **25%**
- Unclean shutdown (OOM kill, crash)
- Disk corruption
- Version mismatch after upgrade
3. **Operator error** — Likelihood: **20%**
- Accidentally deleted data directory
- Wrong command executed
- Misconfigured deployment
4. **Software bug** — Likelihood: **15%**
- Database corruption bug
- Index inconsistency
- Replication failure (cluster)
5. **Disaster recovery test** — Likelihood: **10%**
- Scheduled DR validation
- Migration to new infrastructure
---
## Prerequisites
**Before starting restore:**
- [ ] **Backup available:**
```bash
ls -lh backups/
# Should show: stemedb-backup-YYYYMMDD-HHMMSS/
```
- [ ] **Backup metadata valid:**
```bash
cat backups/stemedb-backup-*/metadata.json
# Should show: version, timestamp, assertion_count
```
- [ ] **Server stopped:**
```bash
sudo systemctl stop stemedb-api
sudo systemctl status stemedb-api
# Should show: inactive (dead)
```
- [ ] **Disk space available:**
```bash
df -h
# Need: 2x backup size available
```
---
## Resolution Steps
### §1. Complete Restore (Full Recovery)
**Use case:** Data loss, complete restoration needed
**Diagnostic:**
```bash
# Verify backup integrity
BACKUP_DIR="backups/stemedb-backup-20260211-100000" # Replace with your backup
# Check metadata
cat $BACKUP_DIR/metadata.json
# Expected output:
# {
# "version": "0.1.0",
# "timestamp": "2026-02-11T10:00:00Z",
# "assertion_count": 10234,
# "wal_segment_count": 15,
# "backup_type": "full"
# }
# Check directory structure
ls -lh $BACKUP_DIR/
# Should show: wal/ db/ metadata.json
```
**Resolution: Use restore script**
```bash
# Run restore script (safe - renames existing dirs, never deletes)
sudo ./scripts/restore-stemedb.sh $BACKUP_DIR
# Expected output:
# Stopping StemeDB API service...
# Renaming existing data/wal to data/wal.backup.20260211-103045
# Renaming existing data/db to data/db.backup.20260211-103045
# Copying WAL from backup...
# Copying DB from backup...
# Copying metadata...
# Restore complete. Starting StemeDB API service...
# StemeDB API service started successfully.
```
**Validate restore:**
```bash
# Check health endpoint
curl http://localhost:18180/v1/health
# Expected output:
# {
# "status": "healthy",
# "version": "0.1.0",
# "uptime_seconds": 5,
# "assertion_count": 10234 # Should match backup metadata
# }
# Verify metadata matches
cat data/metadata.json
# Should match backup metadata.json
# Test query
curl -X POST http://localhost:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "test/restore", "lens": "recency"}'
# Should return 200 (even if empty results)
```
**If failed:** Health check shows different assertion_count → See §4 Validation and Rebuild.
---
### §2. WAL-Only Restore (Preserve Database)
**Use case:** WAL corrupted but database intact
⚠️ **WARNING:** This preserves existing database but replaces WAL. Only use if confident database is uncorrupted.
**Diagnostic:**
```bash
# Check for WAL errors
journalctl -u stemedb-api -n 50 | grep -i wal
# Common errors indicating WAL corruption:
# - "WAL magic byte validation failed"
# - "Checksum mismatch in WAL segment"
# - "Failed to recover WAL"
# Verify database is intact
ls -lh data/db/
# Should show: *.kv files, indexes, no corruption messages
```
**Resolution: Manual WAL replacement**
```bash
# Stop server
sudo systemctl stop stemedb-api
# Backup corrupted WAL for forensics
sudo mv data/wal data/wal.corrupted.$(date +%Y%m%d-%H%M%S)
# Restore WAL from backup
BACKUP_DIR="backups/stemedb-backup-20260211-100000"
sudo cp -r $BACKUP_DIR/wal data/wal
# Set correct permissions
sudo chown -R stemedb:stemedb data/wal/
sudo chmod -R 755 data/wal/
# Start server (will replay WAL and rebuild indexes)
sudo systemctl start stemedb-api
# Monitor startup
journalctl -u stemedb-api -f
# Expected logs:
# "Starting WAL recovery..."
# "Replayed 1523 entries from WAL"
# "Rebuilding indexes..."
# "Startup complete"
```
**Validate WAL recovery:**
```bash
# Check health
curl http://localhost:18180/v1/health
# Check metrics for WAL operations
curl http://localhost:18180/metrics | grep wal_
# Should show:
# wal_segments_total{...} 15
# wal_fsync_latency_seconds{...} <0.1
```
**If failed:** Server still won't start with restored WAL → Perform complete restore (§1).
---
### §3. Point-in-Time Restore
**Use case:** Rollback to specific timestamp (e.g., before bad data ingestion)
⚠️ **NOTE:** StemeDB is append-only, so this is "restore + filter" not true PITR.
**Diagnostic:**
```bash
# Identify when bad data was ingested
curl http://localhost:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "bad/data/path", "lens": "recency"}' | jq '.assertions[0].timestamp'
# Find backup before this timestamp
ls -lh backups/ | grep "before-timestamp"
```
**Resolution: Restore + retraction**
```bash
# Step 1: Restore from backup before bad data
sudo ./scripts/restore-stemedb.sh backups/stemedb-backup-20260210-230000
# Step 2: Start server
sudo systemctl start stemedb-api
# Step 3: If bad data source is known, retract it
curl -X POST http://localhost:18180/v1/retract \
-H "Content-Type: application/json" \
-d '{
"concept_path": "source/bad_source",
"reason": "data_quality_issue",
"cascade": true
}'
# This marks source and all dependent assertions as retracted
```
**Validate rollback:**
```bash
# Check assertion count
curl http://localhost:18180/v1/health | jq '.assertion_count'
# Should be less than current (rolled back)
# Verify bad data is gone
curl -X POST http://localhost:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "bad/data/path", "lens": "recency"}'
# Should return empty or show retracted status
```
**If failed:** Bad data still present → May need to filter WAL before replay (requires engineering support).
---
### §4. Validation and Rebuild
**Use case:** Inconsistency detected, indexes corrupted
**Diagnostic:**
```bash
# Check health assertion_count vs expected
curl http://localhost:18180/v1/health | jq '.assertion_count'
HEALTH_COUNT=10234
cat data/metadata.json | jq '.assertion_count'
METADATA_COUNT=10500
# If mismatch → Inconsistency detected
# Check for index errors
journalctl -u stemedb-api | grep -i "index"
```
**Resolution: Rebuild indexes from WAL**
```bash
# Stop server
sudo systemctl stop stemedb-api
# Backup existing database
sudo cp -r data/db data/db.backup.$(date +%Y%m%d-%H%M%S)
# Remove indexes (will be rebuilt on startup)
sudo rm -rf data/db/indexes/
# Start server (triggers full index rebuild)
sudo systemctl start stemedb-api
# Monitor rebuild progress
journalctl -u stemedb-api -f
# Expected logs:
# "Index rebuild started..."
# "Rebuilding predicate index from 10234 assertions..."
# "Rebuilding concept index..."
# "Index rebuild complete in 3.4s"
```
**Validate rebuild:**
```bash
# Check health
curl http://localhost:18180/v1/health
# Verify assertion_count matches metadata
HEALTH_COUNT=$(curl -s http://localhost:18180/v1/health | jq '.assertion_count')
METADATA_COUNT=$(cat data/metadata.json | jq '.assertion_count')
echo "Health: $HEALTH_COUNT, Metadata: $METADATA_COUNT"
# Should match
# Test query
curl -X POST http://localhost:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "test/validation", "lens": "recency"}'
# Should return 200 with results
```
**If failed:** Rebuild fails or counts still mismatch → Perform complete restore (§1) from known-good backup.
---
## Validation
After any restore procedure, validate system health:
- [ ] **Server starts successfully**
```bash
systemctl status stemedb-api
# Should show: active (running)
```
- [ ] **Health endpoint returns correct count**
```bash
curl http://localhost:18180/v1/health | jq '.assertion_count'
# Should match backup metadata.json
```
- [ ] **Queries succeed**
```bash
curl -X POST http://localhost:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "test/restore", "lens": "recency"}'
# Should return 200
```
- [ ] **Ingest works**
```bash
curl -X POST http://localhost:18180/v1/assert \
-H "Content-Type: application/json" \
-d '{
"concept_path": "test/restore_validation",
"predicate": "restored",
"value": true,
"confidence": 0.95
}'
# Should return 201 Created
```
- [ ] **Metrics are valid**
```bash
curl http://localhost:18180/metrics | grep stemedb_
# Should show all metrics with reasonable values
```
- [ ] **Dashboard loads**
- Open http://localhost:18188/
- Should show current assertion count
- No errors in browser console
---
## Backup Script Reference
**Script location:** `/home/jml/Workspace/stemedb/scripts/backup-stemedb.sh`
**Usage:**
```bash
# Manual backup
sudo ./scripts/backup-stemedb.sh
# Scheduled backup (cron)
0 2 * * * /path/to/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1
```
**Backup structure:**
```
backups/stemedb-backup-20260211-100000/
├── metadata.json # Backup metadata
├── wal/ # Write-ahead log
│ ├── segment-00001.log
│ ├── segment-00002.log
│ └── ...
└── db/ # Database files
├── assertions.kv
├── indexes/
└── ...
```
**Restore script location:** `/home/jml/Workspace/stemedb/scripts/restore-stemedb.sh`
**Safety features:**
- Never deletes existing data (renames to `.backup.TIMESTAMP`)
- Validates backup metadata before restore
- Stops/starts service automatically
- Logs all operations
---
## Recovery Time Objective (RTO)
**Pilot 5 targets:**
| Deployment | Backup Size | RTO Target | Actual (tested) |
|------------|-------------|------------|-----------------|
| Single-node pilot | <10K assertions | 2 hours | 15 minutes |
| Three-node cluster | <100K assertions | 5 minutes | 30 minutes |
**Factors affecting RTO:**
- Backup size
- Network bandwidth (if backup on remote storage)
- Disk I/O speed
- Index rebuild time
---
## Recovery Point Objective (RPO)
**Pilot 5 targets:**
| Deployment | Backup Frequency | RPO Target | Data Loss Window |
|------------|------------------|------------|------------------|
| Single-node pilot | Daily | 24 hours | Last backup to failure |
| Three-node cluster | Hourly | 1 hour | Last backup to failure |
**Reducing RPO:**
- Increase backup frequency (cron schedule)
- Use continuous replication (cluster)
- Enable WAL archival to S3 (roadmap P6.4)
---
## Prevention
### Automated Backups
**Set up daily backup cron:**
```bash
# Edit crontab
sudo crontab -e
# Add daily backup at 2 AM
0 2 * * * /home/jml/Workspace/stemedb/scripts/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1
# Verify cron job
sudo crontab -l
```
**Set up backup retention:**
```bash
# Keep last 7 daily backups
find backups/ -name "stemedb-backup-*" -type d -mtime +7 -exec rm -rf {} \;
# Add to cron (after backup)
0 3 * * * find /path/to/backups -name "stemedb-backup-*" -type d -mtime +7 -exec rm -rf {} \;
```
### Backup Validation
**Monthly DR test:**
```bash
# Test restore on staging environment
# 1. Copy production backup to staging
scp -r prod:/backups/latest staging:/backups/test
# 2. Restore on staging
ssh staging "sudo ./scripts/restore-stemedb.sh /backups/test"
# 3. Validate
ssh staging "curl http://localhost:18180/v1/health"
# 4. Document results
echo "$(date): DR test passed, assertion_count: 10234" >> dr-test-log.txt
```
### Monitoring
**Set up alerts for:**
```yaml
# Prometheus alert rules
groups:
- name: stemedb_backups
rules:
- alert: StemeDBBackupMissing
expr: time() - stemedb_last_backup_timestamp_seconds > 86400
for: 1h
labels:
severity: warning
annotations:
summary: "StemeDB backup missing (>24 hours)"
- alert: StemeDBBackupFailed
expr: stemedb_backup_failures_total > 0
for: 5m
labels:
severity: critical
annotations:
summary: "StemeDB backup failed"
```
---
## Related Runbooks
- [Server Won't Start](./server-wont-start.md) - WAL corruption scenarios
- [Disk Full](./disk-full.md) - Backup storage management
- [High Query Latency](./high-query-latency.md) - Index rebuild performance
---
## Last Updated
2026-02-11