stemedb/docs/operations/runbooks/restore-from-backup.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

13 KiB

Runbook: Restore from Backup

Symptom

  • Data loss after hardware failure, corruption, or operator error
  • WAL corruption preventing server startup
  • Need to rollback to known-good state
  • Assertion count doesn't match expected values
  • Database inconsistency detected

Metrics Alerts:

  • N/A (typically discovered during incident response)

Quick Diagnosis

Need to restore
    │
    ├─► Data loss (hardware failure, operator error)?
    │   └─► §1 Complete Restore
    │
    ├─► WAL corruption on startup?
    │   └─► §2 WAL-Only Restore
    │
    ├─► Need to rollback to specific point in time?
    │   └─► §3 Point-in-Time Restore
    │
    └─► Database inconsistency (assertion count mismatch)?
        └─► §4 Validation and Rebuild

Common Causes

  1. Hardware failure — Likelihood: 30%

    • Disk failure
    • Power loss during write
    • Network storage disconnection
  2. WAL corruption — Likelihood: 25%

    • Unclean shutdown (OOM kill, crash)
    • Disk corruption
    • Version mismatch after upgrade
  3. Operator error — Likelihood: 20%

    • Accidentally deleted data directory
    • Wrong command executed
    • Misconfigured deployment
  4. Software bug — Likelihood: 15%

    • Database corruption bug
    • Index inconsistency
    • Replication failure (cluster)
  5. Disaster recovery test — Likelihood: 10%

    • Scheduled DR validation
    • Migration to new infrastructure

Prerequisites

Before starting restore:

  • Backup available:

    ls -lh backups/
    # Should show: stemedb-backup-YYYYMMDD-HHMMSS/
    
  • Backup metadata valid:

    cat backups/stemedb-backup-*/metadata.json
    # Should show: version, timestamp, assertion_count
    
  • Server stopped:

    sudo systemctl stop stemedb-api
    sudo systemctl status stemedb-api
    # Should show: inactive (dead)
    
  • Disk space available:

    df -h
    # Need: 2x backup size available
    

Resolution Steps

§1. Complete Restore (Full Recovery)

Use case: Data loss, complete restoration needed

Diagnostic:

# Verify backup integrity
BACKUP_DIR="backups/stemedb-backup-20260211-100000"  # Replace with your backup

# Check metadata
cat $BACKUP_DIR/metadata.json

# Expected output:
# {
#   "version": "0.1.0",
#   "timestamp": "2026-02-11T10:00:00Z",
#   "assertion_count": 10234,
#   "wal_segment_count": 15,
#   "backup_type": "full"
# }

# Check directory structure
ls -lh $BACKUP_DIR/
# Should show: wal/ db/ metadata.json

Resolution: Use restore script

# Run restore script (safe - renames existing dirs, never deletes)
sudo ./scripts/restore-stemedb.sh $BACKUP_DIR

# Expected output:
# Stopping StemeDB API service...
# Renaming existing data/wal to data/wal.backup.20260211-103045
# Renaming existing data/db to data/db.backup.20260211-103045
# Copying WAL from backup...
# Copying DB from backup...
# Copying metadata...
# Restore complete. Starting StemeDB API service...
# StemeDB API service started successfully.

Validate restore:

# Check health endpoint
curl http://localhost:18180/v1/health

# Expected output:
# {
#   "status": "healthy",
#   "version": "0.1.0",
#   "uptime_seconds": 5,
#   "assertion_count": 10234  # Should match backup metadata
# }

# Verify metadata matches
cat data/metadata.json
# Should match backup metadata.json

# Test query
curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "test/restore", "lens": "recency"}'
# Should return 200 (even if empty results)

If failed: Health check shows different assertion_count → See §4 Validation and Rebuild.


§2. WAL-Only Restore (Preserve Database)

Use case: WAL corrupted but database intact

⚠️ WARNING: This preserves existing database but replaces WAL. Only use if confident database is uncorrupted.

Diagnostic:

# Check for WAL errors
journalctl -u stemedb-api -n 50 | grep -i wal

# Common errors indicating WAL corruption:
# - "WAL magic byte validation failed"
# - "Checksum mismatch in WAL segment"
# - "Failed to recover WAL"

# Verify database is intact
ls -lh data/db/
# Should show: *.kv files, indexes, no corruption messages

Resolution: Manual WAL replacement

# Stop server
sudo systemctl stop stemedb-api

# Backup corrupted WAL for forensics
sudo mv data/wal data/wal.corrupted.$(date +%Y%m%d-%H%M%S)

# Restore WAL from backup
BACKUP_DIR="backups/stemedb-backup-20260211-100000"
sudo cp -r $BACKUP_DIR/wal data/wal

# Set correct permissions
sudo chown -R stemedb:stemedb data/wal/
sudo chmod -R 755 data/wal/

# Start server (will replay WAL and rebuild indexes)
sudo systemctl start stemedb-api

# Monitor startup
journalctl -u stemedb-api -f

# Expected logs:
# "Starting WAL recovery..."
# "Replayed 1523 entries from WAL"
# "Rebuilding indexes..."
# "Startup complete"

Validate WAL recovery:

# Check health
curl http://localhost:18180/v1/health

# Check metrics for WAL operations
curl http://localhost:18180/metrics | grep wal_

# Should show:
# wal_segments_total{...} 15
# wal_fsync_latency_seconds{...} <0.1

If failed: Server still won't start with restored WAL → Perform complete restore (§1).


§3. Point-in-Time Restore

Use case: Rollback to specific timestamp (e.g., before bad data ingestion)

⚠️ NOTE: StemeDB is append-only, so this is "restore + filter" not true PITR.

Diagnostic:

# Identify when bad data was ingested
curl http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "bad/data/path", "lens": "recency"}' | jq '.assertions[0].timestamp'

# Find backup before this timestamp
ls -lh backups/ | grep "before-timestamp"

Resolution: Restore + retraction

# Step 1: Restore from backup before bad data
sudo ./scripts/restore-stemedb.sh backups/stemedb-backup-20260210-230000

# Step 2: Start server
sudo systemctl start stemedb-api

# Step 3: If bad data source is known, retract it
curl -X POST http://localhost:18180/v1/retract \
  -H "Content-Type: application/json" \
  -d '{
    "concept_path": "source/bad_source",
    "reason": "data_quality_issue",
    "cascade": true
  }'

# This marks source and all dependent assertions as retracted

Validate rollback:

# Check assertion count
curl http://localhost:18180/v1/health | jq '.assertion_count'
# Should be less than current (rolled back)

# Verify bad data is gone
curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "bad/data/path", "lens": "recency"}'
# Should return empty or show retracted status

If failed: Bad data still present → May need to filter WAL before replay (requires engineering support).


§4. Validation and Rebuild

Use case: Inconsistency detected, indexes corrupted

Diagnostic:

# Check health assertion_count vs expected
curl http://localhost:18180/v1/health | jq '.assertion_count'
HEALTH_COUNT=10234

cat data/metadata.json | jq '.assertion_count'
METADATA_COUNT=10500

# If mismatch → Inconsistency detected

# Check for index errors
journalctl -u stemedb-api | grep -i "index"

Resolution: Rebuild indexes from WAL

# Stop server
sudo systemctl stop stemedb-api

# Backup existing database
sudo cp -r data/db data/db.backup.$(date +%Y%m%d-%H%M%S)

# Remove indexes (will be rebuilt on startup)
sudo rm -rf data/db/indexes/

# Start server (triggers full index rebuild)
sudo systemctl start stemedb-api

# Monitor rebuild progress
journalctl -u stemedb-api -f

# Expected logs:
# "Index rebuild started..."
# "Rebuilding predicate index from 10234 assertions..."
# "Rebuilding concept index..."
# "Index rebuild complete in 3.4s"

Validate rebuild:

# Check health
curl http://localhost:18180/v1/health

# Verify assertion_count matches metadata
HEALTH_COUNT=$(curl -s http://localhost:18180/v1/health | jq '.assertion_count')
METADATA_COUNT=$(cat data/metadata.json | jq '.assertion_count')

echo "Health: $HEALTH_COUNT, Metadata: $METADATA_COUNT"
# Should match

# Test query
curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "test/validation", "lens": "recency"}'
# Should return 200 with results

If failed: Rebuild fails or counts still mismatch → Perform complete restore (§1) from known-good backup.


Validation

After any restore procedure, validate system health:

  • Server starts successfully

    systemctl status stemedb-api
    # Should show: active (running)
    
  • Health endpoint returns correct count

    curl http://localhost:18180/v1/health | jq '.assertion_count'
    # Should match backup metadata.json
    
  • Queries succeed

    curl -X POST http://localhost:18180/v1/query \
      -H "Content-Type: application/json" \
      -d '{"concept_path": "test/restore", "lens": "recency"}'
    # Should return 200
    
  • Ingest works

    curl -X POST http://localhost:18180/v1/assert \
      -H "Content-Type: application/json" \
      -d '{
        "concept_path": "test/restore_validation",
        "predicate": "restored",
        "value": true,
        "confidence": 0.95
      }'
    # Should return 201 Created
    
  • Metrics are valid

    curl http://localhost:18180/metrics | grep stemedb_
    # Should show all metrics with reasonable values
    
  • Dashboard loads


Backup Script Reference

Script location: /home/jml/Workspace/stemedb/scripts/backup-stemedb.sh

Usage:

# Manual backup
sudo ./scripts/backup-stemedb.sh

# Scheduled backup (cron)
0 2 * * * /path/to/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1

Backup structure:

backups/stemedb-backup-20260211-100000/
├── metadata.json          # Backup metadata
├── wal/                   # Write-ahead log
│   ├── segment-00001.log
│   ├── segment-00002.log
│   └── ...
└── db/                    # Database files
    ├── assertions.kv
    ├── indexes/
    └── ...

Restore script location: /home/jml/Workspace/stemedb/scripts/restore-stemedb.sh

Safety features:

  • Never deletes existing data (renames to .backup.TIMESTAMP)
  • Validates backup metadata before restore
  • Stops/starts service automatically
  • Logs all operations

Recovery Time Objective (RTO)

Pilot 5 targets:

Deployment Backup Size RTO Target Actual (tested)
Single-node pilot <10K assertions 2 hours 15 minutes
Three-node cluster <100K assertions 5 minutes 30 minutes

Factors affecting RTO:

  • Backup size
  • Network bandwidth (if backup on remote storage)
  • Disk I/O speed
  • Index rebuild time

Recovery Point Objective (RPO)

Pilot 5 targets:

Deployment Backup Frequency RPO Target Data Loss Window
Single-node pilot Daily 24 hours Last backup to failure
Three-node cluster Hourly 1 hour Last backup to failure

Reducing RPO:

  • Increase backup frequency (cron schedule)
  • Use continuous replication (cluster)
  • Enable WAL archival to S3 (roadmap P6.4)

Prevention

Automated Backups

Set up daily backup cron:

# Edit crontab
sudo crontab -e

# Add daily backup at 2 AM
0 2 * * * /home/jml/Workspace/stemedb/scripts/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1

# Verify cron job
sudo crontab -l

Set up backup retention:

# Keep last 7 daily backups
find backups/ -name "stemedb-backup-*" -type d -mtime +7 -exec rm -rf {} \;

# Add to cron (after backup)
0 3 * * * find /path/to/backups -name "stemedb-backup-*" -type d -mtime +7 -exec rm -rf {} \;

Backup Validation

Monthly DR test:

# Test restore on staging environment
# 1. Copy production backup to staging
scp -r prod:/backups/latest staging:/backups/test

# 2. Restore on staging
ssh staging "sudo ./scripts/restore-stemedb.sh /backups/test"

# 3. Validate
ssh staging "curl http://localhost:18180/v1/health"

# 4. Document results
echo "$(date): DR test passed, assertion_count: 10234" >> dr-test-log.txt

Monitoring

Set up alerts for:

# Prometheus alert rules
groups:
  - name: stemedb_backups
    rules:
      - alert: StemeDBBackupMissing
        expr: time() - stemedb_last_backup_timestamp_seconds > 86400
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "StemeDB backup missing (>24 hours)"

      - alert: StemeDBBackupFailed
        expr: stemedb_backup_failures_total > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "StemeDB backup failed"


Last Updated

2026-02-11