This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
16 KiB
Runbook: Disaster Recovery
Overview
Purpose: Restore StemeDB from backup after catastrophic failure.
RTO (Recovery Time Objective): 4 hours RPO (Recovery Point Objective): 15 minutes
Scope: Complete server failure, data center outage, or regional disaster requiring restore from backups.
When to Use This Runbook
Use this runbook for:
- Complete server failure - Hardware dead, cannot boot
- Data center outage - Entire DC offline, need to restore elsewhere
- Disk failure - Storage completely lost, no local recovery possible
- Ransomware/corruption - Data encrypted or corrupted, need clean restore
- Regional disaster - DR drill or actual disaster requiring failover
Do NOT use for:
- Single node failure in cluster → Use cluster failover instead
- WAL corruption → Use Restore from Backup §2
- Index rebuild → Use Restore from Backup §4
Prerequisites
Before starting DR, ensure:
- New server provisioned (or existing server with clean disk)
- S3 access configured (credentials, network access to S3)
- Dependencies installed (Rust, PostgreSQL if using external stores)
- Stakeholders notified (team knows DR is in progress)
- DNS/load balancer updated (if changing server IP)
Minimum server specs:
- CPU: 4 cores
- RAM: 16GB
- Disk: 2x backup size (for restore + buffer)
- Network: 1Gbps (for S3 downloads)
Decision Tree
Disaster scenario
│
├─► Complete restore needed?
│ └─► §1 Full Restore from S3
│
├─► Point-in-time restore needed?
│ └─► §2 Point-in-Time Restore with WAL Replay
│
└─► Only recent data lost?
└─► §3 WAL-Only Recovery
Resolution Steps
§1. Full Restore from S3 (RTO: 4 hours, RPO: 15 minutes)
Use case: Complete data loss, restore everything from S3.
Step 1: Provision new server (30 min)
# Install dependencies
sudo apt update
sudo apt install -y awscli build-essential pkg-config libssl-dev postgresql-client
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
# Create stemedb user
sudo useradd -r -s /bin/bash -d /var/lib/stemedb -m stemedb
# Create data directories
sudo mkdir -p /var/lib/stemedb/{wal,db}
sudo chown -R stemedb:stemedb /var/lib/stemedb
Step 2: Download latest full backup from S3 (60 min)
# List available backups
aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup
# Expected output:
# PRE stemedb-backup-20260211-060000/
# PRE stemedb-backup-20260211-120000/
# PRE stemedb-backup-20260211-180000/ ← Latest
# Download latest full backup
LATEST_BACKUP=$(aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup | tail -n1 | awk '{print $2}' | tr -d '/')
sudo -u stemedb aws s3 sync \
s3://stemedb-backups-prod/${LATEST_BACKUP} \
/var/backups/stemedb/${LATEST_BACKUP} \
--region us-east-1
# Verify download
ls -lh /var/backups/stemedb/${LATEST_BACKUP}/
# Should show: backup-metadata.json, wal/, db/
cat /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json
# Verify timestamp, file counts
Step 3: Download WAL segments since last backup (15 min)
# Get backup timestamp
BACKUP_TIMESTAMP=$(jq -r .timestamp /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json)
echo "Backup timestamp: $BACKUP_TIMESTAMP"
# Download WAL segments archived after backup
sudo -u stemedb mkdir -p /var/lib/stemedb/wal-archive
sudo -u stemedb aws s3 sync \
s3://stemedb-backups-prod/wal-archive/ \
/var/lib/stemedb/wal-archive/ \
--region us-east-1
# Count segments
WAL_COUNT=$(find /var/lib/stemedb/wal-archive -name "*.wal" | wc -l)
echo "Downloaded $WAL_COUNT WAL segments"
Step 4: Restore data directories (30 min)
# Restore from backup
sudo -u stemedb rsync -av \
/var/backups/stemedb/${LATEST_BACKUP}/wal/ \
/var/lib/stemedb/wal/
sudo -u stemedb rsync -av \
/var/backups/stemedb/${LATEST_BACKUP}/db/ \
/var/lib/stemedb/db/
# Copy archived WAL segments
sudo -u stemedb cp -r /var/lib/stemedb/wal-archive/*.wal /var/lib/stemedb/wal/
# Verify restoration
du -sh /var/lib/stemedb/{wal,db}
# Should match backup sizes + WAL archive
Step 5: Build and start StemeDB (30 min)
# Clone repository
cd /opt
sudo git clone https://github.com/yourusername/stemedb.git
sudo chown -R stemedb:stemedb /opt/stemedb
# Build release binary
cd /opt/stemedb
sudo -u stemedb cargo build --release --bin stemedb-api
# Install systemd unit
sudo cp docs/operations/deployment/systemd/stemedb-api.service /etc/systemd/system/
sudo systemctl daemon-reload
# Configure environment
sudo tee /etc/default/stemedb <<ENV
STEMEDB_BIND_ADDR=0.0.0.0:18180
STEMEDB_WAL_DIR=/var/lib/stemedb/wal
STEMEDB_DB_DIR=/var/lib/stemedb/db
RUST_LOG=info
ENV
# Start StemeDB (will auto-replay WAL)
sudo systemctl start stemedb-api
# Monitor startup
sudo journalctl -u stemedb-api -f
# Expected logs:
# "Starting WAL recovery..."
# "Replayed 15234 entries from WAL"
# "Rebuilding indexes..."
# "Startup complete, listening on 0.0.0.0:18180"
Step 6: Validate recovery (30 min)
# Wait for startup to complete (watch journalctl)
# Then validate...
# Check health
curl http://localhost:18180/v1/health
# Expected:
# {
# "status": "healthy",
# "assertion_count": 105234,
# "wal_segments": 47,
# "uptime_seconds": 120
# }
# Verify assertion count matches expected
EXPECTED_COUNT=$(jq -r .assertion_count /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json)
ACTUAL_COUNT=$(curl -s http://localhost:18180/v1/health | jq .assertion_count)
echo "Expected: $EXPECTED_COUNT"
echo "Actual: $ACTUAL_COUNT"
echo "Delta: $((ACTUAL_COUNT - EXPECTED_COUNT))"
# Delta should equal assertions from WAL replay
# (data added between backup and failure)
# Test query
curl -X POST http://localhost:18180/v1/query \
-H "Content-Type: application/json" \
-d '{
"concept_path": "test/dr",
"predicate": "recovered",
"lens": "recency"
}'
# Should return 200 (even if empty results)
# Test ingestion
curl -X POST http://localhost:18180/v1/assert \
-H "Content-Type: application/json" \
-d '{
"concept_path": "test/dr_validation",
"predicate": "restored",
"value": true,
"confidence": 1.0,
"authority_tier": "expert"
}'
# Should return 201 Created
Step 7: Resume operations (60 min)
# Update DNS (if IP changed)
# Point stemedb.yourdomain.com to new server IP
# Update load balancer (if using LB)
# Add new server to backend pool
# Enable backup automation
sudo systemctl enable stemedb-backup.timer
sudo systemctl start stemedb-backup.timer
sudo systemctl enable stemedb-archive-wal.timer
sudo systemctl start stemedb-archive-wal.timer
sudo systemctl enable stemedb-verify-backup.timer
sudo systemctl start stemedb-verify-backup.timer
# Verify timers
systemctl list-timers 'stemedb-*'
# Notify stakeholders
echo "StemeDB DR complete at $(date -u)" | mail -s "StemeDB DR Complete" oncall@yourcompany.com
Total time: ~4 hours (within RTO)
§2. Point-in-Time Restore with WAL Replay (RTO: 2 hours, RPO: 15 min)
Use case: Restore to specific timestamp (e.g., before bad data ingestion).
Step 1: Identify target timestamp
# Determine when bad data was ingested
# (from logs, monitoring, or user reports)
TARGET_TIMESTAMP="2026-02-11T14:30:00Z"
# Find backup immediately before target
aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup | \
awk '{print $2}' | tr -d '/' | \
while read backup; do
BACKUP_TS=$(aws s3 cp s3://stemedb-backups-prod/${backup}/backup-metadata.json - | jq -r .timestamp)
if [[ "$BACKUP_TS" < "$TARGET_TIMESTAMP" ]]; then
echo "$backup ($BACKUP_TS)"
fi
done | tail -n1
# Use backup: stemedb-backup-20260211-120000 (2026-02-11T12:00:00Z)
Step 2: Restore base backup
Follow §1 steps 1-4, but use the identified backup instead of latest.
Step 3: Replay WAL to target timestamp
# Download all WAL segments between backup and target
sudo -u stemedb aws s3 sync \
s3://stemedb-backups-prod/wal-archive/ \
/var/lib/stemedb/wal-partial/ \
--region us-east-1
# Filter WAL segments by timestamp
# (Keep only segments before target timestamp)
for wal in /var/lib/stemedb/wal-partial/*.wal; do
WAL_TS=$(stat -c %Y "$wal" | awk '{print strftime("%Y-%m-%dT%H:%M:%SZ", $1)}')
if [[ "$WAL_TS" < "$TARGET_TIMESTAMP" ]]; then
sudo -u stemedb cp "$wal" /var/lib/stemedb/wal/
fi
done
# Start StemeDB (will replay filtered WAL)
sudo systemctl start stemedb-api
# Validate timestamp
LAST_ASSERTION_TS=$(curl -s http://localhost:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "*", "lens": "recency", "limit": 1}' | \
jq -r '.assertions[0].timestamp')
echo "Last assertion timestamp: $LAST_ASSERTION_TS"
echo "Target timestamp: $TARGET_TIMESTAMP"
# Last assertion should be ≤ target
Total time: ~2 hours
§3. WAL-Only Recovery (RTO: 30 min, RPO: 0 min)
Use case: Database intact, only recent WAL lost (e.g., WAL disk failure).
Step 1: Verify database is intact
sudo systemctl stop stemedb-api
# Check DB directory
ls -lh /var/lib/stemedb/db/
# Should show: *.kv files, no corruption
# Check for errors
journalctl -u stemedb-api | tail -n100 | grep -i "db\|database\|storage"
# Should NOT show corruption errors
Step 2: Download archived WAL
# Download all archived WAL segments
sudo -u stemedb aws s3 sync \
s3://stemedb-backups-prod/wal-archive/ \
/var/lib/stemedb/wal/ \
--region us-east-1 \
--delete
# Verify download
ls -lh /var/lib/stemedb/wal/*.wal | wc -l
# Should show: N segments
Step 3: Start and replay
sudo systemctl start stemedb-api
# Monitor replay
sudo journalctl -u stemedb-api -f
# Expected:
# "Replayed 523 entries from WAL"
# "Startup complete"
# Validate
curl http://localhost:18180/v1/health | jq .assertion_count
# Should match expected count
Total time: ~30 min
Validation Checklist
After any DR procedure, validate:
-
Server starts successfully
systemctl status stemedb-api # Active (running) -
Health endpoint responds
curl http://localhost:18180/v1/health # Returns 200 OK -
Assertion count correct
# Compare to backup metadata or expected count -
Queries work
curl -X POST http://localhost:18180/v1/query \ -H "Content-Type: application/json" \ -d '{"concept_path": "test", "lens": "recency"}' # Returns 200 -
Ingestion works
# Test write curl -X POST http://localhost:18180/v1/assert ... # 201 Created -
Backups resume
systemctl is-active stemedb-backup.timer # active systemctl is-active stemedb-archive-wal.timer # active -
Metrics exporting
curl http://localhost:18180/metrics | grep stemedb_ # Shows metrics -
Alerts firing correctly
curl http://prometheus:9090/api/v1/alerts | jq . # No backup alerts firing -
DNS/LB updated
nslookup stemedb.yourdomain.com # Points to new IP (if changed)
RTO/RPO Metrics
| Scenario | RTO | RPO | Data Loss |
|---|---|---|---|
| Full restore from S3 | 4h | 15min | Last 15min of WAL |
| Point-in-time restore | 2h | variable | Controlled (to target timestamp) |
| WAL-only recovery | 30min | 0min | None (if WAL archived) |
Factors affecting RTO:
- S3 download speed (network bandwidth)
- Backup size (larger = slower restore)
- Server provisioning time (cloud vs. bare metal)
- DNS/LB propagation delay
Factors affecting RPO:
- WAL archival frequency (default: 15 min)
- Last successful backup age (default: 6h intervals)
- Time of failure (worst case: just before backup)
Post-DR Actions
Immediate (within 1 hour):
-
Document incident
- Create incident report
- Record timeline (failure time, detection time, recovery time)
- Note RTO/RPO achieved vs. target
-
Verify monitoring
- Check all alerts are firing correctly
- Verify metrics are being collected
- Test PagerDuty/Slack notifications
-
Communicate status
- Notify stakeholders of recovery completion
- Update status page
- Send post-mortem invite
Within 24 hours:
-
Root cause analysis
- Identify what caused failure
- Determine if preventable
- Create action items
-
Test backups
- Verify next backup completes
- Validate verification passes
- Check S3 uploads working
-
Review procedures
- Update runbook with lessons learned
- Document any deviations from procedure
- Propose improvements
Within 1 week:
-
Conduct post-mortem
- Blameless review with team
- Identify process improvements
- Create corrective actions
-
Update documentation
- Incorporate lessons learned
- Update RTO/RPO estimates
- Revise prerequisites
-
Schedule DR drill
- Test procedure again (quarterly)
- Validate improvements
- Train new team members
Common Pitfalls
1. Incomplete S3 sync
Symptom: Restore completes but assertion count too low.
Cause: S3 sync interrupted or incomplete.
Fix:
# Re-sync with --exact-timestamps
sudo -u stemedb aws s3 sync \
s3://stemedb-backups-prod/${BACKUP} \
/var/backups/stemedb/${BACKUP} \
--exact-timestamps \
--region us-east-1
2. WAL replay fails
Symptom: Server starts but assertion count wrong.
Cause: Corrupted WAL segment or version mismatch.
Fix:
# Check logs for specific segment
sudo journalctl -u stemedb-api | grep -i "wal.*error"
# If segment corrupted, skip it (accept data loss)
sudo mv /var/lib/stemedb/wal/segment-XXXXX.wal /tmp/
# Restart
sudo systemctl restart stemedb-api
3. Permissions incorrect
Symptom: Server won't start, permission denied errors.
Cause: Restored files owned by wrong user.
Fix:
sudo chown -R stemedb:stemedb /var/lib/stemedb
sudo chmod -R 755 /var/lib/stemedb/wal
sudo chmod -R 755 /var/lib/stemedb/db
4. DNS not updated
Symptom: Clients can't connect to restored server.
Cause: DNS still pointing to old IP.
Fix:
# Update DNS record
# (method varies by DNS provider)
# Verify propagation
dig stemedb.yourdomain.com +short
# Should return new IP
DR Drill Procedure
Frequency: Quarterly (every 90 days)
Purpose: Validate DR procedures, train team, measure RTO/RPO.
Steps:
- Schedule drill (at least 1 week notice)
- Provision staging environment (separate from prod)
- Execute DR procedure (§1 Full Restore)
- Measure RTO/RPO achieved
- Document results (drill report)
- Review with team (post-drill retro)
- Update runbook (incorporate learnings)
Drill report template:
# DR Drill Report - YYYY-MM-DD
## Summary
- Date: YYYY-MM-DD HH:MM UTC
- Participants: [names]
- Scenario: Full restore from S3
- Result: ✅ Success / ⚠️ Partial / ❌ Failed
## Metrics
- RTO Target: 4 hours
- RTO Achieved: X hours Y min
- RPO Target: 15 min
- RPO Achieved: X min
- Data Loss: X assertions (expected)
## Timeline
- HH:MM - Drill started
- HH:MM - Server provisioned
- HH:MM - Backup downloaded
- HH:MM - WAL downloaded
- HH:MM - Data restored
- HH:MM - Service started
- HH:MM - Validation complete
- HH:MM - Drill complete
## Issues Encountered
1. [Issue description]
- Impact: [how it affected RTO]
- Resolution: [how it was fixed]
- Preventive action: [how to avoid next time]
## Lessons Learned
- [Lesson 1]
- [Lesson 2]
## Action Items
- [ ] [Action item 1] - Owner: [name] - Due: [date]
- [ ] [Action item 2] - Owner: [name] - Due: [date]
## Runbook Updates
- [Change 1: reason]
- [Change 2: reason]
Related Runbooks
- Restore from Backup - Non-disaster restore scenarios
- Server Won't Start - Startup failures
- Disk Full - Storage management
Last Updated
2026-02-12 (P5.3 Implementation)