jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

16 KiB

Raw Blame History

Runbook: Disaster Recovery

Overview

Purpose: Restore StemeDB from backup after catastrophic failure.

RTO (Recovery Time Objective): 4 hours RPO (Recovery Point Objective): 15 minutes

Scope: Complete server failure, data center outage, or regional disaster requiring restore from backups.

When to Use This Runbook

Use this runbook for:

Complete server failure - Hardware dead, cannot boot
Data center outage - Entire DC offline, need to restore elsewhere
Disk failure - Storage completely lost, no local recovery possible
Ransomware/corruption - Data encrypted or corrupted, need clean restore
Regional disaster - DR drill or actual disaster requiring failover

Do NOT use for:

Single node failure in cluster → Use cluster failover instead
WAL corruption → Use Restore from Backup §2
Index rebuild → Use Restore from Backup §4

Prerequisites

Before starting DR, ensure:

New server provisioned (or existing server with clean disk)
S3 access configured (credentials, network access to S3)
Dependencies installed (Rust, PostgreSQL if using external stores)
Stakeholders notified (team knows DR is in progress)
DNS/load balancer updated (if changing server IP)

Minimum server specs:

CPU: 4 cores
RAM: 16GB
Disk: 2x backup size (for restore + buffer)
Network: 1Gbps (for S3 downloads)

Decision Tree

Disaster scenario
    │
    ├─► Complete restore needed?
    │   └─► §1 Full Restore from S3
    │
    ├─► Point-in-time restore needed?
    │   └─► §2 Point-in-Time Restore with WAL Replay
    │
    └─► Only recent data lost?
        └─► §3 WAL-Only Recovery

Resolution Steps

§1. Full Restore from S3 (RTO: 4 hours, RPO: 15 minutes)

Use case: Complete data loss, restore everything from S3.

Step 1: Provision new server (30 min)

# Install dependencies
sudo apt update
sudo apt install -y awscli build-essential pkg-config libssl-dev postgresql-client

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

# Create stemedb user
sudo useradd -r -s /bin/bash -d /var/lib/stemedb -m stemedb

# Create data directories
sudo mkdir -p /var/lib/stemedb/{wal,db}
sudo chown -R stemedb:stemedb /var/lib/stemedb

Step 2: Download latest full backup from S3 (60 min)

# List available backups
aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup

# Expected output:
#                            PRE stemedb-backup-20260211-060000/
#                            PRE stemedb-backup-20260211-120000/
#                            PRE stemedb-backup-20260211-180000/  ← Latest

# Download latest full backup
LATEST_BACKUP=$(aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup | tail -n1 | awk '{print $2}' | tr -d '/')
sudo -u stemedb aws s3 sync \
    s3://stemedb-backups-prod/${LATEST_BACKUP} \
    /var/backups/stemedb/${LATEST_BACKUP} \
    --region us-east-1

# Verify download
ls -lh /var/backups/stemedb/${LATEST_BACKUP}/
# Should show: backup-metadata.json, wal/, db/

cat /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json
# Verify timestamp, file counts

Step 3: Download WAL segments since last backup (15 min)

# Get backup timestamp
BACKUP_TIMESTAMP=$(jq -r .timestamp /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json)
echo "Backup timestamp: $BACKUP_TIMESTAMP"

# Download WAL segments archived after backup
sudo -u stemedb mkdir -p /var/lib/stemedb/wal-archive
sudo -u stemedb aws s3 sync \
    s3://stemedb-backups-prod/wal-archive/ \
    /var/lib/stemedb/wal-archive/ \
    --region us-east-1

# Count segments
WAL_COUNT=$(find /var/lib/stemedb/wal-archive -name "*.wal" | wc -l)
echo "Downloaded $WAL_COUNT WAL segments"

Step 4: Restore data directories (30 min)

# Restore from backup
sudo -u stemedb rsync -av \
    /var/backups/stemedb/${LATEST_BACKUP}/wal/ \
    /var/lib/stemedb/wal/

sudo -u stemedb rsync -av \
    /var/backups/stemedb/${LATEST_BACKUP}/db/ \
    /var/lib/stemedb/db/

# Copy archived WAL segments
sudo -u stemedb cp -r /var/lib/stemedb/wal-archive/*.wal /var/lib/stemedb/wal/

# Verify restoration
du -sh /var/lib/stemedb/{wal,db}
# Should match backup sizes + WAL archive

Step 5: Build and start StemeDB (30 min)

# Clone repository
cd /opt
sudo git clone https://github.com/yourusername/stemedb.git
sudo chown -R stemedb:stemedb /opt/stemedb

# Build release binary
cd /opt/stemedb
sudo -u stemedb cargo build --release --bin stemedb-api

# Install systemd unit
sudo cp docs/operations/deployment/systemd/stemedb-api.service /etc/systemd/system/
sudo systemctl daemon-reload

# Configure environment
sudo tee /etc/default/stemedb <<ENV
STEMEDB_BIND_ADDR=0.0.0.0:18180
STEMEDB_WAL_DIR=/var/lib/stemedb/wal
STEMEDB_DB_DIR=/var/lib/stemedb/db
RUST_LOG=info
ENV

# Start StemeDB (will auto-replay WAL)
sudo systemctl start stemedb-api

# Monitor startup
sudo journalctl -u stemedb-api -f

# Expected logs:
# "Starting WAL recovery..."
# "Replayed 15234 entries from WAL"
# "Rebuilding indexes..."
# "Startup complete, listening on 0.0.0.0:18180"

Step 6: Validate recovery (30 min)

# Wait for startup to complete (watch journalctl)
# Then validate...

# Check health
curl http://localhost:18180/v1/health

# Expected:
# {
#   "status": "healthy",
#   "assertion_count": 105234,
#   "wal_segments": 47,
#   "uptime_seconds": 120
# }

# Verify assertion count matches expected
EXPECTED_COUNT=$(jq -r .assertion_count /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json)
ACTUAL_COUNT=$(curl -s http://localhost:18180/v1/health | jq .assertion_count)

echo "Expected: $EXPECTED_COUNT"
echo "Actual: $ACTUAL_COUNT"
echo "Delta: $((ACTUAL_COUNT - EXPECTED_COUNT))"

# Delta should equal assertions from WAL replay
# (data added between backup and failure)

# Test query
curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{
    "concept_path": "test/dr",
    "predicate": "recovered",
    "lens": "recency"
  }'

# Should return 200 (even if empty results)

# Test ingestion
curl -X POST http://localhost:18180/v1/assert \
  -H "Content-Type: application/json" \
  -d '{
    "concept_path": "test/dr_validation",
    "predicate": "restored",
    "value": true,
    "confidence": 1.0,
    "authority_tier": "expert"
  }'

# Should return 201 Created

Step 7: Resume operations (60 min)

# Update DNS (if IP changed)
# Point stemedb.yourdomain.com to new server IP

# Update load balancer (if using LB)
# Add new server to backend pool

# Enable backup automation
sudo systemctl enable stemedb-backup.timer
sudo systemctl start stemedb-backup.timer

sudo systemctl enable stemedb-archive-wal.timer
sudo systemctl start stemedb-archive-wal.timer

sudo systemctl enable stemedb-verify-backup.timer
sudo systemctl start stemedb-verify-backup.timer

# Verify timers
systemctl list-timers 'stemedb-*'

# Notify stakeholders
echo "StemeDB DR complete at $(date -u)" | mail -s "StemeDB DR Complete" oncall@yourcompany.com

Total time: ~4 hours (within RTO)

§2. Point-in-Time Restore with WAL Replay (RTO: 2 hours, RPO: 15 min)

Use case: Restore to specific timestamp (e.g., before bad data ingestion).

Step 1: Identify target timestamp

# Determine when bad data was ingested
# (from logs, monitoring, or user reports)
TARGET_TIMESTAMP="2026-02-11T14:30:00Z"

# Find backup immediately before target
aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup | \
  awk '{print $2}' | tr -d '/' | \
  while read backup; do
    BACKUP_TS=$(aws s3 cp s3://stemedb-backups-prod/${backup}/backup-metadata.json - | jq -r .timestamp)
    if [[ "$BACKUP_TS" < "$TARGET_TIMESTAMP" ]]; then
      echo "$backup ($BACKUP_TS)"
    fi
  done | tail -n1

# Use backup: stemedb-backup-20260211-120000 (2026-02-11T12:00:00Z)

Step 2: Restore base backup

Follow §1 steps 1-4, but use the identified backup instead of latest.

Step 3: Replay WAL to target timestamp

# Download all WAL segments between backup and target
sudo -u stemedb aws s3 sync \
    s3://stemedb-backups-prod/wal-archive/ \
    /var/lib/stemedb/wal-partial/ \
    --region us-east-1

# Filter WAL segments by timestamp
# (Keep only segments before target timestamp)
for wal in /var/lib/stemedb/wal-partial/*.wal; do
    WAL_TS=$(stat -c %Y "$wal" | awk '{print strftime("%Y-%m-%dT%H:%M:%SZ", $1)}')
    if [[ "$WAL_TS" < "$TARGET_TIMESTAMP" ]]; then
        sudo -u stemedb cp "$wal" /var/lib/stemedb/wal/
    fi
done

# Start StemeDB (will replay filtered WAL)
sudo systemctl start stemedb-api

# Validate timestamp
LAST_ASSERTION_TS=$(curl -s http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "*", "lens": "recency", "limit": 1}' | \
  jq -r '.assertions[0].timestamp')

echo "Last assertion timestamp: $LAST_ASSERTION_TS"
echo "Target timestamp: $TARGET_TIMESTAMP"
# Last assertion should be ≤ target

Total time: ~2 hours

§3. WAL-Only Recovery (RTO: 30 min, RPO: 0 min)

Use case: Database intact, only recent WAL lost (e.g., WAL disk failure).

Step 1: Verify database is intact

sudo systemctl stop stemedb-api

# Check DB directory
ls -lh /var/lib/stemedb/db/
# Should show: *.kv files, no corruption

# Check for errors
journalctl -u stemedb-api | tail -n100 | grep -i "db\|database\|storage"
# Should NOT show corruption errors

Step 2: Download archived WAL

# Download all archived WAL segments
sudo -u stemedb aws s3 sync \
    s3://stemedb-backups-prod/wal-archive/ \
    /var/lib/stemedb/wal/ \
    --region us-east-1 \
    --delete

# Verify download
ls -lh /var/lib/stemedb/wal/*.wal | wc -l
# Should show: N segments

Step 3: Start and replay

sudo systemctl start stemedb-api

# Monitor replay
sudo journalctl -u stemedb-api -f

# Expected:
# "Replayed 523 entries from WAL"
# "Startup complete"

# Validate
curl http://localhost:18180/v1/health | jq .assertion_count
# Should match expected count

Total time: ~30 min

Validation Checklist

After any DR procedure, validate:

Server starts successfully

systemctl status stemedb-api
# Active (running)

Health endpoint responds

curl http://localhost:18180/v1/health
# Returns 200 OK

Assertion count correct

# Compare to backup metadata or expected count

Queries work

curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "test", "lens": "recency"}'
# Returns 200

Ingestion works

# Test write
curl -X POST http://localhost:18180/v1/assert ... # 201 Created

Backups resume

systemctl is-active stemedb-backup.timer  # active
systemctl is-active stemedb-archive-wal.timer  # active

Metrics exporting

curl http://localhost:18180/metrics | grep stemedb_
# Shows metrics

Alerts firing correctly

curl http://prometheus:9090/api/v1/alerts | jq .
# No backup alerts firing

DNS/LB updated

nslookup stemedb.yourdomain.com
# Points to new IP (if changed)

RTO/RPO Metrics

Scenario	RTO	RPO	Data Loss
Full restore from S3	4h	15min	Last 15min of WAL
Point-in-time restore	2h	variable	Controlled (to target timestamp)
WAL-only recovery	30min	0min	None (if WAL archived)

Factors affecting RTO:

S3 download speed (network bandwidth)
Backup size (larger = slower restore)
Server provisioning time (cloud vs. bare metal)
DNS/LB propagation delay

Factors affecting RPO:

WAL archival frequency (default: 15 min)
Last successful backup age (default: 6h intervals)
Time of failure (worst case: just before backup)

Post-DR Actions

Immediate (within 1 hour):

Document incident
- Create incident report
- Record timeline (failure time, detection time, recovery time)
- Note RTO/RPO achieved vs. target
Verify monitoring
- Check all alerts are firing correctly
- Verify metrics are being collected
- Test PagerDuty/Slack notifications
Communicate status
- Notify stakeholders of recovery completion
- Update status page
- Send post-mortem invite

Within 24 hours:

Root cause analysis
- Identify what caused failure
- Determine if preventable
- Create action items
Test backups
- Verify next backup completes
- Validate verification passes
- Check S3 uploads working
Review procedures
- Update runbook with lessons learned
- Document any deviations from procedure
- Propose improvements

Within 1 week:

Conduct post-mortem
- Blameless review with team
- Identify process improvements
- Create corrective actions
Update documentation
- Incorporate lessons learned
- Update RTO/RPO estimates
- Revise prerequisites
Schedule DR drill
- Test procedure again (quarterly)
- Validate improvements
- Train new team members

Common Pitfalls

1. Incomplete S3 sync

Symptom: Restore completes but assertion count too low.

Cause: S3 sync interrupted or incomplete.

Fix:

# Re-sync with --exact-timestamps
sudo -u stemedb aws s3 sync \
    s3://stemedb-backups-prod/${BACKUP} \
    /var/backups/stemedb/${BACKUP} \
    --exact-timestamps \
    --region us-east-1

2. WAL replay fails

Symptom: Server starts but assertion count wrong.

Cause: Corrupted WAL segment or version mismatch.

Fix:

# Check logs for specific segment
sudo journalctl -u stemedb-api | grep -i "wal.*error"

# If segment corrupted, skip it (accept data loss)
sudo mv /var/lib/stemedb/wal/segment-XXXXX.wal /tmp/

# Restart
sudo systemctl restart stemedb-api

3. Permissions incorrect

Symptom: Server won't start, permission denied errors.

Cause: Restored files owned by wrong user.

Fix:

sudo chown -R stemedb:stemedb /var/lib/stemedb
sudo chmod -R 755 /var/lib/stemedb/wal
sudo chmod -R 755 /var/lib/stemedb/db

4. DNS not updated

Symptom: Clients can't connect to restored server.

Cause: DNS still pointing to old IP.

Fix:

# Update DNS record
# (method varies by DNS provider)

# Verify propagation
dig stemedb.yourdomain.com +short
# Should return new IP

DR Drill Procedure

Frequency: Quarterly (every 90 days)

Purpose: Validate DR procedures, train team, measure RTO/RPO.

Steps:

Schedule drill (at least 1 week notice)
Provision staging environment (separate from prod)
Execute DR procedure (§1 Full Restore)
Measure RTO/RPO achieved
Document results (drill report)
Review with team (post-drill retro)
Update runbook (incorporate learnings)

Drill report template:

# DR Drill Report - YYYY-MM-DD

## Summary
- Date: YYYY-MM-DD HH:MM UTC
- Participants: [names]
- Scenario: Full restore from S3
- Result: ✅ Success / ⚠️ Partial / ❌ Failed

## Metrics
- RTO Target: 4 hours
- RTO Achieved: X hours Y min
- RPO Target: 15 min
- RPO Achieved: X min
- Data Loss: X assertions (expected)

## Timeline
- HH:MM - Drill started
- HH:MM - Server provisioned
- HH:MM - Backup downloaded
- HH:MM - WAL downloaded
- HH:MM - Data restored
- HH:MM - Service started
- HH:MM - Validation complete
- HH:MM - Drill complete

## Issues Encountered
1. [Issue description]
   - Impact: [how it affected RTO]
   - Resolution: [how it was fixed]
   - Preventive action: [how to avoid next time]

## Lessons Learned
- [Lesson 1]
- [Lesson 2]

## Action Items
- [ ] [Action item 1] - Owner: [name] - Due: [date]
- [ ] [Action item 2] - Owner: [name] - Due: [date]

## Runbook Updates
- [Change 1: reason]
- [Change 2: reason]

Restore from Backup - Non-disaster restore scenarios
Server Won't Start - Startup failures
Disk Full - Storage management

Last Updated

2026-02-12 (P5.3 Implementation)

16 KiB Raw Blame History

Runbook: Disaster Recovery

Overview

When to Use This Runbook

Prerequisites

Decision Tree

Resolution Steps

§1. Full Restore from S3 (RTO: 4 hours, RPO: 15 minutes)

§2. Point-in-Time Restore with WAL Replay (RTO: 2 hours, RPO: 15 min)

§3. WAL-Only Recovery (RTO: 30 min, RPO: 0 min)

Validation Checklist

RTO/RPO Metrics

Post-DR Actions

Common Pitfalls

1. Incomplete S3 sync

2. WAL replay fails

3. Permissions incorrect

4. DNS not updated

DR Drill Procedure

Related Runbooks

Last Updated

16 KiB

Raw Blame History