jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

13 KiB

Raw Blame History

Runbook: Restore from Backup

Symptom

Data loss after hardware failure, corruption, or operator error
WAL corruption preventing server startup
Need to rollback to known-good state
Assertion count doesn't match expected values
Database inconsistency detected

Metrics Alerts:

N/A (typically discovered during incident response)

Quick Diagnosis

Need to restore
    │
    ├─► Data loss (hardware failure, operator error)?
    │   └─► §1 Complete Restore
    │
    ├─► WAL corruption on startup?
    │   └─► §2 WAL-Only Restore
    │
    ├─► Need to rollback to specific point in time?
    │   └─► §3 Point-in-Time Restore
    │
    └─► Database inconsistency (assertion count mismatch)?
        └─► §4 Validation and Rebuild

Common Causes

Hardware failure — Likelihood: 30%
- Disk failure
- Power loss during write
- Network storage disconnection
WAL corruption — Likelihood: 25%
- Unclean shutdown (OOM kill, crash)
- Disk corruption
- Version mismatch after upgrade
Operator error — Likelihood: 20%
- Accidentally deleted data directory
- Wrong command executed
- Misconfigured deployment
Software bug — Likelihood: 15%
- Database corruption bug
- Index inconsistency
- Replication failure (cluster)
Disaster recovery test — Likelihood: 10%
- Scheduled DR validation
- Migration to new infrastructure

Prerequisites

Before starting restore:

Backup available:

ls -lh backups/
# Should show: stemedb-backup-YYYYMMDD-HHMMSS/

Backup metadata valid:

cat backups/stemedb-backup-*/metadata.json
# Should show: version, timestamp, assertion_count

Server stopped:

sudo systemctl stop stemedb-api
sudo systemctl status stemedb-api
# Should show: inactive (dead)

Disk space available:
```
df -h
# Need: 2x backup size available
```

Resolution Steps

§1. Complete Restore (Full Recovery)

Use case: Data loss, complete restoration needed

Diagnostic:

# Verify backup integrity
BACKUP_DIR="backups/stemedb-backup-20260211-100000"  # Replace with your backup

# Check metadata
cat $BACKUP_DIR/metadata.json

# Expected output:
# {
#   "version": "0.1.0",
#   "timestamp": "2026-02-11T10:00:00Z",
#   "assertion_count": 10234,
#   "wal_segment_count": 15,
#   "backup_type": "full"
# }

# Check directory structure
ls -lh $BACKUP_DIR/
# Should show: wal/ db/ metadata.json

Resolution: Use restore script

# Run restore script (safe - renames existing dirs, never deletes)
sudo ./scripts/restore-stemedb.sh $BACKUP_DIR

# Expected output:
# Stopping StemeDB API service...
# Renaming existing data/wal to data/wal.backup.20260211-103045
# Renaming existing data/db to data/db.backup.20260211-103045
# Copying WAL from backup...
# Copying DB from backup...
# Copying metadata...
# Restore complete. Starting StemeDB API service...
# StemeDB API service started successfully.

Validate restore:

# Check health endpoint
curl http://localhost:18180/v1/health

# Expected output:
# {
#   "status": "healthy",
#   "version": "0.1.0",
#   "uptime_seconds": 5,
#   "assertion_count": 10234  # Should match backup metadata
# }

# Verify metadata matches
cat data/metadata.json
# Should match backup metadata.json

# Test query
curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "test/restore", "lens": "recency"}'
# Should return 200 (even if empty results)

If failed: Health check shows different assertion_count → See §4 Validation and Rebuild.

§2. WAL-Only Restore (Preserve Database)

Use case: WAL corrupted but database intact

⚠️ WARNING: This preserves existing database but replaces WAL. Only use if confident database is uncorrupted.

Diagnostic:

# Check for WAL errors
journalctl -u stemedb-api -n 50 | grep -i wal

# Common errors indicating WAL corruption:
# - "WAL magic byte validation failed"
# - "Checksum mismatch in WAL segment"
# - "Failed to recover WAL"

# Verify database is intact
ls -lh data/db/
# Should show: *.kv files, indexes, no corruption messages

Resolution: Manual WAL replacement

# Stop server
sudo systemctl stop stemedb-api

# Backup corrupted WAL for forensics
sudo mv data/wal data/wal.corrupted.$(date +%Y%m%d-%H%M%S)

# Restore WAL from backup
BACKUP_DIR="backups/stemedb-backup-20260211-100000"
sudo cp -r $BACKUP_DIR/wal data/wal

# Set correct permissions
sudo chown -R stemedb:stemedb data/wal/
sudo chmod -R 755 data/wal/

# Start server (will replay WAL and rebuild indexes)
sudo systemctl start stemedb-api

# Monitor startup
journalctl -u stemedb-api -f

# Expected logs:
# "Starting WAL recovery..."
# "Replayed 1523 entries from WAL"
# "Rebuilding indexes..."
# "Startup complete"

Validate WAL recovery:

# Check health
curl http://localhost:18180/v1/health

# Check metrics for WAL operations
curl http://localhost:18180/metrics | grep wal_

# Should show:
# wal_segments_total{...} 15
# wal_fsync_latency_seconds{...} <0.1

If failed: Server still won't start with restored WAL → Perform complete restore (§1).

§3. Point-in-Time Restore

Use case: Rollback to specific timestamp (e.g., before bad data ingestion)

⚠️ NOTE: StemeDB is append-only, so this is "restore + filter" not true PITR.

Diagnostic:

# Identify when bad data was ingested
curl http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "bad/data/path", "lens": "recency"}' | jq '.assertions[0].timestamp'

# Find backup before this timestamp
ls -lh backups/ | grep "before-timestamp"

Resolution: Restore + retraction

# Step 1: Restore from backup before bad data
sudo ./scripts/restore-stemedb.sh backups/stemedb-backup-20260210-230000

# Step 2: Start server
sudo systemctl start stemedb-api

# Step 3: If bad data source is known, retract it
curl -X POST http://localhost:18180/v1/retract \
  -H "Content-Type: application/json" \
  -d '{
    "concept_path": "source/bad_source",
    "reason": "data_quality_issue",
    "cascade": true
  }'

# This marks source and all dependent assertions as retracted

Validate rollback:

# Check assertion count
curl http://localhost:18180/v1/health | jq '.assertion_count'
# Should be less than current (rolled back)

# Verify bad data is gone
curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "bad/data/path", "lens": "recency"}'
# Should return empty or show retracted status

If failed: Bad data still present → May need to filter WAL before replay (requires engineering support).

§4. Validation and Rebuild

Use case: Inconsistency detected, indexes corrupted

Diagnostic:

# Check health assertion_count vs expected
curl http://localhost:18180/v1/health | jq '.assertion_count'
HEALTH_COUNT=10234

cat data/metadata.json | jq '.assertion_count'
METADATA_COUNT=10500

# If mismatch → Inconsistency detected

# Check for index errors
journalctl -u stemedb-api | grep -i "index"

Resolution: Rebuild indexes from WAL

# Stop server
sudo systemctl stop stemedb-api

# Backup existing database
sudo cp -r data/db data/db.backup.$(date +%Y%m%d-%H%M%S)

# Remove indexes (will be rebuilt on startup)
sudo rm -rf data/db/indexes/

# Start server (triggers full index rebuild)
sudo systemctl start stemedb-api

# Monitor rebuild progress
journalctl -u stemedb-api -f

# Expected logs:
# "Index rebuild started..."
# "Rebuilding predicate index from 10234 assertions..."
# "Rebuilding concept index..."
# "Index rebuild complete in 3.4s"

Validate rebuild:

# Check health
curl http://localhost:18180/v1/health

# Verify assertion_count matches metadata
HEALTH_COUNT=$(curl -s http://localhost:18180/v1/health | jq '.assertion_count')
METADATA_COUNT=$(cat data/metadata.json | jq '.assertion_count')

echo "Health: $HEALTH_COUNT, Metadata: $METADATA_COUNT"
# Should match

# Test query
curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "test/validation", "lens": "recency"}'
# Should return 200 with results

If failed: Rebuild fails or counts still mismatch → Perform complete restore (§1) from known-good backup.

Validation

After any restore procedure, validate system health:

Server starts successfully

systemctl status stemedb-api
# Should show: active (running)

Health endpoint returns correct count

curl http://localhost:18180/v1/health | jq '.assertion_count'
# Should match backup metadata.json

Queries succeed

curl -X POST http://localhost:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "test/restore", "lens": "recency"}'
# Should return 200

Ingest works

curl -X POST http://localhost:18180/v1/assert \
  -H "Content-Type: application/json" \
  -d '{
    "concept_path": "test/restore_validation",
    "predicate": "restored",
    "value": true,
    "confidence": 0.95
  }'
# Should return 201 Created

Metrics are valid

curl http://localhost:18180/metrics | grep stemedb_
# Should show all metrics with reasonable values

Dashboard loads
- Open http://localhost:18188/
- Should show current assertion count
- No errors in browser console

Backup Script Reference

Script location: /home/jml/Workspace/stemedb/scripts/backup-stemedb.sh

Usage:

# Manual backup
sudo ./scripts/backup-stemedb.sh

# Scheduled backup (cron)
0 2 * * * /path/to/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1

Backup structure:

backups/stemedb-backup-20260211-100000/
├── metadata.json          # Backup metadata
├── wal/                   # Write-ahead log
│   ├── segment-00001.log
│   ├── segment-00002.log
│   └── ...
└── db/                    # Database files
    ├── assertions.kv
    ├── indexes/
    └── ...

Restore script location: /home/jml/Workspace/stemedb/scripts/restore-stemedb.sh

Safety features:

Never deletes existing data (renames to .backup.TIMESTAMP)
Validates backup metadata before restore
Stops/starts service automatically
Logs all operations

Recovery Time Objective (RTO)

Pilot 5 targets:

Deployment	Backup Size	RTO Target	Actual (tested)
Single-node pilot	<10K assertions	2 hours	15 minutes
Three-node cluster	<100K assertions	5 minutes	30 minutes

Factors affecting RTO:

Backup size
Network bandwidth (if backup on remote storage)
Disk I/O speed
Index rebuild time

Recovery Point Objective (RPO)

Pilot 5 targets:

Deployment	Backup Frequency	RPO Target	Data Loss Window
Single-node pilot	Daily	24 hours	Last backup to failure
Three-node cluster	Hourly	1 hour	Last backup to failure

Reducing RPO:

Increase backup frequency (cron schedule)
Use continuous replication (cluster)
Enable WAL archival to S3 (roadmap P6.4)

Prevention

Automated Backups

Set up daily backup cron:

# Edit crontab
sudo crontab -e

# Add daily backup at 2 AM
0 2 * * * /home/jml/Workspace/stemedb/scripts/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1

# Verify cron job
sudo crontab -l

Set up backup retention:

# Keep last 7 daily backups
find backups/ -name "stemedb-backup-*" -type d -mtime +7 -exec rm -rf {} \;

# Add to cron (after backup)
0 3 * * * find /path/to/backups -name "stemedb-backup-*" -type d -mtime +7 -exec rm -rf {} \;

Backup Validation

Monthly DR test:

# Test restore on staging environment
# 1. Copy production backup to staging
scp -r prod:/backups/latest staging:/backups/test

# 2. Restore on staging
ssh staging "sudo ./scripts/restore-stemedb.sh /backups/test"

# 3. Validate
ssh staging "curl http://localhost:18180/v1/health"

# 4. Document results
echo "$(date): DR test passed, assertion_count: 10234" >> dr-test-log.txt

Monitoring

Set up alerts for:

# Prometheus alert rules
groups:
  - name: stemedb_backups
    rules:
      - alert: StemeDBBackupMissing
        expr: time() - stemedb_last_backup_timestamp_seconds > 86400
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "StemeDB backup missing (>24 hours)"

      - alert: StemeDBBackupFailed
        expr: stemedb_backup_failures_total > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "StemeDB backup failed"

Server Won't Start - WAL corruption scenarios
Disk Full - Backup storage management
High Query Latency - Index rebuild performance

Last Updated

2026-02-11

13 KiB Raw Blame History

Runbook: Restore from Backup

Symptom

Quick Diagnosis

Common Causes

Prerequisites

Resolution Steps

§1. Complete Restore (Full Recovery)

§2. WAL-Only Restore (Preserve Database)

§3. Point-in-Time Restore

§4. Validation and Rebuild

Validation

Backup Script Reference

Recovery Time Objective (RTO)

Recovery Point Objective (RPO)

Prevention

Automated Backups

Backup Validation

Monitoring

Related Runbooks

Last Updated

13 KiB

Raw Blame History