This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
559 lines
13 KiB
Markdown
559 lines
13 KiB
Markdown
# Runbook: Restore from Backup
|
|
|
|
## Symptom
|
|
|
|
- Data loss after hardware failure, corruption, or operator error
|
|
- WAL corruption preventing server startup
|
|
- Need to rollback to known-good state
|
|
- Assertion count doesn't match expected values
|
|
- Database inconsistency detected
|
|
|
|
**Metrics Alerts:**
|
|
- N/A (typically discovered during incident response)
|
|
|
|
---
|
|
|
|
## Quick Diagnosis
|
|
|
|
```
|
|
Need to restore
|
|
│
|
|
├─► Data loss (hardware failure, operator error)?
|
|
│ └─► §1 Complete Restore
|
|
│
|
|
├─► WAL corruption on startup?
|
|
│ └─► §2 WAL-Only Restore
|
|
│
|
|
├─► Need to rollback to specific point in time?
|
|
│ └─► §3 Point-in-Time Restore
|
|
│
|
|
└─► Database inconsistency (assertion count mismatch)?
|
|
└─► §4 Validation and Rebuild
|
|
```
|
|
|
|
---
|
|
|
|
## Common Causes
|
|
|
|
1. **Hardware failure** — Likelihood: **30%**
|
|
- Disk failure
|
|
- Power loss during write
|
|
- Network storage disconnection
|
|
|
|
2. **WAL corruption** — Likelihood: **25%**
|
|
- Unclean shutdown (OOM kill, crash)
|
|
- Disk corruption
|
|
- Version mismatch after upgrade
|
|
|
|
3. **Operator error** — Likelihood: **20%**
|
|
- Accidentally deleted data directory
|
|
- Wrong command executed
|
|
- Misconfigured deployment
|
|
|
|
4. **Software bug** — Likelihood: **15%**
|
|
- Database corruption bug
|
|
- Index inconsistency
|
|
- Replication failure (cluster)
|
|
|
|
5. **Disaster recovery test** — Likelihood: **10%**
|
|
- Scheduled DR validation
|
|
- Migration to new infrastructure
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
|
|
**Before starting restore:**
|
|
|
|
- [ ] **Backup available:**
|
|
```bash
|
|
ls -lh backups/
|
|
# Should show: stemedb-backup-YYYYMMDD-HHMMSS/
|
|
```
|
|
|
|
- [ ] **Backup metadata valid:**
|
|
```bash
|
|
cat backups/stemedb-backup-*/metadata.json
|
|
# Should show: version, timestamp, assertion_count
|
|
```
|
|
|
|
- [ ] **Server stopped:**
|
|
```bash
|
|
sudo systemctl stop stemedb-api
|
|
sudo systemctl status stemedb-api
|
|
# Should show: inactive (dead)
|
|
```
|
|
|
|
- [ ] **Disk space available:**
|
|
```bash
|
|
df -h
|
|
# Need: 2x backup size available
|
|
```
|
|
|
|
---
|
|
|
|
## Resolution Steps
|
|
|
|
### §1. Complete Restore (Full Recovery)
|
|
|
|
**Use case:** Data loss, complete restoration needed
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Verify backup integrity
|
|
BACKUP_DIR="backups/stemedb-backup-20260211-100000" # Replace with your backup
|
|
|
|
# Check metadata
|
|
cat $BACKUP_DIR/metadata.json
|
|
|
|
# Expected output:
|
|
# {
|
|
# "version": "0.1.0",
|
|
# "timestamp": "2026-02-11T10:00:00Z",
|
|
# "assertion_count": 10234,
|
|
# "wal_segment_count": 15,
|
|
# "backup_type": "full"
|
|
# }
|
|
|
|
# Check directory structure
|
|
ls -lh $BACKUP_DIR/
|
|
# Should show: wal/ db/ metadata.json
|
|
```
|
|
|
|
**Resolution: Use restore script**
|
|
|
|
```bash
|
|
# Run restore script (safe - renames existing dirs, never deletes)
|
|
sudo ./scripts/restore-stemedb.sh $BACKUP_DIR
|
|
|
|
# Expected output:
|
|
# Stopping StemeDB API service...
|
|
# Renaming existing data/wal to data/wal.backup.20260211-103045
|
|
# Renaming existing data/db to data/db.backup.20260211-103045
|
|
# Copying WAL from backup...
|
|
# Copying DB from backup...
|
|
# Copying metadata...
|
|
# Restore complete. Starting StemeDB API service...
|
|
# StemeDB API service started successfully.
|
|
```
|
|
|
|
**Validate restore:**
|
|
```bash
|
|
# Check health endpoint
|
|
curl http://localhost:18180/v1/health
|
|
|
|
# Expected output:
|
|
# {
|
|
# "status": "healthy",
|
|
# "version": "0.1.0",
|
|
# "uptime_seconds": 5,
|
|
# "assertion_count": 10234 # Should match backup metadata
|
|
# }
|
|
|
|
# Verify metadata matches
|
|
cat data/metadata.json
|
|
# Should match backup metadata.json
|
|
|
|
# Test query
|
|
curl -X POST http://localhost:18180/v1/query \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path": "test/restore", "lens": "recency"}'
|
|
# Should return 200 (even if empty results)
|
|
```
|
|
|
|
**If failed:** Health check shows different assertion_count → See §4 Validation and Rebuild.
|
|
|
|
---
|
|
|
|
### §2. WAL-Only Restore (Preserve Database)
|
|
|
|
**Use case:** WAL corrupted but database intact
|
|
|
|
⚠️ **WARNING:** This preserves existing database but replaces WAL. Only use if confident database is uncorrupted.
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check for WAL errors
|
|
journalctl -u stemedb-api -n 50 | grep -i wal
|
|
|
|
# Common errors indicating WAL corruption:
|
|
# - "WAL magic byte validation failed"
|
|
# - "Checksum mismatch in WAL segment"
|
|
# - "Failed to recover WAL"
|
|
|
|
# Verify database is intact
|
|
ls -lh data/db/
|
|
# Should show: *.kv files, indexes, no corruption messages
|
|
```
|
|
|
|
**Resolution: Manual WAL replacement**
|
|
|
|
```bash
|
|
# Stop server
|
|
sudo systemctl stop stemedb-api
|
|
|
|
# Backup corrupted WAL for forensics
|
|
sudo mv data/wal data/wal.corrupted.$(date +%Y%m%d-%H%M%S)
|
|
|
|
# Restore WAL from backup
|
|
BACKUP_DIR="backups/stemedb-backup-20260211-100000"
|
|
sudo cp -r $BACKUP_DIR/wal data/wal
|
|
|
|
# Set correct permissions
|
|
sudo chown -R stemedb:stemedb data/wal/
|
|
sudo chmod -R 755 data/wal/
|
|
|
|
# Start server (will replay WAL and rebuild indexes)
|
|
sudo systemctl start stemedb-api
|
|
|
|
# Monitor startup
|
|
journalctl -u stemedb-api -f
|
|
|
|
# Expected logs:
|
|
# "Starting WAL recovery..."
|
|
# "Replayed 1523 entries from WAL"
|
|
# "Rebuilding indexes..."
|
|
# "Startup complete"
|
|
```
|
|
|
|
**Validate WAL recovery:**
|
|
```bash
|
|
# Check health
|
|
curl http://localhost:18180/v1/health
|
|
|
|
# Check metrics for WAL operations
|
|
curl http://localhost:18180/metrics | grep wal_
|
|
|
|
# Should show:
|
|
# wal_segments_total{...} 15
|
|
# wal_fsync_latency_seconds{...} <0.1
|
|
```
|
|
|
|
**If failed:** Server still won't start with restored WAL → Perform complete restore (§1).
|
|
|
|
---
|
|
|
|
### §3. Point-in-Time Restore
|
|
|
|
**Use case:** Rollback to specific timestamp (e.g., before bad data ingestion)
|
|
|
|
⚠️ **NOTE:** StemeDB is append-only, so this is "restore + filter" not true PITR.
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Identify when bad data was ingested
|
|
curl http://localhost:18180/v1/query \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path": "bad/data/path", "lens": "recency"}' | jq '.assertions[0].timestamp'
|
|
|
|
# Find backup before this timestamp
|
|
ls -lh backups/ | grep "before-timestamp"
|
|
```
|
|
|
|
**Resolution: Restore + retraction**
|
|
|
|
```bash
|
|
# Step 1: Restore from backup before bad data
|
|
sudo ./scripts/restore-stemedb.sh backups/stemedb-backup-20260210-230000
|
|
|
|
# Step 2: Start server
|
|
sudo systemctl start stemedb-api
|
|
|
|
# Step 3: If bad data source is known, retract it
|
|
curl -X POST http://localhost:18180/v1/retract \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"concept_path": "source/bad_source",
|
|
"reason": "data_quality_issue",
|
|
"cascade": true
|
|
}'
|
|
|
|
# This marks source and all dependent assertions as retracted
|
|
```
|
|
|
|
**Validate rollback:**
|
|
```bash
|
|
# Check assertion count
|
|
curl http://localhost:18180/v1/health | jq '.assertion_count'
|
|
# Should be less than current (rolled back)
|
|
|
|
# Verify bad data is gone
|
|
curl -X POST http://localhost:18180/v1/query \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path": "bad/data/path", "lens": "recency"}'
|
|
# Should return empty or show retracted status
|
|
```
|
|
|
|
**If failed:** Bad data still present → May need to filter WAL before replay (requires engineering support).
|
|
|
|
---
|
|
|
|
### §4. Validation and Rebuild
|
|
|
|
**Use case:** Inconsistency detected, indexes corrupted
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check health assertion_count vs expected
|
|
curl http://localhost:18180/v1/health | jq '.assertion_count'
|
|
HEALTH_COUNT=10234
|
|
|
|
cat data/metadata.json | jq '.assertion_count'
|
|
METADATA_COUNT=10500
|
|
|
|
# If mismatch → Inconsistency detected
|
|
|
|
# Check for index errors
|
|
journalctl -u stemedb-api | grep -i "index"
|
|
```
|
|
|
|
**Resolution: Rebuild indexes from WAL**
|
|
|
|
```bash
|
|
# Stop server
|
|
sudo systemctl stop stemedb-api
|
|
|
|
# Backup existing database
|
|
sudo cp -r data/db data/db.backup.$(date +%Y%m%d-%H%M%S)
|
|
|
|
# Remove indexes (will be rebuilt on startup)
|
|
sudo rm -rf data/db/indexes/
|
|
|
|
# Start server (triggers full index rebuild)
|
|
sudo systemctl start stemedb-api
|
|
|
|
# Monitor rebuild progress
|
|
journalctl -u stemedb-api -f
|
|
|
|
# Expected logs:
|
|
# "Index rebuild started..."
|
|
# "Rebuilding predicate index from 10234 assertions..."
|
|
# "Rebuilding concept index..."
|
|
# "Index rebuild complete in 3.4s"
|
|
```
|
|
|
|
**Validate rebuild:**
|
|
```bash
|
|
# Check health
|
|
curl http://localhost:18180/v1/health
|
|
|
|
# Verify assertion_count matches metadata
|
|
HEALTH_COUNT=$(curl -s http://localhost:18180/v1/health | jq '.assertion_count')
|
|
METADATA_COUNT=$(cat data/metadata.json | jq '.assertion_count')
|
|
|
|
echo "Health: $HEALTH_COUNT, Metadata: $METADATA_COUNT"
|
|
# Should match
|
|
|
|
# Test query
|
|
curl -X POST http://localhost:18180/v1/query \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path": "test/validation", "lens": "recency"}'
|
|
# Should return 200 with results
|
|
```
|
|
|
|
**If failed:** Rebuild fails or counts still mismatch → Perform complete restore (§1) from known-good backup.
|
|
|
|
---
|
|
|
|
## Validation
|
|
|
|
After any restore procedure, validate system health:
|
|
|
|
- [ ] **Server starts successfully**
|
|
```bash
|
|
systemctl status stemedb-api
|
|
# Should show: active (running)
|
|
```
|
|
|
|
- [ ] **Health endpoint returns correct count**
|
|
```bash
|
|
curl http://localhost:18180/v1/health | jq '.assertion_count'
|
|
# Should match backup metadata.json
|
|
```
|
|
|
|
- [ ] **Queries succeed**
|
|
```bash
|
|
curl -X POST http://localhost:18180/v1/query \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path": "test/restore", "lens": "recency"}'
|
|
# Should return 200
|
|
```
|
|
|
|
- [ ] **Ingest works**
|
|
```bash
|
|
curl -X POST http://localhost:18180/v1/assert \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"concept_path": "test/restore_validation",
|
|
"predicate": "restored",
|
|
"value": true,
|
|
"confidence": 0.95
|
|
}'
|
|
# Should return 201 Created
|
|
```
|
|
|
|
- [ ] **Metrics are valid**
|
|
```bash
|
|
curl http://localhost:18180/metrics | grep stemedb_
|
|
# Should show all metrics with reasonable values
|
|
```
|
|
|
|
- [ ] **Dashboard loads**
|
|
- Open http://localhost:18188/
|
|
- Should show current assertion count
|
|
- No errors in browser console
|
|
|
|
---
|
|
|
|
## Backup Script Reference
|
|
|
|
**Script location:** `/home/jml/Workspace/stemedb/scripts/backup-stemedb.sh`
|
|
|
|
**Usage:**
|
|
```bash
|
|
# Manual backup
|
|
sudo ./scripts/backup-stemedb.sh
|
|
|
|
# Scheduled backup (cron)
|
|
0 2 * * * /path/to/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1
|
|
```
|
|
|
|
**Backup structure:**
|
|
```
|
|
backups/stemedb-backup-20260211-100000/
|
|
├── metadata.json # Backup metadata
|
|
├── wal/ # Write-ahead log
|
|
│ ├── segment-00001.log
|
|
│ ├── segment-00002.log
|
|
│ └── ...
|
|
└── db/ # Database files
|
|
├── assertions.kv
|
|
├── indexes/
|
|
└── ...
|
|
```
|
|
|
|
**Restore script location:** `/home/jml/Workspace/stemedb/scripts/restore-stemedb.sh`
|
|
|
|
**Safety features:**
|
|
- Never deletes existing data (renames to `.backup.TIMESTAMP`)
|
|
- Validates backup metadata before restore
|
|
- Stops/starts service automatically
|
|
- Logs all operations
|
|
|
|
---
|
|
|
|
## Recovery Time Objective (RTO)
|
|
|
|
**Pilot 5 targets:**
|
|
|
|
| Deployment | Backup Size | RTO Target | Actual (tested) |
|
|
|------------|-------------|------------|-----------------|
|
|
| Single-node pilot | <10K assertions | 2 hours | 15 minutes |
|
|
| Three-node cluster | <100K assertions | 5 minutes | 30 minutes |
|
|
|
|
**Factors affecting RTO:**
|
|
- Backup size
|
|
- Network bandwidth (if backup on remote storage)
|
|
- Disk I/O speed
|
|
- Index rebuild time
|
|
|
|
---
|
|
|
|
## Recovery Point Objective (RPO)
|
|
|
|
**Pilot 5 targets:**
|
|
|
|
| Deployment | Backup Frequency | RPO Target | Data Loss Window |
|
|
|------------|------------------|------------|------------------|
|
|
| Single-node pilot | Daily | 24 hours | Last backup to failure |
|
|
| Three-node cluster | Hourly | 1 hour | Last backup to failure |
|
|
|
|
**Reducing RPO:**
|
|
- Increase backup frequency (cron schedule)
|
|
- Use continuous replication (cluster)
|
|
- Enable WAL archival to S3 (roadmap P6.4)
|
|
|
|
---
|
|
|
|
## Prevention
|
|
|
|
### Automated Backups
|
|
|
|
**Set up daily backup cron:**
|
|
```bash
|
|
# Edit crontab
|
|
sudo crontab -e
|
|
|
|
# Add daily backup at 2 AM
|
|
0 2 * * * /home/jml/Workspace/stemedb/scripts/backup-stemedb.sh >> /var/log/stemedb-backup.log 2>&1
|
|
|
|
# Verify cron job
|
|
sudo crontab -l
|
|
```
|
|
|
|
**Set up backup retention:**
|
|
```bash
|
|
# Keep last 7 daily backups
|
|
find backups/ -name "stemedb-backup-*" -type d -mtime +7 -exec rm -rf {} \;
|
|
|
|
# Add to cron (after backup)
|
|
0 3 * * * find /path/to/backups -name "stemedb-backup-*" -type d -mtime +7 -exec rm -rf {} \;
|
|
```
|
|
|
|
### Backup Validation
|
|
|
|
**Monthly DR test:**
|
|
```bash
|
|
# Test restore on staging environment
|
|
# 1. Copy production backup to staging
|
|
scp -r prod:/backups/latest staging:/backups/test
|
|
|
|
# 2. Restore on staging
|
|
ssh staging "sudo ./scripts/restore-stemedb.sh /backups/test"
|
|
|
|
# 3. Validate
|
|
ssh staging "curl http://localhost:18180/v1/health"
|
|
|
|
# 4. Document results
|
|
echo "$(date): DR test passed, assertion_count: 10234" >> dr-test-log.txt
|
|
```
|
|
|
|
### Monitoring
|
|
|
|
**Set up alerts for:**
|
|
```yaml
|
|
# Prometheus alert rules
|
|
groups:
|
|
- name: stemedb_backups
|
|
rules:
|
|
- alert: StemeDBBackupMissing
|
|
expr: time() - stemedb_last_backup_timestamp_seconds > 86400
|
|
for: 1h
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "StemeDB backup missing (>24 hours)"
|
|
|
|
- alert: StemeDBBackupFailed
|
|
expr: stemedb_backup_failures_total > 0
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "StemeDB backup failed"
|
|
```
|
|
|
|
---
|
|
|
|
## Related Runbooks
|
|
|
|
- [Server Won't Start](./server-wont-start.md) - WAL corruption scenarios
|
|
- [Disk Full](./disk-full.md) - Backup storage management
|
|
- [High Query Latency](./high-query-latency.md) - Index rebuild performance
|
|
|
|
---
|
|
|
|
## Last Updated
|
|
|
|
2026-02-11
|