This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
674 lines
16 KiB
Markdown
674 lines
16 KiB
Markdown
# Runbook: Disaster Recovery
|
|
|
|
## Overview
|
|
|
|
**Purpose:** Restore StemeDB from backup after catastrophic failure.
|
|
|
|
**RTO (Recovery Time Objective):** 4 hours
|
|
**RPO (Recovery Point Objective):** 15 minutes
|
|
|
|
**Scope:** Complete server failure, data center outage, or regional disaster requiring restore from backups.
|
|
|
|
---
|
|
|
|
## When to Use This Runbook
|
|
|
|
Use this runbook for:
|
|
|
|
- **Complete server failure** - Hardware dead, cannot boot
|
|
- **Data center outage** - Entire DC offline, need to restore elsewhere
|
|
- **Disk failure** - Storage completely lost, no local recovery possible
|
|
- **Ransomware/corruption** - Data encrypted or corrupted, need clean restore
|
|
- **Regional disaster** - DR drill or actual disaster requiring failover
|
|
|
|
**Do NOT use for:**
|
|
- Single node failure in cluster → Use cluster failover instead
|
|
- WAL corruption → Use [Restore from Backup](./restore-from-backup.md) §2
|
|
- Index rebuild → Use [Restore from Backup](./restore-from-backup.md) §4
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
|
|
Before starting DR, ensure:
|
|
|
|
- [ ] **New server provisioned** (or existing server with clean disk)
|
|
- [ ] **S3 access configured** (credentials, network access to S3)
|
|
- [ ] **Dependencies installed** (Rust, PostgreSQL if using external stores)
|
|
- [ ] **Stakeholders notified** (team knows DR is in progress)
|
|
- [ ] **DNS/load balancer updated** (if changing server IP)
|
|
|
|
**Minimum server specs:**
|
|
- CPU: 4 cores
|
|
- RAM: 16GB
|
|
- Disk: 2x backup size (for restore + buffer)
|
|
- Network: 1Gbps (for S3 downloads)
|
|
|
|
---
|
|
|
|
## Decision Tree
|
|
|
|
```
|
|
Disaster scenario
|
|
│
|
|
├─► Complete restore needed?
|
|
│ └─► §1 Full Restore from S3
|
|
│
|
|
├─► Point-in-time restore needed?
|
|
│ └─► §2 Point-in-Time Restore with WAL Replay
|
|
│
|
|
└─► Only recent data lost?
|
|
└─► §3 WAL-Only Recovery
|
|
```
|
|
|
|
---
|
|
|
|
## Resolution Steps
|
|
|
|
### §1. Full Restore from S3 (RTO: 4 hours, RPO: 15 minutes)
|
|
|
|
**Use case:** Complete data loss, restore everything from S3.
|
|
|
|
**Step 1: Provision new server (30 min)**
|
|
|
|
```bash
|
|
# Install dependencies
|
|
sudo apt update
|
|
sudo apt install -y awscli build-essential pkg-config libssl-dev postgresql-client
|
|
|
|
# Install Rust
|
|
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
|
|
source $HOME/.cargo/env
|
|
|
|
# Create stemedb user
|
|
sudo useradd -r -s /bin/bash -d /var/lib/stemedb -m stemedb
|
|
|
|
# Create data directories
|
|
sudo mkdir -p /var/lib/stemedb/{wal,db}
|
|
sudo chown -R stemedb:stemedb /var/lib/stemedb
|
|
```
|
|
|
|
**Step 2: Download latest full backup from S3 (60 min)**
|
|
|
|
```bash
|
|
# List available backups
|
|
aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup
|
|
|
|
# Expected output:
|
|
# PRE stemedb-backup-20260211-060000/
|
|
# PRE stemedb-backup-20260211-120000/
|
|
# PRE stemedb-backup-20260211-180000/ ← Latest
|
|
|
|
# Download latest full backup
|
|
LATEST_BACKUP=$(aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup | tail -n1 | awk '{print $2}' | tr -d '/')
|
|
sudo -u stemedb aws s3 sync \
|
|
s3://stemedb-backups-prod/${LATEST_BACKUP} \
|
|
/var/backups/stemedb/${LATEST_BACKUP} \
|
|
--region us-east-1
|
|
|
|
# Verify download
|
|
ls -lh /var/backups/stemedb/${LATEST_BACKUP}/
|
|
# Should show: backup-metadata.json, wal/, db/
|
|
|
|
cat /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json
|
|
# Verify timestamp, file counts
|
|
```
|
|
|
|
**Step 3: Download WAL segments since last backup (15 min)**
|
|
|
|
```bash
|
|
# Get backup timestamp
|
|
BACKUP_TIMESTAMP=$(jq -r .timestamp /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json)
|
|
echo "Backup timestamp: $BACKUP_TIMESTAMP"
|
|
|
|
# Download WAL segments archived after backup
|
|
sudo -u stemedb mkdir -p /var/lib/stemedb/wal-archive
|
|
sudo -u stemedb aws s3 sync \
|
|
s3://stemedb-backups-prod/wal-archive/ \
|
|
/var/lib/stemedb/wal-archive/ \
|
|
--region us-east-1
|
|
|
|
# Count segments
|
|
WAL_COUNT=$(find /var/lib/stemedb/wal-archive -name "*.wal" | wc -l)
|
|
echo "Downloaded $WAL_COUNT WAL segments"
|
|
```
|
|
|
|
**Step 4: Restore data directories (30 min)**
|
|
|
|
```bash
|
|
# Restore from backup
|
|
sudo -u stemedb rsync -av \
|
|
/var/backups/stemedb/${LATEST_BACKUP}/wal/ \
|
|
/var/lib/stemedb/wal/
|
|
|
|
sudo -u stemedb rsync -av \
|
|
/var/backups/stemedb/${LATEST_BACKUP}/db/ \
|
|
/var/lib/stemedb/db/
|
|
|
|
# Copy archived WAL segments
|
|
sudo -u stemedb cp -r /var/lib/stemedb/wal-archive/*.wal /var/lib/stemedb/wal/
|
|
|
|
# Verify restoration
|
|
du -sh /var/lib/stemedb/{wal,db}
|
|
# Should match backup sizes + WAL archive
|
|
```
|
|
|
|
**Step 5: Build and start StemeDB (30 min)**
|
|
|
|
```bash
|
|
# Clone repository
|
|
cd /opt
|
|
sudo git clone https://github.com/yourusername/stemedb.git
|
|
sudo chown -R stemedb:stemedb /opt/stemedb
|
|
|
|
# Build release binary
|
|
cd /opt/stemedb
|
|
sudo -u stemedb cargo build --release --bin stemedb-api
|
|
|
|
# Install systemd unit
|
|
sudo cp docs/operations/deployment/systemd/stemedb-api.service /etc/systemd/system/
|
|
sudo systemctl daemon-reload
|
|
|
|
# Configure environment
|
|
sudo tee /etc/default/stemedb <<ENV
|
|
STEMEDB_BIND_ADDR=0.0.0.0:18180
|
|
STEMEDB_WAL_DIR=/var/lib/stemedb/wal
|
|
STEMEDB_DB_DIR=/var/lib/stemedb/db
|
|
RUST_LOG=info
|
|
ENV
|
|
|
|
# Start StemeDB (will auto-replay WAL)
|
|
sudo systemctl start stemedb-api
|
|
|
|
# Monitor startup
|
|
sudo journalctl -u stemedb-api -f
|
|
|
|
# Expected logs:
|
|
# "Starting WAL recovery..."
|
|
# "Replayed 15234 entries from WAL"
|
|
# "Rebuilding indexes..."
|
|
# "Startup complete, listening on 0.0.0.0:18180"
|
|
```
|
|
|
|
**Step 6: Validate recovery (30 min)**
|
|
|
|
```bash
|
|
# Wait for startup to complete (watch journalctl)
|
|
# Then validate...
|
|
|
|
# Check health
|
|
curl http://localhost:18180/v1/health
|
|
|
|
# Expected:
|
|
# {
|
|
# "status": "healthy",
|
|
# "assertion_count": 105234,
|
|
# "wal_segments": 47,
|
|
# "uptime_seconds": 120
|
|
# }
|
|
|
|
# Verify assertion count matches expected
|
|
EXPECTED_COUNT=$(jq -r .assertion_count /var/backups/stemedb/${LATEST_BACKUP}/backup-metadata.json)
|
|
ACTUAL_COUNT=$(curl -s http://localhost:18180/v1/health | jq .assertion_count)
|
|
|
|
echo "Expected: $EXPECTED_COUNT"
|
|
echo "Actual: $ACTUAL_COUNT"
|
|
echo "Delta: $((ACTUAL_COUNT - EXPECTED_COUNT))"
|
|
|
|
# Delta should equal assertions from WAL replay
|
|
# (data added between backup and failure)
|
|
|
|
# Test query
|
|
curl -X POST http://localhost:18180/v1/query \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"concept_path": "test/dr",
|
|
"predicate": "recovered",
|
|
"lens": "recency"
|
|
}'
|
|
|
|
# Should return 200 (even if empty results)
|
|
|
|
# Test ingestion
|
|
curl -X POST http://localhost:18180/v1/assert \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"concept_path": "test/dr_validation",
|
|
"predicate": "restored",
|
|
"value": true,
|
|
"confidence": 1.0,
|
|
"authority_tier": "expert"
|
|
}'
|
|
|
|
# Should return 201 Created
|
|
```
|
|
|
|
**Step 7: Resume operations (60 min)**
|
|
|
|
```bash
|
|
# Update DNS (if IP changed)
|
|
# Point stemedb.yourdomain.com to new server IP
|
|
|
|
# Update load balancer (if using LB)
|
|
# Add new server to backend pool
|
|
|
|
# Enable backup automation
|
|
sudo systemctl enable stemedb-backup.timer
|
|
sudo systemctl start stemedb-backup.timer
|
|
|
|
sudo systemctl enable stemedb-archive-wal.timer
|
|
sudo systemctl start stemedb-archive-wal.timer
|
|
|
|
sudo systemctl enable stemedb-verify-backup.timer
|
|
sudo systemctl start stemedb-verify-backup.timer
|
|
|
|
# Verify timers
|
|
systemctl list-timers 'stemedb-*'
|
|
|
|
# Notify stakeholders
|
|
echo "StemeDB DR complete at $(date -u)" | mail -s "StemeDB DR Complete" oncall@yourcompany.com
|
|
```
|
|
|
|
**Total time: ~4 hours (within RTO)**
|
|
|
|
---
|
|
|
|
### §2. Point-in-Time Restore with WAL Replay (RTO: 2 hours, RPO: 15 min)
|
|
|
|
**Use case:** Restore to specific timestamp (e.g., before bad data ingestion).
|
|
|
|
**Step 1: Identify target timestamp**
|
|
|
|
```bash
|
|
# Determine when bad data was ingested
|
|
# (from logs, monitoring, or user reports)
|
|
TARGET_TIMESTAMP="2026-02-11T14:30:00Z"
|
|
|
|
# Find backup immediately before target
|
|
aws s3 ls s3://stemedb-backups-prod/ | grep stemedb-backup | \
|
|
awk '{print $2}' | tr -d '/' | \
|
|
while read backup; do
|
|
BACKUP_TS=$(aws s3 cp s3://stemedb-backups-prod/${backup}/backup-metadata.json - | jq -r .timestamp)
|
|
if [[ "$BACKUP_TS" < "$TARGET_TIMESTAMP" ]]; then
|
|
echo "$backup ($BACKUP_TS)"
|
|
fi
|
|
done | tail -n1
|
|
|
|
# Use backup: stemedb-backup-20260211-120000 (2026-02-11T12:00:00Z)
|
|
```
|
|
|
|
**Step 2: Restore base backup**
|
|
|
|
Follow §1 steps 1-4, but use the identified backup instead of latest.
|
|
|
|
**Step 3: Replay WAL to target timestamp**
|
|
|
|
```bash
|
|
# Download all WAL segments between backup and target
|
|
sudo -u stemedb aws s3 sync \
|
|
s3://stemedb-backups-prod/wal-archive/ \
|
|
/var/lib/stemedb/wal-partial/ \
|
|
--region us-east-1
|
|
|
|
# Filter WAL segments by timestamp
|
|
# (Keep only segments before target timestamp)
|
|
for wal in /var/lib/stemedb/wal-partial/*.wal; do
|
|
WAL_TS=$(stat -c %Y "$wal" | awk '{print strftime("%Y-%m-%dT%H:%M:%SZ", $1)}')
|
|
if [[ "$WAL_TS" < "$TARGET_TIMESTAMP" ]]; then
|
|
sudo -u stemedb cp "$wal" /var/lib/stemedb/wal/
|
|
fi
|
|
done
|
|
|
|
# Start StemeDB (will replay filtered WAL)
|
|
sudo systemctl start stemedb-api
|
|
|
|
# Validate timestamp
|
|
LAST_ASSERTION_TS=$(curl -s http://localhost:18180/v1/query \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path": "*", "lens": "recency", "limit": 1}' | \
|
|
jq -r '.assertions[0].timestamp')
|
|
|
|
echo "Last assertion timestamp: $LAST_ASSERTION_TS"
|
|
echo "Target timestamp: $TARGET_TIMESTAMP"
|
|
# Last assertion should be ≤ target
|
|
```
|
|
|
|
**Total time: ~2 hours**
|
|
|
|
---
|
|
|
|
### §3. WAL-Only Recovery (RTO: 30 min, RPO: 0 min)
|
|
|
|
**Use case:** Database intact, only recent WAL lost (e.g., WAL disk failure).
|
|
|
|
**Step 1: Verify database is intact**
|
|
|
|
```bash
|
|
sudo systemctl stop stemedb-api
|
|
|
|
# Check DB directory
|
|
ls -lh /var/lib/stemedb/db/
|
|
# Should show: *.kv files, no corruption
|
|
|
|
# Check for errors
|
|
journalctl -u stemedb-api | tail -n100 | grep -i "db\|database\|storage"
|
|
# Should NOT show corruption errors
|
|
```
|
|
|
|
**Step 2: Download archived WAL**
|
|
|
|
```bash
|
|
# Download all archived WAL segments
|
|
sudo -u stemedb aws s3 sync \
|
|
s3://stemedb-backups-prod/wal-archive/ \
|
|
/var/lib/stemedb/wal/ \
|
|
--region us-east-1 \
|
|
--delete
|
|
|
|
# Verify download
|
|
ls -lh /var/lib/stemedb/wal/*.wal | wc -l
|
|
# Should show: N segments
|
|
```
|
|
|
|
**Step 3: Start and replay**
|
|
|
|
```bash
|
|
sudo systemctl start stemedb-api
|
|
|
|
# Monitor replay
|
|
sudo journalctl -u stemedb-api -f
|
|
|
|
# Expected:
|
|
# "Replayed 523 entries from WAL"
|
|
# "Startup complete"
|
|
|
|
# Validate
|
|
curl http://localhost:18180/v1/health | jq .assertion_count
|
|
# Should match expected count
|
|
```
|
|
|
|
**Total time: ~30 min**
|
|
|
|
---
|
|
|
|
## Validation Checklist
|
|
|
|
After any DR procedure, validate:
|
|
|
|
- [ ] **Server starts successfully**
|
|
```bash
|
|
systemctl status stemedb-api
|
|
# Active (running)
|
|
```
|
|
|
|
- [ ] **Health endpoint responds**
|
|
```bash
|
|
curl http://localhost:18180/v1/health
|
|
# Returns 200 OK
|
|
```
|
|
|
|
- [ ] **Assertion count correct**
|
|
```bash
|
|
# Compare to backup metadata or expected count
|
|
```
|
|
|
|
- [ ] **Queries work**
|
|
```bash
|
|
curl -X POST http://localhost:18180/v1/query \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path": "test", "lens": "recency"}'
|
|
# Returns 200
|
|
```
|
|
|
|
- [ ] **Ingestion works**
|
|
```bash
|
|
# Test write
|
|
curl -X POST http://localhost:18180/v1/assert ... # 201 Created
|
|
```
|
|
|
|
- [ ] **Backups resume**
|
|
```bash
|
|
systemctl is-active stemedb-backup.timer # active
|
|
systemctl is-active stemedb-archive-wal.timer # active
|
|
```
|
|
|
|
- [ ] **Metrics exporting**
|
|
```bash
|
|
curl http://localhost:18180/metrics | grep stemedb_
|
|
# Shows metrics
|
|
```
|
|
|
|
- [ ] **Alerts firing correctly**
|
|
```bash
|
|
curl http://prometheus:9090/api/v1/alerts | jq .
|
|
# No backup alerts firing
|
|
```
|
|
|
|
- [ ] **DNS/LB updated**
|
|
```bash
|
|
nslookup stemedb.yourdomain.com
|
|
# Points to new IP (if changed)
|
|
```
|
|
|
|
---
|
|
|
|
## RTO/RPO Metrics
|
|
|
|
| Scenario | RTO | RPO | Data Loss |
|
|
|----------|-----|-----|-----------|
|
|
| Full restore from S3 | 4h | 15min | Last 15min of WAL |
|
|
| Point-in-time restore | 2h | variable | Controlled (to target timestamp) |
|
|
| WAL-only recovery | 30min | 0min | None (if WAL archived) |
|
|
|
|
**Factors affecting RTO:**
|
|
- S3 download speed (network bandwidth)
|
|
- Backup size (larger = slower restore)
|
|
- Server provisioning time (cloud vs. bare metal)
|
|
- DNS/LB propagation delay
|
|
|
|
**Factors affecting RPO:**
|
|
- WAL archival frequency (default: 15 min)
|
|
- Last successful backup age (default: 6h intervals)
|
|
- Time of failure (worst case: just before backup)
|
|
|
|
---
|
|
|
|
## Post-DR Actions
|
|
|
|
**Immediate (within 1 hour):**
|
|
|
|
1. **Document incident**
|
|
- Create incident report
|
|
- Record timeline (failure time, detection time, recovery time)
|
|
- Note RTO/RPO achieved vs. target
|
|
|
|
2. **Verify monitoring**
|
|
- Check all alerts are firing correctly
|
|
- Verify metrics are being collected
|
|
- Test PagerDuty/Slack notifications
|
|
|
|
3. **Communicate status**
|
|
- Notify stakeholders of recovery completion
|
|
- Update status page
|
|
- Send post-mortem invite
|
|
|
|
**Within 24 hours:**
|
|
|
|
1. **Root cause analysis**
|
|
- Identify what caused failure
|
|
- Determine if preventable
|
|
- Create action items
|
|
|
|
2. **Test backups**
|
|
- Verify next backup completes
|
|
- Validate verification passes
|
|
- Check S3 uploads working
|
|
|
|
3. **Review procedures**
|
|
- Update runbook with lessons learned
|
|
- Document any deviations from procedure
|
|
- Propose improvements
|
|
|
|
**Within 1 week:**
|
|
|
|
1. **Conduct post-mortem**
|
|
- Blameless review with team
|
|
- Identify process improvements
|
|
- Create corrective actions
|
|
|
|
2. **Update documentation**
|
|
- Incorporate lessons learned
|
|
- Update RTO/RPO estimates
|
|
- Revise prerequisites
|
|
|
|
3. **Schedule DR drill**
|
|
- Test procedure again (quarterly)
|
|
- Validate improvements
|
|
- Train new team members
|
|
|
|
---
|
|
|
|
## Common Pitfalls
|
|
|
|
### 1. Incomplete S3 sync
|
|
|
|
**Symptom:** Restore completes but assertion count too low.
|
|
|
|
**Cause:** S3 sync interrupted or incomplete.
|
|
|
|
**Fix:**
|
|
```bash
|
|
# Re-sync with --exact-timestamps
|
|
sudo -u stemedb aws s3 sync \
|
|
s3://stemedb-backups-prod/${BACKUP} \
|
|
/var/backups/stemedb/${BACKUP} \
|
|
--exact-timestamps \
|
|
--region us-east-1
|
|
```
|
|
|
|
### 2. WAL replay fails
|
|
|
|
**Symptom:** Server starts but assertion count wrong.
|
|
|
|
**Cause:** Corrupted WAL segment or version mismatch.
|
|
|
|
**Fix:**
|
|
```bash
|
|
# Check logs for specific segment
|
|
sudo journalctl -u stemedb-api | grep -i "wal.*error"
|
|
|
|
# If segment corrupted, skip it (accept data loss)
|
|
sudo mv /var/lib/stemedb/wal/segment-XXXXX.wal /tmp/
|
|
|
|
# Restart
|
|
sudo systemctl restart stemedb-api
|
|
```
|
|
|
|
### 3. Permissions incorrect
|
|
|
|
**Symptom:** Server won't start, permission denied errors.
|
|
|
|
**Cause:** Restored files owned by wrong user.
|
|
|
|
**Fix:**
|
|
```bash
|
|
sudo chown -R stemedb:stemedb /var/lib/stemedb
|
|
sudo chmod -R 755 /var/lib/stemedb/wal
|
|
sudo chmod -R 755 /var/lib/stemedb/db
|
|
```
|
|
|
|
### 4. DNS not updated
|
|
|
|
**Symptom:** Clients can't connect to restored server.
|
|
|
|
**Cause:** DNS still pointing to old IP.
|
|
|
|
**Fix:**
|
|
```bash
|
|
# Update DNS record
|
|
# (method varies by DNS provider)
|
|
|
|
# Verify propagation
|
|
dig stemedb.yourdomain.com +short
|
|
# Should return new IP
|
|
```
|
|
|
|
---
|
|
|
|
## DR Drill Procedure
|
|
|
|
**Frequency:** Quarterly (every 90 days)
|
|
|
|
**Purpose:** Validate DR procedures, train team, measure RTO/RPO.
|
|
|
|
**Steps:**
|
|
|
|
1. **Schedule drill** (at least 1 week notice)
|
|
2. **Provision staging environment** (separate from prod)
|
|
3. **Execute DR procedure** (§1 Full Restore)
|
|
4. **Measure RTO/RPO achieved**
|
|
5. **Document results** (drill report)
|
|
6. **Review with team** (post-drill retro)
|
|
7. **Update runbook** (incorporate learnings)
|
|
|
|
**Drill report template:**
|
|
|
|
```markdown
|
|
# DR Drill Report - YYYY-MM-DD
|
|
|
|
## Summary
|
|
- Date: YYYY-MM-DD HH:MM UTC
|
|
- Participants: [names]
|
|
- Scenario: Full restore from S3
|
|
- Result: ✅ Success / ⚠️ Partial / ❌ Failed
|
|
|
|
## Metrics
|
|
- RTO Target: 4 hours
|
|
- RTO Achieved: X hours Y min
|
|
- RPO Target: 15 min
|
|
- RPO Achieved: X min
|
|
- Data Loss: X assertions (expected)
|
|
|
|
## Timeline
|
|
- HH:MM - Drill started
|
|
- HH:MM - Server provisioned
|
|
- HH:MM - Backup downloaded
|
|
- HH:MM - WAL downloaded
|
|
- HH:MM - Data restored
|
|
- HH:MM - Service started
|
|
- HH:MM - Validation complete
|
|
- HH:MM - Drill complete
|
|
|
|
## Issues Encountered
|
|
1. [Issue description]
|
|
- Impact: [how it affected RTO]
|
|
- Resolution: [how it was fixed]
|
|
- Preventive action: [how to avoid next time]
|
|
|
|
## Lessons Learned
|
|
- [Lesson 1]
|
|
- [Lesson 2]
|
|
|
|
## Action Items
|
|
- [ ] [Action item 1] - Owner: [name] - Due: [date]
|
|
- [ ] [Action item 2] - Owner: [name] - Due: [date]
|
|
|
|
## Runbook Updates
|
|
- [Change 1: reason]
|
|
- [Change 2: reason]
|
|
```
|
|
|
|
---
|
|
|
|
## Related Runbooks
|
|
|
|
- [Restore from Backup](./restore-from-backup.md) - Non-disaster restore scenarios
|
|
- [Server Won't Start](./server-wont-start.md) - Startup failures
|
|
- [Disk Full](./disk-full.md) - Storage management
|
|
|
|
---
|
|
|
|
## Last Updated
|
|
|
|
2026-02-12 (P5.3 Implementation)
|