stemedb/docs/operations/runbooks/high-replication-lag.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

273 lines
5.8 KiB
Markdown

# High Replication Lag
## Severity: CRITICAL
## Alert Rule
**Alert:** `ReplicationLagCritical`
**Trigger:** Replica lag exceeds 10 seconds
**Duration:** 3m
## Symptom
- Query results from replicas are stale (missing recent assertions)
- Replication metrics show increasing lag (e.g., `stemedb_replication_lag_seconds > 10`)
- Merkle tree sync reports large diffs between primary and replica
- Clients reading from replicas see inconsistent data
## Impact
**User Impact:**
- Queries to replicas return outdated results
- Reads may miss assertions written in the last 10+ seconds
- Eventual consistency SLAs violated
**System Impact:**
- Replica may fall too far behind to catch up (cascading failure)
- Increased Merkle tree diff volume (bandwidth spike)
- Risk of replica demotion or rebuild
## Investigation Steps
### 1. Check Replication Status
```bash
# Query replication lag metric
curl -s http://localhost:18180/metrics | grep replication_lag
# Expected output (example):
# stemedb_replication_lag_seconds{replica="node2"} 12.5
```
### 2. Identify Bottleneck
**A. Network latency:**
```bash
# Ping replica from primary
ping -c 10 <replica-ip>
# Check bandwidth usage
iftop -i eth0 -f "port 18182"
```
**B. Replica disk I/O:**
```bash
# SSH to replica
iostat -x 1 10
# Look for high %util on WAL partition
```
**C. Replica CPU saturation:**
```bash
# SSH to replica
top -b -n 1 | grep stemedb
```
### 3. Check for Merkle Sync Errors
```bash
# Primary logs
journalctl -u stemedb-api | grep -i "merkle sync" | tail -20
# Replica logs
ssh replica "journalctl -u stemedb-api | grep -i 'sync error' | tail -20"
```
### 4. Compare Assertion Counts
```bash
# Primary assertion count
curl -s http://localhost:18180/metrics | grep assertions_indexed_total
# Replica assertion count
curl -s http://<replica>:18180/metrics | grep assertions_indexed_total
```
## Resolution
### If Network Latency is High
**1. Check network path:**
```bash
traceroute <replica-ip>
mtr -r -c 10 <replica-ip>
```
**2. Verify firewall rules:**
```bash
# RPC port 18182 should be open
telnet <replica-ip> 18182
```
**3. Increase RPC timeout if needed:**
Edit `/etc/stemedb/api.toml` on primary:
```toml
[cluster]
rpc_timeout_ms = 10000 # Increase from default 5000
```
Restart primary:
```bash
systemctl restart stemedb-api
```
### If Replica Disk I/O is Saturated
**1. Verify WAL write performance:**
```bash
# SSH to replica
cd /var/lib/stemedb/wal
time dd if=/dev/zero of=test.dat bs=1M count=1000 oflag=direct
rm test.dat
```
Expected: >100 MB/s on SSD.
**2. Check for competing I/O:**
```bash
iotop -o
```
**3. Temporarily reduce ingestion rate on primary:**
```bash
# Apply rate limit via admin endpoint
curl -X POST http://localhost:18180/v1/admin/rate-limit \
-H 'Content-Type: application/json' \
-d '{"max_assertions_per_sec": 1000}'
```
### If Replica is Falling Further Behind
**1. Initiate manual Merkle sync:**
```bash
curl -X POST http://localhost:18180/v1/admin/cluster/sync \
-H 'Content-Type: application/json' \
-d '{"replica_id": "node2", "force": true}'
```
**2. Monitor sync progress:**
```bash
watch -n 5 'curl -s http://localhost:18180/metrics | grep merkle_sync_progress'
```
**3. If sync fails repeatedly, rebuild replica:**
See `docs/operations/runbooks/rebuild-replica.md`.
### If Replication Stream is Blocked
**1. Check for circuit breaker trip:**
```bash
curl -s http://localhost:18180/v1/admin/circuit-breakers/tripped | jq
```
**2. Reset circuit breaker if needed:**
```bash
curl -X POST http://localhost:18180/v1/admin/circuit-breaker/reset \
-H 'Content-Type: application/json' \
-d '{"agent_id": "<replica_agent_id>"}'
```
## Prevention
### Monitoring and Alerting
**1. Add warning-level lag alert:**
```yaml
# Prometheus alert rule
- alert: ReplicationLagWarning
expr: stemedb_replication_lag_seconds > 5
for: 5m
annotations:
summary: "Replica lag exceeds 5 seconds"
```
**2. Monitor Merkle sync errors:**
```yaml
- alert: MerkleSyncFailures
expr: rate(stemedb_merkle_sync_errors_total[5m]) > 0.1
annotations:
summary: "Frequent Merkle sync failures detected"
```
### Capacity Planning
**1. Ensure replica hardware matches primary:**
- Same or better disk I/O (IOPS)
- Same network bandwidth
- Sufficient CPU headroom
**2. Set replication backpressure threshold:**
```toml
# /etc/stemedb/api.toml
[cluster]
max_replication_lag_seconds = 30 # Pause ingestion if lag exceeds
```
### Operational Best Practices
**1. Gradual rollout of high-volume ingestion:**
```bash
# Ramp up assertion rate slowly
for rate in 100 500 1000 2000; do
echo "Testing rate: $rate/sec"
# Apply rate via API
curl -X POST http://localhost:18180/v1/admin/rate-limit \
-d "{\"max_assertions_per_sec\": $rate}"
sleep 300 # Monitor for 5 minutes
# Check lag
curl -s http://localhost:18180/metrics | grep replication_lag
done
```
**2. Pre-provision replicas before traffic spikes:**
Add replicas 24 hours before expected load increase.
## Escalation
**Escalate immediately if:**
- Lag exceeds 60 seconds (replica rebuild likely needed)
- Replica is stuck in crash loop during sync
- Merkle sync reports corruption (data integrity issue)
- Multiple replicas lagging simultaneously (primary overload)
**Escalation path:**
1. **Primary on-call:** Cluster SRE
2. **Secondary:** Distributed systems engineer
3. **Final escalation:** Principal engineer (data corruption suspected)
## References
- **Dashboard:** [StemeDB Cluster Overview](http://grafana.example.com/d/stemedb-cluster)
- **Related alerts:** `ClusterSplitBrain`, `MerkleSyncFailure`, `HighNetworkUtilization`
- **Metrics to check:**
- `stemedb_replication_lag_seconds` (lag duration)
- `stemedb_merkle_sync_duration_seconds` (sync timing)
- `stemedb_assertions_indexed_total` (ingestion rate)
- `stemedb_network_bytes_sent_total` (replication bandwidth)
- **Runbooks:** `rebuild-replica.md`, `split-brain.md`