# High Replication Lag ## Severity: CRITICAL ## Alert Rule **Alert:** `ReplicationLagCritical` **Trigger:** Replica lag exceeds 10 seconds **Duration:** 3m ## Symptom - Query results from replicas are stale (missing recent assertions) - Replication metrics show increasing lag (e.g., `stemedb_replication_lag_seconds > 10`) - Merkle tree sync reports large diffs between primary and replica - Clients reading from replicas see inconsistent data ## Impact **User Impact:** - Queries to replicas return outdated results - Reads may miss assertions written in the last 10+ seconds - Eventual consistency SLAs violated **System Impact:** - Replica may fall too far behind to catch up (cascading failure) - Increased Merkle tree diff volume (bandwidth spike) - Risk of replica demotion or rebuild ## Investigation Steps ### 1. Check Replication Status ```bash # Query replication lag metric curl -s http://localhost:18180/metrics | grep replication_lag # Expected output (example): # stemedb_replication_lag_seconds{replica="node2"} 12.5 ``` ### 2. Identify Bottleneck **A. Network latency:** ```bash # Ping replica from primary ping -c 10 # Check bandwidth usage iftop -i eth0 -f "port 18182" ``` **B. Replica disk I/O:** ```bash # SSH to replica iostat -x 1 10 # Look for high %util on WAL partition ``` **C. Replica CPU saturation:** ```bash # SSH to replica top -b -n 1 | grep stemedb ``` ### 3. Check for Merkle Sync Errors ```bash # Primary logs journalctl -u stemedb-api | grep -i "merkle sync" | tail -20 # Replica logs ssh replica "journalctl -u stemedb-api | grep -i 'sync error' | tail -20" ``` ### 4. Compare Assertion Counts ```bash # Primary assertion count curl -s http://localhost:18180/metrics | grep assertions_indexed_total # Replica assertion count curl -s http://:18180/metrics | grep assertions_indexed_total ``` ## Resolution ### If Network Latency is High **1. Check network path:** ```bash traceroute mtr -r -c 10 ``` **2. Verify firewall rules:** ```bash # RPC port 18182 should be open telnet 18182 ``` **3. Increase RPC timeout if needed:** Edit `/etc/stemedb/api.toml` on primary: ```toml [cluster] rpc_timeout_ms = 10000 # Increase from default 5000 ``` Restart primary: ```bash systemctl restart stemedb-api ``` ### If Replica Disk I/O is Saturated **1. Verify WAL write performance:** ```bash # SSH to replica cd /var/lib/stemedb/wal time dd if=/dev/zero of=test.dat bs=1M count=1000 oflag=direct rm test.dat ``` Expected: >100 MB/s on SSD. **2. Check for competing I/O:** ```bash iotop -o ``` **3. Temporarily reduce ingestion rate on primary:** ```bash # Apply rate limit via admin endpoint curl -X POST http://localhost:18180/v1/admin/rate-limit \ -H 'Content-Type: application/json' \ -d '{"max_assertions_per_sec": 1000}' ``` ### If Replica is Falling Further Behind **1. Initiate manual Merkle sync:** ```bash curl -X POST http://localhost:18180/v1/admin/cluster/sync \ -H 'Content-Type: application/json' \ -d '{"replica_id": "node2", "force": true}' ``` **2. Monitor sync progress:** ```bash watch -n 5 'curl -s http://localhost:18180/metrics | grep merkle_sync_progress' ``` **3. If sync fails repeatedly, rebuild replica:** See `docs/operations/runbooks/rebuild-replica.md`. ### If Replication Stream is Blocked **1. Check for circuit breaker trip:** ```bash curl -s http://localhost:18180/v1/admin/circuit-breakers/tripped | jq ``` **2. Reset circuit breaker if needed:** ```bash curl -X POST http://localhost:18180/v1/admin/circuit-breaker/reset \ -H 'Content-Type: application/json' \ -d '{"agent_id": ""}' ``` ## Prevention ### Monitoring and Alerting **1. Add warning-level lag alert:** ```yaml # Prometheus alert rule - alert: ReplicationLagWarning expr: stemedb_replication_lag_seconds > 5 for: 5m annotations: summary: "Replica lag exceeds 5 seconds" ``` **2. Monitor Merkle sync errors:** ```yaml - alert: MerkleSyncFailures expr: rate(stemedb_merkle_sync_errors_total[5m]) > 0.1 annotations: summary: "Frequent Merkle sync failures detected" ``` ### Capacity Planning **1. Ensure replica hardware matches primary:** - Same or better disk I/O (IOPS) - Same network bandwidth - Sufficient CPU headroom **2. Set replication backpressure threshold:** ```toml # /etc/stemedb/api.toml [cluster] max_replication_lag_seconds = 30 # Pause ingestion if lag exceeds ``` ### Operational Best Practices **1. Gradual rollout of high-volume ingestion:** ```bash # Ramp up assertion rate slowly for rate in 100 500 1000 2000; do echo "Testing rate: $rate/sec" # Apply rate via API curl -X POST http://localhost:18180/v1/admin/rate-limit \ -d "{\"max_assertions_per_sec\": $rate}" sleep 300 # Monitor for 5 minutes # Check lag curl -s http://localhost:18180/metrics | grep replication_lag done ``` **2. Pre-provision replicas before traffic spikes:** Add replicas 24 hours before expected load increase. ## Escalation **Escalate immediately if:** - Lag exceeds 60 seconds (replica rebuild likely needed) - Replica is stuck in crash loop during sync - Merkle sync reports corruption (data integrity issue) - Multiple replicas lagging simultaneously (primary overload) **Escalation path:** 1. **Primary on-call:** Cluster SRE 2. **Secondary:** Distributed systems engineer 3. **Final escalation:** Principal engineer (data corruption suspected) ## References - **Dashboard:** [StemeDB Cluster Overview](http://grafana.example.com/d/stemedb-cluster) - **Related alerts:** `ClusterSplitBrain`, `MerkleSyncFailure`, `HighNetworkUtilization` - **Metrics to check:** - `stemedb_replication_lag_seconds` (lag duration) - `stemedb_merkle_sync_duration_seconds` (sync timing) - `stemedb_assertions_indexed_total` (ingestion rate) - `stemedb_network_bytes_sent_total` (replication bandwidth) - **Runbooks:** `rebuild-replica.md`, `split-brain.md`