stemedb/docs/operations/runbooks/high-replication-lag.md

# High Replication Lag

## Severity: CRITICAL

## Alert Rule

**Alert:** `ReplicationLagCritical`
**Trigger:** Replica lag exceeds 10 seconds
**Duration:** 3m

## Symptom

- Query results from replicas are stale (missing recent assertions)
- Replication metrics show increasing lag (e.g., `stemedb_replication_lag_seconds > 10`)
- Merkle tree sync reports large diffs between primary and replica
- Clients reading from replicas see inconsistent data

## Impact

**User Impact:**
- Queries to replicas return outdated results
- Reads may miss assertions written in the last 10+ seconds
- Eventual consistency SLAs violated

**System Impact:**
- Replica may fall too far behind to catch up (cascading failure)
- Increased Merkle tree diff volume (bandwidth spike)
- Risk of replica demotion or rebuild

## Investigation Steps

### 1. Check Replication Status

```bash
# Query replication lag metric
curl -s http://localhost:18180/metrics | grep replication_lag

# Expected output (example):
# stemedb_replication_lag_seconds{replica="node2"} 12.5
```

### 2. Identify Bottleneck

**A. Network latency:**

```bash
# Ping replica from primary
ping -c 10 <replica-ip>

# Check bandwidth usage
iftop -i eth0 -f "port 18182"
```

**B. Replica disk I/O:**

```bash
# SSH to replica
iostat -x 1 10

# Look for high %util on WAL partition
```

**C. Replica CPU saturation:**

```bash
# SSH to replica
top -b -n 1 | grep stemedb
```

### 3. Check for Merkle Sync Errors

```bash
# Primary logs
journalctl -u stemedb-api | grep -i "merkle sync" | tail -20

# Replica logs
ssh replica "journalctl -u stemedb-api | grep -i 'sync error' | tail -20"
```

### 4. Compare Assertion Counts

```bash
# Primary assertion count
curl -s http://localhost:18180/metrics | grep assertions_indexed_total

# Replica assertion count
curl -s http://<replica>:18180/metrics | grep assertions_indexed_total
```

## Resolution

### If Network Latency is High

**1. Check network path:**

```bash
traceroute <replica-ip>
mtr -r -c 10 <replica-ip>
```

**2. Verify firewall rules:**

```bash
# RPC port 18182 should be open
telnet <replica-ip> 18182
```

**3. Increase RPC timeout if needed:**

Edit `/etc/stemedb/api.toml` on primary:

```toml
[cluster]
rpc_timeout_ms = 10000  # Increase from default 5000
```

Restart primary:

```bash
systemctl restart stemedb-api
```

### If Replica Disk I/O is Saturated

**1. Verify WAL write performance:**

```bash
# SSH to replica
cd /var/lib/stemedb/wal
time dd if=/dev/zero of=test.dat bs=1M count=1000 oflag=direct
rm test.dat
```

Expected: >100 MB/s on SSD.

**2. Check for competing I/O:**

```bash
iotop -o
```

**3. Temporarily reduce ingestion rate on primary:**

```bash
# Apply rate limit via admin endpoint
curl -X POST http://localhost:18180/v1/admin/rate-limit \
  -H 'Content-Type: application/json' \
  -d '{"max_assertions_per_sec": 1000}'
```

### If Replica is Falling Further Behind

**1. Initiate manual Merkle sync:**

```bash
curl -X POST http://localhost:18180/v1/admin/cluster/sync \
  -H 'Content-Type: application/json' \
  -d '{"replica_id": "node2", "force": true}'
```

**2. Monitor sync progress:**

```bash
watch -n 5 'curl -s http://localhost:18180/metrics | grep merkle_sync_progress'
```

**3. If sync fails repeatedly, rebuild replica:**

See `docs/operations/runbooks/rebuild-replica.md`.

### If Replication Stream is Blocked

**1. Check for circuit breaker trip:**

```bash
curl -s http://localhost:18180/v1/admin/circuit-breakers/tripped | jq
```

**2. Reset circuit breaker if needed:**

```bash
curl -X POST http://localhost:18180/v1/admin/circuit-breaker/reset \
  -H 'Content-Type: application/json' \
  -d '{"agent_id": "<replica_agent_id>"}'
```

## Prevention

### Monitoring and Alerting

**1. Add warning-level lag alert:**

```yaml
# Prometheus alert rule
- alert: ReplicationLagWarning
  expr: stemedb_replication_lag_seconds > 5
  for: 5m
  annotations:
    summary: "Replica lag exceeds 5 seconds"
```

**2. Monitor Merkle sync errors:**

```yaml
- alert: MerkleSyncFailures
  expr: rate(stemedb_merkle_sync_errors_total[5m]) > 0.1
  annotations:
    summary: "Frequent Merkle sync failures detected"
```

### Capacity Planning

**1. Ensure replica hardware matches primary:**

- Same or better disk I/O (IOPS)
- Same network bandwidth
- Sufficient CPU headroom

**2. Set replication backpressure threshold:**

```toml
# /etc/stemedb/api.toml
[cluster]
max_replication_lag_seconds = 30  # Pause ingestion if lag exceeds
```

### Operational Best Practices

**1. Gradual rollout of high-volume ingestion:**

```bash
# Ramp up assertion rate slowly
for rate in 100 500 1000 2000; do
  echo "Testing rate: $rate/sec"
  # Apply rate via API
  curl -X POST http://localhost:18180/v1/admin/rate-limit \
    -d "{\"max_assertions_per_sec\": $rate}"
  sleep 300  # Monitor for 5 minutes
  # Check lag
  curl -s http://localhost:18180/metrics | grep replication_lag
done
```

**2. Pre-provision replicas before traffic spikes:**

Add replicas 24 hours before expected load increase.

## Escalation

**Escalate immediately if:**

- Lag exceeds 60 seconds (replica rebuild likely needed)
- Replica is stuck in crash loop during sync
- Merkle sync reports corruption (data integrity issue)
- Multiple replicas lagging simultaneously (primary overload)

**Escalation path:**

1. **Primary on-call:** Cluster SRE
2. **Secondary:** Distributed systems engineer
3. **Final escalation:** Principal engineer (data corruption suspected)

## References

- **Dashboard:** [StemeDB Cluster Overview](http://grafana.example.com/d/stemedb-cluster)
- **Related alerts:** `ClusterSplitBrain`, `MerkleSyncFailure`, `HighNetworkUtilization`
- **Metrics to check:**
  - `stemedb_replication_lag_seconds` (lag duration)
  - `stemedb_merkle_sync_duration_seconds` (sync timing)
  - `stemedb_assertions_indexed_total` (ingestion rate)
  - `stemedb_network_bytes_sent_total` (replication bandwidth)
- **Runbooks:** `rebuild-replica.md`, `split-brain.md`