stemedb/docs/operations/runbooks/high-replication-lag.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

5.8 KiB

High Replication Lag

Severity: CRITICAL

Alert Rule

Alert: ReplicationLagCritical Trigger: Replica lag exceeds 10 seconds Duration: 3m

Symptom

  • Query results from replicas are stale (missing recent assertions)
  • Replication metrics show increasing lag (e.g., stemedb_replication_lag_seconds > 10)
  • Merkle tree sync reports large diffs between primary and replica
  • Clients reading from replicas see inconsistent data

Impact

User Impact:

  • Queries to replicas return outdated results
  • Reads may miss assertions written in the last 10+ seconds
  • Eventual consistency SLAs violated

System Impact:

  • Replica may fall too far behind to catch up (cascading failure)
  • Increased Merkle tree diff volume (bandwidth spike)
  • Risk of replica demotion or rebuild

Investigation Steps

1. Check Replication Status

# Query replication lag metric
curl -s http://localhost:18180/metrics | grep replication_lag

# Expected output (example):
# stemedb_replication_lag_seconds{replica="node2"} 12.5

2. Identify Bottleneck

A. Network latency:

# Ping replica from primary
ping -c 10 <replica-ip>

# Check bandwidth usage
iftop -i eth0 -f "port 18182"

B. Replica disk I/O:

# SSH to replica
iostat -x 1 10

# Look for high %util on WAL partition

C. Replica CPU saturation:

# SSH to replica
top -b -n 1 | grep stemedb

3. Check for Merkle Sync Errors

# Primary logs
journalctl -u stemedb-api | grep -i "merkle sync" | tail -20

# Replica logs
ssh replica "journalctl -u stemedb-api | grep -i 'sync error' | tail -20"

4. Compare Assertion Counts

# Primary assertion count
curl -s http://localhost:18180/metrics | grep assertions_indexed_total

# Replica assertion count
curl -s http://<replica>:18180/metrics | grep assertions_indexed_total

Resolution

If Network Latency is High

1. Check network path:

traceroute <replica-ip>
mtr -r -c 10 <replica-ip>

2. Verify firewall rules:

# RPC port 18182 should be open
telnet <replica-ip> 18182

3. Increase RPC timeout if needed:

Edit /etc/stemedb/api.toml on primary:

[cluster]
rpc_timeout_ms = 10000  # Increase from default 5000

Restart primary:

systemctl restart stemedb-api

If Replica Disk I/O is Saturated

1. Verify WAL write performance:

# SSH to replica
cd /var/lib/stemedb/wal
time dd if=/dev/zero of=test.dat bs=1M count=1000 oflag=direct
rm test.dat

Expected: >100 MB/s on SSD.

2. Check for competing I/O:

iotop -o

3. Temporarily reduce ingestion rate on primary:

# Apply rate limit via admin endpoint
curl -X POST http://localhost:18180/v1/admin/rate-limit \
  -H 'Content-Type: application/json' \
  -d '{"max_assertions_per_sec": 1000}'

If Replica is Falling Further Behind

1. Initiate manual Merkle sync:

curl -X POST http://localhost:18180/v1/admin/cluster/sync \
  -H 'Content-Type: application/json' \
  -d '{"replica_id": "node2", "force": true}'

2. Monitor sync progress:

watch -n 5 'curl -s http://localhost:18180/metrics | grep merkle_sync_progress'

3. If sync fails repeatedly, rebuild replica:

See docs/operations/runbooks/rebuild-replica.md.

If Replication Stream is Blocked

1. Check for circuit breaker trip:

curl -s http://localhost:18180/v1/admin/circuit-breakers/tripped | jq

2. Reset circuit breaker if needed:

curl -X POST http://localhost:18180/v1/admin/circuit-breaker/reset \
  -H 'Content-Type: application/json' \
  -d '{"agent_id": "<replica_agent_id>"}'

Prevention

Monitoring and Alerting

1. Add warning-level lag alert:

# Prometheus alert rule
- alert: ReplicationLagWarning
  expr: stemedb_replication_lag_seconds > 5
  for: 5m
  annotations:
    summary: "Replica lag exceeds 5 seconds"

2. Monitor Merkle sync errors:

- alert: MerkleSyncFailures
  expr: rate(stemedb_merkle_sync_errors_total[5m]) > 0.1
  annotations:
    summary: "Frequent Merkle sync failures detected"

Capacity Planning

1. Ensure replica hardware matches primary:

  • Same or better disk I/O (IOPS)
  • Same network bandwidth
  • Sufficient CPU headroom

2. Set replication backpressure threshold:

# /etc/stemedb/api.toml
[cluster]
max_replication_lag_seconds = 30  # Pause ingestion if lag exceeds

Operational Best Practices

1. Gradual rollout of high-volume ingestion:

# Ramp up assertion rate slowly
for rate in 100 500 1000 2000; do
  echo "Testing rate: $rate/sec"
  # Apply rate via API
  curl -X POST http://localhost:18180/v1/admin/rate-limit \
    -d "{\"max_assertions_per_sec\": $rate}"
  sleep 300  # Monitor for 5 minutes
  # Check lag
  curl -s http://localhost:18180/metrics | grep replication_lag
done

2. Pre-provision replicas before traffic spikes:

Add replicas 24 hours before expected load increase.

Escalation

Escalate immediately if:

  • Lag exceeds 60 seconds (replica rebuild likely needed)
  • Replica is stuck in crash loop during sync
  • Merkle sync reports corruption (data integrity issue)
  • Multiple replicas lagging simultaneously (primary overload)

Escalation path:

  1. Primary on-call: Cluster SRE
  2. Secondary: Distributed systems engineer
  3. Final escalation: Principal engineer (data corruption suspected)

References

  • Dashboard: StemeDB Cluster Overview
  • Related alerts: ClusterSplitBrain, MerkleSyncFailure, HighNetworkUtilization
  • Metrics to check:
    • stemedb_replication_lag_seconds (lag duration)
    • stemedb_merkle_sync_duration_seconds (sync timing)
    • stemedb_assertions_indexed_total (ingestion rate)
    • stemedb_network_bytes_sent_total (replication bandwidth)
  • Runbooks: rebuild-replica.md, split-brain.md