This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
5.8 KiB
High Replication Lag
Severity: CRITICAL
Alert Rule
Alert: ReplicationLagCritical
Trigger: Replica lag exceeds 10 seconds
Duration: 3m
Symptom
- Query results from replicas are stale (missing recent assertions)
- Replication metrics show increasing lag (e.g.,
stemedb_replication_lag_seconds > 10) - Merkle tree sync reports large diffs between primary and replica
- Clients reading from replicas see inconsistent data
Impact
User Impact:
- Queries to replicas return outdated results
- Reads may miss assertions written in the last 10+ seconds
- Eventual consistency SLAs violated
System Impact:
- Replica may fall too far behind to catch up (cascading failure)
- Increased Merkle tree diff volume (bandwidth spike)
- Risk of replica demotion or rebuild
Investigation Steps
1. Check Replication Status
# Query replication lag metric
curl -s http://localhost:18180/metrics | grep replication_lag
# Expected output (example):
# stemedb_replication_lag_seconds{replica="node2"} 12.5
2. Identify Bottleneck
A. Network latency:
# Ping replica from primary
ping -c 10 <replica-ip>
# Check bandwidth usage
iftop -i eth0 -f "port 18182"
B. Replica disk I/O:
# SSH to replica
iostat -x 1 10
# Look for high %util on WAL partition
C. Replica CPU saturation:
# SSH to replica
top -b -n 1 | grep stemedb
3. Check for Merkle Sync Errors
# Primary logs
journalctl -u stemedb-api | grep -i "merkle sync" | tail -20
# Replica logs
ssh replica "journalctl -u stemedb-api | grep -i 'sync error' | tail -20"
4. Compare Assertion Counts
# Primary assertion count
curl -s http://localhost:18180/metrics | grep assertions_indexed_total
# Replica assertion count
curl -s http://<replica>:18180/metrics | grep assertions_indexed_total
Resolution
If Network Latency is High
1. Check network path:
traceroute <replica-ip>
mtr -r -c 10 <replica-ip>
2. Verify firewall rules:
# RPC port 18182 should be open
telnet <replica-ip> 18182
3. Increase RPC timeout if needed:
Edit /etc/stemedb/api.toml on primary:
[cluster]
rpc_timeout_ms = 10000 # Increase from default 5000
Restart primary:
systemctl restart stemedb-api
If Replica Disk I/O is Saturated
1. Verify WAL write performance:
# SSH to replica
cd /var/lib/stemedb/wal
time dd if=/dev/zero of=test.dat bs=1M count=1000 oflag=direct
rm test.dat
Expected: >100 MB/s on SSD.
2. Check for competing I/O:
iotop -o
3. Temporarily reduce ingestion rate on primary:
# Apply rate limit via admin endpoint
curl -X POST http://localhost:18180/v1/admin/rate-limit \
-H 'Content-Type: application/json' \
-d '{"max_assertions_per_sec": 1000}'
If Replica is Falling Further Behind
1. Initiate manual Merkle sync:
curl -X POST http://localhost:18180/v1/admin/cluster/sync \
-H 'Content-Type: application/json' \
-d '{"replica_id": "node2", "force": true}'
2. Monitor sync progress:
watch -n 5 'curl -s http://localhost:18180/metrics | grep merkle_sync_progress'
3. If sync fails repeatedly, rebuild replica:
See docs/operations/runbooks/rebuild-replica.md.
If Replication Stream is Blocked
1. Check for circuit breaker trip:
curl -s http://localhost:18180/v1/admin/circuit-breakers/tripped | jq
2. Reset circuit breaker if needed:
curl -X POST http://localhost:18180/v1/admin/circuit-breaker/reset \
-H 'Content-Type: application/json' \
-d '{"agent_id": "<replica_agent_id>"}'
Prevention
Monitoring and Alerting
1. Add warning-level lag alert:
# Prometheus alert rule
- alert: ReplicationLagWarning
expr: stemedb_replication_lag_seconds > 5
for: 5m
annotations:
summary: "Replica lag exceeds 5 seconds"
2. Monitor Merkle sync errors:
- alert: MerkleSyncFailures
expr: rate(stemedb_merkle_sync_errors_total[5m]) > 0.1
annotations:
summary: "Frequent Merkle sync failures detected"
Capacity Planning
1. Ensure replica hardware matches primary:
- Same or better disk I/O (IOPS)
- Same network bandwidth
- Sufficient CPU headroom
2. Set replication backpressure threshold:
# /etc/stemedb/api.toml
[cluster]
max_replication_lag_seconds = 30 # Pause ingestion if lag exceeds
Operational Best Practices
1. Gradual rollout of high-volume ingestion:
# Ramp up assertion rate slowly
for rate in 100 500 1000 2000; do
echo "Testing rate: $rate/sec"
# Apply rate via API
curl -X POST http://localhost:18180/v1/admin/rate-limit \
-d "{\"max_assertions_per_sec\": $rate}"
sleep 300 # Monitor for 5 minutes
# Check lag
curl -s http://localhost:18180/metrics | grep replication_lag
done
2. Pre-provision replicas before traffic spikes:
Add replicas 24 hours before expected load increase.
Escalation
Escalate immediately if:
- Lag exceeds 60 seconds (replica rebuild likely needed)
- Replica is stuck in crash loop during sync
- Merkle sync reports corruption (data integrity issue)
- Multiple replicas lagging simultaneously (primary overload)
Escalation path:
- Primary on-call: Cluster SRE
- Secondary: Distributed systems engineer
- Final escalation: Principal engineer (data corruption suspected)
References
- Dashboard: StemeDB Cluster Overview
- Related alerts:
ClusterSplitBrain,MerkleSyncFailure,HighNetworkUtilization - Metrics to check:
stemedb_replication_lag_seconds(lag duration)stemedb_merkle_sync_duration_seconds(sync timing)stemedb_assertions_indexed_total(ingestion rate)stemedb_network_bytes_sent_total(replication bandwidth)
- Runbooks:
rebuild-replica.md,split-brain.md