This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
273 lines
5.8 KiB
Markdown
273 lines
5.8 KiB
Markdown
# High Replication Lag
|
|
|
|
## Severity: CRITICAL
|
|
|
|
## Alert Rule
|
|
|
|
**Alert:** `ReplicationLagCritical`
|
|
**Trigger:** Replica lag exceeds 10 seconds
|
|
**Duration:** 3m
|
|
|
|
## Symptom
|
|
|
|
- Query results from replicas are stale (missing recent assertions)
|
|
- Replication metrics show increasing lag (e.g., `stemedb_replication_lag_seconds > 10`)
|
|
- Merkle tree sync reports large diffs between primary and replica
|
|
- Clients reading from replicas see inconsistent data
|
|
|
|
## Impact
|
|
|
|
**User Impact:**
|
|
- Queries to replicas return outdated results
|
|
- Reads may miss assertions written in the last 10+ seconds
|
|
- Eventual consistency SLAs violated
|
|
|
|
**System Impact:**
|
|
- Replica may fall too far behind to catch up (cascading failure)
|
|
- Increased Merkle tree diff volume (bandwidth spike)
|
|
- Risk of replica demotion or rebuild
|
|
|
|
## Investigation Steps
|
|
|
|
### 1. Check Replication Status
|
|
|
|
```bash
|
|
# Query replication lag metric
|
|
curl -s http://localhost:18180/metrics | grep replication_lag
|
|
|
|
# Expected output (example):
|
|
# stemedb_replication_lag_seconds{replica="node2"} 12.5
|
|
```
|
|
|
|
### 2. Identify Bottleneck
|
|
|
|
**A. Network latency:**
|
|
|
|
```bash
|
|
# Ping replica from primary
|
|
ping -c 10 <replica-ip>
|
|
|
|
# Check bandwidth usage
|
|
iftop -i eth0 -f "port 18182"
|
|
```
|
|
|
|
**B. Replica disk I/O:**
|
|
|
|
```bash
|
|
# SSH to replica
|
|
iostat -x 1 10
|
|
|
|
# Look for high %util on WAL partition
|
|
```
|
|
|
|
**C. Replica CPU saturation:**
|
|
|
|
```bash
|
|
# SSH to replica
|
|
top -b -n 1 | grep stemedb
|
|
```
|
|
|
|
### 3. Check for Merkle Sync Errors
|
|
|
|
```bash
|
|
# Primary logs
|
|
journalctl -u stemedb-api | grep -i "merkle sync" | tail -20
|
|
|
|
# Replica logs
|
|
ssh replica "journalctl -u stemedb-api | grep -i 'sync error' | tail -20"
|
|
```
|
|
|
|
### 4. Compare Assertion Counts
|
|
|
|
```bash
|
|
# Primary assertion count
|
|
curl -s http://localhost:18180/metrics | grep assertions_indexed_total
|
|
|
|
# Replica assertion count
|
|
curl -s http://<replica>:18180/metrics | grep assertions_indexed_total
|
|
```
|
|
|
|
## Resolution
|
|
|
|
### If Network Latency is High
|
|
|
|
**1. Check network path:**
|
|
|
|
```bash
|
|
traceroute <replica-ip>
|
|
mtr -r -c 10 <replica-ip>
|
|
```
|
|
|
|
**2. Verify firewall rules:**
|
|
|
|
```bash
|
|
# RPC port 18182 should be open
|
|
telnet <replica-ip> 18182
|
|
```
|
|
|
|
**3. Increase RPC timeout if needed:**
|
|
|
|
Edit `/etc/stemedb/api.toml` on primary:
|
|
|
|
```toml
|
|
[cluster]
|
|
rpc_timeout_ms = 10000 # Increase from default 5000
|
|
```
|
|
|
|
Restart primary:
|
|
|
|
```bash
|
|
systemctl restart stemedb-api
|
|
```
|
|
|
|
### If Replica Disk I/O is Saturated
|
|
|
|
**1. Verify WAL write performance:**
|
|
|
|
```bash
|
|
# SSH to replica
|
|
cd /var/lib/stemedb/wal
|
|
time dd if=/dev/zero of=test.dat bs=1M count=1000 oflag=direct
|
|
rm test.dat
|
|
```
|
|
|
|
Expected: >100 MB/s on SSD.
|
|
|
|
**2. Check for competing I/O:**
|
|
|
|
```bash
|
|
iotop -o
|
|
```
|
|
|
|
**3. Temporarily reduce ingestion rate on primary:**
|
|
|
|
```bash
|
|
# Apply rate limit via admin endpoint
|
|
curl -X POST http://localhost:18180/v1/admin/rate-limit \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"max_assertions_per_sec": 1000}'
|
|
```
|
|
|
|
### If Replica is Falling Further Behind
|
|
|
|
**1. Initiate manual Merkle sync:**
|
|
|
|
```bash
|
|
curl -X POST http://localhost:18180/v1/admin/cluster/sync \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"replica_id": "node2", "force": true}'
|
|
```
|
|
|
|
**2. Monitor sync progress:**
|
|
|
|
```bash
|
|
watch -n 5 'curl -s http://localhost:18180/metrics | grep merkle_sync_progress'
|
|
```
|
|
|
|
**3. If sync fails repeatedly, rebuild replica:**
|
|
|
|
See `docs/operations/runbooks/rebuild-replica.md`.
|
|
|
|
### If Replication Stream is Blocked
|
|
|
|
**1. Check for circuit breaker trip:**
|
|
|
|
```bash
|
|
curl -s http://localhost:18180/v1/admin/circuit-breakers/tripped | jq
|
|
```
|
|
|
|
**2. Reset circuit breaker if needed:**
|
|
|
|
```bash
|
|
curl -X POST http://localhost:18180/v1/admin/circuit-breaker/reset \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"agent_id": "<replica_agent_id>"}'
|
|
```
|
|
|
|
## Prevention
|
|
|
|
### Monitoring and Alerting
|
|
|
|
**1. Add warning-level lag alert:**
|
|
|
|
```yaml
|
|
# Prometheus alert rule
|
|
- alert: ReplicationLagWarning
|
|
expr: stemedb_replication_lag_seconds > 5
|
|
for: 5m
|
|
annotations:
|
|
summary: "Replica lag exceeds 5 seconds"
|
|
```
|
|
|
|
**2. Monitor Merkle sync errors:**
|
|
|
|
```yaml
|
|
- alert: MerkleSyncFailures
|
|
expr: rate(stemedb_merkle_sync_errors_total[5m]) > 0.1
|
|
annotations:
|
|
summary: "Frequent Merkle sync failures detected"
|
|
```
|
|
|
|
### Capacity Planning
|
|
|
|
**1. Ensure replica hardware matches primary:**
|
|
|
|
- Same or better disk I/O (IOPS)
|
|
- Same network bandwidth
|
|
- Sufficient CPU headroom
|
|
|
|
**2. Set replication backpressure threshold:**
|
|
|
|
```toml
|
|
# /etc/stemedb/api.toml
|
|
[cluster]
|
|
max_replication_lag_seconds = 30 # Pause ingestion if lag exceeds
|
|
```
|
|
|
|
### Operational Best Practices
|
|
|
|
**1. Gradual rollout of high-volume ingestion:**
|
|
|
|
```bash
|
|
# Ramp up assertion rate slowly
|
|
for rate in 100 500 1000 2000; do
|
|
echo "Testing rate: $rate/sec"
|
|
# Apply rate via API
|
|
curl -X POST http://localhost:18180/v1/admin/rate-limit \
|
|
-d "{\"max_assertions_per_sec\": $rate}"
|
|
sleep 300 # Monitor for 5 minutes
|
|
# Check lag
|
|
curl -s http://localhost:18180/metrics | grep replication_lag
|
|
done
|
|
```
|
|
|
|
**2. Pre-provision replicas before traffic spikes:**
|
|
|
|
Add replicas 24 hours before expected load increase.
|
|
|
|
## Escalation
|
|
|
|
**Escalate immediately if:**
|
|
|
|
- Lag exceeds 60 seconds (replica rebuild likely needed)
|
|
- Replica is stuck in crash loop during sync
|
|
- Merkle sync reports corruption (data integrity issue)
|
|
- Multiple replicas lagging simultaneously (primary overload)
|
|
|
|
**Escalation path:**
|
|
|
|
1. **Primary on-call:** Cluster SRE
|
|
2. **Secondary:** Distributed systems engineer
|
|
3. **Final escalation:** Principal engineer (data corruption suspected)
|
|
|
|
## References
|
|
|
|
- **Dashboard:** [StemeDB Cluster Overview](http://grafana.example.com/d/stemedb-cluster)
|
|
- **Related alerts:** `ClusterSplitBrain`, `MerkleSyncFailure`, `HighNetworkUtilization`
|
|
- **Metrics to check:**
|
|
- `stemedb_replication_lag_seconds` (lag duration)
|
|
- `stemedb_merkle_sync_duration_seconds` (sync timing)
|
|
- `stemedb_assertions_indexed_total` (ingestion rate)
|
|
- `stemedb_network_bytes_sent_total` (replication bandwidth)
|
|
- **Runbooks:** `rebuild-replica.md`, `split-brain.md`
|