This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
7.2 KiB
Cluster Split Brain
Severity: CRITICAL
Alert Rule
Alert: ClusterSplitBrain
Trigger: Multiple nodes claim to be primary
Duration: 1m
Symptom
- Metrics show
stemedb_cluster_primary_count > 1 - Logs contain "primary election conflict" or "multiple primaries detected"
- Different clients see different primary nodes
- Assertion IDs from different primaries for same timestamp
- SWIM gossip reports conflicting cluster state
Impact
User Impact:
- Writes may be accepted by multiple primaries → data divergence
- Queries return different results depending on routing
- Inconsistent state across cluster (violates linearizability)
System Impact:
- Data loss when resolving split (one primary's writes discarded)
- Manual intervention required to merge diverged state
- Cluster trust degraded (reputation impact)
Investigation Steps
1. Identify All Nodes Claiming Primary
# Query each node's role
for node in node1 node2 node3; do
echo "=== $node ==="
curl -s http://$node:18180/v1/admin/cluster/status | jq '.role'
done
Expected: Exactly one node should return "primary".
2. Check SWIM Gossip State
# Get cluster membership from each node
for node in node1 node2 node3; do
echo "=== $node ==="
curl -s http://$node:18180/v1/admin/cluster/members | jq '.members[] | {id, role, health}'
done
3. Check Network Partition
# Test connectivity between nodes
for src in node1 node2 node3; do
for dst in node1 node2 node3; do
[[ $src == $dst ]] && continue
echo "$src → $dst:"
ssh $src "timeout 2 nc -zv $dst 18182 2>&1 | tail -1"
done
done
4. Review Election Logs
# Check when each node became primary
for node in node1 node2 node3; do
echo "=== $node ==="
ssh $node "journalctl -u stemedb-api | grep 'elected primary' | tail -5"
done
Resolution
Immediate Mitigation: Force Single Primary
WARNING: This will cause writes to one node to be discarded. Choose the node with the most recent data.
1. Identify primary with latest data:
# Compare latest assertion timestamps
for node in node1 node2 node3; do
echo "$node:"
curl -s http://$node:18180/metrics | grep assertions_indexed_total
done
Choose node with highest count.
2. Demote other primaries to replica:
# On each conflicting primary:
curl -X POST http://$node:18180/v1/admin/cluster/demote \
-H 'Content-Type: application/json' \
-d '{"force": true}'
3. Verify single primary:
for node in node1 node2 node3; do
curl -s http://$node:18180/v1/admin/cluster/status | jq '.role'
done
Expected: One "primary", all others "replica".
Root Cause Resolution
If Network Partition Detected:
1. Restore network connectivity:
# Check firewall rules
iptables -L -n | grep 18182
# Check routing
ip route show
2. Verify SWIM gossip recovery:
# Watch gossip convergence
watch -n 2 'curl -s http://node1:18180/v1/admin/cluster/members | jq .members[].health'
If Split Caused by Clock Skew:
1. Check time drift:
for node in node1 node2 node3; do
echo "$node: $(ssh $node date +%s)"
done
2. Sync clocks:
# Restart NTP
for node in node1 node2 node3; do
ssh $node "systemctl restart chronyd && chronyc makestep"
done
If Split Caused by SWIM Bug:
1. Restart SWIM membership service:
# On each node
curl -X POST http://localhost:18180/v1/admin/cluster/restart-gossip
2. If restart fails, force cluster reset:
# On primary only
curl -X POST http://localhost:18180/v1/admin/cluster/reinit \
-d '{"bootstrap": true}'
# On replicas
curl -X POST http://localhost:18180/v1/admin/cluster/join \
-d '{"primary_address": "node1:18182"}'
Data Reconciliation After Split
1. Compare data divergence:
# Get Merkle tree diff between primaries
curl -X POST http://node1:18180/v1/admin/cluster/merkle-diff \
-d '{"other_node": "node2"}'
2. If divergence is small (<100 assertions), manual merge:
# Export assertions from demoted primary
curl -s http://node2:18180/v1/admin/export-assertions \
--data '{"since": <split_timestamp>}' \
> /tmp/node2-assertions.jsonl
# Import into winning primary
curl -X POST http://node1:18180/v1/admin/import-assertions \
--data-binary @/tmp/node2-assertions.jsonl
3. If divergence is large, escalate for manual resolution:
See docs/operations/runbooks/merge-diverged-clusters.md.
Prevention
Monitoring and Alerting
1. Alert on primary count:
- alert: MultiplePrimaries
expr: sum(stemedb_cluster_is_primary) > 1
for: 1m
annotations:
summary: "Split brain detected: multiple primaries"
2. Monitor SWIM gossip health:
- alert: GossipUnreachable
expr: stemedb_swim_unreachable_members > 0
for: 2m
annotations:
summary: "SWIM gossip detecting unreachable members"
3. Alert on clock skew:
- alert: ClockSkewDetected
expr: abs(stemedb_clock_offset_seconds) > 1
for: 5m
annotations:
summary: "Clock skew exceeds 1 second"
Capacity Planning
1. Deploy nodes across failure domains:
- Different racks (power/network isolation)
- Different availability zones (cloud deployments)
2. Use dedicated network for cluster gossip:
# /etc/stemedb/api.toml
[cluster]
gossip_bind_address = "10.0.1.100:18183" # Private network
3. Configure SWIM timeouts for network:
[cluster.swim]
suspicion_timeout_ms = 5000
probe_interval_ms = 1000
probe_timeout_ms = 500
Operational Best Practices
1. Regular cluster health checks:
# Daily validation
curl -s http://localhost:18180/v1/admin/cluster/validate | jq '{
primary_count: .primaries,
replica_count: .replicas,
unreachable: .unreachable
}'
2. Test network partitions in staging:
# Simulate partition with iptables
iptables -A INPUT -s 10.0.1.102 -j DROP
iptables -A OUTPUT -d 10.0.1.102 -j DROP
# Wait for detection
sleep 60
# Verify single primary
curl -s http://localhost:18180/v1/admin/cluster/status
# Restore network
iptables -D INPUT -s 10.0.1.102 -j DROP
iptables -D OUTPUT -d 10.0.1.102 -j DROP
3. Document primary election priority:
Configure explicit priority for deterministic elections:
[cluster]
election_priority = 100 # Higher on preferred primary
Escalation
Escalate immediately if:
- Split brain lasts >5 minutes (data divergence growing)
- Unable to identify winning primary (data loss unavoidable)
- Network partition affects >50% of cluster
- Split brain recurs after resolution (systemic issue)
Escalation path:
- Primary on-call: Cluster SRE
- Secondary: Distributed systems architect
- Final escalation: CTO + VP Engineering (customer-facing impact)
References
- Dashboard: StemeDB Cluster Health
- Related alerts:
GossipUnreachable,PrimaryElectionFailed,HighReplicationLag - Metrics:
stemedb_cluster_is_primary(0 or 1 per node)stemedb_swim_unreachable_members(network health)stemedb_clock_offset_seconds(time sync)
- Runbooks:
high-replication-lag.md,merge-diverged-clusters.md