# Cluster Split Brain ## Severity: CRITICAL ## Alert Rule **Alert:** `ClusterSplitBrain` **Trigger:** Multiple nodes claim to be primary **Duration:** 1m ## Symptom - Metrics show `stemedb_cluster_primary_count > 1` - Logs contain "primary election conflict" or "multiple primaries detected" - Different clients see different primary nodes - Assertion IDs from different primaries for same timestamp - SWIM gossip reports conflicting cluster state ## Impact **User Impact:** - Writes may be accepted by multiple primaries → data divergence - Queries return different results depending on routing - Inconsistent state across cluster (violates linearizability) **System Impact:** - Data loss when resolving split (one primary's writes discarded) - Manual intervention required to merge diverged state - Cluster trust degraded (reputation impact) ## Investigation Steps ### 1. Identify All Nodes Claiming Primary ```bash # Query each node's role for node in node1 node2 node3; do echo "=== $node ===" curl -s http://$node:18180/v1/admin/cluster/status | jq '.role' done ``` Expected: Exactly one node should return `"primary"`. ### 2. Check SWIM Gossip State ```bash # Get cluster membership from each node for node in node1 node2 node3; do echo "=== $node ===" curl -s http://$node:18180/v1/admin/cluster/members | jq '.members[] | {id, role, health}' done ``` ### 3. Check Network Partition ```bash # Test connectivity between nodes for src in node1 node2 node3; do for dst in node1 node2 node3; do [[ $src == $dst ]] && continue echo "$src → $dst:" ssh $src "timeout 2 nc -zv $dst 18182 2>&1 | tail -1" done done ``` ### 4. Review Election Logs ```bash # Check when each node became primary for node in node1 node2 node3; do echo "=== $node ===" ssh $node "journalctl -u stemedb-api | grep 'elected primary' | tail -5" done ``` ## Resolution ### Immediate Mitigation: Force Single Primary **WARNING:** This will cause writes to one node to be discarded. Choose the node with the most recent data. **1. Identify primary with latest data:** ```bash # Compare latest assertion timestamps for node in node1 node2 node3; do echo "$node:" curl -s http://$node:18180/metrics | grep assertions_indexed_total done ``` Choose node with highest count. **2. Demote other primaries to replica:** ```bash # On each conflicting primary: curl -X POST http://$node:18180/v1/admin/cluster/demote \ -H 'Content-Type: application/json' \ -d '{"force": true}' ``` **3. Verify single primary:** ```bash for node in node1 node2 node3; do curl -s http://$node:18180/v1/admin/cluster/status | jq '.role' done ``` Expected: One `"primary"`, all others `"replica"`. ### Root Cause Resolution **If Network Partition Detected:** **1. Restore network connectivity:** ```bash # Check firewall rules iptables -L -n | grep 18182 # Check routing ip route show ``` **2. Verify SWIM gossip recovery:** ```bash # Watch gossip convergence watch -n 2 'curl -s http://node1:18180/v1/admin/cluster/members | jq .members[].health' ``` **If Split Caused by Clock Skew:** **1. Check time drift:** ```bash for node in node1 node2 node3; do echo "$node: $(ssh $node date +%s)" done ``` **2. Sync clocks:** ```bash # Restart NTP for node in node1 node2 node3; do ssh $node "systemctl restart chronyd && chronyc makestep" done ``` **If Split Caused by SWIM Bug:** **1. Restart SWIM membership service:** ```bash # On each node curl -X POST http://localhost:18180/v1/admin/cluster/restart-gossip ``` **2. If restart fails, force cluster reset:** ```bash # On primary only curl -X POST http://localhost:18180/v1/admin/cluster/reinit \ -d '{"bootstrap": true}' # On replicas curl -X POST http://localhost:18180/v1/admin/cluster/join \ -d '{"primary_address": "node1:18182"}' ``` ### Data Reconciliation After Split **1. Compare data divergence:** ```bash # Get Merkle tree diff between primaries curl -X POST http://node1:18180/v1/admin/cluster/merkle-diff \ -d '{"other_node": "node2"}' ``` **2. If divergence is small (<100 assertions), manual merge:** ```bash # Export assertions from demoted primary curl -s http://node2:18180/v1/admin/export-assertions \ --data '{"since": }' \ > /tmp/node2-assertions.jsonl # Import into winning primary curl -X POST http://node1:18180/v1/admin/import-assertions \ --data-binary @/tmp/node2-assertions.jsonl ``` **3. If divergence is large, escalate for manual resolution:** See `docs/operations/runbooks/merge-diverged-clusters.md`. ## Prevention ### Monitoring and Alerting **1. Alert on primary count:** ```yaml - alert: MultiplePrimaries expr: sum(stemedb_cluster_is_primary) > 1 for: 1m annotations: summary: "Split brain detected: multiple primaries" ``` **2. Monitor SWIM gossip health:** ```yaml - alert: GossipUnreachable expr: stemedb_swim_unreachable_members > 0 for: 2m annotations: summary: "SWIM gossip detecting unreachable members" ``` **3. Alert on clock skew:** ```yaml - alert: ClockSkewDetected expr: abs(stemedb_clock_offset_seconds) > 1 for: 5m annotations: summary: "Clock skew exceeds 1 second" ``` ### Capacity Planning **1. Deploy nodes across failure domains:** - Different racks (power/network isolation) - Different availability zones (cloud deployments) **2. Use dedicated network for cluster gossip:** ```toml # /etc/stemedb/api.toml [cluster] gossip_bind_address = "10.0.1.100:18183" # Private network ``` **3. Configure SWIM timeouts for network:** ```toml [cluster.swim] suspicion_timeout_ms = 5000 probe_interval_ms = 1000 probe_timeout_ms = 500 ``` ### Operational Best Practices **1. Regular cluster health checks:** ```bash # Daily validation curl -s http://localhost:18180/v1/admin/cluster/validate | jq '{ primary_count: .primaries, replica_count: .replicas, unreachable: .unreachable }' ``` **2. Test network partitions in staging:** ```bash # Simulate partition with iptables iptables -A INPUT -s 10.0.1.102 -j DROP iptables -A OUTPUT -d 10.0.1.102 -j DROP # Wait for detection sleep 60 # Verify single primary curl -s http://localhost:18180/v1/admin/cluster/status # Restore network iptables -D INPUT -s 10.0.1.102 -j DROP iptables -D OUTPUT -d 10.0.1.102 -j DROP ``` **3. Document primary election priority:** Configure explicit priority for deterministic elections: ```toml [cluster] election_priority = 100 # Higher on preferred primary ``` ## Escalation **Escalate immediately if:** - Split brain lasts >5 minutes (data divergence growing) - Unable to identify winning primary (data loss unavoidable) - Network partition affects >50% of cluster - Split brain recurs after resolution (systemic issue) **Escalation path:** 1. **Primary on-call:** Cluster SRE 2. **Secondary:** Distributed systems architect 3. **Final escalation:** CTO + VP Engineering (customer-facing impact) ## References - **Dashboard:** [StemeDB Cluster Health](http://grafana.example.com/d/stemedb-cluster) - **Related alerts:** `GossipUnreachable`, `PrimaryElectionFailed`, `HighReplicationLag` - **Metrics:** - `stemedb_cluster_is_primary` (0 or 1 per node) - `stemedb_swim_unreachable_members` (network health) - `stemedb_clock_offset_seconds` (time sync) - **Runbooks:** `high-replication-lag.md`, `merge-diverged-clusters.md`