stemedb/docs/operations/runbooks/split-brain.md

# Cluster Split Brain

## Severity: CRITICAL

## Alert Rule

**Alert:** `ClusterSplitBrain`
**Trigger:** Multiple nodes claim to be primary
**Duration:** 1m

## Symptom

- Metrics show `stemedb_cluster_primary_count > 1`
- Logs contain "primary election conflict" or "multiple primaries detected"
- Different clients see different primary nodes
- Assertion IDs from different primaries for same timestamp
- SWIM gossip reports conflicting cluster state

## Impact

**User Impact:**
- Writes may be accepted by multiple primaries → data divergence
- Queries return different results depending on routing
- Inconsistent state across cluster (violates linearizability)

**System Impact:**
- Data loss when resolving split (one primary's writes discarded)
- Manual intervention required to merge diverged state
- Cluster trust degraded (reputation impact)

## Investigation Steps

### 1. Identify All Nodes Claiming Primary

```bash
# Query each node's role
for node in node1 node2 node3; do
  echo "=== $node ==="
  curl -s http://$node:18180/v1/admin/cluster/status | jq '.role'
done
```

Expected: Exactly one node should return `"primary"`.

### 2. Check SWIM Gossip State

```bash
# Get cluster membership from each node
for node in node1 node2 node3; do
  echo "=== $node ==="
  curl -s http://$node:18180/v1/admin/cluster/members | jq '.members[] | {id, role, health}'
done
```

### 3. Check Network Partition

```bash
# Test connectivity between nodes
for src in node1 node2 node3; do
  for dst in node1 node2 node3; do
    [[ $src == $dst ]] && continue
    echo "$src → $dst:"
    ssh $src "timeout 2 nc -zv $dst 18182 2>&1 | tail -1"
  done
done
```

### 4. Review Election Logs

```bash
# Check when each node became primary
for node in node1 node2 node3; do
  echo "=== $node ==="
  ssh $node "journalctl -u stemedb-api | grep 'elected primary' | tail -5"
done
```

## Resolution

### Immediate Mitigation: Force Single Primary

**WARNING:** This will cause writes to one node to be discarded. Choose the node with the most recent data.

**1. Identify primary with latest data:**

```bash
# Compare latest assertion timestamps
for node in node1 node2 node3; do
  echo "$node:"
  curl -s http://$node:18180/metrics | grep assertions_indexed_total
done
```

Choose node with highest count.

**2. Demote other primaries to replica:**

```bash
# On each conflicting primary:
curl -X POST http://$node:18180/v1/admin/cluster/demote \
  -H 'Content-Type: application/json' \
  -d '{"force": true}'
```

**3. Verify single primary:**

```bash
for node in node1 node2 node3; do
  curl -s http://$node:18180/v1/admin/cluster/status | jq '.role'
done
```

Expected: One `"primary"`, all others `"replica"`.

### Root Cause Resolution

**If Network Partition Detected:**

**1. Restore network connectivity:**

```bash
# Check firewall rules
iptables -L -n | grep 18182

# Check routing
ip route show
```

**2. Verify SWIM gossip recovery:**

```bash
# Watch gossip convergence
watch -n 2 'curl -s http://node1:18180/v1/admin/cluster/members | jq .members[].health'
```

**If Split Caused by Clock Skew:**

**1. Check time drift:**

```bash
for node in node1 node2 node3; do
  echo "$node: $(ssh $node date +%s)"
done
```

**2. Sync clocks:**

```bash
# Restart NTP
for node in node1 node2 node3; do
  ssh $node "systemctl restart chronyd && chronyc makestep"
done
```

**If Split Caused by SWIM Bug:**

**1. Restart SWIM membership service:**

```bash
# On each node
curl -X POST http://localhost:18180/v1/admin/cluster/restart-gossip
```

**2. If restart fails, force cluster reset:**

```bash
# On primary only
curl -X POST http://localhost:18180/v1/admin/cluster/reinit \
  -d '{"bootstrap": true}'

# On replicas
curl -X POST http://localhost:18180/v1/admin/cluster/join \
  -d '{"primary_address": "node1:18182"}'
```

### Data Reconciliation After Split

**1. Compare data divergence:**

```bash
# Get Merkle tree diff between primaries
curl -X POST http://node1:18180/v1/admin/cluster/merkle-diff \
  -d '{"other_node": "node2"}'
```

**2. If divergence is small (<100 assertions), manual merge:**

```bash
# Export assertions from demoted primary
curl -s http://node2:18180/v1/admin/export-assertions \
  --data '{"since": <split_timestamp>}' \
  > /tmp/node2-assertions.jsonl

# Import into winning primary
curl -X POST http://node1:18180/v1/admin/import-assertions \
  --data-binary @/tmp/node2-assertions.jsonl
```

**3. If divergence is large, escalate for manual resolution:**

See `docs/operations/runbooks/merge-diverged-clusters.md`.

## Prevention

### Monitoring and Alerting

**1. Alert on primary count:**

```yaml
- alert: MultiplePrimaries
  expr: sum(stemedb_cluster_is_primary) > 1
  for: 1m
  annotations:
    summary: "Split brain detected: multiple primaries"
```

**2. Monitor SWIM gossip health:**

```yaml
- alert: GossipUnreachable
  expr: stemedb_swim_unreachable_members > 0
  for: 2m
  annotations:
    summary: "SWIM gossip detecting unreachable members"
```

**3. Alert on clock skew:**

```yaml
- alert: ClockSkewDetected
  expr: abs(stemedb_clock_offset_seconds) > 1
  for: 5m
  annotations:
    summary: "Clock skew exceeds 1 second"
```

### Capacity Planning

**1. Deploy nodes across failure domains:**

- Different racks (power/network isolation)
- Different availability zones (cloud deployments)

**2. Use dedicated network for cluster gossip:**

```toml
# /etc/stemedb/api.toml
[cluster]
gossip_bind_address = "10.0.1.100:18183"  # Private network
```

**3. Configure SWIM timeouts for network:**

```toml
[cluster.swim]
suspicion_timeout_ms = 5000
probe_interval_ms = 1000
probe_timeout_ms = 500
```

### Operational Best Practices

**1. Regular cluster health checks:**

```bash
# Daily validation
curl -s http://localhost:18180/v1/admin/cluster/validate | jq '{
  primary_count: .primaries,
  replica_count: .replicas,
  unreachable: .unreachable
}'
```

**2. Test network partitions in staging:**

```bash
# Simulate partition with iptables
iptables -A INPUT -s 10.0.1.102 -j DROP
iptables -A OUTPUT -d 10.0.1.102 -j DROP

# Wait for detection
sleep 60

# Verify single primary
curl -s http://localhost:18180/v1/admin/cluster/status

# Restore network
iptables -D INPUT -s 10.0.1.102 -j DROP
iptables -D OUTPUT -d 10.0.1.102 -j DROP
```

**3. Document primary election priority:**

Configure explicit priority for deterministic elections:

```toml
[cluster]
election_priority = 100  # Higher on preferred primary
```

## Escalation

**Escalate immediately if:**

- Split brain lasts >5 minutes (data divergence growing)
- Unable to identify winning primary (data loss unavoidable)
- Network partition affects >50% of cluster
- Split brain recurs after resolution (systemic issue)

**Escalation path:**

1. **Primary on-call:** Cluster SRE
2. **Secondary:** Distributed systems architect
3. **Final escalation:** CTO + VP Engineering (customer-facing impact)

## References

- **Dashboard:** [StemeDB Cluster Health](http://grafana.example.com/d/stemedb-cluster)
- **Related alerts:** `GossipUnreachable`, `PrimaryElectionFailed`, `HighReplicationLag`
- **Metrics:**
  - `stemedb_cluster_is_primary` (0 or 1 per node)
  - `stemedb_swim_unreachable_members` (network health)
  - `stemedb_clock_offset_seconds` (time sync)
- **Runbooks:** `high-replication-lag.md`, `merge-diverged-clusters.md`