stemedb/docs/operations/runbooks/split-brain.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

7.2 KiB

Cluster Split Brain

Severity: CRITICAL

Alert Rule

Alert: ClusterSplitBrain Trigger: Multiple nodes claim to be primary Duration: 1m

Symptom

  • Metrics show stemedb_cluster_primary_count > 1
  • Logs contain "primary election conflict" or "multiple primaries detected"
  • Different clients see different primary nodes
  • Assertion IDs from different primaries for same timestamp
  • SWIM gossip reports conflicting cluster state

Impact

User Impact:

  • Writes may be accepted by multiple primaries → data divergence
  • Queries return different results depending on routing
  • Inconsistent state across cluster (violates linearizability)

System Impact:

  • Data loss when resolving split (one primary's writes discarded)
  • Manual intervention required to merge diverged state
  • Cluster trust degraded (reputation impact)

Investigation Steps

1. Identify All Nodes Claiming Primary

# Query each node's role
for node in node1 node2 node3; do
  echo "=== $node ==="
  curl -s http://$node:18180/v1/admin/cluster/status | jq '.role'
done

Expected: Exactly one node should return "primary".

2. Check SWIM Gossip State

# Get cluster membership from each node
for node in node1 node2 node3; do
  echo "=== $node ==="
  curl -s http://$node:18180/v1/admin/cluster/members | jq '.members[] | {id, role, health}'
done

3. Check Network Partition

# Test connectivity between nodes
for src in node1 node2 node3; do
  for dst in node1 node2 node3; do
    [[ $src == $dst ]] && continue
    echo "$src$dst:"
    ssh $src "timeout 2 nc -zv $dst 18182 2>&1 | tail -1"
  done
done

4. Review Election Logs

# Check when each node became primary
for node in node1 node2 node3; do
  echo "=== $node ==="
  ssh $node "journalctl -u stemedb-api | grep 'elected primary' | tail -5"
done

Resolution

Immediate Mitigation: Force Single Primary

WARNING: This will cause writes to one node to be discarded. Choose the node with the most recent data.

1. Identify primary with latest data:

# Compare latest assertion timestamps
for node in node1 node2 node3; do
  echo "$node:"
  curl -s http://$node:18180/metrics | grep assertions_indexed_total
done

Choose node with highest count.

2. Demote other primaries to replica:

# On each conflicting primary:
curl -X POST http://$node:18180/v1/admin/cluster/demote \
  -H 'Content-Type: application/json' \
  -d '{"force": true}'

3. Verify single primary:

for node in node1 node2 node3; do
  curl -s http://$node:18180/v1/admin/cluster/status | jq '.role'
done

Expected: One "primary", all others "replica".

Root Cause Resolution

If Network Partition Detected:

1. Restore network connectivity:

# Check firewall rules
iptables -L -n | grep 18182

# Check routing
ip route show

2. Verify SWIM gossip recovery:

# Watch gossip convergence
watch -n 2 'curl -s http://node1:18180/v1/admin/cluster/members | jq .members[].health'

If Split Caused by Clock Skew:

1. Check time drift:

for node in node1 node2 node3; do
  echo "$node: $(ssh $node date +%s)"
done

2. Sync clocks:

# Restart NTP
for node in node1 node2 node3; do
  ssh $node "systemctl restart chronyd && chronyc makestep"
done

If Split Caused by SWIM Bug:

1. Restart SWIM membership service:

# On each node
curl -X POST http://localhost:18180/v1/admin/cluster/restart-gossip

2. If restart fails, force cluster reset:

# On primary only
curl -X POST http://localhost:18180/v1/admin/cluster/reinit \
  -d '{"bootstrap": true}'

# On replicas
curl -X POST http://localhost:18180/v1/admin/cluster/join \
  -d '{"primary_address": "node1:18182"}'

Data Reconciliation After Split

1. Compare data divergence:

# Get Merkle tree diff between primaries
curl -X POST http://node1:18180/v1/admin/cluster/merkle-diff \
  -d '{"other_node": "node2"}'

2. If divergence is small (<100 assertions), manual merge:

# Export assertions from demoted primary
curl -s http://node2:18180/v1/admin/export-assertions \
  --data '{"since": <split_timestamp>}' \
  > /tmp/node2-assertions.jsonl

# Import into winning primary
curl -X POST http://node1:18180/v1/admin/import-assertions \
  --data-binary @/tmp/node2-assertions.jsonl

3. If divergence is large, escalate for manual resolution:

See docs/operations/runbooks/merge-diverged-clusters.md.

Prevention

Monitoring and Alerting

1. Alert on primary count:

- alert: MultiplePrimaries
  expr: sum(stemedb_cluster_is_primary) > 1
  for: 1m
  annotations:
    summary: "Split brain detected: multiple primaries"

2. Monitor SWIM gossip health:

- alert: GossipUnreachable
  expr: stemedb_swim_unreachable_members > 0
  for: 2m
  annotations:
    summary: "SWIM gossip detecting unreachable members"

3. Alert on clock skew:

- alert: ClockSkewDetected
  expr: abs(stemedb_clock_offset_seconds) > 1
  for: 5m
  annotations:
    summary: "Clock skew exceeds 1 second"

Capacity Planning

1. Deploy nodes across failure domains:

  • Different racks (power/network isolation)
  • Different availability zones (cloud deployments)

2. Use dedicated network for cluster gossip:

# /etc/stemedb/api.toml
[cluster]
gossip_bind_address = "10.0.1.100:18183"  # Private network

3. Configure SWIM timeouts for network:

[cluster.swim]
suspicion_timeout_ms = 5000
probe_interval_ms = 1000
probe_timeout_ms = 500

Operational Best Practices

1. Regular cluster health checks:

# Daily validation
curl -s http://localhost:18180/v1/admin/cluster/validate | jq '{
  primary_count: .primaries,
  replica_count: .replicas,
  unreachable: .unreachable
}'

2. Test network partitions in staging:

# Simulate partition with iptables
iptables -A INPUT -s 10.0.1.102 -j DROP
iptables -A OUTPUT -d 10.0.1.102 -j DROP

# Wait for detection
sleep 60

# Verify single primary
curl -s http://localhost:18180/v1/admin/cluster/status

# Restore network
iptables -D INPUT -s 10.0.1.102 -j DROP
iptables -D OUTPUT -d 10.0.1.102 -j DROP

3. Document primary election priority:

Configure explicit priority for deterministic elections:

[cluster]
election_priority = 100  # Higher on preferred primary

Escalation

Escalate immediately if:

  • Split brain lasts >5 minutes (data divergence growing)
  • Unable to identify winning primary (data loss unavoidable)
  • Network partition affects >50% of cluster
  • Split brain recurs after resolution (systemic issue)

Escalation path:

  1. Primary on-call: Cluster SRE
  2. Secondary: Distributed systems architect
  3. Final escalation: CTO + VP Engineering (customer-facing impact)

References

  • Dashboard: StemeDB Cluster Health
  • Related alerts: GossipUnreachable, PrimaryElectionFailed, HighReplicationLag
  • Metrics:
    • stemedb_cluster_is_primary (0 or 1 per node)
    • stemedb_swim_unreachable_members (network health)
    • stemedb_clock_offset_seconds (time sync)
  • Runbooks: high-replication-lag.md, merge-diverged-clusters.md