stemedb/docs/operations/runbooks/split-brain.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

325 lines
7.2 KiB
Markdown

# Cluster Split Brain
## Severity: CRITICAL
## Alert Rule
**Alert:** `ClusterSplitBrain`
**Trigger:** Multiple nodes claim to be primary
**Duration:** 1m
## Symptom
- Metrics show `stemedb_cluster_primary_count > 1`
- Logs contain "primary election conflict" or "multiple primaries detected"
- Different clients see different primary nodes
- Assertion IDs from different primaries for same timestamp
- SWIM gossip reports conflicting cluster state
## Impact
**User Impact:**
- Writes may be accepted by multiple primaries → data divergence
- Queries return different results depending on routing
- Inconsistent state across cluster (violates linearizability)
**System Impact:**
- Data loss when resolving split (one primary's writes discarded)
- Manual intervention required to merge diverged state
- Cluster trust degraded (reputation impact)
## Investigation Steps
### 1. Identify All Nodes Claiming Primary
```bash
# Query each node's role
for node in node1 node2 node3; do
echo "=== $node ==="
curl -s http://$node:18180/v1/admin/cluster/status | jq '.role'
done
```
Expected: Exactly one node should return `"primary"`.
### 2. Check SWIM Gossip State
```bash
# Get cluster membership from each node
for node in node1 node2 node3; do
echo "=== $node ==="
curl -s http://$node:18180/v1/admin/cluster/members | jq '.members[] | {id, role, health}'
done
```
### 3. Check Network Partition
```bash
# Test connectivity between nodes
for src in node1 node2 node3; do
for dst in node1 node2 node3; do
[[ $src == $dst ]] && continue
echo "$src$dst:"
ssh $src "timeout 2 nc -zv $dst 18182 2>&1 | tail -1"
done
done
```
### 4. Review Election Logs
```bash
# Check when each node became primary
for node in node1 node2 node3; do
echo "=== $node ==="
ssh $node "journalctl -u stemedb-api | grep 'elected primary' | tail -5"
done
```
## Resolution
### Immediate Mitigation: Force Single Primary
**WARNING:** This will cause writes to one node to be discarded. Choose the node with the most recent data.
**1. Identify primary with latest data:**
```bash
# Compare latest assertion timestamps
for node in node1 node2 node3; do
echo "$node:"
curl -s http://$node:18180/metrics | grep assertions_indexed_total
done
```
Choose node with highest count.
**2. Demote other primaries to replica:**
```bash
# On each conflicting primary:
curl -X POST http://$node:18180/v1/admin/cluster/demote \
-H 'Content-Type: application/json' \
-d '{"force": true}'
```
**3. Verify single primary:**
```bash
for node in node1 node2 node3; do
curl -s http://$node:18180/v1/admin/cluster/status | jq '.role'
done
```
Expected: One `"primary"`, all others `"replica"`.
### Root Cause Resolution
**If Network Partition Detected:**
**1. Restore network connectivity:**
```bash
# Check firewall rules
iptables -L -n | grep 18182
# Check routing
ip route show
```
**2. Verify SWIM gossip recovery:**
```bash
# Watch gossip convergence
watch -n 2 'curl -s http://node1:18180/v1/admin/cluster/members | jq .members[].health'
```
**If Split Caused by Clock Skew:**
**1. Check time drift:**
```bash
for node in node1 node2 node3; do
echo "$node: $(ssh $node date +%s)"
done
```
**2. Sync clocks:**
```bash
# Restart NTP
for node in node1 node2 node3; do
ssh $node "systemctl restart chronyd && chronyc makestep"
done
```
**If Split Caused by SWIM Bug:**
**1. Restart SWIM membership service:**
```bash
# On each node
curl -X POST http://localhost:18180/v1/admin/cluster/restart-gossip
```
**2. If restart fails, force cluster reset:**
```bash
# On primary only
curl -X POST http://localhost:18180/v1/admin/cluster/reinit \
-d '{"bootstrap": true}'
# On replicas
curl -X POST http://localhost:18180/v1/admin/cluster/join \
-d '{"primary_address": "node1:18182"}'
```
### Data Reconciliation After Split
**1. Compare data divergence:**
```bash
# Get Merkle tree diff between primaries
curl -X POST http://node1:18180/v1/admin/cluster/merkle-diff \
-d '{"other_node": "node2"}'
```
**2. If divergence is small (<100 assertions), manual merge:**
```bash
# Export assertions from demoted primary
curl -s http://node2:18180/v1/admin/export-assertions \
--data '{"since": <split_timestamp>}' \
> /tmp/node2-assertions.jsonl
# Import into winning primary
curl -X POST http://node1:18180/v1/admin/import-assertions \
--data-binary @/tmp/node2-assertions.jsonl
```
**3. If divergence is large, escalate for manual resolution:**
See `docs/operations/runbooks/merge-diverged-clusters.md`.
## Prevention
### Monitoring and Alerting
**1. Alert on primary count:**
```yaml
- alert: MultiplePrimaries
expr: sum(stemedb_cluster_is_primary) > 1
for: 1m
annotations:
summary: "Split brain detected: multiple primaries"
```
**2. Monitor SWIM gossip health:**
```yaml
- alert: GossipUnreachable
expr: stemedb_swim_unreachable_members > 0
for: 2m
annotations:
summary: "SWIM gossip detecting unreachable members"
```
**3. Alert on clock skew:**
```yaml
- alert: ClockSkewDetected
expr: abs(stemedb_clock_offset_seconds) > 1
for: 5m
annotations:
summary: "Clock skew exceeds 1 second"
```
### Capacity Planning
**1. Deploy nodes across failure domains:**
- Different racks (power/network isolation)
- Different availability zones (cloud deployments)
**2. Use dedicated network for cluster gossip:**
```toml
# /etc/stemedb/api.toml
[cluster]
gossip_bind_address = "10.0.1.100:18183" # Private network
```
**3. Configure SWIM timeouts for network:**
```toml
[cluster.swim]
suspicion_timeout_ms = 5000
probe_interval_ms = 1000
probe_timeout_ms = 500
```
### Operational Best Practices
**1. Regular cluster health checks:**
```bash
# Daily validation
curl -s http://localhost:18180/v1/admin/cluster/validate | jq '{
primary_count: .primaries,
replica_count: .replicas,
unreachable: .unreachable
}'
```
**2. Test network partitions in staging:**
```bash
# Simulate partition with iptables
iptables -A INPUT -s 10.0.1.102 -j DROP
iptables -A OUTPUT -d 10.0.1.102 -j DROP
# Wait for detection
sleep 60
# Verify single primary
curl -s http://localhost:18180/v1/admin/cluster/status
# Restore network
iptables -D INPUT -s 10.0.1.102 -j DROP
iptables -D OUTPUT -d 10.0.1.102 -j DROP
```
**3. Document primary election priority:**
Configure explicit priority for deterministic elections:
```toml
[cluster]
election_priority = 100 # Higher on preferred primary
```
## Escalation
**Escalate immediately if:**
- Split brain lasts >5 minutes (data divergence growing)
- Unable to identify winning primary (data loss unavoidable)
- Network partition affects >50% of cluster
- Split brain recurs after resolution (systemic issue)
**Escalation path:**
1. **Primary on-call:** Cluster SRE
2. **Secondary:** Distributed systems architect
3. **Final escalation:** CTO + VP Engineering (customer-facing impact)
## References
- **Dashboard:** [StemeDB Cluster Health](http://grafana.example.com/d/stemedb-cluster)
- **Related alerts:** `GossipUnreachable`, `PrimaryElectionFailed`, `HighReplicationLag`
- **Metrics:**
- `stemedb_cluster_is_primary` (0 or 1 per node)
- `stemedb_swim_unreachable_members` (network health)
- `stemedb_clock_offset_seconds` (time sync)
- **Runbooks:** `high-replication-lag.md`, `merge-diverged-clusters.md`