This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
325 lines
7.2 KiB
Markdown
325 lines
7.2 KiB
Markdown
# Cluster Split Brain
|
|
|
|
## Severity: CRITICAL
|
|
|
|
## Alert Rule
|
|
|
|
**Alert:** `ClusterSplitBrain`
|
|
**Trigger:** Multiple nodes claim to be primary
|
|
**Duration:** 1m
|
|
|
|
## Symptom
|
|
|
|
- Metrics show `stemedb_cluster_primary_count > 1`
|
|
- Logs contain "primary election conflict" or "multiple primaries detected"
|
|
- Different clients see different primary nodes
|
|
- Assertion IDs from different primaries for same timestamp
|
|
- SWIM gossip reports conflicting cluster state
|
|
|
|
## Impact
|
|
|
|
**User Impact:**
|
|
- Writes may be accepted by multiple primaries → data divergence
|
|
- Queries return different results depending on routing
|
|
- Inconsistent state across cluster (violates linearizability)
|
|
|
|
**System Impact:**
|
|
- Data loss when resolving split (one primary's writes discarded)
|
|
- Manual intervention required to merge diverged state
|
|
- Cluster trust degraded (reputation impact)
|
|
|
|
## Investigation Steps
|
|
|
|
### 1. Identify All Nodes Claiming Primary
|
|
|
|
```bash
|
|
# Query each node's role
|
|
for node in node1 node2 node3; do
|
|
echo "=== $node ==="
|
|
curl -s http://$node:18180/v1/admin/cluster/status | jq '.role'
|
|
done
|
|
```
|
|
|
|
Expected: Exactly one node should return `"primary"`.
|
|
|
|
### 2. Check SWIM Gossip State
|
|
|
|
```bash
|
|
# Get cluster membership from each node
|
|
for node in node1 node2 node3; do
|
|
echo "=== $node ==="
|
|
curl -s http://$node:18180/v1/admin/cluster/members | jq '.members[] | {id, role, health}'
|
|
done
|
|
```
|
|
|
|
### 3. Check Network Partition
|
|
|
|
```bash
|
|
# Test connectivity between nodes
|
|
for src in node1 node2 node3; do
|
|
for dst in node1 node2 node3; do
|
|
[[ $src == $dst ]] && continue
|
|
echo "$src → $dst:"
|
|
ssh $src "timeout 2 nc -zv $dst 18182 2>&1 | tail -1"
|
|
done
|
|
done
|
|
```
|
|
|
|
### 4. Review Election Logs
|
|
|
|
```bash
|
|
# Check when each node became primary
|
|
for node in node1 node2 node3; do
|
|
echo "=== $node ==="
|
|
ssh $node "journalctl -u stemedb-api | grep 'elected primary' | tail -5"
|
|
done
|
|
```
|
|
|
|
## Resolution
|
|
|
|
### Immediate Mitigation: Force Single Primary
|
|
|
|
**WARNING:** This will cause writes to one node to be discarded. Choose the node with the most recent data.
|
|
|
|
**1. Identify primary with latest data:**
|
|
|
|
```bash
|
|
# Compare latest assertion timestamps
|
|
for node in node1 node2 node3; do
|
|
echo "$node:"
|
|
curl -s http://$node:18180/metrics | grep assertions_indexed_total
|
|
done
|
|
```
|
|
|
|
Choose node with highest count.
|
|
|
|
**2. Demote other primaries to replica:**
|
|
|
|
```bash
|
|
# On each conflicting primary:
|
|
curl -X POST http://$node:18180/v1/admin/cluster/demote \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"force": true}'
|
|
```
|
|
|
|
**3. Verify single primary:**
|
|
|
|
```bash
|
|
for node in node1 node2 node3; do
|
|
curl -s http://$node:18180/v1/admin/cluster/status | jq '.role'
|
|
done
|
|
```
|
|
|
|
Expected: One `"primary"`, all others `"replica"`.
|
|
|
|
### Root Cause Resolution
|
|
|
|
**If Network Partition Detected:**
|
|
|
|
**1. Restore network connectivity:**
|
|
|
|
```bash
|
|
# Check firewall rules
|
|
iptables -L -n | grep 18182
|
|
|
|
# Check routing
|
|
ip route show
|
|
```
|
|
|
|
**2. Verify SWIM gossip recovery:**
|
|
|
|
```bash
|
|
# Watch gossip convergence
|
|
watch -n 2 'curl -s http://node1:18180/v1/admin/cluster/members | jq .members[].health'
|
|
```
|
|
|
|
**If Split Caused by Clock Skew:**
|
|
|
|
**1. Check time drift:**
|
|
|
|
```bash
|
|
for node in node1 node2 node3; do
|
|
echo "$node: $(ssh $node date +%s)"
|
|
done
|
|
```
|
|
|
|
**2. Sync clocks:**
|
|
|
|
```bash
|
|
# Restart NTP
|
|
for node in node1 node2 node3; do
|
|
ssh $node "systemctl restart chronyd && chronyc makestep"
|
|
done
|
|
```
|
|
|
|
**If Split Caused by SWIM Bug:**
|
|
|
|
**1. Restart SWIM membership service:**
|
|
|
|
```bash
|
|
# On each node
|
|
curl -X POST http://localhost:18180/v1/admin/cluster/restart-gossip
|
|
```
|
|
|
|
**2. If restart fails, force cluster reset:**
|
|
|
|
```bash
|
|
# On primary only
|
|
curl -X POST http://localhost:18180/v1/admin/cluster/reinit \
|
|
-d '{"bootstrap": true}'
|
|
|
|
# On replicas
|
|
curl -X POST http://localhost:18180/v1/admin/cluster/join \
|
|
-d '{"primary_address": "node1:18182"}'
|
|
```
|
|
|
|
### Data Reconciliation After Split
|
|
|
|
**1. Compare data divergence:**
|
|
|
|
```bash
|
|
# Get Merkle tree diff between primaries
|
|
curl -X POST http://node1:18180/v1/admin/cluster/merkle-diff \
|
|
-d '{"other_node": "node2"}'
|
|
```
|
|
|
|
**2. If divergence is small (<100 assertions), manual merge:**
|
|
|
|
```bash
|
|
# Export assertions from demoted primary
|
|
curl -s http://node2:18180/v1/admin/export-assertions \
|
|
--data '{"since": <split_timestamp>}' \
|
|
> /tmp/node2-assertions.jsonl
|
|
|
|
# Import into winning primary
|
|
curl -X POST http://node1:18180/v1/admin/import-assertions \
|
|
--data-binary @/tmp/node2-assertions.jsonl
|
|
```
|
|
|
|
**3. If divergence is large, escalate for manual resolution:**
|
|
|
|
See `docs/operations/runbooks/merge-diverged-clusters.md`.
|
|
|
|
## Prevention
|
|
|
|
### Monitoring and Alerting
|
|
|
|
**1. Alert on primary count:**
|
|
|
|
```yaml
|
|
- alert: MultiplePrimaries
|
|
expr: sum(stemedb_cluster_is_primary) > 1
|
|
for: 1m
|
|
annotations:
|
|
summary: "Split brain detected: multiple primaries"
|
|
```
|
|
|
|
**2. Monitor SWIM gossip health:**
|
|
|
|
```yaml
|
|
- alert: GossipUnreachable
|
|
expr: stemedb_swim_unreachable_members > 0
|
|
for: 2m
|
|
annotations:
|
|
summary: "SWIM gossip detecting unreachable members"
|
|
```
|
|
|
|
**3. Alert on clock skew:**
|
|
|
|
```yaml
|
|
- alert: ClockSkewDetected
|
|
expr: abs(stemedb_clock_offset_seconds) > 1
|
|
for: 5m
|
|
annotations:
|
|
summary: "Clock skew exceeds 1 second"
|
|
```
|
|
|
|
### Capacity Planning
|
|
|
|
**1. Deploy nodes across failure domains:**
|
|
|
|
- Different racks (power/network isolation)
|
|
- Different availability zones (cloud deployments)
|
|
|
|
**2. Use dedicated network for cluster gossip:**
|
|
|
|
```toml
|
|
# /etc/stemedb/api.toml
|
|
[cluster]
|
|
gossip_bind_address = "10.0.1.100:18183" # Private network
|
|
```
|
|
|
|
**3. Configure SWIM timeouts for network:**
|
|
|
|
```toml
|
|
[cluster.swim]
|
|
suspicion_timeout_ms = 5000
|
|
probe_interval_ms = 1000
|
|
probe_timeout_ms = 500
|
|
```
|
|
|
|
### Operational Best Practices
|
|
|
|
**1. Regular cluster health checks:**
|
|
|
|
```bash
|
|
# Daily validation
|
|
curl -s http://localhost:18180/v1/admin/cluster/validate | jq '{
|
|
primary_count: .primaries,
|
|
replica_count: .replicas,
|
|
unreachable: .unreachable
|
|
}'
|
|
```
|
|
|
|
**2. Test network partitions in staging:**
|
|
|
|
```bash
|
|
# Simulate partition with iptables
|
|
iptables -A INPUT -s 10.0.1.102 -j DROP
|
|
iptables -A OUTPUT -d 10.0.1.102 -j DROP
|
|
|
|
# Wait for detection
|
|
sleep 60
|
|
|
|
# Verify single primary
|
|
curl -s http://localhost:18180/v1/admin/cluster/status
|
|
|
|
# Restore network
|
|
iptables -D INPUT -s 10.0.1.102 -j DROP
|
|
iptables -D OUTPUT -d 10.0.1.102 -j DROP
|
|
```
|
|
|
|
**3. Document primary election priority:**
|
|
|
|
Configure explicit priority for deterministic elections:
|
|
|
|
```toml
|
|
[cluster]
|
|
election_priority = 100 # Higher on preferred primary
|
|
```
|
|
|
|
## Escalation
|
|
|
|
**Escalate immediately if:**
|
|
|
|
- Split brain lasts >5 minutes (data divergence growing)
|
|
- Unable to identify winning primary (data loss unavoidable)
|
|
- Network partition affects >50% of cluster
|
|
- Split brain recurs after resolution (systemic issue)
|
|
|
|
**Escalation path:**
|
|
|
|
1. **Primary on-call:** Cluster SRE
|
|
2. **Secondary:** Distributed systems architect
|
|
3. **Final escalation:** CTO + VP Engineering (customer-facing impact)
|
|
|
|
## References
|
|
|
|
- **Dashboard:** [StemeDB Cluster Health](http://grafana.example.com/d/stemedb-cluster)
|
|
- **Related alerts:** `GossipUnreachable`, `PrimaryElectionFailed`, `HighReplicationLag`
|
|
- **Metrics:**
|
|
- `stemedb_cluster_is_primary` (0 or 1 per node)
|
|
- `stemedb_swim_unreachable_members` (network health)
|
|
- `stemedb_clock_offset_seconds` (time sync)
|
|
- **Runbooks:** `high-replication-lag.md`, `merge-diverged-clusters.md`
|