jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

7.2 KiB

Raw Blame History

Cluster Split Brain

Severity: CRITICAL

Alert Rule

Alert: ClusterSplitBrain Trigger: Multiple nodes claim to be primary Duration: 1m

Symptom

Metrics show stemedb_cluster_primary_count > 1
Logs contain "primary election conflict" or "multiple primaries detected"
Different clients see different primary nodes
Assertion IDs from different primaries for same timestamp
SWIM gossip reports conflicting cluster state

Impact

User Impact:

Writes may be accepted by multiple primaries → data divergence
Queries return different results depending on routing
Inconsistent state across cluster (violates linearizability)

System Impact:

Data loss when resolving split (one primary's writes discarded)
Manual intervention required to merge diverged state
Cluster trust degraded (reputation impact)

Investigation Steps

1. Identify All Nodes Claiming Primary

# Query each node's role
for node in node1 node2 node3; do
  echo "=== $node ==="
  curl -s http://$node:18180/v1/admin/cluster/status | jq '.role'
done

Expected: Exactly one node should return "primary".

2. Check SWIM Gossip State

# Get cluster membership from each node
for node in node1 node2 node3; do
  echo "=== $node ==="
  curl -s http://$node:18180/v1/admin/cluster/members | jq '.members[] | {id, role, health}'
done

3. Check Network Partition

# Test connectivity between nodes
for src in node1 node2 node3; do
  for dst in node1 node2 node3; do
    [[ $src == $dst ]] && continue
    echo "$src → $dst:"
    ssh $src "timeout 2 nc -zv $dst 18182 2>&1 | tail -1"
  done
done

4. Review Election Logs

# Check when each node became primary
for node in node1 node2 node3; do
  echo "=== $node ==="
  ssh $node "journalctl -u stemedb-api | grep 'elected primary' | tail -5"
done

Resolution

Immediate Mitigation: Force Single Primary

WARNING: This will cause writes to one node to be discarded. Choose the node with the most recent data.

1. Identify primary with latest data:

# Compare latest assertion timestamps
for node in node1 node2 node3; do
  echo "$node:"
  curl -s http://$node:18180/metrics | grep assertions_indexed_total
done

Choose node with highest count.

2. Demote other primaries to replica:

# On each conflicting primary:
curl -X POST http://$node:18180/v1/admin/cluster/demote \
  -H 'Content-Type: application/json' \
  -d '{"force": true}'

3. Verify single primary:

for node in node1 node2 node3; do
  curl -s http://$node:18180/v1/admin/cluster/status | jq '.role'
done

Expected: One "primary", all others "replica".

Root Cause Resolution

If Network Partition Detected:

1. Restore network connectivity:

# Check firewall rules
iptables -L -n | grep 18182

# Check routing
ip route show

2. Verify SWIM gossip recovery:

# Watch gossip convergence
watch -n 2 'curl -s http://node1:18180/v1/admin/cluster/members | jq .members[].health'

If Split Caused by Clock Skew:

1. Check time drift:

for node in node1 node2 node3; do
  echo "$node: $(ssh $node date +%s)"
done

2. Sync clocks:

# Restart NTP
for node in node1 node2 node3; do
  ssh $node "systemctl restart chronyd && chronyc makestep"
done

If Split Caused by SWIM Bug:

1. Restart SWIM membership service:

# On each node
curl -X POST http://localhost:18180/v1/admin/cluster/restart-gossip

2. If restart fails, force cluster reset:

# On primary only
curl -X POST http://localhost:18180/v1/admin/cluster/reinit \
  -d '{"bootstrap": true}'

# On replicas
curl -X POST http://localhost:18180/v1/admin/cluster/join \
  -d '{"primary_address": "node1:18182"}'

Data Reconciliation After Split

1. Compare data divergence:

# Get Merkle tree diff between primaries
curl -X POST http://node1:18180/v1/admin/cluster/merkle-diff \
  -d '{"other_node": "node2"}'

2. If divergence is small (<100 assertions), manual merge:

# Export assertions from demoted primary
curl -s http://node2:18180/v1/admin/export-assertions \
  --data '{"since": <split_timestamp>}' \
  > /tmp/node2-assertions.jsonl

# Import into winning primary
curl -X POST http://node1:18180/v1/admin/import-assertions \
  --data-binary @/tmp/node2-assertions.jsonl

3. If divergence is large, escalate for manual resolution:

See docs/operations/runbooks/merge-diverged-clusters.md.

Prevention

Monitoring and Alerting

1. Alert on primary count:

- alert: MultiplePrimaries
  expr: sum(stemedb_cluster_is_primary) > 1
  for: 1m
  annotations:
    summary: "Split brain detected: multiple primaries"

2. Monitor SWIM gossip health:

- alert: GossipUnreachable
  expr: stemedb_swim_unreachable_members > 0
  for: 2m
  annotations:
    summary: "SWIM gossip detecting unreachable members"

3. Alert on clock skew:

- alert: ClockSkewDetected
  expr: abs(stemedb_clock_offset_seconds) > 1
  for: 5m
  annotations:
    summary: "Clock skew exceeds 1 second"

Capacity Planning

1. Deploy nodes across failure domains:

Different racks (power/network isolation)
Different availability zones (cloud deployments)

2. Use dedicated network for cluster gossip:

# /etc/stemedb/api.toml
[cluster]
gossip_bind_address = "10.0.1.100:18183"  # Private network

3. Configure SWIM timeouts for network:

[cluster.swim]
suspicion_timeout_ms = 5000
probe_interval_ms = 1000
probe_timeout_ms = 500

Operational Best Practices

1. Regular cluster health checks:

# Daily validation
curl -s http://localhost:18180/v1/admin/cluster/validate | jq '{
  primary_count: .primaries,
  replica_count: .replicas,
  unreachable: .unreachable
}'

2. Test network partitions in staging:

# Simulate partition with iptables
iptables -A INPUT -s 10.0.1.102 -j DROP
iptables -A OUTPUT -d 10.0.1.102 -j DROP

# Wait for detection
sleep 60

# Verify single primary
curl -s http://localhost:18180/v1/admin/cluster/status

# Restore network
iptables -D INPUT -s 10.0.1.102 -j DROP
iptables -D OUTPUT -d 10.0.1.102 -j DROP

3. Document primary election priority:

Configure explicit priority for deterministic elections:

[cluster]
election_priority = 100  # Higher on preferred primary

Escalation

Escalate immediately if:

Split brain lasts >5 minutes (data divergence growing)
Unable to identify winning primary (data loss unavoidable)
Network partition affects >50% of cluster
Split brain recurs after resolution (systemic issue)

Escalation path:

Primary on-call: Cluster SRE
Secondary: Distributed systems architect
Final escalation: CTO + VP Engineering (customer-facing impact)

References

Dashboard: StemeDB Cluster Health
Related alerts: GossipUnreachable, PrimaryElectionFailed, HighReplicationLag
Metrics:
- stemedb_cluster_is_primary (0 or 1 per node)
- stemedb_swim_unreachable_members (network health)
- stemedb_clock_offset_seconds (time sync)
Runbooks: high-replication-lag.md, merge-diverged-clusters.md

7.2 KiB Raw Blame History