This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
9.3 KiB
Three-Node Cluster Architecture
Target: Production deployments, enterprise pilots, high-availability requirements
✅ RECOMMENDED FOR PRODUCTION - Survives single node failure, automatic replication
Overview
The three-node cluster provides high availability through automatic replication (factor 2) and CRDT-based eventual consistency. Survives single node failure with <5 minute recovery time.
[See: diagrams/three-node.txt for ASCII diagram]
Target Specifications
| Metric | Value |
|---|---|
| Assertions | <100,000 |
| Queries/sec | <1,000 |
| Concurrent users | <500 |
| Availability | 99.9% (survives 1 node failure) |
| RTO | 5 minutes (automatic failover) |
| RPO | 1 minute (replication lag) |
| Consistency | Eventual (via CRDTs + Merkle sync) |
Hardware Requirements (Per Node)
Minimum (Pilot <50K assertions)
- CPU: 4 vCPUs
- RAM: 8GB
- Disk: 100GB SSD (50GB WAL + 50GB DB)
- Network: 1 Gbps, <5ms inter-node latency
Example instances (per node):
- AWS:
t3.large(2 vCPU, 8GB) × 3 = $180/month - GCP:
n2-standard-2(2 vCPU, 8GB) × 3 = $195/month - Azure:
Standard_D2s_v3(2 vCPU, 8GB) × 3 = $140/month
Recommended (Production <100K assertions)
- CPU: 8 vCPUs
- RAM: 16GB
- Disk: 200GB SSD (100GB WAL + 100GB DB)
- Network: 10 Gbps, <5ms inter-node latency
Example instances (per node):
- AWS:
t3.xlarge(4 vCPU, 16GB) × 3 = $300/month - GCP:
n2-standard-4(4 vCPU, 16GB) × 3 = $390/month - Azure:
Standard_D4s_v3(4 vCPU, 16GB) × 3 = $280/month
See: Resource Sizing Guide for detailed calculations.
Architecture Components
Node Layout
Each node runs the full stack:
- stemedb-api (port 18180) - HTTP API, queries, ingest
- stemedb-gateway (port 18181) - Cluster coordination
- stemedb-rpc (port 18182) - gRPC replication
- SWIM gossip (port 18183) - Membership, failure detection
Replication
CRDT-based with Merkle sync:
- Writes accepted locally (optimistic)
- Background Merkle tree comparison
- Automatic sync of missing assertions
- No distributed transactions
Replication factor 2:
- Each assertion stored on 2 nodes
- Survives 1 node failure
- Read from any node (eventually consistent)
Load Balancing
Round-robin across all nodes:
- Nginx or Envoy distribute queries
- No "primary" node (all equal)
- Health checks remove failed nodes
Deployment Steps
Prerequisites
- 3 servers provisioned (same specs)
- Private network with <5ms latency
- DNS records created
- TLS certificates provisioned
Step 1: Install StemeDB on All Nodes
# On each node (node1, node2, node3):
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api
sudo mkdir -p /data/{wal,db}
sudo useradd -r -s /bin/false stemedb
sudo chown -R stemedb:stemedb /data
Step 2: Configure Cluster
Node 1:
# /etc/stemedb/config.toml
[cluster]
enabled = true
node_id = "node1"
bind_addr = "10.0.1.51:18181"
rpc_addr = "10.0.1.51:18182"
swim_addr = "10.0.1.51:18183"
seeds = ["10.0.1.52:18183", "10.0.1.53:18183"]
[replication]
factor = 2
Node 2:
[cluster]
enabled = true
node_id = "node2"
bind_addr = "10.0.1.52:18181"
rpc_addr = "10.0.1.52:18182"
swim_addr = "10.0.1.52:18183"
seeds = ["10.0.1.51:18183", "10.0.1.53:18183"]
[replication]
factor = 2
Node 3:
[cluster]
enabled = true
node_id = "node3"
bind_addr = "10.0.1.53:18181"
rpc_addr = "10.0.1.53:18182"
swim_addr = "10.0.1.53:18183"
seeds = ["10.0.1.51:18183", "10.0.1.52:18183"]
[replication]
factor = 2
Step 3: Start All Nodes
# Start nodes sequentially (allows SWIM discovery)
ssh node1 "sudo systemctl start stemedb-api"
sleep 10
ssh node2 "sudo systemctl start stemedb-api"
sleep 10
ssh node3 "sudo systemctl start stemedb-api"
Step 4: Verify Cluster Formation
# Check membership (from any node)
curl http://node1:18181/cluster/members | jq '.'
# Expected output:
# {
# "members": [
# {"id": "node1", "status": "UP"},
# {"id": "node2", "status": "UP"},
# {"id": "node3", "status": "UP"}
# ]
# }
Step 5: Configure Load Balancer
See: Nginx Config or Envoy Config
Nginx upstream:
upstream stemedb_cluster {
server node1.example.com:18180;
server node2.example.com:18180;
server node3.example.com:18180;
}
Step 6: Set Up Monitoring
# Prometheus scrape config
scrape_configs:
- job_name: 'stemedb-cluster'
static_configs:
- targets:
- 'node1:18180'
- 'node2:18180'
- 'node3:18180'
Estimated deployment time: 4-8 hours (including load balancer, monitoring)
Failure Scenarios & Recovery
Single Node Failure
Impact: No service disruption, automatic failover
Recovery:
- Load balancer detects failed node (health check)
- Traffic routed to 2 remaining nodes
- Replication factor maintained (assertions still on 2 nodes)
- Replace failed node when convenient (see Add Node Runbook)
RTO: <1 minute (automatic) Data loss: None (replicated data preserved)
Two Nodes Fail (Catastrophic)
Impact: Read-only mode (no writes accepted)
Recovery:
- Manual intervention required
- Restore third node or add new node
- Trigger Merkle sync
- Resume writes when quorum restored
RTO: 30 minutes - 2 hours (manual) Data loss: Potential (depends on which nodes failed)
Network Partition
Impact: Split brain possible (both sides accept writes)
Recovery:
- CRDT merge resolves conflicts automatically
- Lenses (Recency, Authority) handle conflicts at read time
- No manual intervention needed after partition heals
Data loss: None (CRDTs preserve all writes)
Replication Lag
Impact: Queries may see stale data (<1 minute old)
Recovery:
- Automatic catch-up via Merkle sync
- If lag >5 minutes, see High Latency Runbook
Performance Characteristics
Query Latency
Target: p99 <200ms at <1K queries/sec
| Metric | Single-Node | Three-Node |
|---|---|---|
| p50 | 20ms | 25ms |
| p95 | 50ms | 75ms |
| p99 | 100ms | 150ms |
3-node has slightly higher latency due to network hops, but 3x query capacity
Write Throughput
Target: 1,000 assertions/sec sustained
- Each node accepts writes
- Replication happens asynchronously
- No coordination required (CRDTs)
Replication Lag
Target: <1 second typical, <5 seconds max
Measured by: replication_lag_seconds metric
Network Requirements
See: Network Requirements for full details.
Ports (Per Node)
| Port | Protocol | Purpose | Firewall Rule |
|---|---|---|---|
| 18180 | TCP/HTTP | API (clients → nodes) | Allow from load balancer |
| 18181 | TCP/HTTP | Cluster gateway (admin only) | Allow from internal network |
| 18182 | TCP/gRPC | Replication (node ↔ node) | Allow within cluster |
| 18183 | UDP | SWIM gossip (node ↔ node) | Allow within cluster |
Latency Requirement
<5ms inter-node latency required
- Deploy nodes in same region/AZ
- Private network (10 Gbps recommended)
- Test with:
ping -c 100 node2(should show avg <5ms)
Bandwidth
- Replication: ~1 Mbps per 100 assertions/sec
- Queries: ~10 Mbps at 1K queries/sec
- Recommended: 1 Gbps minimum, 10 Gbps for production
Monitoring & Alerts
Critical Metrics
# Prometheus alerts
- alert: StemeDBNodeDown
expr: up{job="stemedb-cluster"} == 0
for: 1m
- alert: StemeDBReplicationLag
expr: replication_lag_seconds > 5
for: 5m
- alert: StemeDBQuorumLost
expr: count(up{job="stemedb-cluster"} == 1) < 2
for: 1m
Grafana Dashboard Panels
- Cluster Health: Node count, status, replication lag
- Query Latency: p50, p95, p99 across all nodes
- Ingest Rate: Assertions/sec per node
- Disk Usage: WAL + DB per node
- Network: Replication bandwidth
Cost Estimate (AWS, us-east-1)
| Resource | Cost |
|---|---|
| Compute (3× t3.xlarge) | $300/month |
| Storage (3× 200GB SSD) | $60/month |
| Load Balancer (ALB) | $25/month |
| Data Transfer (internal) | $10/month |
| Backups (S3) | $30/month |
| Total | ~$425/month |
Compare to single-node ($87/month): 5x cost for 10x availability
Migration from Single-Node
See: Add Node Runbook for detailed procedure.
Summary:
- Provision 2 new nodes
- Configure cluster on all 3
- Restart single-node with cluster config
- Trigger Merkle sync
- Update load balancer
Downtime: 5-15 minutes for replication
Related Documentation
- Single-Node Pilot - Simpler architecture
- Network Requirements - Firewall rules
- Resource Sizing - Hardware calculations
- Add Node Runbook - Cluster operations
- High Query Latency Runbook - Performance troubleshooting
Last Updated: 2026-02-11