# Three-Node Cluster Architecture **Target:** Production deployments, enterprise pilots, high-availability requirements **✅ RECOMMENDED FOR PRODUCTION** - Survives single node failure, automatic replication --- ## Overview The three-node cluster provides high availability through automatic replication (factor 2) and CRDT-based eventual consistency. Survives single node failure with <5 minute recovery time. ``` [See: diagrams/three-node.txt for ASCII diagram] ``` --- ## Target Specifications | Metric | Value | |--------|-------| | **Assertions** | <100,000 | | **Queries/sec** | <1,000 | | **Concurrent users** | <500 | | **Availability** | 99.9% (survives 1 node failure) | | **RTO** | 5 minutes (automatic failover) | | **RPO** | 1 minute (replication lag) | | **Consistency** | Eventual (via CRDTs + Merkle sync) | --- ## Hardware Requirements (Per Node) ### Minimum (Pilot <50K assertions) - **CPU:** 4 vCPUs - **RAM:** 8GB - **Disk:** 100GB SSD (50GB WAL + 50GB DB) - **Network:** 1 Gbps, <5ms inter-node latency **Example instances (per node):** - AWS: `t3.large` (2 vCPU, 8GB) × 3 = $180/month - GCP: `n2-standard-2` (2 vCPU, 8GB) × 3 = $195/month - Azure: `Standard_D2s_v3` (2 vCPU, 8GB) × 3 = $140/month ### Recommended (Production <100K assertions) - **CPU:** 8 vCPUs - **RAM:** 16GB - **Disk:** 200GB SSD (100GB WAL + 100GB DB) - **Network:** 10 Gbps, <5ms inter-node latency **Example instances (per node):** - AWS: `t3.xlarge` (4 vCPU, 16GB) × 3 = $300/month - GCP: `n2-standard-4` (4 vCPU, 16GB) × 3 = $390/month - Azure: `Standard_D4s_v3` (4 vCPU, 16GB) × 3 = $280/month **See:** [Resource Sizing Guide](./resource-sizing.md) for detailed calculations. --- ## Architecture Components ### Node Layout Each node runs the full stack: - **stemedb-api** (port 18180) - HTTP API, queries, ingest - **stemedb-gateway** (port 18181) - Cluster coordination - **stemedb-rpc** (port 18182) - gRPC replication - **SWIM gossip** (port 18183) - Membership, failure detection ### Replication **CRDT-based with Merkle sync:** - Writes accepted locally (optimistic) - Background Merkle tree comparison - Automatic sync of missing assertions - No distributed transactions **Replication factor 2:** - Each assertion stored on 2 nodes - Survives 1 node failure - Read from any node (eventually consistent) ### Load Balancing **Round-robin across all nodes:** - Nginx or Envoy distribute queries - No "primary" node (all equal) - Health checks remove failed nodes --- ## Deployment Steps ### Prerequisites - [ ] 3 servers provisioned (same specs) - [ ] Private network with <5ms latency - [ ] DNS records created - [ ] TLS certificates provisioned ### Step 1: Install StemeDB on All Nodes ```bash # On each node (node1, node2, node3): sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api sudo chmod +x /usr/local/bin/stemedb-api sudo mkdir -p /data/{wal,db} sudo useradd -r -s /bin/false stemedb sudo chown -R stemedb:stemedb /data ``` ### Step 2: Configure Cluster **Node 1:** ```toml # /etc/stemedb/config.toml [cluster] enabled = true node_id = "node1" bind_addr = "10.0.1.51:18181" rpc_addr = "10.0.1.51:18182" swim_addr = "10.0.1.51:18183" seeds = ["10.0.1.52:18183", "10.0.1.53:18183"] [replication] factor = 2 ``` **Node 2:** ```toml [cluster] enabled = true node_id = "node2" bind_addr = "10.0.1.52:18181" rpc_addr = "10.0.1.52:18182" swim_addr = "10.0.1.52:18183" seeds = ["10.0.1.51:18183", "10.0.1.53:18183"] [replication] factor = 2 ``` **Node 3:** ```toml [cluster] enabled = true node_id = "node3" bind_addr = "10.0.1.53:18181" rpc_addr = "10.0.1.53:18182" swim_addr = "10.0.1.53:18183" seeds = ["10.0.1.51:18183", "10.0.1.52:18183"] [replication] factor = 2 ``` ### Step 3: Start All Nodes ```bash # Start nodes sequentially (allows SWIM discovery) ssh node1 "sudo systemctl start stemedb-api" sleep 10 ssh node2 "sudo systemctl start stemedb-api" sleep 10 ssh node3 "sudo systemctl start stemedb-api" ``` ### Step 4: Verify Cluster Formation ```bash # Check membership (from any node) curl http://node1:18181/cluster/members | jq '.' # Expected output: # { # "members": [ # {"id": "node1", "status": "UP"}, # {"id": "node2", "status": "UP"}, # {"id": "node3", "status": "UP"} # ] # } ``` ### Step 5: Configure Load Balancer **See:** [Nginx Config](../../deployment/nginx/stemedb.conf) or [Envoy Config](../../deployment/envoy/stemedb.yaml) **Nginx upstream:** ```nginx upstream stemedb_cluster { server node1.example.com:18180; server node2.example.com:18180; server node3.example.com:18180; } ``` ### Step 6: Set Up Monitoring ```yaml # Prometheus scrape config scrape_configs: - job_name: 'stemedb-cluster' static_configs: - targets: - 'node1:18180' - 'node2:18180' - 'node3:18180' ``` **Estimated deployment time:** 4-8 hours (including load balancer, monitoring) --- ## Failure Scenarios & Recovery ### Single Node Failure **Impact:** No service disruption, automatic failover **Recovery:** 1. Load balancer detects failed node (health check) 2. Traffic routed to 2 remaining nodes 3. Replication factor maintained (assertions still on 2 nodes) 4. Replace failed node when convenient (see [Add Node Runbook](../../runbooks/add-node.md)) **RTO:** <1 minute (automatic) **Data loss:** None (replicated data preserved) ### Two Nodes Fail (Catastrophic) **Impact:** Read-only mode (no writes accepted) **Recovery:** 1. Manual intervention required 2. Restore third node or add new node 3. Trigger Merkle sync 4. Resume writes when quorum restored **RTO:** 30 minutes - 2 hours (manual) **Data loss:** Potential (depends on which nodes failed) ### Network Partition **Impact:** Split brain possible (both sides accept writes) **Recovery:** - CRDT merge resolves conflicts automatically - Lenses (Recency, Authority) handle conflicts at read time - No manual intervention needed after partition heals **Data loss:** None (CRDTs preserve all writes) ### Replication Lag **Impact:** Queries may see stale data (<1 minute old) **Recovery:** - Automatic catch-up via Merkle sync - If lag >5 minutes, see [High Latency Runbook](../../runbooks/high-query-latency.md) --- ## Performance Characteristics ### Query Latency **Target:** p99 <200ms at <1K queries/sec | Metric | Single-Node | Three-Node | |--------|-------------|------------| | **p50** | 20ms | 25ms | | **p95** | 50ms | 75ms | | **p99** | 100ms | 150ms | *3-node has slightly higher latency due to network hops, but 3x query capacity* ### Write Throughput **Target:** 1,000 assertions/sec sustained - Each node accepts writes - Replication happens asynchronously - No coordination required (CRDTs) ### Replication Lag **Target:** <1 second typical, <5 seconds max Measured by: `replication_lag_seconds` metric --- ## Network Requirements **See:** [Network Requirements](./network-requirements.md) for full details. ### Ports (Per Node) | Port | Protocol | Purpose | Firewall Rule | |------|----------|---------|---------------| | **18180** | TCP/HTTP | API (clients → nodes) | Allow from load balancer | | **18181** | TCP/HTTP | Cluster gateway (admin only) | Allow from internal network | | **18182** | TCP/gRPC | Replication (node ↔ node) | Allow within cluster | | **18183** | UDP | SWIM gossip (node ↔ node) | Allow within cluster | ### Latency Requirement **<5ms inter-node latency required** - Deploy nodes in same region/AZ - Private network (10 Gbps recommended) - Test with: `ping -c 100 node2` (should show avg <5ms) ### Bandwidth - **Replication:** ~1 Mbps per 100 assertions/sec - **Queries:** ~10 Mbps at 1K queries/sec - **Recommended:** 1 Gbps minimum, 10 Gbps for production --- ## Monitoring & Alerts ### Critical Metrics ```yaml # Prometheus alerts - alert: StemeDBNodeDown expr: up{job="stemedb-cluster"} == 0 for: 1m - alert: StemeDBReplicationLag expr: replication_lag_seconds > 5 for: 5m - alert: StemeDBQuorumLost expr: count(up{job="stemedb-cluster"} == 1) < 2 for: 1m ``` ### Grafana Dashboard Panels 1. **Cluster Health:** Node count, status, replication lag 2. **Query Latency:** p50, p95, p99 across all nodes 3. **Ingest Rate:** Assertions/sec per node 4. **Disk Usage:** WAL + DB per node 5. **Network:** Replication bandwidth --- ## Cost Estimate (AWS, us-east-1) | Resource | Cost | |----------|------| | **Compute** (3× t3.xlarge) | $300/month | | **Storage** (3× 200GB SSD) | $60/month | | **Load Balancer** (ALB) | $25/month | | **Data Transfer** (internal) | $10/month | | **Backups** (S3) | $30/month | | **Total** | **~$425/month** | Compare to single-node ($87/month): 5x cost for 10x availability --- ## Migration from Single-Node **See:** [Add Node Runbook](../../runbooks/add-node.md#1-bootstrap-3-node-cluster) for detailed procedure. **Summary:** 1. Provision 2 new nodes 2. Configure cluster on all 3 3. Restart single-node with cluster config 4. Trigger Merkle sync 5. Update load balancer **Downtime:** 5-15 minutes for replication --- ## Related Documentation - [Single-Node Pilot](./single-node-pilot.md) - Simpler architecture - [Network Requirements](./network-requirements.md) - Firewall rules - [Resource Sizing](./resource-sizing.md) - Hardware calculations - [Add Node Runbook](../../runbooks/add-node.md) - Cluster operations - [High Query Latency Runbook](../../runbooks/high-query-latency.md) - Performance troubleshooting --- **Last Updated:** 2026-02-11