stemedb/docs/operations/reference-architecture/three-node-cluster.md

# Three-Node Cluster Architecture

**Target:** Production deployments, enterprise pilots, high-availability requirements

**✅ RECOMMENDED FOR PRODUCTION** - Survives single node failure, automatic replication

---

## Overview

The three-node cluster provides high availability through automatic replication (factor 2) and CRDT-based eventual consistency. Survives single node failure with <5 minute recovery time.

```
[See: diagrams/three-node.txt for ASCII diagram]
```

---

## Target Specifications

| Metric | Value |
|--------|-------|
| **Assertions** | <100,000 |
| **Queries/sec** | <1,000 |
| **Concurrent users** | <500 |
| **Availability** | 99.9% (survives 1 node failure) |
| **RTO** | 5 minutes (automatic failover) |
| **RPO** | 1 minute (replication lag) |
| **Consistency** | Eventual (via CRDTs + Merkle sync) |

---

## Hardware Requirements (Per Node)

### Minimum (Pilot <50K assertions)

- **CPU:** 4 vCPUs
- **RAM:** 8GB
- **Disk:** 100GB SSD (50GB WAL + 50GB DB)
- **Network:** 1 Gbps, <5ms inter-node latency

**Example instances (per node):**
- AWS: `t3.large` (2 vCPU, 8GB) × 3 = $180/month
- GCP: `n2-standard-2` (2 vCPU, 8GB) × 3 = $195/month
- Azure: `Standard_D2s_v3` (2 vCPU, 8GB) × 3 = $140/month

### Recommended (Production <100K assertions)

- **CPU:** 8 vCPUs
- **RAM:** 16GB
- **Disk:** 200GB SSD (100GB WAL + 100GB DB)
- **Network:** 10 Gbps, <5ms inter-node latency

**Example instances (per node):**
- AWS: `t3.xlarge` (4 vCPU, 16GB) × 3 = $300/month
- GCP: `n2-standard-4` (4 vCPU, 16GB) × 3 = $390/month
- Azure: `Standard_D4s_v3` (4 vCPU, 16GB) × 3 = $280/month

**See:** [Resource Sizing Guide](./resource-sizing.md) for detailed calculations.

---

## Architecture Components

### Node Layout

Each node runs the full stack:
- **stemedb-api** (port 18180) - HTTP API, queries, ingest
- **stemedb-gateway** (port 18181) - Cluster coordination
- **stemedb-rpc** (port 18182) - gRPC replication
- **SWIM gossip** (port 18183) - Membership, failure detection

### Replication

**CRDT-based with Merkle sync:**
- Writes accepted locally (optimistic)
- Background Merkle tree comparison
- Automatic sync of missing assertions
- No distributed transactions

**Replication factor 2:**
- Each assertion stored on 2 nodes
- Survives 1 node failure
- Read from any node (eventually consistent)

### Load Balancing

**Round-robin across all nodes:**
- Nginx or Envoy distribute queries
- No "primary" node (all equal)
- Health checks remove failed nodes

---

## Deployment Steps

### Prerequisites

- [ ] 3 servers provisioned (same specs)
- [ ] Private network with <5ms latency
- [ ] DNS records created
- [ ] TLS certificates provisioned

### Step 1: Install StemeDB on All Nodes

```bash
# On each node (node1, node2, node3):
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api

sudo mkdir -p /data/{wal,db}
sudo useradd -r -s /bin/false stemedb
sudo chown -R stemedb:stemedb /data
```

### Step 2: Configure Cluster

**Node 1:**
```toml
# /etc/stemedb/config.toml
[cluster]
enabled = true
node_id = "node1"
bind_addr = "10.0.1.51:18181"
rpc_addr = "10.0.1.51:18182"
swim_addr = "10.0.1.51:18183"
seeds = ["10.0.1.52:18183", "10.0.1.53:18183"]

[replication]
factor = 2
```

**Node 2:**
```toml
[cluster]
enabled = true
node_id = "node2"
bind_addr = "10.0.1.52:18181"
rpc_addr = "10.0.1.52:18182"
swim_addr = "10.0.1.52:18183"
seeds = ["10.0.1.51:18183", "10.0.1.53:18183"]

[replication]
factor = 2
```

**Node 3:**
```toml
[cluster]
enabled = true
node_id = "node3"
bind_addr = "10.0.1.53:18181"
rpc_addr = "10.0.1.53:18182"
swim_addr = "10.0.1.53:18183"
seeds = ["10.0.1.51:18183", "10.0.1.52:18183"]

[replication]
factor = 2
```

### Step 3: Start All Nodes

```bash
# Start nodes sequentially (allows SWIM discovery)
ssh node1 "sudo systemctl start stemedb-api"
sleep 10

ssh node2 "sudo systemctl start stemedb-api"
sleep 10

ssh node3 "sudo systemctl start stemedb-api"
```

### Step 4: Verify Cluster Formation

```bash
# Check membership (from any node)
curl http://node1:18181/cluster/members | jq '.'

# Expected output:
# {
#   "members": [
#     {"id": "node1", "status": "UP"},
#     {"id": "node2", "status": "UP"},
#     {"id": "node3", "status": "UP"}
#   ]
# }
```

### Step 5: Configure Load Balancer

**See:** [Nginx Config](../../deployment/nginx/stemedb.conf) or [Envoy Config](../../deployment/envoy/stemedb.yaml)

**Nginx upstream:**
```nginx
upstream stemedb_cluster {
    server node1.example.com:18180;
    server node2.example.com:18180;
    server node3.example.com:18180;
}
```

### Step 6: Set Up Monitoring

```yaml
# Prometheus scrape config
scrape_configs:
  - job_name: 'stemedb-cluster'
    static_configs:
      - targets:
        - 'node1:18180'
        - 'node2:18180'
        - 'node3:18180'
```

**Estimated deployment time:** 4-8 hours (including load balancer, monitoring)

---

## Failure Scenarios & Recovery

### Single Node Failure

**Impact:** No service disruption, automatic failover

**Recovery:**
1. Load balancer detects failed node (health check)
2. Traffic routed to 2 remaining nodes
3. Replication factor maintained (assertions still on 2 nodes)
4. Replace failed node when convenient (see [Add Node Runbook](../../runbooks/add-node.md))

**RTO:** <1 minute (automatic)
**Data loss:** None (replicated data preserved)

### Two Nodes Fail (Catastrophic)

**Impact:** Read-only mode (no writes accepted)

**Recovery:**
1. Manual intervention required
2. Restore third node or add new node
3. Trigger Merkle sync
4. Resume writes when quorum restored

**RTO:** 30 minutes - 2 hours (manual)
**Data loss:** Potential (depends on which nodes failed)

### Network Partition

**Impact:** Split brain possible (both sides accept writes)

**Recovery:**
- CRDT merge resolves conflicts automatically
- Lenses (Recency, Authority) handle conflicts at read time
- No manual intervention needed after partition heals

**Data loss:** None (CRDTs preserve all writes)

### Replication Lag

**Impact:** Queries may see stale data (<1 minute old)

**Recovery:**
- Automatic catch-up via Merkle sync
- If lag >5 minutes, see [High Latency Runbook](../../runbooks/high-query-latency.md)

---

## Performance Characteristics

### Query Latency

**Target:** p99 <200ms at <1K queries/sec

| Metric | Single-Node | Three-Node |
|--------|-------------|------------|
| **p50** | 20ms | 25ms |
| **p95** | 50ms | 75ms |
| **p99** | 100ms | 150ms |

*3-node has slightly higher latency due to network hops, but 3x query capacity*

### Write Throughput

**Target:** 1,000 assertions/sec sustained

- Each node accepts writes
- Replication happens asynchronously
- No coordination required (CRDTs)

### Replication Lag

**Target:** <1 second typical, <5 seconds max

Measured by: `replication_lag_seconds` metric

---

## Network Requirements

**See:** [Network Requirements](./network-requirements.md) for full details.

### Ports (Per Node)

| Port | Protocol | Purpose | Firewall Rule |
|------|----------|---------|---------------|
| **18180** | TCP/HTTP | API (clients → nodes) | Allow from load balancer |
| **18181** | TCP/HTTP | Cluster gateway (admin only) | Allow from internal network |
| **18182** | TCP/gRPC | Replication (node ↔ node) | Allow within cluster |
| **18183** | UDP | SWIM gossip (node ↔ node) | Allow within cluster |

### Latency Requirement

**<5ms inter-node latency required**

- Deploy nodes in same region/AZ
- Private network (10 Gbps recommended)
- Test with: `ping -c 100 node2` (should show avg <5ms)

### Bandwidth

- **Replication:** ~1 Mbps per 100 assertions/sec
- **Queries:** ~10 Mbps at 1K queries/sec
- **Recommended:** 1 Gbps minimum, 10 Gbps for production

---

## Monitoring & Alerts

### Critical Metrics

```yaml
# Prometheus alerts
- alert: StemeDBNodeDown
  expr: up{job="stemedb-cluster"} == 0
  for: 1m

- alert: StemeDBReplicationLag
  expr: replication_lag_seconds > 5
  for: 5m

- alert: StemeDBQuorumLost
  expr: count(up{job="stemedb-cluster"} == 1) < 2
  for: 1m
```

### Grafana Dashboard Panels

1. **Cluster Health:** Node count, status, replication lag
2. **Query Latency:** p50, p95, p99 across all nodes
3. **Ingest Rate:** Assertions/sec per node
4. **Disk Usage:** WAL + DB per node
5. **Network:** Replication bandwidth

---

## Cost Estimate (AWS, us-east-1)

| Resource | Cost |
|----------|------|
| **Compute** (3× t3.xlarge) | $300/month |
| **Storage** (3× 200GB SSD) | $60/month |
| **Load Balancer** (ALB) | $25/month |
| **Data Transfer** (internal) | $10/month |
| **Backups** (S3) | $30/month |
| **Total** | **~$425/month** |

Compare to single-node ($87/month): 5x cost for 10x availability

---

## Migration from Single-Node

**See:** [Add Node Runbook](../../runbooks/add-node.md#1-bootstrap-3-node-cluster) for detailed procedure.

**Summary:**
1. Provision 2 new nodes
2. Configure cluster on all 3
3. Restart single-node with cluster config
4. Trigger Merkle sync
5. Update load balancer

**Downtime:** 5-15 minutes for replication

---

## Related Documentation

- [Single-Node Pilot](./single-node-pilot.md) - Simpler architecture
- [Network Requirements](./network-requirements.md) - Firewall rules
- [Resource Sizing](./resource-sizing.md) - Hardware calculations
- [Add Node Runbook](../../runbooks/add-node.md) - Cluster operations
- [High Query Latency Runbook](../../runbooks/high-query-latency.md) - Performance troubleshooting

---

**Last Updated:** 2026-02-11