stemedb/docs/operations/reference-architecture/three-node-cluster.md

# Three-Node Cluster Architecture

**Target:** Production deployments, enterprise pilots, high-availability requirements

**✅ RECOMMENDED FOR PRODUCTION** - Survives single node failure, automatic replication

> **Implementation status:** The cluster crates (`stemedb-cluster`, `stemedb-sync`, `stemedb-rpc`) are
> implemented. The k3s/Longhorn deployment path is the current production path (see Phase 2 section).
> Bare-metal deployment via config.toml is aspirational and not yet wired to the binary.

---

## Architectural Rationale: Why Gossip, Not Raft

This section exists because the wrong answer here — "just add Raft" — is commonly assumed and actively harmful for StemeDB's workload.

### The append-only insight

Most databases need Raft because writes can **conflict**: two nodes update the same row, and a leader must serialize them. StemeDB doesn't have this problem. Every assertion receives a **BLAKE3 content hash** as its ID. If two nodes both write the same assertion independently, they produce the same hash → the same key → identical data. There is nothing to conflict on.

This means the assertion write path is naturally **CRDT-like**: the system needs every node to eventually receive every assertion, but doesn't need consensus on which assertion "wins." Gossip + Merkle anti-entropy handles this correctly and efficiently. Raft would add leader-election overhead and write latency for zero benefit on the data path.

### What actually needs coordination

Not everything in StemeDB is append-only. Mutable state requires stronger guarantees:

| State | Type | Replication strategy | Why |
|-------|------|---------------------|-----|
| Assertions | Append-only (CRDT) | Gossip + Merkle anti-entropy | No conflicts possible by design |
| API keys | Mutable | Synchronous broadcast or coordinator | Revoked key must not be reusable |
| Quota / meter counts | Mutable counter | Coordinator node or bounded staleness | Double-spend if two nodes both allow |
| Circuit breaker state | Mutable | Synchronous broadcast | Trip/reset must propagate atomically |
| Epochs | Append-only, ordered | Gossip is sufficient | Creation order captured in content |

**Practical implication:** Admin operations (key management, quota changes) should be routed to a designated coordinator and synchronously acknowledged. Assertion writes and reads can go to any node with no coordination.

### Read scaling

Because Lens resolution is pure local computation on indexed data, **any node can serve any read with no coordination**. Reads scale horizontally to N nodes without inter-node communication.

### Write scaling (assertions)

Because assertion writes are idempotent by content hash, **any node can accept any write**. The cluster gateway (`stemedb-cluster`, port 18181) routes writes by subject hash shard prefix — each node owns a partition of the key space. Merkle anti-entropy ensures all nodes converge. Write throughput scales linearly with node count.

---

## Overview

The three-node cluster provides high availability through automatic replication (factor 2) and gossip-based eventual consistency for assertions. Survives single node failure with <5 minute recovery time.

```
                    ┌──────────────────────────────┐
Internet ──→  LB → │  Cluster Gateway (port 18181) │
                    │  Reads: round-robin any node  │
                    │  Writes: route by shard prefix│
                    └──────┬────────────┬───────────┘
                           │            │
                ┌──────────▼──┐    ┌────▼─────────┐    ┌──────────────┐
                │   Node A    │    │   Node B      │    │   Node C     │
                │ Shard 0-84  │    │ Shard 85-169  │    │ Shard 170-255│
                │ WAL + KV    │    │ WAL + KV      │    │ WAL + KV     │
                └──────┬──────┘    └────┬──────────┘    └───────┬──────┘
                       │     Merkle sync (gossip, port 18183)    │
                       └─────────────────────────────────────────┘
```

---

## Target Specifications

| Metric | Value |
|--------|-------|
| **Assertions** | <100,000 |
| **Queries/sec** | <1,000 |
| **Concurrent users** | <500 |
| **Availability** | 99.9% (survives 1 node failure) |
| **RTO** | 5 minutes (automatic failover) |
| **RPO** | 1 minute (replication lag) |
| **Consistency** | Eventual (via CRDTs + Merkle sync) |

---

## Hardware Requirements (Per Node)

### Minimum (Pilot <50K assertions)

- **CPU:** 4 vCPUs
- **RAM:** 8GB
- **Disk:** 100GB SSD (50GB WAL + 50GB DB)
- **Network:** 1 Gbps, <5ms inter-node latency

**Example instances (per node):**
- AWS: `t3.large` (2 vCPU, 8GB) × 3 = $180/month
- GCP: `n2-standard-2` (2 vCPU, 8GB) × 3 = $195/month
- Azure: `Standard_D2s_v3` (2 vCPU, 8GB) × 3 = $140/month

### Recommended (Production <100K assertions)

- **CPU:** 8 vCPUs
- **RAM:** 16GB
- **Disk:** 200GB SSD (100GB WAL + 100GB DB)
- **Network:** 10 Gbps, <5ms inter-node latency

**Example instances (per node):**
- AWS: `t3.xlarge` (4 vCPU, 16GB) × 3 = $300/month
- GCP: `n2-standard-4` (4 vCPU, 16GB) × 3 = $390/month
- Azure: `Standard_D4s_v3` (4 vCPU, 16GB) × 3 = $280/month

**See:** [Resource Sizing Guide](./resource-sizing.md) for detailed calculations.

---

## Architecture Components

### Node Layout

Each node runs the full stack:
- **stemedb-api** (port 18180) - HTTP API, queries, ingest
- **stemedb-gateway** (port 18181) - Cluster coordination
- **stemedb-rpc** (port 18182) - gRPC replication
- **SWIM gossip** (port 18183) - Membership, failure detection

### Replication

**CRDT-based with Merkle sync:**
- Writes accepted locally (optimistic)
- Background Merkle tree comparison
- Automatic sync of missing assertions
- No distributed transactions

**Replication factor 2:**
- Each assertion stored on 2 nodes
- Survives 1 node failure
- Read from any node (eventually consistent)

### Load Balancing

**Round-robin across all nodes:**
- Nginx or Envoy distribute queries
- No "primary" node (all equal)
- Health checks remove failed nodes

---

## k3s Deployment Path (Current — Longhorn + StatefulSet)

> This is the **current production deployment path** for k3s-fleet. The bare-metal steps below
> are for non-k8s environments and use a config.toml interface that is not yet wired to the binary.

For each cluster node on k3s, deploy a separate StatefulSet with its own Longhorn PVC:

```
k3s-fleet/deployments/k8s/base/stemedb/
├── stemedb.yaml          # Node A (current single-node — Phase 1)
├── stemedb-b.yaml        # Node B (Phase 2 — add when ready to scale reads)
├── stemedb-c.yaml        # Node C (Phase 2)
└── kustomization.yaml
```

**Critical k3s constraints:**
- Each node needs its own `ReadWriteOnce` Longhorn PVC — embedded KV (fjall) cannot share a volume
- Use `strategy: Recreate` on each Deployment (not RollingUpdate) — RWO PVC + 2 pods = deadlock
- Cluster gateway (port 18181) must be exposed as a separate Service for inter-node routing
- Use `topologySpreadConstraints` to ensure nodes land on different k3s worker hosts

**Phase 2 read-replica k8s addition (when ready):**
```yaml
# Add to stemedb-b.yaml — identical to stemedb.yaml except:
# - Different node ID env var
# - STEMEDB_CLUSTER_SEEDS pointing to Node A's gateway ClusterIP
# - Its own PVC claim
env:
  - name: STEMEDB_NODE_ID
    value: "node-b"
  - name: STEMEDB_CLUSTER_SEEDS
    value: "stemedb-api.stemedb.svc:18181"
```

**See:** [k8s Deploy Roadmap](../deployment/k8s-deploy-roadmap.md) for the phased rollout plan.

---

## Bare-Metal Deployment Steps

> ⚠️ The config.toml cluster configuration shown here is **planned** and not yet wired to the
> `stemedb-api` binary. Current binary configuration uses environment variables only. This section
> documents the intended interface for when cluster config is implemented.

### Prerequisites

- [ ] 3 servers provisioned (same specs)
- [ ] Private network with <5ms latency
- [ ] DNS records created
- [ ] TLS certificates provisioned

### Step 1: Install StemeDB on All Nodes

```bash
# On each node (node1, node2, node3):
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api

sudo mkdir -p /data/{wal,db}
sudo useradd -r -s /bin/false stemedb
sudo chown -R stemedb:stemedb /data
```

### Step 2: Configure Cluster

**Node 1:**
```toml
# /etc/stemedb/config.toml
[cluster]
enabled = true
node_id = "node1"
bind_addr = "10.0.1.51:18181"
rpc_addr = "10.0.1.51:18182"
swim_addr = "10.0.1.51:18183"
seeds = ["10.0.1.52:18183", "10.0.1.53:18183"]

[replication]
factor = 2
```

**Node 2:**
```toml
[cluster]
enabled = true
node_id = "node2"
bind_addr = "10.0.1.52:18181"
rpc_addr = "10.0.1.52:18182"
swim_addr = "10.0.1.52:18183"
seeds = ["10.0.1.51:18183", "10.0.1.53:18183"]

[replication]
factor = 2
```

**Node 3:**
```toml
[cluster]
enabled = true
node_id = "node3"
bind_addr = "10.0.1.53:18181"
rpc_addr = "10.0.1.53:18182"
swim_addr = "10.0.1.53:18183"
seeds = ["10.0.1.51:18183", "10.0.1.52:18183"]

[replication]
factor = 2
```

### Step 3: Start All Nodes

```bash
# Start nodes sequentially (allows SWIM discovery)
ssh node1 "sudo systemctl start stemedb-api"
sleep 10

ssh node2 "sudo systemctl start stemedb-api"
sleep 10

ssh node3 "sudo systemctl start stemedb-api"
```

### Step 4: Verify Cluster Formation

```bash
# Check membership (from any node)
curl http://node1:18181/cluster/members | jq '.'

# Expected output:
# {
#   "members": [
#     {"id": "node1", "status": "UP"},
#     {"id": "node2", "status": "UP"},
#     {"id": "node3", "status": "UP"}
#   ]
# }
```

### Step 5: Configure Load Balancer

**See:** [Nginx Config](../../deployment/nginx/stemedb.conf) or [Envoy Config](../../deployment/envoy/stemedb.yaml)

**Nginx upstream:**
```nginx
upstream stemedb_cluster {
    server node1.example.com:18180;
    server node2.example.com:18180;
    server node3.example.com:18180;
}
```

### Step 6: Set Up Monitoring

```yaml
# Prometheus scrape config
scrape_configs:
  - job_name: 'stemedb-cluster'
    static_configs:
      - targets:
        - 'node1:18180'
        - 'node2:18180'
        - 'node3:18180'
```

**Estimated deployment time:** 4-8 hours (including load balancer, monitoring)

---

## Failure Scenarios & Recovery

### Single Node Failure

**Impact:** No service disruption, automatic failover

**Recovery:**
1. Load balancer detects failed node (health check)
2. Traffic routed to 2 remaining nodes
3. Replication factor maintained (assertions still on 2 nodes)
4. Replace failed node when convenient (see [Add Node Runbook](../../runbooks/add-node.md))

**RTO:** <1 minute (automatic)
**Data loss:** None (replicated data preserved)

### Two Nodes Fail (Catastrophic)

**Impact:** Single surviving node continues accepting assertion writes and serving reads. Admin operations (key management) are degraded — single-node has no peer to synchronously acknowledge.

**Recovery:**
1. Manual intervention required to restore cluster
2. Restore failed nodes or add new nodes
3. Trigger Merkle sync (`/cluster/sync` endpoint) after nodes rejoin
4. Admin operations fully restored when cluster membership is repaired

**RTO:** 30 minutes - 2 hours (manual restore)
**Data loss:** Assertion writes continue on surviving node and merge on recovery. Recent admin operations (key revocations) issued during degraded window may not have propagated — audit after recovery.

### Network Partition

**Impact:**
- **Assertion writes:** Both partitions accept writes independently. This is safe — same content → same BLAKE3 hash, different content → different hashes that merge cleanly after partition heals.
- **Admin operations (API key revocations, quota changes):** A revocation issued to one partition is invisible to the other until partition heals. A revoked key may still be honored by nodes in the other partition during the partition window.

**Recovery:**
- Merkle anti-entropy detects and fills gaps automatically when partition heals
- Lenses (Recency, Authority) handle any assertion-level divergence at read time
- Admin state re-synchronizes via coordinator broadcast on reconnect

**Data loss:** None for assertions (all writes from both partitions preserved and merged).

### Replication Lag

**Impact:** Queries may see stale data (<1 minute old)

**Recovery:**
- Automatic catch-up via Merkle sync
- If lag >5 minutes, see [High Latency Runbook](../../runbooks/high-query-latency.md)

---

## Performance Characteristics

### Query Latency

**Target:** p99 <200ms at <1K queries/sec

| Metric | Single-Node | Three-Node |
|--------|-------------|------------|
| **p50** | 20ms | 25ms |
| **p95** | 50ms | 75ms |
| **p99** | 100ms | 150ms |

*3-node has slightly higher latency due to network hops, but 3x query capacity*

### Write Throughput

**Target:** 1,000 assertions/sec sustained

- Each node accepts assertion writes (routed by cluster gateway via shard prefix)
- Replication happens asynchronously via Merkle gossip
- No coordination required for assertions (CRDT-safe by content hash)
- Admin writes (API keys, quota changes) route to coordinator and require synchronous acknowledgment — expect higher latency on those operations (~50ms vs ~5ms for assertions)

### Replication Lag

**Target:** <1 second typical, <5 seconds max

Measured by: `replication_lag_seconds` metric

---

## Network Requirements

**See:** [Network Requirements](./network-requirements.md) for full details.

### Ports (Per Node)

| Port | Protocol | Purpose | Firewall Rule |
|------|----------|---------|---------------|
| **18180** | TCP/HTTP | API (clients → nodes) | Allow from load balancer |
| **18181** | TCP/HTTP | Cluster gateway (admin only) | Allow from internal network |
| **18182** | TCP/gRPC | Replication (node ↔ node) | Allow within cluster |
| **18183** | UDP | SWIM gossip (node ↔ node) | Allow within cluster |

### Latency Requirement

**<5ms inter-node latency required**

- Deploy nodes in same region/AZ
- Private network (10 Gbps recommended)
- Test with: `ping -c 100 node2` (should show avg <5ms)

### Bandwidth

- **Replication:** ~1 Mbps per 100 assertions/sec
- **Queries:** ~10 Mbps at 1K queries/sec
- **Recommended:** 1 Gbps minimum, 10 Gbps for production

---

## Monitoring & Alerts

### Critical Metrics

```yaml
# Prometheus alerts
- alert: StemeDBNodeDown
  expr: up{job="stemedb-cluster"} == 0
  for: 1m

- alert: StemeDBReplicationLag
  expr: replication_lag_seconds > 5
  for: 5m

- alert: StemeDBQuorumLost
  expr: count(up{job="stemedb-cluster"} == 1) < 2
  for: 1m
```

### Grafana Dashboard Panels

1. **Cluster Health:** Node count, status, replication lag
2. **Query Latency:** p50, p95, p99 across all nodes
3. **Ingest Rate:** Assertions/sec per node
4. **Disk Usage:** WAL + DB per node
5. **Network:** Replication bandwidth

---

## Cost Estimate (AWS, us-east-1)

| Resource | Cost |
|----------|------|
| **Compute** (3× t3.xlarge) | $300/month |
| **Storage** (3× 200GB SSD) | $60/month |
| **Load Balancer** (ALB) | $25/month |
| **Data Transfer** (internal) | $10/month |
| **Backups** (S3) | $30/month |
| **Total** | **~$425/month** |

Compare to single-node ($87/month): 5x cost for 10x availability

---

## Migration from Single-Node

**See:** [Add Node Runbook](../../runbooks/add-node.md#1-bootstrap-3-node-cluster) for detailed procedure.

**Summary:**
1. Provision 2 new nodes
2. Configure cluster on all 3
3. Restart single-node with cluster config
4. Trigger Merkle sync
5. Update load balancer

**Downtime:** 5-15 minutes for replication

---

## Scaling Path Beyond Three Nodes

Three nodes on k3s handles the 100-project target. For mass traffic beyond that, the scaling path is incremental — not a rearchitecture:

| Phase | Target | Work type | What changes |
|-------|--------|-----------|-------------|
| **Phase 1** | 1 node, 100 projects | ✅ Done | Single Deployment, Longhorn PVC, auth wired |
| **Phase 2** | 3 nodes, read-scaled | Ops-heavy | Add 2 read replicas as separate Deployments; cluster gateway routes reads round-robin |
| **Phase 3** | 3 nodes, write-sharded | Code-heavy | Gateway enforces shard ownership; each node owns ⅓ of subject hash space; reads still any-node |
| **Phase 4** | N nodes, coordinator | Code-heavy | Designate one node (or small 3-node Raft group) exclusively for mutable admin state; assertion nodes are pure data |

**What you do NOT need:** Raft on the assertion write path. The append-only, content-addressed design means there are no write conflicts to serialize. Raft belongs only on the mutable admin state path (Phase 4), which is a small fraction of total traffic.

---

## Related Documentation

- [Single-Node Pilot](./single-node-pilot.md) - Simpler architecture
- [Network Requirements](./network-requirements.md) - Firewall rules
- [Resource Sizing](./resource-sizing.md) - Hardware calculations
- [Add Node Runbook](../../runbooks/add-node.md) - Cluster operations
- [High Query Latency Runbook](../../runbooks/high-query-latency.md) - Performance troubleshooting

---

**Last Updated:** 2026-03-02 — Added architectural rationale (gossip vs Raft), k3s deployment path, fixed mutable-state coordination notes, added 4-phase scaling table