# Three-Node Cluster Architecture **Target:** Production deployments, enterprise pilots, high-availability requirements **✅ RECOMMENDED FOR PRODUCTION** - Survives single node failure, automatic replication > **Implementation status:** The cluster crates (`stemedb-cluster`, `stemedb-sync`, `stemedb-rpc`) are > implemented. The k3s/Longhorn deployment path is the current production path (see Phase 2 section). > Bare-metal deployment via config.toml is aspirational and not yet wired to the binary. --- ## Architectural Rationale: Why Gossip, Not Raft This section exists because the wrong answer here — "just add Raft" — is commonly assumed and actively harmful for StemeDB's workload. ### The append-only insight Most databases need Raft because writes can **conflict**: two nodes update the same row, and a leader must serialize them. StemeDB doesn't have this problem. Every assertion receives a **BLAKE3 content hash** as its ID. If two nodes both write the same assertion independently, they produce the same hash → the same key → identical data. There is nothing to conflict on. This means the assertion write path is naturally **CRDT-like**: the system needs every node to eventually receive every assertion, but doesn't need consensus on which assertion "wins." Gossip + Merkle anti-entropy handles this correctly and efficiently. Raft would add leader-election overhead and write latency for zero benefit on the data path. ### What actually needs coordination Not everything in StemeDB is append-only. Mutable state requires stronger guarantees: | State | Type | Replication strategy | Why | |-------|------|---------------------|-----| | Assertions | Append-only (CRDT) | Gossip + Merkle anti-entropy | No conflicts possible by design | | API keys | Mutable | Synchronous broadcast or coordinator | Revoked key must not be reusable | | Quota / meter counts | Mutable counter | Coordinator node or bounded staleness | Double-spend if two nodes both allow | | Circuit breaker state | Mutable | Synchronous broadcast | Trip/reset must propagate atomically | | Epochs | Append-only, ordered | Gossip is sufficient | Creation order captured in content | **Practical implication:** Admin operations (key management, quota changes) should be routed to a designated coordinator and synchronously acknowledged. Assertion writes and reads can go to any node with no coordination. ### Read scaling Because Lens resolution is pure local computation on indexed data, **any node can serve any read with no coordination**. Reads scale horizontally to N nodes without inter-node communication. ### Write scaling (assertions) Because assertion writes are idempotent by content hash, **any node can accept any write**. The cluster gateway (`stemedb-cluster`, port 18181) routes writes by subject hash shard prefix — each node owns a partition of the key space. Merkle anti-entropy ensures all nodes converge. Write throughput scales linearly with node count. --- ## Overview The three-node cluster provides high availability through automatic replication (factor 2) and gossip-based eventual consistency for assertions. Survives single node failure with <5 minute recovery time. ``` ┌──────────────────────────────┐ Internet ──→ LB → │ Cluster Gateway (port 18181) │ │ Reads: round-robin any node │ │ Writes: route by shard prefix│ └──────┬────────────┬───────────┘ │ │ ┌──────────▼──┐ ┌────▼─────────┐ ┌──────────────┐ │ Node A │ │ Node B │ │ Node C │ │ Shard 0-84 │ │ Shard 85-169 │ │ Shard 170-255│ │ WAL + KV │ │ WAL + KV │ │ WAL + KV │ └──────┬──────┘ └────┬──────────┘ └───────┬──────┘ │ Merkle sync (gossip, port 18183) │ └─────────────────────────────────────────┘ ``` --- ## Target Specifications | Metric | Value | |--------|-------| | **Assertions** | <100,000 | | **Queries/sec** | <1,000 | | **Concurrent users** | <500 | | **Availability** | 99.9% (survives 1 node failure) | | **RTO** | 5 minutes (automatic failover) | | **RPO** | 1 minute (replication lag) | | **Consistency** | Eventual (via CRDTs + Merkle sync) | --- ## Hardware Requirements (Per Node) ### Minimum (Pilot <50K assertions) - **CPU:** 4 vCPUs - **RAM:** 8GB - **Disk:** 100GB SSD (50GB WAL + 50GB DB) - **Network:** 1 Gbps, <5ms inter-node latency **Example instances (per node):** - AWS: `t3.large` (2 vCPU, 8GB) × 3 = $180/month - GCP: `n2-standard-2` (2 vCPU, 8GB) × 3 = $195/month - Azure: `Standard_D2s_v3` (2 vCPU, 8GB) × 3 = $140/month ### Recommended (Production <100K assertions) - **CPU:** 8 vCPUs - **RAM:** 16GB - **Disk:** 200GB SSD (100GB WAL + 100GB DB) - **Network:** 10 Gbps, <5ms inter-node latency **Example instances (per node):** - AWS: `t3.xlarge` (4 vCPU, 16GB) × 3 = $300/month - GCP: `n2-standard-4` (4 vCPU, 16GB) × 3 = $390/month - Azure: `Standard_D4s_v3` (4 vCPU, 16GB) × 3 = $280/month **See:** [Resource Sizing Guide](./resource-sizing.md) for detailed calculations. --- ## Architecture Components ### Node Layout Each pod runs two binaries via `scripts/entrypoint.sh`: - **stemedb-api** (port 18180) — HTTP API, WAL, KV store, ingestion, query - **stemedb-node** (port 18181) — Cluster Gateway HTTP (client-facing, routes by shard) - **stemedb-node** (port 18182) — gRPC (node-to-node sync) - **stemedb-node** (port 18183) — SWIM gossip (membership, failure detection) ### Replication **CRDT-based with Merkle sync:** - Writes accepted locally (optimistic) - Background Merkle tree comparison - Automatic sync of missing assertions - No distributed transactions **Replication factor 2:** - Each assertion stored on 2 nodes - Survives 1 node failure - Read from any node (eventually consistent) ### Load Balancing **Round-robin across all nodes:** - Nginx or Envoy distribute queries - No "primary" node (all equal) - Health checks remove failed nodes --- ## k3s Deployment (Current — StatefulSet + Longhorn) > This is the **current production deployment** on k3s-fleet. Deployed and running as of 2026-03-07. > The bare-metal steps below are for non-k8s environments and use a config.toml interface that is > not yet wired to the binary. A single 3-replica StatefulSet with `VolumeClaimTemplates` provides each pod its own 50Gi Longhorn PVC: ``` k3s-fleet/deployments/k8s/base/stemedb/ └── stemedb.yaml # Everything: ExternalSecret, headless Service, gateway Service, # 3-replica StatefulSet, Ingress ``` **Architecture per pod (dual-binary via entrypoint.sh):** - `stemedb-api` (:18180) — storage engine (WAL + KV) - `stemedb-node` (:18181) — Gateway HTTP (client-facing, cluster routing) - `stemedb-node` (:18182) — gRPC (node-to-node sync) - `stemedb-node` (:18183) — SWIM gossip (membership) **Pod DNS:** `stemedb-{0,1,2}.stemedb-headless.stemedb.svc.cluster.local` **Key k8s resources:** - **Headless Service** (`stemedb-headless`) — stable DNS for all 4 ports - **Gateway Service** (`stemedb-gateway`) — ClusterIP routing to Gateway port 18181 - **Ingress** — `stemedb.threesix.ai` → `stemedb-gateway:18181` via Traefik **Environment variables (set in StatefulSet):** - `STEMEDB_CLUSTER_MODE=true` — starts both binaries via `entrypoint.sh` - `STEMEDB_NODE_ID` — from `metadata.name` (stemedb-0, stemedb-1, stemedb-2) → stable NodeId via BLAKE3 - `STEMEDB_SEED_NODES` — all 3 headless DNS names on port 18182 - `STEMEDB_NUM_SHARDS=4`, `STEMEDB_REPLICATION_FACTOR=2` **CI/CD:** Woodpecker pipeline (`.woodpecker.yml`) → Kaniko builds → Zot registry (`registry.threesix.ai`) → `kubectl set image statefulset/stemedb` **See:** [k8s Deploy Roadmap](../deployment/k8s-deploy-roadmap.md) for the phased rollout plan. --- ## Bare-Metal Deployment Steps > ⚠️ The config.toml cluster configuration shown here is **planned** and not yet wired to the > `stemedb-api` binary. Current binary configuration uses environment variables only. This section > documents the intended interface for when cluster config is implemented. ### Prerequisites - [ ] 3 servers provisioned (same specs) - [ ] Private network with <5ms latency - [ ] DNS records created - [ ] TLS certificates provisioned ### Step 1: Install StemeDB on All Nodes ```bash # On each node (node1, node2, node3): sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api sudo chmod +x /usr/local/bin/stemedb-api sudo mkdir -p /data/{wal,db} sudo useradd -r -s /bin/false stemedb sudo chown -R stemedb:stemedb /data ``` ### Step 2: Configure Cluster **Node 1:** ```toml # /etc/stemedb/config.toml [cluster] enabled = true node_id = "node1" bind_addr = "10.0.1.51:18181" rpc_addr = "10.0.1.51:18182" swim_addr = "10.0.1.51:18183" seeds = ["10.0.1.52:18183", "10.0.1.53:18183"] [replication] factor = 2 ``` **Node 2:** ```toml [cluster] enabled = true node_id = "node2" bind_addr = "10.0.1.52:18181" rpc_addr = "10.0.1.52:18182" swim_addr = "10.0.1.52:18183" seeds = ["10.0.1.51:18183", "10.0.1.53:18183"] [replication] factor = 2 ``` **Node 3:** ```toml [cluster] enabled = true node_id = "node3" bind_addr = "10.0.1.53:18181" rpc_addr = "10.0.1.53:18182" swim_addr = "10.0.1.53:18183" seeds = ["10.0.1.51:18183", "10.0.1.52:18183"] [replication] factor = 2 ``` ### Step 3: Start All Nodes ```bash # Start nodes sequentially (allows SWIM discovery) ssh node1 "sudo systemctl start stemedb-api" sleep 10 ssh node2 "sudo systemctl start stemedb-api" sleep 10 ssh node3 "sudo systemctl start stemedb-api" ``` ### Step 4: Verify Cluster Formation ```bash # Check membership (from any node) curl http://node1:18181/cluster/members | jq '.' # Expected output: # { # "members": [ # {"id": "node1", "status": "UP"}, # {"id": "node2", "status": "UP"}, # {"id": "node3", "status": "UP"} # ] # } ``` ### Step 5: Configure Load Balancer **See:** [Nginx Config](../../deployment/nginx/stemedb.conf) or [Envoy Config](../../deployment/envoy/stemedb.yaml) **Nginx upstream:** ```nginx upstream stemedb_cluster { server node1.example.com:18180; server node2.example.com:18180; server node3.example.com:18180; } ``` ### Step 6: Set Up Monitoring ```yaml # Prometheus scrape config scrape_configs: - job_name: 'stemedb-cluster' static_configs: - targets: - 'node1:18180' - 'node2:18180' - 'node3:18180' ``` **Estimated deployment time:** 4-8 hours (including load balancer, monitoring) --- ## Failure Scenarios & Recovery ### Single Node Failure **Impact:** No service disruption, automatic failover **Recovery:** 1. Load balancer detects failed node (health check) 2. Traffic routed to 2 remaining nodes 3. Replication factor maintained (assertions still on 2 nodes) 4. Replace failed node when convenient (see [Add Node Runbook](../../runbooks/add-node.md)) **RTO:** <1 minute (automatic) **Data loss:** None (replicated data preserved) ### Two Nodes Fail (Catastrophic) **Impact:** Single surviving node continues accepting assertion writes and serving reads. Admin operations (key management) are degraded — single-node has no peer to synchronously acknowledge. **Recovery:** 1. Manual intervention required to restore cluster 2. Restore failed nodes or add new nodes 3. Trigger Merkle sync (`/cluster/sync` endpoint) after nodes rejoin 4. Admin operations fully restored when cluster membership is repaired **RTO:** 30 minutes - 2 hours (manual restore) **Data loss:** Assertion writes continue on surviving node and merge on recovery. Recent admin operations (key revocations) issued during degraded window may not have propagated — audit after recovery. ### Network Partition **Impact:** - **Assertion writes:** Both partitions accept writes independently. This is safe — same content → same BLAKE3 hash, different content → different hashes that merge cleanly after partition heals. - **Admin operations (API key revocations, quota changes):** A revocation issued to one partition is invisible to the other until partition heals. A revoked key may still be honored by nodes in the other partition during the partition window. **Recovery:** - Merkle anti-entropy detects and fills gaps automatically when partition heals - Lenses (Recency, Authority) handle any assertion-level divergence at read time - Admin state re-synchronizes via coordinator broadcast on reconnect **Data loss:** None for assertions (all writes from both partitions preserved and merged). ### Replication Lag **Impact:** Queries may see stale data (<1 minute old) **Recovery:** - Automatic catch-up via Merkle sync - If lag >5 minutes, see [High Latency Runbook](../../runbooks/high-query-latency.md) --- ## Performance Characteristics ### Query Latency **Target:** p99 <200ms at <1K queries/sec | Metric | Single-Node | Three-Node | |--------|-------------|------------| | **p50** | 20ms | 25ms | | **p95** | 50ms | 75ms | | **p99** | 100ms | 150ms | *3-node has slightly higher latency due to network hops, but 3x query capacity* ### Write Throughput **Target:** 1,000 assertions/sec sustained - Each node accepts assertion writes (routed by cluster gateway via shard prefix) - Replication happens asynchronously via Merkle gossip - No coordination required for assertions (CRDT-safe by content hash) - Admin writes (API keys, quota changes) route to coordinator and require synchronous acknowledgment — expect higher latency on those operations (~50ms vs ~5ms for assertions) ### Replication Lag **Target:** <1 second typical, <5 seconds max Measured by: `replication_lag_seconds` metric --- ## Network Requirements **See:** [Network Requirements](./network-requirements.md) for full details. ### Ports (Per Node) | Port | Protocol | Purpose | Firewall Rule | |------|----------|---------|---------------| | **18180** | TCP/HTTP | API (clients → nodes) | Allow from load balancer | | **18181** | TCP/HTTP | Cluster gateway (admin only) | Allow from internal network | | **18182** | TCP/gRPC | Replication (node ↔ node) | Allow within cluster | | **18183** | UDP | SWIM gossip (node ↔ node) | Allow within cluster | ### Latency Requirement **<5ms inter-node latency required** - Deploy nodes in same region/AZ - Private network (10 Gbps recommended) - Test with: `ping -c 100 node2` (should show avg <5ms) ### Bandwidth - **Replication:** ~1 Mbps per 100 assertions/sec - **Queries:** ~10 Mbps at 1K queries/sec - **Recommended:** 1 Gbps minimum, 10 Gbps for production --- ## Monitoring & Alerts ### Critical Metrics ```yaml # Prometheus alerts - alert: StemeDBNodeDown expr: up{job="stemedb-cluster"} == 0 for: 1m - alert: StemeDBReplicationLag expr: replication_lag_seconds > 5 for: 5m - alert: StemeDBQuorumLost expr: count(up{job="stemedb-cluster"} == 1) < 2 for: 1m ``` ### Grafana Dashboard Panels 1. **Cluster Health:** Node count, status, replication lag 2. **Query Latency:** p50, p95, p99 across all nodes 3. **Ingest Rate:** Assertions/sec per node 4. **Disk Usage:** WAL + DB per node 5. **Network:** Replication bandwidth --- ## Cost Estimate (AWS, us-east-1) | Resource | Cost | |----------|------| | **Compute** (3× t3.xlarge) | $300/month | | **Storage** (3× 200GB SSD) | $60/month | | **Load Balancer** (ALB) | $25/month | | **Data Transfer** (internal) | $10/month | | **Backups** (S3) | $30/month | | **Total** | **~$425/month** | Compare to single-node ($87/month): 5x cost for 10x availability --- ## Migration from Single-Node **See:** [Add Node Runbook](../../runbooks/add-node.md#1-bootstrap-3-node-cluster) for detailed procedure. **Summary:** 1. Provision 2 new nodes 2. Configure cluster on all 3 3. Restart single-node with cluster config 4. Trigger Merkle sync 5. Update load balancer **Downtime:** 5-15 minutes for replication --- ## Scaling Path Beyond Three Nodes Three nodes on k3s handles the 100-project target. For mass traffic beyond that, the scaling path is incremental — not a rearchitecture: | Phase | Target | Work type | What changes | |-------|--------|-----------|-------------| | **Phase 1** | 1 node, 100 projects | ✅ Done | Single Deployment, Longhorn PVC, auth wired | | **Phase 2** | 3 nodes, cluster mode | ✅ Done | 3-replica StatefulSet, dual-binary, SWIM, Gateway routing, Woodpecker CI | | **Phase 3** | 3 nodes, full replication | Code-heavy | SWIM inter-node connectivity, Gateway HTTP forwarding, Merkle anti-entropy | | **Phase 4** | N nodes, coordinator | Code-heavy | Designate one node (or small 3-node Raft group) exclusively for mutable admin state; assertion nodes are pure data | **What you do NOT need:** Raft on the assertion write path. The append-only, content-addressed design means there are no write conflicts to serialize. Raft belongs only on the mutable admin state path (Phase 4), which is a small fraction of total traffic. --- ## Related Documentation - [Single-Node Pilot](./single-node-pilot.md) - Simpler architecture - [Network Requirements](./network-requirements.md) - Firewall rules - [Resource Sizing](./resource-sizing.md) - Hardware calculations - [Add Node Runbook](../../runbooks/add-node.md) - Cluster operations - [High Query Latency Runbook](../../runbooks/high-query-latency.md) - Performance troubleshooting --- **Last Updated:** 2026-03-07 — Updated k3s section to match actual 3-replica StatefulSet deployment, dual-binary architecture, Woodpecker CI pipeline