# Three-Node Cluster Architecture Diagram ## High-Level Topology ``` ┌──────────────────────────────────────────────────────────────────────┐ │ Client Layer │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ Agents │ │ Dashboard │ │ CLI Tools │ │ │ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ │ └──────────────────┴──────────────────┘ │ │ │ │ │ │ HTTPS (443) │ │ ▼ │ └──────────────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────────────┐ │ Load Balancer Layer │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Nginx / Envoy / AWS ALB │ │ │ │ • Round-robin distribution │ │ │ │ • Health checks (5s interval) │ │ │ │ • TLS termination │ │ │ │ • Removes failed nodes automatically │ │ │ └────────────┬──────────────┬──────────────┬─────────────────────┘ │ │ │ │ │ HTTP (18180) │ │ ▼ ▼ ▼ │ └──────────────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────────────┐ │ StemeDB Cluster Nodes │ │ │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ │ │ 10.0.1.51 │ │ 10.0.1.52 │ │ 10.0.1.53 │ │ │ │ │ │ │ │ │ │ │ │ stemedb-api │ │ stemedb-api │ │ stemedb-api │ │ │ │ :18180 (API) │ │ :18180 (API) │ │ :18180 (API) │ │ │ │ :18181 (Gate) │ │ :18181 (Gate) │ │ :18181 (Gate) │ │ │ │ :18182 (RPC) │ │ :18182 (RPC) │ │ :18182 (RPC) │ │ │ │ :18183 (SWIM) │ │ :18183 (SWIM) │ │ :18183 (SWIM) │ │ │ │ │ │ │ │ │ │ │ │ /data/wal/ │ │ /data/wal/ │ │ /data/wal/ │ │ │ │ /data/db/ │ │ /data/db/ │ │ /data/db/ │ │ │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ │ │ │ │ │ │ └────────────────────┴────────────────────┘ │ │ │ │ │ SWIM Gossip + gRPC Replication │ │ (UDP 18183 + TCP 18182) │ │ Replication Factor: 2 │ └──────────────────────────────────────────────────────────────────────┘ ``` ## Inter-Node Communication ``` Node 1 ◀──────────────────────────────────────────────────▶ Node 2 │ │ │ SWIM Gossip (UDP 18183) │ │ • Membership: "Node 2 is UP" │ │ • Failure detection: ping/ack │ │ • Frequency: every 1 second │ │ │ │ gRPC Replication (TCP 18182) │ │ • Push assertions: "Assert X written to Node 1" │ │ • Pull sync: Merkle tree comparison │ │ • Frequency: continuous │ │ │ │ │ ▼ ▼ ◀───────────────────────────────────────────────────────────▶ Node 3 (Same protocol with Node 1 & 2) ``` ## Write Path (Replication Factor 2) ``` Client submits assertion │ ▼ Load Balancer (routes to Node 1) │ ▼ ┌───────────────────────────────────────┐ │ Node 1 (Coordinator) │ │ │ │ 1. Validate assertion │ │ 2. Write to local WAL (fsync) │ │ 3. Return 201 Created to client │ │ 4. Async replicate to Node 2 │ │ (background, no blocking) │ └───────────────┬───────────────────────┘ │ │ gRPC (async) ▼ ┌───────────────────┐ │ Node 2 (Replica) │ │ 1. Receive assert│ │ 2. Write to WAL │ │ 3. ACK to Node 1 │ └───────────────────┘ (Node 3 may also receive replica depending on hash-based shard assignment) ``` ## Read Path (Eventually Consistent) ``` Client queries concept_path: "drug/aspirin/safety" │ ▼ Load Balancer (routes to any node, e.g., Node 2) │ ▼ ┌───────────────────────────────────────┐ │ Node 2 (Query Handler) │ │ │ │ 1. Check local KV store │ │ 2. Apply lens (RecencyLens) │ │ 3. Resolve conflicts (CRDTs) │ │ 4. Return result to client │ │ │ │ No coordination with other nodes! │ └───────────────────────────────────────┘ │ ▼ Client receives result (may be slightly stale if replication lag) ``` ## Failure Scenario: Node 2 Down ``` Initial State (All UP): ┌────────┐ ┌────────┐ ┌────────┐ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ UP │ │ UP │ │ UP │ └───┬────┘ └───┬────┘ └───┬────┘ │ │ │ └───────────┴───────────┘ SWIM: All healthy Node 2 Failure: ┌────────┐ ┌────────┐ ┌────────┐ │ Node 1 │ │ Node 2 │ │ Node 3 │ │ UP │ │ ❌ DOWN│ │ UP │ └───┬────┘ └────────┘ └───┬────┘ │ │ └───────────────────────┘ SWIM: Node 2 detected as DOWN Load Balancer: Health check fails, routes to Node 1 & 3 Replication: Factor 2 maintained (data on Node 1 & 3) Recovery (Automatic): ┌────────┐ ┌────────┐ │ Node 1 │ │ Node 3 │ │ UP │──────────────│ UP │ └────────┘ └────────┘ Cluster continues operating No data loss (replicated) No manual intervention RTO: <1 minute (automatic) RPO: 0 (no data loss) ``` ## Merkle Sync (Convergence) ``` Node 1 Node 2 ┌──────────────┐ ┌──────────────┐ │ Merkle Tree │ │ Merkle Tree │ │ Root: ABC123│◀───────────────│ Root: DEF456│ │ │ Compare roots │ │ │ /drug/ │ (differ!) │ /drug/ │ │ /treatment/ │────────────────▶│ /treatment/ │ └──────────────┘ └──────────────┘ │ │ │ Descend tree, find diffs │ ▼ ▼ Node 1 has: Node 2 has: - Assert X (missing on Node 2) - Assert Y (missing on Node 1) - Assert Z (both have) - Assert Z (both have) │ │ ▼ ▼ Exchange missing assertions │ │ ▼ ▼ Both nodes now have: X, Y, Z Root hash: GHI789 (same!) Convergence achieved. ``` ## Cluster Health Monitoring ``` ┌─────────────────────────────────────────────────┐ │ Prometheus │ │ Scrapes all 3 nodes every 15s │ │ │ │ Metrics: │ │ - up{node="node1"} = 1 │ │ - up{node="node2"} = 1 │ │ - up{node="node3"} = 1 │ │ - replication_lag_seconds{node="node2"} = 0.5 │ │ - stemedb_query_latency_seconds{node="node1"} │ └─────────────────┬───────────────────────────────┘ │ ▼ ┌─────────────────┐ │ Grafana │ │ Dashboard │ │ • Cluster map │ │ • Latency p99 │ │ • Repl lag │ └─────────────────┘ ``` ## Key Characteristics - **High Availability:** Survives 1 node failure (99.9% uptime) - **Replication:** Factor 2 (each assertion on 2 nodes) - **Consistency:** Eventual (CRDTs + Merkle sync) - **Recovery:** Automatic (<5 minute RTO) - **Capacity:** <100K assertions, <1K queries/sec - **Cost:** ~$425/month (AWS t3.xlarge × 3) - **Use Case:** Production deployments, enterprise pilots ✅ RECOMMENDED FOR PRODUCTION