stemedb/docs/operations/reference-architecture/three-node-cluster.md
jordan 6c6ee04e9c
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
feat: complete cluster integration — SWIM gossip, gRPC server, shard rebalancing, single binary
8-task cluster completion bringing 3-replica StatefulSet from isolated
nodes to fully functional cluster:

1. Fix Gateway /metrics 500 (wire PrometheusHandle)
2. gRPC server + SWIM background tasks (probe, suspicion, gossip dissemination)
3. join() registers peers in membership table via PingResponse fields
4. Shard rebalancing on membership changes (deterministic round-robin)
5. API cluster wiring (DNS resolution, Gateway, gRPC, gossip broadcaster)
6. Single binary merge (stemedb-api --features cluster replaces stemedb-node)
7. Auth header forwarding (X-API-Key passed through Gateway to backends)
8. CORS restriction (STEMEDB_ALLOWED_ORIGINS env var, permissive fallback)
2026-03-07 15:09:29 -07:00

18 KiB
Raw Permalink Blame History

Three-Node Cluster Architecture

Target: Production deployments, enterprise pilots, high-availability requirements

RECOMMENDED FOR PRODUCTION - Survives single node failure, automatic replication

Implementation status: The cluster crates (stemedb-cluster, stemedb-sync, stemedb-rpc) are implemented. The k3s/Longhorn deployment path is the current production path (see Phase 2 section). Bare-metal deployment via config.toml is aspirational and not yet wired to the binary.


Architectural Rationale: Why Gossip, Not Raft

This section exists because the wrong answer here — "just add Raft" — is commonly assumed and actively harmful for StemeDB's workload.

The append-only insight

Most databases need Raft because writes can conflict: two nodes update the same row, and a leader must serialize them. StemeDB doesn't have this problem. Every assertion receives a BLAKE3 content hash as its ID. If two nodes both write the same assertion independently, they produce the same hash → the same key → identical data. There is nothing to conflict on.

This means the assertion write path is naturally CRDT-like: the system needs every node to eventually receive every assertion, but doesn't need consensus on which assertion "wins." Gossip + Merkle anti-entropy handles this correctly and efficiently. Raft would add leader-election overhead and write latency for zero benefit on the data path.

What actually needs coordination

Not everything in StemeDB is append-only. Mutable state requires stronger guarantees:

State Type Replication strategy Why
Assertions Append-only (CRDT) Gossip + Merkle anti-entropy No conflicts possible by design
API keys Mutable Synchronous broadcast or coordinator Revoked key must not be reusable
Quota / meter counts Mutable counter Coordinator node or bounded staleness Double-spend if two nodes both allow
Circuit breaker state Mutable Synchronous broadcast Trip/reset must propagate atomically
Epochs Append-only, ordered Gossip is sufficient Creation order captured in content

Practical implication: Admin operations (key management, quota changes) should be routed to a designated coordinator and synchronously acknowledged. Assertion writes and reads can go to any node with no coordination.

Read scaling

Because Lens resolution is pure local computation on indexed data, any node can serve any read with no coordination. Reads scale horizontally to N nodes without inter-node communication.

Write scaling (assertions)

Because assertion writes are idempotent by content hash, any node can accept any write. The cluster gateway (stemedb-cluster, port 18181) routes writes by subject hash shard prefix — each node owns a partition of the key space. Merkle anti-entropy ensures all nodes converge. Write throughput scales linearly with node count.


Overview

The three-node cluster provides high availability through automatic replication (factor 2) and gossip-based eventual consistency for assertions. Survives single node failure with <5 minute recovery time.

                    ┌──────────────────────────────┐
Internet ──→  LB → │  Cluster Gateway (port 18181) │
                    │  Reads: round-robin any node  │
                    │  Writes: route by shard prefix│
                    └──────┬────────────┬───────────┘
                           │            │
                ┌──────────▼──┐    ┌────▼─────────┐    ┌──────────────┐
                │   Node A    │    │   Node B      │    │   Node C     │
                │ Shard 0-84  │    │ Shard 85-169  │    │ Shard 170-255│
                │ WAL + KV    │    │ WAL + KV      │    │ WAL + KV     │
                └──────┬──────┘    └────┬──────────┘    └───────┬──────┘
                       │     Merkle sync (gossip, port 18183)    │
                       └─────────────────────────────────────────┘

Target Specifications

Metric Value
Assertions <100,000
Queries/sec <1,000
Concurrent users <500
Availability 99.9% (survives 1 node failure)
RTO 5 minutes (automatic failover)
RPO 1 minute (replication lag)
Consistency Eventual (via CRDTs + Merkle sync)

Hardware Requirements (Per Node)

Minimum (Pilot <50K assertions)

  • CPU: 4 vCPUs
  • RAM: 8GB
  • Disk: 100GB SSD (50GB WAL + 50GB DB)
  • Network: 1 Gbps, <5ms inter-node latency

Example instances (per node):

  • AWS: t3.large (2 vCPU, 8GB) × 3 = $180/month
  • GCP: n2-standard-2 (2 vCPU, 8GB) × 3 = $195/month
  • Azure: Standard_D2s_v3 (2 vCPU, 8GB) × 3 = $140/month
  • CPU: 8 vCPUs
  • RAM: 16GB
  • Disk: 200GB SSD (100GB WAL + 100GB DB)
  • Network: 10 Gbps, <5ms inter-node latency

Example instances (per node):

  • AWS: t3.xlarge (4 vCPU, 16GB) × 3 = $300/month
  • GCP: n2-standard-4 (4 vCPU, 16GB) × 3 = $390/month
  • Azure: Standard_D4s_v3 (4 vCPU, 16GB) × 3 = $280/month

See: Resource Sizing Guide for detailed calculations.


Architecture Components

Node Layout

Each pod runs two binaries via scripts/entrypoint.sh:

  • stemedb-api (port 18180) — HTTP API, WAL, KV store, ingestion, query
  • stemedb-node (port 18181) — Cluster Gateway HTTP (client-facing, routes by shard)
  • stemedb-node (port 18182) — gRPC (node-to-node sync)
  • stemedb-node (port 18183) — SWIM gossip (membership, failure detection)

Replication

CRDT-based with Merkle sync:

  • Writes accepted locally (optimistic)
  • Background Merkle tree comparison
  • Automatic sync of missing assertions
  • No distributed transactions

Replication factor 2:

  • Each assertion stored on 2 nodes
  • Survives 1 node failure
  • Read from any node (eventually consistent)

Load Balancing

Round-robin across all nodes:

  • Nginx or Envoy distribute queries
  • No "primary" node (all equal)
  • Health checks remove failed nodes

k3s Deployment (Current — StatefulSet + Longhorn)

This is the current production deployment on k3s-fleet. Deployed and running as of 2026-03-07. The bare-metal steps below are for non-k8s environments and use a config.toml interface that is not yet wired to the binary.

A single 3-replica StatefulSet with VolumeClaimTemplates provides each pod its own 50Gi Longhorn PVC:

k3s-fleet/deployments/k8s/base/stemedb/
└── stemedb.yaml          # Everything: ExternalSecret, headless Service, gateway Service,
                          #   3-replica StatefulSet, Ingress

Architecture per pod (dual-binary via entrypoint.sh):

  • stemedb-api (:18180) — storage engine (WAL + KV)
  • stemedb-node (:18181) — Gateway HTTP (client-facing, cluster routing)
  • stemedb-node (:18182) — gRPC (node-to-node sync)
  • stemedb-node (:18183) — SWIM gossip (membership)

Pod DNS: stemedb-{0,1,2}.stemedb-headless.stemedb.svc.cluster.local

Key k8s resources:

  • Headless Service (stemedb-headless) — stable DNS for all 4 ports
  • Gateway Service (stemedb-gateway) — ClusterIP routing to Gateway port 18181
  • Ingressstemedb.threesix.aistemedb-gateway:18181 via Traefik

Environment variables (set in StatefulSet):

  • STEMEDB_CLUSTER_MODE=true — starts both binaries via entrypoint.sh
  • STEMEDB_NODE_ID — from metadata.name (stemedb-0, stemedb-1, stemedb-2) → stable NodeId via BLAKE3
  • STEMEDB_SEED_NODES — all 3 headless DNS names on port 18182
  • STEMEDB_NUM_SHARDS=4, STEMEDB_REPLICATION_FACTOR=2

CI/CD: Woodpecker pipeline (.woodpecker.yml) → Kaniko builds → Zot registry (registry.threesix.ai) → kubectl set image statefulset/stemedb

See: k8s Deploy Roadmap for the phased rollout plan.


Bare-Metal Deployment Steps

⚠️ The config.toml cluster configuration shown here is planned and not yet wired to the stemedb-api binary. Current binary configuration uses environment variables only. This section documents the intended interface for when cluster config is implemented.

Prerequisites

  • 3 servers provisioned (same specs)
  • Private network with <5ms latency
  • DNS records created
  • TLS certificates provisioned

Step 1: Install StemeDB on All Nodes

# On each node (node1, node2, node3):
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api

sudo mkdir -p /data/{wal,db}
sudo useradd -r -s /bin/false stemedb
sudo chown -R stemedb:stemedb /data

Step 2: Configure Cluster

Node 1:

# /etc/stemedb/config.toml
[cluster]
enabled = true
node_id = "node1"
bind_addr = "10.0.1.51:18181"
rpc_addr = "10.0.1.51:18182"
swim_addr = "10.0.1.51:18183"
seeds = ["10.0.1.52:18183", "10.0.1.53:18183"]

[replication]
factor = 2

Node 2:

[cluster]
enabled = true
node_id = "node2"
bind_addr = "10.0.1.52:18181"
rpc_addr = "10.0.1.52:18182"
swim_addr = "10.0.1.52:18183"
seeds = ["10.0.1.51:18183", "10.0.1.53:18183"]

[replication]
factor = 2

Node 3:

[cluster]
enabled = true
node_id = "node3"
bind_addr = "10.0.1.53:18181"
rpc_addr = "10.0.1.53:18182"
swim_addr = "10.0.1.53:18183"
seeds = ["10.0.1.51:18183", "10.0.1.52:18183"]

[replication]
factor = 2

Step 3: Start All Nodes

# Start nodes sequentially (allows SWIM discovery)
ssh node1 "sudo systemctl start stemedb-api"
sleep 10

ssh node2 "sudo systemctl start stemedb-api"
sleep 10

ssh node3 "sudo systemctl start stemedb-api"

Step 4: Verify Cluster Formation

# Check membership (from any node)
curl http://node1:18181/cluster/members | jq '.'

# Expected output:
# {
#   "members": [
#     {"id": "node1", "status": "UP"},
#     {"id": "node2", "status": "UP"},
#     {"id": "node3", "status": "UP"}
#   ]
# }

Step 5: Configure Load Balancer

See: Nginx Config or Envoy Config

Nginx upstream:

upstream stemedb_cluster {
    server node1.example.com:18180;
    server node2.example.com:18180;
    server node3.example.com:18180;
}

Step 6: Set Up Monitoring

# Prometheus scrape config
scrape_configs:
  - job_name: 'stemedb-cluster'
    static_configs:
      - targets:
        - 'node1:18180'
        - 'node2:18180'
        - 'node3:18180'

Estimated deployment time: 4-8 hours (including load balancer, monitoring)


Failure Scenarios & Recovery

Single Node Failure

Impact: No service disruption, automatic failover

Recovery:

  1. Load balancer detects failed node (health check)
  2. Traffic routed to 2 remaining nodes
  3. Replication factor maintained (assertions still on 2 nodes)
  4. Replace failed node when convenient (see Add Node Runbook)

RTO: <1 minute (automatic) Data loss: None (replicated data preserved)

Two Nodes Fail (Catastrophic)

Impact: Single surviving node continues accepting assertion writes and serving reads. Admin operations (key management) are degraded — single-node has no peer to synchronously acknowledge.

Recovery:

  1. Manual intervention required to restore cluster
  2. Restore failed nodes or add new nodes
  3. Trigger Merkle sync (/cluster/sync endpoint) after nodes rejoin
  4. Admin operations fully restored when cluster membership is repaired

RTO: 30 minutes - 2 hours (manual restore) Data loss: Assertion writes continue on surviving node and merge on recovery. Recent admin operations (key revocations) issued during degraded window may not have propagated — audit after recovery.

Network Partition

Impact:

  • Assertion writes: Both partitions accept writes independently. This is safe — same content → same BLAKE3 hash, different content → different hashes that merge cleanly after partition heals.
  • Admin operations (API key revocations, quota changes): A revocation issued to one partition is invisible to the other until partition heals. A revoked key may still be honored by nodes in the other partition during the partition window.

Recovery:

  • Merkle anti-entropy detects and fills gaps automatically when partition heals
  • Lenses (Recency, Authority) handle any assertion-level divergence at read time
  • Admin state re-synchronizes via coordinator broadcast on reconnect

Data loss: None for assertions (all writes from both partitions preserved and merged).

Replication Lag

Impact: Queries may see stale data (<1 minute old)

Recovery:


Performance Characteristics

Query Latency

Target: p99 <200ms at <1K queries/sec

Metric Single-Node Three-Node
p50 20ms 25ms
p95 50ms 75ms
p99 100ms 150ms

3-node has slightly higher latency due to network hops, but 3x query capacity

Write Throughput

Target: 1,000 assertions/sec sustained

  • Each node accepts assertion writes (routed by cluster gateway via shard prefix)
  • Replication happens asynchronously via Merkle gossip
  • No coordination required for assertions (CRDT-safe by content hash)
  • Admin writes (API keys, quota changes) route to coordinator and require synchronous acknowledgment — expect higher latency on those operations (~50ms vs ~5ms for assertions)

Replication Lag

Target: <1 second typical, <5 seconds max

Measured by: replication_lag_seconds metric


Network Requirements

See: Network Requirements for full details.

Ports (Per Node)

Port Protocol Purpose Firewall Rule
18180 TCP/HTTP API (clients → nodes) Allow from load balancer
18181 TCP/HTTP Cluster gateway (admin only) Allow from internal network
18182 TCP/gRPC Replication (node ↔ node) Allow within cluster
18183 UDP SWIM gossip (node ↔ node) Allow within cluster

Latency Requirement

<5ms inter-node latency required

  • Deploy nodes in same region/AZ
  • Private network (10 Gbps recommended)
  • Test with: ping -c 100 node2 (should show avg <5ms)

Bandwidth

  • Replication: ~1 Mbps per 100 assertions/sec
  • Queries: ~10 Mbps at 1K queries/sec
  • Recommended: 1 Gbps minimum, 10 Gbps for production

Monitoring & Alerts

Critical Metrics

# Prometheus alerts
- alert: StemeDBNodeDown
  expr: up{job="stemedb-cluster"} == 0
  for: 1m

- alert: StemeDBReplicationLag
  expr: replication_lag_seconds > 5
  for: 5m

- alert: StemeDBQuorumLost
  expr: count(up{job="stemedb-cluster"} == 1) < 2
  for: 1m

Grafana Dashboard Panels

  1. Cluster Health: Node count, status, replication lag
  2. Query Latency: p50, p95, p99 across all nodes
  3. Ingest Rate: Assertions/sec per node
  4. Disk Usage: WAL + DB per node
  5. Network: Replication bandwidth

Cost Estimate (AWS, us-east-1)

Resource Cost
Compute (3× t3.xlarge) $300/month
Storage (3× 200GB SSD) $60/month
Load Balancer (ALB) $25/month
Data Transfer (internal) $10/month
Backups (S3) $30/month
Total ~$425/month

Compare to single-node ($87/month): 5x cost for 10x availability


Migration from Single-Node

See: Add Node Runbook for detailed procedure.

Summary:

  1. Provision 2 new nodes
  2. Configure cluster on all 3
  3. Restart single-node with cluster config
  4. Trigger Merkle sync
  5. Update load balancer

Downtime: 5-15 minutes for replication


Scaling Path Beyond Three Nodes

Three nodes on k3s handles the 100-project target. For mass traffic beyond that, the scaling path is incremental — not a rearchitecture:

Phase Target Work type What changes
Phase 1 1 node, 100 projects Done Single Deployment, Longhorn PVC, auth wired
Phase 2 3 nodes, cluster mode Done 3-replica StatefulSet, dual-binary, SWIM, Gateway routing, Woodpecker CI
Phase 3 3 nodes, full replication Code-heavy SWIM inter-node connectivity, Gateway HTTP forwarding, Merkle anti-entropy
Phase 4 N nodes, coordinator Code-heavy Designate one node (or small 3-node Raft group) exclusively for mutable admin state; assertion nodes are pure data

What you do NOT need: Raft on the assertion write path. The append-only, content-addressed design means there are no write conflicts to serialize. Raft belongs only on the mutable admin state path (Phase 4), which is a small fraction of total traffic.



Last Updated: 2026-03-07 — Updated k3s section to match actual 3-replica StatefulSet deployment, dual-binary architecture, Woodpecker CI pipeline