8-task cluster completion bringing 3-replica StatefulSet from isolated nodes to fully functional cluster: 1. Fix Gateway /metrics 500 (wire PrometheusHandle) 2. gRPC server + SWIM background tasks (probe, suspicion, gossip dissemination) 3. join() registers peers in membership table via PingResponse fields 4. Shard rebalancing on membership changes (deterministic round-robin) 5. API cluster wiring (DNS resolution, Gateway, gRPC, gossip broadcaster) 6. Single binary merge (stemedb-api --features cluster replaces stemedb-node) 7. Auth header forwarding (X-API-Key passed through Gateway to backends) 8. CORS restriction (STEMEDB_ALLOWED_ORIGINS env var, permissive fallback)
18 KiB
Three-Node Cluster Architecture
Target: Production deployments, enterprise pilots, high-availability requirements
✅ RECOMMENDED FOR PRODUCTION - Survives single node failure, automatic replication
Implementation status: The cluster crates (
stemedb-cluster,stemedb-sync,stemedb-rpc) are implemented. The k3s/Longhorn deployment path is the current production path (see Phase 2 section). Bare-metal deployment via config.toml is aspirational and not yet wired to the binary.
Architectural Rationale: Why Gossip, Not Raft
This section exists because the wrong answer here — "just add Raft" — is commonly assumed and actively harmful for StemeDB's workload.
The append-only insight
Most databases need Raft because writes can conflict: two nodes update the same row, and a leader must serialize them. StemeDB doesn't have this problem. Every assertion receives a BLAKE3 content hash as its ID. If two nodes both write the same assertion independently, they produce the same hash → the same key → identical data. There is nothing to conflict on.
This means the assertion write path is naturally CRDT-like: the system needs every node to eventually receive every assertion, but doesn't need consensus on which assertion "wins." Gossip + Merkle anti-entropy handles this correctly and efficiently. Raft would add leader-election overhead and write latency for zero benefit on the data path.
What actually needs coordination
Not everything in StemeDB is append-only. Mutable state requires stronger guarantees:
| State | Type | Replication strategy | Why |
|---|---|---|---|
| Assertions | Append-only (CRDT) | Gossip + Merkle anti-entropy | No conflicts possible by design |
| API keys | Mutable | Synchronous broadcast or coordinator | Revoked key must not be reusable |
| Quota / meter counts | Mutable counter | Coordinator node or bounded staleness | Double-spend if two nodes both allow |
| Circuit breaker state | Mutable | Synchronous broadcast | Trip/reset must propagate atomically |
| Epochs | Append-only, ordered | Gossip is sufficient | Creation order captured in content |
Practical implication: Admin operations (key management, quota changes) should be routed to a designated coordinator and synchronously acknowledged. Assertion writes and reads can go to any node with no coordination.
Read scaling
Because Lens resolution is pure local computation on indexed data, any node can serve any read with no coordination. Reads scale horizontally to N nodes without inter-node communication.
Write scaling (assertions)
Because assertion writes are idempotent by content hash, any node can accept any write. The cluster gateway (stemedb-cluster, port 18181) routes writes by subject hash shard prefix — each node owns a partition of the key space. Merkle anti-entropy ensures all nodes converge. Write throughput scales linearly with node count.
Overview
The three-node cluster provides high availability through automatic replication (factor 2) and gossip-based eventual consistency for assertions. Survives single node failure with <5 minute recovery time.
┌──────────────────────────────┐
Internet ──→ LB → │ Cluster Gateway (port 18181) │
│ Reads: round-robin any node │
│ Writes: route by shard prefix│
└──────┬────────────┬───────────┘
│ │
┌──────────▼──┐ ┌────▼─────────┐ ┌──────────────┐
│ Node A │ │ Node B │ │ Node C │
│ Shard 0-84 │ │ Shard 85-169 │ │ Shard 170-255│
│ WAL + KV │ │ WAL + KV │ │ WAL + KV │
└──────┬──────┘ └────┬──────────┘ └───────┬──────┘
│ Merkle sync (gossip, port 18183) │
└─────────────────────────────────────────┘
Target Specifications
| Metric | Value |
|---|---|
| Assertions | <100,000 |
| Queries/sec | <1,000 |
| Concurrent users | <500 |
| Availability | 99.9% (survives 1 node failure) |
| RTO | 5 minutes (automatic failover) |
| RPO | 1 minute (replication lag) |
| Consistency | Eventual (via CRDTs + Merkle sync) |
Hardware Requirements (Per Node)
Minimum (Pilot <50K assertions)
- CPU: 4 vCPUs
- RAM: 8GB
- Disk: 100GB SSD (50GB WAL + 50GB DB)
- Network: 1 Gbps, <5ms inter-node latency
Example instances (per node):
- AWS:
t3.large(2 vCPU, 8GB) × 3 = $180/month - GCP:
n2-standard-2(2 vCPU, 8GB) × 3 = $195/month - Azure:
Standard_D2s_v3(2 vCPU, 8GB) × 3 = $140/month
Recommended (Production <100K assertions)
- CPU: 8 vCPUs
- RAM: 16GB
- Disk: 200GB SSD (100GB WAL + 100GB DB)
- Network: 10 Gbps, <5ms inter-node latency
Example instances (per node):
- AWS:
t3.xlarge(4 vCPU, 16GB) × 3 = $300/month - GCP:
n2-standard-4(4 vCPU, 16GB) × 3 = $390/month - Azure:
Standard_D4s_v3(4 vCPU, 16GB) × 3 = $280/month
See: Resource Sizing Guide for detailed calculations.
Architecture Components
Node Layout
Each pod runs two binaries via scripts/entrypoint.sh:
- stemedb-api (port 18180) — HTTP API, WAL, KV store, ingestion, query
- stemedb-node (port 18181) — Cluster Gateway HTTP (client-facing, routes by shard)
- stemedb-node (port 18182) — gRPC (node-to-node sync)
- stemedb-node (port 18183) — SWIM gossip (membership, failure detection)
Replication
CRDT-based with Merkle sync:
- Writes accepted locally (optimistic)
- Background Merkle tree comparison
- Automatic sync of missing assertions
- No distributed transactions
Replication factor 2:
- Each assertion stored on 2 nodes
- Survives 1 node failure
- Read from any node (eventually consistent)
Load Balancing
Round-robin across all nodes:
- Nginx or Envoy distribute queries
- No "primary" node (all equal)
- Health checks remove failed nodes
k3s Deployment (Current — StatefulSet + Longhorn)
This is the current production deployment on k3s-fleet. Deployed and running as of 2026-03-07. The bare-metal steps below are for non-k8s environments and use a config.toml interface that is not yet wired to the binary.
A single 3-replica StatefulSet with VolumeClaimTemplates provides each pod its own 50Gi Longhorn PVC:
k3s-fleet/deployments/k8s/base/stemedb/
└── stemedb.yaml # Everything: ExternalSecret, headless Service, gateway Service,
# 3-replica StatefulSet, Ingress
Architecture per pod (dual-binary via entrypoint.sh):
stemedb-api(:18180) — storage engine (WAL + KV)stemedb-node(:18181) — Gateway HTTP (client-facing, cluster routing)stemedb-node(:18182) — gRPC (node-to-node sync)stemedb-node(:18183) — SWIM gossip (membership)
Pod DNS: stemedb-{0,1,2}.stemedb-headless.stemedb.svc.cluster.local
Key k8s resources:
- Headless Service (
stemedb-headless) — stable DNS for all 4 ports - Gateway Service (
stemedb-gateway) — ClusterIP routing to Gateway port 18181 - Ingress —
stemedb.threesix.ai→stemedb-gateway:18181via Traefik
Environment variables (set in StatefulSet):
STEMEDB_CLUSTER_MODE=true— starts both binaries viaentrypoint.shSTEMEDB_NODE_ID— frommetadata.name(stemedb-0, stemedb-1, stemedb-2) → stable NodeId via BLAKE3STEMEDB_SEED_NODES— all 3 headless DNS names on port 18182STEMEDB_NUM_SHARDS=4,STEMEDB_REPLICATION_FACTOR=2
CI/CD: Woodpecker pipeline (.woodpecker.yml) → Kaniko builds → Zot registry (registry.threesix.ai) → kubectl set image statefulset/stemedb
See: k8s Deploy Roadmap for the phased rollout plan.
Bare-Metal Deployment Steps
⚠️ The config.toml cluster configuration shown here is planned and not yet wired to the
stemedb-apibinary. Current binary configuration uses environment variables only. This section documents the intended interface for when cluster config is implemented.
Prerequisites
- 3 servers provisioned (same specs)
- Private network with <5ms latency
- DNS records created
- TLS certificates provisioned
Step 1: Install StemeDB on All Nodes
# On each node (node1, node2, node3):
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api
sudo mkdir -p /data/{wal,db}
sudo useradd -r -s /bin/false stemedb
sudo chown -R stemedb:stemedb /data
Step 2: Configure Cluster
Node 1:
# /etc/stemedb/config.toml
[cluster]
enabled = true
node_id = "node1"
bind_addr = "10.0.1.51:18181"
rpc_addr = "10.0.1.51:18182"
swim_addr = "10.0.1.51:18183"
seeds = ["10.0.1.52:18183", "10.0.1.53:18183"]
[replication]
factor = 2
Node 2:
[cluster]
enabled = true
node_id = "node2"
bind_addr = "10.0.1.52:18181"
rpc_addr = "10.0.1.52:18182"
swim_addr = "10.0.1.52:18183"
seeds = ["10.0.1.51:18183", "10.0.1.53:18183"]
[replication]
factor = 2
Node 3:
[cluster]
enabled = true
node_id = "node3"
bind_addr = "10.0.1.53:18181"
rpc_addr = "10.0.1.53:18182"
swim_addr = "10.0.1.53:18183"
seeds = ["10.0.1.51:18183", "10.0.1.52:18183"]
[replication]
factor = 2
Step 3: Start All Nodes
# Start nodes sequentially (allows SWIM discovery)
ssh node1 "sudo systemctl start stemedb-api"
sleep 10
ssh node2 "sudo systemctl start stemedb-api"
sleep 10
ssh node3 "sudo systemctl start stemedb-api"
Step 4: Verify Cluster Formation
# Check membership (from any node)
curl http://node1:18181/cluster/members | jq '.'
# Expected output:
# {
# "members": [
# {"id": "node1", "status": "UP"},
# {"id": "node2", "status": "UP"},
# {"id": "node3", "status": "UP"}
# ]
# }
Step 5: Configure Load Balancer
See: Nginx Config or Envoy Config
Nginx upstream:
upstream stemedb_cluster {
server node1.example.com:18180;
server node2.example.com:18180;
server node3.example.com:18180;
}
Step 6: Set Up Monitoring
# Prometheus scrape config
scrape_configs:
- job_name: 'stemedb-cluster'
static_configs:
- targets:
- 'node1:18180'
- 'node2:18180'
- 'node3:18180'
Estimated deployment time: 4-8 hours (including load balancer, monitoring)
Failure Scenarios & Recovery
Single Node Failure
Impact: No service disruption, automatic failover
Recovery:
- Load balancer detects failed node (health check)
- Traffic routed to 2 remaining nodes
- Replication factor maintained (assertions still on 2 nodes)
- Replace failed node when convenient (see Add Node Runbook)
RTO: <1 minute (automatic) Data loss: None (replicated data preserved)
Two Nodes Fail (Catastrophic)
Impact: Single surviving node continues accepting assertion writes and serving reads. Admin operations (key management) are degraded — single-node has no peer to synchronously acknowledge.
Recovery:
- Manual intervention required to restore cluster
- Restore failed nodes or add new nodes
- Trigger Merkle sync (
/cluster/syncendpoint) after nodes rejoin - Admin operations fully restored when cluster membership is repaired
RTO: 30 minutes - 2 hours (manual restore) Data loss: Assertion writes continue on surviving node and merge on recovery. Recent admin operations (key revocations) issued during degraded window may not have propagated — audit after recovery.
Network Partition
Impact:
- Assertion writes: Both partitions accept writes independently. This is safe — same content → same BLAKE3 hash, different content → different hashes that merge cleanly after partition heals.
- Admin operations (API key revocations, quota changes): A revocation issued to one partition is invisible to the other until partition heals. A revoked key may still be honored by nodes in the other partition during the partition window.
Recovery:
- Merkle anti-entropy detects and fills gaps automatically when partition heals
- Lenses (Recency, Authority) handle any assertion-level divergence at read time
- Admin state re-synchronizes via coordinator broadcast on reconnect
Data loss: None for assertions (all writes from both partitions preserved and merged).
Replication Lag
Impact: Queries may see stale data (<1 minute old)
Recovery:
- Automatic catch-up via Merkle sync
- If lag >5 minutes, see High Latency Runbook
Performance Characteristics
Query Latency
Target: p99 <200ms at <1K queries/sec
| Metric | Single-Node | Three-Node |
|---|---|---|
| p50 | 20ms | 25ms |
| p95 | 50ms | 75ms |
| p99 | 100ms | 150ms |
3-node has slightly higher latency due to network hops, but 3x query capacity
Write Throughput
Target: 1,000 assertions/sec sustained
- Each node accepts assertion writes (routed by cluster gateway via shard prefix)
- Replication happens asynchronously via Merkle gossip
- No coordination required for assertions (CRDT-safe by content hash)
- Admin writes (API keys, quota changes) route to coordinator and require synchronous acknowledgment — expect higher latency on those operations (~50ms vs ~5ms for assertions)
Replication Lag
Target: <1 second typical, <5 seconds max
Measured by: replication_lag_seconds metric
Network Requirements
See: Network Requirements for full details.
Ports (Per Node)
| Port | Protocol | Purpose | Firewall Rule |
|---|---|---|---|
| 18180 | TCP/HTTP | API (clients → nodes) | Allow from load balancer |
| 18181 | TCP/HTTP | Cluster gateway (admin only) | Allow from internal network |
| 18182 | TCP/gRPC | Replication (node ↔ node) | Allow within cluster |
| 18183 | UDP | SWIM gossip (node ↔ node) | Allow within cluster |
Latency Requirement
<5ms inter-node latency required
- Deploy nodes in same region/AZ
- Private network (10 Gbps recommended)
- Test with:
ping -c 100 node2(should show avg <5ms)
Bandwidth
- Replication: ~1 Mbps per 100 assertions/sec
- Queries: ~10 Mbps at 1K queries/sec
- Recommended: 1 Gbps minimum, 10 Gbps for production
Monitoring & Alerts
Critical Metrics
# Prometheus alerts
- alert: StemeDBNodeDown
expr: up{job="stemedb-cluster"} == 0
for: 1m
- alert: StemeDBReplicationLag
expr: replication_lag_seconds > 5
for: 5m
- alert: StemeDBQuorumLost
expr: count(up{job="stemedb-cluster"} == 1) < 2
for: 1m
Grafana Dashboard Panels
- Cluster Health: Node count, status, replication lag
- Query Latency: p50, p95, p99 across all nodes
- Ingest Rate: Assertions/sec per node
- Disk Usage: WAL + DB per node
- Network: Replication bandwidth
Cost Estimate (AWS, us-east-1)
| Resource | Cost |
|---|---|
| Compute (3× t3.xlarge) | $300/month |
| Storage (3× 200GB SSD) | $60/month |
| Load Balancer (ALB) | $25/month |
| Data Transfer (internal) | $10/month |
| Backups (S3) | $30/month |
| Total | ~$425/month |
Compare to single-node ($87/month): 5x cost for 10x availability
Migration from Single-Node
See: Add Node Runbook for detailed procedure.
Summary:
- Provision 2 new nodes
- Configure cluster on all 3
- Restart single-node with cluster config
- Trigger Merkle sync
- Update load balancer
Downtime: 5-15 minutes for replication
Scaling Path Beyond Three Nodes
Three nodes on k3s handles the 100-project target. For mass traffic beyond that, the scaling path is incremental — not a rearchitecture:
| Phase | Target | Work type | What changes |
|---|---|---|---|
| Phase 1 | 1 node, 100 projects | ✅ Done | Single Deployment, Longhorn PVC, auth wired |
| Phase 2 | 3 nodes, cluster mode | ✅ Done | 3-replica StatefulSet, dual-binary, SWIM, Gateway routing, Woodpecker CI |
| Phase 3 | 3 nodes, full replication | Code-heavy | SWIM inter-node connectivity, Gateway HTTP forwarding, Merkle anti-entropy |
| Phase 4 | N nodes, coordinator | Code-heavy | Designate one node (or small 3-node Raft group) exclusively for mutable admin state; assertion nodes are pure data |
What you do NOT need: Raft on the assertion write path. The append-only, content-addressed design means there are no write conflicts to serialize. Raft belongs only on the mutable admin state path (Phase 4), which is a small fraction of total traffic.
Related Documentation
- Single-Node Pilot - Simpler architecture
- Network Requirements - Firewall rules
- Resource Sizing - Hardware calculations
- Add Node Runbook - Cluster operations
- High Query Latency Runbook - Performance troubleshooting
Last Updated: 2026-03-07 — Updated k3s section to match actual 3-replica StatefulSet deployment, dual-binary architecture, Woodpecker CI pipeline