Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
- Wire auth bootstrap (root API key, startup guard, auth-first router) in main.rs - Add cluster gateway handlers with proper error handling - Update Dockerfile with optimized multi-stage build and .dockerignore - Add orchard9-deploy skill for CI/CD pipeline (Gitea/Woodpecker/Kaniko/Zot) - Add k8s deployment roadmap and provision-project-keys script - Document production infrastructure in CLAUDE.md - Update three-node-cluster reference architecture - Trim hosted.rs doc comments to stay under 800-line limit
508 lines
17 KiB
Markdown
508 lines
17 KiB
Markdown
# Three-Node Cluster Architecture
|
||
|
||
**Target:** Production deployments, enterprise pilots, high-availability requirements
|
||
|
||
**✅ RECOMMENDED FOR PRODUCTION** - Survives single node failure, automatic replication
|
||
|
||
> **Implementation status:** The cluster crates (`stemedb-cluster`, `stemedb-sync`, `stemedb-rpc`) are
|
||
> implemented. The k3s/Longhorn deployment path is the current production path (see Phase 2 section).
|
||
> Bare-metal deployment via config.toml is aspirational and not yet wired to the binary.
|
||
|
||
---
|
||
|
||
## Architectural Rationale: Why Gossip, Not Raft
|
||
|
||
This section exists because the wrong answer here — "just add Raft" — is commonly assumed and actively harmful for StemeDB's workload.
|
||
|
||
### The append-only insight
|
||
|
||
Most databases need Raft because writes can **conflict**: two nodes update the same row, and a leader must serialize them. StemeDB doesn't have this problem. Every assertion receives a **BLAKE3 content hash** as its ID. If two nodes both write the same assertion independently, they produce the same hash → the same key → identical data. There is nothing to conflict on.
|
||
|
||
This means the assertion write path is naturally **CRDT-like**: the system needs every node to eventually receive every assertion, but doesn't need consensus on which assertion "wins." Gossip + Merkle anti-entropy handles this correctly and efficiently. Raft would add leader-election overhead and write latency for zero benefit on the data path.
|
||
|
||
### What actually needs coordination
|
||
|
||
Not everything in StemeDB is append-only. Mutable state requires stronger guarantees:
|
||
|
||
| State | Type | Replication strategy | Why |
|
||
|-------|------|---------------------|-----|
|
||
| Assertions | Append-only (CRDT) | Gossip + Merkle anti-entropy | No conflicts possible by design |
|
||
| API keys | Mutable | Synchronous broadcast or coordinator | Revoked key must not be reusable |
|
||
| Quota / meter counts | Mutable counter | Coordinator node or bounded staleness | Double-spend if two nodes both allow |
|
||
| Circuit breaker state | Mutable | Synchronous broadcast | Trip/reset must propagate atomically |
|
||
| Epochs | Append-only, ordered | Gossip is sufficient | Creation order captured in content |
|
||
|
||
**Practical implication:** Admin operations (key management, quota changes) should be routed to a designated coordinator and synchronously acknowledged. Assertion writes and reads can go to any node with no coordination.
|
||
|
||
### Read scaling
|
||
|
||
Because Lens resolution is pure local computation on indexed data, **any node can serve any read with no coordination**. Reads scale horizontally to N nodes without inter-node communication.
|
||
|
||
### Write scaling (assertions)
|
||
|
||
Because assertion writes are idempotent by content hash, **any node can accept any write**. The cluster gateway (`stemedb-cluster`, port 18181) routes writes by subject hash shard prefix — each node owns a partition of the key space. Merkle anti-entropy ensures all nodes converge. Write throughput scales linearly with node count.
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
The three-node cluster provides high availability through automatic replication (factor 2) and gossip-based eventual consistency for assertions. Survives single node failure with <5 minute recovery time.
|
||
|
||
```
|
||
┌──────────────────────────────┐
|
||
Internet ──→ LB → │ Cluster Gateway (port 18181) │
|
||
│ Reads: round-robin any node │
|
||
│ Writes: route by shard prefix│
|
||
└──────┬────────────┬───────────┘
|
||
│ │
|
||
┌──────────▼──┐ ┌────▼─────────┐ ┌──────────────┐
|
||
│ Node A │ │ Node B │ │ Node C │
|
||
│ Shard 0-84 │ │ Shard 85-169 │ │ Shard 170-255│
|
||
│ WAL + KV │ │ WAL + KV │ │ WAL + KV │
|
||
└──────┬──────┘ └────┬──────────┘ └───────┬──────┘
|
||
│ Merkle sync (gossip, port 18183) │
|
||
└─────────────────────────────────────────┘
|
||
```
|
||
|
||
---
|
||
|
||
## Target Specifications
|
||
|
||
| Metric | Value |
|
||
|--------|-------|
|
||
| **Assertions** | <100,000 |
|
||
| **Queries/sec** | <1,000 |
|
||
| **Concurrent users** | <500 |
|
||
| **Availability** | 99.9% (survives 1 node failure) |
|
||
| **RTO** | 5 minutes (automatic failover) |
|
||
| **RPO** | 1 minute (replication lag) |
|
||
| **Consistency** | Eventual (via CRDTs + Merkle sync) |
|
||
|
||
---
|
||
|
||
## Hardware Requirements (Per Node)
|
||
|
||
### Minimum (Pilot <50K assertions)
|
||
|
||
- **CPU:** 4 vCPUs
|
||
- **RAM:** 8GB
|
||
- **Disk:** 100GB SSD (50GB WAL + 50GB DB)
|
||
- **Network:** 1 Gbps, <5ms inter-node latency
|
||
|
||
**Example instances (per node):**
|
||
- AWS: `t3.large` (2 vCPU, 8GB) × 3 = $180/month
|
||
- GCP: `n2-standard-2` (2 vCPU, 8GB) × 3 = $195/month
|
||
- Azure: `Standard_D2s_v3` (2 vCPU, 8GB) × 3 = $140/month
|
||
|
||
### Recommended (Production <100K assertions)
|
||
|
||
- **CPU:** 8 vCPUs
|
||
- **RAM:** 16GB
|
||
- **Disk:** 200GB SSD (100GB WAL + 100GB DB)
|
||
- **Network:** 10 Gbps, <5ms inter-node latency
|
||
|
||
**Example instances (per node):**
|
||
- AWS: `t3.xlarge` (4 vCPU, 16GB) × 3 = $300/month
|
||
- GCP: `n2-standard-4` (4 vCPU, 16GB) × 3 = $390/month
|
||
- Azure: `Standard_D4s_v3` (4 vCPU, 16GB) × 3 = $280/month
|
||
|
||
**See:** [Resource Sizing Guide](./resource-sizing.md) for detailed calculations.
|
||
|
||
---
|
||
|
||
## Architecture Components
|
||
|
||
### Node Layout
|
||
|
||
Each node runs the full stack:
|
||
- **stemedb-api** (port 18180) - HTTP API, queries, ingest
|
||
- **stemedb-gateway** (port 18181) - Cluster coordination
|
||
- **stemedb-rpc** (port 18182) - gRPC replication
|
||
- **SWIM gossip** (port 18183) - Membership, failure detection
|
||
|
||
### Replication
|
||
|
||
**CRDT-based with Merkle sync:**
|
||
- Writes accepted locally (optimistic)
|
||
- Background Merkle tree comparison
|
||
- Automatic sync of missing assertions
|
||
- No distributed transactions
|
||
|
||
**Replication factor 2:**
|
||
- Each assertion stored on 2 nodes
|
||
- Survives 1 node failure
|
||
- Read from any node (eventually consistent)
|
||
|
||
### Load Balancing
|
||
|
||
**Round-robin across all nodes:**
|
||
- Nginx or Envoy distribute queries
|
||
- No "primary" node (all equal)
|
||
- Health checks remove failed nodes
|
||
|
||
---
|
||
|
||
## k3s Deployment Path (Current — Longhorn + StatefulSet)
|
||
|
||
> This is the **current production deployment path** for k3s-fleet. The bare-metal steps below
|
||
> are for non-k8s environments and use a config.toml interface that is not yet wired to the binary.
|
||
|
||
For each cluster node on k3s, deploy a separate StatefulSet with its own Longhorn PVC:
|
||
|
||
```
|
||
k3s-fleet/deployments/k8s/base/stemedb/
|
||
├── stemedb.yaml # Node A (current single-node — Phase 1)
|
||
├── stemedb-b.yaml # Node B (Phase 2 — add when ready to scale reads)
|
||
├── stemedb-c.yaml # Node C (Phase 2)
|
||
└── kustomization.yaml
|
||
```
|
||
|
||
**Critical k3s constraints:**
|
||
- Each node needs its own `ReadWriteOnce` Longhorn PVC — embedded KV (fjall) cannot share a volume
|
||
- Use `strategy: Recreate` on each Deployment (not RollingUpdate) — RWO PVC + 2 pods = deadlock
|
||
- Cluster gateway (port 18181) must be exposed as a separate Service for inter-node routing
|
||
- Use `topologySpreadConstraints` to ensure nodes land on different k3s worker hosts
|
||
|
||
**Phase 2 read-replica k8s addition (when ready):**
|
||
```yaml
|
||
# Add to stemedb-b.yaml — identical to stemedb.yaml except:
|
||
# - Different node ID env var
|
||
# - STEMEDB_CLUSTER_SEEDS pointing to Node A's gateway ClusterIP
|
||
# - Its own PVC claim
|
||
env:
|
||
- name: STEMEDB_NODE_ID
|
||
value: "node-b"
|
||
- name: STEMEDB_CLUSTER_SEEDS
|
||
value: "stemedb-api.stemedb.svc:18181"
|
||
```
|
||
|
||
**See:** [k8s Deploy Roadmap](../deployment/k8s-deploy-roadmap.md) for the phased rollout plan.
|
||
|
||
---
|
||
|
||
## Bare-Metal Deployment Steps
|
||
|
||
> ⚠️ The config.toml cluster configuration shown here is **planned** and not yet wired to the
|
||
> `stemedb-api` binary. Current binary configuration uses environment variables only. This section
|
||
> documents the intended interface for when cluster config is implemented.
|
||
|
||
### Prerequisites
|
||
|
||
- [ ] 3 servers provisioned (same specs)
|
||
- [ ] Private network with <5ms latency
|
||
- [ ] DNS records created
|
||
- [ ] TLS certificates provisioned
|
||
|
||
### Step 1: Install StemeDB on All Nodes
|
||
|
||
```bash
|
||
# On each node (node1, node2, node3):
|
||
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
|
||
sudo chmod +x /usr/local/bin/stemedb-api
|
||
|
||
sudo mkdir -p /data/{wal,db}
|
||
sudo useradd -r -s /bin/false stemedb
|
||
sudo chown -R stemedb:stemedb /data
|
||
```
|
||
|
||
### Step 2: Configure Cluster
|
||
|
||
**Node 1:**
|
||
```toml
|
||
# /etc/stemedb/config.toml
|
||
[cluster]
|
||
enabled = true
|
||
node_id = "node1"
|
||
bind_addr = "10.0.1.51:18181"
|
||
rpc_addr = "10.0.1.51:18182"
|
||
swim_addr = "10.0.1.51:18183"
|
||
seeds = ["10.0.1.52:18183", "10.0.1.53:18183"]
|
||
|
||
[replication]
|
||
factor = 2
|
||
```
|
||
|
||
**Node 2:**
|
||
```toml
|
||
[cluster]
|
||
enabled = true
|
||
node_id = "node2"
|
||
bind_addr = "10.0.1.52:18181"
|
||
rpc_addr = "10.0.1.52:18182"
|
||
swim_addr = "10.0.1.52:18183"
|
||
seeds = ["10.0.1.51:18183", "10.0.1.53:18183"]
|
||
|
||
[replication]
|
||
factor = 2
|
||
```
|
||
|
||
**Node 3:**
|
||
```toml
|
||
[cluster]
|
||
enabled = true
|
||
node_id = "node3"
|
||
bind_addr = "10.0.1.53:18181"
|
||
rpc_addr = "10.0.1.53:18182"
|
||
swim_addr = "10.0.1.53:18183"
|
||
seeds = ["10.0.1.51:18183", "10.0.1.52:18183"]
|
||
|
||
[replication]
|
||
factor = 2
|
||
```
|
||
|
||
### Step 3: Start All Nodes
|
||
|
||
```bash
|
||
# Start nodes sequentially (allows SWIM discovery)
|
||
ssh node1 "sudo systemctl start stemedb-api"
|
||
sleep 10
|
||
|
||
ssh node2 "sudo systemctl start stemedb-api"
|
||
sleep 10
|
||
|
||
ssh node3 "sudo systemctl start stemedb-api"
|
||
```
|
||
|
||
### Step 4: Verify Cluster Formation
|
||
|
||
```bash
|
||
# Check membership (from any node)
|
||
curl http://node1:18181/cluster/members | jq '.'
|
||
|
||
# Expected output:
|
||
# {
|
||
# "members": [
|
||
# {"id": "node1", "status": "UP"},
|
||
# {"id": "node2", "status": "UP"},
|
||
# {"id": "node3", "status": "UP"}
|
||
# ]
|
||
# }
|
||
```
|
||
|
||
### Step 5: Configure Load Balancer
|
||
|
||
**See:** [Nginx Config](../../deployment/nginx/stemedb.conf) or [Envoy Config](../../deployment/envoy/stemedb.yaml)
|
||
|
||
**Nginx upstream:**
|
||
```nginx
|
||
upstream stemedb_cluster {
|
||
server node1.example.com:18180;
|
||
server node2.example.com:18180;
|
||
server node3.example.com:18180;
|
||
}
|
||
```
|
||
|
||
### Step 6: Set Up Monitoring
|
||
|
||
```yaml
|
||
# Prometheus scrape config
|
||
scrape_configs:
|
||
- job_name: 'stemedb-cluster'
|
||
static_configs:
|
||
- targets:
|
||
- 'node1:18180'
|
||
- 'node2:18180'
|
||
- 'node3:18180'
|
||
```
|
||
|
||
**Estimated deployment time:** 4-8 hours (including load balancer, monitoring)
|
||
|
||
---
|
||
|
||
## Failure Scenarios & Recovery
|
||
|
||
### Single Node Failure
|
||
|
||
**Impact:** No service disruption, automatic failover
|
||
|
||
**Recovery:**
|
||
1. Load balancer detects failed node (health check)
|
||
2. Traffic routed to 2 remaining nodes
|
||
3. Replication factor maintained (assertions still on 2 nodes)
|
||
4. Replace failed node when convenient (see [Add Node Runbook](../../runbooks/add-node.md))
|
||
|
||
**RTO:** <1 minute (automatic)
|
||
**Data loss:** None (replicated data preserved)
|
||
|
||
### Two Nodes Fail (Catastrophic)
|
||
|
||
**Impact:** Single surviving node continues accepting assertion writes and serving reads. Admin operations (key management) are degraded — single-node has no peer to synchronously acknowledge.
|
||
|
||
**Recovery:**
|
||
1. Manual intervention required to restore cluster
|
||
2. Restore failed nodes or add new nodes
|
||
3. Trigger Merkle sync (`/cluster/sync` endpoint) after nodes rejoin
|
||
4. Admin operations fully restored when cluster membership is repaired
|
||
|
||
**RTO:** 30 minutes - 2 hours (manual restore)
|
||
**Data loss:** Assertion writes continue on surviving node and merge on recovery. Recent admin operations (key revocations) issued during degraded window may not have propagated — audit after recovery.
|
||
|
||
### Network Partition
|
||
|
||
**Impact:**
|
||
- **Assertion writes:** Both partitions accept writes independently. This is safe — same content → same BLAKE3 hash, different content → different hashes that merge cleanly after partition heals.
|
||
- **Admin operations (API key revocations, quota changes):** A revocation issued to one partition is invisible to the other until partition heals. A revoked key may still be honored by nodes in the other partition during the partition window.
|
||
|
||
**Recovery:**
|
||
- Merkle anti-entropy detects and fills gaps automatically when partition heals
|
||
- Lenses (Recency, Authority) handle any assertion-level divergence at read time
|
||
- Admin state re-synchronizes via coordinator broadcast on reconnect
|
||
|
||
**Data loss:** None for assertions (all writes from both partitions preserved and merged).
|
||
|
||
### Replication Lag
|
||
|
||
**Impact:** Queries may see stale data (<1 minute old)
|
||
|
||
**Recovery:**
|
||
- Automatic catch-up via Merkle sync
|
||
- If lag >5 minutes, see [High Latency Runbook](../../runbooks/high-query-latency.md)
|
||
|
||
---
|
||
|
||
## Performance Characteristics
|
||
|
||
### Query Latency
|
||
|
||
**Target:** p99 <200ms at <1K queries/sec
|
||
|
||
| Metric | Single-Node | Three-Node |
|
||
|--------|-------------|------------|
|
||
| **p50** | 20ms | 25ms |
|
||
| **p95** | 50ms | 75ms |
|
||
| **p99** | 100ms | 150ms |
|
||
|
||
*3-node has slightly higher latency due to network hops, but 3x query capacity*
|
||
|
||
### Write Throughput
|
||
|
||
**Target:** 1,000 assertions/sec sustained
|
||
|
||
- Each node accepts assertion writes (routed by cluster gateway via shard prefix)
|
||
- Replication happens asynchronously via Merkle gossip
|
||
- No coordination required for assertions (CRDT-safe by content hash)
|
||
- Admin writes (API keys, quota changes) route to coordinator and require synchronous acknowledgment — expect higher latency on those operations (~50ms vs ~5ms for assertions)
|
||
|
||
### Replication Lag
|
||
|
||
**Target:** <1 second typical, <5 seconds max
|
||
|
||
Measured by: `replication_lag_seconds` metric
|
||
|
||
---
|
||
|
||
## Network Requirements
|
||
|
||
**See:** [Network Requirements](./network-requirements.md) for full details.
|
||
|
||
### Ports (Per Node)
|
||
|
||
| Port | Protocol | Purpose | Firewall Rule |
|
||
|------|----------|---------|---------------|
|
||
| **18180** | TCP/HTTP | API (clients → nodes) | Allow from load balancer |
|
||
| **18181** | TCP/HTTP | Cluster gateway (admin only) | Allow from internal network |
|
||
| **18182** | TCP/gRPC | Replication (node ↔ node) | Allow within cluster |
|
||
| **18183** | UDP | SWIM gossip (node ↔ node) | Allow within cluster |
|
||
|
||
### Latency Requirement
|
||
|
||
**<5ms inter-node latency required**
|
||
|
||
- Deploy nodes in same region/AZ
|
||
- Private network (10 Gbps recommended)
|
||
- Test with: `ping -c 100 node2` (should show avg <5ms)
|
||
|
||
### Bandwidth
|
||
|
||
- **Replication:** ~1 Mbps per 100 assertions/sec
|
||
- **Queries:** ~10 Mbps at 1K queries/sec
|
||
- **Recommended:** 1 Gbps minimum, 10 Gbps for production
|
||
|
||
---
|
||
|
||
## Monitoring & Alerts
|
||
|
||
### Critical Metrics
|
||
|
||
```yaml
|
||
# Prometheus alerts
|
||
- alert: StemeDBNodeDown
|
||
expr: up{job="stemedb-cluster"} == 0
|
||
for: 1m
|
||
|
||
- alert: StemeDBReplicationLag
|
||
expr: replication_lag_seconds > 5
|
||
for: 5m
|
||
|
||
- alert: StemeDBQuorumLost
|
||
expr: count(up{job="stemedb-cluster"} == 1) < 2
|
||
for: 1m
|
||
```
|
||
|
||
### Grafana Dashboard Panels
|
||
|
||
1. **Cluster Health:** Node count, status, replication lag
|
||
2. **Query Latency:** p50, p95, p99 across all nodes
|
||
3. **Ingest Rate:** Assertions/sec per node
|
||
4. **Disk Usage:** WAL + DB per node
|
||
5. **Network:** Replication bandwidth
|
||
|
||
---
|
||
|
||
## Cost Estimate (AWS, us-east-1)
|
||
|
||
| Resource | Cost |
|
||
|----------|------|
|
||
| **Compute** (3× t3.xlarge) | $300/month |
|
||
| **Storage** (3× 200GB SSD) | $60/month |
|
||
| **Load Balancer** (ALB) | $25/month |
|
||
| **Data Transfer** (internal) | $10/month |
|
||
| **Backups** (S3) | $30/month |
|
||
| **Total** | **~$425/month** |
|
||
|
||
Compare to single-node ($87/month): 5x cost for 10x availability
|
||
|
||
---
|
||
|
||
## Migration from Single-Node
|
||
|
||
**See:** [Add Node Runbook](../../runbooks/add-node.md#1-bootstrap-3-node-cluster) for detailed procedure.
|
||
|
||
**Summary:**
|
||
1. Provision 2 new nodes
|
||
2. Configure cluster on all 3
|
||
3. Restart single-node with cluster config
|
||
4. Trigger Merkle sync
|
||
5. Update load balancer
|
||
|
||
**Downtime:** 5-15 minutes for replication
|
||
|
||
---
|
||
|
||
## Scaling Path Beyond Three Nodes
|
||
|
||
Three nodes on k3s handles the 100-project target. For mass traffic beyond that, the scaling path is incremental — not a rearchitecture:
|
||
|
||
| Phase | Target | Work type | What changes |
|
||
|-------|--------|-----------|-------------|
|
||
| **Phase 1** | 1 node, 100 projects | ✅ Done | Single Deployment, Longhorn PVC, auth wired |
|
||
| **Phase 2** | 3 nodes, read-scaled | Ops-heavy | Add 2 read replicas as separate Deployments; cluster gateway routes reads round-robin |
|
||
| **Phase 3** | 3 nodes, write-sharded | Code-heavy | Gateway enforces shard ownership; each node owns ⅓ of subject hash space; reads still any-node |
|
||
| **Phase 4** | N nodes, coordinator | Code-heavy | Designate one node (or small 3-node Raft group) exclusively for mutable admin state; assertion nodes are pure data |
|
||
|
||
**What you do NOT need:** Raft on the assertion write path. The append-only, content-addressed design means there are no write conflicts to serialize. Raft belongs only on the mutable admin state path (Phase 4), which is a small fraction of total traffic.
|
||
|
||
---
|
||
|
||
## Related Documentation
|
||
|
||
- [Single-Node Pilot](./single-node-pilot.md) - Simpler architecture
|
||
- [Network Requirements](./network-requirements.md) - Firewall rules
|
||
- [Resource Sizing](./resource-sizing.md) - Hardware calculations
|
||
- [Add Node Runbook](../../runbooks/add-node.md) - Cluster operations
|
||
- [High Query Latency Runbook](../../runbooks/high-query-latency.md) - Performance troubleshooting
|
||
|
||
---
|
||
|
||
**Last Updated:** 2026-03-02 — Added architectural rationale (gossip vs Raft), k3s deployment path, fixed mutable-state coordination notes, added 4-phase scaling table
|