stemedb/docs/operations/reference-architecture/three-node-cluster.md
jordan 1e5ba8b946
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
feat: wire auth bootstrap, cluster gateway, k8s deploy skill, and ops docs
- Wire auth bootstrap (root API key, startup guard, auth-first router) in main.rs
- Add cluster gateway handlers with proper error handling
- Update Dockerfile with optimized multi-stage build and .dockerignore
- Add orchard9-deploy skill for CI/CD pipeline (Gitea/Woodpecker/Kaniko/Zot)
- Add k8s deployment roadmap and provision-project-keys script
- Document production infrastructure in CLAUDE.md
- Update three-node-cluster reference architecture
- Trim hosted.rs doc comments to stay under 800-line limit
2026-03-07 00:56:31 -07:00

508 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Three-Node Cluster Architecture
**Target:** Production deployments, enterprise pilots, high-availability requirements
**✅ RECOMMENDED FOR PRODUCTION** - Survives single node failure, automatic replication
> **Implementation status:** The cluster crates (`stemedb-cluster`, `stemedb-sync`, `stemedb-rpc`) are
> implemented. The k3s/Longhorn deployment path is the current production path (see Phase 2 section).
> Bare-metal deployment via config.toml is aspirational and not yet wired to the binary.
---
## Architectural Rationale: Why Gossip, Not Raft
This section exists because the wrong answer here — "just add Raft" — is commonly assumed and actively harmful for StemeDB's workload.
### The append-only insight
Most databases need Raft because writes can **conflict**: two nodes update the same row, and a leader must serialize them. StemeDB doesn't have this problem. Every assertion receives a **BLAKE3 content hash** as its ID. If two nodes both write the same assertion independently, they produce the same hash → the same key → identical data. There is nothing to conflict on.
This means the assertion write path is naturally **CRDT-like**: the system needs every node to eventually receive every assertion, but doesn't need consensus on which assertion "wins." Gossip + Merkle anti-entropy handles this correctly and efficiently. Raft would add leader-election overhead and write latency for zero benefit on the data path.
### What actually needs coordination
Not everything in StemeDB is append-only. Mutable state requires stronger guarantees:
| State | Type | Replication strategy | Why |
|-------|------|---------------------|-----|
| Assertions | Append-only (CRDT) | Gossip + Merkle anti-entropy | No conflicts possible by design |
| API keys | Mutable | Synchronous broadcast or coordinator | Revoked key must not be reusable |
| Quota / meter counts | Mutable counter | Coordinator node or bounded staleness | Double-spend if two nodes both allow |
| Circuit breaker state | Mutable | Synchronous broadcast | Trip/reset must propagate atomically |
| Epochs | Append-only, ordered | Gossip is sufficient | Creation order captured in content |
**Practical implication:** Admin operations (key management, quota changes) should be routed to a designated coordinator and synchronously acknowledged. Assertion writes and reads can go to any node with no coordination.
### Read scaling
Because Lens resolution is pure local computation on indexed data, **any node can serve any read with no coordination**. Reads scale horizontally to N nodes without inter-node communication.
### Write scaling (assertions)
Because assertion writes are idempotent by content hash, **any node can accept any write**. The cluster gateway (`stemedb-cluster`, port 18181) routes writes by subject hash shard prefix — each node owns a partition of the key space. Merkle anti-entropy ensures all nodes converge. Write throughput scales linearly with node count.
---
## Overview
The three-node cluster provides high availability through automatic replication (factor 2) and gossip-based eventual consistency for assertions. Survives single node failure with <5 minute recovery time.
```
┌──────────────────────────────┐
Internet ──→ LB → │ Cluster Gateway (port 18181) │
│ Reads: round-robin any node │
│ Writes: route by shard prefix│
└──────┬────────────┬───────────┘
│ │
┌──────────▼──┐ ┌────▼─────────┐ ┌──────────────┐
│ Node A │ │ Node B │ │ Node C │
│ Shard 0-84 │ │ Shard 85-169 │ │ Shard 170-255│
│ WAL + KV │ │ WAL + KV │ │ WAL + KV │
└──────┬──────┘ └────┬──────────┘ └───────┬──────┘
│ Merkle sync (gossip, port 18183) │
└─────────────────────────────────────────┘
```
---
## Target Specifications
| Metric | Value |
|--------|-------|
| **Assertions** | <100,000 |
| **Queries/sec** | <1,000 |
| **Concurrent users** | <500 |
| **Availability** | 99.9% (survives 1 node failure) |
| **RTO** | 5 minutes (automatic failover) |
| **RPO** | 1 minute (replication lag) |
| **Consistency** | Eventual (via CRDTs + Merkle sync) |
---
## Hardware Requirements (Per Node)
### Minimum (Pilot <50K assertions)
- **CPU:** 4 vCPUs
- **RAM:** 8GB
- **Disk:** 100GB SSD (50GB WAL + 50GB DB)
- **Network:** 1 Gbps, <5ms inter-node latency
**Example instances (per node):**
- AWS: `t3.large` (2 vCPU, 8GB) × 3 = $180/month
- GCP: `n2-standard-2` (2 vCPU, 8GB) × 3 = $195/month
- Azure: `Standard_D2s_v3` (2 vCPU, 8GB) × 3 = $140/month
### Recommended (Production <100K assertions)
- **CPU:** 8 vCPUs
- **RAM:** 16GB
- **Disk:** 200GB SSD (100GB WAL + 100GB DB)
- **Network:** 10 Gbps, <5ms inter-node latency
**Example instances (per node):**
- AWS: `t3.xlarge` (4 vCPU, 16GB) × 3 = $300/month
- GCP: `n2-standard-4` (4 vCPU, 16GB) × 3 = $390/month
- Azure: `Standard_D4s_v3` (4 vCPU, 16GB) × 3 = $280/month
**See:** [Resource Sizing Guide](./resource-sizing.md) for detailed calculations.
---
## Architecture Components
### Node Layout
Each node runs the full stack:
- **stemedb-api** (port 18180) - HTTP API, queries, ingest
- **stemedb-gateway** (port 18181) - Cluster coordination
- **stemedb-rpc** (port 18182) - gRPC replication
- **SWIM gossip** (port 18183) - Membership, failure detection
### Replication
**CRDT-based with Merkle sync:**
- Writes accepted locally (optimistic)
- Background Merkle tree comparison
- Automatic sync of missing assertions
- No distributed transactions
**Replication factor 2:**
- Each assertion stored on 2 nodes
- Survives 1 node failure
- Read from any node (eventually consistent)
### Load Balancing
**Round-robin across all nodes:**
- Nginx or Envoy distribute queries
- No "primary" node (all equal)
- Health checks remove failed nodes
---
## k3s Deployment Path (Current — Longhorn + StatefulSet)
> This is the **current production deployment path** for k3s-fleet. The bare-metal steps below
> are for non-k8s environments and use a config.toml interface that is not yet wired to the binary.
For each cluster node on k3s, deploy a separate StatefulSet with its own Longhorn PVC:
```
k3s-fleet/deployments/k8s/base/stemedb/
├── stemedb.yaml # Node A (current single-node — Phase 1)
├── stemedb-b.yaml # Node B (Phase 2 — add when ready to scale reads)
├── stemedb-c.yaml # Node C (Phase 2)
└── kustomization.yaml
```
**Critical k3s constraints:**
- Each node needs its own `ReadWriteOnce` Longhorn PVC embedded KV (fjall) cannot share a volume
- Use `strategy: Recreate` on each Deployment (not RollingUpdate) RWO PVC + 2 pods = deadlock
- Cluster gateway (port 18181) must be exposed as a separate Service for inter-node routing
- Use `topologySpreadConstraints` to ensure nodes land on different k3s worker hosts
**Phase 2 read-replica k8s addition (when ready):**
```yaml
# Add to stemedb-b.yaml — identical to stemedb.yaml except:
# - Different node ID env var
# - STEMEDB_CLUSTER_SEEDS pointing to Node A's gateway ClusterIP
# - Its own PVC claim
env:
- name: STEMEDB_NODE_ID
value: "node-b"
- name: STEMEDB_CLUSTER_SEEDS
value: "stemedb-api.stemedb.svc:18181"
```
**See:** [k8s Deploy Roadmap](../deployment/k8s-deploy-roadmap.md) for the phased rollout plan.
---
## Bare-Metal Deployment Steps
> ⚠️ The config.toml cluster configuration shown here is **planned** and not yet wired to the
> `stemedb-api` binary. Current binary configuration uses environment variables only. This section
> documents the intended interface for when cluster config is implemented.
### Prerequisites
- [ ] 3 servers provisioned (same specs)
- [ ] Private network with <5ms latency
- [ ] DNS records created
- [ ] TLS certificates provisioned
### Step 1: Install StemeDB on All Nodes
```bash
# On each node (node1, node2, node3):
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api
sudo mkdir -p /data/{wal,db}
sudo useradd -r -s /bin/false stemedb
sudo chown -R stemedb:stemedb /data
```
### Step 2: Configure Cluster
**Node 1:**
```toml
# /etc/stemedb/config.toml
[cluster]
enabled = true
node_id = "node1"
bind_addr = "10.0.1.51:18181"
rpc_addr = "10.0.1.51:18182"
swim_addr = "10.0.1.51:18183"
seeds = ["10.0.1.52:18183", "10.0.1.53:18183"]
[replication]
factor = 2
```
**Node 2:**
```toml
[cluster]
enabled = true
node_id = "node2"
bind_addr = "10.0.1.52:18181"
rpc_addr = "10.0.1.52:18182"
swim_addr = "10.0.1.52:18183"
seeds = ["10.0.1.51:18183", "10.0.1.53:18183"]
[replication]
factor = 2
```
**Node 3:**
```toml
[cluster]
enabled = true
node_id = "node3"
bind_addr = "10.0.1.53:18181"
rpc_addr = "10.0.1.53:18182"
swim_addr = "10.0.1.53:18183"
seeds = ["10.0.1.51:18183", "10.0.1.52:18183"]
[replication]
factor = 2
```
### Step 3: Start All Nodes
```bash
# Start nodes sequentially (allows SWIM discovery)
ssh node1 "sudo systemctl start stemedb-api"
sleep 10
ssh node2 "sudo systemctl start stemedb-api"
sleep 10
ssh node3 "sudo systemctl start stemedb-api"
```
### Step 4: Verify Cluster Formation
```bash
# Check membership (from any node)
curl http://node1:18181/cluster/members | jq '.'
# Expected output:
# {
# "members": [
# {"id": "node1", "status": "UP"},
# {"id": "node2", "status": "UP"},
# {"id": "node3", "status": "UP"}
# ]
# }
```
### Step 5: Configure Load Balancer
**See:** [Nginx Config](../../deployment/nginx/stemedb.conf) or [Envoy Config](../../deployment/envoy/stemedb.yaml)
**Nginx upstream:**
```nginx
upstream stemedb_cluster {
server node1.example.com:18180;
server node2.example.com:18180;
server node3.example.com:18180;
}
```
### Step 6: Set Up Monitoring
```yaml
# Prometheus scrape config
scrape_configs:
- job_name: 'stemedb-cluster'
static_configs:
- targets:
- 'node1:18180'
- 'node2:18180'
- 'node3:18180'
```
**Estimated deployment time:** 4-8 hours (including load balancer, monitoring)
---
## Failure Scenarios & Recovery
### Single Node Failure
**Impact:** No service disruption, automatic failover
**Recovery:**
1. Load balancer detects failed node (health check)
2. Traffic routed to 2 remaining nodes
3. Replication factor maintained (assertions still on 2 nodes)
4. Replace failed node when convenient (see [Add Node Runbook](../../runbooks/add-node.md))
**RTO:** <1 minute (automatic)
**Data loss:** None (replicated data preserved)
### Two Nodes Fail (Catastrophic)
**Impact:** Single surviving node continues accepting assertion writes and serving reads. Admin operations (key management) are degraded single-node has no peer to synchronously acknowledge.
**Recovery:**
1. Manual intervention required to restore cluster
2. Restore failed nodes or add new nodes
3. Trigger Merkle sync (`/cluster/sync` endpoint) after nodes rejoin
4. Admin operations fully restored when cluster membership is repaired
**RTO:** 30 minutes - 2 hours (manual restore)
**Data loss:** Assertion writes continue on surviving node and merge on recovery. Recent admin operations (key revocations) issued during degraded window may not have propagated audit after recovery.
### Network Partition
**Impact:**
- **Assertion writes:** Both partitions accept writes independently. This is safe same content same BLAKE3 hash, different content different hashes that merge cleanly after partition heals.
- **Admin operations (API key revocations, quota changes):** A revocation issued to one partition is invisible to the other until partition heals. A revoked key may still be honored by nodes in the other partition during the partition window.
**Recovery:**
- Merkle anti-entropy detects and fills gaps automatically when partition heals
- Lenses (Recency, Authority) handle any assertion-level divergence at read time
- Admin state re-synchronizes via coordinator broadcast on reconnect
**Data loss:** None for assertions (all writes from both partitions preserved and merged).
### Replication Lag
**Impact:** Queries may see stale data (<1 minute old)
**Recovery:**
- Automatic catch-up via Merkle sync
- If lag >5 minutes, see [High Latency Runbook](../../runbooks/high-query-latency.md)
---
## Performance Characteristics
### Query Latency
**Target:** p99 <200ms at <1K queries/sec
| Metric | Single-Node | Three-Node |
|--------|-------------|------------|
| **p50** | 20ms | 25ms |
| **p95** | 50ms | 75ms |
| **p99** | 100ms | 150ms |
*3-node has slightly higher latency due to network hops, but 3x query capacity*
### Write Throughput
**Target:** 1,000 assertions/sec sustained
- Each node accepts assertion writes (routed by cluster gateway via shard prefix)
- Replication happens asynchronously via Merkle gossip
- No coordination required for assertions (CRDT-safe by content hash)
- Admin writes (API keys, quota changes) route to coordinator and require synchronous acknowledgment expect higher latency on those operations (~50ms vs ~5ms for assertions)
### Replication Lag
**Target:** <1 second typical, <5 seconds max
Measured by: `replication_lag_seconds` metric
---
## Network Requirements
**See:** [Network Requirements](./network-requirements.md) for full details.
### Ports (Per Node)
| Port | Protocol | Purpose | Firewall Rule |
|------|----------|---------|---------------|
| **18180** | TCP/HTTP | API (clients nodes) | Allow from load balancer |
| **18181** | TCP/HTTP | Cluster gateway (admin only) | Allow from internal network |
| **18182** | TCP/gRPC | Replication (node node) | Allow within cluster |
| **18183** | UDP | SWIM gossip (node node) | Allow within cluster |
### Latency Requirement
**<5ms inter-node latency required**
- Deploy nodes in same region/AZ
- Private network (10 Gbps recommended)
- Test with: `ping -c 100 node2` (should show avg <5ms)
### Bandwidth
- **Replication:** ~1 Mbps per 100 assertions/sec
- **Queries:** ~10 Mbps at 1K queries/sec
- **Recommended:** 1 Gbps minimum, 10 Gbps for production
---
## Monitoring & Alerts
### Critical Metrics
```yaml
# Prometheus alerts
- alert: StemeDBNodeDown
expr: up{job="stemedb-cluster"} == 0
for: 1m
- alert: StemeDBReplicationLag
expr: replication_lag_seconds > 5
for: 5m
- alert: StemeDBQuorumLost
expr: count(up{job="stemedb-cluster"} == 1) < 2
for: 1m
```
### Grafana Dashboard Panels
1. **Cluster Health:** Node count, status, replication lag
2. **Query Latency:** p50, p95, p99 across all nodes
3. **Ingest Rate:** Assertions/sec per node
4. **Disk Usage:** WAL + DB per node
5. **Network:** Replication bandwidth
---
## Cost Estimate (AWS, us-east-1)
| Resource | Cost |
|----------|------|
| **Compute** (3× t3.xlarge) | $300/month |
| **Storage** (3× 200GB SSD) | $60/month |
| **Load Balancer** (ALB) | $25/month |
| **Data Transfer** (internal) | $10/month |
| **Backups** (S3) | $30/month |
| **Total** | **~$425/month** |
Compare to single-node ($87/month): 5x cost for 10x availability
---
## Migration from Single-Node
**See:** [Add Node Runbook](../../runbooks/add-node.md#1-bootstrap-3-node-cluster) for detailed procedure.
**Summary:**
1. Provision 2 new nodes
2. Configure cluster on all 3
3. Restart single-node with cluster config
4. Trigger Merkle sync
5. Update load balancer
**Downtime:** 5-15 minutes for replication
---
## Scaling Path Beyond Three Nodes
Three nodes on k3s handles the 100-project target. For mass traffic beyond that, the scaling path is incremental not a rearchitecture:
| Phase | Target | Work type | What changes |
|-------|--------|-----------|-------------|
| **Phase 1** | 1 node, 100 projects | Done | Single Deployment, Longhorn PVC, auth wired |
| **Phase 2** | 3 nodes, read-scaled | Ops-heavy | Add 2 read replicas as separate Deployments; cluster gateway routes reads round-robin |
| **Phase 3** | 3 nodes, write-sharded | Code-heavy | Gateway enforces shard ownership; each node owns of subject hash space; reads still any-node |
| **Phase 4** | N nodes, coordinator | Code-heavy | Designate one node (or small 3-node Raft group) exclusively for mutable admin state; assertion nodes are pure data |
**What you do NOT need:** Raft on the assertion write path. The append-only, content-addressed design means there are no write conflicts to serialize. Raft belongs only on the mutable admin state path (Phase 4), which is a small fraction of total traffic.
---
## Related Documentation
- [Single-Node Pilot](./single-node-pilot.md) - Simpler architecture
- [Network Requirements](./network-requirements.md) - Firewall rules
- [Resource Sizing](./resource-sizing.md) - Hardware calculations
- [Add Node Runbook](../../runbooks/add-node.md) - Cluster operations
- [High Query Latency Runbook](../../runbooks/high-query-latency.md) - Performance troubleshooting
---
**Last Updated:** 2026-03-02 Added architectural rationale (gossip vs Raft), k3s deployment path, fixed mutable-state coordination notes, added 4-phase scaling table