ci/woodpecker/push/woodpecker Pipeline failed

Details

feat: complete cluster integration — SWIM gossip, gRPC server, shard rebalancing, single binary

8-task cluster completion bringing 3-replica StatefulSet from isolated
nodes to fully functional cluster:

1. Fix Gateway /metrics 500 (wire PrometheusHandle)
2. gRPC server + SWIM background tasks (probe, suspicion, gossip dissemination)
3. join() registers peers in membership table via PingResponse fields
4. Shard rebalancing on membership changes (deterministic round-robin)
5. API cluster wiring (DNS resolution, Gateway, gRPC, gossip broadcaster)
6. Single binary merge (stemedb-api --features cluster replaces stemedb-node)
7. Auth header forwarding (X-API-Key passed through Gateway to backends)
8. CORS restriction (STEMEDB_ALLOWED_ORIGINS env var, permissive fallback)

2026-03-07 15:09:29 -07:00

18 KiB

Raw Permalink Blame History

Three-Node Cluster Architecture

Target: Production deployments, enterprise pilots, high-availability requirements

✅ RECOMMENDED FOR PRODUCTION - Survives single node failure, automatic replication

Implementation status: The cluster crates (stemedb-cluster, stemedb-sync, stemedb-rpc) are implemented. The k3s/Longhorn deployment path is the current production path (see Phase 2 section). Bare-metal deployment via config.toml is aspirational and not yet wired to the binary.

Architectural Rationale: Why Gossip, Not Raft

This section exists because the wrong answer here — "just add Raft" — is commonly assumed and actively harmful for StemeDB's workload.

The append-only insight

Most databases need Raft because writes can conflict: two nodes update the same row, and a leader must serialize them. StemeDB doesn't have this problem. Every assertion receives a BLAKE3 content hash as its ID. If two nodes both write the same assertion independently, they produce the same hash → the same key → identical data. There is nothing to conflict on.

This means the assertion write path is naturally CRDT-like: the system needs every node to eventually receive every assertion, but doesn't need consensus on which assertion "wins." Gossip + Merkle anti-entropy handles this correctly and efficiently. Raft would add leader-election overhead and write latency for zero benefit on the data path.

What actually needs coordination

Not everything in StemeDB is append-only. Mutable state requires stronger guarantees:

State	Type	Replication strategy	Why
Assertions	Append-only (CRDT)	Gossip + Merkle anti-entropy	No conflicts possible by design
API keys	Mutable	Synchronous broadcast or coordinator	Revoked key must not be reusable
Quota / meter counts	Mutable counter	Coordinator node or bounded staleness	Double-spend if two nodes both allow
Circuit breaker state	Mutable	Synchronous broadcast	Trip/reset must propagate atomically
Epochs	Append-only, ordered	Gossip is sufficient	Creation order captured in content

Practical implication: Admin operations (key management, quota changes) should be routed to a designated coordinator and synchronously acknowledged. Assertion writes and reads can go to any node with no coordination.

Read scaling

Because Lens resolution is pure local computation on indexed data, any node can serve any read with no coordination. Reads scale horizontally to N nodes without inter-node communication.

Write scaling (assertions)

Because assertion writes are idempotent by content hash, any node can accept any write. The cluster gateway (stemedb-cluster, port 18181) routes writes by subject hash shard prefix — each node owns a partition of the key space. Merkle anti-entropy ensures all nodes converge. Write throughput scales linearly with node count.

Overview

The three-node cluster provides high availability through automatic replication (factor 2) and gossip-based eventual consistency for assertions. Survives single node failure with <5 minute recovery time.

                    ┌──────────────────────────────┐
Internet ──→  LB → │  Cluster Gateway (port 18181) │
                    │  Reads: round-robin any node  │
                    │  Writes: route by shard prefix│
                    └──────┬────────────┬───────────┘
                           │            │
                ┌──────────▼──┐    ┌────▼─────────┐    ┌──────────────┐
                │   Node A    │    │   Node B      │    │   Node C     │
                │ Shard 0-84  │    │ Shard 85-169  │    │ Shard 170-255│
                │ WAL + KV    │    │ WAL + KV      │    │ WAL + KV     │
                └──────┬──────┘    └────┬──────────┘    └───────┬──────┘
                       │     Merkle sync (gossip, port 18183)    │
                       └─────────────────────────────────────────┘

Target Specifications

Metric	Value
Assertions	<100,000
Queries/sec	<1,000
Concurrent users	<500
Availability	99.9% (survives 1 node failure)
RTO	5 minutes (automatic failover)
RPO	1 minute (replication lag)
Consistency	Eventual (via CRDTs + Merkle sync)

Hardware Requirements (Per Node)

Minimum (Pilot <50K assertions)

CPU: 4 vCPUs
RAM: 8GB
Disk: 100GB SSD (50GB WAL + 50GB DB)
Network: 1 Gbps, <5ms inter-node latency

Example instances (per node):

AWS: t3.large (2 vCPU, 8GB) × 3 = $180/month
GCP: n2-standard-2 (2 vCPU, 8GB) × 3 = $195/month
Azure: Standard_D2s_v3 (2 vCPU, 8GB) × 3 = $140/month

Recommended (Production <100K assertions)

CPU: 8 vCPUs
RAM: 16GB
Disk: 200GB SSD (100GB WAL + 100GB DB)
Network: 10 Gbps, <5ms inter-node latency

Example instances (per node):

AWS: t3.xlarge (4 vCPU, 16GB) × 3 = $300/month
GCP: n2-standard-4 (4 vCPU, 16GB) × 3 = $390/month
Azure: Standard_D4s_v3 (4 vCPU, 16GB) × 3 = $280/month

See: Resource Sizing Guide for detailed calculations.

Architecture Components

Node Layout

Each pod runs two binaries via scripts/entrypoint.sh:

stemedb-api (port 18180) — HTTP API, WAL, KV store, ingestion, query
stemedb-node (port 18181) — Cluster Gateway HTTP (client-facing, routes by shard)
stemedb-node (port 18182) — gRPC (node-to-node sync)
stemedb-node (port 18183) — SWIM gossip (membership, failure detection)

Replication

CRDT-based with Merkle sync:

Writes accepted locally (optimistic)
Background Merkle tree comparison
Automatic sync of missing assertions
No distributed transactions

Replication factor 2:

Each assertion stored on 2 nodes
Survives 1 node failure
Read from any node (eventually consistent)

Load Balancing

Round-robin across all nodes:

Nginx or Envoy distribute queries
No "primary" node (all equal)
Health checks remove failed nodes

k3s Deployment (Current — StatefulSet + Longhorn)

This is the current production deployment on k3s-fleet. Deployed and running as of 2026-03-07. The bare-metal steps below are for non-k8s environments and use a config.toml interface that is not yet wired to the binary.

A single 3-replica StatefulSet with VolumeClaimTemplates provides each pod its own 50Gi Longhorn PVC:

k3s-fleet/deployments/k8s/base/stemedb/
└── stemedb.yaml          # Everything: ExternalSecret, headless Service, gateway Service,
                          #   3-replica StatefulSet, Ingress

Architecture per pod (dual-binary via entrypoint.sh):

stemedb-api (:18180) — storage engine (WAL + KV)
stemedb-node (:18181) — Gateway HTTP (client-facing, cluster routing)
stemedb-node (:18182) — gRPC (node-to-node sync)
stemedb-node (:18183) — SWIM gossip (membership)

Pod DNS: stemedb-{0,1,2}.stemedb-headless.stemedb.svc.cluster.local

Key k8s resources:

Headless Service (stemedb-headless) — stable DNS for all 4 ports
Gateway Service (stemedb-gateway) — ClusterIP routing to Gateway port 18181
Ingress — stemedb.threesix.ai → stemedb-gateway:18181 via Traefik

Environment variables (set in StatefulSet):

STEMEDB_CLUSTER_MODE=true — starts both binaries via entrypoint.sh
STEMEDB_NODE_ID — from metadata.name (stemedb-0, stemedb-1, stemedb-2) → stable NodeId via BLAKE3
STEMEDB_SEED_NODES — all 3 headless DNS names on port 18182
STEMEDB_NUM_SHARDS=4, STEMEDB_REPLICATION_FACTOR=2

CI/CD: Woodpecker pipeline (.woodpecker.yml) → Kaniko builds → Zot registry (registry.threesix.ai) → kubectl set image statefulset/stemedb

See: k8s Deploy Roadmap for the phased rollout plan.

Bare-Metal Deployment Steps

⚠️ The config.toml cluster configuration shown here is planned and not yet wired to the stemedb-api binary. Current binary configuration uses environment variables only. This section documents the intended interface for when cluster config is implemented.

Prerequisites

3 servers provisioned (same specs)
Private network with <5ms latency
DNS records created
TLS certificates provisioned

Step 1: Install StemeDB on All Nodes

# On each node (node1, node2, node3):
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api

sudo mkdir -p /data/{wal,db}
sudo useradd -r -s /bin/false stemedb
sudo chown -R stemedb:stemedb /data

Step 2: Configure Cluster

Node 1:

# /etc/stemedb/config.toml
[cluster]
enabled = true
node_id = "node1"
bind_addr = "10.0.1.51:18181"
rpc_addr = "10.0.1.51:18182"
swim_addr = "10.0.1.51:18183"
seeds = ["10.0.1.52:18183", "10.0.1.53:18183"]

[replication]
factor = 2

Node 2:

[cluster]
enabled = true
node_id = "node2"
bind_addr = "10.0.1.52:18181"
rpc_addr = "10.0.1.52:18182"
swim_addr = "10.0.1.52:18183"
seeds = ["10.0.1.51:18183", "10.0.1.53:18183"]

[replication]
factor = 2

Node 3:

[cluster]
enabled = true
node_id = "node3"
bind_addr = "10.0.1.53:18181"
rpc_addr = "10.0.1.53:18182"
swim_addr = "10.0.1.53:18183"
seeds = ["10.0.1.51:18183", "10.0.1.52:18183"]

[replication]
factor = 2

Step 3: Start All Nodes

# Start nodes sequentially (allows SWIM discovery)
ssh node1 "sudo systemctl start stemedb-api"
sleep 10

ssh node2 "sudo systemctl start stemedb-api"
sleep 10

ssh node3 "sudo systemctl start stemedb-api"

Step 4: Verify Cluster Formation

# Check membership (from any node)
curl http://node1:18181/cluster/members | jq '.'

# Expected output:
# {
#   "members": [
#     {"id": "node1", "status": "UP"},
#     {"id": "node2", "status": "UP"},
#     {"id": "node3", "status": "UP"}
#   ]
# }

Step 5: Configure Load Balancer

See: Nginx Config or Envoy Config

Nginx upstream:

upstream stemedb_cluster {
    server node1.example.com:18180;
    server node2.example.com:18180;
    server node3.example.com:18180;
}

Step 6: Set Up Monitoring

# Prometheus scrape config
scrape_configs:
  - job_name: 'stemedb-cluster'
    static_configs:
      - targets:
        - 'node1:18180'
        - 'node2:18180'
        - 'node3:18180'

Estimated deployment time: 4-8 hours (including load balancer, monitoring)

Failure Scenarios & Recovery

Single Node Failure

Impact: No service disruption, automatic failover

Recovery:

Load balancer detects failed node (health check)
Traffic routed to 2 remaining nodes
Replication factor maintained (assertions still on 2 nodes)
Replace failed node when convenient (see Add Node Runbook)

RTO: <1 minute (automatic) Data loss: None (replicated data preserved)

Two Nodes Fail (Catastrophic)

Impact: Single surviving node continues accepting assertion writes and serving reads. Admin operations (key management) are degraded — single-node has no peer to synchronously acknowledge.

Recovery:

Manual intervention required to restore cluster
Restore failed nodes or add new nodes
Trigger Merkle sync (/cluster/sync endpoint) after nodes rejoin
Admin operations fully restored when cluster membership is repaired

RTO: 30 minutes - 2 hours (manual restore) Data loss: Assertion writes continue on surviving node and merge on recovery. Recent admin operations (key revocations) issued during degraded window may not have propagated — audit after recovery.

Network Partition

Impact:

Assertion writes: Both partitions accept writes independently. This is safe — same content → same BLAKE3 hash, different content → different hashes that merge cleanly after partition heals.
Admin operations (API key revocations, quota changes): A revocation issued to one partition is invisible to the other until partition heals. A revoked key may still be honored by nodes in the other partition during the partition window.

Recovery:

Merkle anti-entropy detects and fills gaps automatically when partition heals
Lenses (Recency, Authority) handle any assertion-level divergence at read time
Admin state re-synchronizes via coordinator broadcast on reconnect

Data loss: None for assertions (all writes from both partitions preserved and merged).

Replication Lag

Impact: Queries may see stale data (<1 minute old)

Recovery:

Automatic catch-up via Merkle sync
If lag >5 minutes, see High Latency Runbook

Performance Characteristics

Query Latency

Target: p99 <200ms at <1K queries/sec

Metric	Single-Node	Three-Node
p50	20ms	25ms
p95	50ms	75ms
p99	100ms	150ms

3-node has slightly higher latency due to network hops, but 3x query capacity

Write Throughput

Target: 1,000 assertions/sec sustained

Each node accepts assertion writes (routed by cluster gateway via shard prefix)
Replication happens asynchronously via Merkle gossip
No coordination required for assertions (CRDT-safe by content hash)
Admin writes (API keys, quota changes) route to coordinator and require synchronous acknowledgment — expect higher latency on those operations (~50ms vs ~5ms for assertions)

Replication Lag

Target: <1 second typical, <5 seconds max

Measured by: replication_lag_seconds metric

Network Requirements

See: Network Requirements for full details.

Ports (Per Node)

Port	Protocol	Purpose	Firewall Rule
18180	TCP/HTTP	API (clients → nodes)	Allow from load balancer
18181	TCP/HTTP	Cluster gateway (admin only)	Allow from internal network
18182	TCP/gRPC	Replication (node ↔ node)	Allow within cluster
18183	UDP	SWIM gossip (node ↔ node)	Allow within cluster

Latency Requirement

<5ms inter-node latency required

Deploy nodes in same region/AZ
Private network (10 Gbps recommended)
Test with: ping -c 100 node2 (should show avg <5ms)

Bandwidth

Replication: ~1 Mbps per 100 assertions/sec
Queries: ~10 Mbps at 1K queries/sec
Recommended: 1 Gbps minimum, 10 Gbps for production

Monitoring & Alerts

Critical Metrics

# Prometheus alerts
- alert: StemeDBNodeDown
  expr: up{job="stemedb-cluster"} == 0
  for: 1m

- alert: StemeDBReplicationLag
  expr: replication_lag_seconds > 5
  for: 5m

- alert: StemeDBQuorumLost
  expr: count(up{job="stemedb-cluster"} == 1) < 2
  for: 1m

Grafana Dashboard Panels

Cluster Health: Node count, status, replication lag
Query Latency: p50, p95, p99 across all nodes
Ingest Rate: Assertions/sec per node
Disk Usage: WAL + DB per node
Network: Replication bandwidth

Cost Estimate (AWS, us-east-1)

Resource	Cost
Compute (3× t3.xlarge)	$300/month
Storage (3× 200GB SSD)	$60/month
Load Balancer (ALB)	$25/month
Data Transfer (internal)	$10/month
Backups (S3)	$30/month
Total	~$425/month

Compare to single-node ($87/month): 5x cost for 10x availability

Migration from Single-Node

See: Add Node Runbook for detailed procedure.

Summary:

Provision 2 new nodes
Configure cluster on all 3
Restart single-node with cluster config
Trigger Merkle sync
Update load balancer

Downtime: 5-15 minutes for replication

Scaling Path Beyond Three Nodes

Three nodes on k3s handles the 100-project target. For mass traffic beyond that, the scaling path is incremental — not a rearchitecture:

Phase	Target	Work type	What changes
Phase 1	1 node, 100 projects	✅ Done	Single Deployment, Longhorn PVC, auth wired
Phase 2	3 nodes, cluster mode	✅ Done	3-replica StatefulSet, dual-binary, SWIM, Gateway routing, Woodpecker CI
Phase 3	3 nodes, full replication	Code-heavy	SWIM inter-node connectivity, Gateway HTTP forwarding, Merkle anti-entropy
Phase 4	N nodes, coordinator	Code-heavy	Designate one node (or small 3-node Raft group) exclusively for mutable admin state; assertion nodes are pure data

What you do NOT need: Raft on the assertion write path. The append-only, content-addressed design means there are no write conflicts to serialize. Raft belongs only on the mutable admin state path (Phase 4), which is a small fraction of total traffic.

Single-Node Pilot - Simpler architecture
Network Requirements - Firewall rules
Resource Sizing - Hardware calculations
Add Node Runbook - Cluster operations
High Query Latency Runbook - Performance troubleshooting

Last Updated: 2026-03-07 — Updated k3s section to match actual 3-replica StatefulSet deployment, dual-binary architecture, Woodpecker CI pipeline

18 KiB Raw Permalink Blame History Unescape Escape

Three-Node Cluster Architecture

Architectural Rationale: Why Gossip, Not Raft

The append-only insight

What actually needs coordination

Read scaling

Write scaling (assertions)

Overview

Target Specifications

Hardware Requirements (Per Node)

Minimum (Pilot <50K assertions)

Recommended (Production <100K assertions)

Architecture Components

Node Layout

Replication

Load Balancing

k3s Deployment (Current — StatefulSet + Longhorn)

Bare-Metal Deployment Steps

Prerequisites

Step 1: Install StemeDB on All Nodes

Step 2: Configure Cluster

Step 3: Start All Nodes

Step 4: Verify Cluster Formation

Step 5: Configure Load Balancer

Step 6: Set Up Monitoring

Failure Scenarios & Recovery

Single Node Failure

Two Nodes Fail (Catastrophic)

Network Partition

Replication Lag

Performance Characteristics

Query Latency

Write Throughput

Replication Lag

Network Requirements

Ports (Per Node)

Latency Requirement

Bandwidth

Monitoring & Alerts

Critical Metrics

Grafana Dashboard Panels

Cost Estimate (AWS, us-east-1)

Migration from Single-Node

Scaling Path Beyond Three Nodes

Related Documentation

18 KiB

Raw Permalink Blame History