stemedb/docs/operations/reference-architecture/three-node-cluster.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

9.3 KiB
Raw Blame History

Three-Node Cluster Architecture

Target: Production deployments, enterprise pilots, high-availability requirements

RECOMMENDED FOR PRODUCTION - Survives single node failure, automatic replication


Overview

The three-node cluster provides high availability through automatic replication (factor 2) and CRDT-based eventual consistency. Survives single node failure with <5 minute recovery time.

[See: diagrams/three-node.txt for ASCII diagram]

Target Specifications

Metric Value
Assertions <100,000
Queries/sec <1,000
Concurrent users <500
Availability 99.9% (survives 1 node failure)
RTO 5 minutes (automatic failover)
RPO 1 minute (replication lag)
Consistency Eventual (via CRDTs + Merkle sync)

Hardware Requirements (Per Node)

Minimum (Pilot <50K assertions)

  • CPU: 4 vCPUs
  • RAM: 8GB
  • Disk: 100GB SSD (50GB WAL + 50GB DB)
  • Network: 1 Gbps, <5ms inter-node latency

Example instances (per node):

  • AWS: t3.large (2 vCPU, 8GB) × 3 = $180/month
  • GCP: n2-standard-2 (2 vCPU, 8GB) × 3 = $195/month
  • Azure: Standard_D2s_v3 (2 vCPU, 8GB) × 3 = $140/month
  • CPU: 8 vCPUs
  • RAM: 16GB
  • Disk: 200GB SSD (100GB WAL + 100GB DB)
  • Network: 10 Gbps, <5ms inter-node latency

Example instances (per node):

  • AWS: t3.xlarge (4 vCPU, 16GB) × 3 = $300/month
  • GCP: n2-standard-4 (4 vCPU, 16GB) × 3 = $390/month
  • Azure: Standard_D4s_v3 (4 vCPU, 16GB) × 3 = $280/month

See: Resource Sizing Guide for detailed calculations.


Architecture Components

Node Layout

Each node runs the full stack:

  • stemedb-api (port 18180) - HTTP API, queries, ingest
  • stemedb-gateway (port 18181) - Cluster coordination
  • stemedb-rpc (port 18182) - gRPC replication
  • SWIM gossip (port 18183) - Membership, failure detection

Replication

CRDT-based with Merkle sync:

  • Writes accepted locally (optimistic)
  • Background Merkle tree comparison
  • Automatic sync of missing assertions
  • No distributed transactions

Replication factor 2:

  • Each assertion stored on 2 nodes
  • Survives 1 node failure
  • Read from any node (eventually consistent)

Load Balancing

Round-robin across all nodes:

  • Nginx or Envoy distribute queries
  • No "primary" node (all equal)
  • Health checks remove failed nodes

Deployment Steps

Prerequisites

  • 3 servers provisioned (same specs)
  • Private network with <5ms latency
  • DNS records created
  • TLS certificates provisioned

Step 1: Install StemeDB on All Nodes

# On each node (node1, node2, node3):
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api

sudo mkdir -p /data/{wal,db}
sudo useradd -r -s /bin/false stemedb
sudo chown -R stemedb:stemedb /data

Step 2: Configure Cluster

Node 1:

# /etc/stemedb/config.toml
[cluster]
enabled = true
node_id = "node1"
bind_addr = "10.0.1.51:18181"
rpc_addr = "10.0.1.51:18182"
swim_addr = "10.0.1.51:18183"
seeds = ["10.0.1.52:18183", "10.0.1.53:18183"]

[replication]
factor = 2

Node 2:

[cluster]
enabled = true
node_id = "node2"
bind_addr = "10.0.1.52:18181"
rpc_addr = "10.0.1.52:18182"
swim_addr = "10.0.1.52:18183"
seeds = ["10.0.1.51:18183", "10.0.1.53:18183"]

[replication]
factor = 2

Node 3:

[cluster]
enabled = true
node_id = "node3"
bind_addr = "10.0.1.53:18181"
rpc_addr = "10.0.1.53:18182"
swim_addr = "10.0.1.53:18183"
seeds = ["10.0.1.51:18183", "10.0.1.52:18183"]

[replication]
factor = 2

Step 3: Start All Nodes

# Start nodes sequentially (allows SWIM discovery)
ssh node1 "sudo systemctl start stemedb-api"
sleep 10

ssh node2 "sudo systemctl start stemedb-api"
sleep 10

ssh node3 "sudo systemctl start stemedb-api"

Step 4: Verify Cluster Formation

# Check membership (from any node)
curl http://node1:18181/cluster/members | jq '.'

# Expected output:
# {
#   "members": [
#     {"id": "node1", "status": "UP"},
#     {"id": "node2", "status": "UP"},
#     {"id": "node3", "status": "UP"}
#   ]
# }

Step 5: Configure Load Balancer

See: Nginx Config or Envoy Config

Nginx upstream:

upstream stemedb_cluster {
    server node1.example.com:18180;
    server node2.example.com:18180;
    server node3.example.com:18180;
}

Step 6: Set Up Monitoring

# Prometheus scrape config
scrape_configs:
  - job_name: 'stemedb-cluster'
    static_configs:
      - targets:
        - 'node1:18180'
        - 'node2:18180'
        - 'node3:18180'

Estimated deployment time: 4-8 hours (including load balancer, monitoring)


Failure Scenarios & Recovery

Single Node Failure

Impact: No service disruption, automatic failover

Recovery:

  1. Load balancer detects failed node (health check)
  2. Traffic routed to 2 remaining nodes
  3. Replication factor maintained (assertions still on 2 nodes)
  4. Replace failed node when convenient (see Add Node Runbook)

RTO: <1 minute (automatic) Data loss: None (replicated data preserved)

Two Nodes Fail (Catastrophic)

Impact: Read-only mode (no writes accepted)

Recovery:

  1. Manual intervention required
  2. Restore third node or add new node
  3. Trigger Merkle sync
  4. Resume writes when quorum restored

RTO: 30 minutes - 2 hours (manual) Data loss: Potential (depends on which nodes failed)

Network Partition

Impact: Split brain possible (both sides accept writes)

Recovery:

  • CRDT merge resolves conflicts automatically
  • Lenses (Recency, Authority) handle conflicts at read time
  • No manual intervention needed after partition heals

Data loss: None (CRDTs preserve all writes)

Replication Lag

Impact: Queries may see stale data (<1 minute old)

Recovery:


Performance Characteristics

Query Latency

Target: p99 <200ms at <1K queries/sec

Metric Single-Node Three-Node
p50 20ms 25ms
p95 50ms 75ms
p99 100ms 150ms

3-node has slightly higher latency due to network hops, but 3x query capacity

Write Throughput

Target: 1,000 assertions/sec sustained

  • Each node accepts writes
  • Replication happens asynchronously
  • No coordination required (CRDTs)

Replication Lag

Target: <1 second typical, <5 seconds max

Measured by: replication_lag_seconds metric


Network Requirements

See: Network Requirements for full details.

Ports (Per Node)

Port Protocol Purpose Firewall Rule
18180 TCP/HTTP API (clients → nodes) Allow from load balancer
18181 TCP/HTTP Cluster gateway (admin only) Allow from internal network
18182 TCP/gRPC Replication (node ↔ node) Allow within cluster
18183 UDP SWIM gossip (node ↔ node) Allow within cluster

Latency Requirement

<5ms inter-node latency required

  • Deploy nodes in same region/AZ
  • Private network (10 Gbps recommended)
  • Test with: ping -c 100 node2 (should show avg <5ms)

Bandwidth

  • Replication: ~1 Mbps per 100 assertions/sec
  • Queries: ~10 Mbps at 1K queries/sec
  • Recommended: 1 Gbps minimum, 10 Gbps for production

Monitoring & Alerts

Critical Metrics

# Prometheus alerts
- alert: StemeDBNodeDown
  expr: up{job="stemedb-cluster"} == 0
  for: 1m

- alert: StemeDBReplicationLag
  expr: replication_lag_seconds > 5
  for: 5m

- alert: StemeDBQuorumLost
  expr: count(up{job="stemedb-cluster"} == 1) < 2
  for: 1m

Grafana Dashboard Panels

  1. Cluster Health: Node count, status, replication lag
  2. Query Latency: p50, p95, p99 across all nodes
  3. Ingest Rate: Assertions/sec per node
  4. Disk Usage: WAL + DB per node
  5. Network: Replication bandwidth

Cost Estimate (AWS, us-east-1)

Resource Cost
Compute (3× t3.xlarge) $300/month
Storage (3× 200GB SSD) $60/month
Load Balancer (ALB) $25/month
Data Transfer (internal) $10/month
Backups (S3) $30/month
Total ~$425/month

Compare to single-node ($87/month): 5x cost for 10x availability


Migration from Single-Node

See: Add Node Runbook for detailed procedure.

Summary:

  1. Provision 2 new nodes
  2. Configure cluster on all 3
  3. Restart single-node with cluster config
  4. Trigger Merkle sync
  5. Update load balancer

Downtime: 5-15 minutes for replication



Last Updated: 2026-02-11