jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

9.3 KiB

Raw Blame History

Three-Node Cluster Architecture

Target: Production deployments, enterprise pilots, high-availability requirements

✅ RECOMMENDED FOR PRODUCTION - Survives single node failure, automatic replication

Overview

The three-node cluster provides high availability through automatic replication (factor 2) and CRDT-based eventual consistency. Survives single node failure with <5 minute recovery time.

[See: diagrams/three-node.txt for ASCII diagram]

Target Specifications

Metric	Value
Assertions	<100,000
Queries/sec	<1,000
Concurrent users	<500
Availability	99.9% (survives 1 node failure)
RTO	5 minutes (automatic failover)
RPO	1 minute (replication lag)
Consistency	Eventual (via CRDTs + Merkle sync)

Hardware Requirements (Per Node)

Minimum (Pilot <50K assertions)

CPU: 4 vCPUs
RAM: 8GB
Disk: 100GB SSD (50GB WAL + 50GB DB)
Network: 1 Gbps, <5ms inter-node latency

Example instances (per node):

AWS: t3.large (2 vCPU, 8GB) × 3 = $180/month
GCP: n2-standard-2 (2 vCPU, 8GB) × 3 = $195/month
Azure: Standard_D2s_v3 (2 vCPU, 8GB) × 3 = $140/month

Recommended (Production <100K assertions)

CPU: 8 vCPUs
RAM: 16GB
Disk: 200GB SSD (100GB WAL + 100GB DB)
Network: 10 Gbps, <5ms inter-node latency

Example instances (per node):

AWS: t3.xlarge (4 vCPU, 16GB) × 3 = $300/month
GCP: n2-standard-4 (4 vCPU, 16GB) × 3 = $390/month
Azure: Standard_D4s_v3 (4 vCPU, 16GB) × 3 = $280/month

See: Resource Sizing Guide for detailed calculations.

Architecture Components

Node Layout

Each node runs the full stack:

stemedb-api (port 18180) - HTTP API, queries, ingest
stemedb-gateway (port 18181) - Cluster coordination
stemedb-rpc (port 18182) - gRPC replication
SWIM gossip (port 18183) - Membership, failure detection

Replication

CRDT-based with Merkle sync:

Writes accepted locally (optimistic)
Background Merkle tree comparison
Automatic sync of missing assertions
No distributed transactions

Replication factor 2:

Each assertion stored on 2 nodes
Survives 1 node failure
Read from any node (eventually consistent)

Load Balancing

Round-robin across all nodes:

Nginx or Envoy distribute queries
No "primary" node (all equal)
Health checks remove failed nodes

Deployment Steps

Prerequisites

3 servers provisioned (same specs)
Private network with <5ms latency
DNS records created
TLS certificates provisioned

Step 1: Install StemeDB on All Nodes

# On each node (node1, node2, node3):
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api

sudo mkdir -p /data/{wal,db}
sudo useradd -r -s /bin/false stemedb
sudo chown -R stemedb:stemedb /data

Step 2: Configure Cluster

Node 1:

# /etc/stemedb/config.toml
[cluster]
enabled = true
node_id = "node1"
bind_addr = "10.0.1.51:18181"
rpc_addr = "10.0.1.51:18182"
swim_addr = "10.0.1.51:18183"
seeds = ["10.0.1.52:18183", "10.0.1.53:18183"]

[replication]
factor = 2

Node 2:

[cluster]
enabled = true
node_id = "node2"
bind_addr = "10.0.1.52:18181"
rpc_addr = "10.0.1.52:18182"
swim_addr = "10.0.1.52:18183"
seeds = ["10.0.1.51:18183", "10.0.1.53:18183"]

[replication]
factor = 2

Node 3:

[cluster]
enabled = true
node_id = "node3"
bind_addr = "10.0.1.53:18181"
rpc_addr = "10.0.1.53:18182"
swim_addr = "10.0.1.53:18183"
seeds = ["10.0.1.51:18183", "10.0.1.52:18183"]

[replication]
factor = 2

Step 3: Start All Nodes

# Start nodes sequentially (allows SWIM discovery)
ssh node1 "sudo systemctl start stemedb-api"
sleep 10

ssh node2 "sudo systemctl start stemedb-api"
sleep 10

ssh node3 "sudo systemctl start stemedb-api"

Step 4: Verify Cluster Formation

# Check membership (from any node)
curl http://node1:18181/cluster/members | jq '.'

# Expected output:
# {
#   "members": [
#     {"id": "node1", "status": "UP"},
#     {"id": "node2", "status": "UP"},
#     {"id": "node3", "status": "UP"}
#   ]
# }

Step 5: Configure Load Balancer

See: Nginx Config or Envoy Config

Nginx upstream:

upstream stemedb_cluster {
    server node1.example.com:18180;
    server node2.example.com:18180;
    server node3.example.com:18180;
}

Step 6: Set Up Monitoring

# Prometheus scrape config
scrape_configs:
  - job_name: 'stemedb-cluster'
    static_configs:
      - targets:
        - 'node1:18180'
        - 'node2:18180'
        - 'node3:18180'

Estimated deployment time: 4-8 hours (including load balancer, monitoring)

Failure Scenarios & Recovery

Single Node Failure

Impact: No service disruption, automatic failover

Recovery:

Load balancer detects failed node (health check)
Traffic routed to 2 remaining nodes
Replication factor maintained (assertions still on 2 nodes)
Replace failed node when convenient (see Add Node Runbook)

RTO: <1 minute (automatic) Data loss: None (replicated data preserved)

Two Nodes Fail (Catastrophic)

Impact: Read-only mode (no writes accepted)

Recovery:

Manual intervention required
Restore third node or add new node
Trigger Merkle sync
Resume writes when quorum restored

RTO: 30 minutes - 2 hours (manual) Data loss: Potential (depends on which nodes failed)

Network Partition

Impact: Split brain possible (both sides accept writes)

Recovery:

CRDT merge resolves conflicts automatically
Lenses (Recency, Authority) handle conflicts at read time
No manual intervention needed after partition heals

Data loss: None (CRDTs preserve all writes)

Replication Lag

Impact: Queries may see stale data (<1 minute old)

Recovery:

Automatic catch-up via Merkle sync
If lag >5 minutes, see High Latency Runbook

Performance Characteristics

Query Latency

Target: p99 <200ms at <1K queries/sec

Metric	Single-Node	Three-Node
p50	20ms	25ms
p95	50ms	75ms
p99	100ms	150ms

3-node has slightly higher latency due to network hops, but 3x query capacity

Write Throughput

Target: 1,000 assertions/sec sustained

Each node accepts writes
Replication happens asynchronously
No coordination required (CRDTs)

Replication Lag

Target: <1 second typical, <5 seconds max

Measured by: replication_lag_seconds metric

Network Requirements

See: Network Requirements for full details.

Ports (Per Node)

Port	Protocol	Purpose	Firewall Rule
18180	TCP/HTTP	API (clients → nodes)	Allow from load balancer
18181	TCP/HTTP	Cluster gateway (admin only)	Allow from internal network
18182	TCP/gRPC	Replication (node ↔ node)	Allow within cluster
18183	UDP	SWIM gossip (node ↔ node)	Allow within cluster

Latency Requirement

<5ms inter-node latency required

Deploy nodes in same region/AZ
Private network (10 Gbps recommended)
Test with: ping -c 100 node2 (should show avg <5ms)

Bandwidth

Replication: ~1 Mbps per 100 assertions/sec
Queries: ~10 Mbps at 1K queries/sec
Recommended: 1 Gbps minimum, 10 Gbps for production

Monitoring & Alerts

Critical Metrics

# Prometheus alerts
- alert: StemeDBNodeDown
  expr: up{job="stemedb-cluster"} == 0
  for: 1m

- alert: StemeDBReplicationLag
  expr: replication_lag_seconds > 5
  for: 5m

- alert: StemeDBQuorumLost
  expr: count(up{job="stemedb-cluster"} == 1) < 2
  for: 1m

Grafana Dashboard Panels

Cluster Health: Node count, status, replication lag
Query Latency: p50, p95, p99 across all nodes
Ingest Rate: Assertions/sec per node
Disk Usage: WAL + DB per node
Network: Replication bandwidth

Cost Estimate (AWS, us-east-1)

Resource	Cost
Compute (3× t3.xlarge)	$300/month
Storage (3× 200GB SSD)	$60/month
Load Balancer (ALB)	$25/month
Data Transfer (internal)	$10/month
Backups (S3)	$30/month
Total	~$425/month

Compare to single-node ($87/month): 5x cost for 10x availability

Migration from Single-Node

See: Add Node Runbook for detailed procedure.

Summary:

Provision 2 new nodes
Configure cluster on all 3
Restart single-node with cluster config
Trigger Merkle sync
Update load balancer

Downtime: 5-15 minutes for replication

Single-Node Pilot - Simpler architecture
Network Requirements - Firewall rules
Resource Sizing - Hardware calculations
Add Node Runbook - Cluster operations
High Query Latency Runbook - Performance troubleshooting

Last Updated: 2026-02-11

9.3 KiB Raw Blame History Unescape Escape

Three-Node Cluster Architecture

Overview

Target Specifications

Hardware Requirements (Per Node)

Minimum (Pilot <50K assertions)

Recommended (Production <100K assertions)

Architecture Components

Node Layout

Replication

Load Balancing

Deployment Steps

Prerequisites

Step 1: Install StemeDB on All Nodes

Step 2: Configure Cluster

Step 3: Start All Nodes

Step 4: Verify Cluster Formation

Step 5: Configure Load Balancer

Step 6: Set Up Monitoring

Failure Scenarios & Recovery

Single Node Failure

Two Nodes Fail (Catastrophic)

Network Partition

Replication Lag

Performance Characteristics

Query Latency

Write Throughput

Replication Lag

Network Requirements

Ports (Per Node)

Latency Requirement

Bandwidth

Monitoring & Alerts

Critical Metrics

Grafana Dashboard Panels

Cost Estimate (AWS, us-east-1)

Migration from Single-Node

Related Documentation

9.3 KiB

Raw Blame History