stemedb/docs/operations/reference-architecture/three-node-cluster.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

398 lines
9.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Three-Node Cluster Architecture
**Target:** Production deployments, enterprise pilots, high-availability requirements
**✅ RECOMMENDED FOR PRODUCTION** - Survives single node failure, automatic replication
---
## Overview
The three-node cluster provides high availability through automatic replication (factor 2) and CRDT-based eventual consistency. Survives single node failure with <5 minute recovery time.
```
[See: diagrams/three-node.txt for ASCII diagram]
```
---
## Target Specifications
| Metric | Value |
|--------|-------|
| **Assertions** | <100,000 |
| **Queries/sec** | <1,000 |
| **Concurrent users** | <500 |
| **Availability** | 99.9% (survives 1 node failure) |
| **RTO** | 5 minutes (automatic failover) |
| **RPO** | 1 minute (replication lag) |
| **Consistency** | Eventual (via CRDTs + Merkle sync) |
---
## Hardware Requirements (Per Node)
### Minimum (Pilot <50K assertions)
- **CPU:** 4 vCPUs
- **RAM:** 8GB
- **Disk:** 100GB SSD (50GB WAL + 50GB DB)
- **Network:** 1 Gbps, <5ms inter-node latency
**Example instances (per node):**
- AWS: `t3.large` (2 vCPU, 8GB) × 3 = $180/month
- GCP: `n2-standard-2` (2 vCPU, 8GB) × 3 = $195/month
- Azure: `Standard_D2s_v3` (2 vCPU, 8GB) × 3 = $140/month
### Recommended (Production <100K assertions)
- **CPU:** 8 vCPUs
- **RAM:** 16GB
- **Disk:** 200GB SSD (100GB WAL + 100GB DB)
- **Network:** 10 Gbps, <5ms inter-node latency
**Example instances (per node):**
- AWS: `t3.xlarge` (4 vCPU, 16GB) × 3 = $300/month
- GCP: `n2-standard-4` (4 vCPU, 16GB) × 3 = $390/month
- Azure: `Standard_D4s_v3` (4 vCPU, 16GB) × 3 = $280/month
**See:** [Resource Sizing Guide](./resource-sizing.md) for detailed calculations.
---
## Architecture Components
### Node Layout
Each node runs the full stack:
- **stemedb-api** (port 18180) - HTTP API, queries, ingest
- **stemedb-gateway** (port 18181) - Cluster coordination
- **stemedb-rpc** (port 18182) - gRPC replication
- **SWIM gossip** (port 18183) - Membership, failure detection
### Replication
**CRDT-based with Merkle sync:**
- Writes accepted locally (optimistic)
- Background Merkle tree comparison
- Automatic sync of missing assertions
- No distributed transactions
**Replication factor 2:**
- Each assertion stored on 2 nodes
- Survives 1 node failure
- Read from any node (eventually consistent)
### Load Balancing
**Round-robin across all nodes:**
- Nginx or Envoy distribute queries
- No "primary" node (all equal)
- Health checks remove failed nodes
---
## Deployment Steps
### Prerequisites
- [ ] 3 servers provisioned (same specs)
- [ ] Private network with <5ms latency
- [ ] DNS records created
- [ ] TLS certificates provisioned
### Step 1: Install StemeDB on All Nodes
```bash
# On each node (node1, node2, node3):
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api
sudo mkdir -p /data/{wal,db}
sudo useradd -r -s /bin/false stemedb
sudo chown -R stemedb:stemedb /data
```
### Step 2: Configure Cluster
**Node 1:**
```toml
# /etc/stemedb/config.toml
[cluster]
enabled = true
node_id = "node1"
bind_addr = "10.0.1.51:18181"
rpc_addr = "10.0.1.51:18182"
swim_addr = "10.0.1.51:18183"
seeds = ["10.0.1.52:18183", "10.0.1.53:18183"]
[replication]
factor = 2
```
**Node 2:**
```toml
[cluster]
enabled = true
node_id = "node2"
bind_addr = "10.0.1.52:18181"
rpc_addr = "10.0.1.52:18182"
swim_addr = "10.0.1.52:18183"
seeds = ["10.0.1.51:18183", "10.0.1.53:18183"]
[replication]
factor = 2
```
**Node 3:**
```toml
[cluster]
enabled = true
node_id = "node3"
bind_addr = "10.0.1.53:18181"
rpc_addr = "10.0.1.53:18182"
swim_addr = "10.0.1.53:18183"
seeds = ["10.0.1.51:18183", "10.0.1.52:18183"]
[replication]
factor = 2
```
### Step 3: Start All Nodes
```bash
# Start nodes sequentially (allows SWIM discovery)
ssh node1 "sudo systemctl start stemedb-api"
sleep 10
ssh node2 "sudo systemctl start stemedb-api"
sleep 10
ssh node3 "sudo systemctl start stemedb-api"
```
### Step 4: Verify Cluster Formation
```bash
# Check membership (from any node)
curl http://node1:18181/cluster/members | jq '.'
# Expected output:
# {
# "members": [
# {"id": "node1", "status": "UP"},
# {"id": "node2", "status": "UP"},
# {"id": "node3", "status": "UP"}
# ]
# }
```
### Step 5: Configure Load Balancer
**See:** [Nginx Config](../../deployment/nginx/stemedb.conf) or [Envoy Config](../../deployment/envoy/stemedb.yaml)
**Nginx upstream:**
```nginx
upstream stemedb_cluster {
server node1.example.com:18180;
server node2.example.com:18180;
server node3.example.com:18180;
}
```
### Step 6: Set Up Monitoring
```yaml
# Prometheus scrape config
scrape_configs:
- job_name: 'stemedb-cluster'
static_configs:
- targets:
- 'node1:18180'
- 'node2:18180'
- 'node3:18180'
```
**Estimated deployment time:** 4-8 hours (including load balancer, monitoring)
---
## Failure Scenarios & Recovery
### Single Node Failure
**Impact:** No service disruption, automatic failover
**Recovery:**
1. Load balancer detects failed node (health check)
2. Traffic routed to 2 remaining nodes
3. Replication factor maintained (assertions still on 2 nodes)
4. Replace failed node when convenient (see [Add Node Runbook](../../runbooks/add-node.md))
**RTO:** <1 minute (automatic)
**Data loss:** None (replicated data preserved)
### Two Nodes Fail (Catastrophic)
**Impact:** Read-only mode (no writes accepted)
**Recovery:**
1. Manual intervention required
2. Restore third node or add new node
3. Trigger Merkle sync
4. Resume writes when quorum restored
**RTO:** 30 minutes - 2 hours (manual)
**Data loss:** Potential (depends on which nodes failed)
### Network Partition
**Impact:** Split brain possible (both sides accept writes)
**Recovery:**
- CRDT merge resolves conflicts automatically
- Lenses (Recency, Authority) handle conflicts at read time
- No manual intervention needed after partition heals
**Data loss:** None (CRDTs preserve all writes)
### Replication Lag
**Impact:** Queries may see stale data (<1 minute old)
**Recovery:**
- Automatic catch-up via Merkle sync
- If lag >5 minutes, see [High Latency Runbook](../../runbooks/high-query-latency.md)
---
## Performance Characteristics
### Query Latency
**Target:** p99 <200ms at <1K queries/sec
| Metric | Single-Node | Three-Node |
|--------|-------------|------------|
| **p50** | 20ms | 25ms |
| **p95** | 50ms | 75ms |
| **p99** | 100ms | 150ms |
*3-node has slightly higher latency due to network hops, but 3x query capacity*
### Write Throughput
**Target:** 1,000 assertions/sec sustained
- Each node accepts writes
- Replication happens asynchronously
- No coordination required (CRDTs)
### Replication Lag
**Target:** <1 second typical, <5 seconds max
Measured by: `replication_lag_seconds` metric
---
## Network Requirements
**See:** [Network Requirements](./network-requirements.md) for full details.
### Ports (Per Node)
| Port | Protocol | Purpose | Firewall Rule |
|------|----------|---------|---------------|
| **18180** | TCP/HTTP | API (clients nodes) | Allow from load balancer |
| **18181** | TCP/HTTP | Cluster gateway (admin only) | Allow from internal network |
| **18182** | TCP/gRPC | Replication (node node) | Allow within cluster |
| **18183** | UDP | SWIM gossip (node node) | Allow within cluster |
### Latency Requirement
**<5ms inter-node latency required**
- Deploy nodes in same region/AZ
- Private network (10 Gbps recommended)
- Test with: `ping -c 100 node2` (should show avg <5ms)
### Bandwidth
- **Replication:** ~1 Mbps per 100 assertions/sec
- **Queries:** ~10 Mbps at 1K queries/sec
- **Recommended:** 1 Gbps minimum, 10 Gbps for production
---
## Monitoring & Alerts
### Critical Metrics
```yaml
# Prometheus alerts
- alert: StemeDBNodeDown
expr: up{job="stemedb-cluster"} == 0
for: 1m
- alert: StemeDBReplicationLag
expr: replication_lag_seconds > 5
for: 5m
- alert: StemeDBQuorumLost
expr: count(up{job="stemedb-cluster"} == 1) < 2
for: 1m
```
### Grafana Dashboard Panels
1. **Cluster Health:** Node count, status, replication lag
2. **Query Latency:** p50, p95, p99 across all nodes
3. **Ingest Rate:** Assertions/sec per node
4. **Disk Usage:** WAL + DB per node
5. **Network:** Replication bandwidth
---
## Cost Estimate (AWS, us-east-1)
| Resource | Cost |
|----------|------|
| **Compute** (3× t3.xlarge) | $300/month |
| **Storage** (3× 200GB SSD) | $60/month |
| **Load Balancer** (ALB) | $25/month |
| **Data Transfer** (internal) | $10/month |
| **Backups** (S3) | $30/month |
| **Total** | **~$425/month** |
Compare to single-node ($87/month): 5x cost for 10x availability
---
## Migration from Single-Node
**See:** [Add Node Runbook](../../runbooks/add-node.md#1-bootstrap-3-node-cluster) for detailed procedure.
**Summary:**
1. Provision 2 new nodes
2. Configure cluster on all 3
3. Restart single-node with cluster config
4. Trigger Merkle sync
5. Update load balancer
**Downtime:** 5-15 minutes for replication
---
## Related Documentation
- [Single-Node Pilot](./single-node-pilot.md) - Simpler architecture
- [Network Requirements](./network-requirements.md) - Firewall rules
- [Resource Sizing](./resource-sizing.md) - Hardware calculations
- [Add Node Runbook](../../runbooks/add-node.md) - Cluster operations
- [High Query Latency Runbook](../../runbooks/high-query-latency.md) - Performance troubleshooting
---
**Last Updated:** 2026-02-11