This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
398 lines
9.3 KiB
Markdown
398 lines
9.3 KiB
Markdown
# Three-Node Cluster Architecture
|
||
|
||
**Target:** Production deployments, enterprise pilots, high-availability requirements
|
||
|
||
**✅ RECOMMENDED FOR PRODUCTION** - Survives single node failure, automatic replication
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
The three-node cluster provides high availability through automatic replication (factor 2) and CRDT-based eventual consistency. Survives single node failure with <5 minute recovery time.
|
||
|
||
```
|
||
[See: diagrams/three-node.txt for ASCII diagram]
|
||
```
|
||
|
||
---
|
||
|
||
## Target Specifications
|
||
|
||
| Metric | Value |
|
||
|--------|-------|
|
||
| **Assertions** | <100,000 |
|
||
| **Queries/sec** | <1,000 |
|
||
| **Concurrent users** | <500 |
|
||
| **Availability** | 99.9% (survives 1 node failure) |
|
||
| **RTO** | 5 minutes (automatic failover) |
|
||
| **RPO** | 1 minute (replication lag) |
|
||
| **Consistency** | Eventual (via CRDTs + Merkle sync) |
|
||
|
||
---
|
||
|
||
## Hardware Requirements (Per Node)
|
||
|
||
### Minimum (Pilot <50K assertions)
|
||
|
||
- **CPU:** 4 vCPUs
|
||
- **RAM:** 8GB
|
||
- **Disk:** 100GB SSD (50GB WAL + 50GB DB)
|
||
- **Network:** 1 Gbps, <5ms inter-node latency
|
||
|
||
**Example instances (per node):**
|
||
- AWS: `t3.large` (2 vCPU, 8GB) × 3 = $180/month
|
||
- GCP: `n2-standard-2` (2 vCPU, 8GB) × 3 = $195/month
|
||
- Azure: `Standard_D2s_v3` (2 vCPU, 8GB) × 3 = $140/month
|
||
|
||
### Recommended (Production <100K assertions)
|
||
|
||
- **CPU:** 8 vCPUs
|
||
- **RAM:** 16GB
|
||
- **Disk:** 200GB SSD (100GB WAL + 100GB DB)
|
||
- **Network:** 10 Gbps, <5ms inter-node latency
|
||
|
||
**Example instances (per node):**
|
||
- AWS: `t3.xlarge` (4 vCPU, 16GB) × 3 = $300/month
|
||
- GCP: `n2-standard-4` (4 vCPU, 16GB) × 3 = $390/month
|
||
- Azure: `Standard_D4s_v3` (4 vCPU, 16GB) × 3 = $280/month
|
||
|
||
**See:** [Resource Sizing Guide](./resource-sizing.md) for detailed calculations.
|
||
|
||
---
|
||
|
||
## Architecture Components
|
||
|
||
### Node Layout
|
||
|
||
Each node runs the full stack:
|
||
- **stemedb-api** (port 18180) - HTTP API, queries, ingest
|
||
- **stemedb-gateway** (port 18181) - Cluster coordination
|
||
- **stemedb-rpc** (port 18182) - gRPC replication
|
||
- **SWIM gossip** (port 18183) - Membership, failure detection
|
||
|
||
### Replication
|
||
|
||
**CRDT-based with Merkle sync:**
|
||
- Writes accepted locally (optimistic)
|
||
- Background Merkle tree comparison
|
||
- Automatic sync of missing assertions
|
||
- No distributed transactions
|
||
|
||
**Replication factor 2:**
|
||
- Each assertion stored on 2 nodes
|
||
- Survives 1 node failure
|
||
- Read from any node (eventually consistent)
|
||
|
||
### Load Balancing
|
||
|
||
**Round-robin across all nodes:**
|
||
- Nginx or Envoy distribute queries
|
||
- No "primary" node (all equal)
|
||
- Health checks remove failed nodes
|
||
|
||
---
|
||
|
||
## Deployment Steps
|
||
|
||
### Prerequisites
|
||
|
||
- [ ] 3 servers provisioned (same specs)
|
||
- [ ] Private network with <5ms latency
|
||
- [ ] DNS records created
|
||
- [ ] TLS certificates provisioned
|
||
|
||
### Step 1: Install StemeDB on All Nodes
|
||
|
||
```bash
|
||
# On each node (node1, node2, node3):
|
||
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
|
||
sudo chmod +x /usr/local/bin/stemedb-api
|
||
|
||
sudo mkdir -p /data/{wal,db}
|
||
sudo useradd -r -s /bin/false stemedb
|
||
sudo chown -R stemedb:stemedb /data
|
||
```
|
||
|
||
### Step 2: Configure Cluster
|
||
|
||
**Node 1:**
|
||
```toml
|
||
# /etc/stemedb/config.toml
|
||
[cluster]
|
||
enabled = true
|
||
node_id = "node1"
|
||
bind_addr = "10.0.1.51:18181"
|
||
rpc_addr = "10.0.1.51:18182"
|
||
swim_addr = "10.0.1.51:18183"
|
||
seeds = ["10.0.1.52:18183", "10.0.1.53:18183"]
|
||
|
||
[replication]
|
||
factor = 2
|
||
```
|
||
|
||
**Node 2:**
|
||
```toml
|
||
[cluster]
|
||
enabled = true
|
||
node_id = "node2"
|
||
bind_addr = "10.0.1.52:18181"
|
||
rpc_addr = "10.0.1.52:18182"
|
||
swim_addr = "10.0.1.52:18183"
|
||
seeds = ["10.0.1.51:18183", "10.0.1.53:18183"]
|
||
|
||
[replication]
|
||
factor = 2
|
||
```
|
||
|
||
**Node 3:**
|
||
```toml
|
||
[cluster]
|
||
enabled = true
|
||
node_id = "node3"
|
||
bind_addr = "10.0.1.53:18181"
|
||
rpc_addr = "10.0.1.53:18182"
|
||
swim_addr = "10.0.1.53:18183"
|
||
seeds = ["10.0.1.51:18183", "10.0.1.52:18183"]
|
||
|
||
[replication]
|
||
factor = 2
|
||
```
|
||
|
||
### Step 3: Start All Nodes
|
||
|
||
```bash
|
||
# Start nodes sequentially (allows SWIM discovery)
|
||
ssh node1 "sudo systemctl start stemedb-api"
|
||
sleep 10
|
||
|
||
ssh node2 "sudo systemctl start stemedb-api"
|
||
sleep 10
|
||
|
||
ssh node3 "sudo systemctl start stemedb-api"
|
||
```
|
||
|
||
### Step 4: Verify Cluster Formation
|
||
|
||
```bash
|
||
# Check membership (from any node)
|
||
curl http://node1:18181/cluster/members | jq '.'
|
||
|
||
# Expected output:
|
||
# {
|
||
# "members": [
|
||
# {"id": "node1", "status": "UP"},
|
||
# {"id": "node2", "status": "UP"},
|
||
# {"id": "node3", "status": "UP"}
|
||
# ]
|
||
# }
|
||
```
|
||
|
||
### Step 5: Configure Load Balancer
|
||
|
||
**See:** [Nginx Config](../../deployment/nginx/stemedb.conf) or [Envoy Config](../../deployment/envoy/stemedb.yaml)
|
||
|
||
**Nginx upstream:**
|
||
```nginx
|
||
upstream stemedb_cluster {
|
||
server node1.example.com:18180;
|
||
server node2.example.com:18180;
|
||
server node3.example.com:18180;
|
||
}
|
||
```
|
||
|
||
### Step 6: Set Up Monitoring
|
||
|
||
```yaml
|
||
# Prometheus scrape config
|
||
scrape_configs:
|
||
- job_name: 'stemedb-cluster'
|
||
static_configs:
|
||
- targets:
|
||
- 'node1:18180'
|
||
- 'node2:18180'
|
||
- 'node3:18180'
|
||
```
|
||
|
||
**Estimated deployment time:** 4-8 hours (including load balancer, monitoring)
|
||
|
||
---
|
||
|
||
## Failure Scenarios & Recovery
|
||
|
||
### Single Node Failure
|
||
|
||
**Impact:** No service disruption, automatic failover
|
||
|
||
**Recovery:**
|
||
1. Load balancer detects failed node (health check)
|
||
2. Traffic routed to 2 remaining nodes
|
||
3. Replication factor maintained (assertions still on 2 nodes)
|
||
4. Replace failed node when convenient (see [Add Node Runbook](../../runbooks/add-node.md))
|
||
|
||
**RTO:** <1 minute (automatic)
|
||
**Data loss:** None (replicated data preserved)
|
||
|
||
### Two Nodes Fail (Catastrophic)
|
||
|
||
**Impact:** Read-only mode (no writes accepted)
|
||
|
||
**Recovery:**
|
||
1. Manual intervention required
|
||
2. Restore third node or add new node
|
||
3. Trigger Merkle sync
|
||
4. Resume writes when quorum restored
|
||
|
||
**RTO:** 30 minutes - 2 hours (manual)
|
||
**Data loss:** Potential (depends on which nodes failed)
|
||
|
||
### Network Partition
|
||
|
||
**Impact:** Split brain possible (both sides accept writes)
|
||
|
||
**Recovery:**
|
||
- CRDT merge resolves conflicts automatically
|
||
- Lenses (Recency, Authority) handle conflicts at read time
|
||
- No manual intervention needed after partition heals
|
||
|
||
**Data loss:** None (CRDTs preserve all writes)
|
||
|
||
### Replication Lag
|
||
|
||
**Impact:** Queries may see stale data (<1 minute old)
|
||
|
||
**Recovery:**
|
||
- Automatic catch-up via Merkle sync
|
||
- If lag >5 minutes, see [High Latency Runbook](../../runbooks/high-query-latency.md)
|
||
|
||
---
|
||
|
||
## Performance Characteristics
|
||
|
||
### Query Latency
|
||
|
||
**Target:** p99 <200ms at <1K queries/sec
|
||
|
||
| Metric | Single-Node | Three-Node |
|
||
|--------|-------------|------------|
|
||
| **p50** | 20ms | 25ms |
|
||
| **p95** | 50ms | 75ms |
|
||
| **p99** | 100ms | 150ms |
|
||
|
||
*3-node has slightly higher latency due to network hops, but 3x query capacity*
|
||
|
||
### Write Throughput
|
||
|
||
**Target:** 1,000 assertions/sec sustained
|
||
|
||
- Each node accepts writes
|
||
- Replication happens asynchronously
|
||
- No coordination required (CRDTs)
|
||
|
||
### Replication Lag
|
||
|
||
**Target:** <1 second typical, <5 seconds max
|
||
|
||
Measured by: `replication_lag_seconds` metric
|
||
|
||
---
|
||
|
||
## Network Requirements
|
||
|
||
**See:** [Network Requirements](./network-requirements.md) for full details.
|
||
|
||
### Ports (Per Node)
|
||
|
||
| Port | Protocol | Purpose | Firewall Rule |
|
||
|------|----------|---------|---------------|
|
||
| **18180** | TCP/HTTP | API (clients → nodes) | Allow from load balancer |
|
||
| **18181** | TCP/HTTP | Cluster gateway (admin only) | Allow from internal network |
|
||
| **18182** | TCP/gRPC | Replication (node ↔ node) | Allow within cluster |
|
||
| **18183** | UDP | SWIM gossip (node ↔ node) | Allow within cluster |
|
||
|
||
### Latency Requirement
|
||
|
||
**<5ms inter-node latency required**
|
||
|
||
- Deploy nodes in same region/AZ
|
||
- Private network (10 Gbps recommended)
|
||
- Test with: `ping -c 100 node2` (should show avg <5ms)
|
||
|
||
### Bandwidth
|
||
|
||
- **Replication:** ~1 Mbps per 100 assertions/sec
|
||
- **Queries:** ~10 Mbps at 1K queries/sec
|
||
- **Recommended:** 1 Gbps minimum, 10 Gbps for production
|
||
|
||
---
|
||
|
||
## Monitoring & Alerts
|
||
|
||
### Critical Metrics
|
||
|
||
```yaml
|
||
# Prometheus alerts
|
||
- alert: StemeDBNodeDown
|
||
expr: up{job="stemedb-cluster"} == 0
|
||
for: 1m
|
||
|
||
- alert: StemeDBReplicationLag
|
||
expr: replication_lag_seconds > 5
|
||
for: 5m
|
||
|
||
- alert: StemeDBQuorumLost
|
||
expr: count(up{job="stemedb-cluster"} == 1) < 2
|
||
for: 1m
|
||
```
|
||
|
||
### Grafana Dashboard Panels
|
||
|
||
1. **Cluster Health:** Node count, status, replication lag
|
||
2. **Query Latency:** p50, p95, p99 across all nodes
|
||
3. **Ingest Rate:** Assertions/sec per node
|
||
4. **Disk Usage:** WAL + DB per node
|
||
5. **Network:** Replication bandwidth
|
||
|
||
---
|
||
|
||
## Cost Estimate (AWS, us-east-1)
|
||
|
||
| Resource | Cost |
|
||
|----------|------|
|
||
| **Compute** (3× t3.xlarge) | $300/month |
|
||
| **Storage** (3× 200GB SSD) | $60/month |
|
||
| **Load Balancer** (ALB) | $25/month |
|
||
| **Data Transfer** (internal) | $10/month |
|
||
| **Backups** (S3) | $30/month |
|
||
| **Total** | **~$425/month** |
|
||
|
||
Compare to single-node ($87/month): 5x cost for 10x availability
|
||
|
||
---
|
||
|
||
## Migration from Single-Node
|
||
|
||
**See:** [Add Node Runbook](../../runbooks/add-node.md#1-bootstrap-3-node-cluster) for detailed procedure.
|
||
|
||
**Summary:**
|
||
1. Provision 2 new nodes
|
||
2. Configure cluster on all 3
|
||
3. Restart single-node with cluster config
|
||
4. Trigger Merkle sync
|
||
5. Update load balancer
|
||
|
||
**Downtime:** 5-15 minutes for replication
|
||
|
||
---
|
||
|
||
## Related Documentation
|
||
|
||
- [Single-Node Pilot](./single-node-pilot.md) - Simpler architecture
|
||
- [Network Requirements](./network-requirements.md) - Firewall rules
|
||
- [Resource Sizing](./resource-sizing.md) - Hardware calculations
|
||
- [Add Node Runbook](../../runbooks/add-node.md) - Cluster operations
|
||
- [High Query Latency Runbook](../../runbooks/high-query-latency.md) - Performance troubleshooting
|
||
|
||
---
|
||
|
||
**Last Updated:** 2026-02-11
|