stemedb/docs/operations/runbooks/add-node.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

669 lines
16 KiB
Markdown

# Runbook: Add Node to Cluster
## Symptom
- Need to scale from single-node to 3-node cluster
- Need to add capacity to existing cluster
- Need to replace failed node
- Planning horizontal scaling
---
## Quick Diagnosis
```
Need to add node
├─► Currently single-node?
│ └─► §1 Bootstrap 3-Node Cluster
├─► Existing 3-node cluster, need more capacity?
│ └─► §2 Add Node to Existing Cluster
├─► Node failed, need replacement?
│ └─► §3 Replace Failed Node
└─► Planning scaling strategy?
└─► See Reference Architectures
```
---
## Prerequisites
**Before adding node:**
- [ ] **Network connectivity:**
```bash
# From new node, ping existing nodes
ping node1.example.com
ping node2.example.com
# Should show <5ms latency (same region required)
```
- [ ] **Ports open:**
```bash
# Test connectivity to cluster ports
nc -zv node1.example.com 18180 # HTTP API
nc -zv node1.example.com 18181 # Cluster Gateway
nc -zv node1.example.com 18182 # Cluster RPC
nc -zv node1.example.com 18183 # SWIM Gossip
# All should succeed
```
- [ ] **StemeDB installed on new node:**
```bash
# Verify binary
which stemedb-api
# Should return: /usr/local/bin/stemedb-api (or installation path)
```
- [ ] **Disk space sufficient:**
```bash
df -h /data
# Should have >50GB available for pilot
```
- [ ] **Cluster healthy (if existing):**
```bash
curl http://node1:18180/v1/health
# Should return: {"status": "healthy", ...}
```
---
## Resolution Steps
### §1. Bootstrap 3-Node Cluster (From Single-Node)
**Use case:** Migrating from single-node pilot to 3-node production cluster
**Diagnostic:**
```bash
# Check current single-node state
curl http://localhost:18180/v1/health
# Note assertion_count for validation later
ASSERTION_COUNT=$(curl -s http://localhost:18180/v1/health | jq '.assertion_count')
echo "Current assertions: $ASSERTION_COUNT"
# Verify no cluster config
curl http://localhost:18180/metrics | grep cluster_members
# Should return empty (single-node)
```
**Resolution: Step-by-step cluster bootstrap**
**Step 1: Provision 2 new nodes**
```bash
# AWS example: Launch 2 instances matching current node specs
aws ec2 run-instances \
--image-id ami-xxx \
--instance-type t3.large \
--count 2 \
--subnet-id subnet-xxx \
--security-group-ids sg-xxx \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=stemedb-node2},{Key=Name,Value=stemedb-node3}]'
# Note instance IDs and private IPs
NODE2_IP="10.0.1.52"
NODE3_IP="10.0.1.53"
```
**Step 2: Install StemeDB on new nodes**
```bash
# SSH to node2
ssh ubuntu@$NODE2_IP
# Install StemeDB (same version as node1!)
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api
# Create data directories
sudo mkdir -p /data/{wal,db}
sudo chown -R stemedb:stemedb /data
# Repeat for node3
```
**Step 3: Configure cluster on all nodes**
```bash
# Node 1 (existing): Enable cluster mode
cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = "node1"
bind_addr = "10.0.1.51:18181" # Node1 IP
rpc_addr = "10.0.1.51:18182"
swim_addr = "10.0.1.51:18183"
# Seed nodes for discovery
seeds = [
"10.0.1.52:18183", # Node2
"10.0.1.53:18183" # Node3
]
[replication]
factor = 2 # Replicate each assertion to 2 nodes
EOF
# Node 2: Similar config with node2 IPs
ssh node2 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node2\"
bind_addr = \"10.0.1.52:18181\"
rpc_addr = \"10.0.1.52:18182\"
swim_addr = \"10.0.1.52:18183\"
seeds = [\"10.0.1.51:18183\", \"10.0.1.53:18183\"]
[replication]
factor = 2
EOF"
# Node 3: Similar config with node3 IPs
ssh node3 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node3\"
bind_addr = \"10.0.1.53:18181\"
rpc_addr = \"10.0.1.53:18182\"
swim_addr = \"10.0.1.53:18183\"
seeds = [\"10.0.1.51:18183\", \"10.0.1.52:18183\"]
[replication]
factor = 2
EOF"
```
**Step 4: Start new nodes first (empty data)**
```bash
# Start node2
ssh node2 "sudo systemctl start stemedb-api"
# Start node3
ssh node3 "sudo systemctl start stemedb-api"
# Verify startup
ssh node2 "curl http://localhost:18180/v1/health"
ssh node3 "curl http://localhost:18180/v1/health"
# Both should return: {"status": "healthy", "assertion_count": 0}
```
**Step 5: Restart node1 with cluster config**
```bash
# Restart node1 to join cluster
sudo systemctl restart stemedb-api
# Wait for SWIM gossip to converge (~10 seconds)
sleep 15
```
**Step 6: Verify cluster formation**
```bash
# Check cluster membership from any node
curl http://localhost:18181/cluster/members | jq '.'
# Expected output:
# {
# "members": [
# {"id": "node1", "status": "UP", "assertion_count": 10234},
# {"id": "node2", "status": "UP", "assertion_count": 0},
# {"id": "node3", "status": "UP", "assertion_count": 0}
# ]
# }
# Check replication status
curl http://localhost:18180/metrics | grep replication_lag_seconds
# All nodes should show <1s lag
```
**Step 7: Trigger initial replication**
```bash
# Manually trigger Merkle sync to populate node2 and node3
curl -X POST http://localhost:18181/cluster/sync \
-H "Content-Type: application/json" \
-d '{"target_nodes": ["node2", "node3"], "force": true}'
# Monitor replication progress
watch -n 5 'curl -s http://localhost:18181/cluster/members | jq ".members[] | {id, assertion_count}"'
# Wait for node2 and node3 to reach same assertion_count as node1
# (Typically 1-5 minutes for <100K assertions)
```
**Validate cluster:**
```bash
# All nodes should have same assertion count
curl http://node1:18180/v1/health | jq '.assertion_count'
curl http://node2:18180/v1/health | jq '.assertion_count'
curl http://node3:18180/v1/health | jq '.assertion_count'
# All should match original count
# Test writes hit multiple nodes
curl -X POST http://localhost:18180/v1/assert \
-H "Content-Type: application/json" \
-d '{"concept_path": "test/cluster", "predicate": "replicated", "value": true}'
# Query from different nodes
curl -X POST http://node2:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "test/cluster", "lens": "recency"}'
# Should return the assertion just written
```
**If failed:** Cluster won't form → Check firewall rules, SWIM gossip logs, network connectivity.
---
### §2. Add Node to Existing Cluster
**Use case:** Scaling existing 3-node cluster to 4+ nodes
⚠️ **NOTE:** Pilot 5 supports 3-node clusters. 4+ nodes is roadmap P6. Procedure below is future-ready.
**Diagnostic:**
```bash
# Check current cluster state
curl http://node1:18181/cluster/members | jq '.members | length'
# Should return: 3
# Check cluster health
curl http://node1:18181/cluster/health
# Should return: {"status": "healthy", "quorum": true}
```
**Resolution: Add node4**
**Step 1: Provision new node**
```bash
# (Same as §1 Step 1)
NODE4_IP="10.0.1.54"
```
**Step 2: Install StemeDB on node4**
```bash
# (Same as §1 Step 2)
```
**Step 3: Configure node4**
```bash
ssh node4 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node4\"
bind_addr = \"10.0.1.54:18181\"
rpc_addr = \"10.0.1.54:18182\"
swim_addr = \"10.0.1.54:18183\"
# Point to existing cluster for discovery
seeds = [
\"10.0.1.51:18183\", # Node1
\"10.0.1.52:18183\", # Node2
\"10.0.1.53:18183\" # Node3
]
[replication]
factor = 2
EOF"
```
**Step 4: Start node4**
```bash
ssh node4 "sudo systemctl start stemedb-api"
# SWIM gossip will auto-discover existing cluster
# No restart of existing nodes required!
```
**Step 5: Verify join**
```bash
# Check cluster membership
curl http://node1:18181/cluster/members | jq '.members | length'
# Should return: 4
# Check node4 status
curl http://node1:18181/cluster/members | jq '.members[] | select(.id=="node4")'
# Should show: {"id": "node4", "status": "UP", "assertion_count": 0}
```
**Step 6: Rebalance shards (manual for Pilot 5)**
⚠️ **NOTE:** Automatic rebalancing is roadmap P6.3. Manual process required.
```bash
# View current shard assignment
curl http://node1:18181/cluster/shards | jq '.'
# Identify shards to move to node4
# (Typically 25% of shards from node1, node2, node3)
# Move shard (example)
curl -X POST http://node1:18181/admin/shards/rebalance \
-H "Content-Type: application/json" \
-d '{
"shard_id": "shard-abc123",
"target_node": "node4",
"reason": "add_capacity"
}'
# Monitor rebalance progress
watch -n 5 'curl -s http://node1:18181/cluster/shards | jq ".shards[] | select(.id==\"shard-abc123\") | .rebalance_status"'
# Repeat for other shards until balanced
```
**Validate:**
```bash
# All nodes should have similar assertion counts
curl http://node1:18181/cluster/members | jq '.members[] | {id, assertion_count}'
# Test query hits node4
curl -X POST http://node4:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "test/node4", "lens": "recency"}'
# Should succeed
```
**If failed:** Node4 won't join → Check seed node IPs, firewall rules, SWIM logs.
---
### §3. Replace Failed Node
**Use case:** Node2 failed (hardware, software), need replacement
**Diagnostic:**
```bash
# Check cluster status
curl http://node1:18181/cluster/members | jq '.members[] | select(.status != "UP")'
# Expected output:
# {
# "id": "node2",
# "status": "DOWN",
# "last_seen": "2026-02-11T10:15:00Z"
# }
# Check replication status
curl http://node1:18180/metrics | grep replication_lag_seconds
# May show elevated lag to node2
```
**Resolution: Replace node2**
**Step 1: Remove failed node from cluster**
```bash
# Gracefully remove node2 (allows rebalancing)
curl -X POST http://node1:18181/admin/cluster/remove \
-H "Content-Type: application/json" \
-d '{"node_id": "node2", "force": false}'
# Wait for shards to rebalance to node1 and node3
# (Typically 5-15 minutes for <100K assertions)
watch -n 10 'curl -s http://node1:18181/cluster/members | jq .members'
# node2 should disappear from list
```
**Step 2: Provision new node2**
```bash
# Launch new instance
NEW_NODE2_IP="10.0.1.55" # May be different IP
```
**Step 3: Configure new node2**
```bash
# (Same as §1 Step 3, using new IP)
ssh new-node2 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node2-replacement\" # Different ID
bind_addr = \"10.0.1.55:18181\"
rpc_addr = \"10.0.1.55:18182\"
swim_addr = \"10.0.1.55:18183\"
seeds = [\"10.0.1.51:18183\", \"10.0.1.53:18183\"]
[replication]
factor = 2
EOF"
```
**Step 4: Start new node2**
```bash
ssh new-node2 "sudo systemctl start stemedb-api"
# Auto-joins cluster via SWIM
```
**Step 5: Verify join and replication**
```bash
# Check membership
curl http://node1:18181/cluster/members | jq '.members'
# Should show: node1, node2-replacement, node3
# Trigger replication to new node
curl -X POST http://node1:18181/cluster/sync \
-H "Content-Type: application/json" \
-d '{"target_nodes": ["node2-replacement"], "force": true}'
# Monitor
watch -n 5 'curl -s http://node1:18181/cluster/members | jq ".members[] | select(.id==\"node2-replacement\") | .assertion_count"'
```
**Validate:**
```bash
# Cluster healthy with 3 nodes
curl http://node1:18181/cluster/health
# Should return: {"status": "healthy", "quorum": true}
# New node2 has full data
curl http://new-node2:18180/v1/health | jq '.assertion_count'
# Should match node1 and node3
```
**If failed:** Replication not catching up → Check network bandwidth, disk I/O, Merkle sync logs.
---
## Validation
After adding node, validate cluster health:
- [ ] **Cluster members show new node**
```bash
curl http://node1:18181/cluster/members | jq '.members'
# Should list all nodes with status "UP"
```
- [ ] **Replication lag <1s**
```bash
curl http://node1:18180/metrics | grep replication_lag_seconds
# All nodes should show <1.0
```
- [ ] **Assertion counts match**
```bash
for node in node1 node2 node3; do
echo "$node: $(curl -s http://$node:18180/v1/health | jq '.assertion_count')"
done
# All should be equal 1 for in-flight writes)
```
- [ ] **Queries work from new node**
```bash
curl -X POST http://new-node:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "test/cluster", "lens": "recency"}'
# Should return results
```
- [ ] **Writes replicate to new node**
```bash
curl -X POST http://node1:18180/v1/assert \
-H "Content-Type: application/json" \
-d '{"concept_path": "test/new_node", "predicate": "validated", "value": true}'
# Query from new node
curl -X POST http://new-node:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "test/new_node", "lens": "recency"}'
# Should return the assertion
```
---
## Network Requirements
**For cluster operation, ensure:**
| Port | Protocol | Purpose | Required For |
|------|----------|---------|--------------|
| **18180** | TCP/HTTP | API queries | Client Any node |
| **18181** | TCP/HTTP | Cluster gateway | Load balancer Nodes |
| **18182** | TCP/gRPC | Cluster RPC (replication) | Node Node |
| **18183** | UDP | SWIM gossip (membership) | Node Node |
**Firewall rules (AWS Security Group example):**
```bash
# Allow cluster communication (node ↔ node)
aws ec2 authorize-security-group-ingress \
--group-id sg-xxx \
--source-group sg-xxx \
--protocol tcp \
--port 18180-18183
aws ec2 authorize-security-group-ingress \
--group-id sg-xxx \
--source-group sg-xxx \
--protocol udp \
--port 18183
# Allow client access (load balancer → nodes)
aws ec2 authorize-security-group-ingress \
--group-id sg-xxx \
--source-group sg-lb \
--protocol tcp \
--port 18180
```
**Latency requirement:** <5ms inter-node latency (same region/AZ required)
**See:** [Network Requirements](../reference-architecture/network-requirements.md) for full details.
---
## Load Balancer Configuration
**After adding nodes, update load balancer:**
**Nginx example:**
```nginx
upstream stemedb_cluster {
# Round-robin by default
server 10.0.1.51:18180 weight=1; # node1
server 10.0.1.52:18180 weight=1; # node2
server 10.0.1.53:18180 weight=1; # node3
# Health checks
check interval=5000 rise=2 fall=3 timeout=3000;
}
server {
listen 443 ssl;
server_name stemedb.example.com;
location / {
proxy_pass http://stemedb_cluster;
proxy_next_upstream error timeout http_502 http_503;
proxy_connect_timeout 5s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
}
}
```
**Envoy example:**
```yaml
clusters:
- name: stemedb_cluster
type: STRICT_DNS
load_assignment:
cluster_name: stemedb_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: node1.example.com
port_value: 18180
- endpoint:
address:
socket_address:
address: node2.example.com
port_value: 18180
- endpoint:
address:
socket_address:
address: node3.example.com
port_value: 18180
health_checks:
- timeout: 3s
interval: 5s
unhealthy_threshold: 3
healthy_threshold: 2
http_health_check:
path: "/v1/health"
```
---
## Cluster Sizing Guidelines
**From [Resource Sizing Guide](../reference-architecture/resource-sizing.md):**
| Assertions | Nodes | Replication Factor | RTO | RPO |
|-----------|-------|-------------------|-----|-----|
| <10K | 1 | N/A | 2hr | 24hr |
| <100K | 3 | 2 | 5min | 1min |
| <1M | 5 | 3 | 1min | 10s |
**When to add nodes:**
- Query latency p99 >1s (capacity)
- Disk usage >80% (storage)
- CPU sustained >70% (compute)
- Planning for HA (minimum 3 nodes)
---
## Related Documentation
- [Three-Node Cluster Architecture](../reference-architecture/three-node-cluster.md) - Deployment guide
- [Network Requirements](../reference-architecture/network-requirements.md) - Firewall rules
- [High Query Latency](./high-query-latency.md) - Shard rebalancing
- [Resource Sizing](../reference-architecture/resource-sizing.md) - Capacity planning
---
## Future Enhancements
**Roadmap P6.3 (Automatic Shard Rebalancing):**
- Auto-detect when new node joins
- Automatically rebalance shards for even distribution
- No manual `shards/rebalance` API calls needed
**Roadmap P6.4 (WAL Archival to S3):**
- Replicate WAL segments to S3 for durability
- Reduce local disk requirements
- Enable faster node replacement (restore from S3)
---
## Last Updated
2026-02-11