This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
16 KiB
Runbook: Add Node to Cluster
Symptom
- Need to scale from single-node to 3-node cluster
- Need to add capacity to existing cluster
- Need to replace failed node
- Planning horizontal scaling
Quick Diagnosis
Need to add node
│
├─► Currently single-node?
│ └─► §1 Bootstrap 3-Node Cluster
│
├─► Existing 3-node cluster, need more capacity?
│ └─► §2 Add Node to Existing Cluster
│
├─► Node failed, need replacement?
│ └─► §3 Replace Failed Node
│
└─► Planning scaling strategy?
└─► See Reference Architectures
Prerequisites
Before adding node:
-
Network connectivity:
# From new node, ping existing nodes ping node1.example.com ping node2.example.com # Should show <5ms latency (same region required) -
Ports open:
# Test connectivity to cluster ports nc -zv node1.example.com 18180 # HTTP API nc -zv node1.example.com 18181 # Cluster Gateway nc -zv node1.example.com 18182 # Cluster RPC nc -zv node1.example.com 18183 # SWIM Gossip # All should succeed -
StemeDB installed on new node:
# Verify binary which stemedb-api # Should return: /usr/local/bin/stemedb-api (or installation path) -
Disk space sufficient:
df -h /data # Should have >50GB available for pilot -
Cluster healthy (if existing):
curl http://node1:18180/v1/health # Should return: {"status": "healthy", ...}
Resolution Steps
§1. Bootstrap 3-Node Cluster (From Single-Node)
Use case: Migrating from single-node pilot to 3-node production cluster
Diagnostic:
# Check current single-node state
curl http://localhost:18180/v1/health
# Note assertion_count for validation later
ASSERTION_COUNT=$(curl -s http://localhost:18180/v1/health | jq '.assertion_count')
echo "Current assertions: $ASSERTION_COUNT"
# Verify no cluster config
curl http://localhost:18180/metrics | grep cluster_members
# Should return empty (single-node)
Resolution: Step-by-step cluster bootstrap
Step 1: Provision 2 new nodes
# AWS example: Launch 2 instances matching current node specs
aws ec2 run-instances \
--image-id ami-xxx \
--instance-type t3.large \
--count 2 \
--subnet-id subnet-xxx \
--security-group-ids sg-xxx \
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=stemedb-node2},{Key=Name,Value=stemedb-node3}]'
# Note instance IDs and private IPs
NODE2_IP="10.0.1.52"
NODE3_IP="10.0.1.53"
Step 2: Install StemeDB on new nodes
# SSH to node2
ssh ubuntu@$NODE2_IP
# Install StemeDB (same version as node1!)
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api
# Create data directories
sudo mkdir -p /data/{wal,db}
sudo chown -R stemedb:stemedb /data
# Repeat for node3
Step 3: Configure cluster on all nodes
# Node 1 (existing): Enable cluster mode
cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = "node1"
bind_addr = "10.0.1.51:18181" # Node1 IP
rpc_addr = "10.0.1.51:18182"
swim_addr = "10.0.1.51:18183"
# Seed nodes for discovery
seeds = [
"10.0.1.52:18183", # Node2
"10.0.1.53:18183" # Node3
]
[replication]
factor = 2 # Replicate each assertion to 2 nodes
EOF
# Node 2: Similar config with node2 IPs
ssh node2 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node2\"
bind_addr = \"10.0.1.52:18181\"
rpc_addr = \"10.0.1.52:18182\"
swim_addr = \"10.0.1.52:18183\"
seeds = [\"10.0.1.51:18183\", \"10.0.1.53:18183\"]
[replication]
factor = 2
EOF"
# Node 3: Similar config with node3 IPs
ssh node3 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node3\"
bind_addr = \"10.0.1.53:18181\"
rpc_addr = \"10.0.1.53:18182\"
swim_addr = \"10.0.1.53:18183\"
seeds = [\"10.0.1.51:18183\", \"10.0.1.52:18183\"]
[replication]
factor = 2
EOF"
Step 4: Start new nodes first (empty data)
# Start node2
ssh node2 "sudo systemctl start stemedb-api"
# Start node3
ssh node3 "sudo systemctl start stemedb-api"
# Verify startup
ssh node2 "curl http://localhost:18180/v1/health"
ssh node3 "curl http://localhost:18180/v1/health"
# Both should return: {"status": "healthy", "assertion_count": 0}
Step 5: Restart node1 with cluster config
# Restart node1 to join cluster
sudo systemctl restart stemedb-api
# Wait for SWIM gossip to converge (~10 seconds)
sleep 15
Step 6: Verify cluster formation
# Check cluster membership from any node
curl http://localhost:18181/cluster/members | jq '.'
# Expected output:
# {
# "members": [
# {"id": "node1", "status": "UP", "assertion_count": 10234},
# {"id": "node2", "status": "UP", "assertion_count": 0},
# {"id": "node3", "status": "UP", "assertion_count": 0}
# ]
# }
# Check replication status
curl http://localhost:18180/metrics | grep replication_lag_seconds
# All nodes should show <1s lag
Step 7: Trigger initial replication
# Manually trigger Merkle sync to populate node2 and node3
curl -X POST http://localhost:18181/cluster/sync \
-H "Content-Type: application/json" \
-d '{"target_nodes": ["node2", "node3"], "force": true}'
# Monitor replication progress
watch -n 5 'curl -s http://localhost:18181/cluster/members | jq ".members[] | {id, assertion_count}"'
# Wait for node2 and node3 to reach same assertion_count as node1
# (Typically 1-5 minutes for <100K assertions)
Validate cluster:
# All nodes should have same assertion count
curl http://node1:18180/v1/health | jq '.assertion_count'
curl http://node2:18180/v1/health | jq '.assertion_count'
curl http://node3:18180/v1/health | jq '.assertion_count'
# All should match original count
# Test writes hit multiple nodes
curl -X POST http://localhost:18180/v1/assert \
-H "Content-Type: application/json" \
-d '{"concept_path": "test/cluster", "predicate": "replicated", "value": true}'
# Query from different nodes
curl -X POST http://node2:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "test/cluster", "lens": "recency"}'
# Should return the assertion just written
If failed: Cluster won't form → Check firewall rules, SWIM gossip logs, network connectivity.
§2. Add Node to Existing Cluster
Use case: Scaling existing 3-node cluster to 4+ nodes
⚠️ NOTE: Pilot 5 supports 3-node clusters. 4+ nodes is roadmap P6. Procedure below is future-ready.
Diagnostic:
# Check current cluster state
curl http://node1:18181/cluster/members | jq '.members | length'
# Should return: 3
# Check cluster health
curl http://node1:18181/cluster/health
# Should return: {"status": "healthy", "quorum": true}
Resolution: Add node4
Step 1: Provision new node
# (Same as §1 Step 1)
NODE4_IP="10.0.1.54"
Step 2: Install StemeDB on node4
# (Same as §1 Step 2)
Step 3: Configure node4
ssh node4 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node4\"
bind_addr = \"10.0.1.54:18181\"
rpc_addr = \"10.0.1.54:18182\"
swim_addr = \"10.0.1.54:18183\"
# Point to existing cluster for discovery
seeds = [
\"10.0.1.51:18183\", # Node1
\"10.0.1.52:18183\", # Node2
\"10.0.1.53:18183\" # Node3
]
[replication]
factor = 2
EOF"
Step 4: Start node4
ssh node4 "sudo systemctl start stemedb-api"
# SWIM gossip will auto-discover existing cluster
# No restart of existing nodes required!
Step 5: Verify join
# Check cluster membership
curl http://node1:18181/cluster/members | jq '.members | length'
# Should return: 4
# Check node4 status
curl http://node1:18181/cluster/members | jq '.members[] | select(.id=="node4")'
# Should show: {"id": "node4", "status": "UP", "assertion_count": 0}
Step 6: Rebalance shards (manual for Pilot 5)
⚠️ NOTE: Automatic rebalancing is roadmap P6.3. Manual process required.
# View current shard assignment
curl http://node1:18181/cluster/shards | jq '.'
# Identify shards to move to node4
# (Typically 25% of shards from node1, node2, node3)
# Move shard (example)
curl -X POST http://node1:18181/admin/shards/rebalance \
-H "Content-Type: application/json" \
-d '{
"shard_id": "shard-abc123",
"target_node": "node4",
"reason": "add_capacity"
}'
# Monitor rebalance progress
watch -n 5 'curl -s http://node1:18181/cluster/shards | jq ".shards[] | select(.id==\"shard-abc123\") | .rebalance_status"'
# Repeat for other shards until balanced
Validate:
# All nodes should have similar assertion counts
curl http://node1:18181/cluster/members | jq '.members[] | {id, assertion_count}'
# Test query hits node4
curl -X POST http://node4:18180/v1/query \
-H "Content-Type: application/json" \
-d '{"concept_path": "test/node4", "lens": "recency"}'
# Should succeed
If failed: Node4 won't join → Check seed node IPs, firewall rules, SWIM logs.
§3. Replace Failed Node
Use case: Node2 failed (hardware, software), need replacement
Diagnostic:
# Check cluster status
curl http://node1:18181/cluster/members | jq '.members[] | select(.status != "UP")'
# Expected output:
# {
# "id": "node2",
# "status": "DOWN",
# "last_seen": "2026-02-11T10:15:00Z"
# }
# Check replication status
curl http://node1:18180/metrics | grep replication_lag_seconds
# May show elevated lag to node2
Resolution: Replace node2
Step 1: Remove failed node from cluster
# Gracefully remove node2 (allows rebalancing)
curl -X POST http://node1:18181/admin/cluster/remove \
-H "Content-Type: application/json" \
-d '{"node_id": "node2", "force": false}'
# Wait for shards to rebalance to node1 and node3
# (Typically 5-15 minutes for <100K assertions)
watch -n 10 'curl -s http://node1:18181/cluster/members | jq .members'
# node2 should disappear from list
Step 2: Provision new node2
# Launch new instance
NEW_NODE2_IP="10.0.1.55" # May be different IP
Step 3: Configure new node2
# (Same as §1 Step 3, using new IP)
ssh new-node2 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node2-replacement\" # Different ID
bind_addr = \"10.0.1.55:18181\"
rpc_addr = \"10.0.1.55:18182\"
swim_addr = \"10.0.1.55:18183\"
seeds = [\"10.0.1.51:18183\", \"10.0.1.53:18183\"]
[replication]
factor = 2
EOF"
Step 4: Start new node2
ssh new-node2 "sudo systemctl start stemedb-api"
# Auto-joins cluster via SWIM
Step 5: Verify join and replication
# Check membership
curl http://node1:18181/cluster/members | jq '.members'
# Should show: node1, node2-replacement, node3
# Trigger replication to new node
curl -X POST http://node1:18181/cluster/sync \
-H "Content-Type: application/json" \
-d '{"target_nodes": ["node2-replacement"], "force": true}'
# Monitor
watch -n 5 'curl -s http://node1:18181/cluster/members | jq ".members[] | select(.id==\"node2-replacement\") | .assertion_count"'
Validate:
# Cluster healthy with 3 nodes
curl http://node1:18181/cluster/health
# Should return: {"status": "healthy", "quorum": true}
# New node2 has full data
curl http://new-node2:18180/v1/health | jq '.assertion_count'
# Should match node1 and node3
If failed: Replication not catching up → Check network bandwidth, disk I/O, Merkle sync logs.
Validation
After adding node, validate cluster health:
-
Cluster members show new node
curl http://node1:18181/cluster/members | jq '.members' # Should list all nodes with status "UP" -
Replication lag <1s
curl http://node1:18180/metrics | grep replication_lag_seconds # All nodes should show <1.0 -
Assertion counts match
for node in node1 node2 node3; do echo "$node: $(curl -s http://$node:18180/v1/health | jq '.assertion_count')" done # All should be equal (±1 for in-flight writes) -
Queries work from new node
curl -X POST http://new-node:18180/v1/query \ -H "Content-Type: application/json" \ -d '{"concept_path": "test/cluster", "lens": "recency"}' # Should return results -
Writes replicate to new node
curl -X POST http://node1:18180/v1/assert \ -H "Content-Type: application/json" \ -d '{"concept_path": "test/new_node", "predicate": "validated", "value": true}' # Query from new node curl -X POST http://new-node:18180/v1/query \ -H "Content-Type: application/json" \ -d '{"concept_path": "test/new_node", "lens": "recency"}' # Should return the assertion
Network Requirements
For cluster operation, ensure:
| Port | Protocol | Purpose | Required For |
|---|---|---|---|
| 18180 | TCP/HTTP | API queries | Client → Any node |
| 18181 | TCP/HTTP | Cluster gateway | Load balancer → Nodes |
| 18182 | TCP/gRPC | Cluster RPC (replication) | Node ↔ Node |
| 18183 | UDP | SWIM gossip (membership) | Node ↔ Node |
Firewall rules (AWS Security Group example):
# Allow cluster communication (node ↔ node)
aws ec2 authorize-security-group-ingress \
--group-id sg-xxx \
--source-group sg-xxx \
--protocol tcp \
--port 18180-18183
aws ec2 authorize-security-group-ingress \
--group-id sg-xxx \
--source-group sg-xxx \
--protocol udp \
--port 18183
# Allow client access (load balancer → nodes)
aws ec2 authorize-security-group-ingress \
--group-id sg-xxx \
--source-group sg-lb \
--protocol tcp \
--port 18180
Latency requirement: <5ms inter-node latency (same region/AZ required)
See: Network Requirements for full details.
Load Balancer Configuration
After adding nodes, update load balancer:
Nginx example:
upstream stemedb_cluster {
# Round-robin by default
server 10.0.1.51:18180 weight=1; # node1
server 10.0.1.52:18180 weight=1; # node2
server 10.0.1.53:18180 weight=1; # node3
# Health checks
check interval=5000 rise=2 fall=3 timeout=3000;
}
server {
listen 443 ssl;
server_name stemedb.example.com;
location / {
proxy_pass http://stemedb_cluster;
proxy_next_upstream error timeout http_502 http_503;
proxy_connect_timeout 5s;
proxy_send_timeout 30s;
proxy_read_timeout 30s;
}
}
Envoy example:
clusters:
- name: stemedb_cluster
type: STRICT_DNS
load_assignment:
cluster_name: stemedb_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: node1.example.com
port_value: 18180
- endpoint:
address:
socket_address:
address: node2.example.com
port_value: 18180
- endpoint:
address:
socket_address:
address: node3.example.com
port_value: 18180
health_checks:
- timeout: 3s
interval: 5s
unhealthy_threshold: 3
healthy_threshold: 2
http_health_check:
path: "/v1/health"
Cluster Sizing Guidelines
From Resource Sizing Guide:
| Assertions | Nodes | Replication Factor | RTO | RPO |
|---|---|---|---|---|
| <10K | 1 | N/A | 2hr | 24hr |
| <100K | 3 | 2 | 5min | 1min |
| <1M | 5 | 3 | 1min | 10s |
When to add nodes:
- Query latency p99 >1s (capacity)
- Disk usage >80% (storage)
- CPU sustained >70% (compute)
- Planning for HA (minimum 3 nodes)
Related Documentation
- Three-Node Cluster Architecture - Deployment guide
- Network Requirements - Firewall rules
- High Query Latency - Shard rebalancing
- Resource Sizing - Capacity planning
Future Enhancements
Roadmap P6.3 (Automatic Shard Rebalancing):
- Auto-detect when new node joins
- Automatically rebalance shards for even distribution
- No manual
shards/rebalanceAPI calls needed
Roadmap P6.4 (WAL Archival to S3):
- Replicate WAL segments to S3 for durability
- Reduce local disk requirements
- Enable faster node replacement (restore from S3)
Last Updated
2026-02-11