# Runbook: Add Node to Cluster ## Symptom - Need to scale from single-node to 3-node cluster - Need to add capacity to existing cluster - Need to replace failed node - Planning horizontal scaling --- ## Quick Diagnosis ``` Need to add node │ ├─► Currently single-node? │ └─► §1 Bootstrap 3-Node Cluster │ ├─► Existing 3-node cluster, need more capacity? │ └─► §2 Add Node to Existing Cluster │ ├─► Node failed, need replacement? │ └─► §3 Replace Failed Node │ └─► Planning scaling strategy? └─► See Reference Architectures ``` --- ## Prerequisites **Before adding node:** - [ ] **Network connectivity:** ```bash # From new node, ping existing nodes ping node1.example.com ping node2.example.com # Should show <5ms latency (same region required) ``` - [ ] **Ports open:** ```bash # Test connectivity to cluster ports nc -zv node1.example.com 18180 # HTTP API nc -zv node1.example.com 18181 # Cluster Gateway nc -zv node1.example.com 18182 # Cluster RPC nc -zv node1.example.com 18183 # SWIM Gossip # All should succeed ``` - [ ] **StemeDB installed on new node:** ```bash # Verify binary which stemedb-api # Should return: /usr/local/bin/stemedb-api (or installation path) ``` - [ ] **Disk space sufficient:** ```bash df -h /data # Should have >50GB available for pilot ``` - [ ] **Cluster healthy (if existing):** ```bash curl http://node1:18180/v1/health # Should return: {"status": "healthy", ...} ``` --- ## Resolution Steps ### §1. Bootstrap 3-Node Cluster (From Single-Node) **Use case:** Migrating from single-node pilot to 3-node production cluster **Diagnostic:** ```bash # Check current single-node state curl http://localhost:18180/v1/health # Note assertion_count for validation later ASSERTION_COUNT=$(curl -s http://localhost:18180/v1/health | jq '.assertion_count') echo "Current assertions: $ASSERTION_COUNT" # Verify no cluster config curl http://localhost:18180/metrics | grep cluster_members # Should return empty (single-node) ``` **Resolution: Step-by-step cluster bootstrap** **Step 1: Provision 2 new nodes** ```bash # AWS example: Launch 2 instances matching current node specs aws ec2 run-instances \ --image-id ami-xxx \ --instance-type t3.large \ --count 2 \ --subnet-id subnet-xxx \ --security-group-ids sg-xxx \ --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=stemedb-node2},{Key=Name,Value=stemedb-node3}]' # Note instance IDs and private IPs NODE2_IP="10.0.1.52" NODE3_IP="10.0.1.53" ``` **Step 2: Install StemeDB on new nodes** ```bash # SSH to node2 ssh ubuntu@$NODE2_IP # Install StemeDB (same version as node1!) sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api sudo chmod +x /usr/local/bin/stemedb-api # Create data directories sudo mkdir -p /data/{wal,db} sudo chown -R stemedb:stemedb /data # Repeat for node3 ``` **Step 3: Configure cluster on all nodes** ```bash # Node 1 (existing): Enable cluster mode cat <1s (capacity) - Disk usage >80% (storage) - CPU sustained >70% (compute) - Planning for HA (minimum 3 nodes) --- ## Related Documentation - [Three-Node Cluster Architecture](../reference-architecture/three-node-cluster.md) - Deployment guide - [Network Requirements](../reference-architecture/network-requirements.md) - Firewall rules - [High Query Latency](./high-query-latency.md) - Shard rebalancing - [Resource Sizing](../reference-architecture/resource-sizing.md) - Capacity planning --- ## Future Enhancements **Roadmap P6.3 (Automatic Shard Rebalancing):** - Auto-detect when new node joins - Automatically rebalance shards for even distribution - No manual `shards/rebalance` API calls needed **Roadmap P6.4 (WAL Archival to S3):** - Replicate WAL segments to S3 for durability - Reduce local disk requirements - Enable faster node replacement (restore from S3) --- ## Last Updated 2026-02-11