stemedb/docs/operations/runbooks/add-node.md

# Runbook: Add Node to Cluster

## Symptom

- Need to scale from single-node to 3-node cluster
- Need to add capacity to existing cluster
- Need to replace failed node
- Planning horizontal scaling

---

## Quick Diagnosis

```
Need to add node
    │
    ├─► Currently single-node?
    │   └─► §1 Bootstrap 3-Node Cluster
    │
    ├─► Existing 3-node cluster, need more capacity?
    │   └─► §2 Add Node to Existing Cluster
    │
    ├─► Node failed, need replacement?
    │   └─► §3 Replace Failed Node
    │
    └─► Planning scaling strategy?
        └─► See Reference Architectures
```

---

## Prerequisites

**Before adding node:**

- [ ] **Network connectivity:**
  ```bash
  # From new node, ping existing nodes
  ping node1.example.com
  ping node2.example.com
  # Should show <5ms latency (same region required)
  ```

- [ ] **Ports open:**
  ```bash
  # Test connectivity to cluster ports
  nc -zv node1.example.com 18180  # HTTP API
  nc -zv node1.example.com 18181  # Cluster Gateway
  nc -zv node1.example.com 18182  # Cluster RPC
  nc -zv node1.example.com 18183  # SWIM Gossip
  # All should succeed
  ```

- [ ] **StemeDB installed on new node:**
  ```bash
  # Verify binary
  which stemedb-api
  # Should return: /usr/local/bin/stemedb-api (or installation path)
  ```

- [ ] **Disk space sufficient:**
  ```bash
  df -h /data
  # Should have >50GB available for pilot
  ```

- [ ] **Cluster healthy (if existing):**
  ```bash
  curl http://node1:18180/v1/health
  # Should return: {"status": "healthy", ...}
  ```

---

## Resolution Steps

### §1. Bootstrap 3-Node Cluster (From Single-Node)

**Use case:** Migrating from single-node pilot to 3-node production cluster

**Diagnostic:**
```bash
# Check current single-node state
curl http://localhost:18180/v1/health

# Note assertion_count for validation later
ASSERTION_COUNT=$(curl -s http://localhost:18180/v1/health | jq '.assertion_count')
echo "Current assertions: $ASSERTION_COUNT"

# Verify no cluster config
curl http://localhost:18180/metrics | grep cluster_members
# Should return empty (single-node)
```

**Resolution: Step-by-step cluster bootstrap**

**Step 1: Provision 2 new nodes**

```bash
# AWS example: Launch 2 instances matching current node specs
aws ec2 run-instances \
  --image-id ami-xxx \
  --instance-type t3.large \
  --count 2 \
  --subnet-id subnet-xxx \
  --security-group-ids sg-xxx \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=stemedb-node2},{Key=Name,Value=stemedb-node3}]'

# Note instance IDs and private IPs
NODE2_IP="10.0.1.52"
NODE3_IP="10.0.1.53"
```

**Step 2: Install StemeDB on new nodes**

```bash
# SSH to node2
ssh ubuntu@$NODE2_IP

# Install StemeDB (same version as node1!)
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api

# Create data directories
sudo mkdir -p /data/{wal,db}
sudo chown -R stemedb:stemedb /data

# Repeat for node3
```

**Step 3: Configure cluster on all nodes**

```bash
# Node 1 (existing): Enable cluster mode
cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = "node1"
bind_addr = "10.0.1.51:18181"  # Node1 IP
rpc_addr = "10.0.1.51:18182"
swim_addr = "10.0.1.51:18183"

# Seed nodes for discovery
seeds = [
  "10.0.1.52:18183",  # Node2
  "10.0.1.53:18183"   # Node3
]

[replication]
factor = 2  # Replicate each assertion to 2 nodes
EOF

# Node 2: Similar config with node2 IPs
ssh node2 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node2\"
bind_addr = \"10.0.1.52:18181\"
rpc_addr = \"10.0.1.52:18182\"
swim_addr = \"10.0.1.52:18183\"
seeds = [\"10.0.1.51:18183\", \"10.0.1.53:18183\"]
[replication]
factor = 2
EOF"

# Node 3: Similar config with node3 IPs
ssh node3 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node3\"
bind_addr = \"10.0.1.53:18181\"
rpc_addr = \"10.0.1.53:18182\"
swim_addr = \"10.0.1.53:18183\"
seeds = [\"10.0.1.51:18183\", \"10.0.1.52:18183\"]
[replication]
factor = 2
EOF"
```

**Step 4: Start new nodes first (empty data)**

```bash
# Start node2
ssh node2 "sudo systemctl start stemedb-api"

# Start node3
ssh node3 "sudo systemctl start stemedb-api"

# Verify startup
ssh node2 "curl http://localhost:18180/v1/health"
ssh node3 "curl http://localhost:18180/v1/health"
# Both should return: {"status": "healthy", "assertion_count": 0}
```

**Step 5: Restart node1 with cluster config**

```bash
# Restart node1 to join cluster
sudo systemctl restart stemedb-api

# Wait for SWIM gossip to converge (~10 seconds)
sleep 15
```

**Step 6: Verify cluster formation**

```bash
# Check cluster membership from any node
curl http://localhost:18181/cluster/members | jq '.'

# Expected output:
# {
#   "members": [
#     {"id": "node1", "status": "UP", "assertion_count": 10234},
#     {"id": "node2", "status": "UP", "assertion_count": 0},
#     {"id": "node3", "status": "UP", "assertion_count": 0}
#   ]
# }

# Check replication status
curl http://localhost:18180/metrics | grep replication_lag_seconds
# All nodes should show <1s lag
```

**Step 7: Trigger initial replication**

```bash
# Manually trigger Merkle sync to populate node2 and node3
curl -X POST http://localhost:18181/cluster/sync \
  -H "Content-Type: application/json" \
  -d '{"target_nodes": ["node2", "node3"], "force": true}'

# Monitor replication progress
watch -n 5 'curl -s http://localhost:18181/cluster/members | jq ".members[] | {id, assertion_count}"'

# Wait for node2 and node3 to reach same assertion_count as node1
# (Typically 1-5 minutes for <100K assertions)
```

**Validate cluster:**
```bash
# All nodes should have same assertion count
curl http://node1:18180/v1/health | jq '.assertion_count'
curl http://node2:18180/v1/health | jq '.assertion_count'
curl http://node3:18180/v1/health | jq '.assertion_count'
# All should match original count

# Test writes hit multiple nodes
curl -X POST http://localhost:18180/v1/assert \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "test/cluster", "predicate": "replicated", "value": true}'

# Query from different nodes
curl -X POST http://node2:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "test/cluster", "lens": "recency"}'
# Should return the assertion just written
```

**If failed:** Cluster won't form → Check firewall rules, SWIM gossip logs, network connectivity.

---

### §2. Add Node to Existing Cluster

**Use case:** Scaling existing 3-node cluster to 4+ nodes

⚠️ **NOTE:** Pilot 5 supports 3-node clusters. 4+ nodes is roadmap P6. Procedure below is future-ready.

**Diagnostic:**
```bash
# Check current cluster state
curl http://node1:18181/cluster/members | jq '.members | length'
# Should return: 3

# Check cluster health
curl http://node1:18181/cluster/health
# Should return: {"status": "healthy", "quorum": true}
```

**Resolution: Add node4**

**Step 1: Provision new node**
```bash
# (Same as §1 Step 1)
NODE4_IP="10.0.1.54"
```

**Step 2: Install StemeDB on node4**
```bash
# (Same as §1 Step 2)
```

**Step 3: Configure node4**
```bash
ssh node4 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node4\"
bind_addr = \"10.0.1.54:18181\"
rpc_addr = \"10.0.1.54:18182\"
swim_addr = \"10.0.1.54:18183\"

# Point to existing cluster for discovery
seeds = [
  \"10.0.1.51:18183\",  # Node1
  \"10.0.1.52:18183\",  # Node2
  \"10.0.1.53:18183\"   # Node3
]

[replication]
factor = 2
EOF"
```

**Step 4: Start node4**
```bash
ssh node4 "sudo systemctl start stemedb-api"

# SWIM gossip will auto-discover existing cluster
# No restart of existing nodes required!
```

**Step 5: Verify join**
```bash
# Check cluster membership
curl http://node1:18181/cluster/members | jq '.members | length'
# Should return: 4

# Check node4 status
curl http://node1:18181/cluster/members | jq '.members[] | select(.id=="node4")'
# Should show: {"id": "node4", "status": "UP", "assertion_count": 0}
```

**Step 6: Rebalance shards (manual for Pilot 5)**

⚠️ **NOTE:** Automatic rebalancing is roadmap P6.3. Manual process required.

```bash
# View current shard assignment
curl http://node1:18181/cluster/shards | jq '.'

# Identify shards to move to node4
# (Typically 25% of shards from node1, node2, node3)

# Move shard (example)
curl -X POST http://node1:18181/admin/shards/rebalance \
  -H "Content-Type: application/json" \
  -d '{
    "shard_id": "shard-abc123",
    "target_node": "node4",
    "reason": "add_capacity"
  }'

# Monitor rebalance progress
watch -n 5 'curl -s http://node1:18181/cluster/shards | jq ".shards[] | select(.id==\"shard-abc123\") | .rebalance_status"'

# Repeat for other shards until balanced
```

**Validate:**
```bash
# All nodes should have similar assertion counts
curl http://node1:18181/cluster/members | jq '.members[] | {id, assertion_count}'

# Test query hits node4
curl -X POST http://node4:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "test/node4", "lens": "recency"}'
# Should succeed
```

**If failed:** Node4 won't join → Check seed node IPs, firewall rules, SWIM logs.

---

### §3. Replace Failed Node

**Use case:** Node2 failed (hardware, software), need replacement

**Diagnostic:**
```bash
# Check cluster status
curl http://node1:18181/cluster/members | jq '.members[] | select(.status != "UP")'

# Expected output:
# {
#   "id": "node2",
#   "status": "DOWN",
#   "last_seen": "2026-02-11T10:15:00Z"
# }

# Check replication status
curl http://node1:18180/metrics | grep replication_lag_seconds
# May show elevated lag to node2
```

**Resolution: Replace node2**

**Step 1: Remove failed node from cluster**
```bash
# Gracefully remove node2 (allows rebalancing)
curl -X POST http://node1:18181/admin/cluster/remove \
  -H "Content-Type: application/json" \
  -d '{"node_id": "node2", "force": false}'

# Wait for shards to rebalance to node1 and node3
# (Typically 5-15 minutes for <100K assertions)

watch -n 10 'curl -s http://node1:18181/cluster/members | jq .members'
# node2 should disappear from list
```

**Step 2: Provision new node2**
```bash
# Launch new instance
NEW_NODE2_IP="10.0.1.55"  # May be different IP
```

**Step 3: Configure new node2**
```bash
# (Same as §1 Step 3, using new IP)
ssh new-node2 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node2-replacement\"  # Different ID
bind_addr = \"10.0.1.55:18181\"
rpc_addr = \"10.0.1.55:18182\"
swim_addr = \"10.0.1.55:18183\"
seeds = [\"10.0.1.51:18183\", \"10.0.1.53:18183\"]
[replication]
factor = 2
EOF"
```

**Step 4: Start new node2**
```bash
ssh new-node2 "sudo systemctl start stemedb-api"

# Auto-joins cluster via SWIM
```

**Step 5: Verify join and replication**
```bash
# Check membership
curl http://node1:18181/cluster/members | jq '.members'
# Should show: node1, node2-replacement, node3

# Trigger replication to new node
curl -X POST http://node1:18181/cluster/sync \
  -H "Content-Type: application/json" \
  -d '{"target_nodes": ["node2-replacement"], "force": true}'

# Monitor
watch -n 5 'curl -s http://node1:18181/cluster/members | jq ".members[] | select(.id==\"node2-replacement\") | .assertion_count"'
```

**Validate:**
```bash
# Cluster healthy with 3 nodes
curl http://node1:18181/cluster/health
# Should return: {"status": "healthy", "quorum": true}

# New node2 has full data
curl http://new-node2:18180/v1/health | jq '.assertion_count'
# Should match node1 and node3
```

**If failed:** Replication not catching up → Check network bandwidth, disk I/O, Merkle sync logs.

---

## Validation

After adding node, validate cluster health:

- [ ] **Cluster members show new node**
  ```bash
  curl http://node1:18181/cluster/members | jq '.members'
  # Should list all nodes with status "UP"
  ```

- [ ] **Replication lag <1s**
  ```bash
  curl http://node1:18180/metrics | grep replication_lag_seconds
  # All nodes should show <1.0
  ```

- [ ] **Assertion counts match**
  ```bash
  for node in node1 node2 node3; do
    echo "$node: $(curl -s http://$node:18180/v1/health | jq '.assertion_count')"
  done
  # All should be equal (±1 for in-flight writes)
  ```

- [ ] **Queries work from new node**
  ```bash
  curl -X POST http://new-node:18180/v1/query \
    -H "Content-Type: application/json" \
    -d '{"concept_path": "test/cluster", "lens": "recency"}'
  # Should return results
  ```

- [ ] **Writes replicate to new node**
  ```bash
  curl -X POST http://node1:18180/v1/assert \
    -H "Content-Type: application/json" \
    -d '{"concept_path": "test/new_node", "predicate": "validated", "value": true}'

  # Query from new node
  curl -X POST http://new-node:18180/v1/query \
    -H "Content-Type: application/json" \
    -d '{"concept_path": "test/new_node", "lens": "recency"}'
  # Should return the assertion
  ```

---

## Network Requirements

**For cluster operation, ensure:**

| Port | Protocol | Purpose | Required For |
|------|----------|---------|--------------|
| **18180** | TCP/HTTP | API queries | Client → Any node |
| **18181** | TCP/HTTP | Cluster gateway | Load balancer → Nodes |
| **18182** | TCP/gRPC | Cluster RPC (replication) | Node ↔ Node |
| **18183** | UDP | SWIM gossip (membership) | Node ↔ Node |

**Firewall rules (AWS Security Group example):**
```bash
# Allow cluster communication (node ↔ node)
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --source-group sg-xxx \
  --protocol tcp \
  --port 18180-18183

aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --source-group sg-xxx \
  --protocol udp \
  --port 18183

# Allow client access (load balancer → nodes)
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --source-group sg-lb \
  --protocol tcp \
  --port 18180
```

**Latency requirement:** <5ms inter-node latency (same region/AZ required)

**See:** [Network Requirements](../reference-architecture/network-requirements.md) for full details.

---

## Load Balancer Configuration

**After adding nodes, update load balancer:**

**Nginx example:**
```nginx
upstream stemedb_cluster {
    # Round-robin by default
    server 10.0.1.51:18180 weight=1;  # node1
    server 10.0.1.52:18180 weight=1;  # node2
    server 10.0.1.53:18180 weight=1;  # node3

    # Health checks
    check interval=5000 rise=2 fall=3 timeout=3000;
}

server {
    listen 443 ssl;
    server_name stemedb.example.com;

    location / {
        proxy_pass http://stemedb_cluster;
        proxy_next_upstream error timeout http_502 http_503;
        proxy_connect_timeout 5s;
        proxy_send_timeout 30s;
        proxy_read_timeout 30s;
    }
}
```

**Envoy example:**
```yaml
clusters:
  - name: stemedb_cluster
    type: STRICT_DNS
    load_assignment:
      cluster_name: stemedb_cluster
      endpoints:
        - lb_endpoints:
          - endpoint:
              address:
                socket_address:
                  address: node1.example.com
                  port_value: 18180
          - endpoint:
              address:
                socket_address:
                  address: node2.example.com
                  port_value: 18180
          - endpoint:
              address:
                socket_address:
                  address: node3.example.com
                  port_value: 18180
    health_checks:
      - timeout: 3s
        interval: 5s
        unhealthy_threshold: 3
        healthy_threshold: 2
        http_health_check:
          path: "/v1/health"
```

---

## Cluster Sizing Guidelines

**From [Resource Sizing Guide](../reference-architecture/resource-sizing.md):**

| Assertions | Nodes | Replication Factor | RTO | RPO |
|-----------|-------|-------------------|-----|-----|
| <10K | 1 | N/A | 2hr | 24hr |
| <100K | 3 | 2 | 5min | 1min |
| <1M | 5 | 3 | 1min | 10s |

**When to add nodes:**
- Query latency p99 >1s (capacity)
- Disk usage >80% (storage)
- CPU sustained >70% (compute)
- Planning for HA (minimum 3 nodes)

---

## Related Documentation

- [Three-Node Cluster Architecture](../reference-architecture/three-node-cluster.md) - Deployment guide
- [Network Requirements](../reference-architecture/network-requirements.md) - Firewall rules
- [High Query Latency](./high-query-latency.md) - Shard rebalancing
- [Resource Sizing](../reference-architecture/resource-sizing.md) - Capacity planning

---

## Future Enhancements

**Roadmap P6.3 (Automatic Shard Rebalancing):**
- Auto-detect when new node joins
- Automatically rebalance shards for even distribution
- No manual `shards/rebalance` API calls needed

**Roadmap P6.4 (WAL Archival to S3):**
- Replicate WAL segments to S3 for durability
- Reduce local disk requirements
- Enable faster node replacement (restore from S3)

---

## Last Updated

2026-02-11