stemedb/docs/operations/runbooks/add-node.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

16 KiB

Runbook: Add Node to Cluster

Symptom

  • Need to scale from single-node to 3-node cluster
  • Need to add capacity to existing cluster
  • Need to replace failed node
  • Planning horizontal scaling

Quick Diagnosis

Need to add node
    │
    ├─► Currently single-node?
    │   └─► §1 Bootstrap 3-Node Cluster
    │
    ├─► Existing 3-node cluster, need more capacity?
    │   └─► §2 Add Node to Existing Cluster
    │
    ├─► Node failed, need replacement?
    │   └─► §3 Replace Failed Node
    │
    └─► Planning scaling strategy?
        └─► See Reference Architectures

Prerequisites

Before adding node:

  • Network connectivity:

    # From new node, ping existing nodes
    ping node1.example.com
    ping node2.example.com
    # Should show <5ms latency (same region required)
    
  • Ports open:

    # Test connectivity to cluster ports
    nc -zv node1.example.com 18180  # HTTP API
    nc -zv node1.example.com 18181  # Cluster Gateway
    nc -zv node1.example.com 18182  # Cluster RPC
    nc -zv node1.example.com 18183  # SWIM Gossip
    # All should succeed
    
  • StemeDB installed on new node:

    # Verify binary
    which stemedb-api
    # Should return: /usr/local/bin/stemedb-api (or installation path)
    
  • Disk space sufficient:

    df -h /data
    # Should have >50GB available for pilot
    
  • Cluster healthy (if existing):

    curl http://node1:18180/v1/health
    # Should return: {"status": "healthy", ...}
    

Resolution Steps

§1. Bootstrap 3-Node Cluster (From Single-Node)

Use case: Migrating from single-node pilot to 3-node production cluster

Diagnostic:

# Check current single-node state
curl http://localhost:18180/v1/health

# Note assertion_count for validation later
ASSERTION_COUNT=$(curl -s http://localhost:18180/v1/health | jq '.assertion_count')
echo "Current assertions: $ASSERTION_COUNT"

# Verify no cluster config
curl http://localhost:18180/metrics | grep cluster_members
# Should return empty (single-node)

Resolution: Step-by-step cluster bootstrap

Step 1: Provision 2 new nodes

# AWS example: Launch 2 instances matching current node specs
aws ec2 run-instances \
  --image-id ami-xxx \
  --instance-type t3.large \
  --count 2 \
  --subnet-id subnet-xxx \
  --security-group-ids sg-xxx \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=stemedb-node2},{Key=Name,Value=stemedb-node3}]'

# Note instance IDs and private IPs
NODE2_IP="10.0.1.52"
NODE3_IP="10.0.1.53"

Step 2: Install StemeDB on new nodes

# SSH to node2
ssh ubuntu@$NODE2_IP

# Install StemeDB (same version as node1!)
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
sudo chmod +x /usr/local/bin/stemedb-api

# Create data directories
sudo mkdir -p /data/{wal,db}
sudo chown -R stemedb:stemedb /data

# Repeat for node3

Step 3: Configure cluster on all nodes

# Node 1 (existing): Enable cluster mode
cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = "node1"
bind_addr = "10.0.1.51:18181"  # Node1 IP
rpc_addr = "10.0.1.51:18182"
swim_addr = "10.0.1.51:18183"

# Seed nodes for discovery
seeds = [
  "10.0.1.52:18183",  # Node2
  "10.0.1.53:18183"   # Node3
]

[replication]
factor = 2  # Replicate each assertion to 2 nodes
EOF

# Node 2: Similar config with node2 IPs
ssh node2 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node2\"
bind_addr = \"10.0.1.52:18181\"
rpc_addr = \"10.0.1.52:18182\"
swim_addr = \"10.0.1.52:18183\"
seeds = [\"10.0.1.51:18183\", \"10.0.1.53:18183\"]
[replication]
factor = 2
EOF"

# Node 3: Similar config with node3 IPs
ssh node3 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node3\"
bind_addr = \"10.0.1.53:18181\"
rpc_addr = \"10.0.1.53:18182\"
swim_addr = \"10.0.1.53:18183\"
seeds = [\"10.0.1.51:18183\", \"10.0.1.52:18183\"]
[replication]
factor = 2
EOF"

Step 4: Start new nodes first (empty data)

# Start node2
ssh node2 "sudo systemctl start stemedb-api"

# Start node3
ssh node3 "sudo systemctl start stemedb-api"

# Verify startup
ssh node2 "curl http://localhost:18180/v1/health"
ssh node3 "curl http://localhost:18180/v1/health"
# Both should return: {"status": "healthy", "assertion_count": 0}

Step 5: Restart node1 with cluster config

# Restart node1 to join cluster
sudo systemctl restart stemedb-api

# Wait for SWIM gossip to converge (~10 seconds)
sleep 15

Step 6: Verify cluster formation

# Check cluster membership from any node
curl http://localhost:18181/cluster/members | jq '.'

# Expected output:
# {
#   "members": [
#     {"id": "node1", "status": "UP", "assertion_count": 10234},
#     {"id": "node2", "status": "UP", "assertion_count": 0},
#     {"id": "node3", "status": "UP", "assertion_count": 0}
#   ]
# }

# Check replication status
curl http://localhost:18180/metrics | grep replication_lag_seconds
# All nodes should show <1s lag

Step 7: Trigger initial replication

# Manually trigger Merkle sync to populate node2 and node3
curl -X POST http://localhost:18181/cluster/sync \
  -H "Content-Type: application/json" \
  -d '{"target_nodes": ["node2", "node3"], "force": true}'

# Monitor replication progress
watch -n 5 'curl -s http://localhost:18181/cluster/members | jq ".members[] | {id, assertion_count}"'

# Wait for node2 and node3 to reach same assertion_count as node1
# (Typically 1-5 minutes for <100K assertions)

Validate cluster:

# All nodes should have same assertion count
curl http://node1:18180/v1/health | jq '.assertion_count'
curl http://node2:18180/v1/health | jq '.assertion_count'
curl http://node3:18180/v1/health | jq '.assertion_count'
# All should match original count

# Test writes hit multiple nodes
curl -X POST http://localhost:18180/v1/assert \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "test/cluster", "predicate": "replicated", "value": true}'

# Query from different nodes
curl -X POST http://node2:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "test/cluster", "lens": "recency"}'
# Should return the assertion just written

If failed: Cluster won't form → Check firewall rules, SWIM gossip logs, network connectivity.


§2. Add Node to Existing Cluster

Use case: Scaling existing 3-node cluster to 4+ nodes

⚠️ NOTE: Pilot 5 supports 3-node clusters. 4+ nodes is roadmap P6. Procedure below is future-ready.

Diagnostic:

# Check current cluster state
curl http://node1:18181/cluster/members | jq '.members | length'
# Should return: 3

# Check cluster health
curl http://node1:18181/cluster/health
# Should return: {"status": "healthy", "quorum": true}

Resolution: Add node4

Step 1: Provision new node

# (Same as §1 Step 1)
NODE4_IP="10.0.1.54"

Step 2: Install StemeDB on node4

# (Same as §1 Step 2)

Step 3: Configure node4

ssh node4 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node4\"
bind_addr = \"10.0.1.54:18181\"
rpc_addr = \"10.0.1.54:18182\"
swim_addr = \"10.0.1.54:18183\"

# Point to existing cluster for discovery
seeds = [
  \"10.0.1.51:18183\",  # Node1
  \"10.0.1.52:18183\",  # Node2
  \"10.0.1.53:18183\"   # Node3
]

[replication]
factor = 2
EOF"

Step 4: Start node4

ssh node4 "sudo systemctl start stemedb-api"

# SWIM gossip will auto-discover existing cluster
# No restart of existing nodes required!

Step 5: Verify join

# Check cluster membership
curl http://node1:18181/cluster/members | jq '.members | length'
# Should return: 4

# Check node4 status
curl http://node1:18181/cluster/members | jq '.members[] | select(.id=="node4")'
# Should show: {"id": "node4", "status": "UP", "assertion_count": 0}

Step 6: Rebalance shards (manual for Pilot 5)

⚠️ NOTE: Automatic rebalancing is roadmap P6.3. Manual process required.

# View current shard assignment
curl http://node1:18181/cluster/shards | jq '.'

# Identify shards to move to node4
# (Typically 25% of shards from node1, node2, node3)

# Move shard (example)
curl -X POST http://node1:18181/admin/shards/rebalance \
  -H "Content-Type: application/json" \
  -d '{
    "shard_id": "shard-abc123",
    "target_node": "node4",
    "reason": "add_capacity"
  }'

# Monitor rebalance progress
watch -n 5 'curl -s http://node1:18181/cluster/shards | jq ".shards[] | select(.id==\"shard-abc123\") | .rebalance_status"'

# Repeat for other shards until balanced

Validate:

# All nodes should have similar assertion counts
curl http://node1:18181/cluster/members | jq '.members[] | {id, assertion_count}'

# Test query hits node4
curl -X POST http://node4:18180/v1/query \
  -H "Content-Type: application/json" \
  -d '{"concept_path": "test/node4", "lens": "recency"}'
# Should succeed

If failed: Node4 won't join → Check seed node IPs, firewall rules, SWIM logs.


§3. Replace Failed Node

Use case: Node2 failed (hardware, software), need replacement

Diagnostic:

# Check cluster status
curl http://node1:18181/cluster/members | jq '.members[] | select(.status != "UP")'

# Expected output:
# {
#   "id": "node2",
#   "status": "DOWN",
#   "last_seen": "2026-02-11T10:15:00Z"
# }

# Check replication status
curl http://node1:18180/metrics | grep replication_lag_seconds
# May show elevated lag to node2

Resolution: Replace node2

Step 1: Remove failed node from cluster

# Gracefully remove node2 (allows rebalancing)
curl -X POST http://node1:18181/admin/cluster/remove \
  -H "Content-Type: application/json" \
  -d '{"node_id": "node2", "force": false}'

# Wait for shards to rebalance to node1 and node3
# (Typically 5-15 minutes for <100K assertions)

watch -n 10 'curl -s http://node1:18181/cluster/members | jq .members'
# node2 should disappear from list

Step 2: Provision new node2

# Launch new instance
NEW_NODE2_IP="10.0.1.55"  # May be different IP

Step 3: Configure new node2

# (Same as §1 Step 3, using new IP)
ssh new-node2 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
[cluster]
enabled = true
node_id = \"node2-replacement\"  # Different ID
bind_addr = \"10.0.1.55:18181\"
rpc_addr = \"10.0.1.55:18182\"
swim_addr = \"10.0.1.55:18183\"
seeds = [\"10.0.1.51:18183\", \"10.0.1.53:18183\"]
[replication]
factor = 2
EOF"

Step 4: Start new node2

ssh new-node2 "sudo systemctl start stemedb-api"

# Auto-joins cluster via SWIM

Step 5: Verify join and replication

# Check membership
curl http://node1:18181/cluster/members | jq '.members'
# Should show: node1, node2-replacement, node3

# Trigger replication to new node
curl -X POST http://node1:18181/cluster/sync \
  -H "Content-Type: application/json" \
  -d '{"target_nodes": ["node2-replacement"], "force": true}'

# Monitor
watch -n 5 'curl -s http://node1:18181/cluster/members | jq ".members[] | select(.id==\"node2-replacement\") | .assertion_count"'

Validate:

# Cluster healthy with 3 nodes
curl http://node1:18181/cluster/health
# Should return: {"status": "healthy", "quorum": true}

# New node2 has full data
curl http://new-node2:18180/v1/health | jq '.assertion_count'
# Should match node1 and node3

If failed: Replication not catching up → Check network bandwidth, disk I/O, Merkle sync logs.


Validation

After adding node, validate cluster health:

  • Cluster members show new node

    curl http://node1:18181/cluster/members | jq '.members'
    # Should list all nodes with status "UP"
    
  • Replication lag <1s

    curl http://node1:18180/metrics | grep replication_lag_seconds
    # All nodes should show <1.0
    
  • Assertion counts match

    for node in node1 node2 node3; do
      echo "$node: $(curl -s http://$node:18180/v1/health | jq '.assertion_count')"
    done
    # All should be equal (±1 for in-flight writes)
    
  • Queries work from new node

    curl -X POST http://new-node:18180/v1/query \
      -H "Content-Type: application/json" \
      -d '{"concept_path": "test/cluster", "lens": "recency"}'
    # Should return results
    
  • Writes replicate to new node

    curl -X POST http://node1:18180/v1/assert \
      -H "Content-Type: application/json" \
      -d '{"concept_path": "test/new_node", "predicate": "validated", "value": true}'
    
    # Query from new node
    curl -X POST http://new-node:18180/v1/query \
      -H "Content-Type: application/json" \
      -d '{"concept_path": "test/new_node", "lens": "recency"}'
    # Should return the assertion
    

Network Requirements

For cluster operation, ensure:

Port Protocol Purpose Required For
18180 TCP/HTTP API queries Client → Any node
18181 TCP/HTTP Cluster gateway Load balancer → Nodes
18182 TCP/gRPC Cluster RPC (replication) Node ↔ Node
18183 UDP SWIM gossip (membership) Node ↔ Node

Firewall rules (AWS Security Group example):

# Allow cluster communication (node ↔ node)
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --source-group sg-xxx \
  --protocol tcp \
  --port 18180-18183

aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --source-group sg-xxx \
  --protocol udp \
  --port 18183

# Allow client access (load balancer → nodes)
aws ec2 authorize-security-group-ingress \
  --group-id sg-xxx \
  --source-group sg-lb \
  --protocol tcp \
  --port 18180

Latency requirement: <5ms inter-node latency (same region/AZ required)

See: Network Requirements for full details.


Load Balancer Configuration

After adding nodes, update load balancer:

Nginx example:

upstream stemedb_cluster {
    # Round-robin by default
    server 10.0.1.51:18180 weight=1;  # node1
    server 10.0.1.52:18180 weight=1;  # node2
    server 10.0.1.53:18180 weight=1;  # node3

    # Health checks
    check interval=5000 rise=2 fall=3 timeout=3000;
}

server {
    listen 443 ssl;
    server_name stemedb.example.com;

    location / {
        proxy_pass http://stemedb_cluster;
        proxy_next_upstream error timeout http_502 http_503;
        proxy_connect_timeout 5s;
        proxy_send_timeout 30s;
        proxy_read_timeout 30s;
    }
}

Envoy example:

clusters:
  - name: stemedb_cluster
    type: STRICT_DNS
    load_assignment:
      cluster_name: stemedb_cluster
      endpoints:
        - lb_endpoints:
          - endpoint:
              address:
                socket_address:
                  address: node1.example.com
                  port_value: 18180
          - endpoint:
              address:
                socket_address:
                  address: node2.example.com
                  port_value: 18180
          - endpoint:
              address:
                socket_address:
                  address: node3.example.com
                  port_value: 18180
    health_checks:
      - timeout: 3s
        interval: 5s
        unhealthy_threshold: 3
        healthy_threshold: 2
        http_health_check:
          path: "/v1/health"

Cluster Sizing Guidelines

From Resource Sizing Guide:

Assertions Nodes Replication Factor RTO RPO
<10K 1 N/A 2hr 24hr
<100K 3 2 5min 1min
<1M 5 3 1min 10s

When to add nodes:

  • Query latency p99 >1s (capacity)
  • Disk usage >80% (storage)
  • CPU sustained >70% (compute)
  • Planning for HA (minimum 3 nodes)


Future Enhancements

Roadmap P6.3 (Automatic Shard Rebalancing):

  • Auto-detect when new node joins
  • Automatically rebalance shards for even distribution
  • No manual shards/rebalance API calls needed

Roadmap P6.4 (WAL Archival to S3):

  • Replicate WAL segments to S3 for durability
  • Reduce local disk requirements
  • Enable faster node replacement (restore from S3)

Last Updated

2026-02-11