This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
669 lines
16 KiB
Markdown
669 lines
16 KiB
Markdown
# Runbook: Add Node to Cluster
|
|
|
|
## Symptom
|
|
|
|
- Need to scale from single-node to 3-node cluster
|
|
- Need to add capacity to existing cluster
|
|
- Need to replace failed node
|
|
- Planning horizontal scaling
|
|
|
|
---
|
|
|
|
## Quick Diagnosis
|
|
|
|
```
|
|
Need to add node
|
|
│
|
|
├─► Currently single-node?
|
|
│ └─► §1 Bootstrap 3-Node Cluster
|
|
│
|
|
├─► Existing 3-node cluster, need more capacity?
|
|
│ └─► §2 Add Node to Existing Cluster
|
|
│
|
|
├─► Node failed, need replacement?
|
|
│ └─► §3 Replace Failed Node
|
|
│
|
|
└─► Planning scaling strategy?
|
|
└─► See Reference Architectures
|
|
```
|
|
|
|
---
|
|
|
|
## Prerequisites
|
|
|
|
**Before adding node:**
|
|
|
|
- [ ] **Network connectivity:**
|
|
```bash
|
|
# From new node, ping existing nodes
|
|
ping node1.example.com
|
|
ping node2.example.com
|
|
# Should show <5ms latency (same region required)
|
|
```
|
|
|
|
- [ ] **Ports open:**
|
|
```bash
|
|
# Test connectivity to cluster ports
|
|
nc -zv node1.example.com 18180 # HTTP API
|
|
nc -zv node1.example.com 18181 # Cluster Gateway
|
|
nc -zv node1.example.com 18182 # Cluster RPC
|
|
nc -zv node1.example.com 18183 # SWIM Gossip
|
|
# All should succeed
|
|
```
|
|
|
|
- [ ] **StemeDB installed on new node:**
|
|
```bash
|
|
# Verify binary
|
|
which stemedb-api
|
|
# Should return: /usr/local/bin/stemedb-api (or installation path)
|
|
```
|
|
|
|
- [ ] **Disk space sufficient:**
|
|
```bash
|
|
df -h /data
|
|
# Should have >50GB available for pilot
|
|
```
|
|
|
|
- [ ] **Cluster healthy (if existing):**
|
|
```bash
|
|
curl http://node1:18180/v1/health
|
|
# Should return: {"status": "healthy", ...}
|
|
```
|
|
|
|
---
|
|
|
|
## Resolution Steps
|
|
|
|
### §1. Bootstrap 3-Node Cluster (From Single-Node)
|
|
|
|
**Use case:** Migrating from single-node pilot to 3-node production cluster
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check current single-node state
|
|
curl http://localhost:18180/v1/health
|
|
|
|
# Note assertion_count for validation later
|
|
ASSERTION_COUNT=$(curl -s http://localhost:18180/v1/health | jq '.assertion_count')
|
|
echo "Current assertions: $ASSERTION_COUNT"
|
|
|
|
# Verify no cluster config
|
|
curl http://localhost:18180/metrics | grep cluster_members
|
|
# Should return empty (single-node)
|
|
```
|
|
|
|
**Resolution: Step-by-step cluster bootstrap**
|
|
|
|
**Step 1: Provision 2 new nodes**
|
|
|
|
```bash
|
|
# AWS example: Launch 2 instances matching current node specs
|
|
aws ec2 run-instances \
|
|
--image-id ami-xxx \
|
|
--instance-type t3.large \
|
|
--count 2 \
|
|
--subnet-id subnet-xxx \
|
|
--security-group-ids sg-xxx \
|
|
--tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=stemedb-node2},{Key=Name,Value=stemedb-node3}]'
|
|
|
|
# Note instance IDs and private IPs
|
|
NODE2_IP="10.0.1.52"
|
|
NODE3_IP="10.0.1.53"
|
|
```
|
|
|
|
**Step 2: Install StemeDB on new nodes**
|
|
|
|
```bash
|
|
# SSH to node2
|
|
ssh ubuntu@$NODE2_IP
|
|
|
|
# Install StemeDB (same version as node1!)
|
|
sudo curl -L https://github.com/yourorg/stemedb/releases/download/v0.1.0/stemedb-api -o /usr/local/bin/stemedb-api
|
|
sudo chmod +x /usr/local/bin/stemedb-api
|
|
|
|
# Create data directories
|
|
sudo mkdir -p /data/{wal,db}
|
|
sudo chown -R stemedb:stemedb /data
|
|
|
|
# Repeat for node3
|
|
```
|
|
|
|
**Step 3: Configure cluster on all nodes**
|
|
|
|
```bash
|
|
# Node 1 (existing): Enable cluster mode
|
|
cat <<EOF | sudo tee /etc/stemedb/cluster.toml
|
|
[cluster]
|
|
enabled = true
|
|
node_id = "node1"
|
|
bind_addr = "10.0.1.51:18181" # Node1 IP
|
|
rpc_addr = "10.0.1.51:18182"
|
|
swim_addr = "10.0.1.51:18183"
|
|
|
|
# Seed nodes for discovery
|
|
seeds = [
|
|
"10.0.1.52:18183", # Node2
|
|
"10.0.1.53:18183" # Node3
|
|
]
|
|
|
|
[replication]
|
|
factor = 2 # Replicate each assertion to 2 nodes
|
|
EOF
|
|
|
|
# Node 2: Similar config with node2 IPs
|
|
ssh node2 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
|
|
[cluster]
|
|
enabled = true
|
|
node_id = \"node2\"
|
|
bind_addr = \"10.0.1.52:18181\"
|
|
rpc_addr = \"10.0.1.52:18182\"
|
|
swim_addr = \"10.0.1.52:18183\"
|
|
seeds = [\"10.0.1.51:18183\", \"10.0.1.53:18183\"]
|
|
[replication]
|
|
factor = 2
|
|
EOF"
|
|
|
|
# Node 3: Similar config with node3 IPs
|
|
ssh node3 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
|
|
[cluster]
|
|
enabled = true
|
|
node_id = \"node3\"
|
|
bind_addr = \"10.0.1.53:18181\"
|
|
rpc_addr = \"10.0.1.53:18182\"
|
|
swim_addr = \"10.0.1.53:18183\"
|
|
seeds = [\"10.0.1.51:18183\", \"10.0.1.52:18183\"]
|
|
[replication]
|
|
factor = 2
|
|
EOF"
|
|
```
|
|
|
|
**Step 4: Start new nodes first (empty data)**
|
|
|
|
```bash
|
|
# Start node2
|
|
ssh node2 "sudo systemctl start stemedb-api"
|
|
|
|
# Start node3
|
|
ssh node3 "sudo systemctl start stemedb-api"
|
|
|
|
# Verify startup
|
|
ssh node2 "curl http://localhost:18180/v1/health"
|
|
ssh node3 "curl http://localhost:18180/v1/health"
|
|
# Both should return: {"status": "healthy", "assertion_count": 0}
|
|
```
|
|
|
|
**Step 5: Restart node1 with cluster config**
|
|
|
|
```bash
|
|
# Restart node1 to join cluster
|
|
sudo systemctl restart stemedb-api
|
|
|
|
# Wait for SWIM gossip to converge (~10 seconds)
|
|
sleep 15
|
|
```
|
|
|
|
**Step 6: Verify cluster formation**
|
|
|
|
```bash
|
|
# Check cluster membership from any node
|
|
curl http://localhost:18181/cluster/members | jq '.'
|
|
|
|
# Expected output:
|
|
# {
|
|
# "members": [
|
|
# {"id": "node1", "status": "UP", "assertion_count": 10234},
|
|
# {"id": "node2", "status": "UP", "assertion_count": 0},
|
|
# {"id": "node3", "status": "UP", "assertion_count": 0}
|
|
# ]
|
|
# }
|
|
|
|
# Check replication status
|
|
curl http://localhost:18180/metrics | grep replication_lag_seconds
|
|
# All nodes should show <1s lag
|
|
```
|
|
|
|
**Step 7: Trigger initial replication**
|
|
|
|
```bash
|
|
# Manually trigger Merkle sync to populate node2 and node3
|
|
curl -X POST http://localhost:18181/cluster/sync \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"target_nodes": ["node2", "node3"], "force": true}'
|
|
|
|
# Monitor replication progress
|
|
watch -n 5 'curl -s http://localhost:18181/cluster/members | jq ".members[] | {id, assertion_count}"'
|
|
|
|
# Wait for node2 and node3 to reach same assertion_count as node1
|
|
# (Typically 1-5 minutes for <100K assertions)
|
|
```
|
|
|
|
**Validate cluster:**
|
|
```bash
|
|
# All nodes should have same assertion count
|
|
curl http://node1:18180/v1/health | jq '.assertion_count'
|
|
curl http://node2:18180/v1/health | jq '.assertion_count'
|
|
curl http://node3:18180/v1/health | jq '.assertion_count'
|
|
# All should match original count
|
|
|
|
# Test writes hit multiple nodes
|
|
curl -X POST http://localhost:18180/v1/assert \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path": "test/cluster", "predicate": "replicated", "value": true}'
|
|
|
|
# Query from different nodes
|
|
curl -X POST http://node2:18180/v1/query \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path": "test/cluster", "lens": "recency"}'
|
|
# Should return the assertion just written
|
|
```
|
|
|
|
**If failed:** Cluster won't form → Check firewall rules, SWIM gossip logs, network connectivity.
|
|
|
|
---
|
|
|
|
### §2. Add Node to Existing Cluster
|
|
|
|
**Use case:** Scaling existing 3-node cluster to 4+ nodes
|
|
|
|
⚠️ **NOTE:** Pilot 5 supports 3-node clusters. 4+ nodes is roadmap P6. Procedure below is future-ready.
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check current cluster state
|
|
curl http://node1:18181/cluster/members | jq '.members | length'
|
|
# Should return: 3
|
|
|
|
# Check cluster health
|
|
curl http://node1:18181/cluster/health
|
|
# Should return: {"status": "healthy", "quorum": true}
|
|
```
|
|
|
|
**Resolution: Add node4**
|
|
|
|
**Step 1: Provision new node**
|
|
```bash
|
|
# (Same as §1 Step 1)
|
|
NODE4_IP="10.0.1.54"
|
|
```
|
|
|
|
**Step 2: Install StemeDB on node4**
|
|
```bash
|
|
# (Same as §1 Step 2)
|
|
```
|
|
|
|
**Step 3: Configure node4**
|
|
```bash
|
|
ssh node4 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
|
|
[cluster]
|
|
enabled = true
|
|
node_id = \"node4\"
|
|
bind_addr = \"10.0.1.54:18181\"
|
|
rpc_addr = \"10.0.1.54:18182\"
|
|
swim_addr = \"10.0.1.54:18183\"
|
|
|
|
# Point to existing cluster for discovery
|
|
seeds = [
|
|
\"10.0.1.51:18183\", # Node1
|
|
\"10.0.1.52:18183\", # Node2
|
|
\"10.0.1.53:18183\" # Node3
|
|
]
|
|
|
|
[replication]
|
|
factor = 2
|
|
EOF"
|
|
```
|
|
|
|
**Step 4: Start node4**
|
|
```bash
|
|
ssh node4 "sudo systemctl start stemedb-api"
|
|
|
|
# SWIM gossip will auto-discover existing cluster
|
|
# No restart of existing nodes required!
|
|
```
|
|
|
|
**Step 5: Verify join**
|
|
```bash
|
|
# Check cluster membership
|
|
curl http://node1:18181/cluster/members | jq '.members | length'
|
|
# Should return: 4
|
|
|
|
# Check node4 status
|
|
curl http://node1:18181/cluster/members | jq '.members[] | select(.id=="node4")'
|
|
# Should show: {"id": "node4", "status": "UP", "assertion_count": 0}
|
|
```
|
|
|
|
**Step 6: Rebalance shards (manual for Pilot 5)**
|
|
|
|
⚠️ **NOTE:** Automatic rebalancing is roadmap P6.3. Manual process required.
|
|
|
|
```bash
|
|
# View current shard assignment
|
|
curl http://node1:18181/cluster/shards | jq '.'
|
|
|
|
# Identify shards to move to node4
|
|
# (Typically 25% of shards from node1, node2, node3)
|
|
|
|
# Move shard (example)
|
|
curl -X POST http://node1:18181/admin/shards/rebalance \
|
|
-H "Content-Type: application/json" \
|
|
-d '{
|
|
"shard_id": "shard-abc123",
|
|
"target_node": "node4",
|
|
"reason": "add_capacity"
|
|
}'
|
|
|
|
# Monitor rebalance progress
|
|
watch -n 5 'curl -s http://node1:18181/cluster/shards | jq ".shards[] | select(.id==\"shard-abc123\") | .rebalance_status"'
|
|
|
|
# Repeat for other shards until balanced
|
|
```
|
|
|
|
**Validate:**
|
|
```bash
|
|
# All nodes should have similar assertion counts
|
|
curl http://node1:18181/cluster/members | jq '.members[] | {id, assertion_count}'
|
|
|
|
# Test query hits node4
|
|
curl -X POST http://node4:18180/v1/query \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path": "test/node4", "lens": "recency"}'
|
|
# Should succeed
|
|
```
|
|
|
|
**If failed:** Node4 won't join → Check seed node IPs, firewall rules, SWIM logs.
|
|
|
|
---
|
|
|
|
### §3. Replace Failed Node
|
|
|
|
**Use case:** Node2 failed (hardware, software), need replacement
|
|
|
|
**Diagnostic:**
|
|
```bash
|
|
# Check cluster status
|
|
curl http://node1:18181/cluster/members | jq '.members[] | select(.status != "UP")'
|
|
|
|
# Expected output:
|
|
# {
|
|
# "id": "node2",
|
|
# "status": "DOWN",
|
|
# "last_seen": "2026-02-11T10:15:00Z"
|
|
# }
|
|
|
|
# Check replication status
|
|
curl http://node1:18180/metrics | grep replication_lag_seconds
|
|
# May show elevated lag to node2
|
|
```
|
|
|
|
**Resolution: Replace node2**
|
|
|
|
**Step 1: Remove failed node from cluster**
|
|
```bash
|
|
# Gracefully remove node2 (allows rebalancing)
|
|
curl -X POST http://node1:18181/admin/cluster/remove \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"node_id": "node2", "force": false}'
|
|
|
|
# Wait for shards to rebalance to node1 and node3
|
|
# (Typically 5-15 minutes for <100K assertions)
|
|
|
|
watch -n 10 'curl -s http://node1:18181/cluster/members | jq .members'
|
|
# node2 should disappear from list
|
|
```
|
|
|
|
**Step 2: Provision new node2**
|
|
```bash
|
|
# Launch new instance
|
|
NEW_NODE2_IP="10.0.1.55" # May be different IP
|
|
```
|
|
|
|
**Step 3: Configure new node2**
|
|
```bash
|
|
# (Same as §1 Step 3, using new IP)
|
|
ssh new-node2 "cat <<EOF | sudo tee /etc/stemedb/cluster.toml
|
|
[cluster]
|
|
enabled = true
|
|
node_id = \"node2-replacement\" # Different ID
|
|
bind_addr = \"10.0.1.55:18181\"
|
|
rpc_addr = \"10.0.1.55:18182\"
|
|
swim_addr = \"10.0.1.55:18183\"
|
|
seeds = [\"10.0.1.51:18183\", \"10.0.1.53:18183\"]
|
|
[replication]
|
|
factor = 2
|
|
EOF"
|
|
```
|
|
|
|
**Step 4: Start new node2**
|
|
```bash
|
|
ssh new-node2 "sudo systemctl start stemedb-api"
|
|
|
|
# Auto-joins cluster via SWIM
|
|
```
|
|
|
|
**Step 5: Verify join and replication**
|
|
```bash
|
|
# Check membership
|
|
curl http://node1:18181/cluster/members | jq '.members'
|
|
# Should show: node1, node2-replacement, node3
|
|
|
|
# Trigger replication to new node
|
|
curl -X POST http://node1:18181/cluster/sync \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"target_nodes": ["node2-replacement"], "force": true}'
|
|
|
|
# Monitor
|
|
watch -n 5 'curl -s http://node1:18181/cluster/members | jq ".members[] | select(.id==\"node2-replacement\") | .assertion_count"'
|
|
```
|
|
|
|
**Validate:**
|
|
```bash
|
|
# Cluster healthy with 3 nodes
|
|
curl http://node1:18181/cluster/health
|
|
# Should return: {"status": "healthy", "quorum": true}
|
|
|
|
# New node2 has full data
|
|
curl http://new-node2:18180/v1/health | jq '.assertion_count'
|
|
# Should match node1 and node3
|
|
```
|
|
|
|
**If failed:** Replication not catching up → Check network bandwidth, disk I/O, Merkle sync logs.
|
|
|
|
---
|
|
|
|
## Validation
|
|
|
|
After adding node, validate cluster health:
|
|
|
|
- [ ] **Cluster members show new node**
|
|
```bash
|
|
curl http://node1:18181/cluster/members | jq '.members'
|
|
# Should list all nodes with status "UP"
|
|
```
|
|
|
|
- [ ] **Replication lag <1s**
|
|
```bash
|
|
curl http://node1:18180/metrics | grep replication_lag_seconds
|
|
# All nodes should show <1.0
|
|
```
|
|
|
|
- [ ] **Assertion counts match**
|
|
```bash
|
|
for node in node1 node2 node3; do
|
|
echo "$node: $(curl -s http://$node:18180/v1/health | jq '.assertion_count')"
|
|
done
|
|
# All should be equal (±1 for in-flight writes)
|
|
```
|
|
|
|
- [ ] **Queries work from new node**
|
|
```bash
|
|
curl -X POST http://new-node:18180/v1/query \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path": "test/cluster", "lens": "recency"}'
|
|
# Should return results
|
|
```
|
|
|
|
- [ ] **Writes replicate to new node**
|
|
```bash
|
|
curl -X POST http://node1:18180/v1/assert \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path": "test/new_node", "predicate": "validated", "value": true}'
|
|
|
|
# Query from new node
|
|
curl -X POST http://new-node:18180/v1/query \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"concept_path": "test/new_node", "lens": "recency"}'
|
|
# Should return the assertion
|
|
```
|
|
|
|
---
|
|
|
|
## Network Requirements
|
|
|
|
**For cluster operation, ensure:**
|
|
|
|
| Port | Protocol | Purpose | Required For |
|
|
|------|----------|---------|--------------|
|
|
| **18180** | TCP/HTTP | API queries | Client → Any node |
|
|
| **18181** | TCP/HTTP | Cluster gateway | Load balancer → Nodes |
|
|
| **18182** | TCP/gRPC | Cluster RPC (replication) | Node ↔ Node |
|
|
| **18183** | UDP | SWIM gossip (membership) | Node ↔ Node |
|
|
|
|
**Firewall rules (AWS Security Group example):**
|
|
```bash
|
|
# Allow cluster communication (node ↔ node)
|
|
aws ec2 authorize-security-group-ingress \
|
|
--group-id sg-xxx \
|
|
--source-group sg-xxx \
|
|
--protocol tcp \
|
|
--port 18180-18183
|
|
|
|
aws ec2 authorize-security-group-ingress \
|
|
--group-id sg-xxx \
|
|
--source-group sg-xxx \
|
|
--protocol udp \
|
|
--port 18183
|
|
|
|
# Allow client access (load balancer → nodes)
|
|
aws ec2 authorize-security-group-ingress \
|
|
--group-id sg-xxx \
|
|
--source-group sg-lb \
|
|
--protocol tcp \
|
|
--port 18180
|
|
```
|
|
|
|
**Latency requirement:** <5ms inter-node latency (same region/AZ required)
|
|
|
|
**See:** [Network Requirements](../reference-architecture/network-requirements.md) for full details.
|
|
|
|
---
|
|
|
|
## Load Balancer Configuration
|
|
|
|
**After adding nodes, update load balancer:**
|
|
|
|
**Nginx example:**
|
|
```nginx
|
|
upstream stemedb_cluster {
|
|
# Round-robin by default
|
|
server 10.0.1.51:18180 weight=1; # node1
|
|
server 10.0.1.52:18180 weight=1; # node2
|
|
server 10.0.1.53:18180 weight=1; # node3
|
|
|
|
# Health checks
|
|
check interval=5000 rise=2 fall=3 timeout=3000;
|
|
}
|
|
|
|
server {
|
|
listen 443 ssl;
|
|
server_name stemedb.example.com;
|
|
|
|
location / {
|
|
proxy_pass http://stemedb_cluster;
|
|
proxy_next_upstream error timeout http_502 http_503;
|
|
proxy_connect_timeout 5s;
|
|
proxy_send_timeout 30s;
|
|
proxy_read_timeout 30s;
|
|
}
|
|
}
|
|
```
|
|
|
|
**Envoy example:**
|
|
```yaml
|
|
clusters:
|
|
- name: stemedb_cluster
|
|
type: STRICT_DNS
|
|
load_assignment:
|
|
cluster_name: stemedb_cluster
|
|
endpoints:
|
|
- lb_endpoints:
|
|
- endpoint:
|
|
address:
|
|
socket_address:
|
|
address: node1.example.com
|
|
port_value: 18180
|
|
- endpoint:
|
|
address:
|
|
socket_address:
|
|
address: node2.example.com
|
|
port_value: 18180
|
|
- endpoint:
|
|
address:
|
|
socket_address:
|
|
address: node3.example.com
|
|
port_value: 18180
|
|
health_checks:
|
|
- timeout: 3s
|
|
interval: 5s
|
|
unhealthy_threshold: 3
|
|
healthy_threshold: 2
|
|
http_health_check:
|
|
path: "/v1/health"
|
|
```
|
|
|
|
---
|
|
|
|
## Cluster Sizing Guidelines
|
|
|
|
**From [Resource Sizing Guide](../reference-architecture/resource-sizing.md):**
|
|
|
|
| Assertions | Nodes | Replication Factor | RTO | RPO |
|
|
|-----------|-------|-------------------|-----|-----|
|
|
| <10K | 1 | N/A | 2hr | 24hr |
|
|
| <100K | 3 | 2 | 5min | 1min |
|
|
| <1M | 5 | 3 | 1min | 10s |
|
|
|
|
**When to add nodes:**
|
|
- Query latency p99 >1s (capacity)
|
|
- Disk usage >80% (storage)
|
|
- CPU sustained >70% (compute)
|
|
- Planning for HA (minimum 3 nodes)
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- [Three-Node Cluster Architecture](../reference-architecture/three-node-cluster.md) - Deployment guide
|
|
- [Network Requirements](../reference-architecture/network-requirements.md) - Firewall rules
|
|
- [High Query Latency](./high-query-latency.md) - Shard rebalancing
|
|
- [Resource Sizing](../reference-architecture/resource-sizing.md) - Capacity planning
|
|
|
|
---
|
|
|
|
## Future Enhancements
|
|
|
|
**Roadmap P6.3 (Automatic Shard Rebalancing):**
|
|
- Auto-detect when new node joins
|
|
- Automatically rebalance shards for even distribution
|
|
- No manual `shards/rebalance` API calls needed
|
|
|
|
**Roadmap P6.4 (WAL Archival to S3):**
|
|
- Replicate WAL segments to S3 for durability
|
|
- Reduce local disk requirements
|
|
- Enable faster node replacement (restore from S3)
|
|
|
|
---
|
|
|
|
## Last Updated
|
|
|
|
2026-02-11
|