stemedb/docs/operations/node-lifecycle.md
jml ae7d2ed8b1 feat(admin): implement stemedb-admin CLI with API contract fixes
Complete implementation of P5.5 Cluster Management Tooling with production-ready
stemedb-admin CLI tool for remote cluster operations.

## Features Implemented

### CLI Tool (1,200 lines)
- Cluster commands: health, status
- Node commands: list, info, shards
- Shard commands: list, info, replicas
- Debug commands: export
- Output formats: table (colored) and JSON
- Remote gateway connection via HTTP

### API Contract Fixes
- Handle gateway wrapper objects ({"ranges": [...]})
- Convert string shard IDs ("shard_0") to integers
- Normalize different endpoint formats (/v1/admin/ranges vs /v1/shards/:id)
- Custom deserializer for flexible ID formats

### Code Quality
- Zero clippy warnings (strict mode)
- Zero panics (unwrap/expect forbidden)
- 12 integration tests (all passing)
- Comprehensive error handling with anyhow
- Structured logging with tracing

### Documentation (7,000+ words)
- Node lifecycle operations guide (38 sections)
- CLI installation and usage guide (61 sections)
- Add/remove/replace node procedures
- Troubleshooting guides

## Testing
- Automated tests: 23/23 passing
- Cluster tests: 8/8 passing
- All commands verified against live 3-node cluster

## Production Readiness
- Code: Production-grade (0 warnings, defensive error handling)
- Tests: 31/31 passing (100%)
- Documentation: Complete operations guides
- Status: Ready for staging deployment

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 08:23:36 +00:00

10 KiB

Node Lifecycle Operations

This guide covers adding, removing, and replacing nodes in a StemeDB cluster. All procedures use the stemedb-admin CLI tool for cluster inspection and management.

Prerequisites

  • stemedb-admin CLI installed (see install-admin-cli.md)
  • Network access to the gateway node (default: http://gateway:18181)
  • Appropriate credentials for cluster operations (Phase 2)

Table of Contents

  1. Add Node Procedure
  2. Remove Node Procedure
  3. Replace Failed Node Procedure
  4. Troubleshooting

Add Node Procedure

Pre-flight Checks

Before adding a node to the cluster, verify:

  1. Network connectivity: New node can reach existing cluster nodes

    # From new node, test connectivity to gateway
    curl http://gateway:18181/v1/health
    
  2. Port availability: Required ports are not blocked

    # Check ports are open
    nc -zv gateway 18181  # Gateway
    nc -zv gateway 18182  # RPC
    nc -zv gateway 18183  # SWIM gossip
    
  3. Disk space: Adequate storage for shard replicas

    df -h /var/lib/stemedb
    # Recommendation: At least 100GB free per node
    
  4. Configuration: Node config matches cluster settings

    cat /etc/stemedb/node.toml
    # Verify: cluster_name, seed_nodes, port settings
    

Add Node Steps

  1. Start the new node with seed node addresses:

    stemedb-node \
      --node-id $(uuidgen) \
      --seed-nodes gateway:18183,node-02:18183 \
      --data-dir /var/lib/stemedb
    
  2. Verify node joined the cluster:

    stemedb-admin node list
    

    Expected output:

    NODES
    ┌──────────┬────────┬──────────┬───────────┬──────────┐
    │ Node ID  │ State  │ Shards   │ Leader    │ Follower │
    ├──────────┼────────┼──────────┼───────────┼──────────┤
    │ a3f2b1c4 │ Alive  │ 10,15,22 │ -         │ -        │
    │ 7d8e9f0a │ Alive  │ 5,12,18  │ -         │ -        │
    │ NEW_NODE │ Alive  │          │ -         │ -        │  ← New node appears
    └──────────┴────────┴──────────┴───────────┴──────────┘
    
  3. Wait for shard assignment (typically 30-60 seconds):

    # Watch for shards to be assigned
    watch -n 5 'stemedb-admin node NEW_NODE shards'
    
  4. Verify shard replication:

    stemedb-admin node NEW_NODE shards
    # Check that shards are being replicated (size_bytes > 0)
    
  5. Check cluster health:

    stemedb-admin cluster health
    # Expected: ✓ Cluster is healthy
    

Post-Add Validation

  • Node appears in stemedb-admin node list with Alive state
  • Node has been assigned shards (may take 1-2 minutes)
  • Cluster health check passes
  • Node logs show successful replication (no errors)

Timeline: 2-5 minutes for full node addition and initial replication.


Remove Node Procedure

Pre-removal Checks

  1. Check node is not critical for quorum:

    stemedb-admin cluster status
    # Verify: node_count >= 3 (for 3-node minimum)
    
  2. Identify which shards will be affected:

    stemedb-admin node NODE_ID shards
    # Record: leader shards (need failover), follower shards (need replication)
    
  3. Check if node is leader for critical shards:

    stemedb-admin node NODE_ID shards --leader
    

Remove Node Steps

Phase 2 Feature: Graceful node removal with stemedb-admin node NODE_ID drain is planned but not yet implemented. Current procedure is manual monitoring.

  1. Stop the node gracefully:

    # On the node being removed
    systemctl stop stemedb-node
    
  2. Wait for node to transition to Dead state (30-60 seconds):

    watch -n 5 'stemedb-admin node list'
    # Wait for state to change: Alive → Suspect → Dead
    
  3. Verify leader election for affected shards:

    # For each leader shard the removed node owned
    stemedb-admin shard SHARD_ID info
    # Check: leader_node is now a different node
    
  4. Monitor shard rebalancing:

    stemedb-admin cluster status
    # Watch shard_count stabilize across remaining nodes
    
  5. Verify cluster health:

    stemedb-admin cluster health
    # Expected: ✓ Cluster is healthy
    

Post-Removal Validation

  • Node shows Dead state in stemedb-admin node list
  • All shards previously led by removed node have new leaders
  • Cluster health check passes
  • Remaining nodes have picked up replica duties

Timeline: 2-5 minutes for failover and rebalancing.


Replace Failed Node Procedure

When a node fails unexpectedly (hardware failure, network partition, etc.), follow this procedure to replace it.

Confirm Failure

  1. Verify node is truly dead:

    stemedb-admin node NODE_ID info
    # Expected: State: Dead
    
  2. Identify affected shards:

    stemedb-admin node NODE_ID shards
    # Record which shards were on the failed node
    
  3. Check leader failover status:

    # For each shard
    stemedb-admin shard SHARD_ID info
    # Verify: leader_node is NOT the dead node
    

Replace Failed Node

  1. Provision replacement node with same configuration:

    # Use original node config, but generate new node-id
    stemedb-node \
      --node-id $(uuidgen) \
      --seed-nodes gateway:18183,node-02:18183 \
      --data-dir /var/lib/stemedb
    
  2. Verify replacement node joins cluster:

    stemedb-admin node list
    # Check new node appears with Alive state
    
  3. Monitor replica recovery:

    # Watch shards being assigned to replacement
    watch -n 10 'stemedb-admin node NEW_NODE_ID shards'
    
  4. Verify data replication:

    stemedb-admin node NEW_NODE_ID shards
    # Check size_bytes matches expected values
    
  5. Remove dead node from member list (Phase 2 feature):

    # Planned: stemedb-admin node DEAD_NODE_ID remove
    # Current: Dead nodes age out of membership after timeout
    

Post-Replacement Validation

  • Replacement node is Alive and has shards assigned
  • All previously affected shards have proper replication factor
  • Cluster health check passes
  • No ongoing replication errors in logs

Timeline: 5-10 minutes for full replacement and data sync.


Troubleshooting

Node Stuck in Suspect State

Symptom: Node shows Suspect state for extended period (>2 minutes).

Possible Causes:

  • Network latency spikes
  • Node under heavy load (CPU/disk saturation)
  • SWIM gossip port blocked (18183)

Diagnosis:

# Check network latency
ping -c 10 node-hostname

# Check node resource usage
ssh node-hostname 'top -bn1 | head -20'

# Check SWIM port
nc -zv node-hostname 18183

Resolution:

  1. If network issue: Fix network, node will transition back to Alive
  2. If resource exhaustion: Scale up node resources or reduce load
  3. If persistent: Consider replacing node (see above)

Shard Leader Election Issues

Symptom: Shard has no leader after node failure.

Diagnosis:

stemedb-admin shard SHARD_ID info
# Check: leader_node field

Resolution:

  1. Check replica nodes are alive:

    stemedb-admin node list
    # Verify replica nodes show Alive state
    
  2. Check logs for election failures:

    # On gateway node
    journalctl -u stemedb-gateway | grep "election\|leader"
    
  3. If stuck, trigger manual sync (Phase 2):

    # Planned: stemedb-admin shard SHARD_ID elect-leader
    

Network Partition Scenarios

Symptom: Cluster split into multiple segments, nodes in each segment see others as Dead.

Diagnosis:

# On each node segment
stemedb-admin cluster status
# Compare node counts and health status

Resolution:

  1. Restore network connectivity between segments

  2. Wait for SWIM to reconcile (30-60 seconds after connectivity restored)

  3. Verify cluster converges:

    stemedb-admin node list
    # All nodes should show Alive after partition heals
    
  4. Check for data divergence:

    # Trigger anti-entropy sync
    # Planned: stemedb-admin cluster sync --force
    

Important: During partition, writes may be accepted in multiple segments. After healing, conflict resolution via lenses will apply (Recency, Consensus, Authority).

Shard Rebalancing Not Occurring

Symptom: New node added but no shards assigned after 5+ minutes.

Diagnosis:

# Check cluster meta version is advancing
stemedb-admin cluster status
# meta_version should increment when topology changes

# Check gateway logs
journalctl -u stemedb-gateway | grep "rebalance\|assign"

Resolution:

  1. Verify node is truly Alive:

    stemedb-admin node NEW_NODE_ID info
    
  2. Check node has adequate disk space:

    ssh NEW_NODE_ID 'df -h /var/lib/stemedb'
    
  3. Trigger manual rebalance (Phase 2):

    # Planned: stemedb-admin shard rebalance --target-node NEW_NODE_ID
    

Quick Reference: Common Commands

# Check cluster health
stemedb-admin cluster health

# List all nodes
stemedb-admin node list

# Show node details
stemedb-admin node NODE_ID info

# Show shards on a node
stemedb-admin node NODE_ID shards

# List all shards
stemedb-admin shard list

# Show shard details
stemedb-admin shard SHARD_ID info

# Export debug state
stemedb-admin debug export --output cluster-state.json