stemedb/docs/operations/node-lifecycle.md

# Node Lifecycle Operations

This guide covers adding, removing, and replacing nodes in a StemeDB cluster. All procedures use the `stemedb-admin` CLI tool for cluster inspection and management.

## Prerequisites

- `stemedb-admin` CLI installed (see [install-admin-cli.md](deployment/install-admin-cli.md))
- Network access to the gateway node (default: `http://gateway:18181`)
- Appropriate credentials for cluster operations (Phase 2)

## Table of Contents

1. [Add Node Procedure](#add-node-procedure)
2. [Remove Node Procedure](#remove-node-procedure)
3. [Replace Failed Node Procedure](#replace-failed-node-procedure)
4. [Troubleshooting](#troubleshooting)

---

## Add Node Procedure

### Pre-flight Checks

Before adding a node to the cluster, verify:

1. **Network connectivity**: New node can reach existing cluster nodes
   ```bash
   # From new node, test connectivity to gateway
   curl http://gateway:18181/v1/health
   ```

2. **Port availability**: Required ports are not blocked
   ```bash
   # Check ports are open
   nc -zv gateway 18181  # Gateway
   nc -zv gateway 18182  # RPC
   nc -zv gateway 18183  # SWIM gossip
   ```

3. **Disk space**: Adequate storage for shard replicas
   ```bash
   df -h /var/lib/stemedb
   # Recommendation: At least 100GB free per node
   ```

4. **Configuration**: Node config matches cluster settings
   ```bash
   cat /etc/stemedb/node.toml
   # Verify: cluster_name, seed_nodes, port settings
   ```

### Add Node Steps

1. **Start the new node** with seed node addresses:
   ```bash
   stemedb-node \
     --node-id $(uuidgen) \
     --seed-nodes gateway:18183,node-02:18183 \
     --data-dir /var/lib/stemedb
   ```

2. **Verify node joined the cluster**:
   ```bash
   stemedb-admin node list
   ```

   Expected output:
   ```
   NODES
   ┌──────────┬────────┬──────────┬───────────┬──────────┐
   │ Node ID  │ State  │ Shards   │ Leader    │ Follower │
   ├──────────┼────────┼──────────┼───────────┼──────────┤
   │ a3f2b1c4 │ Alive  │ 10,15,22 │ -         │ -        │
   │ 7d8e9f0a │ Alive  │ 5,12,18  │ -         │ -        │
   │ NEW_NODE │ Alive  │          │ -         │ -        │  ← New node appears
   └──────────┴────────┴──────────┴───────────┴──────────┘
   ```

3. **Wait for shard assignment** (typically 30-60 seconds):
   ```bash
   # Watch for shards to be assigned
   watch -n 5 'stemedb-admin node NEW_NODE shards'
   ```

4. **Verify shard replication**:
   ```bash
   stemedb-admin node NEW_NODE shards
   # Check that shards are being replicated (size_bytes > 0)
   ```

5. **Check cluster health**:
   ```bash
   stemedb-admin cluster health
   # Expected: ✓ Cluster is healthy
   ```

### Post-Add Validation

- [ ] Node appears in `stemedb-admin node list` with `Alive` state
- [ ] Node has been assigned shards (may take 1-2 minutes)
- [ ] Cluster health check passes
- [ ] Node logs show successful replication (no errors)

**Timeline**: 2-5 minutes for full node addition and initial replication.

---

## Remove Node Procedure

### Pre-removal Checks

1. **Check node is not critical for quorum**:
   ```bash
   stemedb-admin cluster status
   # Verify: node_count >= 3 (for 3-node minimum)
   ```

2. **Identify which shards will be affected**:
   ```bash
   stemedb-admin node NODE_ID shards
   # Record: leader shards (need failover), follower shards (need replication)
   ```

3. **Check if node is leader for critical shards**:
   ```bash
   stemedb-admin node NODE_ID shards --leader
   ```

### Remove Node Steps

**Phase 2 Feature**: Graceful node removal with `stemedb-admin node NODE_ID drain` is planned but not yet implemented. Current procedure is manual monitoring.

1. **Stop the node gracefully**:
   ```bash
   # On the node being removed
   systemctl stop stemedb-node
   ```

2. **Wait for node to transition to Dead state** (30-60 seconds):
   ```bash
   watch -n 5 'stemedb-admin node list'
   # Wait for state to change: Alive → Suspect → Dead
   ```

3. **Verify leader election for affected shards**:
   ```bash
   # For each leader shard the removed node owned
   stemedb-admin shard SHARD_ID info
   # Check: leader_node is now a different node
   ```

4. **Monitor shard rebalancing**:
   ```bash
   stemedb-admin cluster status
   # Watch shard_count stabilize across remaining nodes
   ```

5. **Verify cluster health**:
   ```bash
   stemedb-admin cluster health
   # Expected: ✓ Cluster is healthy
   ```

### Post-Removal Validation

- [ ] Node shows `Dead` state in `stemedb-admin node list`
- [ ] All shards previously led by removed node have new leaders
- [ ] Cluster health check passes
- [ ] Remaining nodes have picked up replica duties

**Timeline**: 2-5 minutes for failover and rebalancing.

---

## Replace Failed Node Procedure

When a node fails unexpectedly (hardware failure, network partition, etc.), follow this procedure to replace it.

### Confirm Failure

1. **Verify node is truly dead**:
   ```bash
   stemedb-admin node NODE_ID info
   # Expected: State: Dead
   ```

2. **Identify affected shards**:
   ```bash
   stemedb-admin node NODE_ID shards
   # Record which shards were on the failed node
   ```

3. **Check leader failover status**:
   ```bash
   # For each shard
   stemedb-admin shard SHARD_ID info
   # Verify: leader_node is NOT the dead node
   ```

### Replace Failed Node

1. **Provision replacement node** with same configuration:
   ```bash
   # Use original node config, but generate new node-id
   stemedb-node \
     --node-id $(uuidgen) \
     --seed-nodes gateway:18183,node-02:18183 \
     --data-dir /var/lib/stemedb
   ```

2. **Verify replacement node joins cluster**:
   ```bash
   stemedb-admin node list
   # Check new node appears with Alive state
   ```

3. **Monitor replica recovery**:
   ```bash
   # Watch shards being assigned to replacement
   watch -n 10 'stemedb-admin node NEW_NODE_ID shards'
   ```

4. **Verify data replication**:
   ```bash
   stemedb-admin node NEW_NODE_ID shards
   # Check size_bytes matches expected values
   ```

5. **Remove dead node from member list** (Phase 2 feature):
   ```bash
   # Planned: stemedb-admin node DEAD_NODE_ID remove
   # Current: Dead nodes age out of membership after timeout
   ```

### Post-Replacement Validation

- [ ] Replacement node is `Alive` and has shards assigned
- [ ] All previously affected shards have proper replication factor
- [ ] Cluster health check passes
- [ ] No ongoing replication errors in logs

**Timeline**: 5-10 minutes for full replacement and data sync.

---

## Troubleshooting

### Node Stuck in Suspect State

**Symptom**: Node shows `Suspect` state for extended period (>2 minutes).

**Possible Causes**:
- Network latency spikes
- Node under heavy load (CPU/disk saturation)
- SWIM gossip port blocked (18183)

**Diagnosis**:
```bash
# Check network latency
ping -c 10 node-hostname

# Check node resource usage
ssh node-hostname 'top -bn1 | head -20'

# Check SWIM port
nc -zv node-hostname 18183
```

**Resolution**:
1. If network issue: Fix network, node will transition back to `Alive`
2. If resource exhaustion: Scale up node resources or reduce load
3. If persistent: Consider replacing node (see above)

### Shard Leader Election Issues

**Symptom**: Shard has no leader after node failure.

**Diagnosis**:
```bash
stemedb-admin shard SHARD_ID info
# Check: leader_node field
```

**Resolution**:
1. Check replica nodes are alive:
   ```bash
   stemedb-admin node list
   # Verify replica nodes show Alive state
   ```

2. Check logs for election failures:
   ```bash
   # On gateway node
   journalctl -u stemedb-gateway | grep "election\|leader"
   ```

3. If stuck, trigger manual sync (Phase 2):
   ```bash
   # Planned: stemedb-admin shard SHARD_ID elect-leader
   ```

### Network Partition Scenarios

**Symptom**: Cluster split into multiple segments, nodes in each segment see others as `Dead`.

**Diagnosis**:
```bash
# On each node segment
stemedb-admin cluster status
# Compare node counts and health status
```

**Resolution**:
1. **Restore network connectivity** between segments
2. **Wait for SWIM to reconcile** (30-60 seconds after connectivity restored)
3. **Verify cluster converges**:
   ```bash
   stemedb-admin node list
   # All nodes should show Alive after partition heals
   ```

4. **Check for data divergence**:
   ```bash
   # Trigger anti-entropy sync
   # Planned: stemedb-admin cluster sync --force
   ```

**Important**: During partition, writes may be accepted in multiple segments. After healing, conflict resolution via lenses will apply (Recency, Consensus, Authority).

### Shard Rebalancing Not Occurring

**Symptom**: New node added but no shards assigned after 5+ minutes.

**Diagnosis**:
```bash
# Check cluster meta version is advancing
stemedb-admin cluster status
# meta_version should increment when topology changes

# Check gateway logs
journalctl -u stemedb-gateway | grep "rebalance\|assign"
```

**Resolution**:
1. Verify node is truly `Alive`:
   ```bash
   stemedb-admin node NEW_NODE_ID info
   ```

2. Check node has adequate disk space:
   ```bash
   ssh NEW_NODE_ID 'df -h /var/lib/stemedb'
   ```

3. Trigger manual rebalance (Phase 2):
   ```bash
   # Planned: stemedb-admin shard rebalance --target-node NEW_NODE_ID
   ```

---

## Quick Reference: Common Commands

```bash
# Check cluster health
stemedb-admin cluster health

# List all nodes
stemedb-admin node list

# Show node details
stemedb-admin node NODE_ID info

# Show shards on a node
stemedb-admin node NODE_ID shards

# List all shards
stemedb-admin shard list

# Show shard details
stemedb-admin shard SHARD_ID info

# Export debug state
stemedb-admin debug export --output cluster-state.json
```

---

## Related Documentation

- [Three-Node Cluster Setup](deployment/three-node-cluster.md)
- [Install Admin CLI](deployment/install-admin-cli.md)
- [Monitoring & Observability](monitoring/README.md)
- [Disaster Recovery](disaster-recovery/README.md)