tidaldb/docs/planning/milestone-8/phase-5/task-05-rolling-upgrade.md

# Task 05: RollingUpgradeCoordinator

## Delivers

`RollingUpgradeCoordinator` in `tidal/src/replication/upgrade.rs`. Upgrades nodes one at a time with drain → upgrade → rejoin. Uses WAL shipping to keep remaining followers current during the upgrade window. Query availability remains 100% because at least one node is always serving during each upgrade step.

## Complexity: M

## Dependencies

- Task 03 (ControlPlane)
- Phase 8.2, Task 03 (WalShipper)
- Phase 8.2, Task 05 (FollowerDb / NodeRole)

## Technical Design

```rust
// tidal/src/replication/upgrade.rs

/// Coordinates a rolling upgrade across all nodes in a cluster.
///
/// Protocol (per node):
///   1. `drain(node)` -- stop routing new writes to the target node;
///      let in-flight operations complete; verify replication lag = 0.
///   2. Caller performs the upgrade (outside this coordinator's scope).
///   3. `rejoin(node)` -- re-enable routing to the upgraded node;
///      verify it can process new WAL segments.
///
/// At any point, at least (N-1) nodes are serving queries.
pub struct RollingUpgradeCoordinator {
    control_plane: Arc<ControlPlane>,
    wal_shipper: Arc<WalShipper>,
    /// Nodes currently in the "draining" state (not routing new writes).
    drained_nodes: Mutex<HashSet<ShardId>>,
}

/// Status of a single node's upgrade step.
#[derive(Debug, Clone, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
pub enum NodeUpgradeStatus {
    Pending,
    Draining,
    Drained,         // ready for upgrade
    Upgrading,       // external process is upgrading the node
    Rejoining,       // node is catching up from WAL
    Complete,
    Failed { reason: String },
}

impl RollingUpgradeCoordinator {
    pub fn new(
        control_plane: Arc<ControlPlane>,
        wal_shipper: Arc<WalShipper>,
    ) -> Self {
        Self {
            control_plane,
            wal_shipper,
            drained_nodes: Mutex::new(HashSet::new()),
        }
    }

    /// Drain a node: stop routing writes to it, wait for replication lag = 0.
    ///
    /// Fails if draining this node would leave fewer than 1 serving node.
    pub async fn drain(&self, target_shard: ShardId) -> Result<()> {
        // Safety check: cannot drain if it would leave 0 serving nodes.
        let drained = self.drained_nodes.lock().unwrap();
        let topology = self.control_plane.topology();
        let total_nodes = topology.shards.len();
        let already_drained = drained.len();
        if already_drained + 1 >= total_nodes {
            return Err(TidalError::InvalidState(
                "cannot drain: would leave no serving nodes".into()
            ));
        }
        drop(drained);

        // Mark as draining: routing layer stops sending new writes here.
        self.drained_nodes.lock().unwrap().insert(target_shard);

        // Wait for replication lag to reach 0 (target has all events).
        self.await_zero_lag(target_shard, Duration::from_secs(30)).await?;

        Ok(())
    }

    /// Rejoin a (newly upgraded) node: re-enable routing, ship missing WAL segments.
    ///
    /// The upgraded node may have missed WAL segments during its downtime.
    /// We ship those segments before re-enabling routing.
    pub async fn rejoin(&self, target_shard: ShardId) -> Result<()> {
        // Get the node's current applied seqno (via its reported stats).
        let follower_seqno = self.control_plane
            .shard_stats(target_shard)
            .map(|s| s.applied_seqno)
            .unwrap_or(0);

        // Ship missed segments.
        self.wal_shipper
            .ship_segments_since(target_shard, follower_seqno)
            .await?;

        // Wait for the node to apply all shipped segments.
        self.await_zero_lag(target_shard, Duration::from_secs(60)).await?;

        // Re-enable routing to this node.
        self.drained_nodes.lock().unwrap().remove(&target_shard);

        Ok(())
    }

    /// Returns `true` if `shard_id` is currently drained (not receiving writes).
    pub fn is_drained(&self, shard_id: ShardId) -> bool {
        self.drained_nodes.lock().unwrap().contains(&shard_id)
    }

    /// Wait until the replication lag for `target_shard` reaches 0.
    ///
    /// Polls the `ReplicationLagGauge` every 100ms. Times out after `timeout`.
    async fn await_zero_lag(
        &self,
        target_shard: ShardId,
        timeout: Duration,
    ) -> Result<()> {
        let deadline = Instant::now() + timeout;
        loop {
            if Instant::now() > deadline {
                return Err(TidalError::Timeout(
                    format!("drain timeout: shard {:?} still has replication lag", target_shard)
                ));
            }
            let lag = self.control_plane.lag_for(target_shard);
            if lag == 0 {
                return Ok(());
            }
            tokio::time::sleep(Duration::from_millis(100)).await;
        }
    }
}
```

### Routing Integration

```rust
// In WalShipper (additions)

impl WalShipper {
    /// Skip shipping to drained nodes.
    async fn should_ship_to(&self, shard_id: ShardId) -> bool {
        !self.upgrade_coordinator
            .as_ref()
            .map(|c| c.is_drained(shard_id))
            .unwrap_or(false)
    }
}
```

## Acceptance Criteria

- [ ] `drain(shard)` fails with `TidalError::InvalidState` if draining would leave 0 serving nodes
- [ ] `drain(shard)` succeeds once replication lag for that shard reaches 0
- [ ] During drain: writes from `WalShipper` skip the drained shard; reads from other shards succeed
- [ ] `rejoin(shard)` ships all WAL segments the node missed during its downtime, then re-enables routing
- [ ] Rolling upgrade of all N nodes: each drain+rejoin step maintains availability (property: at least 1 node serving throughout)
- [ ] Integration test: 3-node simulated cluster; drain node 0, "upgrade" (simulated by stop+restart), rejoin; verify all signals written during the upgrade are present on the rejoined node
- [ ] `cargo clippy -D warnings` and `cargo fmt` pass