Milestone 8 (phases 1-4): - Shard-aware WAL segment naming, BatchHeader v2, ShardRouter - Transport trait, InProcessTransport, WalShipper, FollowerDb - HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine - Session replication bridge with SeqNo/HWM, idempotency store Forage application: - Multi-source discovery engine with MAB exploration - Embedding-based label system, server handlers, UI refresh Other: - QUICKSTART.md, README.md, milestone-8 planning docs - Hard negative union semantics, RLHF export enhancements - Recovery benchmark and visibility test expansions - Split 8 oversized source files per CODING_GUIDELINES §9 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
166 lines
5.9 KiB
Markdown
166 lines
5.9 KiB
Markdown
# Task 05: RollingUpgradeCoordinator
|
|
|
|
## Delivers
|
|
|
|
`RollingUpgradeCoordinator` in `tidal/src/replication/upgrade.rs`. Upgrades nodes one at a time with drain → upgrade → rejoin. Uses WAL shipping to keep remaining followers current during the upgrade window. Query availability remains 100% because at least one node is always serving during each upgrade step.
|
|
|
|
## Complexity: M
|
|
|
|
## Dependencies
|
|
|
|
- Task 03 (ControlPlane)
|
|
- Phase 8.2, Task 03 (WalShipper)
|
|
- Phase 8.2, Task 05 (FollowerDb / NodeRole)
|
|
|
|
## Technical Design
|
|
|
|
```rust
|
|
// tidal/src/replication/upgrade.rs
|
|
|
|
/// Coordinates a rolling upgrade across all nodes in a cluster.
|
|
///
|
|
/// Protocol (per node):
|
|
/// 1. `drain(node)` -- stop routing new writes to the target node;
|
|
/// let in-flight operations complete; verify replication lag = 0.
|
|
/// 2. Caller performs the upgrade (outside this coordinator's scope).
|
|
/// 3. `rejoin(node)` -- re-enable routing to the upgraded node;
|
|
/// verify it can process new WAL segments.
|
|
///
|
|
/// At any point, at least (N-1) nodes are serving queries.
|
|
pub struct RollingUpgradeCoordinator {
|
|
control_plane: Arc<ControlPlane>,
|
|
wal_shipper: Arc<WalShipper>,
|
|
/// Nodes currently in the "draining" state (not routing new writes).
|
|
drained_nodes: Mutex<HashSet<ShardId>>,
|
|
}
|
|
|
|
/// Status of a single node's upgrade step.
|
|
#[derive(Debug, Clone, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
|
|
pub enum NodeUpgradeStatus {
|
|
Pending,
|
|
Draining,
|
|
Drained, // ready for upgrade
|
|
Upgrading, // external process is upgrading the node
|
|
Rejoining, // node is catching up from WAL
|
|
Complete,
|
|
Failed { reason: String },
|
|
}
|
|
|
|
impl RollingUpgradeCoordinator {
|
|
pub fn new(
|
|
control_plane: Arc<ControlPlane>,
|
|
wal_shipper: Arc<WalShipper>,
|
|
) -> Self {
|
|
Self {
|
|
control_plane,
|
|
wal_shipper,
|
|
drained_nodes: Mutex::new(HashSet::new()),
|
|
}
|
|
}
|
|
|
|
/// Drain a node: stop routing writes to it, wait for replication lag = 0.
|
|
///
|
|
/// Fails if draining this node would leave fewer than 1 serving node.
|
|
pub async fn drain(&self, target_shard: ShardId) -> Result<()> {
|
|
// Safety check: cannot drain if it would leave 0 serving nodes.
|
|
let drained = self.drained_nodes.lock().unwrap();
|
|
let topology = self.control_plane.topology();
|
|
let total_nodes = topology.shards.len();
|
|
let already_drained = drained.len();
|
|
if already_drained + 1 >= total_nodes {
|
|
return Err(TidalError::InvalidState(
|
|
"cannot drain: would leave no serving nodes".into()
|
|
));
|
|
}
|
|
drop(drained);
|
|
|
|
// Mark as draining: routing layer stops sending new writes here.
|
|
self.drained_nodes.lock().unwrap().insert(target_shard);
|
|
|
|
// Wait for replication lag to reach 0 (target has all events).
|
|
self.await_zero_lag(target_shard, Duration::from_secs(30)).await?;
|
|
|
|
Ok(())
|
|
}
|
|
|
|
/// Rejoin a (newly upgraded) node: re-enable routing, ship missing WAL segments.
|
|
///
|
|
/// The upgraded node may have missed WAL segments during its downtime.
|
|
/// We ship those segments before re-enabling routing.
|
|
pub async fn rejoin(&self, target_shard: ShardId) -> Result<()> {
|
|
// Get the node's current applied seqno (via its reported stats).
|
|
let follower_seqno = self.control_plane
|
|
.shard_stats(target_shard)
|
|
.map(|s| s.applied_seqno)
|
|
.unwrap_or(0);
|
|
|
|
// Ship missed segments.
|
|
self.wal_shipper
|
|
.ship_segments_since(target_shard, follower_seqno)
|
|
.await?;
|
|
|
|
// Wait for the node to apply all shipped segments.
|
|
self.await_zero_lag(target_shard, Duration::from_secs(60)).await?;
|
|
|
|
// Re-enable routing to this node.
|
|
self.drained_nodes.lock().unwrap().remove(&target_shard);
|
|
|
|
Ok(())
|
|
}
|
|
|
|
/// Returns `true` if `shard_id` is currently drained (not receiving writes).
|
|
pub fn is_drained(&self, shard_id: ShardId) -> bool {
|
|
self.drained_nodes.lock().unwrap().contains(&shard_id)
|
|
}
|
|
|
|
/// Wait until the replication lag for `target_shard` reaches 0.
|
|
///
|
|
/// Polls the `ReplicationLagGauge` every 100ms. Times out after `timeout`.
|
|
async fn await_zero_lag(
|
|
&self,
|
|
target_shard: ShardId,
|
|
timeout: Duration,
|
|
) -> Result<()> {
|
|
let deadline = Instant::now() + timeout;
|
|
loop {
|
|
if Instant::now() > deadline {
|
|
return Err(TidalError::Timeout(
|
|
format!("drain timeout: shard {:?} still has replication lag", target_shard)
|
|
));
|
|
}
|
|
let lag = self.control_plane.lag_for(target_shard);
|
|
if lag == 0 {
|
|
return Ok(());
|
|
}
|
|
tokio::time::sleep(Duration::from_millis(100)).await;
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Routing Integration
|
|
|
|
```rust
|
|
// In WalShipper (additions)
|
|
|
|
impl WalShipper {
|
|
/// Skip shipping to drained nodes.
|
|
async fn should_ship_to(&self, shard_id: ShardId) -> bool {
|
|
!self.upgrade_coordinator
|
|
.as_ref()
|
|
.map(|c| c.is_drained(shard_id))
|
|
.unwrap_or(false)
|
|
}
|
|
}
|
|
```
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] `drain(shard)` fails with `TidalError::InvalidState` if draining would leave 0 serving nodes
|
|
- [ ] `drain(shard)` succeeds once replication lag for that shard reaches 0
|
|
- [ ] During drain: writes from `WalShipper` skip the drained shard; reads from other shards succeed
|
|
- [ ] `rejoin(shard)` ships all WAL segments the node missed during its downtime, then re-enables routing
|
|
- [ ] Rolling upgrade of all N nodes: each drain+rejoin step maintains availability (property: at least 1 node serving throughout)
|
|
- [ ] Integration test: 3-node simulated cluster; drain node 0, "upgrade" (simulated by stop+restart), rejoin; verify all signals written during the upgrade are present on the rejoined node
|
|
- [ ] `cargo clippy -D warnings` and `cargo fmt` pass
|