Milestone 8 (phases 1-4): - Shard-aware WAL segment naming, BatchHeader v2, ShardRouter - Transport trait, InProcessTransport, WalShipper, FollowerDb - HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine - Session replication bridge with SeqNo/HWM, idempotency store Forage application: - Multi-source discovery engine with MAB exploration - Embedding-based label system, server handlers, UI refresh Other: - QUICKSTART.md, README.md, milestone-8 planning docs - Hard negative union semantics, RLHF export enhancements - Recovery benchmark and visibility test expansions - Split 8 oversized source files per CODING_GUIDELINES §9 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5.9 KiB
5.9 KiB
Task 05: RollingUpgradeCoordinator
Delivers
RollingUpgradeCoordinator in tidal/src/replication/upgrade.rs. Upgrades nodes one at a time with drain → upgrade → rejoin. Uses WAL shipping to keep remaining followers current during the upgrade window. Query availability remains 100% because at least one node is always serving during each upgrade step.
Complexity: M
Dependencies
- Task 03 (ControlPlane)
- Phase 8.2, Task 03 (WalShipper)
- Phase 8.2, Task 05 (FollowerDb / NodeRole)
Technical Design
// tidal/src/replication/upgrade.rs
/// Coordinates a rolling upgrade across all nodes in a cluster.
///
/// Protocol (per node):
/// 1. `drain(node)` -- stop routing new writes to the target node;
/// let in-flight operations complete; verify replication lag = 0.
/// 2. Caller performs the upgrade (outside this coordinator's scope).
/// 3. `rejoin(node)` -- re-enable routing to the upgraded node;
/// verify it can process new WAL segments.
///
/// At any point, at least (N-1) nodes are serving queries.
pub struct RollingUpgradeCoordinator {
control_plane: Arc<ControlPlane>,
wal_shipper: Arc<WalShipper>,
/// Nodes currently in the "draining" state (not routing new writes).
drained_nodes: Mutex<HashSet<ShardId>>,
}
/// Status of a single node's upgrade step.
#[derive(Debug, Clone, PartialEq, Eq, serde::Serialize, serde::Deserialize)]
pub enum NodeUpgradeStatus {
Pending,
Draining,
Drained, // ready for upgrade
Upgrading, // external process is upgrading the node
Rejoining, // node is catching up from WAL
Complete,
Failed { reason: String },
}
impl RollingUpgradeCoordinator {
pub fn new(
control_plane: Arc<ControlPlane>,
wal_shipper: Arc<WalShipper>,
) -> Self {
Self {
control_plane,
wal_shipper,
drained_nodes: Mutex::new(HashSet::new()),
}
}
/// Drain a node: stop routing writes to it, wait for replication lag = 0.
///
/// Fails if draining this node would leave fewer than 1 serving node.
pub async fn drain(&self, target_shard: ShardId) -> Result<()> {
// Safety check: cannot drain if it would leave 0 serving nodes.
let drained = self.drained_nodes.lock().unwrap();
let topology = self.control_plane.topology();
let total_nodes = topology.shards.len();
let already_drained = drained.len();
if already_drained + 1 >= total_nodes {
return Err(TidalError::InvalidState(
"cannot drain: would leave no serving nodes".into()
));
}
drop(drained);
// Mark as draining: routing layer stops sending new writes here.
self.drained_nodes.lock().unwrap().insert(target_shard);
// Wait for replication lag to reach 0 (target has all events).
self.await_zero_lag(target_shard, Duration::from_secs(30)).await?;
Ok(())
}
/// Rejoin a (newly upgraded) node: re-enable routing, ship missing WAL segments.
///
/// The upgraded node may have missed WAL segments during its downtime.
/// We ship those segments before re-enabling routing.
pub async fn rejoin(&self, target_shard: ShardId) -> Result<()> {
// Get the node's current applied seqno (via its reported stats).
let follower_seqno = self.control_plane
.shard_stats(target_shard)
.map(|s| s.applied_seqno)
.unwrap_or(0);
// Ship missed segments.
self.wal_shipper
.ship_segments_since(target_shard, follower_seqno)
.await?;
// Wait for the node to apply all shipped segments.
self.await_zero_lag(target_shard, Duration::from_secs(60)).await?;
// Re-enable routing to this node.
self.drained_nodes.lock().unwrap().remove(&target_shard);
Ok(())
}
/// Returns `true` if `shard_id` is currently drained (not receiving writes).
pub fn is_drained(&self, shard_id: ShardId) -> bool {
self.drained_nodes.lock().unwrap().contains(&shard_id)
}
/// Wait until the replication lag for `target_shard` reaches 0.
///
/// Polls the `ReplicationLagGauge` every 100ms. Times out after `timeout`.
async fn await_zero_lag(
&self,
target_shard: ShardId,
timeout: Duration,
) -> Result<()> {
let deadline = Instant::now() + timeout;
loop {
if Instant::now() > deadline {
return Err(TidalError::Timeout(
format!("drain timeout: shard {:?} still has replication lag", target_shard)
));
}
let lag = self.control_plane.lag_for(target_shard);
if lag == 0 {
return Ok(());
}
tokio::time::sleep(Duration::from_millis(100)).await;
}
}
}
Routing Integration
// In WalShipper (additions)
impl WalShipper {
/// Skip shipping to drained nodes.
async fn should_ship_to(&self, shard_id: ShardId) -> bool {
!self.upgrade_coordinator
.as_ref()
.map(|c| c.is_drained(shard_id))
.unwrap_or(false)
}
}
Acceptance Criteria
drain(shard)fails withTidalError::InvalidStateif draining would leave 0 serving nodesdrain(shard)succeeds once replication lag for that shard reaches 0- During drain: writes from
WalShipperskip the drained shard; reads from other shards succeed rejoin(shard)ships all WAL segments the node missed during its downtime, then re-enables routing- Rolling upgrade of all N nodes: each drain+rejoin step maintains availability (property: at least 1 node serving throughout)
- Integration test: 3-node simulated cluster; drain node 0, "upgrade" (simulated by stop+restart), rejoin; verify all signals written during the upgrade are present on the rejoined node
cargo clippy -D warningsandcargo fmtpass