# Task 02: Fault Injection ## Delivers `NetworkPartition` and `ShardCrash` in `tidal/src/testing/faults.rs`. `NetworkPartition` intercepts `Transport::send_segment` calls and drops them for specified region pairs. `ShardCrash` drops a shard's primary and triggers follower promotion. Both are RAII handles — faults are active while the handle is alive, automatically healed/cleaned up on drop. ## Complexity: M ## Dependencies - Task 01 (SimulatedCluster) - Phase 8.2, Task 01 (Transport trait) - Phase 8.2, Task 05 (FollowerDb, NodeRole) ## Technical Design ```rust // tidal/src/testing/faults.rs // Only compiled with #[cfg(test)] or --features test-utils /// RAII handle for a network partition between two regions. /// /// While this handle is alive, all `Transport::send_segment` and /// `Transport::send_session_batch` calls from `from` to `to` (and /// optionally `to` to `from` for symmetric partitions) are dropped /// without delivery. /// /// When the handle is dropped, the partition is automatically healed. pub struct NetworkPartition { from: RegionId, to: RegionId, symmetric: bool, transport_factory: Arc, } impl NetworkPartition { /// Create a one-way partition: `from` cannot reach `to`. pub fn one_way( from: RegionId, to: RegionId, factory: Arc, ) -> Self { factory.block_route(from, to); Self { from, to, symmetric: false, transport_factory: factory } } /// Create a symmetric partition: neither side can reach the other. pub fn symmetric( region_a: RegionId, region_b: RegionId, factory: Arc, ) -> Self { factory.block_route(region_a, region_b); factory.block_route(region_b, region_a); Self { from: region_a, to: region_b, symmetric: true, transport_factory: factory } } /// Check how many segments have been dropped since partition was injected. pub fn dropped_segments(&self) -> u64 { self.transport_factory.dropped_count(self.from, self.to) } } impl Drop for NetworkPartition { fn drop(&mut self) { self.transport_factory.unblock_route(self.from, self.to); if self.symmetric { self.transport_factory.unblock_route(self.to, self.from); } } } /// RAII handle for a simulated shard crash. /// /// Crashes the primary of the given shard. The primary is taken offline /// (stops processing WAL writes, stops shipping to followers). The most /// advanced follower is promoted to leader automatically. /// /// When the handle is dropped, the "crashed" shard can be optionally /// restored (simulating a node restart) or left offline. pub struct ShardCrash { crashed_shard: ShardId, original_leader_seqno: u64, cluster: Arc, auto_rejoin: bool, } impl ShardCrash { /// Crash the primary of `shard_id`. /// /// `auto_rejoin`: if true, the shard restarts and rejoins on drop. pub async fn crash( shard_id: ShardId, cluster: Arc, auto_rejoin: bool, ) -> Self { // Record the shard's current seqno before crash. let original_seqno = cluster.applied_seqno_for(shard_id); // Take the shard offline: stop WAL shipping, stop write processing. cluster.take_shard_offline(shard_id).await; // Promote the most advanced follower (if any). cluster.promote_best_follower(shard_id).await; Self { crashed_shard: shard_id, original_leader_seqno: original_seqno, cluster, auto_rejoin, } } /// How many events the crashed shard had applied at crash time. pub fn pre_crash_seqno(&self) -> u64 { self.original_leader_seqno } /// Manually rejoin the crashed shard (ship missed WAL, re-enable as follower). pub async fn rejoin(&self) { self.cluster.rejoin_shard(self.crashed_shard).await; } } impl Drop for ShardCrash { fn drop(&mut self) { if self.auto_rejoin { // Best effort async rejoin on drop (may race with test teardown). let cluster = self.cluster.clone(); let shard = self.crashed_shard; tokio::spawn(async move { cluster.rejoin_shard(shard).await; }); } } } /// Extension to InProcessTransportFactory for fault injection. impl InProcessTransportFactory { /// Block all traffic from `from` to `to`. pub fn block_route(&self, from: RegionId, to: RegionId) { self.blocked_routes.write().unwrap().insert((from, to)); } /// Unblock traffic from `from` to `to`. pub fn unblock_route(&self, from: RegionId, to: RegionId) { self.blocked_routes.write().unwrap().remove(&(from, to)); } /// Heal all partitions. pub fn heal_all(&self) { self.blocked_routes.write().unwrap().clear(); } /// Count of segments dropped on a specific route since the factory was created. pub fn dropped_count(&self, from: RegionId, to: RegionId) -> u64 { self.drop_counters .get(&(from, to)) .map(|c| c.load(Ordering::Relaxed)) .unwrap_or(0) } /// Replay the last session batch that was dropped to `to` region. /// Used by idempotency tests to simulate duplicate delivery. pub async fn replay_last_session_batch(&self, to: RegionId) { if let Some(batch) = self.last_session_batch.lock().unwrap().get(&to).cloned() { self.deliver_session_batch(to, batch).await; } } } ``` ## Acceptance Criteria - [ ] `NetworkPartition::one_way(from, to)` drops all segments from `from` to `to`; segments from `to` to `from` still deliver - [ ] `NetworkPartition::symmetric(a, b)` drops segments in both directions - [ ] Dropping `NetworkPartition` heals the route; subsequent segments deliver normally - [ ] `dropped_segments()` accurately counts segments dropped since partition injection - [ ] `ShardCrash::crash(shard, cluster, false)` takes the shard offline; a follower is promoted - [ ] After `ShardCrash::rejoin()`: the previously crashed shard catches up from WAL segments and its applied seqno matches the current leader's - [ ] `heal_all()` restores all blocked routes in one call - [ ] Partition test: inject partition, write 50 segments, verify they are not applied on isolated follower; heal, verify they are applied - [ ] `cargo clippy -D warnings` and `cargo fmt` pass