# Task 03: Crash Recovery and Replay ## Context **Milestone:** 1 -- Signal Engine **Phase:** m1p2 -- Write-Ahead Log **Status:** COMPLETE **Depends On:** Task 01 (wire format, `list_segments`, `WalError`) **Blocks:** Task 04 (WalHandle calls `recover()` during `open()`) **Complexity:** M ## Objective Implement the WAL reader and crash recovery procedure. On startup, `recover()` reads the checkpoint metadata to find the last-materialized sequence number, identifies segments containing events after that checkpoint, scans them forward validating each batch via two-phase check (magic + bounds, then BLAKE3), truncates at the first invalid batch, and returns all post-checkpoint events for replay by the signal materializer. This is the component that makes the WAL a durable source of truth. If the process crashes mid-write, recovery must detect the partial batch, truncate it, and return only committed events. No committed event is ever lost; no partial write is ever presented as committed. ## Requirements - `recover(wal_dir)` reads `checkpoint.meta` (or assumes checkpoint_seq=0 if absent), lists segments, and scans all batches after the checkpoint - Two-phase batch validation: - Phase 1: verify magic bytes == `0x54494C44`, version == 1, `offset + 64 + payload_length <= file_length` - Phase 2: read payload, compute `blake3(header[0..32] || payload)`, compare to stored checksum - On any failure: truncate the file at the last valid batch boundary, stop scanning - `RecoveryResult` carries: `events: Vec` (post-checkpoint), `next_seq: u64` (for writer to continue from) - Recovery is sequential (not parallel) — segments are scanned in ascending first-seq order - Recovery time target: < 10ms for a WAL with 63 MB of content (one checkpoint interval at 100K events/sec) - `WalReader` provides an iterator over batches in a single segment file ## Technical Design ### Recovery Procedure ``` recover(wal_dir): 1. checkpoint_seq = CheckpointManager::read(wal_dir)?.unwrap_or(0) 2. segments = list_segments(wal_dir)? -- sorted ascending by first_seq 3. filter to segments that may contain events > checkpoint_seq 4. for each relevant segment: a. open file for reading b. offset = 0 c. last_valid_offset = 0 d. while offset < file_length: i. if file_length - offset < 64: break (incomplete header) ii. read 64 bytes as header candidate iii. Phase 1: verify magic + version; verify offset+64+payload_len <= file_length iv. if Phase 1 fails: truncate file at last_valid_offset, break v. read payload_len bytes vi. Phase 2: compute blake3(header[0..32] || payload); compare to stored checksum vii. if Phase 2 fails: truncate file at last_valid_offset, break viii. decode event records from payload ix. filter events where seq > checkpoint_seq, add to result x. last_valid_offset = offset + 64 + payload_len xi. advance offset 5. return RecoveryResult { events, next_seq } ``` ### API ```rust pub struct RecoveryResult { /// Events since the last checkpoint, in order. pub events: Vec, /// The sequence number the writer should assign to the next new event. pub next_seq: u64, } /// Recover from crash. Scans WAL segments after the last checkpoint. /// Truncates any partially-written trailing batch. /// /// Returns the events to replay and the next sequence number to use. pub fn recover(wal_dir: &Path) -> Result; /// Iterator over batches in a single WAL segment. pub struct WalReader { /* file, current offset */ } impl WalReader { pub fn open(path: &Path) -> Result; /// Read the next batch. Returns Ok(None) at EOF. /// Returns Err on validation failure (caller should truncate). pub fn next_batch(&mut self) -> Result)>, WalError>; /// File position of the last successfully-read batch's end. pub fn last_valid_offset(&self) -> u64; } ``` ## Test Strategy ### Unit Tests ```rust #[test] fn recover_empty_wal_returns_no_events() { let dir = tempfile::tempdir().unwrap(); let result = recover(dir.path()).unwrap(); assert!(result.events.is_empty()); assert_eq!(result.next_seq, 1); } #[test] fn recover_returns_events_after_checkpoint() { // Write 10 events, checkpoint at seq 5, write 5 more, close // recover() should return only the 5 post-checkpoint events } #[test] fn recover_truncates_partial_header() { // Write valid batch, then write 32 bytes of garbage (half a header) // recover() should truncate the file at the end of the valid batch // Events from the valid batch are returned; partial header is gone } #[test] fn recover_truncates_bad_checksum() { // Write valid batch, then write batch with corrupted payload // (flip a byte in the payload but leave header intact) // recover() should detect Phase 2 failure and truncate } #[test] fn recover_truncates_short_payload() { // Write header with payload_len=210 but only write 100 bytes of payload // recover() Phase 1 detects payload doesn't fit, truncates } #[test] fn recover_spans_multiple_segments() { // Write events across 2 segments, no checkpoint // recover() returns all events in order, next_seq is correct } #[test] fn recover_after_segment_rotation_with_checkpoint() { // Seg 1: events 1-100; Seg 2: events 101-200; checkpoint at 100 // recover() skips seg 1 (all before checkpoint), returns events from seg 2 } ``` ## Acceptance Criteria - [x] `recover()` reads checkpoint sequence from `checkpoint.meta` if it exists, defaults to 0 - [x] Segments are scanned in ascending first-seq order - [x] Phase 1 validation: magic bytes and payload bounds checked before reading payload - [x] Phase 2 validation: BLAKE3 computed and compared — corrupted batches cause truncation - [x] Truncation removes the partial/corrupted batch from the file (not just skips it) - [x] `RecoveryResult.events` contains exactly the events after `checkpoint_seq`, in order - [x] `RecoveryResult.next_seq` is one greater than the highest sequence number seen - [x] `WalHandle::open()` returns replayed events as `Vec` for the materializer ## Research References - [docs/research/tidaldb_wal.md](../../../research/tidaldb_wal.md) — Section 4 (crash detection: Approach 3, two-phase validation algorithm), Section 5 (checkpoint + truncation: recovery algorithm pseudocode, recovery time estimate ~8ms for 63 MB at BLAKE3's 8 GB/sec) ## Implementation Notes - File truncation uses `File::set_len(last_valid_offset)` followed by `File::sync_all()` to flush the metadata update. - Truncation writes are rare (only after crashes). No performance concern. - Events with `seq <= checkpoint_seq` are still parsed during recovery (to advance the offset) but not added to `result.events`. This is necessary to correctly determine `next_seq` even when the checkpoint falls mid-segment. - The dedup window is populated from replayed events after `recover()` returns (`WalHandle::open()` calls `dedup.populate_from_events(recovery.events)`). This is sequential — no race with the writer thread which hasn't started yet.