tidaldb/docs/planning/milestone-1/phase-2/task-03-crash-recovery-and-replay.md

# Task 03: Crash Recovery and Replay

## Context

**Milestone:** 1 -- Signal Engine
**Phase:** m1p2 -- Write-Ahead Log
**Status:** COMPLETE
**Depends On:** Task 01 (wire format, `list_segments`, `WalError`)
**Blocks:** Task 04 (WalHandle calls `recover()` during `open()`)
**Complexity:** M

## Objective

Implement the WAL reader and crash recovery procedure. On startup, `recover()` reads the checkpoint metadata to find the last-materialized sequence number, identifies segments containing events after that checkpoint, scans them forward validating each batch via two-phase check (magic + bounds, then BLAKE3), truncates at the first invalid batch, and returns all post-checkpoint events for replay by the signal materializer.

This is the component that makes the WAL a durable source of truth. If the process crashes mid-write, recovery must detect the partial batch, truncate it, and return only committed events. No committed event is ever lost; no partial write is ever presented as committed.

## Requirements

- `recover(wal_dir)` reads `checkpoint.meta` (or assumes checkpoint_seq=0 if absent), lists segments, and scans all batches after the checkpoint
- Two-phase batch validation:
  - Phase 1: verify magic bytes == `0x54494C44`, version == 1, `offset + 64 + payload_length <= file_length`
  - Phase 2: read payload, compute `blake3(header[0..32] || payload)`, compare to stored checksum
  - On any failure: truncate the file at the last valid batch boundary, stop scanning
- `RecoveryResult` carries: `events: Vec<EventRecord>` (post-checkpoint), `next_seq: u64` (for writer to continue from)
- Recovery is sequential (not parallel) — segments are scanned in ascending first-seq order
- Recovery time target: < 10ms for a WAL with 63 MB of content (one checkpoint interval at 100K events/sec)
- `WalReader` provides an iterator over batches in a single segment file

## Technical Design

### Recovery Procedure

```
recover(wal_dir):
1. checkpoint_seq = CheckpointManager::read(wal_dir)?.unwrap_or(0)
2. segments = list_segments(wal_dir)?  -- sorted ascending by first_seq
3. filter to segments that may contain events > checkpoint_seq
4. for each relevant segment:
   a. open file for reading
   b. offset = 0
   c. last_valid_offset = 0
   d. while offset < file_length:
        i.  if file_length - offset < 64: break  (incomplete header)
        ii. read 64 bytes as header candidate
        iii. Phase 1: verify magic + version; verify offset+64+payload_len <= file_length
        iv. if Phase 1 fails: truncate file at last_valid_offset, break
        v.  read payload_len bytes
        vi. Phase 2: compute blake3(header[0..32] || payload); compare to stored checksum
        vii. if Phase 2 fails: truncate file at last_valid_offset, break
        viii. decode event records from payload
        ix. filter events where seq > checkpoint_seq, add to result
        x.  last_valid_offset = offset + 64 + payload_len
        xi. advance offset
5. return RecoveryResult { events, next_seq }
```

### API

```rust
pub struct RecoveryResult {
    /// Events since the last checkpoint, in order.
    pub events: Vec<EventRecord>,
    /// The sequence number the writer should assign to the next new event.
    pub next_seq: u64,
}

/// Recover from crash. Scans WAL segments after the last checkpoint.
/// Truncates any partially-written trailing batch.
///
/// Returns the events to replay and the next sequence number to use.
pub fn recover(wal_dir: &Path) -> Result<RecoveryResult, WalError>;

/// Iterator over batches in a single WAL segment.
pub struct WalReader { /* file, current offset */ }

impl WalReader {
    pub fn open(path: &Path) -> Result<Self, WalError>;
    /// Read the next batch. Returns Ok(None) at EOF.
    /// Returns Err on validation failure (caller should truncate).
    pub fn next_batch(&mut self) -> Result<Option<(BatchHeader, Vec<EventRecord>)>, WalError>;
    /// File position of the last successfully-read batch's end.
    pub fn last_valid_offset(&self) -> u64;
}
```

## Test Strategy

### Unit Tests

```rust
#[test]
fn recover_empty_wal_returns_no_events() {
    let dir = tempfile::tempdir().unwrap();
    let result = recover(dir.path()).unwrap();
    assert!(result.events.is_empty());
    assert_eq!(result.next_seq, 1);
}

#[test]
fn recover_returns_events_after_checkpoint() {
    // Write 10 events, checkpoint at seq 5, write 5 more, close
    // recover() should return only the 5 post-checkpoint events
}

#[test]
fn recover_truncates_partial_header() {
    // Write valid batch, then write 32 bytes of garbage (half a header)
    // recover() should truncate the file at the end of the valid batch
    // Events from the valid batch are returned; partial header is gone
}

#[test]
fn recover_truncates_bad_checksum() {
    // Write valid batch, then write batch with corrupted payload
    // (flip a byte in the payload but leave header intact)
    // recover() should detect Phase 2 failure and truncate
}

#[test]
fn recover_truncates_short_payload() {
    // Write header with payload_len=210 but only write 100 bytes of payload
    // recover() Phase 1 detects payload doesn't fit, truncates
}

#[test]
fn recover_spans_multiple_segments() {
    // Write events across 2 segments, no checkpoint
    // recover() returns all events in order, next_seq is correct
}

#[test]
fn recover_after_segment_rotation_with_checkpoint() {
    // Seg 1: events 1-100; Seg 2: events 101-200; checkpoint at 100
    // recover() skips seg 1 (all before checkpoint), returns events from seg 2
}
```

## Acceptance Criteria

- [x] `recover()` reads checkpoint sequence from `checkpoint.meta` if it exists, defaults to 0
- [x] Segments are scanned in ascending first-seq order
- [x] Phase 1 validation: magic bytes and payload bounds checked before reading payload
- [x] Phase 2 validation: BLAKE3 computed and compared — corrupted batches cause truncation
- [x] Truncation removes the partial/corrupted batch from the file (not just skips it)
- [x] `RecoveryResult.events` contains exactly the events after `checkpoint_seq`, in order
- [x] `RecoveryResult.next_seq` is one greater than the highest sequence number seen
- [x] `WalHandle::open()` returns replayed events as `Vec<SignalEvent>` for the materializer

## Research References

- [docs/research/tidaldb_wal.md](../../../research/tidaldb_wal.md) — Section 4 (crash detection: Approach 3, two-phase validation algorithm), Section 5 (checkpoint + truncation: recovery algorithm pseudocode, recovery time estimate ~8ms for 63 MB at BLAKE3's 8 GB/sec)

## Implementation Notes

- File truncation uses `File::set_len(last_valid_offset)` followed by `File::sync_all()` to flush the metadata update.
- Truncation writes are rare (only after crashes). No performance concern.
- Events with `seq <= checkpoint_seq` are still parsed during recovery (to advance the offset) but not added to `result.events`. This is necessary to correctly determine `next_seq` even when the checkpoint falls mid-segment.
- The dedup window is populated from replayed events after `recover()` returns (`WalHandle::open()` calls `dedup.populate_from_events(recovery.events)`). This is sequential — no race with the writer thread which hasn't started yet.