tidaldb/docs/planning/milestone-1/phase-2/task-03-crash-recovery-and-replay.md
jordan 29400d48db feat: implement Milestone 1 phases 1-3 — schema, WAL, and storage layer
Implements the foundation of tidalDB's data pipeline:

**Phase 1 – Schema primitives**
- EntityId newtype (u64, big-endian ordering)
- SignalTypeDefinition with pre-computed decay λ, deduped/sorted windows
- SchemaBuilder with full constraint validation (duplicates, identifiers,
  half-life, windows, velocity)
- LumenError wrapping all subsystems with required From impls

**Phase 2 – Write-Ahead Log**
- Length-prefixed, BLAKE3-protected entry format
- Group-commit writer (batch up to 100 events / 10 ms)
- Double-buffered content-hash deduplication
- Checkpoint, truncation, and crash-recovery with full replay
- Integration, property, and UAT tests (incl. 5,500-event deterministic UAT)
- Proptest coverage scaled to 10 000 events/run (was ≤500) to meet
  acceptance criterion; cases reduced 100→10 to keep runtime comparable

**Phase 3 – Storage engine**
- StorageEngine trait (get/put/delete/scan/batch/flush)
- Key encoding: [EntityId][0x00][Tag][suffix] with ordering/prefix helpers
- InMemoryBackend (BTreeMap + RwLock)
- FjallStorage with three isolated keyspaces and atomic batch helper
- Property tests for key ordering and round-trip correctness

Also adds planning docs for phases 4-5, research docs, architecture
overview, and roadmap updates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 16:43:24 -07:00

7.1 KiB

Task 03: Crash Recovery and Replay

Context

Milestone: 1 -- Signal Engine Phase: m1p2 -- Write-Ahead Log Status: COMPLETE Depends On: Task 01 (wire format, list_segments, WalError) Blocks: Task 04 (WalHandle calls recover() during open()) Complexity: M

Objective

Implement the WAL reader and crash recovery procedure. On startup, recover() reads the checkpoint metadata to find the last-materialized sequence number, identifies segments containing events after that checkpoint, scans them forward validating each batch via two-phase check (magic + bounds, then BLAKE3), truncates at the first invalid batch, and returns all post-checkpoint events for replay by the signal materializer.

This is the component that makes the WAL a durable source of truth. If the process crashes mid-write, recovery must detect the partial batch, truncate it, and return only committed events. No committed event is ever lost; no partial write is ever presented as committed.

Requirements

  • recover(wal_dir) reads checkpoint.meta (or assumes checkpoint_seq=0 if absent), lists segments, and scans all batches after the checkpoint
  • Two-phase batch validation:
    • Phase 1: verify magic bytes == 0x54494C44, version == 1, offset + 64 + payload_length <= file_length
    • Phase 2: read payload, compute blake3(header[0..32] || payload), compare to stored checksum
    • On any failure: truncate the file at the last valid batch boundary, stop scanning
  • RecoveryResult carries: events: Vec<EventRecord> (post-checkpoint), next_seq: u64 (for writer to continue from)
  • Recovery is sequential (not parallel) — segments are scanned in ascending first-seq order
  • Recovery time target: < 10ms for a WAL with 63 MB of content (one checkpoint interval at 100K events/sec)
  • WalReader provides an iterator over batches in a single segment file

Technical Design

Recovery Procedure

recover(wal_dir):
1. checkpoint_seq = CheckpointManager::read(wal_dir)?.unwrap_or(0)
2. segments = list_segments(wal_dir)?  -- sorted ascending by first_seq
3. filter to segments that may contain events > checkpoint_seq
4. for each relevant segment:
   a. open file for reading
   b. offset = 0
   c. last_valid_offset = 0
   d. while offset < file_length:
        i.  if file_length - offset < 64: break  (incomplete header)
        ii. read 64 bytes as header candidate
        iii. Phase 1: verify magic + version; verify offset+64+payload_len <= file_length
        iv. if Phase 1 fails: truncate file at last_valid_offset, break
        v.  read payload_len bytes
        vi. Phase 2: compute blake3(header[0..32] || payload); compare to stored checksum
        vii. if Phase 2 fails: truncate file at last_valid_offset, break
        viii. decode event records from payload
        ix. filter events where seq > checkpoint_seq, add to result
        x.  last_valid_offset = offset + 64 + payload_len
        xi. advance offset
5. return RecoveryResult { events, next_seq }

API

pub struct RecoveryResult {
    /// Events since the last checkpoint, in order.
    pub events: Vec<EventRecord>,
    /// The sequence number the writer should assign to the next new event.
    pub next_seq: u64,
}

/// Recover from crash. Scans WAL segments after the last checkpoint.
/// Truncates any partially-written trailing batch.
///
/// Returns the events to replay and the next sequence number to use.
pub fn recover(wal_dir: &Path) -> Result<RecoveryResult, WalError>;

/// Iterator over batches in a single WAL segment.
pub struct WalReader { /* file, current offset */ }

impl WalReader {
    pub fn open(path: &Path) -> Result<Self, WalError>;
    /// Read the next batch. Returns Ok(None) at EOF.
    /// Returns Err on validation failure (caller should truncate).
    pub fn next_batch(&mut self) -> Result<Option<(BatchHeader, Vec<EventRecord>)>, WalError>;
    /// File position of the last successfully-read batch's end.
    pub fn last_valid_offset(&self) -> u64;
}

Test Strategy

Unit Tests

#[test]
fn recover_empty_wal_returns_no_events() {
    let dir = tempfile::tempdir().unwrap();
    let result = recover(dir.path()).unwrap();
    assert!(result.events.is_empty());
    assert_eq!(result.next_seq, 1);
}

#[test]
fn recover_returns_events_after_checkpoint() {
    // Write 10 events, checkpoint at seq 5, write 5 more, close
    // recover() should return only the 5 post-checkpoint events
}

#[test]
fn recover_truncates_partial_header() {
    // Write valid batch, then write 32 bytes of garbage (half a header)
    // recover() should truncate the file at the end of the valid batch
    // Events from the valid batch are returned; partial header is gone
}

#[test]
fn recover_truncates_bad_checksum() {
    // Write valid batch, then write batch with corrupted payload
    // (flip a byte in the payload but leave header intact)
    // recover() should detect Phase 2 failure and truncate
}

#[test]
fn recover_truncates_short_payload() {
    // Write header with payload_len=210 but only write 100 bytes of payload
    // recover() Phase 1 detects payload doesn't fit, truncates
}

#[test]
fn recover_spans_multiple_segments() {
    // Write events across 2 segments, no checkpoint
    // recover() returns all events in order, next_seq is correct
}

#[test]
fn recover_after_segment_rotation_with_checkpoint() {
    // Seg 1: events 1-100; Seg 2: events 101-200; checkpoint at 100
    // recover() skips seg 1 (all before checkpoint), returns events from seg 2
}

Acceptance Criteria

  • recover() reads checkpoint sequence from checkpoint.meta if it exists, defaults to 0
  • Segments are scanned in ascending first-seq order
  • Phase 1 validation: magic bytes and payload bounds checked before reading payload
  • Phase 2 validation: BLAKE3 computed and compared — corrupted batches cause truncation
  • Truncation removes the partial/corrupted batch from the file (not just skips it)
  • RecoveryResult.events contains exactly the events after checkpoint_seq, in order
  • RecoveryResult.next_seq is one greater than the highest sequence number seen
  • WalHandle::open() returns replayed events as Vec<SignalEvent> for the materializer

Research References

  • docs/research/tidaldb_wal.md — Section 4 (crash detection: Approach 3, two-phase validation algorithm), Section 5 (checkpoint + truncation: recovery algorithm pseudocode, recovery time estimate ~8ms for 63 MB at BLAKE3's 8 GB/sec)

Implementation Notes

  • File truncation uses File::set_len(last_valid_offset) followed by File::sync_all() to flush the metadata update.
  • Truncation writes are rare (only after crashes). No performance concern.
  • Events with seq <= checkpoint_seq are still parsed during recovery (to advance the offset) but not added to result.events. This is necessary to correctly determine next_seq even when the checkpoint falls mid-segment.
  • The dedup window is populated from replayed events after recover() returns (WalHandle::open() calls dedup.populate_from_events(recovery.events)). This is sequential — no race with the writer thread which hasn't started yet.