Implements the foundation of tidalDB's data pipeline: **Phase 1 – Schema primitives** - EntityId newtype (u64, big-endian ordering) - SignalTypeDefinition with pre-computed decay λ, deduped/sorted windows - SchemaBuilder with full constraint validation (duplicates, identifiers, half-life, windows, velocity) - LumenError wrapping all subsystems with required From impls **Phase 2 – Write-Ahead Log** - Length-prefixed, BLAKE3-protected entry format - Group-commit writer (batch up to 100 events / 10 ms) - Double-buffered content-hash deduplication - Checkpoint, truncation, and crash-recovery with full replay - Integration, property, and UAT tests (incl. 5,500-event deterministic UAT) - Proptest coverage scaled to 10 000 events/run (was ≤500) to meet acceptance criterion; cases reduced 100→10 to keep runtime comparable **Phase 3 – Storage engine** - StorageEngine trait (get/put/delete/scan/batch/flush) - Key encoding: [EntityId][0x00][Tag][suffix] with ordering/prefix helpers - InMemoryBackend (BTreeMap + RwLock) - FjallStorage with three isolated keyspaces and atomic batch helper - Property tests for key ordering and round-trip correctness Also adds planning docs for phases 4-5, research docs, architecture overview, and roadmap updates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7.1 KiB
Task 03: Crash Recovery and Replay
Context
Milestone: 1 -- Signal Engine
Phase: m1p2 -- Write-Ahead Log
Status: COMPLETE
Depends On: Task 01 (wire format, list_segments, WalError)
Blocks: Task 04 (WalHandle calls recover() during open())
Complexity: M
Objective
Implement the WAL reader and crash recovery procedure. On startup, recover() reads the checkpoint metadata to find the last-materialized sequence number, identifies segments containing events after that checkpoint, scans them forward validating each batch via two-phase check (magic + bounds, then BLAKE3), truncates at the first invalid batch, and returns all post-checkpoint events for replay by the signal materializer.
This is the component that makes the WAL a durable source of truth. If the process crashes mid-write, recovery must detect the partial batch, truncate it, and return only committed events. No committed event is ever lost; no partial write is ever presented as committed.
Requirements
recover(wal_dir)readscheckpoint.meta(or assumes checkpoint_seq=0 if absent), lists segments, and scans all batches after the checkpoint- Two-phase batch validation:
- Phase 1: verify magic bytes ==
0x54494C44, version == 1,offset + 64 + payload_length <= file_length - Phase 2: read payload, compute
blake3(header[0..32] || payload), compare to stored checksum - On any failure: truncate the file at the last valid batch boundary, stop scanning
- Phase 1: verify magic bytes ==
RecoveryResultcarries:events: Vec<EventRecord>(post-checkpoint),next_seq: u64(for writer to continue from)- Recovery is sequential (not parallel) — segments are scanned in ascending first-seq order
- Recovery time target: < 10ms for a WAL with 63 MB of content (one checkpoint interval at 100K events/sec)
WalReaderprovides an iterator over batches in a single segment file
Technical Design
Recovery Procedure
recover(wal_dir):
1. checkpoint_seq = CheckpointManager::read(wal_dir)?.unwrap_or(0)
2. segments = list_segments(wal_dir)? -- sorted ascending by first_seq
3. filter to segments that may contain events > checkpoint_seq
4. for each relevant segment:
a. open file for reading
b. offset = 0
c. last_valid_offset = 0
d. while offset < file_length:
i. if file_length - offset < 64: break (incomplete header)
ii. read 64 bytes as header candidate
iii. Phase 1: verify magic + version; verify offset+64+payload_len <= file_length
iv. if Phase 1 fails: truncate file at last_valid_offset, break
v. read payload_len bytes
vi. Phase 2: compute blake3(header[0..32] || payload); compare to stored checksum
vii. if Phase 2 fails: truncate file at last_valid_offset, break
viii. decode event records from payload
ix. filter events where seq > checkpoint_seq, add to result
x. last_valid_offset = offset + 64 + payload_len
xi. advance offset
5. return RecoveryResult { events, next_seq }
API
pub struct RecoveryResult {
/// Events since the last checkpoint, in order.
pub events: Vec<EventRecord>,
/// The sequence number the writer should assign to the next new event.
pub next_seq: u64,
}
/// Recover from crash. Scans WAL segments after the last checkpoint.
/// Truncates any partially-written trailing batch.
///
/// Returns the events to replay and the next sequence number to use.
pub fn recover(wal_dir: &Path) -> Result<RecoveryResult, WalError>;
/// Iterator over batches in a single WAL segment.
pub struct WalReader { /* file, current offset */ }
impl WalReader {
pub fn open(path: &Path) -> Result<Self, WalError>;
/// Read the next batch. Returns Ok(None) at EOF.
/// Returns Err on validation failure (caller should truncate).
pub fn next_batch(&mut self) -> Result<Option<(BatchHeader, Vec<EventRecord>)>, WalError>;
/// File position of the last successfully-read batch's end.
pub fn last_valid_offset(&self) -> u64;
}
Test Strategy
Unit Tests
#[test]
fn recover_empty_wal_returns_no_events() {
let dir = tempfile::tempdir().unwrap();
let result = recover(dir.path()).unwrap();
assert!(result.events.is_empty());
assert_eq!(result.next_seq, 1);
}
#[test]
fn recover_returns_events_after_checkpoint() {
// Write 10 events, checkpoint at seq 5, write 5 more, close
// recover() should return only the 5 post-checkpoint events
}
#[test]
fn recover_truncates_partial_header() {
// Write valid batch, then write 32 bytes of garbage (half a header)
// recover() should truncate the file at the end of the valid batch
// Events from the valid batch are returned; partial header is gone
}
#[test]
fn recover_truncates_bad_checksum() {
// Write valid batch, then write batch with corrupted payload
// (flip a byte in the payload but leave header intact)
// recover() should detect Phase 2 failure and truncate
}
#[test]
fn recover_truncates_short_payload() {
// Write header with payload_len=210 but only write 100 bytes of payload
// recover() Phase 1 detects payload doesn't fit, truncates
}
#[test]
fn recover_spans_multiple_segments() {
// Write events across 2 segments, no checkpoint
// recover() returns all events in order, next_seq is correct
}
#[test]
fn recover_after_segment_rotation_with_checkpoint() {
// Seg 1: events 1-100; Seg 2: events 101-200; checkpoint at 100
// recover() skips seg 1 (all before checkpoint), returns events from seg 2
}
Acceptance Criteria
recover()reads checkpoint sequence fromcheckpoint.metaif it exists, defaults to 0- Segments are scanned in ascending first-seq order
- Phase 1 validation: magic bytes and payload bounds checked before reading payload
- Phase 2 validation: BLAKE3 computed and compared — corrupted batches cause truncation
- Truncation removes the partial/corrupted batch from the file (not just skips it)
RecoveryResult.eventscontains exactly the events aftercheckpoint_seq, in orderRecoveryResult.next_seqis one greater than the highest sequence number seenWalHandle::open()returns replayed events asVec<SignalEvent>for the materializer
Research References
- docs/research/tidaldb_wal.md — Section 4 (crash detection: Approach 3, two-phase validation algorithm), Section 5 (checkpoint + truncation: recovery algorithm pseudocode, recovery time estimate ~8ms for 63 MB at BLAKE3's 8 GB/sec)
Implementation Notes
- File truncation uses
File::set_len(last_valid_offset)followed byFile::sync_all()to flush the metadata update. - Truncation writes are rare (only after crashes). No performance concern.
- Events with
seq <= checkpoint_seqare still parsed during recovery (to advance the offset) but not added toresult.events. This is necessary to correctly determinenext_seqeven when the checkpoint falls mid-segment. - The dedup window is populated from replayed events after
recover()returns (WalHandle::open()callsdedup.populate_from_events(recovery.events)). This is sequential — no race with the writer thread which hasn't started yet.