Implements the foundation of tidalDB's data pipeline: **Phase 1 – Schema primitives** - EntityId newtype (u64, big-endian ordering) - SignalTypeDefinition with pre-computed decay λ, deduped/sorted windows - SchemaBuilder with full constraint validation (duplicates, identifiers, half-life, windows, velocity) - LumenError wrapping all subsystems with required From impls **Phase 2 – Write-Ahead Log** - Length-prefixed, BLAKE3-protected entry format - Group-commit writer (batch up to 100 events / 10 ms) - Double-buffered content-hash deduplication - Checkpoint, truncation, and crash-recovery with full replay - Integration, property, and UAT tests (incl. 5,500-event deterministic UAT) - Proptest coverage scaled to 10 000 events/run (was ≤500) to meet acceptance criterion; cases reduced 100→10 to keep runtime comparable **Phase 3 – Storage engine** - StorageEngine trait (get/put/delete/scan/batch/flush) - Key encoding: [EntityId][0x00][Tag][suffix] with ordering/prefix helpers - InMemoryBackend (BTreeMap + RwLock) - FjallStorage with three isolated keyspaces and atomic batch helper - Property tests for key ordering and round-trip correctness Also adds planning docs for phases 4-5, research docs, architecture overview, and roadmap updates. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
160 lines
7.1 KiB
Markdown
160 lines
7.1 KiB
Markdown
# Task 03: Crash Recovery and Replay
|
|
|
|
## Context
|
|
|
|
**Milestone:** 1 -- Signal Engine
|
|
**Phase:** m1p2 -- Write-Ahead Log
|
|
**Status:** COMPLETE
|
|
**Depends On:** Task 01 (wire format, `list_segments`, `WalError`)
|
|
**Blocks:** Task 04 (WalHandle calls `recover()` during `open()`)
|
|
**Complexity:** M
|
|
|
|
## Objective
|
|
|
|
Implement the WAL reader and crash recovery procedure. On startup, `recover()` reads the checkpoint metadata to find the last-materialized sequence number, identifies segments containing events after that checkpoint, scans them forward validating each batch via two-phase check (magic + bounds, then BLAKE3), truncates at the first invalid batch, and returns all post-checkpoint events for replay by the signal materializer.
|
|
|
|
This is the component that makes the WAL a durable source of truth. If the process crashes mid-write, recovery must detect the partial batch, truncate it, and return only committed events. No committed event is ever lost; no partial write is ever presented as committed.
|
|
|
|
## Requirements
|
|
|
|
- `recover(wal_dir)` reads `checkpoint.meta` (or assumes checkpoint_seq=0 if absent), lists segments, and scans all batches after the checkpoint
|
|
- Two-phase batch validation:
|
|
- Phase 1: verify magic bytes == `0x54494C44`, version == 1, `offset + 64 + payload_length <= file_length`
|
|
- Phase 2: read payload, compute `blake3(header[0..32] || payload)`, compare to stored checksum
|
|
- On any failure: truncate the file at the last valid batch boundary, stop scanning
|
|
- `RecoveryResult` carries: `events: Vec<EventRecord>` (post-checkpoint), `next_seq: u64` (for writer to continue from)
|
|
- Recovery is sequential (not parallel) — segments are scanned in ascending first-seq order
|
|
- Recovery time target: < 10ms for a WAL with 63 MB of content (one checkpoint interval at 100K events/sec)
|
|
- `WalReader` provides an iterator over batches in a single segment file
|
|
|
|
## Technical Design
|
|
|
|
### Recovery Procedure
|
|
|
|
```
|
|
recover(wal_dir):
|
|
1. checkpoint_seq = CheckpointManager::read(wal_dir)?.unwrap_or(0)
|
|
2. segments = list_segments(wal_dir)? -- sorted ascending by first_seq
|
|
3. filter to segments that may contain events > checkpoint_seq
|
|
4. for each relevant segment:
|
|
a. open file for reading
|
|
b. offset = 0
|
|
c. last_valid_offset = 0
|
|
d. while offset < file_length:
|
|
i. if file_length - offset < 64: break (incomplete header)
|
|
ii. read 64 bytes as header candidate
|
|
iii. Phase 1: verify magic + version; verify offset+64+payload_len <= file_length
|
|
iv. if Phase 1 fails: truncate file at last_valid_offset, break
|
|
v. read payload_len bytes
|
|
vi. Phase 2: compute blake3(header[0..32] || payload); compare to stored checksum
|
|
vii. if Phase 2 fails: truncate file at last_valid_offset, break
|
|
viii. decode event records from payload
|
|
ix. filter events where seq > checkpoint_seq, add to result
|
|
x. last_valid_offset = offset + 64 + payload_len
|
|
xi. advance offset
|
|
5. return RecoveryResult { events, next_seq }
|
|
```
|
|
|
|
### API
|
|
|
|
```rust
|
|
pub struct RecoveryResult {
|
|
/// Events since the last checkpoint, in order.
|
|
pub events: Vec<EventRecord>,
|
|
/// The sequence number the writer should assign to the next new event.
|
|
pub next_seq: u64,
|
|
}
|
|
|
|
/// Recover from crash. Scans WAL segments after the last checkpoint.
|
|
/// Truncates any partially-written trailing batch.
|
|
///
|
|
/// Returns the events to replay and the next sequence number to use.
|
|
pub fn recover(wal_dir: &Path) -> Result<RecoveryResult, WalError>;
|
|
|
|
/// Iterator over batches in a single WAL segment.
|
|
pub struct WalReader { /* file, current offset */ }
|
|
|
|
impl WalReader {
|
|
pub fn open(path: &Path) -> Result<Self, WalError>;
|
|
/// Read the next batch. Returns Ok(None) at EOF.
|
|
/// Returns Err on validation failure (caller should truncate).
|
|
pub fn next_batch(&mut self) -> Result<Option<(BatchHeader, Vec<EventRecord>)>, WalError>;
|
|
/// File position of the last successfully-read batch's end.
|
|
pub fn last_valid_offset(&self) -> u64;
|
|
}
|
|
```
|
|
|
|
## Test Strategy
|
|
|
|
### Unit Tests
|
|
|
|
```rust
|
|
#[test]
|
|
fn recover_empty_wal_returns_no_events() {
|
|
let dir = tempfile::tempdir().unwrap();
|
|
let result = recover(dir.path()).unwrap();
|
|
assert!(result.events.is_empty());
|
|
assert_eq!(result.next_seq, 1);
|
|
}
|
|
|
|
#[test]
|
|
fn recover_returns_events_after_checkpoint() {
|
|
// Write 10 events, checkpoint at seq 5, write 5 more, close
|
|
// recover() should return only the 5 post-checkpoint events
|
|
}
|
|
|
|
#[test]
|
|
fn recover_truncates_partial_header() {
|
|
// Write valid batch, then write 32 bytes of garbage (half a header)
|
|
// recover() should truncate the file at the end of the valid batch
|
|
// Events from the valid batch are returned; partial header is gone
|
|
}
|
|
|
|
#[test]
|
|
fn recover_truncates_bad_checksum() {
|
|
// Write valid batch, then write batch with corrupted payload
|
|
// (flip a byte in the payload but leave header intact)
|
|
// recover() should detect Phase 2 failure and truncate
|
|
}
|
|
|
|
#[test]
|
|
fn recover_truncates_short_payload() {
|
|
// Write header with payload_len=210 but only write 100 bytes of payload
|
|
// recover() Phase 1 detects payload doesn't fit, truncates
|
|
}
|
|
|
|
#[test]
|
|
fn recover_spans_multiple_segments() {
|
|
// Write events across 2 segments, no checkpoint
|
|
// recover() returns all events in order, next_seq is correct
|
|
}
|
|
|
|
#[test]
|
|
fn recover_after_segment_rotation_with_checkpoint() {
|
|
// Seg 1: events 1-100; Seg 2: events 101-200; checkpoint at 100
|
|
// recover() skips seg 1 (all before checkpoint), returns events from seg 2
|
|
}
|
|
```
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [x] `recover()` reads checkpoint sequence from `checkpoint.meta` if it exists, defaults to 0
|
|
- [x] Segments are scanned in ascending first-seq order
|
|
- [x] Phase 1 validation: magic bytes and payload bounds checked before reading payload
|
|
- [x] Phase 2 validation: BLAKE3 computed and compared — corrupted batches cause truncation
|
|
- [x] Truncation removes the partial/corrupted batch from the file (not just skips it)
|
|
- [x] `RecoveryResult.events` contains exactly the events after `checkpoint_seq`, in order
|
|
- [x] `RecoveryResult.next_seq` is one greater than the highest sequence number seen
|
|
- [x] `WalHandle::open()` returns replayed events as `Vec<SignalEvent>` for the materializer
|
|
|
|
## Research References
|
|
|
|
- [docs/research/tidaldb_wal.md](../../../research/tidaldb_wal.md) — Section 4 (crash detection: Approach 3, two-phase validation algorithm), Section 5 (checkpoint + truncation: recovery algorithm pseudocode, recovery time estimate ~8ms for 63 MB at BLAKE3's 8 GB/sec)
|
|
|
|
## Implementation Notes
|
|
|
|
- File truncation uses `File::set_len(last_valid_offset)` followed by `File::sync_all()` to flush the metadata update.
|
|
- Truncation writes are rare (only after crashes). No performance concern.
|
|
- Events with `seq <= checkpoint_seq` are still parsed during recovery (to advance the offset) but not added to `result.events`. This is necessary to correctly determine `next_seq` even when the checkpoint falls mid-segment.
|
|
- The dedup window is populated from replayed events after `recover()` returns (`WalHandle::open()` calls `dedup.populate_from_events(recovery.events)`). This is sequential — no race with the writer thread which hasn't started yet.
|