# Task 04: Deduplication, Checkpoint, and WalHandle Public API ## Context **Milestone:** 1 -- Signal Engine **Phase:** m1p2 -- Write-Ahead Log **Status:** COMPLETE **Depends On:** Task 02 (writer channel types), Task 03 (`recover()`) **Blocks:** m1p4 (Signal Ledger uses `WalHandle` as its durability backend) **Complexity:** M ## Objective Deliver three components that complete the WAL: 1. **`DedupWindow`** — a double-buffered `HashSet` that detects duplicate signal events within a 60-second window using the first 128 bits of each event's BLAKE3 hash. Zero false positives. Bounded memory. 2. **`CheckpointManager`** — reads and writes `checkpoint.meta`, the small JSON-like file that records the last-materialized sequence number. Enables recovery to skip already-materialized events. 3. **`WalHandle`** — the public API: `open()`, `append()`, `checkpoint()`, `truncate_before()`, `shutdown()`. The entry point for m1p4 (Signal Ledger) and m1p5 (Entity CRUD API). ## Requirements ### DedupWindow - Two `HashSet` buffers, alternating every `window_duration` (default 30s) - Effective dedup coverage: ~60 seconds (current + previous window) - Hash key: first 16 bytes (128 bits) of `blake3::hash(event_bytes)` interpreted as `u128` little-endian - `check_and_insert(event_bytes: &[u8]) -> bool` — returns `true` if duplicate - `populate_from_events(events: Vec)` — bulk-insert on startup from replayed events - `maybe_rotate()` — called on each `check_and_insert`; swaps buffers when `rotation_time.elapsed() > window_duration` and clears the old current ### CheckpointManager - `checkpoint.meta` is a simple binary file: `[sequence: u64 LE][timestamp_nanos: u64 LE]` (16 bytes) - `CheckpointManager::write(dir, seq, timestamp_nanos)` — writes atomically (write to temp file, fsync, rename) - `CheckpointManager::read(dir) -> Result, WalError>` — `None` if file does not exist - File corruption (wrong size) returns `WalError::Corruption` ### WalHandle - `WalHandle::open(config: WalConfig) -> Result<(Self, Vec), WalError>` - Creates `{config.dir}/wal/` if absent - Calls `recover()`, initializes `DedupWindow` from replayed events - Finds or creates current segment - Spawns writer thread via `std::thread::Builder::new().name("tidaldb-wal-writer")` - Returns `(handle, replayed_events)` — replayed events are for m1p4 to feed into the signal materializer - `WalHandle::append(event: SignalEvent) -> Result` — blocks until durably committed - `WalHandle::checkpoint(seq: u64) -> Result<(), WalError>` — writes checkpoint.meta directly (no writer thread round-trip) - `WalHandle::truncate_before(seq: u64) -> Result<(), WalError>` — dispatches `WalCommand::TruncateBefore` to writer thread - `WalHandle::shutdown(self) -> Result<(), WalError>` — sends `WalCommand::Shutdown`, joins writer thread - `impl Drop for WalHandle` — best-effort shutdown if not already shut down (ignores errors) - `WalHandle: Send + Sync` — the `Sender` is `Send + Sync` ## Technical Design ### DedupWindow ```rust pub struct DedupWindow { current: HashSet, previous: HashSet, rotation_time: Instant, window: Duration, } impl DedupWindow { pub fn new(window: Duration) -> Self; pub fn check_and_insert(&mut self, event_bytes: &[u8]) -> bool { self.maybe_rotate(); let hash = self.hash(event_bytes); if self.current.contains(&hash) || self.previous.contains(&hash) { return true; // duplicate } self.current.insert(hash); false } pub fn populate_from_events(&mut self, events: Vec) { for e in events { let bytes = e.encode(); let hash = self.hash(&bytes); self.current.insert(hash); } } fn hash(&self, event_bytes: &[u8]) -> u128 { u128::from_le_bytes( blake3::hash(event_bytes).as_bytes()[..16].try_into().unwrap() ) } fn maybe_rotate(&mut self) { if self.rotation_time.elapsed() > self.window { std::mem::swap(&mut self.current, &mut self.previous); self.current.clear(); self.rotation_time = Instant::now(); } } } ``` **Memory at 10K events/sec:** ~300K entries/window * 16 bytes * 2 windows + HashSet overhead ≈ 19 MB **Memory at 100K events/sec:** ~3M entries/window * 16 bytes * 2 ≈ 144 MB ### CheckpointManager ```rust pub struct CheckpointManager; impl CheckpointManager { pub fn write(dir: &Path, seq: u64, timestamp_nanos: u64) -> Result<(), WalError> { // Write to temp file, fsync, rename (atomic on POSIX) } pub fn read(dir: &Path) -> Result, WalError> { // Returns None if checkpoint.meta does not exist // Returns Corruption if file is wrong size } } ``` ## Test Strategy ### DedupWindow Tests ```rust #[test] fn dedup_detects_duplicate() { let mut window = DedupWindow::new(Duration::from_secs(30)); let bytes = [1u8; 21]; assert!(!window.check_and_insert(&bytes)); // first: not duplicate assert!(window.check_and_insert(&bytes)); // second: duplicate } #[test] fn dedup_different_events_not_duplicates() { let mut window = DedupWindow::new(Duration::from_secs(30)); assert!(!window.check_and_insert(&[1u8; 21])); assert!(!window.check_and_insert(&[2u8; 21])); } #[test] fn dedup_rotation_clears_old_events() { let mut window = DedupWindow::new(Duration::from_millis(10)); let bytes = [1u8; 21]; window.check_and_insert(&bytes); std::thread::sleep(Duration::from_millis(11)); // trigger rotation // After one rotation: event is in "previous" -- still caught assert!(window.check_and_insert(&bytes)); std::thread::sleep(Duration::from_millis(11)); // trigger second rotation // After two rotations: event has left both windows assert!(!window.check_and_insert(&bytes)); } #[test] fn dedup_populate_from_events_seeds_correctly() { let mut window = DedupWindow::new(Duration::from_secs(30)); let events = vec![EventRecord { entity_id: 1, signal_type: 1, weight: 1.0, timestamp_nanos: 0 }]; window.populate_from_events(events); let bytes = EventRecord { entity_id: 1, signal_type: 1, weight: 1.0, timestamp_nanos: 0 }.encode(); assert!(window.check_and_insert(&bytes)); // seeded event is detected as duplicate } ``` ### CheckpointManager Tests ```rust #[test] fn checkpoint_read_returns_none_if_absent() { let dir = tempfile::tempdir().unwrap(); assert!(CheckpointManager::read(dir.path()).unwrap().is_none()); } #[test] fn checkpoint_write_then_read_roundtrip() { let dir = tempfile::tempdir().unwrap(); CheckpointManager::write(dir.path(), 42, 1_700_000_000_000_000_000).unwrap(); let result = CheckpointManager::read(dir.path()).unwrap().unwrap(); assert_eq!(result.0, 42); assert_eq!(result.1, 1_700_000_000_000_000_000); } #[test] fn checkpoint_overwrites_previous() { let dir = tempfile::tempdir().unwrap(); CheckpointManager::write(dir.path(), 10, 0).unwrap(); CheckpointManager::write(dir.path(), 20, 0).unwrap(); let (seq, _) = CheckpointManager::read(dir.path()).unwrap().unwrap(); assert_eq!(seq, 20); } ``` ### WalHandle Integration Tests ```rust #[test] fn open_creates_wal_directory() { /* ... */ } #[test] fn append_returns_sequence_number() { /* ... */ } #[test] fn dedup_returns_zero() { /* ... */ } #[test] fn checkpoint_writes_file() { /* ... */ } #[test] fn close_and_reopen_continues_sequence() { /* ... */ } #[test] fn drop_shuts_down_cleanly() { // WalHandle drops without explicit shutdown — no panic, no thread leak let dir = tempfile::tempdir().unwrap(); let (handle, _) = WalHandle::open(test_config(dir.path())).unwrap(); drop(handle); // should not hang or panic } ``` ## Acceptance Criteria - [x] `DedupWindow::check_and_insert()` returns `true` for duplicates, `false` for new events - [x] Duplicate detection covers ~60-second window via double-buffer rotation - [x] Zero false positives — no legitimate events are silently dropped - [x] `DedupWindow::populate_from_events()` seeds the window from WAL replay - [x] `CheckpointManager::write()` is atomic (temp file + rename on POSIX) - [x] `CheckpointManager::read()` returns `None` for a fresh WAL with no checkpoint - [x] `WalHandle::open()` returns `(handle, replayed_events)` where `replayed_events` contains all events since last checkpoint - [x] `WalHandle::append()` returns `Ok(0)` for deduplicated events - [x] `WalHandle::checkpoint()` does not go through the writer thread (no deadlock risk if writer is busy) - [x] `WalHandle::truncate_before()` runs inside the writer thread (no race with active writes) - [x] `impl Drop for WalHandle` provides best-effort shutdown without panicking ## Research References - [docs/research/tidaldb_wal.md](../../../research/tidaldb_wal.md) — Section 6 (Approach 3: bounded sliding window dedup, DedupWindow implementation, memory analysis), Section 5 (checkpoint.meta format, checkpoint process with atomic write) - [thoughts.md](../../../../thoughts.md) — Part II.1 (WAL convergence lessons from Engram/Citadel/StemeDB) ## Implementation Notes - `blake3` is a direct dependency of the WAL module (`blake3 = "1"` in `Cargo.toml`). Already in the dependency plan per CODING_GUIDELINES.md. - `crossbeam` is already a transitive dependency via fjall. Adding it as a direct dependency makes the version explicit and allows feature selection. - The checkpoint file format (16 bytes binary) is simpler than JSON and trivially parsed. If schema evolution is ever needed, bump the format version (currently implied 1 by the read/write assumption). - `WalHandle` does not implement `Clone` — there is exactly one writer thread. Use `Arc` if shared across threads.