# Task 03: Background Syncer ## Delivers `TextIndexSyncer` — a background thread that reads entity store writes (tracked via a sequence counter), feeds Tantivy writer, commits on interval (every 1000 docs or 2 seconds), and stores the last-processed sequence number in the commit payload. On crash recovery, reads the commit payload to find the resume point and replays from the entity store. ## Complexity: L ## Dependencies - Task 01 complete: `TextIndex`, `TextIndexConfig` - Task 02 complete: `TextIndexWriter`, `commit(seq)`, `last_committed_seq()` - `StorageEngine` trait with `scan_prefix()` for rebuild ## Technical Design ### Approach Use an **outbox sequence counter** approach. The entity store write path increments a shared `AtomicU64` sequence counter each time an item is written. The syncer reads this counter and processes any items with sequence numbers above its last committed value. For the initial m5p1 implementation, use a simpler approach: 1. The syncer runs on a configurable interval (default: 2 seconds) 2. On each tick, it scans ALL items from the entity store and re-indexes them if their sequence number is higher than last committed 3. A more sophisticated outbox pattern (WAL-based) is deferred to future work This is correct but not optimally efficient — full rebuild handles correctness, partial updates optimize throughput. For 10K items, a full rebuild takes < 1 second, so this is acceptable. Actually, looking at the WAL sequence numbers and the entity store, the simplest correct approach is: - Maintain a monotonic `write_counter: AtomicU64` in `TidalDb` that increments on each `write_item_with_metadata()` call - The syncer checks if `write_counter > last_committed_seq` and if so, does a full index rebuild - This guarantees correctness at the cost of always doing a full rebuild (acceptable for 10K items) For a more sophisticated approach with incremental updates, we track which entity IDs have been updated since the last commit via a concurrent queue: ```rust // In TidalDb: a channel where item writes post (entity_id, write_seq) pairs pending_text_updates: crossbeam::channel::Sender<(EntityId, u64)> ``` The syncer receives these pairs, batches them, and commits on interval. Use the channel approach — it's more efficient and correctly handles the outbox pattern. ### TextIndexSyncer ```rust // tidal/src/text/syncer.rs use std::sync::Arc; use std::time::{Duration, Instant}; use crossbeam::channel::{Receiver, RecvTimeoutError}; use crate::schema::EntityId; use crate::text::index::TextIndex; use crate::storage::StorageEngine; use crate::TidalError; /// A pending write event: entity_id + WAL sequence number of the write. #[derive(Debug, Clone)] pub struct PendingWrite { pub entity_id: EntityId, pub metadata: std::collections::HashMap, pub seq: u64, /// If true, this is a delete (item was removed). pub deleted: bool, } /// Background syncer that feeds the Tantivy text index from the entity store outbox. pub struct TextIndexSyncer { index: Arc, rx: Receiver, commit_every_n: usize, commit_every: Duration, } impl TextIndexSyncer { pub fn new( index: Arc, rx: Receiver, commit_every_n: usize, commit_every_secs: u64, ) -> Self { Self { index, rx, commit_every_n, commit_every: Duration::from_secs(commit_every_secs), } } /// Run the syncer loop. Blocks until the channel is closed (sender dropped). /// /// This is intended to run on a dedicated background thread. pub fn run(self) -> crate::Result<()> { let mut pending_count = 0usize; let mut last_commit_time = Instant::now(); let mut last_seq = 0u64; let mut writer = self.index.writer_guard()?; loop { // Try to receive with timeout match self.rx.recv_timeout(Duration::from_millis(100)) { Ok(update) => { if update.deleted { writer.delete_item(update.entity_id); } else { writer.index_item(update.entity_id, &update.metadata)?; } if update.seq > last_seq { last_seq = update.seq; } pending_count += 1; // Commit if batch is full if pending_count >= self.commit_every_n { writer.commit(last_seq)?; pending_count = 0; last_commit_time = Instant::now(); } } Err(RecvTimeoutError::Timeout) => { // Commit on timeout if there are pending documents if pending_count > 0 && last_commit_time.elapsed() >= self.commit_every { writer.commit(last_seq)?; pending_count = 0; last_commit_time = Instant::now(); } } Err(RecvTimeoutError::Disconnected) => { // Channel closed: flush remaining if pending_count > 0 { writer.commit(last_seq)?; } break; } } } Ok(()) } } ``` ### Crash Recovery On `TidalDb::open()` (or `TidalDb::builder().open()`), after opening the Tantivy index: ```rust let last_committed = TextIndexWriter::last_committed_seq(&text_index.index); // The syncer will process events with seq > last_committed // Since entity_writes are tracked, items written after last_committed // will be re-submitted to the syncer automatically on the first cycle. ``` For the initial implementation, implement `rebuild_from()`: ```rust impl TextIndex { /// Rebuild the Tantivy index from the entity store. /// /// Scans all items in the entity store and re-indexes them. /// The last committed sequence is set to `last_seq` after rebuild. /// /// Used for crash recovery and initial setup. pub fn rebuild_from( &self, storage: &dyn crate::storage::StorageEngine, last_seq: u64, ) -> crate::Result<()> { let mut writer = self.writer_guard()?; // Delete all existing documents writer.writer.delete_all_documents() .map_err(|e| TidalError::Internal(format!("tantivy delete_all: {e}")))?; // Scan all items from entity store for entry in storage.scan_prefix(&[]) { let (key, value) = entry.map_err(|e| TidalError::from(e))?; // Parse entity_id from key, metadata from value // ... decode and index each item } writer.commit(last_seq) } } ``` ### Integration in TidalDb Add to `TidalDb`: - `text_index: Option>` — `None` if no text fields declared in schema - `text_tx: Option>` — channel to syncer - `text_syncer_thread: Option>>` — background thread On `write_item_with_metadata()`, after the entity store write, send to `text_tx` if `Some`. On `close()` / `shutdown()`, drop `text_tx` to signal the syncer to flush and exit, then join the thread. ## Acceptance Criteria - [ ] `TextIndexSyncer` struct with `new()` and `run()` methods - [ ] `PendingWrite` struct with `entity_id`, `metadata`, `seq`, `deleted` fields - [ ] Syncer commits after `commit_every_n` documents - [ ] Syncer commits after `commit_every_secs` timeout even with fewer documents - [ ] Syncer flushes remaining documents when channel is closed (graceful shutdown) - [ ] Each commit stores `last_seq` in the Tantivy commit payload - [ ] `TextIndex::rebuild_from(storage, last_seq)` scans entity store and re-indexes all items - [ ] `TidalDb` holds `Option>` — `None` if schema has no text fields - [ ] `TidalDb::write_item_with_metadata()` sends `PendingWrite` to the syncer channel - [ ] `TidalDb::close()` drops the channel sender and joins the syncer thread - [ ] Unit tests: `syncer_commits_on_batch`, `syncer_commits_on_timeout`, `syncer_flushes_on_shutdown`, `rebuild_from_indexes_all_items` - [ ] `cargo check`, `cargo fmt`, `cargo clippy -D warnings` all pass ## Test Strategy ```rust #[test] fn syncer_commits_on_batch() { let (tx, rx) = crossbeam::channel::unbounded(); let idx = Arc::new(TextIndex::ephemeral(&test_fields()).unwrap()); let syncer = TextIndexSyncer::new(Arc::clone(&idx), rx, 3, 60); let handle = std::thread::spawn(move || syncer.run()); // Send 3 items → triggers commit for i in 0..3u64 { tx.send(PendingWrite { entity_id: EntityId::new(i), metadata: make_meta(i), seq: i + 1, deleted: false, }).unwrap(); } // Drop sender to trigger flush drop(tx); handle.join().unwrap().unwrap(); // Verify all 3 items are in the index let searcher = idx.reader.searcher(); assert_eq!(searcher.num_docs(), 3); } ```