- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs) - M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates) - M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking) - M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators) - Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.) - Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.) - Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers - Add benches: fusion, search, session, text_index - Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index) - Update blog posts, roadmap, content strategy, and M5 planning docs - Add tmp/ and .claude/worktrees/ to .gitignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7.1 KiB
Task 02: Document Write/Delete
Delivers
TextIndexWriter with index_item(), delete_item(), field mapping (text → tokenized, keyword → raw), metadata-to-document conversion, and commit with sequence number payload.
Complexity: M
Dependencies
- Task 01 complete:
TextIndex,TantivyFields,TextFieldDef,TextFieldTypeall exist
Technical Design
TextIndexWriter
// tidal/src/text/writer.rs
use std::collections::HashMap;
use std::sync::MutexGuard;
use tantivy::{Document, Term, doc};
use tantivy::schema::Value;
use crate::schema::EntityId;
use crate::text::index::{TextIndex, TantivyFields};
use crate::TidalError;
/// Write operations on the Tantivy text index.
///
/// This is a thin wrapper over the locked IndexWriter that converts tidalDB
/// metadata maps into Tantivy documents and handles entity_id-based deletes.
///
/// Thread safety: `TextIndexWriter` holds a `MutexGuard` on the IndexWriter.
/// Operations are batched in memory and only become visible after `commit()`.
pub struct TextIndexWriter<'a> {
writer: MutexGuard<'a, tantivy::IndexWriter>,
fields: &'a TantivyFields,
}
impl TextIndex {
/// Lock the writer and return a `TextIndexWriter` for batch operations.
///
/// # Errors
/// Returns `TidalError::Internal` if the writer mutex is poisoned.
pub fn writer_guard(&self) -> crate::Result<TextIndexWriter<'_>> {
let writer = self
.writer
.lock()
.map_err(|e| TidalError::Internal(format!("writer lock poisoned: {e}")))?;
Ok(TextIndexWriter {
writer,
fields: &self.fields,
})
}
}
impl<'a> TextIndexWriter<'a> {
/// Index or re-index an item.
///
/// Tantivy has no atomic update — this deletes any existing document for
/// `entity_id` and adds a fresh document. Both operations are in the same
/// batch and become visible atomically on the next `commit()`.
///
/// Only metadata keys that match a declared text field are indexed.
/// Unknown keys are silently ignored.
pub fn index_item(
&mut self,
entity_id: EntityId,
metadata: &HashMap<String, String>,
) -> crate::Result<()> {
// Delete any existing document for this entity_id
let id_term = Term::from_field_u64(self.fields.entity_id, entity_id.get());
self.writer.delete_term(id_term);
// Build document
let mut doc = Document::new();
doc.add_u64(self.fields.entity_id, entity_id.get());
for (key, tv_field, _field_type) in &self.fields.text_fields {
if let Some(value) = metadata.get(key) {
doc.add_text(*tv_field, value);
}
}
self.writer
.add_document(doc)
.map_err(|e| TidalError::Internal(format!("tantivy add_document: {e}")))?;
Ok(())
}
/// Remove an item from the index.
///
/// The delete takes effect on the next `commit()`.
pub fn delete_item(&mut self, entity_id: EntityId) {
let id_term = Term::from_field_u64(self.fields.entity_id, entity_id.get());
self.writer.delete_term(id_term);
}
/// Commit all pending writes and store `last_seq` in the commit payload.
///
/// This is the durability boundary: after `commit()` returns, all indexed
/// documents are visible to new `IndexReader::searcher()` instances.
///
/// The `last_seq` is stored in the Tantivy commit payload via `set_payload()`.
/// On crash recovery, read the last commit payload to find the resume point.
///
/// # Errors
/// Returns `TidalError::Internal` if the commit fails.
pub fn commit(&mut self, last_seq: u64) -> crate::Result<()> {
self.writer.set_payload(&last_seq.to_string());
self.writer
.commit()
.map_err(|e| TidalError::Internal(format!("tantivy commit: {e}")))?;
Ok(())
}
/// Read the last committed sequence number from the Tantivy index payload.
///
/// Returns 0 if no commit payload exists (fresh index or first run).
pub fn last_committed_seq(index: &tantivy::Index) -> u64 {
index
.load_metas()
.ok()
.and_then(|meta| meta.payload)
.and_then(|p| p.parse::<u64>().ok())
.unwrap_or(0)
}
}
Integration with TidalDb
Wire index_item calls into TidalDb::write_item_with_metadata() and write_item(). The text index should be updated after the entity store write succeeds (DB-primary consistency: entity store wins, Tantivy is derived).
In the immediate term (before the background syncer in task-03), do a synchronous index update after each write. The background syncer in task-03 will replace this with an async outbox pattern.
Actually, for correctness in m5p1, keep it synchronous (direct call after entity store write). Task-03 (Background Syncer) replaces the synchronous write with the outbox pattern.
EntityId fast field access
EntityId must expose its inner u64 value. Check if EntityId::get() exists — if not, add it:
impl EntityId {
pub fn get(&self) -> u64 {
self.0 // or whatever the inner field is
}
}
Acceptance Criteria
TextIndexWriter::index_item(entity_id, metadata)builds a Tantivy document withentity_idfast field + all matching text fields- Unknown metadata keys (not declared as text fields) are silently ignored
delete_item(entity_id)issues adelete_termon theentity_idfast fieldindex_itemdoes delete-then-add (same batch): updating an item does not leave orphan documentscommit(last_seq)callsset_payload(&last_seq.to_string())beforecommit()TextIndexWriter::last_committed_seq(index)reads payload from last commit; returns 0 on fresh indexTextIndex::writer_guard()acquires the mutex and returnsTextIndexWriter- Unit tests:
index_and_search,delete_removes_document,update_replaces_document,commit_stores_sequence,last_committed_seq_returns_zero_fresh,last_committed_seq_returns_stored_value cargo check,cargo fmt,cargo clippy -D warningsall pass
Test Strategy
#[test]
fn index_and_search() {
let fields = vec![
TextFieldDef { key: "title".into(), field_type: TextFieldType::Text },
];
let idx = TextIndex::ephemeral(&fields).unwrap();
let mut w = idx.writer_guard().unwrap();
let mut meta = HashMap::new();
meta.insert("title".into(), "Rust programming language".into());
w.index_item(EntityId::new(42), &meta).unwrap();
w.commit(1).unwrap();
// Searcher should find item 42 for query "Rust"
idx.reader.reload().unwrap(); // force reader refresh in test
let searcher = idx.reader.searcher();
// ... assert item found
}
#[test]
fn delete_removes_document() {
// Write, commit, delete, commit, verify not found
}
#[test]
fn commit_stores_sequence() {
let idx = TextIndex::ephemeral(&[]).unwrap(); // no text fields, just entity_id
// index_item with only entity_id field, commit(seq=42)
let seq = TextIndexWriter::last_committed_seq(&idx.index);
assert_eq!(seq, 42);
}