- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs) - M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates) - M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking) - M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators) - Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.) - Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.) - Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers - Add benches: fusion, search, session, text_index - Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index) - Update blog posts, roadmap, content strategy, and M5 planning docs - Add tmp/ and .claude/worktrees/ to .gitignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
198 lines
7.1 KiB
Markdown
198 lines
7.1 KiB
Markdown
# Task 02: Document Write/Delete
|
|
|
|
## Delivers
|
|
|
|
`TextIndexWriter` with `index_item()`, `delete_item()`, field mapping (text → tokenized, keyword → raw), metadata-to-document conversion, and commit with sequence number payload.
|
|
|
|
## Complexity: M
|
|
|
|
## Dependencies
|
|
|
|
- Task 01 complete: `TextIndex`, `TantivyFields`, `TextFieldDef`, `TextFieldType` all exist
|
|
|
|
## Technical Design
|
|
|
|
### TextIndexWriter
|
|
|
|
```rust
|
|
// tidal/src/text/writer.rs
|
|
|
|
use std::collections::HashMap;
|
|
use std::sync::MutexGuard;
|
|
use tantivy::{Document, Term, doc};
|
|
use tantivy::schema::Value;
|
|
|
|
use crate::schema::EntityId;
|
|
use crate::text::index::{TextIndex, TantivyFields};
|
|
use crate::TidalError;
|
|
|
|
/// Write operations on the Tantivy text index.
|
|
///
|
|
/// This is a thin wrapper over the locked IndexWriter that converts tidalDB
|
|
/// metadata maps into Tantivy documents and handles entity_id-based deletes.
|
|
///
|
|
/// Thread safety: `TextIndexWriter` holds a `MutexGuard` on the IndexWriter.
|
|
/// Operations are batched in memory and only become visible after `commit()`.
|
|
pub struct TextIndexWriter<'a> {
|
|
writer: MutexGuard<'a, tantivy::IndexWriter>,
|
|
fields: &'a TantivyFields,
|
|
}
|
|
|
|
impl TextIndex {
|
|
/// Lock the writer and return a `TextIndexWriter` for batch operations.
|
|
///
|
|
/// # Errors
|
|
/// Returns `TidalError::Internal` if the writer mutex is poisoned.
|
|
pub fn writer_guard(&self) -> crate::Result<TextIndexWriter<'_>> {
|
|
let writer = self
|
|
.writer
|
|
.lock()
|
|
.map_err(|e| TidalError::Internal(format!("writer lock poisoned: {e}")))?;
|
|
Ok(TextIndexWriter {
|
|
writer,
|
|
fields: &self.fields,
|
|
})
|
|
}
|
|
}
|
|
|
|
impl<'a> TextIndexWriter<'a> {
|
|
/// Index or re-index an item.
|
|
///
|
|
/// Tantivy has no atomic update — this deletes any existing document for
|
|
/// `entity_id` and adds a fresh document. Both operations are in the same
|
|
/// batch and become visible atomically on the next `commit()`.
|
|
///
|
|
/// Only metadata keys that match a declared text field are indexed.
|
|
/// Unknown keys are silently ignored.
|
|
pub fn index_item(
|
|
&mut self,
|
|
entity_id: EntityId,
|
|
metadata: &HashMap<String, String>,
|
|
) -> crate::Result<()> {
|
|
// Delete any existing document for this entity_id
|
|
let id_term = Term::from_field_u64(self.fields.entity_id, entity_id.get());
|
|
self.writer.delete_term(id_term);
|
|
|
|
// Build document
|
|
let mut doc = Document::new();
|
|
doc.add_u64(self.fields.entity_id, entity_id.get());
|
|
|
|
for (key, tv_field, _field_type) in &self.fields.text_fields {
|
|
if let Some(value) = metadata.get(key) {
|
|
doc.add_text(*tv_field, value);
|
|
}
|
|
}
|
|
|
|
self.writer
|
|
.add_document(doc)
|
|
.map_err(|e| TidalError::Internal(format!("tantivy add_document: {e}")))?;
|
|
|
|
Ok(())
|
|
}
|
|
|
|
/// Remove an item from the index.
|
|
///
|
|
/// The delete takes effect on the next `commit()`.
|
|
pub fn delete_item(&mut self, entity_id: EntityId) {
|
|
let id_term = Term::from_field_u64(self.fields.entity_id, entity_id.get());
|
|
self.writer.delete_term(id_term);
|
|
}
|
|
|
|
/// Commit all pending writes and store `last_seq` in the commit payload.
|
|
///
|
|
/// This is the durability boundary: after `commit()` returns, all indexed
|
|
/// documents are visible to new `IndexReader::searcher()` instances.
|
|
///
|
|
/// The `last_seq` is stored in the Tantivy commit payload via `set_payload()`.
|
|
/// On crash recovery, read the last commit payload to find the resume point.
|
|
///
|
|
/// # Errors
|
|
/// Returns `TidalError::Internal` if the commit fails.
|
|
pub fn commit(&mut self, last_seq: u64) -> crate::Result<()> {
|
|
self.writer.set_payload(&last_seq.to_string());
|
|
self.writer
|
|
.commit()
|
|
.map_err(|e| TidalError::Internal(format!("tantivy commit: {e}")))?;
|
|
Ok(())
|
|
}
|
|
|
|
/// Read the last committed sequence number from the Tantivy index payload.
|
|
///
|
|
/// Returns 0 if no commit payload exists (fresh index or first run).
|
|
pub fn last_committed_seq(index: &tantivy::Index) -> u64 {
|
|
index
|
|
.load_metas()
|
|
.ok()
|
|
.and_then(|meta| meta.payload)
|
|
.and_then(|p| p.parse::<u64>().ok())
|
|
.unwrap_or(0)
|
|
}
|
|
}
|
|
```
|
|
|
|
### Integration with TidalDb
|
|
|
|
Wire `index_item` calls into `TidalDb::write_item_with_metadata()` and `write_item()`. The text index should be updated **after** the entity store write succeeds (DB-primary consistency: entity store wins, Tantivy is derived).
|
|
|
|
In the immediate term (before the background syncer in task-03), do a synchronous index update after each write. The background syncer in task-03 will replace this with an async outbox pattern.
|
|
|
|
Actually, for correctness in m5p1, keep it synchronous (direct call after entity store write). Task-03 (Background Syncer) replaces the synchronous write with the outbox pattern.
|
|
|
|
### EntityId fast field access
|
|
|
|
`EntityId` must expose its inner `u64` value. Check if `EntityId::get()` exists — if not, add it:
|
|
|
|
```rust
|
|
impl EntityId {
|
|
pub fn get(&self) -> u64 {
|
|
self.0 // or whatever the inner field is
|
|
}
|
|
}
|
|
```
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] `TextIndexWriter::index_item(entity_id, metadata)` builds a Tantivy document with `entity_id` fast field + all matching text fields
|
|
- [ ] Unknown metadata keys (not declared as text fields) are silently ignored
|
|
- [ ] `delete_item(entity_id)` issues a `delete_term` on the `entity_id` fast field
|
|
- [ ] `index_item` does delete-then-add (same batch): updating an item does not leave orphan documents
|
|
- [ ] `commit(last_seq)` calls `set_payload(&last_seq.to_string())` before `commit()`
|
|
- [ ] `TextIndexWriter::last_committed_seq(index)` reads payload from last commit; returns 0 on fresh index
|
|
- [ ] `TextIndex::writer_guard()` acquires the mutex and returns `TextIndexWriter`
|
|
- [ ] Unit tests: `index_and_search`, `delete_removes_document`, `update_replaces_document`, `commit_stores_sequence`, `last_committed_seq_returns_zero_fresh`, `last_committed_seq_returns_stored_value`
|
|
- [ ] `cargo check`, `cargo fmt`, `cargo clippy -D warnings` all pass
|
|
|
|
## Test Strategy
|
|
|
|
```rust
|
|
#[test]
|
|
fn index_and_search() {
|
|
let fields = vec![
|
|
TextFieldDef { key: "title".into(), field_type: TextFieldType::Text },
|
|
];
|
|
let idx = TextIndex::ephemeral(&fields).unwrap();
|
|
let mut w = idx.writer_guard().unwrap();
|
|
let mut meta = HashMap::new();
|
|
meta.insert("title".into(), "Rust programming language".into());
|
|
w.index_item(EntityId::new(42), &meta).unwrap();
|
|
w.commit(1).unwrap();
|
|
// Searcher should find item 42 for query "Rust"
|
|
idx.reader.reload().unwrap(); // force reader refresh in test
|
|
let searcher = idx.reader.searcher();
|
|
// ... assert item found
|
|
}
|
|
|
|
#[test]
|
|
fn delete_removes_document() {
|
|
// Write, commit, delete, commit, verify not found
|
|
}
|
|
|
|
#[test]
|
|
fn commit_stores_sequence() {
|
|
let idx = TextIndex::ephemeral(&[]).unwrap(); // no text fields, just entity_id
|
|
// index_item with only entity_id field, commit(seq=42)
|
|
let seq = TextIndexWriter::last_committed_seq(&idx.index);
|
|
assert_eq!(seq, 42);
|
|
}
|
|
```
|