tidaldb/docs/planning/milestone-5/phase-1/task-02-document-write-delete.md
jordan 192c473f55 feat: complete Milestone 5 — full-text search, RRF fusion, and creator search
- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs)
- M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates)
- M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking)
- M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators)
- Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.)
- Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.)
- Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers
- Add benches: fusion, search, session, text_index
- Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index)
- Update blog posts, roadmap, content strategy, and M5 planning docs
- Add tmp/ and .claude/worktrees/ to .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 23:53:16 -07:00

198 lines
7.1 KiB
Markdown

# Task 02: Document Write/Delete
## Delivers
`TextIndexWriter` with `index_item()`, `delete_item()`, field mapping (text → tokenized, keyword → raw), metadata-to-document conversion, and commit with sequence number payload.
## Complexity: M
## Dependencies
- Task 01 complete: `TextIndex`, `TantivyFields`, `TextFieldDef`, `TextFieldType` all exist
## Technical Design
### TextIndexWriter
```rust
// tidal/src/text/writer.rs
use std::collections::HashMap;
use std::sync::MutexGuard;
use tantivy::{Document, Term, doc};
use tantivy::schema::Value;
use crate::schema::EntityId;
use crate::text::index::{TextIndex, TantivyFields};
use crate::TidalError;
/// Write operations on the Tantivy text index.
///
/// This is a thin wrapper over the locked IndexWriter that converts tidalDB
/// metadata maps into Tantivy documents and handles entity_id-based deletes.
///
/// Thread safety: `TextIndexWriter` holds a `MutexGuard` on the IndexWriter.
/// Operations are batched in memory and only become visible after `commit()`.
pub struct TextIndexWriter<'a> {
writer: MutexGuard<'a, tantivy::IndexWriter>,
fields: &'a TantivyFields,
}
impl TextIndex {
/// Lock the writer and return a `TextIndexWriter` for batch operations.
///
/// # Errors
/// Returns `TidalError::Internal` if the writer mutex is poisoned.
pub fn writer_guard(&self) -> crate::Result<TextIndexWriter<'_>> {
let writer = self
.writer
.lock()
.map_err(|e| TidalError::Internal(format!("writer lock poisoned: {e}")))?;
Ok(TextIndexWriter {
writer,
fields: &self.fields,
})
}
}
impl<'a> TextIndexWriter<'a> {
/// Index or re-index an item.
///
/// Tantivy has no atomic update — this deletes any existing document for
/// `entity_id` and adds a fresh document. Both operations are in the same
/// batch and become visible atomically on the next `commit()`.
///
/// Only metadata keys that match a declared text field are indexed.
/// Unknown keys are silently ignored.
pub fn index_item(
&mut self,
entity_id: EntityId,
metadata: &HashMap<String, String>,
) -> crate::Result<()> {
// Delete any existing document for this entity_id
let id_term = Term::from_field_u64(self.fields.entity_id, entity_id.get());
self.writer.delete_term(id_term);
// Build document
let mut doc = Document::new();
doc.add_u64(self.fields.entity_id, entity_id.get());
for (key, tv_field, _field_type) in &self.fields.text_fields {
if let Some(value) = metadata.get(key) {
doc.add_text(*tv_field, value);
}
}
self.writer
.add_document(doc)
.map_err(|e| TidalError::Internal(format!("tantivy add_document: {e}")))?;
Ok(())
}
/// Remove an item from the index.
///
/// The delete takes effect on the next `commit()`.
pub fn delete_item(&mut self, entity_id: EntityId) {
let id_term = Term::from_field_u64(self.fields.entity_id, entity_id.get());
self.writer.delete_term(id_term);
}
/// Commit all pending writes and store `last_seq` in the commit payload.
///
/// This is the durability boundary: after `commit()` returns, all indexed
/// documents are visible to new `IndexReader::searcher()` instances.
///
/// The `last_seq` is stored in the Tantivy commit payload via `set_payload()`.
/// On crash recovery, read the last commit payload to find the resume point.
///
/// # Errors
/// Returns `TidalError::Internal` if the commit fails.
pub fn commit(&mut self, last_seq: u64) -> crate::Result<()> {
self.writer.set_payload(&last_seq.to_string());
self.writer
.commit()
.map_err(|e| TidalError::Internal(format!("tantivy commit: {e}")))?;
Ok(())
}
/// Read the last committed sequence number from the Tantivy index payload.
///
/// Returns 0 if no commit payload exists (fresh index or first run).
pub fn last_committed_seq(index: &tantivy::Index) -> u64 {
index
.load_metas()
.ok()
.and_then(|meta| meta.payload)
.and_then(|p| p.parse::<u64>().ok())
.unwrap_or(0)
}
}
```
### Integration with TidalDb
Wire `index_item` calls into `TidalDb::write_item_with_metadata()` and `write_item()`. The text index should be updated **after** the entity store write succeeds (DB-primary consistency: entity store wins, Tantivy is derived).
In the immediate term (before the background syncer in task-03), do a synchronous index update after each write. The background syncer in task-03 will replace this with an async outbox pattern.
Actually, for correctness in m5p1, keep it synchronous (direct call after entity store write). Task-03 (Background Syncer) replaces the synchronous write with the outbox pattern.
### EntityId fast field access
`EntityId` must expose its inner `u64` value. Check if `EntityId::get()` exists — if not, add it:
```rust
impl EntityId {
pub fn get(&self) -> u64 {
self.0 // or whatever the inner field is
}
}
```
## Acceptance Criteria
- [ ] `TextIndexWriter::index_item(entity_id, metadata)` builds a Tantivy document with `entity_id` fast field + all matching text fields
- [ ] Unknown metadata keys (not declared as text fields) are silently ignored
- [ ] `delete_item(entity_id)` issues a `delete_term` on the `entity_id` fast field
- [ ] `index_item` does delete-then-add (same batch): updating an item does not leave orphan documents
- [ ] `commit(last_seq)` calls `set_payload(&last_seq.to_string())` before `commit()`
- [ ] `TextIndexWriter::last_committed_seq(index)` reads payload from last commit; returns 0 on fresh index
- [ ] `TextIndex::writer_guard()` acquires the mutex and returns `TextIndexWriter`
- [ ] Unit tests: `index_and_search`, `delete_removes_document`, `update_replaces_document`, `commit_stores_sequence`, `last_committed_seq_returns_zero_fresh`, `last_committed_seq_returns_stored_value`
- [ ] `cargo check`, `cargo fmt`, `cargo clippy -D warnings` all pass
## Test Strategy
```rust
#[test]
fn index_and_search() {
let fields = vec![
TextFieldDef { key: "title".into(), field_type: TextFieldType::Text },
];
let idx = TextIndex::ephemeral(&fields).unwrap();
let mut w = idx.writer_guard().unwrap();
let mut meta = HashMap::new();
meta.insert("title".into(), "Rust programming language".into());
w.index_item(EntityId::new(42), &meta).unwrap();
w.commit(1).unwrap();
// Searcher should find item 42 for query "Rust"
idx.reader.reload().unwrap(); // force reader refresh in test
let searcher = idx.reader.searcher();
// ... assert item found
}
#[test]
fn delete_removes_document() {
// Write, commit, delete, commit, verify not found
}
#[test]
fn commit_stores_sequence() {
let idx = TextIndex::ephemeral(&[]).unwrap(); // no text fields, just entity_id
// index_item with only entity_id field, commit(seq=42)
let seq = TextIndexWriter::last_committed_seq(&idx.index);
assert_eq!(seq, 42);
}
```