tidaldb/docs/planning/milestone-5/phase-1/task-02-document-write-delete.md
jordan 192c473f55 feat: complete Milestone 5 — full-text search, RRF fusion, and creator search
- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs)
- M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates)
- M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking)
- M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators)
- Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.)
- Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.)
- Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers
- Add benches: fusion, search, session, text_index
- Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index)
- Update blog posts, roadmap, content strategy, and M5 planning docs
- Add tmp/ and .claude/worktrees/ to .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 23:53:16 -07:00

7.1 KiB

Task 02: Document Write/Delete

Delivers

TextIndexWriter with index_item(), delete_item(), field mapping (text → tokenized, keyword → raw), metadata-to-document conversion, and commit with sequence number payload.

Complexity: M

Dependencies

  • Task 01 complete: TextIndex, TantivyFields, TextFieldDef, TextFieldType all exist

Technical Design

TextIndexWriter

// tidal/src/text/writer.rs

use std::collections::HashMap;
use std::sync::MutexGuard;
use tantivy::{Document, Term, doc};
use tantivy::schema::Value;

use crate::schema::EntityId;
use crate::text::index::{TextIndex, TantivyFields};
use crate::TidalError;

/// Write operations on the Tantivy text index.
///
/// This is a thin wrapper over the locked IndexWriter that converts tidalDB
/// metadata maps into Tantivy documents and handles entity_id-based deletes.
///
/// Thread safety: `TextIndexWriter` holds a `MutexGuard` on the IndexWriter.
/// Operations are batched in memory and only become visible after `commit()`.
pub struct TextIndexWriter<'a> {
    writer: MutexGuard<'a, tantivy::IndexWriter>,
    fields: &'a TantivyFields,
}

impl TextIndex {
    /// Lock the writer and return a `TextIndexWriter` for batch operations.
    ///
    /// # Errors
    /// Returns `TidalError::Internal` if the writer mutex is poisoned.
    pub fn writer_guard(&self) -> crate::Result<TextIndexWriter<'_>> {
        let writer = self
            .writer
            .lock()
            .map_err(|e| TidalError::Internal(format!("writer lock poisoned: {e}")))?;
        Ok(TextIndexWriter {
            writer,
            fields: &self.fields,
        })
    }
}

impl<'a> TextIndexWriter<'a> {
    /// Index or re-index an item.
    ///
    /// Tantivy has no atomic update — this deletes any existing document for
    /// `entity_id` and adds a fresh document. Both operations are in the same
    /// batch and become visible atomically on the next `commit()`.
    ///
    /// Only metadata keys that match a declared text field are indexed.
    /// Unknown keys are silently ignored.
    pub fn index_item(
        &mut self,
        entity_id: EntityId,
        metadata: &HashMap<String, String>,
    ) -> crate::Result<()> {
        // Delete any existing document for this entity_id
        let id_term = Term::from_field_u64(self.fields.entity_id, entity_id.get());
        self.writer.delete_term(id_term);

        // Build document
        let mut doc = Document::new();
        doc.add_u64(self.fields.entity_id, entity_id.get());

        for (key, tv_field, _field_type) in &self.fields.text_fields {
            if let Some(value) = metadata.get(key) {
                doc.add_text(*tv_field, value);
            }
        }

        self.writer
            .add_document(doc)
            .map_err(|e| TidalError::Internal(format!("tantivy add_document: {e}")))?;

        Ok(())
    }

    /// Remove an item from the index.
    ///
    /// The delete takes effect on the next `commit()`.
    pub fn delete_item(&mut self, entity_id: EntityId) {
        let id_term = Term::from_field_u64(self.fields.entity_id, entity_id.get());
        self.writer.delete_term(id_term);
    }

    /// Commit all pending writes and store `last_seq` in the commit payload.
    ///
    /// This is the durability boundary: after `commit()` returns, all indexed
    /// documents are visible to new `IndexReader::searcher()` instances.
    ///
    /// The `last_seq` is stored in the Tantivy commit payload via `set_payload()`.
    /// On crash recovery, read the last commit payload to find the resume point.
    ///
    /// # Errors
    /// Returns `TidalError::Internal` if the commit fails.
    pub fn commit(&mut self, last_seq: u64) -> crate::Result<()> {
        self.writer.set_payload(&last_seq.to_string());
        self.writer
            .commit()
            .map_err(|e| TidalError::Internal(format!("tantivy commit: {e}")))?;
        Ok(())
    }

    /// Read the last committed sequence number from the Tantivy index payload.
    ///
    /// Returns 0 if no commit payload exists (fresh index or first run).
    pub fn last_committed_seq(index: &tantivy::Index) -> u64 {
        index
            .load_metas()
            .ok()
            .and_then(|meta| meta.payload)
            .and_then(|p| p.parse::<u64>().ok())
            .unwrap_or(0)
    }
}

Integration with TidalDb

Wire index_item calls into TidalDb::write_item_with_metadata() and write_item(). The text index should be updated after the entity store write succeeds (DB-primary consistency: entity store wins, Tantivy is derived).

In the immediate term (before the background syncer in task-03), do a synchronous index update after each write. The background syncer in task-03 will replace this with an async outbox pattern.

Actually, for correctness in m5p1, keep it synchronous (direct call after entity store write). Task-03 (Background Syncer) replaces the synchronous write with the outbox pattern.

EntityId fast field access

EntityId must expose its inner u64 value. Check if EntityId::get() exists — if not, add it:

impl EntityId {
    pub fn get(&self) -> u64 {
        self.0  // or whatever the inner field is
    }
}

Acceptance Criteria

  • TextIndexWriter::index_item(entity_id, metadata) builds a Tantivy document with entity_id fast field + all matching text fields
  • Unknown metadata keys (not declared as text fields) are silently ignored
  • delete_item(entity_id) issues a delete_term on the entity_id fast field
  • index_item does delete-then-add (same batch): updating an item does not leave orphan documents
  • commit(last_seq) calls set_payload(&last_seq.to_string()) before commit()
  • TextIndexWriter::last_committed_seq(index) reads payload from last commit; returns 0 on fresh index
  • TextIndex::writer_guard() acquires the mutex and returns TextIndexWriter
  • Unit tests: index_and_search, delete_removes_document, update_replaces_document, commit_stores_sequence, last_committed_seq_returns_zero_fresh, last_committed_seq_returns_stored_value
  • cargo check, cargo fmt, cargo clippy -D warnings all pass

Test Strategy

#[test]
fn index_and_search() {
    let fields = vec![
        TextFieldDef { key: "title".into(), field_type: TextFieldType::Text },
    ];
    let idx = TextIndex::ephemeral(&fields).unwrap();
    let mut w = idx.writer_guard().unwrap();
    let mut meta = HashMap::new();
    meta.insert("title".into(), "Rust programming language".into());
    w.index_item(EntityId::new(42), &meta).unwrap();
    w.commit(1).unwrap();
    // Searcher should find item 42 for query "Rust"
    idx.reader.reload().unwrap(); // force reader refresh in test
    let searcher = idx.reader.searcher();
    // ... assert item found
}

#[test]
fn delete_removes_document() {
    // Write, commit, delete, commit, verify not found
}

#[test]
fn commit_stores_sequence() {
    let idx = TextIndex::ephemeral(&[]).unwrap();  // no text fields, just entity_id
    // index_item with only entity_id field, commit(seq=42)
    let seq = TextIndexWriter::last_committed_seq(&idx.index);
    assert_eq!(seq, 42);
}