tidaldb/docs/planning/milestone-5/phase-1/OVERVIEW.md
jordan 192c473f55 feat: complete Milestone 5 — full-text search, RRF fusion, and creator search
- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs)
- M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates)
- M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking)
- M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators)
- Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.)
- Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.)
- Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers
- Add benches: fusion, search, session, text_index
- Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index)
- Update blog posts, roadmap, content strategy, and M5 planning docs
- Add tmp/ and .claude/worktrees/ to .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 23:53:16 -07:00

4.0 KiB

m5p1: Tantivy Integration

Delivers

Tantivy embedded as a derived index for full-text search. DB-primary consistency pattern: entity store is source of truth, Tantivy is a materialized view updated via an outbox sequence. BM25 scoring exposed via custom Collector and the Weight/Scorer seek pattern. Schema text fields (title, description, tags) automatically indexed. Crash recovery replays from the last committed sequence number stored in Tantivy's commit payload.

Dependencies

  • m1p3 (storage engine, key encoding, StorageEngine trait, scan_prefix)
  • m1p5 (entity write API, WAL sequence numbers)
  • m2p2 (metadata fields used for field-scoped queries)
  • m4 (full TidalDb API with sessions and agents — all complete)

Research References

  • docs/research/tantivy.md — Collector API, consistency pattern, seek scoring, commit model, single-writer lock, segment merge
  • CODING_GUIDELINES.md Section 5 — Text Search guidelines
  • CODING_GUIDELINES.md Section 7 — Error handling

Acceptance Criteria (Phase Level)

  • TextIndex struct wraps Tantivy Index, IndexWriter (behind Mutex), and IndexReader with auto-reload
  • Tantivy schema created from tidalDB schema text field definitions: text fields get full-text tokenization; keyword fields get raw indexing
  • TextIndexWriter::index_item(entity_id, metadata) adds or updates a document in Tantivy; delete_item(entity_id) removes via delete_term
  • Background indexer: TextIndexSyncer reads entity store writes (via WAL sequence tracking) and feeds Tantivy writer; commit interval configurable (default: every 1000 docs or 2 seconds)
  • Each Tantivy commit() stores the last-processed WAL sequence number in the commit payload via set_payload(); crash recovery replays from that sequence number
  • Custom AllScoresCollector implementing Tantivy's Collector trait returns all matching (EntityId, f32) pairs with BM25 scores; requires_scoring() returns true
  • ScoredCandidateCollector implementing Tantivy's Collector trait accepts a pre-sorted candidate set and returns BM25 scores via DocSet::seek()
  • External EntityId -> DocAddress mapping maintained via a fast field (entity_id_field) on every Tantivy document
  • Boolean query parsing: AND, OR, NOT operators; exact phrase ("..."); field-scoped (title:jazz); exclusion (-beginner); wildcard prefix (pian*)
  • Index rebuild from entity store: text_index.rebuild_from(storage) scans all items and rebuilds Tantivy index
  • BM25 query latency < 10ms at 10K documents (Criterion benchmarked)
  • Tantivy IndexWriter heap budget set to 50MB
  • LogMergePolicy configured with defaults; wait_merging_threads() called on shutdown
  • TextIndex is Send + Sync — safe to share across threads behind Arc

Task Execution Order

task-01 (TextIndex Core)
    |
    v
task-02 (Document Write/Delete)
    |            |              |
    v            v              v
task-03       task-04        task-05
(Syncer)    (Collectors)  (Query Parser)

Tasks 01-02 are sequential. Tasks 03, 04, 05 can parallelize after task-02 completes.

Module Location

New module: tidal/src/text/ with submodules:

  • mod.rs — public re-exports
  • index.rsTextIndex, TextIndexConfig
  • writer.rsTextIndexWriter (write/delete operations)
  • syncer.rsTextIndexSyncer (background indexing)
  • collectors.rsAllScoresCollector, ScoredCandidateCollector
  • query.rsTextQueryParser

Notes

  • tantivy must be added to tidal/Cargo.toml as a dependency
  • Text field definitions must be added to Schema / SchemaBuilder
  • The unsafe_code = "forbid" lint is crate-level — tantivy itself uses unsafe but we do not need unsafe in our wrapper code
  • tantivy crate itself has forbid(unsafe_code) in some modules but not all — the FFI is contained within their crate

Done When

All 14 acceptance criteria above pass. Tests pass with cargo test --manifest-path tidal/Cargo.toml. The text_index bench shows BM25 query < 10ms at 10K documents.