- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs) - M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates) - M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking) - M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators) - Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.) - Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.) - Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers - Add benches: fusion, search, session, text_index - Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index) - Update blog posts, roadmap, content strategy, and M5 planning docs - Add tmp/ and .claude/worktrees/ to .gitignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
4.0 KiB
4.0 KiB
m5p1: Tantivy Integration
Delivers
Tantivy embedded as a derived index for full-text search. DB-primary consistency pattern: entity store is source of truth, Tantivy is a materialized view updated via an outbox sequence. BM25 scoring exposed via custom Collector and the Weight/Scorer seek pattern. Schema text fields (title, description, tags) automatically indexed. Crash recovery replays from the last committed sequence number stored in Tantivy's commit payload.
Dependencies
- m1p3 (storage engine, key encoding,
StorageEnginetrait,scan_prefix) - m1p5 (entity write API, WAL sequence numbers)
- m2p2 (metadata fields used for field-scoped queries)
- m4 (full TidalDb API with sessions and agents — all complete)
Research References
docs/research/tantivy.md— Collector API, consistency pattern, seek scoring, commit model, single-writer lock, segment mergeCODING_GUIDELINES.mdSection 5 — Text Search guidelinesCODING_GUIDELINES.mdSection 7 — Error handling
Acceptance Criteria (Phase Level)
TextIndexstruct wraps TantivyIndex,IndexWriter(behindMutex), andIndexReaderwith auto-reload- Tantivy schema created from tidalDB schema text field definitions:
textfields get full-text tokenization;keywordfields get raw indexing TextIndexWriter::index_item(entity_id, metadata)adds or updates a document in Tantivy;delete_item(entity_id)removes viadelete_term- Background indexer:
TextIndexSyncerreads entity store writes (via WAL sequence tracking) and feeds Tantivy writer; commit interval configurable (default: every 1000 docs or 2 seconds) - Each Tantivy
commit()stores the last-processed WAL sequence number in the commit payload viaset_payload(); crash recovery replays from that sequence number - Custom
AllScoresCollectorimplementing Tantivy'sCollectortrait returns all matching(EntityId, f32)pairs with BM25 scores;requires_scoring()returnstrue ScoredCandidateCollectorimplementing Tantivy'sCollectortrait accepts a pre-sorted candidate set and returns BM25 scores viaDocSet::seek()- External
EntityId -> DocAddressmapping maintained via a fast field (entity_id_field) on every Tantivy document - Boolean query parsing: AND, OR, NOT operators; exact phrase (
"..."); field-scoped (title:jazz); exclusion (-beginner); wildcard prefix (pian*) - Index rebuild from entity store:
text_index.rebuild_from(storage)scans all items and rebuilds Tantivy index - BM25 query latency < 10ms at 10K documents (Criterion benchmarked)
- Tantivy
IndexWriterheap budget set to 50MB LogMergePolicyconfigured with defaults;wait_merging_threads()called on shutdownTextIndexisSend + Sync— safe to share across threads behindArc
Task Execution Order
task-01 (TextIndex Core)
|
v
task-02 (Document Write/Delete)
| | |
v v v
task-03 task-04 task-05
(Syncer) (Collectors) (Query Parser)
Tasks 01-02 are sequential. Tasks 03, 04, 05 can parallelize after task-02 completes.
Module Location
New module: tidal/src/text/ with submodules:
mod.rs— public re-exportsindex.rs—TextIndex,TextIndexConfigwriter.rs—TextIndexWriter(write/delete operations)syncer.rs—TextIndexSyncer(background indexing)collectors.rs—AllScoresCollector,ScoredCandidateCollectorquery.rs—TextQueryParser
Notes
tantivymust be added totidal/Cargo.tomlas a dependency- Text field definitions must be added to
Schema/SchemaBuilder - The
unsafe_code = "forbid"lint is crate-level —tantivyitself uses unsafe but we do not need unsafe in our wrapper code tantivycrate itself hasforbid(unsafe_code)in some modules but not all — the FFI is contained within their crate
Done When
All 14 acceptance criteria above pass. Tests pass with cargo test --manifest-path tidal/Cargo.toml. The text_index bench shows BM25 query < 10ms at 10K documents.