- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs) - M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates) - M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking) - M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators) - Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.) - Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.) - Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers - Add benches: fusion, search, session, text_index - Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index) - Update blog posts, roadmap, content strategy, and M5 planning docs - Add tmp/ and .claude/worktrees/ to .gitignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
72 lines
4.0 KiB
Markdown
72 lines
4.0 KiB
Markdown
# m5p1: Tantivy Integration
|
|
|
|
## Delivers
|
|
|
|
Tantivy embedded as a derived index for full-text search. DB-primary consistency pattern: entity store is source of truth, Tantivy is a materialized view updated via an outbox sequence. BM25 scoring exposed via custom Collector and the Weight/Scorer seek pattern. Schema text fields (title, description, tags) automatically indexed. Crash recovery replays from the last committed sequence number stored in Tantivy's commit payload.
|
|
|
|
## Dependencies
|
|
|
|
- m1p3 (storage engine, key encoding, `StorageEngine` trait, `scan_prefix`)
|
|
- m1p5 (entity write API, WAL sequence numbers)
|
|
- m2p2 (metadata fields used for field-scoped queries)
|
|
- m4 (full TidalDb API with sessions and agents — all complete)
|
|
|
|
## Research References
|
|
|
|
- `docs/research/tantivy.md` — Collector API, consistency pattern, seek scoring, commit model, single-writer lock, segment merge
|
|
- `CODING_GUIDELINES.md` Section 5 — Text Search guidelines
|
|
- `CODING_GUIDELINES.md` Section 7 — Error handling
|
|
|
|
## Acceptance Criteria (Phase Level)
|
|
|
|
- [ ] `TextIndex` struct wraps Tantivy `Index`, `IndexWriter` (behind `Mutex`), and `IndexReader` with auto-reload
|
|
- [ ] Tantivy schema created from tidalDB schema text field definitions: `text` fields get full-text tokenization; `keyword` fields get raw indexing
|
|
- [ ] `TextIndexWriter::index_item(entity_id, metadata)` adds or updates a document in Tantivy; `delete_item(entity_id)` removes via `delete_term`
|
|
- [ ] Background indexer: `TextIndexSyncer` reads entity store writes (via WAL sequence tracking) and feeds Tantivy writer; commit interval configurable (default: every 1000 docs or 2 seconds)
|
|
- [ ] Each Tantivy `commit()` stores the last-processed WAL sequence number in the commit payload via `set_payload()`; crash recovery replays from that sequence number
|
|
- [ ] Custom `AllScoresCollector` implementing Tantivy's `Collector` trait returns all matching `(EntityId, f32)` pairs with BM25 scores; `requires_scoring()` returns `true`
|
|
- [ ] `ScoredCandidateCollector` implementing Tantivy's `Collector` trait accepts a pre-sorted candidate set and returns BM25 scores via `DocSet::seek()`
|
|
- [ ] External `EntityId -> DocAddress` mapping maintained via a fast field (`entity_id_field`) on every Tantivy document
|
|
- [ ] Boolean query parsing: AND, OR, NOT operators; exact phrase (`"..."`); field-scoped (`title:jazz`); exclusion (`-beginner`); wildcard prefix (`pian*`)
|
|
- [ ] Index rebuild from entity store: `text_index.rebuild_from(storage)` scans all items and rebuilds Tantivy index
|
|
- [ ] BM25 query latency < 10ms at 10K documents (Criterion benchmarked)
|
|
- [ ] Tantivy `IndexWriter` heap budget set to 50MB
|
|
- [ ] `LogMergePolicy` configured with defaults; `wait_merging_threads()` called on shutdown
|
|
- [ ] `TextIndex` is `Send + Sync` — safe to share across threads behind `Arc`
|
|
|
|
## Task Execution Order
|
|
|
|
```
|
|
task-01 (TextIndex Core)
|
|
|
|
|
v
|
|
task-02 (Document Write/Delete)
|
|
| | |
|
|
v v v
|
|
task-03 task-04 task-05
|
|
(Syncer) (Collectors) (Query Parser)
|
|
```
|
|
|
|
Tasks 01-02 are sequential. Tasks 03, 04, 05 can parallelize after task-02 completes.
|
|
|
|
## Module Location
|
|
|
|
New module: `tidal/src/text/` with submodules:
|
|
- `mod.rs` — public re-exports
|
|
- `index.rs` — `TextIndex`, `TextIndexConfig`
|
|
- `writer.rs` — `TextIndexWriter` (write/delete operations)
|
|
- `syncer.rs` — `TextIndexSyncer` (background indexing)
|
|
- `collectors.rs` — `AllScoresCollector`, `ScoredCandidateCollector`
|
|
- `query.rs` — `TextQueryParser`
|
|
|
|
## Notes
|
|
|
|
- `tantivy` must be added to `tidal/Cargo.toml` as a dependency
|
|
- Text field definitions must be added to `Schema` / `SchemaBuilder`
|
|
- The `unsafe_code = "forbid"` lint is crate-level — `tantivy` itself uses unsafe but we do not need unsafe in our wrapper code
|
|
- `tantivy` crate itself has `forbid(unsafe_code)` in some modules but not all — the FFI is contained within their crate
|
|
|
|
## Done When
|
|
|
|
All 14 acceptance criteria above pass. Tests pass with `cargo test --manifest-path tidal/Cargo.toml`. The `text_index` bench shows BM25 query < 10ms at 10K documents.
|