tidaldb/docs/planning/milestone-5/phase-2/OVERVIEW.md
jordan 192c473f55 feat: complete Milestone 5 — full-text search, RRF fusion, and creator search
- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs)
- M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates)
- M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking)
- M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators)
- Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.)
- Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.)
- Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers
- Add benches: fusion, search, session, text_index
- Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index)
- Update blog posts, roadmap, content strategy, and M5 planning docs
- Add tmp/ and .claude/worktrees/ to .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 23:53:16 -07:00

3.4 KiB

m5p2: Hybrid Fusion (RRF)

Delivers

Reciprocal Rank Fusion combining BM25 ranked lists with ANN ranked lists into a single scored result set. The starting point is RRF with k=60; the architecture supports upgrading to tuned linear combination when relevance labels exist. Handles the three retrieval modes: text-only, vector-only, and hybrid. A RetrievalMode enum and route_results() function encapsulate the decision logic that the m5p3 SearchExecutor will call.

Dependencies

  • m5p1 COMPLETE: TextIndex, AllScoresCollector, TextQueryParser — BM25 search that returns Vec<(EntityId, f32)>
  • m2p1 COMPLETE: VectorIndex trait, VectorSearchResult { id: VectorId, distance: f32 }, EmbeddingSlotRegistry — ANN search

Research References

  • docs/research/tantivy.md — Section "Start with Reciprocal Rank Fusion": RRF formula, k=60, Cormack et al. SIGIR 2009, production system comparison, upgrade path to linear combination

Acceptance Criteria (Phase Level)

  • HybridFusion struct with k: u32 field (default 60) in tidal/src/query/fusion.rs
  • HybridFusion::fuse(bm25_results: &[(EntityId, f32)], ann_results: &[(EntityId, f32)], k: u32) -> Vec<(EntityId, f64)> implements RRF
  • RRF formula: score(d) = 1.0 / (k + rank_bm25(d)) + 1.0 / (k + rank_ann(d)), ranks are 1-based (rank 1 = best)
  • Documents in only one list contribute only their single-list term; the missing-list term is zero
  • Results sorted descending by fused score
  • RetrievalMode enum: TextOnly, VectorOnly, Hybrid
  • RetrievalMode::determine(has_text: bool, has_vector: bool) -> Option<RetrievalMode> returns the correct mode
  • route_results() converts single-mode results to Vec<(EntityId, f64)> and calls HybridFusion::fuse() for hybrid
  • Pure BM25 path: results passed through as Vec<(EntityId, f64)> without fusion overhead
  • Pure ANN path: VectorSearchResult list converted to Vec<(EntityId, f64)> (score = 1.0 / (k + rank))
  • k parameter configurable; default 60
  • Fusion adds < 1ms to query time for 1000 candidates from each list (Criterion benchmarked)
  • Property test: for any pair of ranked lists, RRF output is the union of both input document sets; scores computed correctly to 6 decimal places

Task Execution Order

task-01 (RRF Implementation)
    |
    v
task-02 (Retrieval Mode Router)

Both tasks are sequential. Task 02 depends on HybridFusion from Task 01.

Module Location

New module: tidal/src/query/fusion.rs

  • HybridFusion — RRF computation struct
  • RetrievalMode — enum for text-only / vector-only / hybrid
  • route_results() — routes pre-retrieved result lists through the appropriate path

Add pub mod fusion; to tidal/src/query/mod.rs.

Notes

  • RRF uses rank position only — the input f32 scores are used only for ordering, not for the fusion formula itself
  • BM25 results: (EntityId, f32) where higher score = better → sort descending, rank 1 = index 0
  • ANN results: VectorSearchResult { id, distance } where lower distance = better → sort ascending, rank 1 = index 0
  • For the ANN-only path, convert VectorSearchResult { id, distance } to (EntityId, score) where score = 1.0 / (k + rank) to produce a consistent f64 output
  • The rrf crate exists on crates.io but we implement from scratch to avoid a dependency and maintain full control over the algorithm
  • No unsafe code — pure indexing arithmetic