tidaldb/docs/planning/milestone-5/phase-2/OVERVIEW.md

# m5p2: Hybrid Fusion (RRF)

## Delivers

Reciprocal Rank Fusion combining BM25 ranked lists with ANN ranked lists into a single scored result set. The starting point is RRF with k=60; the architecture supports upgrading to tuned linear combination when relevance labels exist. Handles the three retrieval modes: text-only, vector-only, and hybrid. A `RetrievalMode` enum and `route_results()` function encapsulate the decision logic that the m5p3 `SearchExecutor` will call.

## Dependencies

- m5p1 COMPLETE: `TextIndex`, `AllScoresCollector`, `TextQueryParser` — BM25 search that returns `Vec<(EntityId, f32)>`
- m2p1 COMPLETE: `VectorIndex` trait, `VectorSearchResult { id: VectorId, distance: f32 }`, `EmbeddingSlotRegistry` — ANN search

## Research References

- `docs/research/tantivy.md` — Section "Start with Reciprocal Rank Fusion": RRF formula, k=60, Cormack et al. SIGIR 2009, production system comparison, upgrade path to linear combination

## Acceptance Criteria (Phase Level)

- [ ] `HybridFusion` struct with `k: u32` field (default 60) in `tidal/src/query/fusion.rs`
- [ ] `HybridFusion::fuse(bm25_results: &[(EntityId, f32)], ann_results: &[(EntityId, f32)], k: u32) -> Vec<(EntityId, f64)>` implements RRF
- [ ] RRF formula: `score(d) = 1.0 / (k + rank_bm25(d)) + 1.0 / (k + rank_ann(d))`, ranks are 1-based (rank 1 = best)
- [ ] Documents in only one list contribute only their single-list term; the missing-list term is zero
- [ ] Results sorted descending by fused score
- [ ] `RetrievalMode` enum: `TextOnly`, `VectorOnly`, `Hybrid`
- [ ] `RetrievalMode::determine(has_text: bool, has_vector: bool) -> Option<RetrievalMode>` returns the correct mode
- [ ] `route_results()` converts single-mode results to `Vec<(EntityId, f64)>` and calls `HybridFusion::fuse()` for hybrid
- [ ] Pure BM25 path: results passed through as `Vec<(EntityId, f64)>` without fusion overhead
- [ ] Pure ANN path: `VectorSearchResult` list converted to `Vec<(EntityId, f64)>` (score = 1.0 / (k + rank))
- [ ] `k` parameter configurable; default 60
- [ ] Fusion adds < 1ms to query time for 1000 candidates from each list (Criterion benchmarked)
- [ ] Property test: for any pair of ranked lists, RRF output is the union of both input document sets; scores computed correctly to 6 decimal places

## Task Execution Order

```
task-01 (RRF Implementation)
    |
    v
task-02 (Retrieval Mode Router)
```

Both tasks are sequential. Task 02 depends on `HybridFusion` from Task 01.

## Module Location

New module: `tidal/src/query/fusion.rs`

- `HybridFusion` — RRF computation struct
- `RetrievalMode` — enum for text-only / vector-only / hybrid
- `route_results()` — routes pre-retrieved result lists through the appropriate path

Add `pub mod fusion;` to `tidal/src/query/mod.rs`.

## Notes

- RRF uses **rank position** only — the input `f32` scores are used only for ordering, not for the fusion formula itself
- BM25 results: `(EntityId, f32)` where **higher score = better** → sort descending, rank 1 = index 0
- ANN results: `VectorSearchResult { id, distance }` where **lower distance = better** → sort ascending, rank 1 = index 0
- For the ANN-only path, convert `VectorSearchResult { id, distance }` to `(EntityId, score)` where `score = 1.0 / (k + rank)` to produce a consistent `f64` output
- The `rrf` crate exists on crates.io but we implement from scratch to avoid a dependency and maintain full control over the algorithm
- No unsafe code — pure indexing arithmetic