tidaldb/docs/planning/milestone-5/phase-2/task-01-rrf-implementation.md
jordan 192c473f55 feat: complete Milestone 5 — full-text search, RRF fusion, and creator search
- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs)
- M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates)
- M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking)
- M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators)
- Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.)
- Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.)
- Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers
- Add benches: fusion, search, session, text_index
- Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index)
- Update blog posts, roadmap, content strategy, and M5 planning docs
- Add tmp/ and .claude/worktrees/ to .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 23:53:16 -07:00

282 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Task 01: RRF Implementation
## Delivers
`HybridFusion` struct implementing Reciprocal Rank Fusion. `fuse()` merges a BM25 ranked list and an ANN ranked list into a single `Vec<(EntityId, f64)>` sorted by descending fused score. Documents appearing in only one list contribute only their single-list term. `k = 60` by default, configurable.
## Complexity: S
## Dependencies
- m5p1 COMPLETE: `EntityId` type, `tidal/src/query/` module structure
- `tidal/src/query/mod.rs` exists for adding the `fusion` submodule
## Technical Design
### RRF Formula
From `docs/research/tantivy.md`:
> RRFscore(d) = 1/(60 + rank_bm25(d)) + 1/(60 + rank_ann(d))
Where:
- `rank_bm25(d)` is the 1-based rank of document `d` in the BM25 list (rank 1 = highest BM25 score)
- `rank_ann(d)` is the 1-based rank of document `d` in the ANN list (rank 1 = lowest L2 distance)
- Documents absent from a list contribute zero for that term
The k=60 constant is insensitive across 30100 range. We implement it as configurable, defaulting to 60.
### Input Conventions
- **BM25 results**: `&[(EntityId, f32)]` where the f32 is the BM25 score. **The caller must pre-sort these descending by score.** The `fuse()` function uses position-as-rank (position 0 = rank 1).
- **ANN results**: `&[(EntityId, f32)]` where the f32 is the L2-squared distance. **The caller must pre-sort these ascending by distance.** The `fuse()` function uses position-as-rank (position 0 = rank 1).
This design keeps `fuse()` a pure function with no sorting overhead. The caller controls ordering.
### HybridFusion
```rust
// tidal/src/query/fusion.rs
use std::collections::HashMap;
use crate::schema::EntityId;
/// Reciprocal Rank Fusion (Cormack et al. SIGIR 2009).
///
/// Merges ranked lists from heterogeneous retrieval systems (BM25 text scores,
/// ANN vector distances) into a single ranked list using only rank positions.
///
/// The k=60 constant is insensitive across [30, 100] — see the research
/// literature. A configurable k is provided for experimentation.
///
/// # Reference
///
/// Cormack, Clarke, Büttcher. "Reciprocal Rank Fusion Outperforms Condorcet
/// and Individual Rank Learning Methods." SIGIR 2009.
#[derive(Debug, Clone)]
pub struct HybridFusion {
/// Rank offset constant. Default 60 per the original paper.
pub k: u32,
}
impl Default for HybridFusion {
fn default() -> Self {
Self { k: 60 }
}
}
impl HybridFusion {
/// Create a fusion instance with the default k=60.
#[must_use]
pub fn new() -> Self {
Self::default()
}
/// Create a fusion instance with a custom k.
#[must_use]
pub fn with_k(k: u32) -> Self {
Self { k }
}
/// Fuse two ranked lists via Reciprocal Rank Fusion.
///
/// Both lists must be pre-sorted in "best first" order by the caller:
/// - `bm25_results`: sorted descending by BM25 score (index 0 = rank 1)
/// - `ann_results`: sorted ascending by L2 distance (index 0 = rank 1)
///
/// The f32 value in each tuple is used only to establish the ordering by
/// the caller; `fuse()` itself uses only position (0-indexed) as the rank.
///
/// Documents appearing in only one list contribute only their single-list
/// term. The missing-list contribution is zero.
///
/// Returns results sorted by descending fused RRF score.
#[must_use]
pub fn fuse(
&self,
bm25_results: &[(EntityId, f32)],
ann_results: &[(EntityId, f32)],
) -> Vec<(EntityId, f64)> {
let k = f64::from(self.k);
// Map entity_id -> accumulated RRF score
let capacity = bm25_results.len() + ann_results.len();
let mut scores: HashMap<u64, f64> = HashMap::with_capacity(capacity);
// BM25 contribution: rank is 1-based (position 0 → rank 1)
for (rank_0based, (entity_id, _score)) in bm25_results.iter().enumerate() {
let rank = (rank_0based + 1) as f64;
*scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank);
}
// ANN contribution: rank is 1-based (position 0 → rank 1)
for (rank_0based, (entity_id, _distance)) in ann_results.iter().enumerate() {
let rank = (rank_0based + 1) as f64;
*scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank);
}
// Collect and sort descending by fused score
let mut results: Vec<(EntityId, f64)> = scores
.into_iter()
.map(|(id, score)| (EntityId::new(id), score))
.collect();
results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
results
}
}
```
### Registration in query/mod.rs
Add to `tidal/src/query/mod.rs`:
```rust
pub mod fusion;
pub use fusion::HybridFusion;
```
## Acceptance Criteria
- [ ] `tidal/src/query/fusion.rs` created with `HybridFusion` struct
- [ ] `HybridFusion { k: u32 }` with `Default` trait (k=60)
- [ ] `HybridFusion::new()` returns default k=60 instance
- [ ] `HybridFusion::with_k(k)` returns configured instance
- [ ] `fuse(bm25: &[(EntityId, f32)], ann: &[(EntityId, f32)]) -> Vec<(EntityId, f64)>` implements RRF
- [ ] BM25 rank: position 0 → rank 1, position N-1 → rank N
- [ ] ANN rank: position 0 → rank 1, position N-1 → rank N
- [ ] Documents in only one list: single-list contribution only (missing term = 0)
- [ ] Results sorted descending by fused score
- [ ] `pub mod fusion;` added to `tidal/src/query/mod.rs`
- [ ] `pub use fusion::HybridFusion;` exported from `tidal/src/query/mod.rs`
- [ ] Unit tests: `fuse_both_lists`, `fuse_bm25_only`, `fuse_ann_only`, `fuse_empty_lists`, `fuse_single_doc_both_lists`, `fuse_k_affects_scores`
- [ ] Property test: for any pair of ranked lists, fused output is union of both document sets; score for doc in both lists > score for doc in only one list
- [ ] `cargo check`, `cargo fmt`, `cargo clippy -D warnings` all pass
## Test Strategy
```rust
#[test]
fn fuse_both_lists() {
// BM25: [A=1.0, B=0.8, C=0.5] (descending)
// ANN: [B=0.1, A=0.2, D=0.5] (ascending distance)
// Expected: B ranks highest (rank 1 in ANN, rank 2 in BM25)
// A is rank 1 in BM25, rank 2 in ANN
let bm25 = vec![
(EntityId::new(1), 1.0f32), // A, rank 1
(EntityId::new(2), 0.8f32), // B, rank 2
(EntityId::new(3), 0.5f32), // C, rank 3 (BM25 only)
];
let ann = vec![
(EntityId::new(2), 0.1f32), // B, rank 1 (best ANN match)
(EntityId::new(1), 0.2f32), // A, rank 2
(EntityId::new(4), 0.5f32), // D, rank 3 (ANN only)
];
let fusion = HybridFusion::new(); // k=60
let results = fusion.fuse(&bm25, &ann);
// B: 1/(60+2) + 1/(60+1) = 1/62 + 1/61 ≈ 0.01613 + 0.01639 = 0.03252
// A: 1/(60+1) + 1/(60+2) = 1/61 + 1/62 ≈ 0.03252 (same as B — tie!)
// Actually: B is rank 2 in BM25, rank 1 in ANN; A is rank 1 in BM25, rank 2 in ANN
// B: 1/(60+2) + 1/(60+1) = 0.03252
// A: 1/(60+1) + 1/(60+2) = 0.03252 (same score — tie)
// C: 1/(60+3) + 0 = 1/63 ≈ 0.01587
// D: 0 + 1/(60+3) = 1/63 ≈ 0.01587
// Verify all 4 documents are in the output
let ids: Vec<u64> = results.iter().map(|(id, _)| id.as_u64()).collect();
assert!(ids.contains(&1)); // A
assert!(ids.contains(&2)); // B
assert!(ids.contains(&3)); // C
assert!(ids.contains(&4)); // D
// C and D (single-list) have lower scores than A and B (both lists)
let c_score = results.iter().find(|(id, _)| id.as_u64() == 3).unwrap().1;
let d_score = results.iter().find(|(id, _)| id.as_u64() == 4).unwrap().1;
let a_score = results.iter().find(|(id, _)| id.as_u64() == 1).unwrap().1;
let b_score = results.iter().find(|(id, _)| id.as_u64() == 2).unwrap().1;
assert!(a_score > c_score);
assert!(b_score > d_score);
// Scores are sorted descending
let scores: Vec<f64> = results.iter().map(|(_, s)| *s).collect();
for i in 1..scores.len() {
assert!(scores[i-1] >= scores[i]);
}
}
#[test]
fn fuse_bm25_only() {
let bm25 = vec![(EntityId::new(1), 1.0f32), (EntityId::new(2), 0.5f32)];
let ann = vec![];
let fusion = HybridFusion::new();
let results = fusion.fuse(&bm25, &ann);
assert_eq!(results.len(), 2);
// rank 1 doc scores higher than rank 2
let score_1 = results.iter().find(|(id, _)| id.as_u64() == 1).unwrap().1;
let score_2 = results.iter().find(|(id, _)| id.as_u64() == 2).unwrap().1;
assert!(score_1 > score_2);
// Score = 1/(60+1) for rank 1 = 1/61
let expected = 1.0 / (60.0 + 1.0);
assert!((score_1 - expected).abs() < 1e-9);
}
#[test]
fn fuse_k_affects_scores() {
let bm25 = vec![(EntityId::new(1), 1.0f32)];
let ann = vec![(EntityId::new(1), 0.1f32)];
let fusion_60 = HybridFusion::new(); // k=60
let fusion_30 = HybridFusion::with_k(30); // k=30
let results_60 = fusion_60.fuse(&bm25, &ann);
let results_30 = fusion_30.fuse(&bm25, &ann);
// k=30: 1/(30+1) + 1/(30+1) = 2/31 ≈ 0.0645
// k=60: 1/(60+1) + 1/(60+1) = 2/61 ≈ 0.0328
// k=30 produces higher scores
assert!(results_30[0].1 > results_60[0].1);
}
```
### Property Test
```rust
use proptest::prelude::*;
proptest! {
#[test]
fn rrf_output_is_union_of_inputs(
bm25_ids in prop::collection::vec(1u64..=100, 0..20),
ann_ids in prop::collection::vec(1u64..=100, 0..20),
) {
let bm25: Vec<(EntityId, f32)> = bm25_ids.iter().enumerate()
.map(|(i, &id)| (EntityId::new(id), (100 - i) as f32))
.collect();
let ann: Vec<(EntityId, f32)> = ann_ids.iter().enumerate()
.map(|(i, &id)| (EntityId::new(id), i as f32 * 0.01))
.collect();
let fusion = HybridFusion::new();
let results = fusion.fuse(&bm25, &ann);
// Output must contain the union of all input IDs
let all_ids: std::collections::HashSet<u64> = bm25_ids.iter()
.chain(ann_ids.iter())
.copied()
.collect();
let result_ids: std::collections::HashSet<u64> = results.iter()
.map(|(id, _)| id.as_u64())
.collect();
prop_assert_eq!(&all_ids, &result_ids);
// Results must be sorted descending
for i in 1..results.len() {
prop_assert!(results[i-1].1 >= results[i].1);
}
}
}
```