# Task 01: RRF Implementation ## Delivers `HybridFusion` struct implementing Reciprocal Rank Fusion. `fuse()` merges a BM25 ranked list and an ANN ranked list into a single `Vec<(EntityId, f64)>` sorted by descending fused score. Documents appearing in only one list contribute only their single-list term. `k = 60` by default, configurable. ## Complexity: S ## Dependencies - m5p1 COMPLETE: `EntityId` type, `tidal/src/query/` module structure - `tidal/src/query/mod.rs` exists for adding the `fusion` submodule ## Technical Design ### RRF Formula From `docs/research/tantivy.md`: > RRFscore(d) = 1/(60 + rank_bm25(d)) + 1/(60 + rank_ann(d)) Where: - `rank_bm25(d)` is the 1-based rank of document `d` in the BM25 list (rank 1 = highest BM25 score) - `rank_ann(d)` is the 1-based rank of document `d` in the ANN list (rank 1 = lowest L2 distance) - Documents absent from a list contribute zero for that term The k=60 constant is insensitive across 30–100 range. We implement it as configurable, defaulting to 60. ### Input Conventions - **BM25 results**: `&[(EntityId, f32)]` where the f32 is the BM25 score. **The caller must pre-sort these descending by score.** The `fuse()` function uses position-as-rank (position 0 = rank 1). - **ANN results**: `&[(EntityId, f32)]` where the f32 is the L2-squared distance. **The caller must pre-sort these ascending by distance.** The `fuse()` function uses position-as-rank (position 0 = rank 1). This design keeps `fuse()` a pure function with no sorting overhead. The caller controls ordering. ### HybridFusion ```rust // tidal/src/query/fusion.rs use std::collections::HashMap; use crate::schema::EntityId; /// Reciprocal Rank Fusion (Cormack et al. SIGIR 2009). /// /// Merges ranked lists from heterogeneous retrieval systems (BM25 text scores, /// ANN vector distances) into a single ranked list using only rank positions. /// /// The k=60 constant is insensitive across [30, 100] — see the research /// literature. A configurable k is provided for experimentation. /// /// # Reference /// /// Cormack, Clarke, Büttcher. "Reciprocal Rank Fusion Outperforms Condorcet /// and Individual Rank Learning Methods." SIGIR 2009. #[derive(Debug, Clone)] pub struct HybridFusion { /// Rank offset constant. Default 60 per the original paper. pub k: u32, } impl Default for HybridFusion { fn default() -> Self { Self { k: 60 } } } impl HybridFusion { /// Create a fusion instance with the default k=60. #[must_use] pub fn new() -> Self { Self::default() } /// Create a fusion instance with a custom k. #[must_use] pub fn with_k(k: u32) -> Self { Self { k } } /// Fuse two ranked lists via Reciprocal Rank Fusion. /// /// Both lists must be pre-sorted in "best first" order by the caller: /// - `bm25_results`: sorted descending by BM25 score (index 0 = rank 1) /// - `ann_results`: sorted ascending by L2 distance (index 0 = rank 1) /// /// The f32 value in each tuple is used only to establish the ordering by /// the caller; `fuse()` itself uses only position (0-indexed) as the rank. /// /// Documents appearing in only one list contribute only their single-list /// term. The missing-list contribution is zero. /// /// Returns results sorted by descending fused RRF score. #[must_use] pub fn fuse( &self, bm25_results: &[(EntityId, f32)], ann_results: &[(EntityId, f32)], ) -> Vec<(EntityId, f64)> { let k = f64::from(self.k); // Map entity_id -> accumulated RRF score let capacity = bm25_results.len() + ann_results.len(); let mut scores: HashMap = HashMap::with_capacity(capacity); // BM25 contribution: rank is 1-based (position 0 → rank 1) for (rank_0based, (entity_id, _score)) in bm25_results.iter().enumerate() { let rank = (rank_0based + 1) as f64; *scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank); } // ANN contribution: rank is 1-based (position 0 → rank 1) for (rank_0based, (entity_id, _distance)) in ann_results.iter().enumerate() { let rank = (rank_0based + 1) as f64; *scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank); } // Collect and sort descending by fused score let mut results: Vec<(EntityId, f64)> = scores .into_iter() .map(|(id, score)| (EntityId::new(id), score)) .collect(); results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal)); results } } ``` ### Registration in query/mod.rs Add to `tidal/src/query/mod.rs`: ```rust pub mod fusion; pub use fusion::HybridFusion; ``` ## Acceptance Criteria - [ ] `tidal/src/query/fusion.rs` created with `HybridFusion` struct - [ ] `HybridFusion { k: u32 }` with `Default` trait (k=60) - [ ] `HybridFusion::new()` returns default k=60 instance - [ ] `HybridFusion::with_k(k)` returns configured instance - [ ] `fuse(bm25: &[(EntityId, f32)], ann: &[(EntityId, f32)]) -> Vec<(EntityId, f64)>` implements RRF - [ ] BM25 rank: position 0 → rank 1, position N-1 → rank N - [ ] ANN rank: position 0 → rank 1, position N-1 → rank N - [ ] Documents in only one list: single-list contribution only (missing term = 0) - [ ] Results sorted descending by fused score - [ ] `pub mod fusion;` added to `tidal/src/query/mod.rs` - [ ] `pub use fusion::HybridFusion;` exported from `tidal/src/query/mod.rs` - [ ] Unit tests: `fuse_both_lists`, `fuse_bm25_only`, `fuse_ann_only`, `fuse_empty_lists`, `fuse_single_doc_both_lists`, `fuse_k_affects_scores` - [ ] Property test: for any pair of ranked lists, fused output is union of both document sets; score for doc in both lists > score for doc in only one list - [ ] `cargo check`, `cargo fmt`, `cargo clippy -D warnings` all pass ## Test Strategy ```rust #[test] fn fuse_both_lists() { // BM25: [A=1.0, B=0.8, C=0.5] (descending) // ANN: [B=0.1, A=0.2, D=0.5] (ascending distance) // Expected: B ranks highest (rank 1 in ANN, rank 2 in BM25) // A is rank 1 in BM25, rank 2 in ANN let bm25 = vec![ (EntityId::new(1), 1.0f32), // A, rank 1 (EntityId::new(2), 0.8f32), // B, rank 2 (EntityId::new(3), 0.5f32), // C, rank 3 (BM25 only) ]; let ann = vec![ (EntityId::new(2), 0.1f32), // B, rank 1 (best ANN match) (EntityId::new(1), 0.2f32), // A, rank 2 (EntityId::new(4), 0.5f32), // D, rank 3 (ANN only) ]; let fusion = HybridFusion::new(); // k=60 let results = fusion.fuse(&bm25, &ann); // B: 1/(60+2) + 1/(60+1) = 1/62 + 1/61 ≈ 0.01613 + 0.01639 = 0.03252 // A: 1/(60+1) + 1/(60+2) = 1/61 + 1/62 ≈ 0.03252 (same as B — tie!) // Actually: B is rank 2 in BM25, rank 1 in ANN; A is rank 1 in BM25, rank 2 in ANN // B: 1/(60+2) + 1/(60+1) = 0.03252 // A: 1/(60+1) + 1/(60+2) = 0.03252 (same score — tie) // C: 1/(60+3) + 0 = 1/63 ≈ 0.01587 // D: 0 + 1/(60+3) = 1/63 ≈ 0.01587 // Verify all 4 documents are in the output let ids: Vec = results.iter().map(|(id, _)| id.as_u64()).collect(); assert!(ids.contains(&1)); // A assert!(ids.contains(&2)); // B assert!(ids.contains(&3)); // C assert!(ids.contains(&4)); // D // C and D (single-list) have lower scores than A and B (both lists) let c_score = results.iter().find(|(id, _)| id.as_u64() == 3).unwrap().1; let d_score = results.iter().find(|(id, _)| id.as_u64() == 4).unwrap().1; let a_score = results.iter().find(|(id, _)| id.as_u64() == 1).unwrap().1; let b_score = results.iter().find(|(id, _)| id.as_u64() == 2).unwrap().1; assert!(a_score > c_score); assert!(b_score > d_score); // Scores are sorted descending let scores: Vec = results.iter().map(|(_, s)| *s).collect(); for i in 1..scores.len() { assert!(scores[i-1] >= scores[i]); } } #[test] fn fuse_bm25_only() { let bm25 = vec![(EntityId::new(1), 1.0f32), (EntityId::new(2), 0.5f32)]; let ann = vec![]; let fusion = HybridFusion::new(); let results = fusion.fuse(&bm25, &ann); assert_eq!(results.len(), 2); // rank 1 doc scores higher than rank 2 let score_1 = results.iter().find(|(id, _)| id.as_u64() == 1).unwrap().1; let score_2 = results.iter().find(|(id, _)| id.as_u64() == 2).unwrap().1; assert!(score_1 > score_2); // Score = 1/(60+1) for rank 1 = 1/61 let expected = 1.0 / (60.0 + 1.0); assert!((score_1 - expected).abs() < 1e-9); } #[test] fn fuse_k_affects_scores() { let bm25 = vec![(EntityId::new(1), 1.0f32)]; let ann = vec![(EntityId::new(1), 0.1f32)]; let fusion_60 = HybridFusion::new(); // k=60 let fusion_30 = HybridFusion::with_k(30); // k=30 let results_60 = fusion_60.fuse(&bm25, &ann); let results_30 = fusion_30.fuse(&bm25, &ann); // k=30: 1/(30+1) + 1/(30+1) = 2/31 ≈ 0.0645 // k=60: 1/(60+1) + 1/(60+1) = 2/61 ≈ 0.0328 // k=30 produces higher scores assert!(results_30[0].1 > results_60[0].1); } ``` ### Property Test ```rust use proptest::prelude::*; proptest! { #[test] fn rrf_output_is_union_of_inputs( bm25_ids in prop::collection::vec(1u64..=100, 0..20), ann_ids in prop::collection::vec(1u64..=100, 0..20), ) { let bm25: Vec<(EntityId, f32)> = bm25_ids.iter().enumerate() .map(|(i, &id)| (EntityId::new(id), (100 - i) as f32)) .collect(); let ann: Vec<(EntityId, f32)> = ann_ids.iter().enumerate() .map(|(i, &id)| (EntityId::new(id), i as f32 * 0.01)) .collect(); let fusion = HybridFusion::new(); let results = fusion.fuse(&bm25, &ann); // Output must contain the union of all input IDs let all_ids: std::collections::HashSet = bm25_ids.iter() .chain(ann_ids.iter()) .copied() .collect(); let result_ids: std::collections::HashSet = results.iter() .map(|(id, _)| id.as_u64()) .collect(); prop_assert_eq!(&all_ids, &result_ids); // Results must be sorted descending for i in 1..results.len() { prop_assert!(results[i-1].1 >= results[i].1); } } } ```