tidaldb/docs/planning/milestone-5/phase-2/task-01-rrf-implementation.md

# Task 01: RRF Implementation

## Delivers

`HybridFusion` struct implementing Reciprocal Rank Fusion. `fuse()` merges a BM25 ranked list and an ANN ranked list into a single `Vec<(EntityId, f64)>` sorted by descending fused score. Documents appearing in only one list contribute only their single-list term. `k = 60` by default, configurable.

## Complexity: S

## Dependencies

- m5p1 COMPLETE: `EntityId` type, `tidal/src/query/` module structure
- `tidal/src/query/mod.rs` exists for adding the `fusion` submodule

## Technical Design

### RRF Formula

From `docs/research/tantivy.md`:

> RRFscore(d) = 1/(60 + rank_bm25(d)) + 1/(60 + rank_ann(d))

Where:
- `rank_bm25(d)` is the 1-based rank of document `d` in the BM25 list (rank 1 = highest BM25 score)
- `rank_ann(d)` is the 1-based rank of document `d` in the ANN list (rank 1 = lowest L2 distance)
- Documents absent from a list contribute zero for that term

The k=60 constant is insensitive across 30–100 range. We implement it as configurable, defaulting to 60.

### Input Conventions

- **BM25 results**: `&[(EntityId, f32)]` where the f32 is the BM25 score. **The caller must pre-sort these descending by score.** The `fuse()` function uses position-as-rank (position 0 = rank 1).
- **ANN results**: `&[(EntityId, f32)]` where the f32 is the L2-squared distance. **The caller must pre-sort these ascending by distance.** The `fuse()` function uses position-as-rank (position 0 = rank 1).

This design keeps `fuse()` a pure function with no sorting overhead. The caller controls ordering.

### HybridFusion

```rust
// tidal/src/query/fusion.rs

use std::collections::HashMap;
use crate::schema::EntityId;

/// Reciprocal Rank Fusion (Cormack et al. SIGIR 2009).
///
/// Merges ranked lists from heterogeneous retrieval systems (BM25 text scores,
/// ANN vector distances) into a single ranked list using only rank positions.
///
/// The k=60 constant is insensitive across [30, 100] — see the research
/// literature. A configurable k is provided for experimentation.
///
/// # Reference
///
/// Cormack, Clarke, Büttcher. "Reciprocal Rank Fusion Outperforms Condorcet
/// and Individual Rank Learning Methods." SIGIR 2009.
#[derive(Debug, Clone)]
pub struct HybridFusion {
    /// Rank offset constant. Default 60 per the original paper.
    pub k: u32,
}

impl Default for HybridFusion {
    fn default() -> Self {
        Self { k: 60 }
    }
}

impl HybridFusion {
    /// Create a fusion instance with the default k=60.
    #[must_use]
    pub fn new() -> Self {
        Self::default()
    }

    /// Create a fusion instance with a custom k.
    #[must_use]
    pub fn with_k(k: u32) -> Self {
        Self { k }
    }

    /// Fuse two ranked lists via Reciprocal Rank Fusion.
    ///
    /// Both lists must be pre-sorted in "best first" order by the caller:
    /// - `bm25_results`: sorted descending by BM25 score (index 0 = rank 1)
    /// - `ann_results`: sorted ascending by L2 distance (index 0 = rank 1)
    ///
    /// The f32 value in each tuple is used only to establish the ordering by
    /// the caller; `fuse()` itself uses only position (0-indexed) as the rank.
    ///
    /// Documents appearing in only one list contribute only their single-list
    /// term. The missing-list contribution is zero.
    ///
    /// Returns results sorted by descending fused RRF score.
    #[must_use]
    pub fn fuse(
        &self,
        bm25_results: &[(EntityId, f32)],
        ann_results: &[(EntityId, f32)],
    ) -> Vec<(EntityId, f64)> {
        let k = f64::from(self.k);

        // Map entity_id -> accumulated RRF score
        let capacity = bm25_results.len() + ann_results.len();
        let mut scores: HashMap<u64, f64> = HashMap::with_capacity(capacity);

        // BM25 contribution: rank is 1-based (position 0 → rank 1)
        for (rank_0based, (entity_id, _score)) in bm25_results.iter().enumerate() {
            let rank = (rank_0based + 1) as f64;
            *scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank);
        }

        // ANN contribution: rank is 1-based (position 0 → rank 1)
        for (rank_0based, (entity_id, _distance)) in ann_results.iter().enumerate() {
            let rank = (rank_0based + 1) as f64;
            *scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank);
        }

        // Collect and sort descending by fused score
        let mut results: Vec<(EntityId, f64)> = scores
            .into_iter()
            .map(|(id, score)| (EntityId::new(id), score))
            .collect();
        results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));

        results
    }
}
```

### Registration in query/mod.rs

Add to `tidal/src/query/mod.rs`:
```rust
pub mod fusion;
pub use fusion::HybridFusion;
```

## Acceptance Criteria

- [ ] `tidal/src/query/fusion.rs` created with `HybridFusion` struct
- [ ] `HybridFusion { k: u32 }` with `Default` trait (k=60)
- [ ] `HybridFusion::new()` returns default k=60 instance
- [ ] `HybridFusion::with_k(k)` returns configured instance
- [ ] `fuse(bm25: &[(EntityId, f32)], ann: &[(EntityId, f32)]) -> Vec<(EntityId, f64)>` implements RRF
- [ ] BM25 rank: position 0 → rank 1, position N-1 → rank N
- [ ] ANN rank: position 0 → rank 1, position N-1 → rank N
- [ ] Documents in only one list: single-list contribution only (missing term = 0)
- [ ] Results sorted descending by fused score
- [ ] `pub mod fusion;` added to `tidal/src/query/mod.rs`
- [ ] `pub use fusion::HybridFusion;` exported from `tidal/src/query/mod.rs`
- [ ] Unit tests: `fuse_both_lists`, `fuse_bm25_only`, `fuse_ann_only`, `fuse_empty_lists`, `fuse_single_doc_both_lists`, `fuse_k_affects_scores`
- [ ] Property test: for any pair of ranked lists, fused output is union of both document sets; score for doc in both lists > score for doc in only one list
- [ ] `cargo check`, `cargo fmt`, `cargo clippy -D warnings` all pass

## Test Strategy

```rust
#[test]
fn fuse_both_lists() {
    // BM25: [A=1.0, B=0.8, C=0.5] (descending)
    // ANN:  [B=0.1, A=0.2, D=0.5] (ascending distance)
    // Expected: B ranks highest (rank 1 in ANN, rank 2 in BM25)
    // A is rank 1 in BM25, rank 2 in ANN
    let bm25 = vec![
        (EntityId::new(1), 1.0f32),  // A, rank 1
        (EntityId::new(2), 0.8f32),  // B, rank 2
        (EntityId::new(3), 0.5f32),  // C, rank 3 (BM25 only)
    ];
    let ann = vec![
        (EntityId::new(2), 0.1f32),  // B, rank 1 (best ANN match)
        (EntityId::new(1), 0.2f32),  // A, rank 2
        (EntityId::new(4), 0.5f32),  // D, rank 3 (ANN only)
    ];

    let fusion = HybridFusion::new();  // k=60
    let results = fusion.fuse(&bm25, &ann);

    // B: 1/(60+2) + 1/(60+1) = 1/62 + 1/61 ≈ 0.01613 + 0.01639 = 0.03252
    // A: 1/(60+1) + 1/(60+2) = 1/61 + 1/62 ≈ 0.03252 (same as B — tie!)
    // Actually: B is rank 2 in BM25, rank 1 in ANN; A is rank 1 in BM25, rank 2 in ANN
    // B: 1/(60+2) + 1/(60+1) = 0.03252
    // A: 1/(60+1) + 1/(60+2) = 0.03252 (same score — tie)
    // C: 1/(60+3) + 0 = 1/63 ≈ 0.01587
    // D: 0 + 1/(60+3) = 1/63 ≈ 0.01587

    // Verify all 4 documents are in the output
    let ids: Vec<u64> = results.iter().map(|(id, _)| id.as_u64()).collect();
    assert!(ids.contains(&1)); // A
    assert!(ids.contains(&2)); // B
    assert!(ids.contains(&3)); // C
    assert!(ids.contains(&4)); // D

    // C and D (single-list) have lower scores than A and B (both lists)
    let c_score = results.iter().find(|(id, _)| id.as_u64() == 3).unwrap().1;
    let d_score = results.iter().find(|(id, _)| id.as_u64() == 4).unwrap().1;
    let a_score = results.iter().find(|(id, _)| id.as_u64() == 1).unwrap().1;
    let b_score = results.iter().find(|(id, _)| id.as_u64() == 2).unwrap().1;
    assert!(a_score > c_score);
    assert!(b_score > d_score);

    // Scores are sorted descending
    let scores: Vec<f64> = results.iter().map(|(_, s)| *s).collect();
    for i in 1..scores.len() {
        assert!(scores[i-1] >= scores[i]);
    }
}

#[test]
fn fuse_bm25_only() {
    let bm25 = vec![(EntityId::new(1), 1.0f32), (EntityId::new(2), 0.5f32)];
    let ann = vec![];

    let fusion = HybridFusion::new();
    let results = fusion.fuse(&bm25, &ann);

    assert_eq!(results.len(), 2);
    // rank 1 doc scores higher than rank 2
    let score_1 = results.iter().find(|(id, _)| id.as_u64() == 1).unwrap().1;
    let score_2 = results.iter().find(|(id, _)| id.as_u64() == 2).unwrap().1;
    assert!(score_1 > score_2);
    // Score = 1/(60+1) for rank 1 = 1/61
    let expected = 1.0 / (60.0 + 1.0);
    assert!((score_1 - expected).abs() < 1e-9);
}

#[test]
fn fuse_k_affects_scores() {
    let bm25 = vec![(EntityId::new(1), 1.0f32)];
    let ann = vec![(EntityId::new(1), 0.1f32)];

    let fusion_60 = HybridFusion::new();  // k=60
    let fusion_30 = HybridFusion::with_k(30);  // k=30

    let results_60 = fusion_60.fuse(&bm25, &ann);
    let results_30 = fusion_30.fuse(&bm25, &ann);

    // k=30: 1/(30+1) + 1/(30+1) = 2/31 ≈ 0.0645
    // k=60: 1/(60+1) + 1/(60+1) = 2/61 ≈ 0.0328
    // k=30 produces higher scores
    assert!(results_30[0].1 > results_60[0].1);
}
```

### Property Test

```rust
use proptest::prelude::*;

proptest! {
    #[test]
    fn rrf_output_is_union_of_inputs(
        bm25_ids in prop::collection::vec(1u64..=100, 0..20),
        ann_ids  in prop::collection::vec(1u64..=100, 0..20),
    ) {
        let bm25: Vec<(EntityId, f32)> = bm25_ids.iter().enumerate()
            .map(|(i, &id)| (EntityId::new(id), (100 - i) as f32))
            .collect();
        let ann: Vec<(EntityId, f32)> = ann_ids.iter().enumerate()
            .map(|(i, &id)| (EntityId::new(id), i as f32 * 0.01))
            .collect();

        let fusion = HybridFusion::new();
        let results = fusion.fuse(&bm25, &ann);

        // Output must contain the union of all input IDs
        let all_ids: std::collections::HashSet<u64> = bm25_ids.iter()
            .chain(ann_ids.iter())
            .copied()
            .collect();
        let result_ids: std::collections::HashSet<u64> = results.iter()
            .map(|(id, _)| id.as_u64())
            .collect();
        prop_assert_eq!(&all_ids, &result_ids);

        // Results must be sorted descending
        for i in 1..results.len() {
            prop_assert!(results[i-1].1 >= results[i].1);
        }
    }
}
```