tidaldb/docs/planning/milestone-5/phase-2/task-01-rrf-implementation.md
jordan 192c473f55 feat: complete Milestone 5 — full-text search, RRF fusion, and creator search
- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs)
- M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates)
- M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking)
- M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators)
- Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.)
- Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.)
- Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers
- Add benches: fusion, search, session, text_index
- Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index)
- Update blog posts, roadmap, content strategy, and M5 planning docs
- Add tmp/ and .claude/worktrees/ to .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 23:53:16 -07:00

10 KiB
Raw Permalink Blame History

Task 01: RRF Implementation

Delivers

HybridFusion struct implementing Reciprocal Rank Fusion. fuse() merges a BM25 ranked list and an ANN ranked list into a single Vec<(EntityId, f64)> sorted by descending fused score. Documents appearing in only one list contribute only their single-list term. k = 60 by default, configurable.

Complexity: S

Dependencies

  • m5p1 COMPLETE: EntityId type, tidal/src/query/ module structure
  • tidal/src/query/mod.rs exists for adding the fusion submodule

Technical Design

RRF Formula

From docs/research/tantivy.md:

RRFscore(d) = 1/(60 + rank_bm25(d)) + 1/(60 + rank_ann(d))

Where:

  • rank_bm25(d) is the 1-based rank of document d in the BM25 list (rank 1 = highest BM25 score)
  • rank_ann(d) is the 1-based rank of document d in the ANN list (rank 1 = lowest L2 distance)
  • Documents absent from a list contribute zero for that term

The k=60 constant is insensitive across 30100 range. We implement it as configurable, defaulting to 60.

Input Conventions

  • BM25 results: &[(EntityId, f32)] where the f32 is the BM25 score. The caller must pre-sort these descending by score. The fuse() function uses position-as-rank (position 0 = rank 1).
  • ANN results: &[(EntityId, f32)] where the f32 is the L2-squared distance. The caller must pre-sort these ascending by distance. The fuse() function uses position-as-rank (position 0 = rank 1).

This design keeps fuse() a pure function with no sorting overhead. The caller controls ordering.

HybridFusion

// tidal/src/query/fusion.rs

use std::collections::HashMap;
use crate::schema::EntityId;

/// Reciprocal Rank Fusion (Cormack et al. SIGIR 2009).
///
/// Merges ranked lists from heterogeneous retrieval systems (BM25 text scores,
/// ANN vector distances) into a single ranked list using only rank positions.
///
/// The k=60 constant is insensitive across [30, 100] — see the research
/// literature. A configurable k is provided for experimentation.
///
/// # Reference
///
/// Cormack, Clarke, Büttcher. "Reciprocal Rank Fusion Outperforms Condorcet
/// and Individual Rank Learning Methods." SIGIR 2009.
#[derive(Debug, Clone)]
pub struct HybridFusion {
    /// Rank offset constant. Default 60 per the original paper.
    pub k: u32,
}

impl Default for HybridFusion {
    fn default() -> Self {
        Self { k: 60 }
    }
}

impl HybridFusion {
    /// Create a fusion instance with the default k=60.
    #[must_use]
    pub fn new() -> Self {
        Self::default()
    }

    /// Create a fusion instance with a custom k.
    #[must_use]
    pub fn with_k(k: u32) -> Self {
        Self { k }
    }

    /// Fuse two ranked lists via Reciprocal Rank Fusion.
    ///
    /// Both lists must be pre-sorted in "best first" order by the caller:
    /// - `bm25_results`: sorted descending by BM25 score (index 0 = rank 1)
    /// - `ann_results`: sorted ascending by L2 distance (index 0 = rank 1)
    ///
    /// The f32 value in each tuple is used only to establish the ordering by
    /// the caller; `fuse()` itself uses only position (0-indexed) as the rank.
    ///
    /// Documents appearing in only one list contribute only their single-list
    /// term. The missing-list contribution is zero.
    ///
    /// Returns results sorted by descending fused RRF score.
    #[must_use]
    pub fn fuse(
        &self,
        bm25_results: &[(EntityId, f32)],
        ann_results: &[(EntityId, f32)],
    ) -> Vec<(EntityId, f64)> {
        let k = f64::from(self.k);

        // Map entity_id -> accumulated RRF score
        let capacity = bm25_results.len() + ann_results.len();
        let mut scores: HashMap<u64, f64> = HashMap::with_capacity(capacity);

        // BM25 contribution: rank is 1-based (position 0 → rank 1)
        for (rank_0based, (entity_id, _score)) in bm25_results.iter().enumerate() {
            let rank = (rank_0based + 1) as f64;
            *scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank);
        }

        // ANN contribution: rank is 1-based (position 0 → rank 1)
        for (rank_0based, (entity_id, _distance)) in ann_results.iter().enumerate() {
            let rank = (rank_0based + 1) as f64;
            *scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank);
        }

        // Collect and sort descending by fused score
        let mut results: Vec<(EntityId, f64)> = scores
            .into_iter()
            .map(|(id, score)| (EntityId::new(id), score))
            .collect();
        results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));

        results
    }
}

Registration in query/mod.rs

Add to tidal/src/query/mod.rs:

pub mod fusion;
pub use fusion::HybridFusion;

Acceptance Criteria

  • tidal/src/query/fusion.rs created with HybridFusion struct
  • HybridFusion { k: u32 } with Default trait (k=60)
  • HybridFusion::new() returns default k=60 instance
  • HybridFusion::with_k(k) returns configured instance
  • fuse(bm25: &[(EntityId, f32)], ann: &[(EntityId, f32)]) -> Vec<(EntityId, f64)> implements RRF
  • BM25 rank: position 0 → rank 1, position N-1 → rank N
  • ANN rank: position 0 → rank 1, position N-1 → rank N
  • Documents in only one list: single-list contribution only (missing term = 0)
  • Results sorted descending by fused score
  • pub mod fusion; added to tidal/src/query/mod.rs
  • pub use fusion::HybridFusion; exported from tidal/src/query/mod.rs
  • Unit tests: fuse_both_lists, fuse_bm25_only, fuse_ann_only, fuse_empty_lists, fuse_single_doc_both_lists, fuse_k_affects_scores
  • Property test: for any pair of ranked lists, fused output is union of both document sets; score for doc in both lists > score for doc in only one list
  • cargo check, cargo fmt, cargo clippy -D warnings all pass

Test Strategy

#[test]
fn fuse_both_lists() {
    // BM25: [A=1.0, B=0.8, C=0.5] (descending)
    // ANN:  [B=0.1, A=0.2, D=0.5] (ascending distance)
    // Expected: B ranks highest (rank 1 in ANN, rank 2 in BM25)
    // A is rank 1 in BM25, rank 2 in ANN
    let bm25 = vec![
        (EntityId::new(1), 1.0f32),  // A, rank 1
        (EntityId::new(2), 0.8f32),  // B, rank 2
        (EntityId::new(3), 0.5f32),  // C, rank 3 (BM25 only)
    ];
    let ann = vec![
        (EntityId::new(2), 0.1f32),  // B, rank 1 (best ANN match)
        (EntityId::new(1), 0.2f32),  // A, rank 2
        (EntityId::new(4), 0.5f32),  // D, rank 3 (ANN only)
    ];

    let fusion = HybridFusion::new();  // k=60
    let results = fusion.fuse(&bm25, &ann);

    // B: 1/(60+2) + 1/(60+1) = 1/62 + 1/61 ≈ 0.01613 + 0.01639 = 0.03252
    // A: 1/(60+1) + 1/(60+2) = 1/61 + 1/62 ≈ 0.03252 (same as B — tie!)
    // Actually: B is rank 2 in BM25, rank 1 in ANN; A is rank 1 in BM25, rank 2 in ANN
    // B: 1/(60+2) + 1/(60+1) = 0.03252
    // A: 1/(60+1) + 1/(60+2) = 0.03252 (same score — tie)
    // C: 1/(60+3) + 0 = 1/63 ≈ 0.01587
    // D: 0 + 1/(60+3) = 1/63 ≈ 0.01587

    // Verify all 4 documents are in the output
    let ids: Vec<u64> = results.iter().map(|(id, _)| id.as_u64()).collect();
    assert!(ids.contains(&1)); // A
    assert!(ids.contains(&2)); // B
    assert!(ids.contains(&3)); // C
    assert!(ids.contains(&4)); // D

    // C and D (single-list) have lower scores than A and B (both lists)
    let c_score = results.iter().find(|(id, _)| id.as_u64() == 3).unwrap().1;
    let d_score = results.iter().find(|(id, _)| id.as_u64() == 4).unwrap().1;
    let a_score = results.iter().find(|(id, _)| id.as_u64() == 1).unwrap().1;
    let b_score = results.iter().find(|(id, _)| id.as_u64() == 2).unwrap().1;
    assert!(a_score > c_score);
    assert!(b_score > d_score);

    // Scores are sorted descending
    let scores: Vec<f64> = results.iter().map(|(_, s)| *s).collect();
    for i in 1..scores.len() {
        assert!(scores[i-1] >= scores[i]);
    }
}

#[test]
fn fuse_bm25_only() {
    let bm25 = vec![(EntityId::new(1), 1.0f32), (EntityId::new(2), 0.5f32)];
    let ann = vec![];

    let fusion = HybridFusion::new();
    let results = fusion.fuse(&bm25, &ann);

    assert_eq!(results.len(), 2);
    // rank 1 doc scores higher than rank 2
    let score_1 = results.iter().find(|(id, _)| id.as_u64() == 1).unwrap().1;
    let score_2 = results.iter().find(|(id, _)| id.as_u64() == 2).unwrap().1;
    assert!(score_1 > score_2);
    // Score = 1/(60+1) for rank 1 = 1/61
    let expected = 1.0 / (60.0 + 1.0);
    assert!((score_1 - expected).abs() < 1e-9);
}

#[test]
fn fuse_k_affects_scores() {
    let bm25 = vec![(EntityId::new(1), 1.0f32)];
    let ann = vec![(EntityId::new(1), 0.1f32)];

    let fusion_60 = HybridFusion::new();  // k=60
    let fusion_30 = HybridFusion::with_k(30);  // k=30

    let results_60 = fusion_60.fuse(&bm25, &ann);
    let results_30 = fusion_30.fuse(&bm25, &ann);

    // k=30: 1/(30+1) + 1/(30+1) = 2/31 ≈ 0.0645
    // k=60: 1/(60+1) + 1/(60+1) = 2/61 ≈ 0.0328
    // k=30 produces higher scores
    assert!(results_30[0].1 > results_60[0].1);
}

Property Test

use proptest::prelude::*;

proptest! {
    #[test]
    fn rrf_output_is_union_of_inputs(
        bm25_ids in prop::collection::vec(1u64..=100, 0..20),
        ann_ids  in prop::collection::vec(1u64..=100, 0..20),
    ) {
        let bm25: Vec<(EntityId, f32)> = bm25_ids.iter().enumerate()
            .map(|(i, &id)| (EntityId::new(id), (100 - i) as f32))
            .collect();
        let ann: Vec<(EntityId, f32)> = ann_ids.iter().enumerate()
            .map(|(i, &id)| (EntityId::new(id), i as f32 * 0.01))
            .collect();

        let fusion = HybridFusion::new();
        let results = fusion.fuse(&bm25, &ann);

        // Output must contain the union of all input IDs
        let all_ids: std::collections::HashSet<u64> = bm25_ids.iter()
            .chain(ann_ids.iter())
            .copied()
            .collect();
        let result_ids: std::collections::HashSet<u64> = results.iter()
            .map(|(id, _)| id.as_u64())
            .collect();
        prop_assert_eq!(&all_ids, &result_ids);

        // Results must be sorted descending
        for i in 1..results.len() {
            prop_assert!(results[i-1].1 >= results[i].1);
        }
    }
}