- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs) - M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates) - M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking) - M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators) - Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.) - Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.) - Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers - Add benches: fusion, search, session, text_index - Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index) - Update blog posts, roadmap, content strategy, and M5 planning docs - Add tmp/ and .claude/worktrees/ to .gitignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
10 KiB
10 KiB
Task 01: RRF Implementation
Delivers
HybridFusion struct implementing Reciprocal Rank Fusion. fuse() merges a BM25 ranked list and an ANN ranked list into a single Vec<(EntityId, f64)> sorted by descending fused score. Documents appearing in only one list contribute only their single-list term. k = 60 by default, configurable.
Complexity: S
Dependencies
- m5p1 COMPLETE:
EntityIdtype,tidal/src/query/module structure tidal/src/query/mod.rsexists for adding thefusionsubmodule
Technical Design
RRF Formula
From docs/research/tantivy.md:
RRFscore(d) = 1/(60 + rank_bm25(d)) + 1/(60 + rank_ann(d))
Where:
rank_bm25(d)is the 1-based rank of documentdin the BM25 list (rank 1 = highest BM25 score)rank_ann(d)is the 1-based rank of documentdin the ANN list (rank 1 = lowest L2 distance)- Documents absent from a list contribute zero for that term
The k=60 constant is insensitive across 30–100 range. We implement it as configurable, defaulting to 60.
Input Conventions
- BM25 results:
&[(EntityId, f32)]where the f32 is the BM25 score. The caller must pre-sort these descending by score. Thefuse()function uses position-as-rank (position 0 = rank 1). - ANN results:
&[(EntityId, f32)]where the f32 is the L2-squared distance. The caller must pre-sort these ascending by distance. Thefuse()function uses position-as-rank (position 0 = rank 1).
This design keeps fuse() a pure function with no sorting overhead. The caller controls ordering.
HybridFusion
// tidal/src/query/fusion.rs
use std::collections::HashMap;
use crate::schema::EntityId;
/// Reciprocal Rank Fusion (Cormack et al. SIGIR 2009).
///
/// Merges ranked lists from heterogeneous retrieval systems (BM25 text scores,
/// ANN vector distances) into a single ranked list using only rank positions.
///
/// The k=60 constant is insensitive across [30, 100] — see the research
/// literature. A configurable k is provided for experimentation.
///
/// # Reference
///
/// Cormack, Clarke, Büttcher. "Reciprocal Rank Fusion Outperforms Condorcet
/// and Individual Rank Learning Methods." SIGIR 2009.
#[derive(Debug, Clone)]
pub struct HybridFusion {
/// Rank offset constant. Default 60 per the original paper.
pub k: u32,
}
impl Default for HybridFusion {
fn default() -> Self {
Self { k: 60 }
}
}
impl HybridFusion {
/// Create a fusion instance with the default k=60.
#[must_use]
pub fn new() -> Self {
Self::default()
}
/// Create a fusion instance with a custom k.
#[must_use]
pub fn with_k(k: u32) -> Self {
Self { k }
}
/// Fuse two ranked lists via Reciprocal Rank Fusion.
///
/// Both lists must be pre-sorted in "best first" order by the caller:
/// - `bm25_results`: sorted descending by BM25 score (index 0 = rank 1)
/// - `ann_results`: sorted ascending by L2 distance (index 0 = rank 1)
///
/// The f32 value in each tuple is used only to establish the ordering by
/// the caller; `fuse()` itself uses only position (0-indexed) as the rank.
///
/// Documents appearing in only one list contribute only their single-list
/// term. The missing-list contribution is zero.
///
/// Returns results sorted by descending fused RRF score.
#[must_use]
pub fn fuse(
&self,
bm25_results: &[(EntityId, f32)],
ann_results: &[(EntityId, f32)],
) -> Vec<(EntityId, f64)> {
let k = f64::from(self.k);
// Map entity_id -> accumulated RRF score
let capacity = bm25_results.len() + ann_results.len();
let mut scores: HashMap<u64, f64> = HashMap::with_capacity(capacity);
// BM25 contribution: rank is 1-based (position 0 → rank 1)
for (rank_0based, (entity_id, _score)) in bm25_results.iter().enumerate() {
let rank = (rank_0based + 1) as f64;
*scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank);
}
// ANN contribution: rank is 1-based (position 0 → rank 1)
for (rank_0based, (entity_id, _distance)) in ann_results.iter().enumerate() {
let rank = (rank_0based + 1) as f64;
*scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank);
}
// Collect and sort descending by fused score
let mut results: Vec<(EntityId, f64)> = scores
.into_iter()
.map(|(id, score)| (EntityId::new(id), score))
.collect();
results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
results
}
}
Registration in query/mod.rs
Add to tidal/src/query/mod.rs:
pub mod fusion;
pub use fusion::HybridFusion;
Acceptance Criteria
tidal/src/query/fusion.rscreated withHybridFusionstructHybridFusion { k: u32 }withDefaulttrait (k=60)HybridFusion::new()returns default k=60 instanceHybridFusion::with_k(k)returns configured instancefuse(bm25: &[(EntityId, f32)], ann: &[(EntityId, f32)]) -> Vec<(EntityId, f64)>implements RRF- BM25 rank: position 0 → rank 1, position N-1 → rank N
- ANN rank: position 0 → rank 1, position N-1 → rank N
- Documents in only one list: single-list contribution only (missing term = 0)
- Results sorted descending by fused score
pub mod fusion;added totidal/src/query/mod.rspub use fusion::HybridFusion;exported fromtidal/src/query/mod.rs- Unit tests:
fuse_both_lists,fuse_bm25_only,fuse_ann_only,fuse_empty_lists,fuse_single_doc_both_lists,fuse_k_affects_scores - Property test: for any pair of ranked lists, fused output is union of both document sets; score for doc in both lists > score for doc in only one list
cargo check,cargo fmt,cargo clippy -D warningsall pass
Test Strategy
#[test]
fn fuse_both_lists() {
// BM25: [A=1.0, B=0.8, C=0.5] (descending)
// ANN: [B=0.1, A=0.2, D=0.5] (ascending distance)
// Expected: B ranks highest (rank 1 in ANN, rank 2 in BM25)
// A is rank 1 in BM25, rank 2 in ANN
let bm25 = vec![
(EntityId::new(1), 1.0f32), // A, rank 1
(EntityId::new(2), 0.8f32), // B, rank 2
(EntityId::new(3), 0.5f32), // C, rank 3 (BM25 only)
];
let ann = vec![
(EntityId::new(2), 0.1f32), // B, rank 1 (best ANN match)
(EntityId::new(1), 0.2f32), // A, rank 2
(EntityId::new(4), 0.5f32), // D, rank 3 (ANN only)
];
let fusion = HybridFusion::new(); // k=60
let results = fusion.fuse(&bm25, &ann);
// B: 1/(60+2) + 1/(60+1) = 1/62 + 1/61 ≈ 0.01613 + 0.01639 = 0.03252
// A: 1/(60+1) + 1/(60+2) = 1/61 + 1/62 ≈ 0.03252 (same as B — tie!)
// Actually: B is rank 2 in BM25, rank 1 in ANN; A is rank 1 in BM25, rank 2 in ANN
// B: 1/(60+2) + 1/(60+1) = 0.03252
// A: 1/(60+1) + 1/(60+2) = 0.03252 (same score — tie)
// C: 1/(60+3) + 0 = 1/63 ≈ 0.01587
// D: 0 + 1/(60+3) = 1/63 ≈ 0.01587
// Verify all 4 documents are in the output
let ids: Vec<u64> = results.iter().map(|(id, _)| id.as_u64()).collect();
assert!(ids.contains(&1)); // A
assert!(ids.contains(&2)); // B
assert!(ids.contains(&3)); // C
assert!(ids.contains(&4)); // D
// C and D (single-list) have lower scores than A and B (both lists)
let c_score = results.iter().find(|(id, _)| id.as_u64() == 3).unwrap().1;
let d_score = results.iter().find(|(id, _)| id.as_u64() == 4).unwrap().1;
let a_score = results.iter().find(|(id, _)| id.as_u64() == 1).unwrap().1;
let b_score = results.iter().find(|(id, _)| id.as_u64() == 2).unwrap().1;
assert!(a_score > c_score);
assert!(b_score > d_score);
// Scores are sorted descending
let scores: Vec<f64> = results.iter().map(|(_, s)| *s).collect();
for i in 1..scores.len() {
assert!(scores[i-1] >= scores[i]);
}
}
#[test]
fn fuse_bm25_only() {
let bm25 = vec![(EntityId::new(1), 1.0f32), (EntityId::new(2), 0.5f32)];
let ann = vec![];
let fusion = HybridFusion::new();
let results = fusion.fuse(&bm25, &ann);
assert_eq!(results.len(), 2);
// rank 1 doc scores higher than rank 2
let score_1 = results.iter().find(|(id, _)| id.as_u64() == 1).unwrap().1;
let score_2 = results.iter().find(|(id, _)| id.as_u64() == 2).unwrap().1;
assert!(score_1 > score_2);
// Score = 1/(60+1) for rank 1 = 1/61
let expected = 1.0 / (60.0 + 1.0);
assert!((score_1 - expected).abs() < 1e-9);
}
#[test]
fn fuse_k_affects_scores() {
let bm25 = vec![(EntityId::new(1), 1.0f32)];
let ann = vec![(EntityId::new(1), 0.1f32)];
let fusion_60 = HybridFusion::new(); // k=60
let fusion_30 = HybridFusion::with_k(30); // k=30
let results_60 = fusion_60.fuse(&bm25, &ann);
let results_30 = fusion_30.fuse(&bm25, &ann);
// k=30: 1/(30+1) + 1/(30+1) = 2/31 ≈ 0.0645
// k=60: 1/(60+1) + 1/(60+1) = 2/61 ≈ 0.0328
// k=30 produces higher scores
assert!(results_30[0].1 > results_60[0].1);
}
Property Test
use proptest::prelude::*;
proptest! {
#[test]
fn rrf_output_is_union_of_inputs(
bm25_ids in prop::collection::vec(1u64..=100, 0..20),
ann_ids in prop::collection::vec(1u64..=100, 0..20),
) {
let bm25: Vec<(EntityId, f32)> = bm25_ids.iter().enumerate()
.map(|(i, &id)| (EntityId::new(id), (100 - i) as f32))
.collect();
let ann: Vec<(EntityId, f32)> = ann_ids.iter().enumerate()
.map(|(i, &id)| (EntityId::new(id), i as f32 * 0.01))
.collect();
let fusion = HybridFusion::new();
let results = fusion.fuse(&bm25, &ann);
// Output must contain the union of all input IDs
let all_ids: std::collections::HashSet<u64> = bm25_ids.iter()
.chain(ann_ids.iter())
.copied()
.collect();
let result_ids: std::collections::HashSet<u64> = results.iter()
.map(|(id, _)| id.as_u64())
.collect();
prop_assert_eq!(&all_ids, &result_ids);
// Results must be sorted descending
for i in 1..results.len() {
prop_assert!(results[i-1].1 >= results[i].1);
}
}
}