- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs) - M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates) - M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking) - M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators) - Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.) - Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.) - Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers - Add benches: fusion, search, session, text_index - Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index) - Update blog posts, roadmap, content strategy, and M5 planning docs - Add tmp/ and .claude/worktrees/ to .gitignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
282 lines
10 KiB
Markdown
282 lines
10 KiB
Markdown
# Task 01: RRF Implementation
|
||
|
||
## Delivers
|
||
|
||
`HybridFusion` struct implementing Reciprocal Rank Fusion. `fuse()` merges a BM25 ranked list and an ANN ranked list into a single `Vec<(EntityId, f64)>` sorted by descending fused score. Documents appearing in only one list contribute only their single-list term. `k = 60` by default, configurable.
|
||
|
||
## Complexity: S
|
||
|
||
## Dependencies
|
||
|
||
- m5p1 COMPLETE: `EntityId` type, `tidal/src/query/` module structure
|
||
- `tidal/src/query/mod.rs` exists for adding the `fusion` submodule
|
||
|
||
## Technical Design
|
||
|
||
### RRF Formula
|
||
|
||
From `docs/research/tantivy.md`:
|
||
|
||
> RRFscore(d) = 1/(60 + rank_bm25(d)) + 1/(60 + rank_ann(d))
|
||
|
||
Where:
|
||
- `rank_bm25(d)` is the 1-based rank of document `d` in the BM25 list (rank 1 = highest BM25 score)
|
||
- `rank_ann(d)` is the 1-based rank of document `d` in the ANN list (rank 1 = lowest L2 distance)
|
||
- Documents absent from a list contribute zero for that term
|
||
|
||
The k=60 constant is insensitive across 30–100 range. We implement it as configurable, defaulting to 60.
|
||
|
||
### Input Conventions
|
||
|
||
- **BM25 results**: `&[(EntityId, f32)]` where the f32 is the BM25 score. **The caller must pre-sort these descending by score.** The `fuse()` function uses position-as-rank (position 0 = rank 1).
|
||
- **ANN results**: `&[(EntityId, f32)]` where the f32 is the L2-squared distance. **The caller must pre-sort these ascending by distance.** The `fuse()` function uses position-as-rank (position 0 = rank 1).
|
||
|
||
This design keeps `fuse()` a pure function with no sorting overhead. The caller controls ordering.
|
||
|
||
### HybridFusion
|
||
|
||
```rust
|
||
// tidal/src/query/fusion.rs
|
||
|
||
use std::collections::HashMap;
|
||
use crate::schema::EntityId;
|
||
|
||
/// Reciprocal Rank Fusion (Cormack et al. SIGIR 2009).
|
||
///
|
||
/// Merges ranked lists from heterogeneous retrieval systems (BM25 text scores,
|
||
/// ANN vector distances) into a single ranked list using only rank positions.
|
||
///
|
||
/// The k=60 constant is insensitive across [30, 100] — see the research
|
||
/// literature. A configurable k is provided for experimentation.
|
||
///
|
||
/// # Reference
|
||
///
|
||
/// Cormack, Clarke, Büttcher. "Reciprocal Rank Fusion Outperforms Condorcet
|
||
/// and Individual Rank Learning Methods." SIGIR 2009.
|
||
#[derive(Debug, Clone)]
|
||
pub struct HybridFusion {
|
||
/// Rank offset constant. Default 60 per the original paper.
|
||
pub k: u32,
|
||
}
|
||
|
||
impl Default for HybridFusion {
|
||
fn default() -> Self {
|
||
Self { k: 60 }
|
||
}
|
||
}
|
||
|
||
impl HybridFusion {
|
||
/// Create a fusion instance with the default k=60.
|
||
#[must_use]
|
||
pub fn new() -> Self {
|
||
Self::default()
|
||
}
|
||
|
||
/// Create a fusion instance with a custom k.
|
||
#[must_use]
|
||
pub fn with_k(k: u32) -> Self {
|
||
Self { k }
|
||
}
|
||
|
||
/// Fuse two ranked lists via Reciprocal Rank Fusion.
|
||
///
|
||
/// Both lists must be pre-sorted in "best first" order by the caller:
|
||
/// - `bm25_results`: sorted descending by BM25 score (index 0 = rank 1)
|
||
/// - `ann_results`: sorted ascending by L2 distance (index 0 = rank 1)
|
||
///
|
||
/// The f32 value in each tuple is used only to establish the ordering by
|
||
/// the caller; `fuse()` itself uses only position (0-indexed) as the rank.
|
||
///
|
||
/// Documents appearing in only one list contribute only their single-list
|
||
/// term. The missing-list contribution is zero.
|
||
///
|
||
/// Returns results sorted by descending fused RRF score.
|
||
#[must_use]
|
||
pub fn fuse(
|
||
&self,
|
||
bm25_results: &[(EntityId, f32)],
|
||
ann_results: &[(EntityId, f32)],
|
||
) -> Vec<(EntityId, f64)> {
|
||
let k = f64::from(self.k);
|
||
|
||
// Map entity_id -> accumulated RRF score
|
||
let capacity = bm25_results.len() + ann_results.len();
|
||
let mut scores: HashMap<u64, f64> = HashMap::with_capacity(capacity);
|
||
|
||
// BM25 contribution: rank is 1-based (position 0 → rank 1)
|
||
for (rank_0based, (entity_id, _score)) in bm25_results.iter().enumerate() {
|
||
let rank = (rank_0based + 1) as f64;
|
||
*scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank);
|
||
}
|
||
|
||
// ANN contribution: rank is 1-based (position 0 → rank 1)
|
||
for (rank_0based, (entity_id, _distance)) in ann_results.iter().enumerate() {
|
||
let rank = (rank_0based + 1) as f64;
|
||
*scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + rank);
|
||
}
|
||
|
||
// Collect and sort descending by fused score
|
||
let mut results: Vec<(EntityId, f64)> = scores
|
||
.into_iter()
|
||
.map(|(id, score)| (EntityId::new(id), score))
|
||
.collect();
|
||
results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
|
||
|
||
results
|
||
}
|
||
}
|
||
```
|
||
|
||
### Registration in query/mod.rs
|
||
|
||
Add to `tidal/src/query/mod.rs`:
|
||
```rust
|
||
pub mod fusion;
|
||
pub use fusion::HybridFusion;
|
||
```
|
||
|
||
## Acceptance Criteria
|
||
|
||
- [ ] `tidal/src/query/fusion.rs` created with `HybridFusion` struct
|
||
- [ ] `HybridFusion { k: u32 }` with `Default` trait (k=60)
|
||
- [ ] `HybridFusion::new()` returns default k=60 instance
|
||
- [ ] `HybridFusion::with_k(k)` returns configured instance
|
||
- [ ] `fuse(bm25: &[(EntityId, f32)], ann: &[(EntityId, f32)]) -> Vec<(EntityId, f64)>` implements RRF
|
||
- [ ] BM25 rank: position 0 → rank 1, position N-1 → rank N
|
||
- [ ] ANN rank: position 0 → rank 1, position N-1 → rank N
|
||
- [ ] Documents in only one list: single-list contribution only (missing term = 0)
|
||
- [ ] Results sorted descending by fused score
|
||
- [ ] `pub mod fusion;` added to `tidal/src/query/mod.rs`
|
||
- [ ] `pub use fusion::HybridFusion;` exported from `tidal/src/query/mod.rs`
|
||
- [ ] Unit tests: `fuse_both_lists`, `fuse_bm25_only`, `fuse_ann_only`, `fuse_empty_lists`, `fuse_single_doc_both_lists`, `fuse_k_affects_scores`
|
||
- [ ] Property test: for any pair of ranked lists, fused output is union of both document sets; score for doc in both lists > score for doc in only one list
|
||
- [ ] `cargo check`, `cargo fmt`, `cargo clippy -D warnings` all pass
|
||
|
||
## Test Strategy
|
||
|
||
```rust
|
||
#[test]
|
||
fn fuse_both_lists() {
|
||
// BM25: [A=1.0, B=0.8, C=0.5] (descending)
|
||
// ANN: [B=0.1, A=0.2, D=0.5] (ascending distance)
|
||
// Expected: B ranks highest (rank 1 in ANN, rank 2 in BM25)
|
||
// A is rank 1 in BM25, rank 2 in ANN
|
||
let bm25 = vec![
|
||
(EntityId::new(1), 1.0f32), // A, rank 1
|
||
(EntityId::new(2), 0.8f32), // B, rank 2
|
||
(EntityId::new(3), 0.5f32), // C, rank 3 (BM25 only)
|
||
];
|
||
let ann = vec![
|
||
(EntityId::new(2), 0.1f32), // B, rank 1 (best ANN match)
|
||
(EntityId::new(1), 0.2f32), // A, rank 2
|
||
(EntityId::new(4), 0.5f32), // D, rank 3 (ANN only)
|
||
];
|
||
|
||
let fusion = HybridFusion::new(); // k=60
|
||
let results = fusion.fuse(&bm25, &ann);
|
||
|
||
// B: 1/(60+2) + 1/(60+1) = 1/62 + 1/61 ≈ 0.01613 + 0.01639 = 0.03252
|
||
// A: 1/(60+1) + 1/(60+2) = 1/61 + 1/62 ≈ 0.03252 (same as B — tie!)
|
||
// Actually: B is rank 2 in BM25, rank 1 in ANN; A is rank 1 in BM25, rank 2 in ANN
|
||
// B: 1/(60+2) + 1/(60+1) = 0.03252
|
||
// A: 1/(60+1) + 1/(60+2) = 0.03252 (same score — tie)
|
||
// C: 1/(60+3) + 0 = 1/63 ≈ 0.01587
|
||
// D: 0 + 1/(60+3) = 1/63 ≈ 0.01587
|
||
|
||
// Verify all 4 documents are in the output
|
||
let ids: Vec<u64> = results.iter().map(|(id, _)| id.as_u64()).collect();
|
||
assert!(ids.contains(&1)); // A
|
||
assert!(ids.contains(&2)); // B
|
||
assert!(ids.contains(&3)); // C
|
||
assert!(ids.contains(&4)); // D
|
||
|
||
// C and D (single-list) have lower scores than A and B (both lists)
|
||
let c_score = results.iter().find(|(id, _)| id.as_u64() == 3).unwrap().1;
|
||
let d_score = results.iter().find(|(id, _)| id.as_u64() == 4).unwrap().1;
|
||
let a_score = results.iter().find(|(id, _)| id.as_u64() == 1).unwrap().1;
|
||
let b_score = results.iter().find(|(id, _)| id.as_u64() == 2).unwrap().1;
|
||
assert!(a_score > c_score);
|
||
assert!(b_score > d_score);
|
||
|
||
// Scores are sorted descending
|
||
let scores: Vec<f64> = results.iter().map(|(_, s)| *s).collect();
|
||
for i in 1..scores.len() {
|
||
assert!(scores[i-1] >= scores[i]);
|
||
}
|
||
}
|
||
|
||
#[test]
|
||
fn fuse_bm25_only() {
|
||
let bm25 = vec![(EntityId::new(1), 1.0f32), (EntityId::new(2), 0.5f32)];
|
||
let ann = vec![];
|
||
|
||
let fusion = HybridFusion::new();
|
||
let results = fusion.fuse(&bm25, &ann);
|
||
|
||
assert_eq!(results.len(), 2);
|
||
// rank 1 doc scores higher than rank 2
|
||
let score_1 = results.iter().find(|(id, _)| id.as_u64() == 1).unwrap().1;
|
||
let score_2 = results.iter().find(|(id, _)| id.as_u64() == 2).unwrap().1;
|
||
assert!(score_1 > score_2);
|
||
// Score = 1/(60+1) for rank 1 = 1/61
|
||
let expected = 1.0 / (60.0 + 1.0);
|
||
assert!((score_1 - expected).abs() < 1e-9);
|
||
}
|
||
|
||
#[test]
|
||
fn fuse_k_affects_scores() {
|
||
let bm25 = vec![(EntityId::new(1), 1.0f32)];
|
||
let ann = vec![(EntityId::new(1), 0.1f32)];
|
||
|
||
let fusion_60 = HybridFusion::new(); // k=60
|
||
let fusion_30 = HybridFusion::with_k(30); // k=30
|
||
|
||
let results_60 = fusion_60.fuse(&bm25, &ann);
|
||
let results_30 = fusion_30.fuse(&bm25, &ann);
|
||
|
||
// k=30: 1/(30+1) + 1/(30+1) = 2/31 ≈ 0.0645
|
||
// k=60: 1/(60+1) + 1/(60+1) = 2/61 ≈ 0.0328
|
||
// k=30 produces higher scores
|
||
assert!(results_30[0].1 > results_60[0].1);
|
||
}
|
||
```
|
||
|
||
### Property Test
|
||
|
||
```rust
|
||
use proptest::prelude::*;
|
||
|
||
proptest! {
|
||
#[test]
|
||
fn rrf_output_is_union_of_inputs(
|
||
bm25_ids in prop::collection::vec(1u64..=100, 0..20),
|
||
ann_ids in prop::collection::vec(1u64..=100, 0..20),
|
||
) {
|
||
let bm25: Vec<(EntityId, f32)> = bm25_ids.iter().enumerate()
|
||
.map(|(i, &id)| (EntityId::new(id), (100 - i) as f32))
|
||
.collect();
|
||
let ann: Vec<(EntityId, f32)> = ann_ids.iter().enumerate()
|
||
.map(|(i, &id)| (EntityId::new(id), i as f32 * 0.01))
|
||
.collect();
|
||
|
||
let fusion = HybridFusion::new();
|
||
let results = fusion.fuse(&bm25, &ann);
|
||
|
||
// Output must contain the union of all input IDs
|
||
let all_ids: std::collections::HashSet<u64> = bm25_ids.iter()
|
||
.chain(ann_ids.iter())
|
||
.copied()
|
||
.collect();
|
||
let result_ids: std::collections::HashSet<u64> = results.iter()
|
||
.map(|(id, _)| id.as_u64())
|
||
.collect();
|
||
prop_assert_eq!(&all_ids, &result_ids);
|
||
|
||
// Results must be sorted descending
|
||
for i in 1..results.len() {
|
||
prop_assert!(results[i-1].1 >= results[i].1);
|
||
}
|
||
}
|
||
}
|
||
```
|