jordan 192c473f55 feat: complete Milestone 5 — full-text search, RRF fusion, and creator search

- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs)
- M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates)
- M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking)
- M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators)
- Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.)
- Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.)
- Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers
- Add benches: fusion, search, session, text_index
- Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index)
- Update blog posts, roadmap, content strategy, and M5 planning docs
- Add tmp/ and .claude/worktrees/ to .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-21 23:53:16 -07:00

8.4 KiB

Raw Permalink Blame History

Task 02: Retrieval Mode Router

Delivers

RetrievalMode enum and route_results() function. RetrievalMode::determine() selects text-only, vector-only, or hybrid based on what's present in the query. route_results() converts pre-retrieved result lists through the appropriate path — direct passthrough for single-mode, HybridFusion::fuse() for hybrid. Criterion benchmark confirming fusion adds < 1ms at 1000 candidates per list.

Complexity: S

Dependencies

Task 01 COMPLETE: HybridFusion with fuse() method in tidal/src/query/fusion.rs
m5p1 COMPLETE: EntityId type
m2p1 COMPLETE: VectorSearchResult { id: VectorId, distance: f32 } in tidal/src/storage/vector/

Technical Design

RetrievalMode

// tidal/src/query/fusion.rs (additions)

/// Which retrieval system(s) to use for a search query.
///
/// Determined by what the query provides:
/// - `TextOnly` — only `query_text` is present
/// - `VectorOnly` — only `query_vector` is present
/// - `Hybrid` — both `query_text` and `query_vector` are present
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum RetrievalMode {
    /// Execute BM25 text search only.
    TextOnly,
    /// Execute ANN vector search only.
    VectorOnly,
    /// Execute both and fuse results via RRF.
    Hybrid,
}

impl RetrievalMode {
    /// Determine the retrieval mode from query contents.
    ///
    /// Returns `None` if neither text nor vector is provided (invalid query).
    #[must_use]
    pub fn determine(has_text: bool, has_vector: bool) -> Option<Self> {
        match (has_text, has_vector) {
            (true, false) => Some(Self::TextOnly),
            (false, true) => Some(Self::VectorOnly),
            (true, true) => Some(Self::Hybrid),
            (false, false) => None,
        }
    }
}

route_results()

/// Route pre-retrieved result lists through the appropriate fusion path.
///
/// - `TextOnly`: converts BM25 scores to `f64` and returns them sorted descending.
/// - `VectorOnly`: converts ANN distance → rank-based score and returns sorted descending.
/// - `Hybrid`: calls `HybridFusion::fuse()` and returns the fused result.
///
/// # Inputs
///
/// - `bm25_results`: `(EntityId, f32)` where f32 is BM25 score, **pre-sorted descending**.
/// - `ann_results`: `(EntityId, f32)` where f32 is L2-squared distance, **pre-sorted ascending**.
/// - Both slices may be empty; callers pass `&[]` for unused modes.
///
/// # Returns
///
/// `Vec<(EntityId, f64)>` sorted descending by score. For `TextOnly` and `VectorOnly`,
/// scores are normalized to `[0, 1]` relative to the top candidate (score 1.0).
/// For `Hybrid`, scores are raw RRF values (typically 0.01–0.04 for k=60).
pub fn route_results(
    mode: RetrievalMode,
    bm25_results: &[(EntityId, f32)],
    ann_results: &[(EntityId, f32)],
    fusion: &HybridFusion,
) -> Vec<(EntityId, f64)> {
    match mode {
        RetrievalMode::TextOnly => {
            // Convert f32 BM25 scores to f64; already sorted descending by caller.
            bm25_results
                .iter()
                .map(|(id, score)| (*id, f64::from(*score)))
                .collect()
        }
        RetrievalMode::VectorOnly => {
            // Convert rank-position to a score using the same RRF formula for
            // consistency: score = 1.0 / (k + rank). This gives ANN-only results
            // the same score range as hybrid results.
            let k = f64::from(fusion.k);
            ann_results
                .iter()
                .enumerate()
                .map(|(i, (id, _distance))| {
                    let rank = (i + 1) as f64;
                    (*id, 1.0 / (k + rank))
                })
                .collect()
        }
        RetrievalMode::Hybrid => fusion.fuse(bm25_results, ann_results),
    }
}

ann_to_ranked()

A helper to convert Vec<VectorSearchResult> (returned by VectorIndex::search()) to Vec<(EntityId, f32)> suitable as input to fuse() or route_results():

use crate::storage::vector::VectorSearchResult;

/// Convert ANN search results to a ranked list for fusion input.
///
/// `VectorSearchResult` is already sorted ascending by distance (best first).
/// This function maps it to `(EntityId, f32)` where the f32 is the raw L2 distance.
/// The caller passes this to `fuse()` or `route_results()` which uses position-as-rank.
#[must_use]
pub fn ann_to_ranked(ann_results: &[VectorSearchResult]) -> Vec<(EntityId, f32)> {
    ann_results
        .iter()
        .map(|r| (EntityId::new(r.id), r.distance))
        .collect()
}

Module Integration

Add to tidal/src/query/mod.rs:

pub use fusion::{HybridFusion, RetrievalMode, ann_to_ranked, route_results};

Criterion Benchmark

// tidal/benches/fusion.rs

fn bench_rrf_1k_per_list(c: &mut Criterion) {
    // 1000 BM25 results
    let bm25: Vec<(EntityId, f32)> = (0u64..1000)
        .map(|i| (EntityId::new(i), (1000 - i) as f32))
        .collect();
    // 1000 ANN results, 50% overlap with BM25
    let ann: Vec<(EntityId, f32)> = (500u64..1500)
        .enumerate()
        .map(|(i, id)| (EntityId::new(id), i as f32 * 0.001))
        .collect();

    let fusion = HybridFusion::new();

    c.bench_function("rrf_fuse_1k_per_list", |b| {
        b.iter(|| {
            let results = fusion.fuse(black_box(&bm25), black_box(&ann));
            black_box(results)
        });
    });
}

Acceptance Criteria

RetrievalMode enum with TextOnly, VectorOnly, Hybrid variants in fusion.rs
RetrievalMode::determine(has_text, has_vector) -> Option<RetrievalMode> returns correct variant
determine(false, false) returns None
route_results(mode, bm25, ann, fusion) -> Vec<(EntityId, f64)> implemented
TextOnly path: BM25 scores converted to f64, list preserved
VectorOnly path: ANN results converted to rank-based scores via 1.0 / (k + rank)
Hybrid path: calls HybridFusion::fuse() and returns result
ann_to_ranked(ann_results: &[VectorSearchResult]) -> Vec<(EntityId, f32)> helper
RetrievalMode, route_results, ann_to_ranked exported from tidal/src/query/mod.rs
tidal/benches/fusion.rs created with Criterion benchmark rrf_fuse_1k_per_list
Benchmark result confirms fusion < 1ms for 1000 candidates per list
[[bench]] name = "fusion" harness = false added to tidal/Cargo.toml
Unit tests: determine_text_only, determine_vector_only, determine_hybrid, determine_none, route_text_only_passthrough, route_vector_only_rank_based, route_hybrid_calls_fuse, ann_to_ranked_converts_correctly
cargo check, cargo fmt, cargo clippy -D warnings all pass

Test Strategy

#[test]
fn determine_text_only() {
    assert_eq!(RetrievalMode::determine(true, false), Some(RetrievalMode::TextOnly));
}

#[test]
fn determine_hybrid() {
    assert_eq!(RetrievalMode::determine(true, true), Some(RetrievalMode::Hybrid));
}

#[test]
fn determine_none() {
    assert_eq!(RetrievalMode::determine(false, false), None);
}

#[test]
fn route_text_only_passthrough() {
    let bm25 = vec![(EntityId::new(1), 1.0f32), (EntityId::new(2), 0.5f32)];
    let fusion = HybridFusion::new();
    let results = route_results(RetrievalMode::TextOnly, &bm25, &[], &fusion);
    assert_eq!(results.len(), 2);
    assert!((results[0].1 - 1.0f64).abs() < 1e-6);  // f32 → f64 exact
    assert!((results[1].1 - 0.5f64).abs() < 1e-6);
}

#[test]
fn route_vector_only_rank_based() {
    // VectorSearchResult order: rank 1 (index 0) gets score 1/(60+1)
    let ann = vec![
        (EntityId::new(1), 0.1f32),  // rank 1
        (EntityId::new(2), 0.2f32),  // rank 2
    ];
    let fusion = HybridFusion::new();
    let results = route_results(RetrievalMode::VectorOnly, &[], &ann, &fusion);
    assert_eq!(results.len(), 2);
    let expected_rank1 = 1.0 / (60.0 + 1.0);
    let expected_rank2 = 1.0 / (60.0 + 2.0);
    assert!((results[0].1 - expected_rank1).abs() < 1e-9);
    assert!((results[1].1 - expected_rank2).abs() < 1e-9);
}

#[test]
fn ann_to_ranked_converts_correctly() {
    use crate::storage::vector::VectorSearchResult;
    let ann_results = vec![
        VectorSearchResult { id: 42, distance: 0.1 },
        VectorSearchResult { id: 99, distance: 0.3 },
    ];
    let ranked = ann_to_ranked(&ann_results);
    assert_eq!(ranked.len(), 2);
    assert_eq!(ranked[0].0.as_u64(), 42);
    assert!((ranked[0].1 - 0.1f32).abs() < 1e-6);
    assert_eq!(ranked[1].0.as_u64(), 99);
}

8.4 KiB Raw Permalink Blame History Unescape Escape