tidaldb/site/content/blog/hybrid-search-shipped.mdx
2026-02-23 22:41:16 -07:00

241 lines
14 KiB
Plaintext

---
title: "Hybrid search shipped. BM25 + ANN + signals in one call."
date: "2026-02-22"
author: "Jordan Washburn"
description: "The SEARCH pipeline is live: Tantivy BM25 at 0.26ms per 10K docs, USearch ANN wired as a first-class retrieval path, Reciprocal Rank Fusion at 46 microseconds, and signal-based re-ranking in the same process. One function call. No network boundary."
tags: ["search", "performance", "rust", "architecture"]
---
Yesterday's post on [search and ranking](/blog/search-and-ranking) ended with a sentence I wanted to make obsolete as quickly as possible: "The implementation is catching up."
It caught up.
```rust
let results = db.search(
&Search::builder()
.query("jazz piano")
.vector(user_embedding)
.for_user(user_id)
.diversity(DiversityConstraints::new().max_per_creator(2))
.limit(20)
.build()?,
)?;
```
That compiles. That runs. That returns ranked results with BM25 text scores, ANN semantic scores, and signal-boosted re-ranking. One function call. One process. No network boundary between the text index and the engagement signals.
This post covers what shipped, what it costs, and what the pipeline looks like from the inside.
## What was missing
The previous post was honest about the gaps. Three systems were described but not wired:
**Tantivy was researched, not integrated.** The research doc existed. The architecture decision was made -- Tantivy is a derived index, not a source of truth. No code had been written.
**ANN fell back to scan.** The USearch HNSW index was integrated and tested, but the `CandidateStrategy::Ann` arm in the query executor printed a warning and scanned the full entity set.
**`CandidateStrategy::Hybrid` returned `UnsupportedStrategy`.** The enum variant existed as type-level documentation of intent. Calling it produced an error.
All three are now operational.
## The SEARCH pipeline
The `RETRIEVE` query has a 5-stage pipeline. The `SEARCH` query has an 8-stage pipeline. The additional stages handle the retrieval modes that search introduces -- text, vector, and the fusion of both.
```
Stage 1a: BM25 text retrieval (Tantivy)
-> Vec<(EntityId, f32)> sorted by BM25 score
Stage 1b: ANN vector retrieval (USearch HNSW)
-> Vec<(EntityId, f32)> sorted by L2 distance
Stage 1c: Reciprocal Rank Fusion
-> Vec<(EntityId, f64)> merged ranking
Stage 2: Metadata filter (RoaringBitmap intersection)
Stage 2b: Creator metadata post-filter (when searching creators)
Stage 2.5: User-context filter (seen, hidden, blocked)
Stage 3: Signal scoring via ranking profile
Stage 4: Diversity enforcement
Stage 5: Result assembly
```
Stages 2 through 5 are the same pipeline that `RETRIEVE` uses. The search pipeline adds the retrieval front-end -- three sub-stages that produce a fused candidate list -- and feeds it into the existing infrastructure. No scoring code was duplicated. No diversity logic was forked. The ranking profiles, the signal ledger reads, the diversity selector, the pagination cursors -- all shared.
The executor determines which retrieval path to use from the query itself:
```rust
let mode = RetrievalMode::determine(
query.query_text.is_some(),
query.query_vector.is_some(),
);
```
Three modes: `TextOnly`, `VectorOnly`, `Hybrid`. Text-only queries skip ANN entirely. Vector-only queries skip Tantivy. Hybrid runs both and fuses. The downstream pipeline does not know which mode produced the candidates.
## Stage 1a: Tantivy BM25
Tantivy runs embedded. No separate process, no network call, no sidecar. The text index is a materialized view of the entity store -- when you call `write_item_with_metadata`, a background syncer feeds the metadata fields into Tantivy's inverted index.
The search executor calls Tantivy's `Searcher` with a custom `AllScoresCollector` that extracts BM25 scores for every matching document, not just the top-K. This matters because RRF fusion operates on ranks, not scores -- we need the complete ranked list from each source to fuse correctly.
```rust
let parser = idx.query_parser();
let tq = parser.parse(query_text)?;
let searcher = idx.searcher();
let collector = AllScoresCollector {
entity_id_field: idx.fields().entity_id,
};
let mut raw = searcher.search(tq.as_ref(), &collector)?;
raw.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
```
Tantivy supports the query syntax you would expect: exact phrases (`"Rust tutorial"`), boolean exclusion (`rust -jazz`), multi-term queries with implicit OR. The schema declares which fields are indexed as `Text` (tokenized, BM25-scored) and which as `Keyword` (exact match, filterable).
The consistency model: the entity store is the source of truth. Tantivy indexes are derived. If the Tantivy index is lost, it rebuilds from storage. The syncer commits every 1,000 items or every 2 seconds, whichever comes first. For immediate visibility in tests, `flush_text_index()` blocks until the syncer acknowledges and the reader reloads.
**Cost: 0.26ms for a BM25 query against 10,000 documents.**
## Stage 1b: USearch ANN
The previous post described the ANN infrastructure as "integrated, not wired as a retrieval path." The `CandidateStrategy::Ann` arm fell back to a full scan. That arm no longer exists in the search pipeline -- the `SearchExecutor` reads the embedding registry directly.
```rust
let registry = embedding_registry.read()?;
match registry.get(query.entity_kind, "content") {
Some(slot) => {
let k = (query.limit as usize * 20).max(200);
let raw = slot.index.search(query_vector, k, slot.params.ef_search)?;
ann_results = ann_to_ranked(&raw);
}
None => {
warnings.push("no 'content' embedding slot; ANN retrieval skipped");
}
};
```
The overprovision factor is 20x the requested limit or 200, whichever is larger. This gives the fusion and diversity stages enough candidates to work with. The HNSW index returns results sorted by L2 distance -- for unit-normalized embeddings (which tidalDB normalizes at insertion time), this is equivalent to cosine similarity ranking.
When no embedding slot exists for the queried entity kind, the executor skips ANN and logs a warning. The query still succeeds via text retrieval alone. Graceful degradation, not failure.
## Stage 1c: Reciprocal Rank Fusion
RRF is the formula from Cormack, Clarke, and Buettcher (SIGIR 2009):
```
RRF(d) = 1/(k + rank_bm25(d)) + 1/(k + rank_ann(d))
```
`k` is 60 (the paper's default). The formula is rank-based, not score-based. This matters because BM25 scores and L2 distances have incomparable distributions. A BM25 score of 4.7 and an L2 distance of 0.3 cannot be meaningfully combined by addition or weighted average. But rank 3 in the text list and rank 7 in the vector list can be combined: `1/63 + 1/67 = 0.0308`.
A document that appears in both lists gets contributions from both terms. A document in only one list gets only its single-list term. Documents ranked highly in both lists surface to the top. Documents ranked highly in one and poorly in the other land in the middle. The formula requires no tuning, no score normalization, and no per-query alpha parameter.
```rust
pub fn fuse(
&self,
bm25_results: &[(EntityId, f32)],
ann_results: &[(EntityId, f32)],
) -> Vec<(EntityId, f64)> {
let k = f64::from(self.k);
let mut scores: HashMap<u64, f64> = HashMap::with_capacity(capacity);
for (rank, (entity_id, _)) in bm25_results.iter().enumerate() {
*scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + (rank + 1) as f64);
}
for (rank, (entity_id, _)) in ann_results.iter().enumerate() {
*scores.entry(entity_id.as_u64()).or_insert(0.0) += 1.0 / (k + (rank + 1) as f64);
}
// Sort descending by fused score.
// ...
}
```
The fusion module includes property tests proving that the output is always the union of unique IDs from both inputs, sorted descending by fused score. RRF never drops a candidate that appeared in either source list.
**Cost: 46 microseconds for 1,000 candidates.**
## What a result looks like
Each result carries its provenance:
```rust
pub struct SearchResultItem {
pub entity_id: EntityId,
pub score: f64, // final profile score after signal re-ranking
pub rank: usize, // 1-based
pub bm25_score: Option<f32>, // present if text retrieval matched
pub semantic_score: Option<f32>, // present if ANN retrieval matched
pub signals: Vec<Signal>,
pub metadata: Option<HashMap<String, String>>,
}
```
`bm25_score` and `semantic_score` are the raw retrieval scores before fusion and before signal re-ranking. They are the search equivalent of the signal snapshot on `RETRIEVE` results -- a per-result explanation of how this item got here. A result with a high BM25 score and no semantic score was found by keywords alone. A result with both was found by text and by embedding. The final `score` reflects the ranking profile's signal boosts applied on top of the fused retrieval score.
## The UAT
The acceptance test writes 200 items (100 "rust tutorial" items, 100 "jazz piano" items), each with a 4-dimensional embedding pointing into distinct quadrants. It writes 50 creators with text metadata and embeddings. Then it runs 8 tests:
1. **Hybrid search** -- text + vector, both BM25 and semantic scores present, descending order.
2. **Text-only search** -- BM25 scores present, semantic scores absent.
3. **Exact phrase match** -- `"Rust tutorial"` in quotes.
4. **Boolean exclusion** -- `rust -jazz`.
5. **Creator text search** -- routes to the creator text index.
6. **Creator `similar_to`** -- reads a creator's stored embedding, runs ANN, source entity excluded from results.
7. **`search_click` signal** -- records engagement from search interaction.
8. **Re-search after signal write** -- proves signals written between queries are visible in the next search.
Step 8 is the thesis in miniature. Write a view signal. Search again. The signal ledger and the text index share a process. The ranking profile reads the updated signal state. No cache to invalidate. No consumer to wait for.
```rust
// Write a signal.
let _ = db.signal("view", EntityId::new(1), 1.0, Timestamp::now());
// Re-search should still work -- and reflect the new signal.
let results = db.search(&q).unwrap();
assert!(!results.is_empty());
```
A performance test verifies that hybrid search at 200 items averages under 50ms -- comfortably within the query budget even at this early scale.
## The numbers
| Operation | Measured | Scale |
|-----------|----------|-------|
| BM25 text query | 0.26ms | 10K documents |
| RRF fusion | 46us | 1K candidates |
| Hybrid search (end-to-end) | <50ms | 200 items |
The BM25 number is from Tantivy running embedded. No serialization, no network, no result deserialization. The 0.26ms includes query parsing, term lookup in the inverted index, BM25 scoring, and result collection. At 10K documents, the inverted index fits in L2 cache.
The RRF number is from the fusion module in isolation: two ranked lists of 500 candidates each, merged into a single sorted list. Two hash map passes and one sort. The 46 microseconds would not show up in a flame graph.
The end-to-end number includes all 8 stages: both retrieval modes, fusion, metadata filtering, signal scoring, diversity enforcement, and result assembly. Under 50ms with room to spare.
## What changed architecturally
Nothing.
That is the point. The previous post argued that the architecture was already designed for hybrid search -- that the pipeline split at the right boundary, that scores were composable numbers, that signals lived where the query could read them. Shipping the SEARCH pipeline proved it. The executor gained a new file (`search/executor.rs`). The fusion module is 166 lines. The `SearchBuilder` and result types are 320 lines. The existing scoring pipeline, diversity selector, filter evaluator, and pagination logic required zero changes.
The RRF fusion produces a `Vec<(EntityId, f64)>`. The ranking profile's `ProfileExecutor` expects a `Vec<EntityId>`. The diversity selector expects a `Vec<ScoredCandidate>`. These are the same types the `RETRIEVE` pipeline uses. Fusion is arithmetic. The pipeline does not care where the candidates came from.
Sixteen built-in ranking profiles now include `search` -- a profile with no exploration injection, no diversity bias, and light signal boosts from view and like counts. When you call `Search::builder()` without specifying a profile, it defaults to `search`. The profile treats text and semantic relevance as the primary ranking signal, with engagement data adjusting within the relevant set. A high-quality but unengaged result stays high. A low-relevance result does not surface on engagement alone.
## What remains
Creator search shipped alongside hybrid search. `entity_kind(EntityKind::Creator)` routes BM25 to the creator text index and ANN to the creator embedding index. `similar_to(EntityId::new(42))` reads a creator's stored embedding and uses it as the query vector. Sort modes include `MostFollowed` (by follow signal) and `CreatorEngagementRate` (view + like velocity over 24 hours). These are operational but deserve their own post.
Index persistence across restarts is not here yet. The Tantivy index is durable (it writes to disk), but the bitmap indexes and HNSW index rebuild from the entity store on reopen. At the current scale targets this is fast. At larger scales it will need attention.
The text query parser -- `SEARCH items QUERY "jazz piano" VECTOR [...] LIMIT 20` as a string -- is not yet implemented. Queries are constructed via the Rust builder API. The parser will produce the same `Search` struct the builder produces.
## Where to start
For the architectural backstory -- why search and ranking are the same system, why the pipeline splits where it does, why RRF over learned fusion -- read [Search and ranking are the same system](/blog/search-and-ranking). That post describes the design. This one describes the artifact.
---
*The search executor is at [tidal/src/query/search/executor.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/query/search/executor.rs). The RRF fusion module is at [tidal/src/query/fusion.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/query/fusion.rs). The acceptance test is at [tidal/tests/m5_uat.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/tests/m5_uat.rs). Follow the build on [GitHub](https://github.com/orchard9/tidalDB).*