tidaldb/site/content/blog/search-and-ranking.mdx
jordan 192c473f55 feat: complete Milestone 5 — full-text search, RRF fusion, and creator search
- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs)
- M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates)
- M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking)
- M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators)
- Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.)
- Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.)
- Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers
- Add benches: fusion, search, session, text_index
- Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index)
- Update blog posts, roadmap, content strategy, and M5 planning docs
- Add tmp/ and .claude/worktrees/ to .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 23:53:16 -07:00

285 lines
20 KiB
Plaintext

---
title: "Search and ranking are the same system"
date: "2026-02-21"
author: "Jordan Washburn"
description: "In the 6-system stack, search and ranking are separate pipelines with separate teams. tidalDB is designed to collapse text retrieval, vector retrieval, and signal-based ranking into a single query pipeline. Here is what that architecture looks like, what is built, and what remains."
tags: ["search", "ranking", "architecture", "rust"]
---
Your search results and your ranked feed are computed by different systems, maintained by different teams, and they return different answers to the same question.
A user types "jazz piano tutorial." Elasticsearch returns results ranked by BM25 text relevance. Separately, your ranking service reads the user's preference vector from a feature store, pulls engagement signals from Redis, and reranks the candidates. The text score and the engagement score are combined using a weighted formula that somebody wrote eighteen months ago and nobody has revisited since. If the user follows a jazz piano creator whose new tutorial has 500 completions in the last hour, that signal exists in Redis. It does not exist in Elasticsearch. The search result does not reflect it.
Meanwhile, the "For You" feed shows the same tutorial ranked highly -- because the ranking service reads the same Redis signals and the same preference vector. But the feed used a different candidate set (vector similarity from the vector database, not keyword match from Elasticsearch), a different scoring formula (hot decay, not BM25), and a different diversity pass. The user sees the tutorial in the feed. They search for it and it appears on page two.
This is not a bug. It is the architecture. Search and ranking are separate systems with separate data pipelines, and the seams between them are where relevance dies.
## Why search and ranking diverged
The separation is a historical accident, not a design choice.
Full-text search engines -- Lucene, then Elasticsearch, then Solr -- were built to answer a question about documents: "which ones match this query?" They index terms. They compute BM25 or TF-IDF scores. They return results ranked by textual relevance. The problem they solve is information retrieval.
Recommendation systems were built to answer a different question: "what should this user see?" They model user preferences. They track engagement signals. They compute scores based on behavioral data, not textual content. The problem they solve is personalization.
The two problems feel different, so they got different systems. But the question a real user asks is neither purely textual nor purely behavioral. When a user searches for "jazz piano tutorial," they want results that are textually relevant to those words, semantically related to that concept, and ranked according to the quality and freshness signals that their platform has accumulated. They want the search result that the feed would surface -- and the feed result that the search would find.
The 6-system stack cannot answer this question without stitching systems together. Elasticsearch produces text candidates. The vector database produces semantic candidates. Redis provides engagement signals. The ranking service merges everything. Each system has its own consistency model, its own latency profile, and its own failure mode. The merge happens in application code that nobody wants to own.
## What "unified" actually means
A unified search-and-ranking system handles three retrieval modes in a single pipeline:
**Text retrieval.** BM25 keyword relevance against an inverted index. "Jazz piano tutorial" matches documents containing those terms, weighted by term frequency and inverse document frequency. This is what Elasticsearch does.
**Vector retrieval.** Approximate nearest neighbor search over embeddings. The query "jazz piano" encoded as a vector finds documents whose embeddings are geometrically close in the latent space -- including documents titled "beginner jazz keyboard lessons" that share no keywords with the query. This is what a vector database does.
**Signal-based ranking.** Scoring candidates using live engagement signals -- decay scores, velocity, windowed counts, interaction weights, preference vectors. This is what the ranking service does.
In the 6-system stack, these are three systems. Three network calls. Three consistency models. The merge is application logic.
In tidalDB, the design is one pipeline: source candidates from one or more retrieval modes, fuse their scores, apply the ranking profile's signal boosts, enforce diversity, return results. One function call. One process. One consistency model.
The target query looks like this:
```
SEARCH items
QUERY "jazz piano"
VECTOR [embedding]
FOR USER @user_42
USING PROFILE search
DIVERSITY max_per_creator:2
LIMIT 20
```
Text relevance, semantic similarity, and personalized signal ranking in a single query. The fusion uses Reciprocal Rank Fusion: each retrieval mode produces a ranked list, and RRF combines them by summing `1 / (k + rank)` across lists. An item that ranks 3rd by BM25 and 7th by ANN gets a higher fused score than an item that ranks 1st by BM25 but 50th by ANN. The formula is simple, parameter-free (given a fixed `k`), and well-studied.
Personalization re-ranks within the fused set. A high-quality result never surfaces solely because the user likes the creator. Textual and semantic relevance establish the candidate floor. Signals adjust rank within the relevant set. An irrelevant result stays irrelevant regardless of the user's history.
## What is built today
tidalDB is not there yet. Here is exactly what exists and what does not.
### The RETRIEVE pipeline: operational
The 5-stage RETRIEVE query pipeline is complete and tested. It handles candidate generation, metadata filtering, signal scoring, diversity enforcement, and result assembly. Fifteen built-in ranking profiles cover trending, hot, new, top-week, top-month, top-all-time, hidden gems, controversial, most viewed, most liked, shuffle, for-you, following, related, and notification. Personalized ranking with preference vectors, interaction weights, hard negatives, and exploration injection all work.
```rust
// This works today.
let query = Retrieve::builder()
.profile("trending")
.for_user(42)
.filter(FilterExpr::CategoryEq("jazz".into()))
.diversity(DiversityConstraints::new().max_per_creator(1))
.limit(25)
.build()
.expect("valid query");
let results = db.retrieve(&query).expect("retrieve");
```
Every result carries a signal snapshot showing the values that contributed to its score. The pipeline produces identical output for identical input. The acceptance tests verify this.
### USearch HNSW index: integrated, not wired as a retrieval path
The vector index is integrated and tested. USearch backs the `VectorIndex` trait with insert, search, filtered search, delete, save/load, and mmap `view()` mode. The adaptive query planner selects strategies based on filter selectivity: unfiltered HNSW for open queries, in-graph filtering for moderate selectivity, widened beam search for selective filters, and pre-filter-then-brute-force for extreme selectivity.
```rust
pub struct UsearchIndex {
inner: usearch::Index,
total_slots: AtomicUsize,
}
impl VectorIndex for UsearchIndex {
fn search(
&self,
query: &[f32],
k: usize,
ef_search: usize,
) -> Result<Vec<VectorSearchResult>, VectorError> { /* ... */ }
fn filtered_search(
&self,
query: &[f32],
k: usize,
ef_search: usize,
filter: &dyn Fn(u64) -> bool,
) -> Result<Vec<VectorSearchResult>, VectorError> { /* ... */ }
}
```
The embedding slot registry manages named vector slots per entity kind (e.g., "content" embeddings on Items, "creator_profile" embeddings on Creators). Embeddings are stored durably in the entity store and indexed in the HNSW index for search.
However, the `CandidateStrategy::Ann` variant in the query executor currently falls back to a full scan with a warning:
```rust
// From tidal/src/query/executor.rs
CandidateStrategy::Ann { .. } => {
// ANN candidate strategy falls back to scan with a warning.
warnings.push(
"ANN candidate strategy not yet wired; falling back to scan"
.to_string(),
);
self.scan_candidates(query.limit, has_user_context)
}
```
The vector infrastructure is there. The retrieval path through the query executor is not wired. The scan-based approach is sufficient for the item counts the current version targets (tens of thousands), but it does not scale to millions where ANN retrieval is essential.
### Tantivy: researched, not integrated
Full-text search via Tantivy has been researched in depth. The integration patterns are documented: a custom `Collector` for bulk BM25 score extraction, `Weight::scorer` with `DocSet::seek` for scoring a pre-existing candidate set, and the consistency model for keeping Tantivy's segment storage synchronized with tidalDB's entity store.
The research identified the key architectural decision: Tantivy is a derived index, not a source of truth. The entity store is canonical. Tantivy indexes are materialized views that can be rebuilt from storage. Crash recovery replays from a stored sequence number. This is simpler than two-phase commit and correct for an embedded database.
No Tantivy code has been written. No inverted index exists. No BM25 scoring is available. The `SEARCH` query type does not exist yet.
### The Hybrid and CohortTrending strategies: defined, not implemented
The `CandidateStrategy` enum includes variants for the target architecture:
```rust
// From tidal/src/ranking/profile.rs
pub enum CandidateStrategy {
Ann { slot: String, limit: usize },
Scan { sort_field: String },
SignalRanked { signal: String, window: Window },
Hybrid,
Relationship,
CohortTrending,
}
```
`Hybrid` is the strategy that will combine text and vector retrieval with RRF fusion. `CohortTrending` will scope signal aggregation to audience segments. Both return `UnsupportedStrategy` errors today. They are type-level documentation of intent.
## Why the architecture makes this possible
The interesting question is not "when will hybrid search ship." It is "why is the current architecture already designed to support it." The answer is in three decisions that were made before a line of search code was written.
### Decision 1: Ranking profiles control the retrieval path
The `CandidateStrategy` field on `RankingProfile` determines how candidates are sourced. The executor dispatches on this field. The scoring pipeline does not know or care where the candidates came from.
```rust
// From tidal/src/query/executor.rs
let mut candidates = match &profile.candidate_strategy {
CandidateStrategy::Scan { .. } => self.scan_candidates(query.limit, has_user_context),
CandidateStrategy::SignalRanked { signal, .. } => {
self.signal_ranked_candidates(signal, query.limit)
}
CandidateStrategy::Ann { .. } => {
// Falls back to scan today. Will query the HNSW index next.
self.scan_candidates(query.limit, has_user_context)
}
CandidateStrategy::Relationship => {
// Sources candidates from followed creators' item sets.
// ...
}
// ...
};
```
Adding ANN retrieval means implementing the `Ann` arm. Adding hybrid retrieval means implementing the `Hybrid` arm. The scoring, filtering, diversity, and pagination stages are unchanged. The pipeline is already split at the right boundary.
### Decision 2: Scores are composable numbers, not opaque ranks
Every stage in the pipeline produces and consumes `f64` scores. The `ProfileExecutor` reads signal aggregations and computes a weighted sum. The diversity selector operates on scored candidates sorted by score. The result carries the score and a signal snapshot.
This means RRF fusion has a natural integration point. RRF produces a fused score from ranked lists. That score enters the same pipeline as any other candidate score. Boosts from the ranking profile add signal-weighted values on top. Personalization adjusts via interaction weights. The profile's `sort` mode can override the base score entirely if needed. The type system already supports it:
```rust
// From tidal/src/ranking/executor.rs
pub struct ScoredCandidate {
pub entity_id: EntityId,
pub score: f64,
pub signal_snapshot: Vec<(String, f64)>,
pub creator_id: Option<EntityId>,
pub format: Option<String>,
}
```
A `ScoredCandidate` from ANN retrieval and a `ScoredCandidate` from text retrieval are the same type. Fusion is arithmetic, not type coercion.
### Decision 3: Signals live where the query can read them
In the 6-system stack, BM25 scores come from Elasticsearch, engagement signals come from Redis, and preference vectors come from a feature store. Merging them requires network calls across consistency boundaries.
In tidalDB, all three data sources are in-process:
- **Signal ledger**: decay scores, velocity, windowed counts -- read with a function call, not a network request. A signal written 100ms ago is visible in the next query.
- **Preference vectors**: per-user taste embeddings stored in-memory, updated atomically on engagement events. Cosine similarity between user preference and item embedding is a dot product, not a feature store lookup.
- **Metadata indexes**: bitmap and range indexes for category, format, creator, tags, duration, timestamps. Filter evaluation is a bitmap intersection.
When text retrieval is added, the BM25 score will be one more `f64` in the same process. Tantivy runs embedded. The score crosses no network boundary. The consistency model is the same as every other data source: in-memory state updated by the write path, visible to the read path immediately.
This is why the unified architecture matters even before it is fully wired. The data model is already unified. The indexes are in the same process. The signal ledger and the preference vectors and the metadata indexes and (eventually) the text index and the vector index all share a memory space. Fusion is addition. Consistency is structural.
## What a unified search query replaces
Here is the dependency graph for a search query in the 6-system stack:
```
Application
-> Elasticsearch (BM25 candidates from inverted index)
-> Vector DB (semantic candidates from ANN)
-> [merge candidate lists in application code]
-> Ranking Service
-> Redis (engagement signals per candidate)
-> Feature Store (user preference vector)
-> [score, rerank, diversity in application code]
<- sorted results
```
Four systems. Three network calls minimum. Two candidate sets merged in application code that nobody tests. Diversity rules in a microservice that nobody wants to refactor. A BM25 score from Elasticsearch and an engagement score from Redis that were computed at different points in time against different consistency snapshots.
Here is the target in tidalDB:
```
Application
-> db.search(&query)
Stage 1a: Tantivy inverted index -> BM25-scored candidates
Stage 1b: USearch HNSW index -> ANN-scored candidates
Stage 1c: RRF fusion -> merged candidate list
Stage 2: Bitmap filter evaluation -> surviving candidates
Stage 3: Signal scoring via ranking profile -> scored candidates
Stage 4: Diversity enforcement -> reordered candidates
Stage 5: Result assembly -> Results
<- Results
```
One function call. One process. No network boundary between text relevance, semantic similarity, and engagement signals. The data is never stale because every data source shares the same write path.
## What is next
Three pieces of work stand between the current codebase and the unified search query.
**Wire ANN as a first-class retrieval path.** The USearch index is integrated. The adaptive query planner is implemented. The `CandidateStrategy::Ann` variant exists. What remains is plumbing: when the executor sees `Ann { slot, limit }`, it reads the query embedding (from the `Retrieve` struct or from the user's preference vector), calls the embedding registry to find the right HNSW index, runs the adaptive planner's `execute`, and returns the results as `Vec<EntityId>`. The scoring pipeline handles the rest. This is a wiring task, not an architecture change.
**Integrate Tantivy as a derived index.** The research doc maps out three integration patterns. The architecture decision is made: Tantivy is a materialized view of the entity store. The integration work is: add Tantivy as a dependency, build a background indexer that writes entity metadata to Tantivy segments, implement the custom `AllScoresCollector` for BM25 extraction, and expose text search as a candidate source in the executor. Crash recovery replays from a sequence number stored in the entity store.
**Implement RRF fusion and the SEARCH query.** Define a `Search` query type (analogous to `Retrieve` but with text and vector query fields). Implement the `Hybrid` candidate strategy that runs text retrieval and ANN retrieval in parallel, fuses the ranked lists using RRF, and feeds the merged candidates into Stage 2. The rest of the pipeline -- filtering, signal scoring, diversity, pagination -- is already built.
Each piece builds on existing infrastructure. No architectural changes. No new consistency models. No new storage engines. The foundation is laid. What remains is connecting the pieces.
## Why the integration matters now
Even before the SEARCH query ships, the architectural unification has consequences.
When a platform team evaluates tidalDB today, they see a system where engagement signals, preference vectors, metadata indexes, and vector embeddings are co-located in a single process. They see a query pipeline that reads all of them in a single pass. They see ranking profiles that can reference any signal source without a network call. They see diversity enforcement that operates on the full scored set, not on a post-hoc splice in application code.
Adding text search to this system is mechanical. Adding text search to the 6-system stack is political -- because search is Elasticsearch, ranking is the ranking service, and the merge is whoever volunteered to own it last year.
The most expensive part of unifying search and ranking is not the code. It is the organizational decision to put them in the same system. In tidalDB, that decision was made at the schema level. Signals, profiles, entities, embeddings, and (soon) text indexes share a schema, a storage model, and a query pipeline. They are the same system because the data model says they are.
Search and ranking are the same question asked with different emphasis. "What matches this query for this user?" and "What should this user see right now?" differ in whether the user provided keywords. The ranking pipeline -- candidates, filters, signals, diversity -- is identical. The only variable is how candidates are sourced: by keyword, by embedding, by signal velocity, by relationship graph, or by some combination.
One pipeline. One set of candidates. One scoring pass. That is the design. The implementation is catching up.
---
*The RETRIEVE query executor is at [tidal/src/query/executor.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/query/executor.rs). The ranking profiles are at [tidal/src/ranking/builtins.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/ranking/builtins.rs). The vector index is at [tidal/src/storage/vector/usearch_index.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/storage/vector/usearch_index.rs). The adaptive query planner is at [tidal/src/storage/vector/planner.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/storage/vector/planner.rs). The Tantivy research is at [docs/research/tantivy.md](https://github.com/orchard9/tidalDB/blob/main/docs/research/tantivy.md). Follow the build on [GitHub](https://github.com/orchard9/tidalDB).*