206 lines
10 KiB
Plaintext
206 lines
10 KiB
Plaintext
---
|
|
title: "People search is not a different problem. It is a different entity_kind."
|
|
date: "2026-02-22"
|
|
author: "Jordan Washburn"
|
|
description: "Content discovery asks 'find me videos about jazz piano.' People discovery asks 'find me creators I should follow.' In the 6-system stack, these are two different architectures. In tidalDB, they differ by one field."
|
|
tags: ["search", "creators", "personalization", "rust"]
|
|
---
|
|
|
|
Content discovery and people discovery are treated as separate problems because, in practice, they require separate infrastructure. Content lives in one Elasticsearch index with one mapping. Creators live in another index with a different mapping, or they live in a Postgres table with a `tsvector` column that someone added during a hackathon and nobody optimized. The signal data that makes creator search interesting -- follow counts, engagement rates, trending velocity -- lives in Redis or a feature store. Not co-located with the search candidates. Not queryable in the same pipeline.
|
|
|
|
The result: content search works. People search is an afterthought. Or it is a separate project with its own team, its own latency budget, and its own set of bugs.
|
|
|
|
In tidalDB, creators are first-class entities. The same `SEARCH` query that finds content can find creators. Same pipeline. Same signal-based ranking. Same diversity enforcement. Same metadata filtering. You change one field.
|
|
|
|
```rust
|
|
// Content search
|
|
let results = db.search(&Search::builder()
|
|
.query("jazz piano")
|
|
.limit(10)
|
|
.build()?)?;
|
|
|
|
// Creator search — same pipeline, different entity_kind
|
|
let results = db.search(&Search::builder()
|
|
.entity_kind(EntityKind::Creator)
|
|
.query("jazz piano")
|
|
.limit(10)
|
|
.build()?)?;
|
|
```
|
|
|
|
That is the entire API difference. The executor handles the routing.
|
|
|
|
## How entity-aware search works
|
|
|
|
The search executor runs an 8-stage pipeline. Stage 1 is BM25 retrieval. Stage 1b is ANN retrieval. When `entity_kind` is `Creator`, both stages route to creator-specific indexes instead of item indexes. The text index searches creator names and handles instead of item titles and descriptions. The ANN index searches creator embeddings instead of item embeddings.
|
|
|
|
The routing happens once, at the top of the pipeline:
|
|
|
|
```rust
|
|
// From tidal/src/query/search/executor.rs
|
|
let effective_text_index = match query.entity_kind {
|
|
EntityKind::Creator => self.creator_text_index,
|
|
_ => self.text_index,
|
|
};
|
|
```
|
|
|
|
Everything downstream -- RRF fusion, signal scoring, diversity enforcement, pagination -- operates on entity IDs. It does not know or care whether those IDs refer to items or creators. The pipeline is entity-agnostic. The indexes are entity-specific.
|
|
|
|
Creator text fields are declared in the schema alongside item text fields:
|
|
|
|
```rust
|
|
let mut builder = SchemaBuilder::new();
|
|
builder.text_field("title", TextFieldType::Text); // item field
|
|
builder.creator_text_field("name", TextFieldType::Text); // creator field
|
|
builder.creator_text_field("handle", TextFieldType::Text); // creator field
|
|
builder.creator_text_field("language", TextFieldType::Keyword);
|
|
```
|
|
|
|
Separate declarations. Separate Tantivy indexes. Same schema object. The database knows which fields belong to which entity kind. The caller does not manage index routing.
|
|
|
|
## similar_to is not keyword search
|
|
|
|
The most interesting query in creator search does not use keywords at all.
|
|
|
|
```rust
|
|
let results = db.search(&Search::builder()
|
|
.entity_kind(EntityKind::Creator)
|
|
.similar_to(EntityId::new(1))
|
|
.limit(5)
|
|
.build()?)?;
|
|
```
|
|
|
|
This says: "find creators whose embeddings are close to creator 1." No query string. No BM25. The database reads creator 1's stored embedding, injects it as the query vector, runs ANN retrieval against the creator embedding index, and returns the nearest neighbors. Creator 1 is automatically excluded from the results.
|
|
|
|
The resolution happens in `TidalDb::search()` before the executor sees the query:
|
|
|
|
```rust
|
|
// From tidal/src/db/query_ops.rs
|
|
let query = if let (Some(similar_id), None) = (query.similar_to, &query.query_vector) {
|
|
let emb = match query.entity_kind {
|
|
EntityKind::Creator => self.read_creator_embedding(similar_id)?,
|
|
_ => None,
|
|
};
|
|
if let Some(embedding) = emb {
|
|
query_owned = query.clone();
|
|
query_owned.query_vector = Some(embedding);
|
|
if !query_owned.exclude.contains(&similar_id) {
|
|
query_owned.exclude.push(similar_id);
|
|
}
|
|
&query_owned
|
|
} else {
|
|
query
|
|
}
|
|
};
|
|
```
|
|
|
|
Three things happen. The stored embedding is read. It becomes the query vector. The source entity is excluded. By the time the executor runs, this is an ordinary ANN query. The executor does not know it originated from a `similar_to` call.
|
|
|
|
This is a fundamentally different kind of query than keyword search. You are not searching the text index. You are navigating embedding space. The use case is: "this user follows creator A. Show them creators who are like creator A." The embeddings encode whatever similarity your model learned -- genre, style, audience overlap, content type. The database does not generate the embeddings. It retrieves and ranks over them.
|
|
|
|
The acceptance test verifies it directly:
|
|
|
|
```rust
|
|
// Creator 1 is a jazz creator with embedding [0.0, 1.0, 0.0, 0.0].
|
|
// similar_to should return other jazz creators with nearby embeddings.
|
|
let results = db.search(&Search::builder()
|
|
.entity_kind(EntityKind::Creator)
|
|
.similar_to(EntityId::new(1))
|
|
.limit(5)
|
|
.build()?)?;
|
|
|
|
assert!(!results.is_empty());
|
|
assert!(results.items.iter().all(|r| r.entity_id != EntityId::new(1)));
|
|
assert!(results.items.iter().any(|r| r.semantic_score.is_some()));
|
|
```
|
|
|
|
Source excluded. Semantic scores present. No text query needed.
|
|
|
|
## Signal-based ranking for people
|
|
|
|
The `Sort::MostFollowed` and `Sort::CreatorEngagementRate` variants exist in the ranking executor. They read live signal state, not precomputed columns.
|
|
|
|
`MostFollowed` reads the "follow" signal's `AllTime` windowed value. This is not a static follower count field that was last updated by a cron job. It is the running accumulator in the signal ledger, computed at query time. A creator who gained 1,000 follows in the last hour has a different score than a creator with 1,000 total follows gained over two years -- if the signal has decay configured, the recent follows carry more weight.
|
|
|
|
`CreatorEngagementRate` sums two velocity reads: view velocity over 24 hours plus like velocity over 24 hours. Velocity is a native signal primitive in tidalDB -- it is not computed by dividing a count by a time window in application code. It is maintained by the signal ledger as events arrive. The read is O(1).
|
|
|
|
```rust
|
|
// From tidal/src/ranking/executor/mod.rs
|
|
Some(Sort::MostFollowed) => read_agg(
|
|
entity_id, "follow", &SignalAgg::Value, Window::AllTime, self.ledger,
|
|
),
|
|
Some(Sort::CreatorEngagementRate) => {
|
|
let view_vel = read_agg(
|
|
entity_id, "view", &SignalAgg::Velocity, Window::TwentyFourHours, self.ledger,
|
|
);
|
|
let like_vel = read_agg(
|
|
entity_id, "like", &SignalAgg::Velocity, Window::TwentyFourHours, self.ledger,
|
|
);
|
|
view_vel + like_vel
|
|
}
|
|
```
|
|
|
|
The data is the same signal ledger that scores items. The lens is different.
|
|
|
|
## Metadata filtering on creators
|
|
|
|
Creator search supports the same `FilterExpr` predicates as item search. Language, category, any metadata key written with the creator entity.
|
|
|
|
```rust
|
|
let query = Search::builder()
|
|
.entity_kind(EntityKind::Creator)
|
|
.query("jazz")
|
|
.filter(FilterExpr::eq("language", "en"))
|
|
.limit(20)
|
|
.build()?;
|
|
```
|
|
|
|
For creators, filtering happens in Stage 2b -- a post-filter that evaluates predicates against actual creator metadata from storage, rather than relying on bitmap indexes. This is because creator metadata lives in the creator storage engine, not in the item bitmap indexes that were built for content filtering. The trade-off: filtering is a storage read per candidate rather than a bitmap intersection. At creator-scale cardinalities (thousands, not millions), this is not a bottleneck.
|
|
|
|
Creator results also carry metadata in the response. Each `SearchResultItem` has an optional `metadata` field that is populated for creator results:
|
|
|
|
```rust
|
|
pub struct SearchResultItem {
|
|
pub entity_id: EntityId,
|
|
pub score: f64,
|
|
pub rank: usize,
|
|
pub bm25_score: Option<f32>,
|
|
pub semantic_score: Option<f32>,
|
|
pub signals: Vec<Signal>,
|
|
pub metadata: Option<HashMap<String, String>>, // populated for creators
|
|
}
|
|
```
|
|
|
|
The caller gets the creator's name, handle, and any other metadata without a second round-trip.
|
|
|
|
## The cost
|
|
|
|
Creator text search runs under 20 milliseconds at 200 creators. The acceptance test measures it:
|
|
|
|
```rust
|
|
let avg = total / iters;
|
|
assert!(
|
|
avg < Duration::from_millis(20),
|
|
"Average creator text search latency {avg:?} exceeds 20ms target"
|
|
);
|
|
```
|
|
|
|
This is the full pipeline: BM25 retrieval from the creator text index, signal scoring, result assembly with metadata enrichment. Not a microbenchmark. The real query path.
|
|
|
|
## What this replaces
|
|
|
|
In the 6-system stack, people search requires:
|
|
|
|
1. A separate Elasticsearch index with a creator-specific mapping.
|
|
2. A separate ingestion pipeline to keep creator metadata in sync.
|
|
3. A call to Redis or the feature store for follow counts and engagement rates.
|
|
4. Application code to merge search results with signal data.
|
|
5. A separate ranking pass, or no ranking at all -- just BM25 relevance.
|
|
|
|
In tidalDB, creators are entities. Signals are ledger entries. Embeddings are stored vectors. Search is a pipeline. You declare the text fields, write the creators, and query. The same process that handles content search handles people search. The same signal ledger that scores items scores creators. No sync pipeline. No second index to manage. No feature store call.
|
|
|
|
One pipeline. Two `entity_kind` values. The database handles the routing.
|
|
|
|
---
|
|
|
|
*The creator search tests are at [tidal/tests/m5p4_creator_search.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/tests/m5p4_creator_search.rs). The search executor is at [tidal/src/query/search/executor.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/query/search/executor.rs). The M5 UAT is at [tidal/tests/m5_uat.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/tests/m5_uat.rs). Follow the build on [GitHub](https://github.com/orchard9/tidalDB).*
|