- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs) - M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates) - M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking) - M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators) - Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.) - Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.) - Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers - Add benches: fusion, search, session, text_index - Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index) - Update blog posts, roadmap, content strategy, and M5 planning docs - Add tmp/ and .claude/worktrees/ to .gitignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
389 lines
22 KiB
Plaintext
389 lines
22 KiB
Plaintext
---
|
|
title: "Cold start without application logic"
|
|
date: "2026-02-21"
|
|
author: "Jordan Washburn"
|
|
description: "New users with zero history and new items with zero signals are not special cases. tidalDB handles cold start as a profile declaration, not an application-level exception path."
|
|
tags: ["ranking", "signals", "rust"]
|
|
---
|
|
|
|
Every recommendation system has a moment of silence. A new user signs up. The system has no history, no preferences, no interaction weights, no signal data. What do you show them?
|
|
|
|
The conventional answer is: write application code.
|
|
|
|
A function in your ranking service checks whether the user has a preference vector. If not, it branches to a "cold start" path that queries a separate "popular items" table, or samples from a hand-curated editorial list, or returns items sorted by global trending -- and none of these paths share infrastructure with the personalized ranking pipeline that handles the other 99% of traffic.
|
|
|
|
The new-item problem is the mirror image. A creator publishes content. It has zero views, zero likes, zero shares. The trending formula reads its velocity as zero. The hot score reads its view count as zero. The hidden gems formula divides zero quality by zero reach. Every scoring formula that depends on signals produces a zero or a NaN. The item is invisible until enough users stumble onto it through some other path -- search, chronological browse, a direct link -- and generate the initial signals the ranking system needs to function.
|
|
|
|
Both problems share a root cause: the ranking system treats "no data" as an edge case that application code must handle. tidalDB treats it as a profile configuration.
|
|
|
|
## The application code problem
|
|
|
|
Here is what cold start looks like in the 6-system stack.
|
|
|
|
For new users, the ranking service contains a function like this:
|
|
|
|
```python
|
|
def get_feed(user_id):
|
|
prefs = feature_store.get(user_id)
|
|
if prefs is None:
|
|
# Cold start: no preference vector.
|
|
# Fall back to global trending.
|
|
return elasticsearch.search(
|
|
index="items",
|
|
body={"sort": [{"trending_score": "desc"}]},
|
|
size=50
|
|
)
|
|
else:
|
|
# Warm user: personalized ranking.
|
|
candidates = vector_db.ann_search(prefs.embedding, k=200)
|
|
scored = ranking_service.score(candidates, prefs)
|
|
return scored[:50]
|
|
```
|
|
|
|
Two code paths. Two data sources. Two latency profiles. Two sets of failure modes. The cold-start path hits Elasticsearch with a sort query. The warm path hits a vector database, then a ranking service, then a feature store. When something breaks, you need to know which path the user was on. When you change the ranking formula, you need to change it in both paths -- or forget, and spend a quarter wondering why new user retention is declining while existing user engagement is steady.
|
|
|
|
For new items, the application hacks around the zero-signal problem differently. Some teams add a "boost" column to the items table and manually set it to 1.0 for the first 24 hours. Some teams maintain a separate "new items" feed that is chronological and unranked. Some teams inject random new items into the ranked feed at a fixed rate -- one new item for every nine ranked items -- implemented as a post-ranking splice in the API layer. Every approach is bespoke, undocumented, and coupled to the specific ranking pipeline it patches.
|
|
|
|
The cost is operational. Cold-start logic is scattered across services, maintained by different teams, tested to different standards. It works until someone changes the ranking formula and forgets to update the fallback. It works until the "new items" Elasticsearch index falls behind by 6 hours because a Kafka consumer crashed and nobody noticed. It works until a product manager asks "can we show new users content from creators in their geographic region?" and the answer is "that requires changes to three services and a new data pipeline."
|
|
|
|
## Why random and chronological fail
|
|
|
|
The obvious fallback for new users is random or chronological: show the newest content, or show a random sample. Both are worse than they sound.
|
|
|
|
Chronological exposes the user to whatever was published most recently. Most content on any platform is mediocre. The median completion rate of a freshly published video is low. A new user whose first experience is five mediocre items sorted by timestamp has a measurably higher churn rate than a new user who sees five items selected for quality. The first session matters. Chronological does not optimize for it.
|
|
|
|
Random is better than chronological if the random sample is quality-weighted. But "quality-weighted random" is itself a ranking formula. It needs signal data -- completion rate, like ratio, engagement velocity -- to determine what "quality" means. At that point, you are not avoiding the ranking problem. You are implementing a second ranking pipeline inside the cold-start path.
|
|
|
|
The better answer is: there is no cold-start path. There is only the ranking pipeline, with profiles configured to produce sensible results when signals are sparse.
|
|
|
|
## Cold start as a profile
|
|
|
|
A `RankingProfile` in tidalDB has an `exploration` field:
|
|
|
|
```rust
|
|
// From tidal/src/ranking/profile.rs
|
|
|
|
pub struct RankingProfile {
|
|
pub name: String,
|
|
pub version: u32,
|
|
pub candidate_strategy: CandidateStrategy,
|
|
pub boosts: Vec<Boost>,
|
|
pub decay: Option<ProfileDecay>,
|
|
pub gates: Vec<Gate>,
|
|
pub penalties: Vec<Penalty>,
|
|
pub excludes: Vec<Exclude>,
|
|
pub diversity: DiversitySpec,
|
|
pub exploration: f64, // <-- this field
|
|
pub sort: Option<Sort>,
|
|
pub is_builtin: bool,
|
|
}
|
|
```
|
|
|
|
`exploration` is a fraction between 0.0 and 0.5. It controls what percentage of the result set is reserved for candidates that did not make it into the top-ranked results. When set to 0.1, 10% of the result set is filled with items outside the scored top. When set to 0.5, half the results are exploration candidates.
|
|
|
|
This is not a cold-start hack. It is a profile-level declaration that controls the exploration-exploitation tradeoff. But it is also what makes cold start work.
|
|
|
|
## New user cold start
|
|
|
|
A new user has no preference vector. No interaction history. No seen set. No hard negatives. When a `RETRIEVE` query runs `FOR USER @new_user USING PROFILE for_you`, the query executor proceeds through the same pipeline as any other query:
|
|
|
|
**Stage 1: Candidate generation.** The profile uses `CandidateStrategy::Scan`. All items in the universe are candidates. No history-based filtering narrows the candidate set because there is no history.
|
|
|
|
**Stage 2: Filter evaluation.** Metadata filters (category, format, duration) apply normally. A new user with a locale set to `ja-JP` still gets Japanese-language content if the query includes a locale filter. Filters do not depend on signal history.
|
|
|
|
**Stage 2.5: User-context filtering.** The executor checks seen items, hidden items, blocked creators, hard negatives. All bitmaps are empty. No candidates are removed. The full universe proceeds to scoring.
|
|
|
|
**Stage 3: Signal scoring.** The `for_you` profile scores candidates using the `Hot` sort mode and three boosts:
|
|
|
|
```rust
|
|
// From tidal/src/ranking/builtins.rs
|
|
|
|
fn for_you() -> RankingProfile {
|
|
let mut p = skeleton("for_you");
|
|
p.sort = Some(Sort::Hot { gravity: 1.5 });
|
|
p.boosts = vec![
|
|
Boost {
|
|
signal: "view".into(),
|
|
agg: SignalAgg::DecayScore,
|
|
window: Window::AllTime,
|
|
weight: 1.0,
|
|
},
|
|
Boost {
|
|
signal: "like".into(),
|
|
agg: SignalAgg::DecayScore,
|
|
window: Window::AllTime,
|
|
weight: 2.0,
|
|
},
|
|
Boost {
|
|
signal: "share".into(),
|
|
agg: SignalAgg::Velocity,
|
|
window: Window::TwentyFourHours,
|
|
weight: 1.5,
|
|
},
|
|
];
|
|
p.diversity = DiversitySpec {
|
|
max_per_creator: Some(2),
|
|
format_mix_max_fraction: Some(0.4),
|
|
};
|
|
p.exploration = 0.1;
|
|
p
|
|
}
|
|
```
|
|
|
|
The scoring reads population-level signals. Item A has 500 views and 200 likes from all users. Item B has 50 views and 3 likes. The `for_you` profile scores Item A higher -- not because it matches this user's preferences (the user has none), but because the population signals indicate quality. The decay scores, velocity, and hot formula operate on the items' global signal state. They do not require per-user data to produce a useful ordering.
|
|
|
|
The personalized scoring path builds a `UserContext` from the interaction ledger:
|
|
|
|
```rust
|
|
fn build_user_context(&self, user_id: u64, now: Timestamp) -> UserContext {
|
|
let top_creators = self.interaction_ledger
|
|
.map(|il| il.top_creators(user_id, 50, now.as_nanos()))
|
|
.unwrap_or_default();
|
|
|
|
// ... expand creators to per-item boosts ...
|
|
|
|
UserContext {
|
|
user_id,
|
|
creator_interaction_boosts,
|
|
}
|
|
}
|
|
```
|
|
|
|
For a new user, `top_creators` returns an empty vec. The `creator_interaction_boosts` map is empty. The additive interaction boost is 0.0 for every candidate. The scoring falls through to population-level signals. No special case. No branch. The same code runs. The boost map just happens to be empty.
|
|
|
|
**Stage 3.5: Exploration injection.** This is where the `exploration: 0.1` field matters. After scoring, the executor reserves 10% of result slots for candidates outside the top-scored set:
|
|
|
|
```rust
|
|
fn inject_exploration(
|
|
scored: &mut Vec<ScoredCandidate>,
|
|
all_candidates: &[EntityId],
|
|
exploration_fraction: f64,
|
|
) {
|
|
let exploration_slots = (exploration_fraction * scored.len() as f64).ceil() as usize;
|
|
|
|
// Find candidates not in the scored set.
|
|
let scored_ids: HashSet<u64> = scored.iter().map(|c| c.entity_id.as_u64()).collect();
|
|
let mut exploration_pool: Vec<EntityId> = all_candidates.iter()
|
|
.filter(|id| !scored_ids.contains(&id.as_u64()))
|
|
.copied()
|
|
.collect();
|
|
|
|
// Deterministic shuffle using BLAKE3 hash.
|
|
exploration_pool.sort_by(|a, b| {
|
|
let hash_a = blake3::hash(&a.as_u64().to_le_bytes());
|
|
let hash_b = blake3::hash(&b.as_u64().to_le_bytes());
|
|
hash_a.as_bytes().cmp(hash_b.as_bytes())
|
|
});
|
|
|
|
// Trim scored to make room, then append exploration candidates.
|
|
let keep = scored.len().saturating_sub(exploration_slots);
|
|
scored.truncate(keep);
|
|
for &entity_id in exploration_pool.iter().take(exploration_slots) {
|
|
scored.push(ScoredCandidate {
|
|
entity_id,
|
|
score: 0.0,
|
|
signal_snapshot: vec![],
|
|
creator_id: None,
|
|
format: None,
|
|
});
|
|
}
|
|
}
|
|
```
|
|
|
|
For a warm user, exploration prevents filter bubbles -- it surfaces content the personalized model would not have chosen. For a cold-start user, exploration introduces variety into a result set that would otherwise be entirely population-ranked. The mechanism is the same. The effect differs because the user's state differs.
|
|
|
|
**Stage 4: Diversity enforcement.** The `for_you` profile enforces `max_per_creator: 2` and `format_mix_max_fraction: 0.4`. A new user sees at most two items from any creator, and no single format exceeds 40% of results. This is critical for cold start: without diversity constraints, a new user's first feed would be dominated by the globally most popular creator. Diversity forces breadth. Breadth generates signal diversity. Signal diversity makes the preference vector converge faster.
|
|
|
|
The result: a new user's first feed is ranked by population-level quality, diversified across creators and formats, with 10% exploration. It is not personalized -- there is nothing to personalize against -- but it is also not random, not chronological, and not a separate code path. It is the same pipeline with empty user state.
|
|
|
|
## New item cold start
|
|
|
|
A new item has no signals. Zero views, zero likes, zero shares. The `for_you` profile scores it as follows:
|
|
|
|
- `Hot` sort: `log10(max(views, 1)) / (age_hours + 2)^gravity`. With zero views, the numerator is `log10(1) = 0`. The score is 0.0.
|
|
- View decay score boost: 0.0 (no signals recorded).
|
|
- Like decay score boost: 0.0.
|
|
- Share velocity boost: 0.0.
|
|
|
|
Total score: 0.0. The item will not appear in a `for_you` result set unless it lands in the exploration pool.
|
|
|
|
This is where the built-in profiles for discovery surfaces matter. The `hidden_gems` profile is designed to find high-quality, low-reach content:
|
|
|
|
```rust
|
|
// From tidal/src/ranking/builtins.rs
|
|
|
|
fn hidden_gems() -> RankingProfile {
|
|
let mut p = skeleton("hidden_gems");
|
|
p.sort = Some(Sort::HiddenGems);
|
|
p
|
|
}
|
|
```
|
|
|
|
The scoring formula:
|
|
|
|
```rust
|
|
// From tidal/src/ranking/executor.rs
|
|
|
|
/// Hidden gems: `quality / log10(view_count + 10)`
|
|
fn hidden_gems_score(quality: f64, view_count: f64) -> f64 {
|
|
quality / (view_count + 10.0).log10()
|
|
}
|
|
```
|
|
|
|
An item with zero views and nonzero completion rate has `quality / log10(10) = quality / 1.0 = quality`. An item with 10,000 views and the same completion rate scores `quality / log10(10010) = quality / 4.0`. The formula structurally favors low-reach content. A new item with a few completions outscores a popular item with the same quality metric.
|
|
|
|
The `new` profile is simpler. It sorts by recency:
|
|
|
|
```rust
|
|
// From tidal/src/ranking/builtins.rs
|
|
|
|
fn new() -> RankingProfile {
|
|
let mut p = skeleton("new");
|
|
p.sort = Some(Sort::New);
|
|
p
|
|
}
|
|
```
|
|
|
|
The `shuffle` profile goes further -- 50% exploration:
|
|
|
|
```rust
|
|
// From tidal/src/ranking/builtins.rs
|
|
|
|
fn shuffle() -> RankingProfile {
|
|
let mut p = skeleton("shuffle");
|
|
p.sort = Some(Sort::Shuffle);
|
|
p.exploration = SHUFFLE_EXPLORATION; // 0.5
|
|
p
|
|
}
|
|
```
|
|
|
|
The `shuffle` sort uses a deterministic hash of the entity ID for stable random ordering. With 50% exploration, half the result set is randomly selected regardless of score. New items have an equal chance of appearing as established ones.
|
|
|
|
These profiles are not cold-start logic. They are discovery surfaces. `hidden_gems` answers "what is good but underseen." `new` answers "what was just published." `shuffle` answers "surprise me." Each one, by design, gives new items a fair chance -- not because of a special case, but because the scoring formula does not penalize the absence of signals.
|
|
|
|
## The transition
|
|
|
|
The preference vector initializes on the first positive engagement:
|
|
|
|
```rust
|
|
pub fn update(&self, user_id: u64, interaction_embedding: &[f32]) -> bool {
|
|
let lr = self.learning_rate;
|
|
match self.inner.entry(user_id) {
|
|
Entry::Occupied(mut occ) => {
|
|
let pref = occ.get_mut();
|
|
for (p, &i) in pref.iter_mut().zip(interaction_embedding.iter()) {
|
|
*p = (1.0 - lr).mul_add(*p, lr * i);
|
|
}
|
|
l2_normalize(pref);
|
|
}
|
|
Entry::Vacant(vac) => {
|
|
// First interaction: the item's embedding becomes the preference.
|
|
let mut v = interaction_embedding.to_vec();
|
|
l2_normalize(&mut v);
|
|
vac.insert(v);
|
|
}
|
|
}
|
|
true
|
|
}
|
|
```
|
|
|
|
When the user has no preference vector and likes an item, the item's embedding becomes the initial preference vector. One interaction. One data point. The preference is crude -- it is a single point in a 128-dimensional space -- but it is not nothing. The second like blends the new item's embedding with the existing preference using exponential moving average at learning rate 0.1: `pref = 0.9 * pref + 0.1 * new_embedding`. The third like refines further. By the tenth positive interaction, the preference vector has converged to a meaningful region of the embedding space.
|
|
|
|
The vector is being built. Cosine similarity between the preference vector and candidate embeddings -- the scoring path that will rank items near this region higher -- is planned for M5, which requires an O(1) per-item embedding lookup table that does not yet exist. The architecture is in place; the wiring is the next step.
|
|
|
|
The working cold-to-warm mechanism today is the interaction ledger. A new user's interaction ledger is empty. After one view of creator A's content, the ledger has one entry: `(user, creator_A) -> 1.0`. After three more views at weight 1.0 each, the entry decays and accumulates: the score is the sum of `weight * exp(-lambda * dt)` over all interactions. The decay half-life is 7 days. Recent interactions dominate. By the time the user has engaged with five creators, the `top_creators()` call returns a ranked list that meaningfully differentiates the user's preferences. Those weights flow into `creator_interaction_boosts`, which apply additive boosts to items from favored creators during scoring. Empty ledger, no boosts. One creator interaction, one boost. The transition from cold to warm is continuous. There is no threshold, no flag, no branch.
|
|
|
|
The exploration fraction does not change. It stays at 10% for `for_you` regardless of how many signals the user has generated. For a cold-start user, 10% exploration introduces variety into a population-ranked feed. For a warm user, 10% exploration prevents the feedback loop from closing too tightly. The same value serves both purposes because the purpose is the same: prevent the ranking from converging to a local optimum.
|
|
|
|
## What the application does not write
|
|
|
|
Here is the code required to handle cold start in tidalDB:
|
|
|
|
```rust
|
|
let query = RetrieveBuilder::new(EntityKind::Item, ProfileRef::new("for_you"))
|
|
.for_user(user_id)
|
|
.limit(50)
|
|
.build()
|
|
.expect("valid query");
|
|
let results = db.retrieve(&query).expect("retrieve");
|
|
```
|
|
|
|
No `if user.is_new()`. No `get_cold_start_items()`. No `select_popular_fallback()`. No feature flag for the cold-start experiment. No A/B test between the cold-start path and the warm path. No monitoring for "how many users are hitting the cold-start branch." No incident review when the cold-start Elasticsearch index falls behind.
|
|
|
|
The application chooses a profile name. The database handles the rest. The same query, the same profile, the same pipeline produces a population-ranked, diversity-enforced, exploration-injected feed for a new user and a personalized, interaction-weighted feed for a returning user. The difference is the data, not the code.
|
|
|
|
For new items, the application does even less. An item is written with metadata. It has no signals. If a query uses `hidden_gems`, the item competes on quality. If a query uses `new`, the item appears by recency. If a query uses `shuffle`, the item has a random chance proportional to the exploration budget. If a query uses `for_you`, the item can appear in the 10% exploration pool. The application did not write cold-start injection logic. It did not maintain a "new items" index. It did not implement a boost column with a 24-hour TTL.
|
|
|
|
The profiles handle it because the profiles were designed to handle it. `exploration` is a field, not a feature. `hidden_gems` is a formula, not a workaround. `new` is a sort mode, not a fallback. Cold start is not a special case in the system. It is the initial state that the normal ranking pipeline handles correctly.
|
|
|
|
## The broader pattern
|
|
|
|
Cold start is one instance of a general principle in tidalDB: the database should produce correct results for every valid state of the data, including the empty state. A new user with zero signals is a valid state. A new item with zero engagement is a valid state. The ranking pipeline should not require application-level exception handling to produce useful output from valid state.
|
|
|
|
The `read_agg` function in the executor demonstrates this at the lowest level:
|
|
|
|
```rust
|
|
// From tidal/src/ranking/executor.rs
|
|
|
|
/// Read a signal aggregation for a candidate. Returns 0.0 on any error or
|
|
/// missing data -- scoring must never fail, only degrade.
|
|
fn read_agg(
|
|
entity_id: EntityId,
|
|
signal: &str,
|
|
agg: &SignalAgg,
|
|
window: Window,
|
|
ledger: &SignalLedger,
|
|
) -> f64 {
|
|
match agg {
|
|
SignalAgg::Value => {
|
|
let count = ledger
|
|
.read_windowed_count(entity_id, signal, window)
|
|
.unwrap_or(0) as f64;
|
|
count
|
|
}
|
|
SignalAgg::Velocity => ledger
|
|
.read_velocity(entity_id, signal, window)
|
|
.unwrap_or(0.0),
|
|
SignalAgg::DecayScore => ledger
|
|
.read_decay_score(entity_id, signal, 0)
|
|
.unwrap_or(None)
|
|
.unwrap_or(0.0),
|
|
// ...
|
|
}
|
|
}
|
|
```
|
|
|
|
Missing data returns 0.0. Not an error. Not a null. Not a sentinel value that requires special handling upstream. Zero. The scoring formulas are written to produce defined output when their inputs are zero. `hot_score(0.0, age, gravity)` returns 0.0 because `log10(max(0, 1)) = 0`. `trending_score(0.0, 0.0)` returns 0.0. `hidden_gems_score(0.0, 0.0)` returns 0.0 because `0.0 / log10(10) = 0.0`. No NaN. No division by zero. No panic.
|
|
|
|
The normalization pass that follows scoring handles the case where all candidates have the same score:
|
|
|
|
```rust
|
|
// From tidal/src/ranking/executor.rs
|
|
|
|
fn normalize(candidates: &mut [ScoredCandidate]) {
|
|
// ...
|
|
let range = max - min;
|
|
for c in candidates.iter_mut() {
|
|
c.score = if range < f64::EPSILON {
|
|
1.0 // All equal -> all get 1.0
|
|
} else {
|
|
(c.score - min) / range
|
|
};
|
|
}
|
|
}
|
|
```
|
|
|
|
When every candidate has the same score (including the case where every score is 0.0), normalization assigns 1.0 to all of them. The diversity pass then selects based on creator and format constraints. The result is a diverse selection from equally-scored candidates -- which is exactly what you want for a new user viewing a universe of items they have never interacted with.
|
|
|
|
Every layer degrades gracefully toward the empty state. The empty state is not handled by special code. It is handled by the mathematical properties of the formulas and the structural guarantees of the pipeline.
|
|
|
|
---
|
|
|
|
The 6-system stack treats cold start as an engineering problem because it is -- when your ranking depends on data scattered across six systems, the absence of data in any one of them breaks the pipeline. tidalDB treats cold start as a data state because that is what it is. A new user is a user with empty bitmaps. A new item is an item with zero signal counts. The ranking pipeline reads those values, produces scores, applies diversity, injects exploration, and returns results. No exception path. No fallback service. No application code.
|
|
|
|
The cold-start problem is solved when it stops being a problem.
|
|
|
|
---
|
|
|
|
*The ranking profiles are at [tidal/src/ranking/builtins.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/ranking/builtins.rs). The profile executor and scoring formulas are at [tidal/src/ranking/executor.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/ranking/executor.rs). The preference vector is at [tidal/src/entities/preference.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/entities/preference.rs). The exploration injection is at [tidal/src/query/executor.rs](https://github.com/orchard9/tidalDB/blob/main/tidal/src/query/executor.rs). Follow the build on [GitHub](https://github.com/orchard9/tidalDB).*
|