jordan 39ada28c6e feat: complete Milestones 2–4 — RETRIEVE query, vector index, ranking profiles, diversity, entity system, sessions

M2: RETRIEVE query pipeline with 5-stage execution (candidate → filter → score → diversify → limit),
    usearch HNSW vector index, bitmap/range/universe filters, ranking profiles with signal scoring,
    MMR diversity enforcement, and m2_uat integration tests.

M3: Entity system with typed metadata, relationship graph (follows/blocks/interactions),
    creator entities, session tracking, and m3_uat integration tests.

M4: Advanced ranking with builtin functions (freshness, trending, controversy, wilson),
    ranking executor with explain mode, query executor integration, benchmarks for
    query/ranking/vector/filters/diversity, and m4_uat integration tests.

Includes: 9 new blog posts, marketing site updates, updated roadmap, and updated vision doc.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-21 16:24:48 -07:00

16 KiB

Raw Blame History

Vision

The Problem

Every platform that serves personalized content — a media library, a social feed, a marketplace, a content discovery surface — eventually builds the same distributed system from scratch. Elasticsearch for retrieval. Redis for hot signals. Kafka for event ingestion. A feature store for user profiles. A vector database for semantic search. A ranking service that tries to stitch all of the above together into a single ordered list.

This is not an ecosystem. It is scar tissue. The seams between these systems are where correctness dies — stale signals, inconsistent ranking, cache invalidation bugs, and an operational burden that consumes entire engineering teams.

The root cause is that existing databases were not built with this problem in mind. They treat ranking as an afterthought — a sort clause, a float field, a bolt-on scoring function. They have no concept of a signal that evolves over time, no concept of a user context that shapes relevance, no concept of diversity as a query constraint, no concept of the feedback loop between what a user sees and what the system learns.

Worse: every team building one of these platforms discovers that their users want the same things. Search with typo tolerance and boolean operators. Filter by duration, date, language, format, quality, creator size, and a dozen other dimensions simultaneously. Sort by trending, hot, rising, controversial, top-this-week, hidden gems, shuffle. Personalize the result of all of the above. Apply diversity constraints. Close the feedback loop.

These are not exotic requirements. They are table stakes for any serious content platform. And today, every team builds them from scratch, on top of systems not designed for the task.

The Thesis

Ranking is not a feature. It is a primitive.

A database purpose-built for personalized content delivery should model the world the way this problem actually works:

Content has metadata, embeddings, and signals. Signals are not fields — they are typed, timestamped streams with native decay, velocity, and windowed aggregation semantics.
Users have preferences, histories, and relationships. These are not rows — they are living profiles that update continuously as events arrive.
Personalization has scopes — global, community, and session/agent — and every scope is user-controlled and revocable.
Agents mediate most interactions. They retrieve context, elicit preferences, and publish structured feedback (reward, tool usage, confidence) as first-class signals. The system must let them read and write memory instantly.
A query is not "give me items matching these filters sorted by this field." It is "given this user, this context, and this surface — what should they see, in what order, subject to these constraints?"
Filters, sort modes, and diversity rules are first-class query citizens — not application logic bolted on top.
Engagement is not application logic that happens to write back into the database. It is a first-class write path that closes the feedback loop natively.

This is the database that models that world.

What It Is

A single-node-first, embeddable Rust database designed specifically for the personalized content ranking problem. It replaces the 6-system stack for this one domain with a single process, a single query interface, and a single operational model.

It is strongly opinionated. It does not try to be a general-purpose database. It does not try to solve problems outside its domain. Every design decision is made in service of one question: given a user and a context, what content should they see, and in what order?

First-Class Primitives

Entities are the nodes of the system — Items (content), Users, and Creators. Every entity has metadata, a vector embedding slot, and an attached signal ledger.

Signals are typed, timestamped event streams. The database natively understands signal semantics: velocity (rate of change), decay (exponential or linear, configurable per signal type), and windowed aggregation (last hour, last day, last 7 days, all time). You do not pre-compute trending_score_7d and store it in a field. You declare a view signal type and query its 7-day windowed velocity at ranking time.

Users have preferences, histories, relationships, and attributes. Attributes include demographics, locale, interests, and behavioral segments. These attributes are queryable for cohort membership and enable cohort-scoped signal aggregation. Some attributes are application-set (locale, age); others are database-computed from engagement patterns (interest affinity, engagement level, format preferences).

Relationships are first-class edges between entities — follows, blocks, interactions, similarity. They are weighted, directional, and traversable in queries.

Ranking Profiles are named, versioned scoring functions declared in schema. They reference signals, relationship weights, recency curves, and diversity rules. A profile is not code deployed separately — it lives in the database, is versioned alongside your data, and can be swapped at query time by name.

Cohorts are named predicates over user attributes — demographic, behavioral, and interest-based segments. A cohort is not a static list of users — it is a live query over user state. "US users aged 18-24 who engage with jazz content" is a cohort. The database maintains per-cohort signal aggregation so that trending, rising, and quality signals can be scoped to any cohort at query time. This enables the three-layer trending model: global trending, cohort-scoped trending, and search within cohort-scoped trending.

Personalization Layers are composable scopes over the same signal model: a user-owned global profile, optional community overlays, and short-lived session/agent context. Ranking can blend them, but ownership stays explicit per layer.

Sessions / Agent Context capture in-flight conversations and tool use. They bind a user, an agent, and a session identifier to short-lived signals (preference hints, rewards, critiques) with aggressive decay. Sessions can be forked, merged, and policy-limited so an agent only sees what it is allowed to remember. Users can revoke agent scope and remove agent-contributed signals from specific personalization layers.

The Query is a single operation that encapsulates candidate retrieval, filtering, ranking, and diversity enforcement:

RETRIEVE items
FOR USER @user_id
FOR SESSION @session_id
CONTEXT feed
USING PROFILE for_you
FILTER unseen, unblocked, format:video, duration:short
DIVERSITY max_per_creator:2, format_mix:true
LIMIT 50

This is what 6 systems currently produce. It is one query here.

Cohort scoping and query composition extend this further. Trending scoped to a cohort:

RETRIEVE items
USING PROFILE trending
COHORT locale:US, age:18-24, interest:jazz
WINDOW 24h
DIVERSITY max_per_creator:1
LIMIT 25

Search within cohort-scoped trending:

SEARCH items
QUERY "piano tutorial"
FOR SESSION @session_id
WITHIN TRENDING
COHORT locale:US, age:18-24, interest:jazz
WINDOW 24h
LIMIT 20

Three queries, three layers of the same question: what's happening globally, what's happening for people like this, and can I find something specific within that.

The Full Query Surface

tidalDB is designed to handle every retrieval and ranking pattern a content platform needs. This is the complete surface the database covers natively:

Retrieval modes:

Full-text keyword search with BM25 relevance scoring
Exact phrase match, boolean operators (AND/OR/NOT), field-scoped search
Semantic search — query by meaning, not just keywords
Vector similarity search — ANN over item and creator embeddings
Visual similarity search — find items near a reference image embedding
Hybrid search — text relevance + semantic similarity, merged score
User history search — find something the user previously engaged with
Collaborative filtering — "users who engaged with X also engaged with Y"
Social graph traversal — content from or engaged by a user's follows

Sort modes (all native, no application implementation required):

Relevance (text + semantic match)
Personalized (user preference match)
New / Old (chronological)
Hot (score with age decay — Reddit model)
Trending (pure velocity)
Rising (velocity relative to creator/category baseline, age-boosted)
Top: All Time / This Year / This Month / This Week / Today / This Hour
Controversial (maximizes product of positive and negative signals)
Hidden Gems (high quality, low reach)
Most Viewed / Most Liked / Most Commented / Most Shared
Shortest / Longest (by duration)
Alphabetical A-Z / Z-A
Shuffle (random, quality-weighted)
Live Viewer Count (for live surfaces)
Date Saved (for personal library)

Filter dimensions (all composable simultaneously):

Content type / format: video, short, live, VOD, podcast, article, image, gallery, audio
Duration: range or presets (short / medium / long)
Date range: presets or custom (last hour, today, this week, custom range)
Category, tag, hashtag, flair (multi-select, OR logic within dimension)
Language, subtitle language, dubbed language
Technical quality: SD / HD / 4K / HDR / Dolby / spatial audio
Accessibility: subtitles available, audio description, sign language
Content rating / maturity level
Safe search toggle
Status: published, live, scheduled, archived
Availability: free, premium, subscriber-only, downloadable, leaving soon
Creator: specific creator, exclude creator, verified only, follower count range, new to user, followed by user
Engagement thresholds: minimum views, likes, like ratio, comments, shares, completion rate
Community signals: flair, minimum score, award/gilded, post type, original only
User state: unseen, in progress, saved, liked, downloaded, in collection
Geography: content region, creator region, near location, trending in region

Discovery surfaces (all driven by the same underlying query engine):

For You personalized feed
Following / subscription feed
Trending (global, category-scoped, cohort-scoped, social-graph-scoped, region-scoped)
Cohort-scoped discovery — "trending for people like you"
Rising / breakout content
Browse by category with any sort mode
Related / up next recommendations
Hidden gems and underrated content
Live and scheduled content
Mood and aesthetic-filtered browse
Visual similarity browse (Pinterest model)
Creator discovery ("creators like X")
Notification prioritization
Search suggestions and autocomplete
Saved searches as persistent feeds

Every one of these surfaces is driven by the same underlying query primitives. The application does not implement ranking logic — it specifies profiles, filters, and context.

The Feedback Loop

When a user engages with content — directly or via an agent — that event is written to the database as a signal. The agent can attach structured metadata (reward, confidence, tool invocation) in the same write. The database updates the item's signal ledger, the user's implicit preference profile, the relationship weight, and the session-scoped memory — automatically, as part of the write transaction. The next ranking or grounding query reflects this immediately. There is no Kafka consumer to lag, no feature store sync to schedule, no cache to invalidate.

Negative signals are equal citizens. A skip, a hide, a block, a "not interested," a downvote — these update the system with the same immediacy and precision as a like or a completion. Feedback intent is typed (skip_for_now, not_for_me, low_quality, hide/mute/block) and scope-aware (local, community, session/agent), so ranking updates preserve meaning rather than collapsing all negatives into one bucket.

What It Is Not

It is not a general-purpose document store. It is not a replacement for PostgreSQL for your transactional data. It is not trying to win the NewSQL wars or build a distributed OLAP engine.

It is not schema-free. Strong opinions about data shape enable strong guarantees about ranking correctness.

It is not trying to generate embeddings. It accepts vectors — you bring your model, you bring your embeddings, you write them in. The database owns retrieval and ranking over those vectors, not generation.

It is embeddable first — it runs in your process with zero operational overhead. But it is designed for scale from day one. Key encoding, storage isolation, and signal aggregation are all partitioning-ready. The single-node deployment is the first target, not the ceiling. When you outgrow one node, the architecture supports horizontal scaling without a rewrite.

It is not trying to solve moderation, payments, authentication, or content delivery. It solves one problem: given a user and a context, what content should they see, and in what order.

Design Principles

Temporal decay is a type, not a formula you write. Signal half-lives are declared in schema. The database applies them at query time.

Negative signals are equal citizens. A skip, a hide, a block, a mute, a downvote — these are not the absence of positive engagement. They are data. They belong in the ranking function with the same weight and precision as a like.

Feedback intent is explicit. "Not for me," "low quality," and "skip for now" are different semantics, with different ranking effects and retention policies.

All sort modes are native. Trending, hot, rising, controversial, hidden gems, shuffle — these are built-in sort modes, not formulas the application implements and passes in. The application names a sort mode. The database executes it correctly.

All filters are composable. Any combination of filter dimensions produces a valid, efficiently-executed query. There is no special-casing for "common" filter combinations. Faceted queries are first-class.

Diversity is a query constraint, not application logic. "No more than 2 items per creator" does not belong in your API layer. It belongs in the query.

The write path and the read path are one system. Engagement events and ranking queries share a storage model and a signal ledger. There is no ETL between them.

Cold start is handled by the database. New content with no signals gets an exploration budget. New users with no history get a sensible default experience. The application does not manage this.

Cohorts are live queries, not static lists. A cohort is a predicate over user attributes — demographics, interests, behavioral segments. Users flow in and out of cohorts as their attributes change. Signal aggregation runs per-cohort so trending and quality signals reflect what's happening within any audience segment.

Agents own managed contexts. Sessions scope short-lived memory, rewards, and tool usage. Agents can only read/write within their sessions, and policy guards live in schema, not ad-hoc middleware.

Personalization is user-owned and revocable. Users can opt into community overlays, scope agent access, and remove scoped contributions from future ranking.

Correctness over cleverness. Ranking is already approximate by nature. The database does not need to be more clever than the signals it has. It needs to be fast, consistent, and operationally simple.

Who This Is For

Engineering teams building any surface where content is ranked for a user — media libraries, social feeds, content discovery, search — who are currently operating a multi-system stack and paying the consistency, latency, and operational cost of the seams between those systems.

The target developer has domain data that fits the entity/signal/relationship model, has immediate use cases that need this in production, and values a single system with sharp opinions over a flexible system with unlimited configuration.

The target scale is platforms serving millions of users across diverse audiences — where "what's trending" means different things to different cohorts, and the ability to slice engagement signals by audience segment is not a nice-to-have but the core product question.

The Name

tidalDB — the tide that surfaces the right content for the right person at the right time. Rising signals, ebbing decay, a natural rhythm of discovery.

(The idea matters more than the label.)

This is a focused tool for a focused problem. It will do one thing and do it correctly.

16 KiB Raw Blame History