tidaldb/site/content/blog/why-tidaldb.mdx
jordan 192c473f55 feat: complete Milestone 5 — full-text search, RRF fusion, and creator search
- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs)
- M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates)
- M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking)
- M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators)
- Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.)
- Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.)
- Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers
- Add benches: fusion, search, session, text_index
- Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index)
- Update blog posts, roadmap, content strategy, and M5 planning docs
- Add tmp/ and .claude/worktrees/ to .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 23:53:16 -07:00

121 lines
8.0 KiB
Plaintext

---
title: "Why we're building tidalDB"
date: "2026-02-20"
author: "Jordan Washburn"
description: "tidalDB is a single-process Rust database for personalized content ranking. Here is what it does and how it works."
tags: ["vision", "architecture"]
---
tidalDB is a database that answers one question: given this user, right now, what should they see?
Agents now sit between the user and many surfaces, so session memory still matters. But the core focus is personalized content ranking. tidalDB is not trying to out-feature every search platform. It is a database where signals, decay, negative feedback, and diversity are schema-level primitives — and ranking updates immediately after user actions.
I wrote separately about [why every content platform ends up operating six systems](/blog/every-platform-builds-the-same-6-systems) to answer that question. This post is about what we are building instead.
## The primitives
tidalDB has five core concepts. Everything else follows from them.
**Entities** are Items, Users, and Creators. Each carries metadata, an embedding slot, and a signal ledger. You define them in schema with typed fields — text fields are full-text indexed, keyword fields are filterable, embeddings are ANN-indexed. The database owns the indexes.
**Signals** are typed, timestamped event streams with decay and velocity built in. You declare a signal type once:
```rust
use std::time::Duration;
use tidaldb::schema::{DecaySpec, EntityKind, SchemaBuilder, Window};
let mut builder = SchemaBuilder::new();
let _ = builder.signal("view", EntityKind::Item,
DecaySpec::Exponential { half_life: Duration::from_secs(7 * 24 * 3600) })
.windows(&[Window::OneHour, Window::TwentyFourHours, Window::SevenDays, Window::AllTime])
.velocity(true)
.add();
let schema = builder.build()?;
```
That declaration tells the database everything it needs. When a view event arrives, the database maintains windowed counts, computes velocity, and applies exponential decay — all at write time, all O(1). You never compute `trending_score = views / (age_hours + 2)^1.8` in application code. You never update a stale float field on a cron schedule. The database does this natively, and it does it correctly.
Negative signals — skips, hides, blocks — are the same type. A skip is not the absence of a like. It is data with its own decay rate and its own weight in the scoring function.
**Ranking Profiles** are named, versioned scoring functions declared in schema. They reference signals, relationship weights, recency curves, and diversity rules. You swap profiles at query time by name — no redeploy, no recompile. This is how you A/B test ranking: two profiles, one query parameter.
**Sessions** capture agent context. A session binds a user, an agent identity, and a short-lived memory lane. Agents append structured signals (preference hints, reward scores, tool metadata) with aggressive decay while policies live in schema: what an agent can read, how often it may write, how long data persists.
**The query** brings it together. Candidate retrieval, filtering, personalized ranking, and diversity enforcement in a single operation:
```
RETRIEVE items
FOR USER @user_id
FOR SESSION @session_id
CONTEXT feed
USING PROFILE for_you
FILTER unseen, unblocked, format:video, duration:short
DIVERSITY max_per_creator:2, format_mix:true
LIMIT 50
```
Today queries are built via the Rust builder API (`Retrieve::builder()`); a parsed text query language is planned for a future milestone.
One call. No network hops between subsystems. No merging results from five data sources. The database handles retrieval strategy (ANN, BM25, graph walk, full scan), applies hard filters, scores candidates against live signal state, enforces diversity constraints, and returns a ranked list. The agent gets the list along with a session snapshot (top signals, reward velocity, last tool it used) so it can explain its answer.
## The feedback loop
This is the part that makes the architecture honest.
When a user likes an item, the database updates the item's signal ledger, the user's preference vector, and the user-to-creator relationship weight in the same call. The next ranking query — even 100ms later — reflects the updated state.
```rust
// Item-level signal
db.signal("like", EntityId::new(42), 1.0, Timestamp::now())?;
// With user context (updates interaction weight and preference vector)
db.signal_with_context(
"like", EntityId::new(42), 1.0, Timestamp::now(),
Some(user_id), Some(creator_id),
)?;
```
There is no event bus between the engagement and the ranking update. No consumer lag. No cache to invalidate. The write path and the read path are one system. A user who skips three items in a row sees the fourth query adjust — the skips add to the user's exclusion bitmap, and the next retrieve filters them out. Not after a batch pipeline runs, not after a feature store syncs. Now.
## Where we are deliberately narrow
If your primary problem is operating a large, general search serving platform, systems like Vespa are excellent and mature.
Our wedge is narrower and opinionated:
- Optimize for the personalization loop, not broad search platform parity.
- Make negative feedback intent explicit and immediate:
`skip_for_now` (soft), `not_for_me` (preference), `low_quality` (quality), `hide/mute/block` (hard excludes).
- Treat "next refresh reflects feedback" as a hard product promise, not a best effort.
- Keep the first deployment embeddable and in-process for low-latency iteration.
## Where the build stands
tidalDB is early. I want to be direct about what exists today and what does not.
**M1 — Signal engine.** Schema system with entity, signal, and profile definitions. Write-ahead log with segment rotation, checksummed records, BLAKE3 deduplication, and crash recovery. Storage engine backed by fjall with trait abstraction, key encoding, and batch writes. Signal ledger with forward-decay scoring, hot-path state, and warm-path persistence.
**M2 — Query and retrieval.** RETRIEVE query with a five-stage execution pipeline: candidate generation, filter evaluation, signal scoring, diversity enforcement, result assembly. Vector index (USearch HNSW), bitmap and range indexes, 15 built-in ranking profiles.
**M3 — Personalized ranking.** FOR USER context in queries. Relationship graph (follows, blocks). Interaction ledger with lazy decay. Preference vectors blended from positive engagement signals. Full feedback loop from signal write to ranking adjustment.
**M4 — Entity system and sessions.** User and Creator entities with metadata and signal ledgers. Agent sessions with identity binding, policy enforcement, and session-scoped signals. Negative signal classification (skip, hide, dislike, block) with hard-negative exclusion bitmaps. Cold-start fallback profiles.
All four milestones are complete with 661+ passing tests.
**Next (M5):** Hybrid search — RRF fusion across text (Tantivy BM25) and vector retrieval, the SEARCH executor, and a parsed query language.
The foundation is Rust, single-node, embeddable. The storage layer is designed for horizontal scaling later — key encoding and storage isolation are partition-ready — but single-node correctness comes first. This is how we differentiate from Vespa, Milvus, or any search-first system: tidalDB embeds inside your agent runtime, exposes a declarative query+session API, and guarantees every signal the agent writes is visible on the next read without a distributed hop.
The code is on [GitHub](https://github.com/orchard9/tidalDB). Every architectural decision gets documented.
## Why open source
The personalized content ranking problem is universal. Every content platform needs it. The solution should be a tool you embed in your process and point at your data — not a vendor you depend on for a query you could run locally.
MIT licensed. No asterisks.
---
*If you want the full diagnosis of why the 6-system stack exists and where correctness fails between the seams, read [Every content platform builds the same 6 systems from scratch](/blog/every-platform-builds-the-same-6-systems).*