tidaldb/site/content/blog/what-three-databases-taught-us.mdx
jordan 192c473f55 feat: complete Milestone 5 — full-text search, RRF fusion, and creator search
- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs)
- M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates)
- M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking)
- M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators)
- Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.)
- Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.)
- Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers
- Add benches: fusion, search, session, text_index
- Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index)
- Update blog posts, roadmap, content strategy, and M5 planning docs
- Add tmp/ and .claude/worktrees/ to .gitignore

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-21 23:53:16 -07:00

221 lines
15 KiB
Plaintext

---
title: "What three databases taught us before we wrote a line of code"
date: "2026-02-21"
author: "Jordan Washburn"
description: "We studied three purpose-built databases we had already shipped. Their convergent patterns became tidalDB's foundation. Their gaps became our roadmap."
tags: ["architecture", "rust", "lessons"]
---
Before tidalDB existed as code, it existed as a list of patterns we had already seen work -- and a shorter list of gaps none of our prior systems had closed.
We had built three databases before this one. A cognitive memory system that modeled decay and associative recall. A defensive logging engine that guaranteed durability before everything else. A knowledge graph that held contradictions and resolved them at read time. Each was purpose-built for a different domain. Each solved its domain well. None of them solved personalized content ranking.
But when we laid their architectures side by side, we found convergences that could not be coincidence. And we found gaps that defined exactly what tidalDB needed to be.
This post is about both.
## The convergences
When three independent databases arrive at the same solution, the solution is probably load-bearing. Here are the patterns that recurred across all three, and how they appear in tidalDB today.
### WAL-first durability
All three systems used a write-ahead log with explicit fsync control as the durability boundary. The storage engine -- whether an LSM-tree, a B-tree, or a memory-mapped log -- was an optimization layer built on top of a durable append-only record. Crash recovery in every case meant the same thing: replay the WAL from the last checkpoint.
The specifics differed. The memory system used CRC32C checksums with hardware acceleration. The logging engine used BLAKE3 and a quarantine journal that fsynced before the client received an acknowledgment. The knowledge graph used CRC32C for speed with BLAKE3 for content-addressed identity.
But the principle was identical. The WAL is the source of truth. Everything else is derived state.
tidalDB adopted this directly. Signal events are durably logged before any aggregation occurs. The signal ledger -- with its decay scores, windowed counts, and velocity computations -- is a materialized view over the WAL. If the aggregation layer crashes, it replays from the log. No signal is ever lost.
```rust
// From tidal/src/wal/mod.rs — the WAL module header
//! The WAL is the durability primitive for signal events. Every view, like,
//! skip, and completion is appended to the WAL before any aggregation occurs.
//! Signal aggregates, decay scores, and windowed counts are derived state
//! that can always be rebuilt from WAL replay.
```
### Group commit for amortized fsync
The logging engine taught us this one explicitly. Individual fsync per write is O(N) in syscall overhead. At thousands of signal events per second, that overhead dominates.
The solution: batch writes and amortize the fsync cost across the batch. The logging engine's default was 100 writes or 10 milliseconds, whichever came first. PostgreSQL uses the same technique with its commit delay.
tidalDB's WAL writer runs on a dedicated thread and forms batches the same way. Events arrive via a crossbeam channel, accumulate until the batch is full or the timeout expires, then write as a single atomic unit with one BLAKE3 checksum and one fsync.
```rust
// From tidal/src/wal/mod.rs
/// Default batch size: up to 100 events per batch.
const DEFAULT_BATCH_SIZE: usize = 100;
/// Default batch timeout: 10 milliseconds.
const DEFAULT_BATCH_TIMEOUT: Duration = Duration::from_millis(10);
```
One hundred signal events. One syscall. The same insight, carried forward.
### Cache-line aligned hot data
The memory system was obsessive about hardware-aware data layout. Its hot nodes were `#[repr(C, align(64))]` -- exactly one L1 cache line per node. Atomic fields for concurrent access. No false sharing between adjacent entries.
The reasoning: when your ranking query touches every candidate entity, and your write path ingests thousands of events per second, those two paths will contend on the same data. If two entities share a cache line, a write to entity A invalidates the cache for entity B's reader on another core. At the throughput of a ranking pipeline, false sharing is measurable.
tidalDB's `HotSignalState` is 64 bytes. One cache line. Compile-time assertions enforce it.
```rust
// From tidal/src/signals/hot.rs
#[repr(C, align(64))]
pub struct HotSignalState {
entity_id: u64,
last_update_ns: AtomicU64,
signal_type_id: u16,
flags: u16,
_pad0: [u8; 4],
decay_scores: [AtomicU64; 3],
_pad1: [u8; 16],
}
const _SIZE: () = assert!(std::mem::size_of::<HotSignalState>() == 64);
const _ALIGN: () = assert!(std::mem::align_of::<HotSignalState>() == 64);
```
The memory ordering is deliberate: `Acquire`/`Release` for the timestamp to establish happens-before between writers and readers, `AcqRel` on CAS success for the decay scores. No `SeqCst` anywhere on the hot path. These are the weakest orderings that maintain correctness.
### Subject-prefix key encoding
The knowledge graph organized its storage with a subject-prefix key scheme: `{subject}\x00{TAG}:{suffix}`. A prefix scan on `{subject}\x00` returned every assertion, vote, index, and materialized view for that subject. All data for one entity, co-located. Natural shard boundaries. No joins.
tidalDB uses the same layout:
```rust
// From tidal/src/storage/keys.rs
/// Layout: [entity_id: 8 bytes BE][0x00][tag: 1 byte][suffix bytes...]
pub fn encode_key(entity_id: EntityId, tag: Tag, suffix: &[u8]) -> Vec<u8> {
let mut key = Vec::with_capacity(8 + 1 + 1 + suffix.len());
key.extend_from_slice(&entity_id.to_be_bytes());
key.push(NUL);
key.push(tag.as_byte());
key.extend_from_slice(suffix);
key
}
```
Big-endian encoding means entity IDs sort lexicographically the same way they sort numerically. The `0x00` separator byte guarantees no tag byte collides with the entity ID prefix. Tags discriminate data categories: `Evt` for raw signal events, `Sig` for running decay scores, `Meta` for entity metadata, `Rel` for relationship edges, `Mv` for materialized views, `Idx` for secondary indexes.
A prefix scan on `entity_prefix(id)` returns everything the database knows about that entity. A scan on `entity_tag_prefix(id, Tag::Sig)` returns only its signal state. This is the same insight the knowledge graph proved: co-located data makes the common queries fast and the future sharding obvious.
### Per-entity-kind storage isolation
The logging engine isolated storage per tenant. Separate directories. Separate files. Separate quotas. The philosophy: a noisy neighbor should not degrade a quiet one.
tidalDB isolates per entity kind. A single fjall database opens at one path, and Items, Users, and Creators each get their own keyspace within it:
```
{base}/ # single fjall database
├── items/ # keyspace for item entities
├── users/ # keyspace for user entities
└── creators/ # keyspace for creator entities
```
They share one database instance, but compaction runs independently per keyspace. A burst of signal events for a viral item does not slow down user profile reads. The I/O pressure stays contained.
### Append-only core with mutable views
All three databases converged on the same separation. The source of truth is immutable -- append-only events, assertions, or log records. Mutability is confined to derived state that can be recomputed.
The memory system appended to a WAL and kept mutable activation levels as atomics. The logging engine appended to a quarantine journal and produced immutable Parquet segments. The knowledge graph appended immutable assertions and maintained materialized views in a background worker.
tidalDB follows the same boundary. Signal events are immutable facts: "user U liked item I at timestamp T." Signal aggregates -- decay scores, windowed counts, velocity -- are mutable derived state. The WAL holds the facts. The signal ledger holds the derived state. If the derived state is lost, the WAL rebuilds it.
## The gaps
The convergent patterns gave us a foundation. The gaps gave us a roadmap. These are the problems none of the three systems solved -- and the problems that define tidalDB's reason to exist.
### Temporal signals as schema-level primitives
The memory system had the most sophisticated decay functions of the three, validated against decades of psychology research. But even there, temporal behavior was *implemented*, not *declared*. You could not say "this signal has a 7-day half-life and I want its 24-hour velocity" in schema. You wrote code that computed it.
In tidalDB, signals are a schema type. You declare the decay rate, the windows, and whether velocity tracking is enabled. The database maintains the aggregates. The application writes events and reads scores.
```rust
// Schema declaration
let _ = builder.signal("view", EntityKind::Item,
DecaySpec::Exponential { half_life: Duration::from_secs(7 * 24 * 3600) })
.windows(&[Window::OneHour, Window::TwentyFourHours])
.velocity(true)
.add();
```
The application never computes `trending_score = views / (age + 2)^1.8`. The database does.
### Ranking as a database operation
The memory system ranked by spreading activation. The knowledge graph ranked by lens resolution. The logging engine did not rank at all. None of them modeled multi-signal, multi-factor ranking with user context as a query primitive.
Content ranking requires combining text relevance, vector similarity, temporal signals, user preferences, social graph weights, negative feedback, quality gates, and diversity constraints. No existing database combines all of these in a single operation.
tidalDB will. Named ranking profiles, declared in schema, referencing signal types and relationship weights. The application says `USING PROFILE trending`. The database executes the pipeline.
### The closed feedback loop
This is the gap that mattered most.
The knowledge graph came closest -- it accepted votes that affected future read-time resolution. But the feedback path was designed for knowledge claims, not engagement signals. In all three databases, "user engaged with content" and "content's ranking changes" were separate operations in separate systems or at separate times.
In tidalDB, the engagement write *is* the ranking update. A signal write atomically updates the item's signal ledger, the windowed aggregates, and the velocity computation. The next ranking query -- even 100 milliseconds later -- reflects the change. No event bus. No consumer lag. No cache to invalidate.
This is the architectural thesis. The write path and the read path share a process, a memory space, and a signal ledger. Not "eventually consistent" sharing via replication. The same data, immediately visible.
## What we did not take
Studying prior systems also means knowing what to leave behind.
The memory system's spreading activation engine was 98 kilobytes of lock-free work-stealing code. Extraordinary engineering for associative recall. Irrelevant to content ranking. We took the cache-line alignment. We left the graph traversal engine.
The logging engine's tiered storage moved data from NVMe to HDD to object storage based on age. Content ranking needs a different tiering model -- based on access pattern, not calendar time. A six-month-old video still getting steady views stays hot. Yesterday's content that nobody watched goes warm. We took the durability model. We left the tier migration policy.
The knowledge graph's lens system -- pluggable resolution strategies that could return different answers from the same data -- was its most distinctive feature. But lenses resolve truth from competing claims. Ranking resolves preference from competing signals. The domain semantics are different enough that the abstraction does not transfer. We took the key encoding and the background materializer. We left the lenses.
Knowing what to steal is half the work. Knowing what to leave is the other half.
## The pattern underneath
There is a progression in how databases relate to their data.
Stage 1 databases store and retrieve. You put data in, you get data out, unchanged. The database's job is correctness and durability. This is the relational era.
Stage 2 databases index and search. The shape of the question changed -- full-text search, graph traversal, nearest-neighbor lookup. Each new query shape produced a new database. This era fragmented the landscape. Every application now runs three to seven databases, and the seams between them are where correctness dies.
Stage 3 databases understand domain semantics. The database does not just store "a float." It stores "a signal with a decay rate and a velocity window." It does not plan a generic sort. It knows that `USING PROFILE trending` means signal velocity, not total count.
Our three prior systems were Stage 3 databases. Each understood its domain. The memory system understood forgetting. The logging engine understood durability. The knowledge graph understood conflict.
tidalDB is Stage 3 for the ranking domain. Signals, decay, velocity, diversity, and feedback are not application logic. They are database primitives.
The gaps we found point toward something further -- a system where the write path and read path are not separate operations the application stitches together, but a single loop the database manages. Where a ranking query is simultaneously a read, a write, and a state transition. We are not there yet. But the foundation is built to get there.
## The list
For anyone building a purpose-built database, here is what we took and where it came from:
| Pattern | Source | tidalDB implementation |
|---------|--------|----------------------|
| WAL-first durability | All three | Signal events logged before aggregation |
| Group commit | Logging engine | Batch writer, 100 events or 10ms |
| Cache-line aligned hot structs | Memory system | `HotSignalState`, 64 bytes, `#[repr(C, align(64))]` |
| Subject-prefix key encoding | Knowledge graph | `[entity_id BE][0x00][tag][suffix]` |
| Per-entity-kind isolation | Logging engine | Separate fjall keyspaces per `EntityKind` |
| Background materializer | Knowledge graph | Checkpoint thread, 30-second interval |
| Append-only core / mutable views | All three | WAL is truth, signal ledger is derived |
| Lock-free hot path | All three | Atomic CAS, no mutex on read or write |
| BLAKE3 checksums | Logging engine + knowledge graph | WAL batch integrity verification |
Nine patterns. Three databases. Zero invented from scratch.
The best code you write is the code someone already proved works.
---
*tidalDB is an open-source, embeddable Rust database for personalized content ranking. The storage and signal code referenced in this post is at [tidal/src/](https://github.com/orchard9/tidalDB/tree/main/tidal/src). Follow the build on [GitHub](https://github.com/orchard9/tidalDB).*