jordan 29400d48db feat: implement Milestone 1 phases 1-3 — schema, WAL, and storage layer

Implements the foundation of tidalDB's data pipeline:

**Phase 1 – Schema primitives**
- EntityId newtype (u64, big-endian ordering)
- SignalTypeDefinition with pre-computed decay λ, deduped/sorted windows
- SchemaBuilder with full constraint validation (duplicates, identifiers,
  half-life, windows, velocity)
- LumenError wrapping all subsystems with required From impls

**Phase 2 – Write-Ahead Log**
- Length-prefixed, BLAKE3-protected entry format
- Group-commit writer (batch up to 100 events / 10 ms)
- Double-buffered content-hash deduplication
- Checkpoint, truncation, and crash-recovery with full replay
- Integration, property, and UAT tests (incl. 5,500-event deterministic UAT)
- Proptest coverage scaled to 10 000 events/run (was ≤500) to meet
  acceptance criterion; cases reduced 100→10 to keep runtime comparable

**Phase 3 – Storage engine**
- StorageEngine trait (get/put/delete/scan/batch/flush)
- Key encoding: [EntityId][0x00][Tag][suffix] with ordering/prefix helpers
- InMemoryBackend (BTreeMap + RwLock)
- FjallStorage with three isolated keyspaces and atomic batch helper
- Property tests for key ordering and round-trip correctness

Also adds planning docs for phases 4-5, research docs, architecture
overview, and roadmap updates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-20 16:43:24 -07:00

23 KiB

Raw Blame History

Architecture

tidalDB is a purpose-built ranking database. Its architecture is shaped by a single constraint: every design decision must serve the question "given a user and a context, what content should they see, in what order?" Nothing else.

This document describes how the system is structured, why it is structured that way, and how the major subsystems interact. For the API surface, see API.md. For engineering standards, see CODING_GUIDELINES.md. For the research behind specific decisions, see docs/research/.

Core Thesis

Every content platform (YouTube, TikTok, Reddit, Netflix) builds the same 6-system distributed stack from scratch — Elasticsearch, Redis, Kafka, a feature store, a vector DB, and a ranking service. The seams between those systems are where correctness fails: stale signals, inconsistent ranking, cache invalidation bugs, ETL lag.

The root cause is that existing databases treat ranking as an afterthought. They have no native concept of signals that evolve over time, no understanding of user context, no diversity as a query constraint, and no feedback loop between what users see and what the system learns.

tidalDB treats ranking as a primitive. Signals, decay, velocity, user preferences, relationships, and diversity are first-class schema concepts — not application logic bolted on top.

Domain Model

Five first-class entity types:

Type	What it represents
Item	A piece of content (video, article, post) — has metadata, an embedding slot, a signal ledger
User	A viewer — has attributes, a preference vector, a signal ledger, a seen-item set
Creator	An author — has attributes, an embedding slot, a signal ledger
Relationship	A weighted, directional edge between any two entities (follows, blocks, interaction weight)
Cohort	A named, live predicate over user attributes (e.g. `age_range ∈ {18-24} AND locale = en-US`)

Five schema-level primitives:

Primitive	What it captures
Signal	A typed, timestamped event stream (view, like, skip, hide...) with declared decay rate, velocity, and windowed aggregation
Ranking Profile	A named, versioned scoring function: candidate retrieval strategy, boosts, penalties, quality gates, diversity rules, exploration budget
Relationship	Weighted edges: follows, blocks, interaction strength — used as ranking inputs
Cohort	Live predicate membership — enables cohort-scoped signal aggregation and trending
Filter	Composable predicates over entity attributes, signal values, and relationship state

Module Structure

The dependency chain is strict. No circular dependencies. Each module knows only about modules beneath it.

schema/      ← standalone; no dependencies; defines all types
  ↑
storage/     ← depends on schema; knows nothing about signals or ranking
  ↑
signals/     ← depends on storage; knows nothing about queries or ranking
  ↑
query/       ← depends on storage + signals; orchestrates execution
  ↑
ranking/     ← depends on signals; invoked by the query executor

`schema/`

The type system. Defines EntityId, SignalDef, ProfileDef, CohortDef, TidalError, and validation logic. No dependencies. Every other module depends on this one.

No external crates except thiserror (error derives).

`storage/`

The persistence layer. Owns:

WAL — the durability boundary. Every write goes here first.
Entity store — item, user, creator metadata. Trait-abstracted: EntityStore, SignalLedgerStore, RelationshipStore.
Key encoding — [entity_id: u64 BE][0x00][TAG:suffix] for co-location and range scans.

The storage backend (fjall initially) sits behind a trait. No storage engine types leak into higher modules.

`signals/`

Signal ingestion and aggregation. Owns:

Ingest — validates, hashes (BLAKE3 for deduplication), writes to WAL, triggers downstream
Decay — forward-decay formula maintenance (S(t) = S(t_prev) * exp(-λ * dt) + weight)
Aggregation — windowed counters (SWAG-based), velocity computation
Materialization — background worker that writes pre-computed aggregates to O(1) lookup keys

`query/`

Query parsing and execution. Owns:

Parser — validates Retrieve, Search, Suggest inputs
Planner — selects candidate retrieval strategy (ANN vs. scan vs. cohort-scoped), estimates filter selectivity
Executor — orchestrates retrieval → filter → score → diversity → paginate

`ranking/`

Scoring and diversity. Owns:

Profile engine — loads named profiles, applies boosts, penalties, gates
Signal scoring — reads decay scores and windowed aggregates from signals/
Diversity enforcement — post-scoring reordering pass enforcing max_per_creator, format mix, topic spread
Exploration — injects new-item candidates at the declared exploration rate

Storage Architecture

WAL as source of truth

Every write — entity, signal, relationship — goes through the Write-Ahead Log before any processing. The entity store, signal aggregates, vector index, and text index are all derived state. If they are lost, they can be rebuilt from the WAL.

write_signal(event)
  → hash payload (BLAKE3)             // deduplication
  → append to WAL (fsync amortized)   // durability boundary
  → update in-memory decay score      // hot path, atomic
  → update windowed counter           // hot path, lock-free
  → enqueue for materializer          // background

Signal durability is configurable per signal type:

Immediate — fsync per event (purchases, high-value actions)
Batched — fsync per N events or T ms, whichever comes first (likes, views)
Eventual — OS-buffered (impressions, hover events)

Default: Batched { max_events: 100, max_delay_ms: 10 }.

Key encoding

All keys follow the subject-prefix pattern: [entity_id: u64 BE][0x00][TAG:suffix].

Big-endian encoding ensures byte-lexicographic order matches numeric order — enabling range scans and prefix compression. All data for one entity is co-located.

{item_id}\x00SIG:view:1h     → windowed aggregate (1-hour view count)
{item_id}\x00SIG:view:decay  → running decay score
{item_id}\x00META            → entity metadata
{item_id}\x00EMB             → embedding vector reference
{user_id}\x00PREF            → preference vector
{user_id}\x00SEEN:{item_id}  → seen-item record
{user_id}\x00REL:follows:{creator_id} → relationship edge

This layout is shard-ready: entity_id is a partition key, and range-based partitioning needs no format migration.

Storage isolation

Item signal ledgers, user preference vectors, and creator profiles occupy separate storage namespaces (column families). A burst of view events on a viral item must not slow down user profile reads.

Hybrid backend

LSM-tree (fjall) for the signal event log — write-heavy, sequential, FIFO-compacted. The same engine serves entity metadata with prefix bloom filters for point lookups. If a B-tree backend proves faster for entity random reads in benchmarks, the trait abstraction allows substitution without touching higher layers.

Signal System

Decay model

Decay is declared in schema, applied at query time. The application never computes trending_score = views / (age_hours + 2)^1.8.

DEFINE SIGNAL view ON item
  DECAY exponential HALF_LIFE 7d
  WINDOWS 1h, 24h, 7d, 30d, all_time
  VELOCITY enabled

The forward-decay formula is mathematically exact and O(1) per operation:

// Write path (3 exp() ≈ 36ns)
S(t) = S(t_prev) * exp(-λ * dt) + weight

// Read path (1 exp() ≈ 15ns per entity per λ)
current = stored * exp(-λ * dt_since_last)

For 200 candidates: ~3-4 µs total. This replaces scanning raw events, which costs 160-1600 µs at 50 events/entity.

Out-of-order events: when t_event < last_update, pre-decay the weight: score += weight * exp(-λ * (last_update - t_event)). last_update is not modified — it already reflects a more recent time.

Windowed aggregation

Per-signal, per-window counters maintained using a SWAG (Sliding Window Aggregate) structure. The database tracks counts within declared windows (1h, 24h, 7d, 30d) at all times. No re-scan of raw events at query time.

Materialization

A background materializer continuously pre-computes aggregate values and writes them to O(1) lookup keys:

{item_id}\x00SIG:view:vel:1h   → view velocity (events/hour over last hour)
{item_id}\x00SIG:like:24h      → like count in last 24 hours
{item_id}\x00SIG:completion:all → completion rate, all time

Ranking queries read from materialized state on the fast path. If materialized state is stale (background worker lagging), the query falls back to computing from the in-memory decay score and windowed counters — slower but never wrong.

Cohort-scoped aggregation

When a signal event arrives and cohorts are defined, the signal fans out to per-entity aggregates and per-cohort-entity aggregates:

signal(view, item: X, user: U)
  → update item X's entity-level aggregates        // always
  → for each cohort C where U ∈ C:
      update (cohort C, item X) aggregate            // fan-out

Cohort membership is maintained as RoaringBitmaps — O(1) membership test. The per-cohort-item aggregate is sparse: only active (cohort, item) pairs with at least one signal are stored. Write amplification is ~6x for 5 cohorts per user on average; mitigated by batching.

Immutable events, mutable aggregates

Signal events (user U liked item I at time T) are immutable facts appended to the WAL. Signal aggregates (item I has 1,247 likes in the last 24h) are mutable derived state maintained in the signal ledger. These layers are kept strictly separate. Aggregates can always be recomputed from events.

Vector Index

USearch (Unum Cloud) is the HNSW engine. It is not built from scratch — correct, high-performance, concurrent HNSW with SIMD distance computation is 6-12 months of dedicated work. USearch runs in ScyllaDB, ClickHouse, and DuckDB at scale. The FFI boundary via cxx is thin.

Quantization

f16 by default: 10M vectors at 1536D → ~31.5 GB (f16) vs ~60 GB (float32). Less than 1% recall loss. Float32 only if benchmarks prove f16 is insufficient for a specific embedding model.

Embeddings are normalized to unit length at insertion time. L2 distance is then equivalent to cosine similarity, and more SIMD-friendly. Re-normalization at query time never happens.

Adaptive filtered search

The query planner estimates filter selectivity from metadata indexes (roaring bitmaps per creator, B-tree for date ranges), then selects a strategy:

Estimated selectivity	Strategy
< 2%	Pre-filter via bitmap intersection → brute-force L2 over matched set
2%–100%	`index.filtered_search(vector, k, \|key\| predicate(key))` — USearch evaluates filters inline during HNSW traversal; non-matching nodes are skipped for results but still used for graph navigation
Fallback	Widen `ef_search`; if still insufficient, fall back to pre-filter + brute-force

This matches how ScyllaDB uses USearch in production and how Weaviate and Qdrant handle the same problem.

Persistence lifecycle

Active index in RAM for reads and writes during operation.
Periodic save() coordinated with WAL checkpointing.
On restart: view() for immediate read-only mmap serving while a writable copy loads in background.
Segment-based management for growing datasets: new inserts go to a new segment; periodic compaction merges segments and reclaims tombstoned space.

Multi-vector user preference

User interest is not a single vector. Averaging engagement embeddings across topics ("hiking," "cooking," "cars") produces a centroid that represents none of them. Instead, each user's preference is represented as 3-10 interest cluster centroids (PinnerSage-style), maintained by the database as signals arrive. At query time, the planner issues one filtered HNSW query per active cluster and merges results. This requires no special index modifications — standard filtered_search per cluster, results deduped by score.

Text Search

Tantivy is the full-text / BM25 engine. It is a derived index, not a source of truth.

Consistency model

The entity store is the source of truth. Tantivy is a materialized view over it. If the Tantivy index is corrupted or lost, it can be rebuilt from the entity store by replaying the entity outbox.

write_item(item)
  → write to entity store (within WAL)
  → append to background indexer outbox
  → [async] background indexer → Tantivy
  → on each Tantivy commit, store last-processed WAL sequence number
  → on crash recovery, replay from that sequence number

Tantivy's single-writer guarantee is enforced via filesystem lock. Segment merging runs on background threads to avoid query latency spikes.

Hybrid fusion

Search queries combine BM25 relevance and ANN semantic similarity using Reciprocal Rank Fusion:

RRF(d) = 1/(60 + rank_bm25) + 1/(60 + rank_ann)

RRF is rank-based — no score normalization required, robust across query types. Graduate to a tuned linear combination α * bm25 + (1-α) * ann only after relevance labels exist to set α.

Personalization re-ranks the fused set using the user's preference vector and relationship graph. The order of operations: text retrieval → ANN retrieval → RRF fusion → personalization re-ranking → diversity enforcement.

Query Execution Pipeline

Every retrieve() or search() call follows this pipeline:

1. Parse & validate
   └── input types, profile existence, filter validity

2. Plan candidate retrieval
   ├── ANN (user preference vector → top-k items by embedding similarity)
   ├── BM25 (text query → top-k items by relevance)
   ├── Full scan (trending/browse — no user vector required)
   ├── Graph walk (following feed — reverse-chronological from followed creators)
   └── Cohort-scoped (trending/rising within a named cohort)

3. Apply hard filters
   └── unseen, unblocked, unhidden, field predicates — eliminate ineligible candidates
   └── Negative relationship checks (blocked creators, muted topics)

4. Score candidates
   ├── Load decay scores and windowed aggregates (from materialized state or computed)
   ├── Apply profile boosts (signal velocity, relationship weight, social proof)
   ├── Apply profile penalties (skip count, hide, negative engagement)
   ├── Apply freshness decay (age-based score reduction)
   └── Apply quality gates (minimum completion rate, minimum score threshold)

5. Diversity enforcement (post-scoring reordering pass)
   └── max_per_creator, format_mix, topic_diversity
   └── Reorders — does not reduce result count

6. Exploration injection
   └── Inject new/low-signal items at declared exploration rate (e.g. 10%)
   └── New items get exploration budget until signals accumulate

7. Paginate and return
   └── Cursor-based, stable across pages

Ranking profiles are data, not code

Profiles are schema-level declarations — parsed, validated, versioned, stored in the database. They are not Rust functions compiled into the binary. Changing a profile weight requires no recompile, no redeploy. The query planner reasons about profile structure to optimize execution (e.g. a profile that only uses velocity signals skips the ANN step).

Graceful degradation

Under load, the executor degrades in order — never returns errors for well-formed queries:

Reduce candidate set size (top_k: 500 → 200)
Use coarser signal aggregates (skip velocity, use windowed counts)
Skip diversity enforcement
Return from materialized ranking cache

Write Path: Single Engagement Signal

Tracing db.signal(Signal { kind: "like", item: "I", user: "U", ... }):

1. Hash event payload (BLAKE3) → deduplicate
2. Append to WAL → fsync (batched)
3. Update item I's like decay score (atomic CAS)
4. Increment item I's like_count windowed counters (atomic add)
5. Recompute like velocity for item I
6. Update user U → item I relationship weight
7. Increment user U → creator C interaction weight
8. Shift user U's preference vector toward item I's embedding
9. Fan-out to cohort aggregates for each cohort U belongs to
10. Enqueue item I for materializer (windowed aggregate refresh)

Steps 3-9 execute atomically in memory. Step 10 is background. A ranking query issued 100ms later sees the updated decay score, relationship weight, and preference vector.

Concurrency Model

Hot path: lock-free

Signal counters, decay scores, and windowed aggregates use atomic operations exclusively.

AtomicU64 with Relaxed ordering for monotonic counters (view_count, like_count)
AtomicU64 via f64::to_bits / from_bits with CAS loops for decay scores
Acquire/Release at synchronization points (checkpoint, materializer flush)
DashMap for concurrent entity state access (sharded, no global lock)

A like event increments an atomic. A ranking query reads it. No blocking between writers and readers.

Cold path: mutex acceptable

Schema changes, profile definitions, background compaction coordination — these happen infrequently and outside the query hot path. Mutexes are acceptable here.

Hot-path structs: cache-line aligned

Any struct touched during candidate scoring is #[repr(C, align(64))] — one L1 cache line. This prevents false sharing under concurrent access and keeps scoring loops cache-friendly.

#[repr(C, align(64))]
struct EntitySignalState {
    entity_id: u64,
    decay_scores: [f64; 3],      // one per declared decay rate
    last_update_ns: u64,
    window_counts: BucketedCounter,
    // padded to 64-byte boundary
}

Performance Targets

These are constraints, not aspirations. Regressions are bugs.

Operation	Target
Signal write (including WAL, amortized)	< 100 µs
Decay score read per candidate	~15 ns
200-candidate scoring pass	< 5 µs
ANN retrieval at 1M vectors	< 10 ms p99
BM25 query at 1M documents	< 10 ms
End-to-end RETRIEVE query	< 50 ms

The 200-candidate scoring budget breaks down as: 200 × 15 ns (decay read) + 200 × (boost/penalty application) + 1 diversity pass. Everything else in the pipeline must fit within the remainder of the 50 ms budget.

Dependency Map

usearch  (C++ FFI via cxx)    → vector index
tantivy  (pure Rust)          → text/BM25 index
fjall    (pure Rust)          → storage engine (WAL, entity store, signal ledger)
roaring  (pure Rust)          → bitmap indexes (cohort membership, filter selectivity)
blake3   (pure Rust)          → content-addressed signal deduplication
dashmap  (pure Rust)          → concurrent entity state map
thiserror                     → typed error derives
tracing                       → structured spans (embedder provides subscriber)
serde / serde_json            → serialization at API boundaries only
criterion                     → benchmarking (dev dependency)
proptest                      → property testing (dev dependency)

Every dependency must justify its existence against "could we write this in 200 lines?" The approved list above is the complete list. No additions without research justification.

Key Architectural Decisions

Decision	Choice	Why
WAL strategy	Append-only, fsync batched	Durability before processing; replay-based recovery; matches Citadel and Engram patterns
Storage engine	fjall (LSM-tree)	Pure Rust, embeddable, FIFO compaction for event logs, prefix bloom filters
Vector index	USearch	150x faster than Lucene, predicate callback during HNSW traversal, mmap, quantization; used in ScyllaDB/ClickHouse/DuckDB
Quantization	f16 by default	50% memory savings, <1% recall loss; 10M × 1536D → ~31.5 GB
Filtered ANN	Adaptive planner	<2% selectivity: pre-filter + brute-force; 2-100%: USearch predicate callback
Text search	Tantivy as derived index	40K lines of battle-tested Rust; custom Collector for score extraction; DB-primary with background indexer
Hybrid fusion	RRF (k=60)	Rank-based, no score normalization, proven better than CombMNZ
Decay model	Forward-decay formula	Mathematically exact, O(1) write/read; no raw-event scanning at query time
Decay storage	f64 via AtomicU64	15 significant digits; sufficient for 528-year precision
Timestamps	u64 nanoseconds since Unix epoch	Overflows year 2554; matches ClickHouse/Sonnerie; no external dependency
Cohort membership	RoaringBitmap	O(1) membership test; sparse fan-out for signal aggregation
Signal deduplication	BLAKE3 content hash	Automatic deduplication of webhook retries and client double-submissions
Key encoding	`[entity_id: u64 BE]\x00TAG:suffix`	Co-location, range scans, natural shard boundaries, no migration path needed
Ranking profiles	Schema declarations	Swappable at query time by name; A/B testable; no recompile on change
Diversity	Post-scoring reordering pass	Does not reduce result count; enforces constraints after scoring is complete
Error handling	`thiserror` enum with 6 variants	Typed, actionable errors; used by fjall/tantivy/tikv; no `unwrap()` outside tests
Observability	`tracing` spans, embedder provides subscriber	Library crate; never initializes a subscriber; `#[tracing::instrument]` at subsystem boundaries

What This Replaces

Elasticsearch          →  Tantivy (BM25, derived index)
Redis                  →  In-memory decay scores + windowed counters (lock-free atomics)
Kafka                  →  WAL (durable, ordered, replayable)
Feature store          →  Signal ledger + materialized aggregates
Vector DB              →  USearch (HNSW, embedded)
Ranking service        →  Named profiles, query-time scoring

One process. One query interface. One operational model.

The test: this query should execute in under 50 ms, incorporate signals written 100 ms ago, enforce diversity without application logic, handle cold-start items without application intervention:

db.retrieve(Retrieve {
    entity: EntityKind::Item,
    for_user: Some("user_123"),
    context: Some("feed"),
    profile: "for_you",
    filters: vec![Filter::unseen(), Filter::not_blocked(), Filter::eq("format", "video")],
    diversity: Some(DiversitySpec { max_per_creator: Some(2), format_mix: true }),
    limit: 50,
})

That is what six systems currently produce. It should be one query here.

23 KiB Raw Blame History Unescape Escape