tidaldb/ARCHITECTURE.md

# Architecture

tidalDB is a purpose-built ranking database. Its architecture is shaped by a single constraint: every design decision must serve the question "given a user and a context, what content should they see, in what order?" Nothing else.

This document describes how the system is structured, why it is structured that way, and how the major subsystems interact. For the API surface, see [API.md](API.md). For engineering standards, see [CODING_GUIDELINES.md](CODING_GUIDELINES.md). For the research behind specific decisions, see [docs/research/](docs/research/).

---

## Core Thesis

Every content platform (YouTube, TikTok, Reddit, Netflix) builds the same 6-system distributed stack from scratch — Elasticsearch, Redis, Kafka, a feature store, a vector DB, and a ranking service. The seams between those systems are where correctness fails: stale signals, inconsistent ranking, cache invalidation bugs, ETL lag.

The root cause is that existing databases treat ranking as an afterthought. They have no native concept of signals that evolve over time, no understanding of user context, no diversity as a query constraint, and no feedback loop between what users see and what the system learns.

tidalDB treats ranking as a primitive. Signals, decay, velocity, user preferences, relationships, and diversity are first-class schema concepts — not application logic bolted on top.

---

## Domain Model

Six first-class entity types:

| Type | What it represents |
|------|--------------------|
| **Item** | A piece of content (video, article, post) — has metadata, an embedding slot, a signal ledger |
| **User** | A viewer — has attributes, a preference vector, a signal ledger, a seen-item set |
| **Creator** | An author — has attributes, an embedding slot, a signal ledger |
| **Relationship** | A weighted, directional edge between any two entities (follows, blocks, interaction weight) |
| **Cohort** | A named, live predicate over user attributes (e.g. `age_range ∈ {18-24} AND locale = en-US`) |
| **Session / Agent Context** | A short-lived, agent-scoped memory surface binding a user, agent identity, and session metadata (tools, reward hints, policy) |

Six schema-level primitives:

| Primitive | What it captures |
|-----------|-----------------|
| **Signal** | A typed, timestamped event stream (view, like, skip, hide...) with declared decay rate, velocity, and windowed aggregation |
| **Ranking Profile** | A named, versioned scoring function: candidate retrieval strategy, boosts, penalties, quality gates, diversity rules, exploration budget |
| **Relationship** | Weighted edges: follows, blocks, interaction strength — used as ranking inputs |
| **Cohort** | Live predicate membership — enables cohort-scoped signal aggregation and trending |
| **Session** | Agent-scoped conversational context: short-lived signals, reward hints, policy tags, decay curves |
| **Filter** | Composable predicates over entity attributes, signal values, and relationship state |

---

## Module Structure

The dependency chain is strict. No circular dependencies. Each module knows only about modules beneath it.

```
schema/      ← standalone; no dependencies; defines all types
  ↑
storage/     ← depends on schema; knows nothing about signals or ranking
  ↑
signals/     ← depends on storage; knows nothing about queries or ranking
  ↑
agent/       ← depends on signals; manages sessions, policy, agent APIs
  ↑
query/       ← depends on agent + signals; orchestrates execution
  ↑
ranking/     ← depends on signals; invoked by the query executor
```

### `schema/`

The type system. Defines `EntityId`, `SignalDef`, `ProfileDef`, `CohortDef`, `TidalError`, and validation logic. No dependencies. Every other module depends on this one.

No external crates except `thiserror` (error derives).

### `storage/`

The persistence layer. Owns:
- **WAL** — the durability boundary. Every write goes here first.
- **Entity store** — item, user, creator metadata. Trait-abstracted: `EntityStore`, `SignalLedgerStore`, `RelationshipStore`.
- **Key encoding** — `[entity_id: u64 BE][0x00][TAG:suffix]` for co-location and range scans.

The storage backend (fjall initially) sits behind a trait. No storage engine types leak into higher modules.

### `signals/`

Signal ingestion and aggregation. Owns:
- **Ingest** — validates, hashes (BLAKE3 for deduplication), writes to WAL, triggers downstream
- **Decay** — forward-decay formula maintenance (`S(t) = S(t_prev) * exp(-λ * dt) + weight`)
- **Aggregation** — windowed counters (SWAG-based), velocity computation
- **Materialization** — background worker that writes pre-computed aggregates to O(1) lookup keys

### `agent/`

Session and policy management for agents. Owns:
- **Session store** — lifecycle of `(user, agent, session_id)` plus short-lived signals
- **Session materializers** — aggressive-decay aggregates agents can query (`last_5m_reward`, “tools used”, etc.)
- **Policy enforcement** — per-agent read/write rules, rate limiting, and isolation guardrails
- **API surface** — typed commands (`start_session`, `append_signal`, `close_session`) used by query/ranking layers

### `query/`

Query parsing and execution. Owns:
- **Parser** — validates `Retrieve`, `Search`, `Suggest` inputs
- **Planner** — selects candidate retrieval strategy (ANN vs. scan vs. cohort-scoped), estimates filter selectivity
- **Executor** — orchestrates retrieval → filter → score → diversity → paginate

### `ranking/`

Scoring and diversity. Owns:
- **Profile engine** — loads named profiles, applies boosts, penalties, gates
- **Signal scoring** — reads decay scores and windowed aggregates from signals/
- **Diversity enforcement** — post-scoring reordering pass enforcing `max_per_creator`, format mix, topic spread
- **Exploration** — injects new-item candidates at the declared exploration rate

---

## Storage Architecture

### WAL as source of truth

Every write — entity, signal, relationship — goes through the Write-Ahead Log before any processing. The entity store, signal aggregates, vector index, and text index are all derived state. If they are lost, they can be rebuilt from the WAL.

```
write_signal(event)
  → hash payload (BLAKE3)             // deduplication
  → append to WAL (fsync amortized)   // durability boundary
  → update in-memory decay score      // hot path, atomic
  → update windowed counter           // hot path, lock-free
  → enqueue for materializer          // background
```

Signal durability is configurable per signal type:
- `Immediate` — fsync per event (purchases, high-value actions)
- `Batched` — fsync per N events or T ms, whichever comes first (likes, views)
- `Eventual` — OS-buffered (impressions, hover events)

Default: `Batched { max_events: 100, max_delay_ms: 10 }`.

### Key encoding

All keys follow the subject-prefix pattern: `[entity_id: u64 BE][0x00][TAG:suffix]`.

Big-endian encoding ensures byte-lexicographic order matches numeric order — enabling range scans and prefix compression. All data for one entity is co-located.

```
{item_id}\x00SIG:view:1h     → windowed aggregate (1-hour view count)
{item_id}\x00SIG:view:decay  → running decay score
{item_id}\x00META            → entity metadata
{item_id}\x00EMB             → embedding vector reference
{user_id}\x00PREF            → preference vector
{user_id}\x00SEEN:{item_id}  → seen-item record
{user_id}\x00REL:follows:{creator_id} → relationship edge
```

This layout is shard-ready: `entity_id` is a partition key, and range-based partitioning needs no format migration.

### Storage isolation

Item signal ledgers, user preference vectors, and creator profiles occupy separate storage namespaces (column families). A burst of view events on a viral item must not slow down user profile reads.

### Hybrid backend

LSM-tree (fjall) for the signal event log — write-heavy, sequential, FIFO-compacted. The same engine serves entity metadata with prefix bloom filters for point lookups. If a B-tree backend proves faster for entity random reads in benchmarks, the trait abstraction allows substitution without touching higher layers.

---

## Signal System

### Decay model

Decay is declared in schema, applied at query time. The application never computes `trending_score = views / (age_hours + 2)^1.8`.

```
DEFINE SIGNAL view ON item
  DECAY exponential HALF_LIFE 7d
  WINDOWS 1h, 24h, 7d, 30d, all_time
  VELOCITY enabled
```

The forward-decay formula is mathematically exact and O(1) per operation:

```
// Write path (3 exp() ≈ 36ns)
S(t) = S(t_prev) * exp(-λ * dt) + weight

// Read path (1 exp() ≈ 15ns per entity per λ)
current = stored * exp(-λ * dt_since_last)
```

For 200 candidates: ~3-4 µs total. This replaces scanning raw events, which costs 160-1600 µs at 50 events/entity.

Out-of-order events: when `t_event < last_update`, pre-decay the weight: `score += weight * exp(-λ * (last_update - t_event))`. `last_update` is not modified — it already reflects a more recent time.

### Windowed aggregation

Per-signal, per-window counters maintained using a SWAG (Sliding Window Aggregate) structure. The database tracks counts within declared windows (1h, 24h, 7d, 30d) at all times. No re-scan of raw events at query time.

### Materialization

A background materializer continuously pre-computes aggregate values and writes them to O(1) lookup keys:

```
{item_id}\x00SIG:view:vel:1h   → view velocity (events/hour over last hour)
{item_id}\x00SIG:like:24h      → like count in last 24 hours
{item_id}\x00SIG:completion:all → completion rate, all time
```

Ranking queries read from materialized state on the fast path. If materialized state is stale (background worker lagging), the query falls back to computing from the in-memory decay score and windowed counters — slower but never wrong.

### Cohort-scoped aggregation

When a signal event arrives and cohorts are defined, the signal fans out to per-entity aggregates **and** per-cohort-entity aggregates:

```
signal(view, item: X, user: U)
  → update item X's entity-level aggregates        // always
  → for each cohort C where U ∈ C:
      update (cohort C, item X) aggregate            // fan-out
```

Cohort membership is maintained as RoaringBitmaps — O(1) membership test. The per-cohort-item aggregate is sparse: only active (cohort, item) pairs with at least one signal are stored. Write amplification is ~6x for 5 cohorts per user on average; mitigated by batching.

### Immutable events, mutable aggregates

Signal events (user U liked item I at time T) are immutable facts appended to the WAL. Signal aggregates (item I has 1,247 likes in the last 24h) are mutable derived state maintained in the signal ledger. These layers are kept strictly separate. Aggregates can always be recomputed from events.

---

## Vector Index

**USearch** (Unum Cloud) is the HNSW engine. It is not built from scratch — correct, high-performance, concurrent HNSW with SIMD distance computation is 6-12 months of dedicated work. USearch runs in ScyllaDB, ClickHouse, and DuckDB at scale. The FFI boundary via `cxx` is thin.

### Quantization

f16 by default: 10M vectors at 1536D → ~31.5 GB (f16) vs ~60 GB (float32). Less than 1% recall loss. Float32 only if benchmarks prove f16 is insufficient for a specific embedding model.

Embeddings are normalized to unit length at insertion time. L2 distance is then equivalent to cosine similarity, and more SIMD-friendly. Re-normalization at query time never happens.

### Adaptive filtered search

The query planner estimates filter selectivity from metadata indexes (roaring bitmaps per creator, B-tree for date ranges), then selects a strategy:

| Estimated selectivity | Strategy |
|-----------------------|----------|
| < 2% | Pre-filter via bitmap intersection → brute-force L2 over matched set |
| 2%–100% | `index.filtered_search(vector, k, \|key\| predicate(key))` — USearch evaluates filters inline during HNSW traversal; non-matching nodes are skipped for results but still used for graph navigation |
| Fallback | Widen `ef_search`; if still insufficient, fall back to pre-filter + brute-force |

This matches how ScyllaDB uses USearch in production and how Weaviate and Qdrant handle the same problem.

### Persistence lifecycle

1. Active index in RAM for reads and writes during operation.
2. Periodic `save()` coordinated with WAL checkpointing.
3. On restart: `view()` for immediate read-only mmap serving while a writable copy loads in background.
4. Segment-based management for growing datasets: new inserts go to a new segment; periodic compaction merges segments and reclaims tombstoned space.

### Multi-vector user preference

User interest is not a single vector. Averaging engagement embeddings across topics ("hiking," "cooking," "cars") produces a centroid that represents none of them. Instead, each user's preference is represented as 3-10 interest cluster centroids (PinnerSage-style), maintained by the database as signals arrive. At query time, the planner issues one filtered HNSW query per active cluster and merges results. This requires no special index modifications — standard `filtered_search` per cluster, results deduped by score.

---

## Text Search

**Tantivy** is the full-text / BM25 engine. It is a derived index, not a source of truth.

### Consistency model

The entity store is the source of truth. Tantivy is a materialized view over it. If the Tantivy index is corrupted or lost, it can be rebuilt from the entity store by replaying the entity outbox.

```
write_item(item)
  → write to entity store (within WAL)
  → append to background indexer outbox
  → [async] background indexer → Tantivy
  → on each Tantivy commit, store last-processed WAL sequence number
  → on crash recovery, replay from that sequence number
```

Tantivy's single-writer guarantee is enforced via filesystem lock. Segment merging runs on background threads to avoid query latency spikes.

### Hybrid fusion

Search queries combine BM25 relevance and ANN semantic similarity using Reciprocal Rank Fusion:

```
RRF(d) = 1/(60 + rank_bm25) + 1/(60 + rank_ann)
```

RRF is rank-based — no score normalization required, robust across query types. Graduate to a tuned linear combination `α * bm25 + (1-α) * ann` only after relevance labels exist to set α.

Personalization re-ranks the fused set using the user's preference vector and relationship graph. The order of operations: text retrieval → ANN retrieval → RRF fusion → personalization re-ranking → diversity enforcement.

---

## Query Execution Pipeline

Every `retrieve()` or `search()` call follows this pipeline:

```
1. Parse & validate
   └── input types, profile existence, filter validity

2. Plan candidate retrieval
   ├── ANN (user preference vector → top-k items by embedding similarity)
   ├── BM25 (text query → top-k items by relevance)
   ├── Full scan (trending/browse — no user vector required)
   ├── Graph walk (following feed — reverse-chronological from followed creators)
   └── Cohort-scoped (trending/rising within a named cohort)

3. Apply hard filters
   └── unseen, unblocked, unhidden, field predicates — eliminate ineligible candidates
   └── Negative relationship checks (blocked creators, muted topics)

4. Score candidates
   ├── Load decay scores and windowed aggregates (from materialized state or computed)
   ├── Apply profile boosts (signal velocity, relationship weight, social proof)
   ├── Apply profile penalties (skip count, hide, negative engagement)
   ├── Apply freshness decay (age-based score reduction)
   └── Apply quality gates (minimum completion rate, minimum score threshold)

5. Diversity enforcement (post-scoring reordering pass)
   └── max_per_creator, format_mix, topic_diversity
   └── Reorders — does not reduce result count

6. Exploration injection
   └── Inject new/low-signal items at declared exploration rate (e.g. 10%)
   └── New items get exploration budget until signals accumulate

7. Paginate and return
   └── Cursor-based, stable across pages
```

### Ranking profiles are data, not code

Profiles are schema-level declarations — parsed, validated, versioned, stored in the database. They are not Rust functions compiled into the binary. Changing a profile weight requires no recompile, no redeploy. The query planner reasons about profile structure to optimize execution (e.g. a profile that only uses velocity signals skips the ANN step).

### Graceful degradation

Under load, the executor degrades in order — never returns errors for well-formed queries:

1. Reduce candidate set size (top_k: 500 → 200)
2. Use coarser signal aggregates (skip velocity, use windowed counts)
3. Skip diversity enforcement
4. Return from materialized ranking cache

---

## Write Path: Single Engagement Signal

Tracing `db.signal(Signal { kind: "like", item: "I", user: "U", ... })`:

```
1. Hash event payload (BLAKE3) → deduplicate
2. Append to WAL → fsync (batched)
3. Update item I's like decay score (atomic CAS)
4. Increment item I's like_count windowed counters (atomic add)
5. Recompute like velocity for item I
6. Update user U → item I relationship weight
7. Increment user U → creator C interaction weight
8. Shift user U's preference vector toward item I's embedding
9. Fan-out to cohort aggregates for each cohort U belongs to
10. Enqueue item I for materializer (windowed aggregate refresh)
```

Steps 3-9 execute atomically in memory. Step 10 is background. A ranking query issued 100ms later sees the updated decay score, relationship weight, and preference vector.

---

## Concurrency Model

### Hot path: lock-free

Signal counters, decay scores, and windowed aggregates use atomic operations exclusively.

- `AtomicU64` with `Relaxed` ordering for monotonic counters (view_count, like_count)
- `AtomicU64` via `f64::to_bits / from_bits` with CAS loops for decay scores
- `Acquire/Release` at synchronization points (checkpoint, materializer flush)
- `DashMap` for concurrent entity state access (sharded, no global lock)

A `like` event increments an atomic. A ranking query reads it. No blocking between writers and readers.

### Cold path: mutex acceptable

Schema changes, profile definitions, background compaction coordination — these happen infrequently and outside the query hot path. Mutexes are acceptable here.

### Hot-path structs: cache-line aligned

Any struct touched during candidate scoring is `#[repr(C, align(64))]` — one L1 cache line. This prevents false sharing under concurrent access and keeps scoring loops cache-friendly.

```rust
#[repr(C, align(64))]
struct EntitySignalState {
    entity_id: u64,
    decay_scores: [f64; 3],      // one per declared decay rate
    last_update_ns: u64,
    window_counts: BucketedCounter,
    // padded to 64-byte boundary
}
```

---

## Performance Targets

These are constraints, not aspirations. Regressions are bugs.

| Operation | Target |
|-----------|--------|
| Signal write (including WAL, amortized) | < 100 µs |
| Decay score read per candidate | ~15 ns |
| 200-candidate scoring pass | < 5 µs |
| ANN retrieval at 1M vectors | < 10 ms p99 |
| BM25 query at 1M documents | < 10 ms |
| End-to-end RETRIEVE query | < 50 ms |

The 200-candidate scoring budget breaks down as: 200 × 15 ns (decay read) + 200 × (boost/penalty application) + 1 diversity pass. Everything else in the pipeline must fit within the remainder of the 50 ms budget.

---

## Dependency Map

```
usearch  (C++ FFI via cxx)    → vector index
tantivy  (pure Rust)          → text/BM25 index
fjall    (pure Rust)          → storage engine (WAL, entity store, signal ledger)
roaring  (pure Rust)          → bitmap indexes (cohort membership, filter selectivity)
blake3   (pure Rust)          → content-addressed signal deduplication
dashmap  (pure Rust)          → concurrent entity state map
thiserror                     → typed error derives
tracing                       → structured spans (embedder provides subscriber)
serde / serde_json            → serialization at API boundaries only
criterion                     → benchmarking (dev dependency)
proptest                      → property testing (dev dependency)
```

Every dependency must justify its existence against "could we write this in 200 lines?" The approved list above is the complete list. No additions without research justification.

---

## Key Architectural Decisions

| Decision | Choice | Why |
|----------|--------|-----|
| WAL strategy | Append-only, fsync batched | Durability before processing; replay-based recovery; matches Citadel and Engram patterns |
| Storage engine | fjall (LSM-tree) | Pure Rust, embeddable, FIFO compaction for event logs, prefix bloom filters |
| Vector index | USearch | 150x faster than Lucene, predicate callback during HNSW traversal, mmap, quantization; used in ScyllaDB/ClickHouse/DuckDB |
| Quantization | f16 by default | 50% memory savings, <1% recall loss; 10M × 1536D → ~31.5 GB |
| Filtered ANN | Adaptive planner | <2% selectivity: pre-filter + brute-force; 2-100%: USearch predicate callback |
| Text search | Tantivy as derived index | 40K lines of battle-tested Rust; custom Collector for score extraction; DB-primary with background indexer |
| Hybrid fusion | RRF (k=60) | Rank-based, no score normalization, proven better than CombMNZ |
| Decay model | Forward-decay formula | Mathematically exact, O(1) write/read; no raw-event scanning at query time |
| Decay storage | f64 via AtomicU64 | 15 significant digits; sufficient for 528-year precision |
| Timestamps | u64 nanoseconds since Unix epoch | Overflows year 2554; matches ClickHouse/Sonnerie; no external dependency |
| Cohort membership | RoaringBitmap | O(1) membership test; sparse fan-out for signal aggregation |
| Signal deduplication | BLAKE3 content hash | Automatic deduplication of webhook retries and client double-submissions |
| Key encoding | `[entity_id: u64 BE]\x00TAG:suffix` | Co-location, range scans, natural shard boundaries, no migration path needed |
| Ranking profiles | Schema declarations | Swappable at query time by name; A/B testable; no recompile on change |
| Diversity | Post-scoring reordering pass | Does not reduce result count; enforces constraints after scoring is complete |
| Error handling | `thiserror` enum with 6 variants | Typed, actionable errors; used by fjall/tantivy/tikv; no `unwrap()` outside tests |
| Observability | `tracing` spans, embedder provides subscriber | Library crate; never initializes a subscriber; `#[tracing::instrument]` at subsystem boundaries |

---

## What This Replaces

```
Elasticsearch          →  Tantivy (BM25, derived index)
Redis                  →  In-memory decay scores + windowed counters (lock-free atomics)
Kafka                  →  WAL (durable, ordered, replayable)
Feature store          →  Signal ledger + materialized aggregates
Vector DB              →  USearch (HNSW, embedded)
Ranking service        →  Named profiles, query-time scoring
```

One process. One query interface. One operational model.

The test: this query should execute in under 50 ms, incorporate signals written 100 ms ago, enforce diversity without application logic, handle cold-start items without application intervention:

```rust
db.retrieve(Retrieve {
    entity: EntityKind::Item,
    for_user: Some("user_123"),
    context: Some("feed"),
    profile: "for_you",
    filters: vec![Filter::unseen(), Filter::not_blocked(), Filter::eq("format", "video")],
    diversity: Some(DiversitySpec { max_per_creator: Some(2), format_mix: true }),
    limit: 50,
})
```

That is what six systems currently produce. It should be one query here.