tidaldb/docs/specs/01-storage-engine.md

# Storage Engine Specification

**Status:** Draft
**Author:** tidalDB Engineering
**Last Updated:** 2026-02-20
**Prerequisites:** [VISION.md](../../VISION.md), [thoughts.md](../../thoughts.md), [Signal Ledger Research](../research/tidaldb_signal_ledger.md)

---

## 1. Design Principles

tidalDB's storage engine serves one master: the ranking query. Every design decision flows from this question: _can we score 200 candidates in under 5 microseconds while sustaining thousands of signal writes per second without losing a single event?_

The storage engine is not a general-purpose key-value store. It is a purpose-built substrate for three workloads that coexist in a single process:

1. **Signal ingestion** -- high-velocity, append-heavy, durability-critical writes (thousands/sec)
2. **Ranking reads** -- low-latency, random-access reads across hundreds of entities per query (<5 us for 200 candidates)
3. **Background materialization** -- continuous compaction of raw events into pre-computed aggregates

These workloads have fundamentally different I/O profiles. Forcing them through a single storage engine is the architectural mistake that thoughts.md identifies in StemeDB's hybrid routing pattern. We use two engines, routed by key prefix, behind a single trait boundary.

### Invariants

These must hold at all times. They are not aspirational. Property tests and crash recovery tests enforce them.

1. **WAL-before-visibility.** No signal event is visible to any reader until it is durably logged in the WAL.
2. **No lost events.** A signal event acknowledged to the caller survives any single crash. The WAL is the source of truth; everything else is derived state.
3. **Aggregate consistency.** Materialized aggregates are always computable from the WAL + raw events. If they diverge, the aggregates are wrong, not the events.
4. **Entity isolation.** A write storm on one entity type (viral item signals) does not degrade read latency for another entity type (user profile lookups).
5. **Crash recovery is bounded.** Recovery time is proportional to the WAL tail (events since last checkpoint), not total data size.
6. **Key co-location.** All data for a single entity is retrievable via a single prefix scan. No cross-entity joins at the storage layer.

---

## 2. Write-Ahead Log

The WAL is the durability primitive. Every mutation -- signal event, entity write, relationship update -- is serialized into the WAL before any downstream processing occurs. The signal ledger, entity store, search index, and materialized aggregates are all derived state that can be rebuilt from the WAL.

### 2.1 Record Format

Each WAL record is a length-prefixed, checksummed byte sequence. The format is designed for sequential write performance and crash-safe parsing.

```
WAL Record Layout (on disk)
+--------+----------+--------+----------+----------+
| Length  | Checksum | SeqNo  | Type     | Payload |
| 4 bytes | 32 bytes | 8 bytes| 1 byte   | N bytes |
+--------+----------+--------+----------+----------+
|<-- header (45 bytes) ---------------------->|
```

Field definitions:

| Field      | Size     | Encoding          | Description                                                                         |
| ---------- | -------- | ----------------- | ----------------------------------------------------------------------------------- |
| `length`   | 4 bytes  | u32 little-endian | Total record size including header. Max record: 4 GiB.                              |
| `checksum` | 32 bytes | BLAKE3 hash       | Hash of `seqno \|\| type \|\| payload`. Covers everything after the checksum field. |
| `seqno`    | 8 bytes  | u64 little-endian | Monotonically increasing sequence number. Never reused. Survives crash recovery.    |
| `type`     | 1 byte   | u8 enum           | Record type discriminator (see below).                                              |
| `payload`  | variable | type-dependent    | Serialized record body.                                                             |

Record types:

| Value  | Name                | Description                                  |
| ------ | ------------------- | -------------------------------------------- |
| `0x01` | `SignalEvent`       | Engagement signal (view, like, skip, etc.)   |
| `0x02` | `EntityWrite`       | Entity metadata create/update                |
| `0x03` | `RelationshipWrite` | Relationship edge create/update              |
| `0x04` | `SchemaChange`      | Schema DDL (define signal, define profile)   |
| `0x05` | `Checkpoint`        | Checkpoint marker with materializer state    |
| `0x06` | `BatchBoundary`     | Group commit boundary marker                 |
| `0xFF` | `Padding`           | Fill to segment boundary (ignored on replay) |

**Why BLAKE3, not CRC32.** CRC32 detects accidental corruption but not adversarial modification. BLAKE3 is a cryptographic hash that also serves as the content-address for signal event deduplication (see Section 2.4). The cost difference is negligible -- BLAKE3 processes 1 GiB/s/core on modern hardware, and WAL records are small. Using BLAKE3 for both checksumming and deduplication avoids computing two separate hashes.

### 2.2 WAL Segments

The WAL is divided into fixed-size segments to bound file sizes and simplify cleanup.

```
WAL Segment Layout (filesystem)

data/
  wal/
    segment-000000000001.wal    # oldest active segment
    segment-000000000002.wal
    segment-000000000003.wal    # current write segment
```

| Parameter      | Default | Tuning Guidance                                                                                                                  |
| -------------- | ------- | -------------------------------------------------------------------------------------------------------------------------------- |
| `segment_size` | 64 MiB  | Larger segments reduce file count but increase recovery time. 64 MiB balances: ~2 seconds of writes at 32 MB/s sustained ingest. |
| `max_segments` | 128     | 8 GiB total WAL. Segments older than the last checkpoint are eligible for cleanup.                                               |
| `preallocate`  | `true`  | Pre-allocate segment files with `fallocate()` to avoid filesystem metadata updates on every write.                               |

**Segment lifecycle:**

1. **Create.** When the current segment reaches `segment_size`, a new segment file is pre-allocated and becomes the active write target. The segment number is the first `seqno` it will contain.
2. **Seal.** When a segment is no longer the write target, it is sealed (marked read-only). Sealed segments are used for crash recovery replay and WAL tailing by the background materializer.
3. **Cleanup.** After a checkpoint is written and confirmed durable, all segments whose highest `seqno` is less than the checkpoint's `seqno` are eligible for deletion. Cleanup runs after every checkpoint.

**Invariant:** The WAL always retains all segments from the last confirmed checkpoint forward. Deleting a segment before its records are checkpointed violates the crash recovery guarantee.

### 2.3 Crash Recovery

On startup, the storage engine:

1. **Locates the last checkpoint record** by scanning backward from the newest WAL segment. The checkpoint record contains the `seqno` at which all derived state (entity store, signal aggregates, materialized views) was consistent.
2. **Replays all records after the checkpoint `seqno`** in sequence order. Each record is validated against its BLAKE3 checksum. Records with invalid checksums are discarded (they represent incomplete writes interrupted by a crash).
3. **Applies replayed records** to the entity store, signal ledger, and materialized views, bringing them to a consistent state.
4. **Writes a new checkpoint** once recovery is complete, establishing a clean recovery boundary for future crashes.

**Torn write detection.** If the last record in a segment has a valid `length` field but an invalid checksum, the write was interrupted mid-record. The record is discarded. If `length` itself is torn (partially written), the parser detects this because the remaining bytes in the segment are fewer than `length` specifies. Both cases are safe -- the record was never acknowledged to the caller (fsync had not completed), so discarding it does not violate the durability guarantee.

**Recovery time bound.** Recovery replays only the WAL tail (records since last checkpoint). With the default checkpoint interval of 30 seconds (Section 8) and a write rate of 10,000 events/sec, the WAL tail contains at most ~300,000 records. At ~1 us per record replay, recovery completes in under 300 ms.

### 2.4 Signal Event Deduplication

Signal events are content-addressed using BLAKE3. The hash is computed over the canonical fields that define event identity:

```
BLAKE3(entity_id || signal_type || user_id || timestamp_ns)
```

The resulting 32-byte hash serves dual purpose:

1. **WAL checksum** -- the same hash stored in the WAL record header.
2. **Deduplication key** -- before appending a signal event to the WAL, the writer checks a bloom filter of recent event hashes. If the hash is present, the event is a duplicate (webhook retry, client double-submit) and is silently acknowledged without writing.

The deduplication bloom filter covers the last `dedup_window` (default: 5 minutes) of event hashes. At 10,000 events/sec, this is 3 million entries. A bloom filter with 10 bits/entry and 3 hash functions consumes ~3.75 MB with a 1% false positive rate. False positives cause a harmless duplicate check against the WAL -- they do not cause event loss.

---

## 3. Durability Levels

Not all writes carry the same durability requirement. A purchase event must survive any crash. An impression event can tolerate losing the last 100 ms of writes. The storage engine exposes three durability levels, configurable per signal type in schema.

### 3.1 Durability Level Definitions

```rust
/// Durability guarantee for a write operation.
pub enum DurabilityLevel {
    /// fsync after every write. The write is durable when the call returns.
    /// Use for: purchases, subscriptions, blocks, reports.
    Immediate,

    /// fsync per batch. Writes are buffered until either `max_batch_size`
    /// records accumulate or `max_delay` elapses, whichever comes first.
    /// Use for: likes, comments, shares, follows (default for engagement).
    Batched {
        max_batch_size: u32,
        max_delay: Duration,
    },

    /// fsync on OS schedule (typically every 30s on Linux).
    /// Use for: impressions, scroll depth, hover events, telemetry.
    Eventual,
}
```

| Level       | Default Parameters                       | Worst-Case Data Loss on Crash              | fsync Cost                                                            |
| ----------- | ---------------------------------------- | ------------------------------------------ | --------------------------------------------------------------------- |
| `Immediate` | --                                       | 0 bytes                                    | 1 fsync per write (~200 us on NVMe)                                   |
| `Batched`   | `max_batch_size: 256`, `max_delay: 10ms` | Up to 256 records or 10 ms of writes       | 1 fsync per batch (~200 us amortized over 256 writes = ~0.8 us/write) |
| `Eventual`  | --                                       | Up to ~30 seconds of writes (OS-dependent) | 0 explicit fsyncs                                                     |

### 3.2 Group Commit

Group commit amortizes the cost of `fsync` across multiple concurrent writers. This is the same technique used by PostgreSQL's `commit_delay` and Citadel's `GroupCommitQueue`.

**Mechanism:**

1. Writers append their WAL records to an in-memory buffer and register a notification channel.
2. A dedicated **commit thread** monitors the buffer. It triggers a flush when either condition is met:
    - The buffer contains `max_batch_size` records.
    - `max_delay` has elapsed since the first unflushed record was buffered.
3. The commit thread writes all buffered records to the WAL segment file in a single `writev()` call, then issues one `fdatasync()`.
4. After `fdatasync()` returns, the commit thread notifies all waiting writers that their records are durable.
5. Writers blocked on `Immediate` durability wake up and return success.

```
Group Commit Timeline

Writer A:  write -----> [wait] -----> ACK
Writer B:       write --------> [wait] -> ACK
Writer C:            write ---> [wait] -> ACK
                                  |
Commit thread:          writev + fdatasync
                        (one syscall pair for 3 records)
```

**Configuration:**

| Parameter                | Default | Tuning Guidance                                                                                                                                               |
| ------------------------ | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `group_commit_max_batch` | 256     | Higher values amortize fsync better but increase tail latency for early arrivals in the batch. At 10K writes/sec, 256 records accumulate in ~25 ms.           |
| `group_commit_max_delay` | 10 ms   | Maximum time any writer waits for the batch to fill. 10 ms is the sweet spot: perceptible latency is >50 ms, and 10 ms captures most of the batching benefit. |

**Interaction with durability levels:**

- `Immediate` writers are always included in the next group commit flush. They wait for the fdatasync but benefit from batching with concurrent writers.
- `Batched` writers share the group commit mechanism with their configured parameters.
- `Eventual` writers append to the WAL buffer but do not wait for fdatasync. Their records ride along with the next flush but the writer returns immediately.

**Invariant:** A writer that receives an ACK for an `Immediate` or `Batched` write is guaranteed that the record has been fsynced. The group commit thread never acknowledges a record before fdatasync completes.

---

## 4. Hybrid Storage Backend

### 4.1 Rationale

tidalDB has a split personality: signal ingestion is write-heavy and append-mostly; ranking queries are read-heavy and random-access. No single storage engine excels at both.

From thoughts.md, StemeDB's key insight: _"Rather than forcing one storage engine to be good at everything, pick two and route intelligently."_ StemeDB uses fjall (LSM-tree) for write-heavy assertion appends and redb (B-tree) for read-heavy index lookups. tidalDB adopts the same pattern for the same reasons.

| Workload                       | Access Pattern                                      | Optimal Engine   | Why                                                                                                                               |
| ------------------------------ | --------------------------------------------------- | ---------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| Signal event log               | Append-only, sequential writes, range scans by time | LSM-tree (fjall) | LSM-trees batch writes in memtables and flush sequentially. Write amplification with FIFO compaction is 2x.                       |
| Signal ledger (running scores) | Frequent point updates, frequent point reads        | LSM-tree (fjall) | Running decay scores are updated on every event and read on every ranking query. LSM memtable serves both from memory.            |
| Entity metadata                | Infrequent writes, frequent random reads            | B-tree (redb)    | B-trees provide O(log n) point reads with no compaction overhead. Entity metadata changes rarely but is read on every query.      |
| Relationship graph             | Infrequent writes, range scans per entity           | B-tree (redb)    | Relationships are read during social-graph-aware ranking. Range scans over a user's edges are B-tree's sweet spot.                |
| Materialized aggregates        | Periodic batch writes, frequent point reads         | B-tree (redb)    | Aggregates are written by the background materializer and read during ranking. Write frequency is low (once per rollup interval). |
| Schema definitions             | Rare writes, reads on startup + DDL                 | B-tree (redb)    | Tiny dataset, read-heavy. B-tree is simpler.                                                                                      |

### 4.2 Engine Selection

**LSM-tree: fjall v3.** Pure Rust (`#![forbid(unsafe_code)]`). Embeddable. Keyspace-based isolation (equivalent to column families). Batch write performance competitive with RocksDB. Compiles in 3.5 seconds vs RocksDB's 40 seconds. No C++ FFI boundary. Aligns with tidalDB's pure-Rust-where-possible philosophy.

**B-tree: redb.** Pure Rust. ACID transactions. Copy-on-write B-tree with MVCC. No compaction overhead. Crash-safe by design (COW means the old page is valid until the new page is fully written). Zero-copy reads via memory mapping.

Both engines sit behind trait boundaries (Section 4.4). If benchmarks reveal fjall or redb is insufficient for a specific workload, the engine can be swapped without touching any code outside the storage module.

### 4.3 Key Routing

All keys follow the subject-prefix encoding defined in Section 5. The router dispatches based on the tag byte in the key:

```rust
/// Routes a key to the appropriate storage backend.
fn route(key: &[u8]) -> Backend {
    let tag = extract_tag(key);
    match tag {
        Tag::Sig | Tag::Evt => Backend::Lsm,   // signal state + raw events
        Tag::Meta | Tag::Rel | Tag::Mv | Tag::Idx | Tag::Schema => Backend::Btree,
    }
}
```

```
Key Routing Diagram

                    +------------------+
   write(key, val)  |   Key Router     |
   ----------------->  extract_tag(key) |
                    +--------+---------+
                             |
              +--------------+--------------+
              |                             |
        tag in {SIG, EVT}           tag in {META, REL,
              |                      MV, IDX, SCHEMA}
              v                             v
     +--------+--------+          +--------+--------+
     |   fjall (LSM)   |          |   redb (B-tree) |
     |                  |          |                  |
     | - Signal events  |          | - Entity metadata|
     | - Decay scores   |          | - Relationships  |
     | - Window counts  |          | - Materialized   |
     | - Raw event log  |          |   aggregates     |
     +---------+--------+          | - Schema defs    |
               |                   | - Secondary idx  |
               v                   +---------+--------+
     FIFO compaction for                     |
     events; leveled for                     v
     signal state                  COW B-tree, MVCC,
                                   crash-safe by design
```

### 4.4 Trait Abstraction

The storage engine exposes a single trait boundary. No module outside of `storage/` knows whether data is served from fjall, redb, or an in-memory cache.

```rust
/// The storage engine trait. All access to durable state goes through this.
pub trait StorageEngine: Send + Sync {
    /// Read a single key.
    fn get(&self, key: &[u8]) -> Result<Option<Vec<u8>>, StorageError>;

    /// Write a single key-value pair. Durability is governed by the WAL,
    /// not by this call -- this updates derived state only.
    fn put(&self, key: &[u8], value: &[u8]) -> Result<(), StorageError>;

    /// Delete a key.
    fn delete(&self, key: &[u8]) -> Result<(), StorageError>;

    /// Scan all keys with the given prefix, in lexicographic order.
    fn scan_prefix(&self, prefix: &[u8]) -> Result<PrefixIterator, StorageError>;

    /// Write a batch of key-value pairs atomically within a single backend.
    fn write_batch(&self, batch: &WriteBatch) -> Result<(), StorageError>;

    /// Force all buffered data to stable storage.
    fn flush(&self) -> Result<(), StorageError>;
}
```

The `HybridStorage` implementation composes an `LsmBackend` (fjall) and a `BtreeBackend` (redb), routing each call based on key prefix as described above. Tests use an `InMemoryStorage` implementation that stores everything in a `BTreeMap`, enabling deterministic testing without disk I/O.

---

## 5. Key Encoding Scheme

### 5.1 Design Goals

The key encoding must satisfy:

1. **Co-location.** All data for a single entity (metadata, signals, relationships, aggregates) shares a common prefix, enabling single-prefix-scan retrieval.
2. **Shard boundary.** The entity ID prefix is a natural partition key for future range-based sharding (Section 9).
3. **Lexicographic ordering.** Byte ordering matches logical ordering. Range scans over time-ordered data yield chronologically sorted results.
4. **Tag-based routing.** The tag byte enables the key router (Section 4.3) to dispatch to the correct backend without parsing the full key.

### 5.2 Key Layout

```
Subject-Prefix Key Encoding

+-------------------+------+------+---------------------------+
| Entity ID         | NUL  | Tag  | Suffix                    |
| 8 bytes           | 1 b  | 1-3b | variable                  |
+-------------------+------+------+---------------------------+
  u64 big-endian     0x00   ASCII   tag-dependent encoding

Total header: 10-12 bytes (entity_id + NUL + tag)
```

**Why big-endian for the entity ID.** Byte-lexicographic ordering of big-endian integers matches numeric ordering. This means a prefix scan over entity IDs 1000-2000 is a contiguous range scan in the storage engine. Little-endian would scatter numerically adjacent entities across the keyspace.

**Why NUL separator.** The `0x00` byte between entity ID and tag guarantees that no valid entity ID suffix collides with a tag prefix. Entity IDs are u64 values that may contain `0x00` bytes internally, but the NUL separator is always at offset 8, so parsing is unambiguous.

### 5.3 Tag Types

| Tag    | Bytes            | Backend | Description                                         |
| ------ | ---------------- | ------- | --------------------------------------------------- |
| `EVT`  | `0x45 0x56 0x54` | LSM     | Raw signal event log                                |
| `SIG`  | `0x53 0x49 0x47` | LSM     | Running decay scores, window counts                 |
| `META` | `0x4D 0x45 0x54` | B-tree  | Entity metadata (title, format, embedding pointer)  |
| `REL`  | `0x52 0x45 0x4C` | B-tree  | Relationship edges (follows, blocks, interactions)  |
| `MV`   | `0x4D 0x56`      | B-tree  | Materialized view aggregates (hourly/daily rollups) |
| `IDX`  | `0x49 0x44 0x58` | B-tree  | Secondary indexes (inverted index postings, etc.)   |

### 5.4 Suffix Encoding by Tag

**EVT (raw signal events):**

```
{entity_id:8BE}{0x00}EVT{signal_type:2BE}{timestamp_ns:8BE}{event_hash:8}
                                                             ^-- first 8 bytes of BLAKE3
Total: 30 bytes
```

Events for a given entity and signal type are ordered chronologically by the timestamp suffix. The truncated event hash breaks ties for events at the same nanosecond.

**SIG (signal ledger state):**

```
{entity_id:8BE}{0x00}SIG{signal_type:2BE}{window_tag:1}
                                           ^-- 0x00=running, 0x01=1h, 0x02=24h, etc.
Total: 14 bytes
```

The running decay score, windowed counts, and velocity are stored as separate keys under the SIG tag. Each is a small fixed-size value (8-32 bytes).

**META (entity metadata):**

```
{entity_id:8BE}{0x00}META
Total: 12 bytes (value is the serialized entity struct)
```

**REL (relationships):**

```
{entity_id:8BE}{0x00}REL{rel_type:2BE}{target_id:8BE}
Total: 21 bytes (value is weight + metadata)
```

Range scan on `{entity_id}\x00REL` returns all relationships for an entity. Scan on `{entity_id}\x00REL{rel_type}` returns all relationships of a given type.

**MV (materialized aggregates):**

```
{entity_id:8BE}{0x00}MV{signal_type:2BE}{bucket_tag:1}{bucket_id:4BE}
                                          ^-- 0x01=hourly, 0x02=daily
Total: 18 bytes
```

`bucket_id` is hours-since-epoch (u32, good until year 2516) for hourly rollups, or days-since-epoch for daily rollups.

### 5.5 Byte-Level Example

For entity ID `0x00000000000003E8` (1000), a view signal event at timestamp `1740000000000000000` ns:

```
Offset  Bytes                              Meaning
------  ---------------------------------  -------
0x00    00 00 00 00 00 00 03 E8            entity_id = 1000 (u64 BE)
0x08    00                                 NUL separator
0x09    45 56 54                           "EVT" tag
0x0C    00 01                              signal_type = 1 (view)
0x0E    18 21 7D 68 7F 62 00 00           timestamp_ns (u64 BE)
0x16    A3 B7 2C 19 F0 81 DD 04           event_hash (first 8 bytes of BLAKE3)
        --------------------------------
        Total: 30 bytes
```

### 5.6 Why This Enables Sharding

The entity ID prefix is the natural shard key. A range-based partition scheme divides the entity ID space into contiguous ranges:

```
Shard 0:  entity_id [0x0000000000000000, 0x3FFFFFFFFFFFFFFF)
Shard 1:  entity_id [0x4000000000000000, 0x7FFFFFFFFFFFFFFF)
Shard 2:  entity_id [0x8000000000000000, 0xBFFFFFFFFFFFFFFF)
Shard 3:  entity_id [0xC000000000000000, 0xFFFFFFFFFFFFFFFF)
```

Because all keys for an entity share the same 8-byte prefix, shard splits never bisect an entity's data. All signals, metadata, relationships, and aggregates for entity X live on the same shard. Cross-shard ranking queries fan out by shard, score locally, and merge results -- the same pattern used by Elasticsearch and every distributed search engine.

---

## 6. Tiered Storage

### 6.1 Architecture

Data moves through three tiers based on access pattern, not just age. A viral old video's signal state stays hot. Yesterday's impression data for a video nobody watched moves to warm.

```
Tiered Storage Architecture

+--------------------------------------------------+
|  HOT TIER (in-memory)                            |
|                                                  |
|  DashMap<EntityId, EntitySignalState>            |
|  - Running decay scores (per-lambda)             |
|  - SWAG windowed counters (1h window)            |
|  - Recent event buffer (last N events)           |
|  - Velocity estimates                            |
|                                                  |
|  Budget: ~80 bytes/entity x 10M = 800 MB         |
|  Eviction: access-pattern-based (see 6.3)        |
+------------------------+-------------------------+
                         |
           promote on    | demote when
           access        | cold
                         v
+--------------------------------------------------+
|  WARM TIER (SSD - fjall + redb)                  |
|                                                  |
|  Signal ledger state (SIG keys)                  |
|  Raw events (EVT keys, 7-day retention)          |
|  Hourly rollups (MV keys, 30-day retention)      |
|  Entity metadata (META keys)                     |
|  Relationship graph (REL keys)                   |
|                                                  |
|  Budget: ~460 GB for full workload               |
+------------------------+-------------------------+
                         |
           archive when  | load on
           beyond window | ad-hoc query
                         v
+--------------------------------------------------+
|  COLD TIER (compressed archival)                 |
|                                                  |
|  Daily rollups (MV keys, no TTL)                 |
|  Compressed raw events beyond retention window   |
|  (optional, for compliance/audit)                |
|                                                  |
|  Format: ZSTD-compressed, columnar               |
|  Budget: grows at ~320 MB/day                    |
+--------------------------------------------------+
```

### 6.2 Hot Tier Design

The hot tier is an in-memory cache of per-entity signal state, optimized for the ranking query hot path. It is NOT a source of truth -- every value in the hot tier is derivable from the WAL and warm tier.

```rust
/// Per-entity signal state, cache-line aligned for zero false sharing.
/// This is the hottest struct in the entire system. Every ranking query
/// touches ~200 of these.
#[repr(C, align(64))]
pub struct EntitySignalState {
    // -- 8 bytes: identity
    entity_id: u64,

    // -- 24 bytes: running decay scores (one per configured lambda)
    // Lambdas: ln(2)/3600 (1h), ln(2)/86400 (24h), ln(2)/604800 (7d)
    decay_scores: [f64; 3],

    // -- 8 bytes: last update timestamp
    last_update_ns: u64,

    // -- 8 bytes: windowed count (SWAG-backed, 1h window)
    window_count_1h: u32,
    velocity_1h: f32,

    // -- 8 bytes: access tracking for tier management
    last_access_ns: u64,

    // -- 8 bytes: padding to 64-byte boundary
    _pad: [u8; 8],
}
// Total: 64 bytes = exactly 1 cache line

const _: () = assert!(core::mem::size_of::<EntitySignalState>() == 64);
```

**Memory budget at scale:**

| Entities in hot tier      | Memory                |
| ------------------------- | --------------------- |
| 1 million (active)        | 64 MB                 |
| 10 million (all)          | 640 MB                |
| 1 million hot + lazy load | 64 MB + demand-loaded |

The recommended configuration for a 10M-entity deployment is to keep the most active 1-2 million entities in the hot tier (64-128 MB) and load others on demand from the warm tier. On-demand loading from redb/fjall adds ~10-50 us per entity -- acceptable for cold entities that appear infrequently in candidate sets.

### 6.3 Tier Migration Policy

Migration is driven by **access pattern**, not age. The policy uses two signals:

1. **Signal write frequency.** Entities receiving signals in the last `hot_write_window` (default: 1 hour) are hot.
2. **Ranking read frequency.** Entities that appeared in a ranking candidate set in the last `hot_read_window` (default: 15 minutes) are hot.

An entity becomes hot when it receives a signal write or is read by a ranking query. An entity becomes cold when neither condition has been true for `cold_threshold` (default: 2 hours).

| Parameter          | Default   | Tuning Guidance                                                                                                             |
| ------------------ | --------- | --------------------------------------------------------------------------------------------------------------------------- |
| `hot_write_window` | 1 hour    | Entities with recent signals stay hot. Increase for workloads with bursty signal patterns.                                  |
| `hot_read_window`  | 15 min    | Entities recently scored in ranking stay hot. Increase if the same entities are queried repeatedly (e.g., trending page).   |
| `cold_threshold`   | 2 hours   | How long an idle entity stays in memory. Decrease to reduce memory pressure; increase to absorb intermittent access spikes. |
| `max_hot_entities` | 2 million | Hard cap on hot tier size. When exceeded, the least-recently-accessed entities are evicted regardless of activity.          |

**Eviction on memory pressure:** When the hot tier reaches `max_hot_entities`, the entity with the oldest `last_access_ns` is evicted. Its state is already persisted in the warm tier (the hot tier is a cache), so eviction is a simple memory deallocation with no I/O.

### 6.4 Per-Signal-Window Tiering

Signal aggregates have natural temperature that correlates with window size:

| Aggregate                  | Tier                    | Update Frequency                        | Read Frequency        |
| -------------------------- | ----------------------- | --------------------------------------- | --------------------- |
| Running decay score        | Hot                     | Every signal event                      | Every ranking query   |
| 1h windowed count/velocity | Hot                     | Every signal event                      | Trending/rising sorts |
| 24h windowed count         | Warm (SIG key in fjall) | Every signal event or per-minute rollup | Hot/top-today sorts   |
| 7d windowed count          | Warm (MV key in redb)   | Hourly rollup                           | Top-this-week sorts   |
| 30d aggregate              | Warm (MV key in redb)   | Daily rollup                            | Top-this-month sorts  |
| All-time aggregate         | Cold (MV key in redb)   | Daily rollup                            | Top-all-time sorts    |

This means the 1-hour velocity computation (the backbone of trending/rising sorts) never touches disk on the hot path. The 7-day aggregate is a single point read from redb (B-tree, sub-millisecond). The all-time count is the same read cost but accessed less frequently.

---

## 7. Compaction Strategy

Compaction applies only to the LSM-tree backend (fjall). The B-tree backend (redb) uses copy-on-write pages and does not require compaction.

### 7.1 Signal Event Log (EVT keys)

**Strategy: FIFO compaction.**

The signal event log is append-only and time-ordered. FIFO compaction achieves write amplification of 2x (1x WAL flush to memtable, 1x memtable flush to L0 SST file). Old SST files are dropped whole when they fall outside the retention window.

| Parameter          | Default | Rationale                                                                                                                                                                 |
| ------------------ | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `evt_retention`    | 7 days  | Raw events are needed for: (a) crash recovery replay, (b) backfill when adding new decay lambdas, (c) ad-hoc historical queries. 7 days covers all active signal windows. |
| `evt_max_sst_size` | 256 MiB | Larger SSTs reduce file count; smaller SSTs enable finer-grained retention cleanup. 256 MiB balances both.                                                                |

**Why FIFO, not leveled.** Leveled compaction for append-only time-series data has write amplification of 12-32x. Solana's BlockStore measured a 6.5x speedup after switching from leveled to FIFO. For tidalDB's event log, where data is written once and deleted by time window, FIFO is strictly superior.

**Retention enforcement.** Every SST file in the event log has a maximum timestamp recorded in its metadata. A background task periodically scans SST metadata and deletes files whose maximum timestamp is older than `now - evt_retention`. Cost: O(1) per file, zero write amplification for deletion.

### 7.2 Signal Ledger State (SIG keys)

**Strategy: Leveled compaction.**

The signal ledger contains per-entity running decay scores and windowed counters. These are updated frequently (on every signal event) and read frequently (on every ranking query). Leveled compaction ensures read amplification stays low (1-2 levels for point reads with bloom filters).

| Parameter                   | Default | Rationale                                                                       |
| --------------------------- | ------- | ------------------------------------------------------------------------------- |
| `sig_level_size_multiplier` | 10      | Standard leveled compaction ratio. Each level is 10x the size of the previous.  |
| `sig_bloom_bits_per_key`    | 10      | 1% false positive rate. Sufficient for the signal ledger's point-read workload. |
| `sig_target_file_size`      | 64 MiB  | Balances compaction granularity with file count.                                |

### 7.3 Write Amplification Analysis

For the reference workload (10M entities, 50 events/day, ~5,800 events/sec sustained):

| Component                | Daily Data Written                            | Write Amp | Disk I/O        |
| ------------------------ | --------------------------------------------- | --------- | --------------- |
| WAL                      | 32 GB/day                                     | 1x        | 32 GB/day       |
| EVT SSTs (FIFO)          | 32 GB/day                                     | 2x        | 64 GB/day       |
| SIG updates (leveled)    | ~1.6 GB/day (10M entities x 32B x ~5 updates) | ~10x      | ~16 GB/day      |
| MV rollups (B-tree, COW) | ~5 GB/day                                     | ~2x (COW) | ~10 GB/day      |
| **Total**                |                                               |           | **~122 GB/day** |

At ~122 GB/day sustained, the average write throughput is ~1.4 MB/s -- trivial for any modern NVMe SSD rated at 1+ GB/s sequential writes. The SSD write endurance requirement is ~44.5 TB/year, well within the rated endurance of enterprise NVMe drives (typically 1+ DWPD on 2 TB = 730 TB/year).

---

## 8. Checkpoint Strategy

Checkpoints snapshot the materializer state, creating a recovery boundary that limits WAL replay length on crash restart.

### 8.1 Checkpoint Contents

A checkpoint record in the WAL (type `0x05`) contains:

```
Checkpoint Record Payload

+-------------------+-------------------+-------------------+
| Checkpoint SeqNo  | Materializer Pos  | Entity State Hash |
| 8 bytes           | 8 bytes           | 32 bytes          |
+-------------------+-------------------+-------------------+
| Timestamp         | Hot Tier Count    |
| 8 bytes           | 4 bytes           |
+-------------------+-------------------+

Total: 60 bytes
```

| Field               | Description                                                                                                                                                |
| ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `checkpoint_seqno`  | The WAL sequence number up to which all derived state is consistent.                                                                                       |
| `materializer_pos`  | The last event `seqno` processed by the background materializer.                                                                                           |
| `entity_state_hash` | BLAKE3 hash of a deterministic serialization of all in-memory entity signal states. Used to verify that warm-tier persisted state matches in-memory state. |
| `timestamp`         | Wall-clock time of the checkpoint (for monitoring/debugging).                                                                                              |
| `hot_tier_count`    | Number of entities in the hot tier at checkpoint time (for monitoring).                                                                                    |

### 8.2 Checkpoint Procedure

1. **Pause signal writes** (briefly). The write path acquires a lightweight checkpoint lock. Writers that arrive during checkpoint are buffered in the group commit queue -- they do not block, they just ride the next batch.
2. **Flush entity signal state.** All dirty `EntitySignalState` entries in the hot tier are written to the warm tier (SIG keys in fjall). This is a batch write of only the entries modified since the last checkpoint.
3. **Flush fjall memtable.** Force-flush the fjall memtable to ensure all SIG key writes are durable on disk.
4. **Write checkpoint record to WAL.** The checkpoint record contains the current `seqno` and materializer position.
5. **fdatasync the WAL.** The checkpoint record is durable.
6. **Release the checkpoint lock.** Writers resume.
7. **Clean up old WAL segments.** Segments fully before the new checkpoint `seqno` are deleted.

**Checkpoint duration.** Steps 2-5 are the critical section. Flushing dirty entity state is O(dirty entries), which at the default 30-second interval with 5,800 events/sec is at most ~174,000 entities. At ~1 us per key-value write to fjall's memtable, this takes ~174 ms. The fdatasync adds ~200 us. Total checkpoint duration: ~175 ms in the worst case.

During this 175 ms, signal writers are not blocked -- they are buffered in the group commit queue. The only observable effect is slightly higher write latency for events that arrive during the checkpoint flush.

### 8.3 Configuration

| Parameter                    | Default    | Tuning Guidance                                                                                                                                        |
| ---------------------------- | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `checkpoint_interval`        | 30 seconds | Shorter intervals reduce recovery time but increase disk I/O. At 30s with 5,800 events/sec, recovery replays ~174K records (~174 ms).                  |
| `checkpoint_dirty_threshold` | 100,000    | Force a checkpoint when this many entity states are dirty, even if the interval has not elapsed. Prevents unbounded recovery time during write spikes. |
| `max_recovery_time_target`   | 500 ms     | Advisory. The system tunes `checkpoint_interval` to keep estimated recovery time below this target.                                                    |

### 8.4 Recovery Procedure

```
Crash Recovery Sequence

1. Open WAL segments
2. Scan backward to find last checkpoint record
3. Read checkpoint: seqno=N, materializer_pos=M
4. Replay WAL records from seqno N+1 to end:
   - SignalEvent: update entity signal state + re-derive aggregates
   - EntityWrite: apply to entity store (redb)
   - RelationshipWrite: apply to relationship store (redb)
   - SchemaChange: apply to schema store (redb)
   - Padding/BatchBoundary: skip
5. Verify: entity_state_hash matches recomputed state (debug builds)
6. Write new checkpoint at current position
7. Resume normal operation
```

---

## 9. Per-Entity-Type Isolation

### 9.1 Namespace Architecture

Items, Users, and Creators occupy separate storage namespaces. This is not merely a key prefix convention -- it maps to separate fjall keyspaces and separate redb tables. The goal: a viral item's signal burst does not contend with user profile reads at the storage engine level.

```
Storage Namespace Layout

fjall instance
  +-- keyspace: "item_signals"     (EVT + SIG keys for items)
  +-- keyspace: "user_signals"     (EVT + SIG keys for users)
  +-- keyspace: "creator_signals"  (EVT + SIG keys for creators)

redb instance
  +-- table: "item_meta"           (META keys for items)
  +-- table: "user_meta"           (META keys for users)
  +-- table: "creator_meta"        (META keys for creators)
  +-- table: "relationships"       (REL keys, all entity types)
  +-- table: "materialized_views"  (MV keys, all entity types)
  +-- table: "schema"              (schema definitions)
  +-- table: "indexes"             (IDX keys, secondary indexes)
```

### 9.2 Why Separate Namespaces

**Independent compaction.** Item signals compact on their own schedule without affecting user signal reads. At 10M items generating 50 events/day each, the item_signals keyspace handles ~5,800 writes/sec. User signals are typically 10x lower volume. Without isolation, item signal compaction would stall user signal reads.

**Independent memory budgets.** Each fjall keyspace has its own memtable and block cache. The item_signals keyspace can be allocated a larger memtable (more write-buffering) while user_signals gets a smaller memtable but larger block cache (more read-caching).

**Independent monitoring.** Latency, throughput, and error metrics are per-namespace. When item signal write latency spikes, you know it is an item signal problem, not a user profile problem.

**Shard-ready.** When tidalDB moves to multi-node, each namespace maps naturally to an independent shard group. Item shards and user shards can be placed on different machines based on their workload profiles.

### 9.3 Cross-Entity Reads

A ranking query touches multiple namespaces: item signals (candidate scoring), user signals (preference vector), creator signals (creator quality), and relationships (social graph). These are separate read operations that execute concurrently via async I/O or thread pool. The storage engine does not provide cross-namespace transactions -- the query executor handles consistency by reading from a consistent WAL position.

---

## 10. Scale-Ready Design

tidalDB is single-node first. But the storage architecture is designed so that the transition to multi-node requires changing the deployment topology, not the storage engine.

### 10.1 What Stays the Same

| Component         | Single-Node                     | Multi-Node                                                                                        |
| ----------------- | ------------------------------- | ------------------------------------------------------------------------------------------------- |
| Key encoding      | `{entity_id}\x00{TAG}:{suffix}` | Identical. Entity ID prefix is the shard key.                                                     |
| WAL               | Local WAL per process           | Local WAL per shard. Each shard is a self-contained tidalDB instance.                             |
| Hybrid backend    | fjall + redb in-process         | fjall + redb per shard. Same code, same configuration.                                            |
| Trait abstraction | `StorageEngine` trait           | Same trait. The multi-node router implements `StorageEngine` by dispatching to the correct shard. |
| Checkpoints       | Local checkpoints               | Per-shard checkpoints. Same mechanism.                                                            |
| Compaction        | Local compaction                | Per-shard compaction. Same strategies.                                                            |

### 10.2 What Changes

| Concern             | Single-Node    | Multi-Node                                                                                           |
| ------------------- | -------------- | ---------------------------------------------------------------------------------------------------- |
| Shard routing       | All keys local | A routing layer maps `entity_id` to shard via consistent hashing or range partitioning.              |
| Cross-shard queries | N/A            | Ranking queries fan out to shards containing candidate entities, score locally, merge results.       |
| Replication         | N/A            | Each shard is replicated via WAL shipping (leader ships sealed WAL segments to followers).           |
| Rebalancing         | N/A            | Shard splits use the key encoding's natural range boundaries. All data for an entity moves together. |

### 10.3 Design Decisions That Enable This

1. **Entity ID as the universal prefix.** Every key starts with the entity ID. This means shard routing is a single 8-byte prefix lookup, and shard splits never bisect an entity's data.

2. **No cross-entity storage transactions.** The storage engine provides per-entity atomicity (all keys for entity X are updated atomically), not cross-entity atomicity. This means a ranking query that scores items A, B, C reads each independently -- there is no global snapshot. This is acceptable because ranking is inherently approximate, and signal staleness of a few milliseconds does not affect result quality.

3. **Namespace isolation maps to shard groups.** The per-entity-type namespaces (Section 9) are independent storage instances. In a multi-node deployment, item shards can run on high-write-throughput machines while user shards run on high-read-throughput machines.

4. **WAL segments are self-contained.** Each WAL segment contains complete records that can be replayed independently. This makes WAL shipping for replication straightforward: the leader ships sealed segments to followers, who replay them locally.

5. **Checksums enable verification.** BLAKE3 checksums on every WAL record and checkpoint enable followers to verify the integrity of replicated data without trusting the network.

---

## Appendix A: Configuration Reference

All parameters with defaults and tuning guidance, consolidated.

### WAL Configuration

| Parameter              | Default | Range      | Description                                                      |
| ---------------------- | ------- | ---------- | ---------------------------------------------------------------- |
| `wal.segment_size`     | 64 MiB  | 16-256 MiB | Size of each WAL segment file.                                   |
| `wal.max_segments`     | 128     | 8-1024     | Maximum number of WAL segments before forced cleanup.            |
| `wal.preallocate`      | `true`  | --         | Pre-allocate segment files to avoid filesystem metadata updates. |
| `wal.dedup_window`     | 5 min   | 1-60 min   | Time window for signal event deduplication bloom filter.         |
| `wal.dedup_bloom_bits` | 10      | 5-20       | Bits per entry in the dedup bloom filter. 10 = ~1% FPR.          |

### Group Commit Configuration

| Parameter                | Default | Range    | Description                             |
| ------------------------ | ------- | -------- | --------------------------------------- |
| `group_commit.max_batch` | 256     | 1-4096   | Maximum records per group commit batch. |
| `group_commit.max_delay` | 10 ms   | 1-100 ms | Maximum time before a batch is flushed. |

### Durability Defaults (per signal type)

| Signal Category                       | Default Level           | Override In Schema              |
| ------------------------------------- | ----------------------- | ------------------------------- |
| Financial (purchase, subscribe)       | `Immediate`             | `DURABILITY immediate`          |
| Engagement (like, comment, share)     | `Batched { 256, 10ms }` | `DURABILITY batched(256, 10ms)` |
| Telemetry (impression, scroll, hover) | `Eventual`              | `DURABILITY eventual`           |

### Tiered Storage Configuration

| Parameter                | Default   | Range     | Description                                        |
| ------------------------ | --------- | --------- | -------------------------------------------------- |
| `tiers.hot_write_window` | 1 hour    | 5min-24h  | Signal write recency threshold for hot tier.       |
| `tiers.hot_read_window`  | 15 min    | 1min-1h   | Ranking read recency threshold for hot tier.       |
| `tiers.cold_threshold`   | 2 hours   | 30min-24h | Inactivity duration before demotion from hot tier. |
| `tiers.max_hot_entities` | 2 million | 100K-50M  | Hard cap on hot tier entity count.                 |

### Compaction Configuration

| Parameter                         | Default | Range       | Description                                      |
| --------------------------------- | ------- | ----------- | ------------------------------------------------ |
| `compaction.evt_retention`        | 7 days  | 1-90 days   | Retention window for raw signal events.          |
| `compaction.evt_max_sst_size`     | 256 MiB | 64-1024 MiB | Target SST file size for event log.              |
| `compaction.sig_level_multiplier` | 10      | 4-20        | Leveled compaction size ratio for signal ledger. |
| `compaction.sig_bloom_bits`       | 10      | 5-20        | Bloom filter bits per key for signal ledger.     |

### Checkpoint Configuration

| Parameter                        | Default | Range     | Description                                         |
| -------------------------------- | ------- | --------- | --------------------------------------------------- |
| `checkpoint.interval`            | 30 sec  | 5sec-5min | Time between periodic checkpoints.                  |
| `checkpoint.dirty_threshold`     | 100,000 | 10K-1M    | Dirty entity count that forces an early checkpoint. |
| `checkpoint.max_recovery_target` | 500 ms  | 100ms-5s  | Advisory target for maximum crash recovery time.    |

---

## Appendix B: Filesystem Layout

```
{data_dir}/
  wal/
    segment-{seqno}.wal              # WAL segments (rotated at segment_size)
  lsm/
    item_signals/                    # fjall keyspace: item EVT + SIG keys
      ...                            # fjall internal structure
    user_signals/                    # fjall keyspace: user EVT + SIG keys
      ...
    creator_signals/                 # fjall keyspace: creator EVT + SIG keys
      ...
  btree/
    tidaldb.redb                     # single redb file containing all B-tree tables
  meta/
    config.json                      # persisted configuration (checkpoint interval, etc.)
    LOCK                             # flock-based single-writer guard
```

The `LOCK` file prevents multiple tidalDB instances from opening the same data directory. It uses `flock(LOCK_EX | LOCK_NB)` on open -- if the lock cannot be acquired, the process fails with a clear error message. This prevents silent data corruption from concurrent access.

---

## Appendix C: Invariant Checklist

These invariants must be verified by property tests and crash recovery tests. Each maps to a specific test case.

| #   | Invariant                                                                                                                 | Test Strategy                                                                                                          |
| --- | ------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| 1   | A WAL record with a valid checksum is never silently dropped during replay.                                               | Property test: write N records, replay, verify all N are present.                                                      |
| 2   | A WAL record with an invalid checksum is never applied during replay.                                                     | Property test: corrupt random bytes in WAL segment, replay, verify only valid records are applied.                     |
| 3   | Crash at any point during checkpoint leaves the previous checkpoint valid.                                                | Crash test: inject crashes during each step of the checkpoint procedure, verify recovery uses the previous checkpoint. |
| 4   | The group commit thread never ACKs a record before fdatasync completes.                                                   | Instrumented test: mock fdatasync to delay, verify writers block until it returns.                                     |
| 5   | Materialized aggregates are always consistent with the WAL.                                                               | Property test: write random signal events, compute aggregates from WAL, compare with materialized state.               |
| 6   | Key routing is deterministic: the same key always routes to the same backend.                                             | Property test: generate random keys, verify route() is a pure function.                                                |
| 7   | Entity isolation: writes to one namespace do not affect read latency in another.                                          | Benchmark test: measure user_meta read latency while saturating item_signals writes.                                   |
| 8   | Deduplication never causes a unique event to be silently dropped.                                                         | Property test: generate events with guaranteed-unique hashes, verify all are written.                                  |
| 9   | Big-endian entity ID encoding preserves numeric ordering in byte-lexicographic scans.                                     | Property test: generate random u64 pairs, verify BE encoding preserves ordering.                                       |
| 10  | After crash recovery, the hot tier state matches what would be produced by replaying all events from the last checkpoint. | Crash test: fill hot tier, crash, recover, compare entity states against fresh computation from WAL.                   |

---

## References

- [Signal Ledger Research](../research/tidaldb_signal_ledger.md) -- Three-tier hybrid architecture, running decay scores, SWAG, compaction analysis
- [thoughts.md](../../thoughts.md) -- Lessons from Engram (cache-line alignment), Citadel (quarantine-first durability, group commit), StemeDB (hybrid backend routing, subject-prefix keys, background materializer)
- [CODING_GUIDELINES.md](../../CODING_GUIDELINES.md) -- `#[repr(C, align(64))]` for hot structs, lock-free hot path, trait-abstracted backends
- [VISION.md](../../VISION.md) -- The ranking query that this storage engine exists to serve
- Cormode et al., "Forward Decay: A Practical Time Decay Model for Streaming Systems" (ICDE 2009) -- Running decay score correctness proof
- Tangwongsan, Hirzel, Schneider, "Sliding-Window Aggregation Algorithms" (PVLDB 2015) -- Two-Stacks SWAG algorithm
- Traub et al., "Scotty: Efficient Window Aggregation for out-of-order Stream Processing" (EDBT 2019) -- Stream-slicing for shared windows