tidaldb/docs/specs/01-storage-engine.md
jordan 413b712c0a chore: initialize tidalDB repository with schema foundation and standards
- Schema phase 1 (tasks 01-02): EntityId, EntityKind, Timestamp, Score, SignalTypeDef, DecayModel, Window, WindowSet — all with property tests and benchmarks scaffolding
- Stub modules for storage, signals, query, ranking
- Full documentation suite: VISION, USE_CASES, SEQUENCE, API, CODING_GUIDELINES, ai-lookup, research docs, specs, roadmap, planning docs
- Marketing site (Next.js) with blog infrastructure
- .claude/ agents and skills for the tidalDB development workflow
- Foundation standards enforced: thiserror + tracing declared as dependencies, clippy::unwrap_used = deny added to lint config
- .gitignore hardened: .next/, node_modules/, .env, secrets, logs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 12:52:20 -07:00

56 KiB

Storage Engine Specification

Status: Draft Author: tidalDB Engineering Last Updated: 2026-02-20 Prerequisites: VISION.md, thoughts.md, Signal Ledger Research


1. Design Principles

tidalDB's storage engine serves one master: the ranking query. Every design decision flows from this question: can we score 200 candidates in under 5 microseconds while sustaining thousands of signal writes per second without losing a single event?

The storage engine is not a general-purpose key-value store. It is a purpose-built substrate for three workloads that coexist in a single process:

  1. Signal ingestion -- high-velocity, append-heavy, durability-critical writes (thousands/sec)
  2. Ranking reads -- low-latency, random-access reads across hundreds of entities per query (<5 us for 200 candidates)
  3. Background materialization -- continuous compaction of raw events into pre-computed aggregates

These workloads have fundamentally different I/O profiles. Forcing them through a single storage engine is the architectural mistake that thoughts.md identifies in StemeDB's hybrid routing pattern. We use two engines, routed by key prefix, behind a single trait boundary.

Invariants

These must hold at all times. They are not aspirational. Property tests and crash recovery tests enforce them.

  1. WAL-before-visibility. No signal event is visible to any reader until it is durably logged in the WAL.
  2. No lost events. A signal event acknowledged to the caller survives any single crash. The WAL is the source of truth; everything else is derived state.
  3. Aggregate consistency. Materialized aggregates are always computable from the WAL + raw events. If they diverge, the aggregates are wrong, not the events.
  4. Entity isolation. A write storm on one entity type (viral item signals) does not degrade read latency for another entity type (user profile lookups).
  5. Crash recovery is bounded. Recovery time is proportional to the WAL tail (events since last checkpoint), not total data size.
  6. Key co-location. All data for a single entity is retrievable via a single prefix scan. No cross-entity joins at the storage layer.

2. Write-Ahead Log

The WAL is the durability primitive. Every mutation -- signal event, entity write, relationship update -- is serialized into the WAL before any downstream processing occurs. The signal ledger, entity store, search index, and materialized aggregates are all derived state that can be rebuilt from the WAL.

2.1 Record Format

Each WAL record is a length-prefixed, checksummed byte sequence. The format is designed for sequential write performance and crash-safe parsing.

WAL Record Layout (on disk)
+--------+----------+--------+----------+----------+
| Length  | Checksum | SeqNo  | Type     | Payload |
| 4 bytes | 32 bytes | 8 bytes| 1 byte   | N bytes |
+--------+----------+--------+----------+----------+
|<-- header (45 bytes) ---------------------->|

Field definitions:

Field Size Encoding Description
length 4 bytes u32 little-endian Total record size including header. Max record: 4 GiB.
checksum 32 bytes BLAKE3 hash Hash of seqno || type || payload. Covers everything after the checksum field.
seqno 8 bytes u64 little-endian Monotonically increasing sequence number. Never reused. Survives crash recovery.
type 1 byte u8 enum Record type discriminator (see below).
payload variable type-dependent Serialized record body.

Record types:

Value Name Description
0x01 SignalEvent Engagement signal (view, like, skip, etc.)
0x02 EntityWrite Entity metadata create/update
0x03 RelationshipWrite Relationship edge create/update
0x04 SchemaChange Schema DDL (define signal, define profile)
0x05 Checkpoint Checkpoint marker with materializer state
0x06 BatchBoundary Group commit boundary marker
0xFF Padding Fill to segment boundary (ignored on replay)

Why BLAKE3, not CRC32. CRC32 detects accidental corruption but not adversarial modification. BLAKE3 is a cryptographic hash that also serves as the content-address for signal event deduplication (see Section 2.4). The cost difference is negligible -- BLAKE3 processes 1 GiB/s/core on modern hardware, and WAL records are small. Using BLAKE3 for both checksumming and deduplication avoids computing two separate hashes.

2.2 WAL Segments

The WAL is divided into fixed-size segments to bound file sizes and simplify cleanup.

WAL Segment Layout (filesystem)

data/
  wal/
    segment-000000000001.wal    # oldest active segment
    segment-000000000002.wal
    segment-000000000003.wal    # current write segment
Parameter Default Tuning Guidance
segment_size 64 MiB Larger segments reduce file count but increase recovery time. 64 MiB balances: ~2 seconds of writes at 32 MB/s sustained ingest.
max_segments 128 8 GiB total WAL. Segments older than the last checkpoint are eligible for cleanup.
preallocate true Pre-allocate segment files with fallocate() to avoid filesystem metadata updates on every write.

Segment lifecycle:

  1. Create. When the current segment reaches segment_size, a new segment file is pre-allocated and becomes the active write target. The segment number is the first seqno it will contain.
  2. Seal. When a segment is no longer the write target, it is sealed (marked read-only). Sealed segments are used for crash recovery replay and WAL tailing by the background materializer.
  3. Cleanup. After a checkpoint is written and confirmed durable, all segments whose highest seqno is less than the checkpoint's seqno are eligible for deletion. Cleanup runs after every checkpoint.

Invariant: The WAL always retains all segments from the last confirmed checkpoint forward. Deleting a segment before its records are checkpointed violates the crash recovery guarantee.

2.3 Crash Recovery

On startup, the storage engine:

  1. Locates the last checkpoint record by scanning backward from the newest WAL segment. The checkpoint record contains the seqno at which all derived state (entity store, signal aggregates, materialized views) was consistent.
  2. Replays all records after the checkpoint seqno in sequence order. Each record is validated against its BLAKE3 checksum. Records with invalid checksums are discarded (they represent incomplete writes interrupted by a crash).
  3. Applies replayed records to the entity store, signal ledger, and materialized views, bringing them to a consistent state.
  4. Writes a new checkpoint once recovery is complete, establishing a clean recovery boundary for future crashes.

Torn write detection. If the last record in a segment has a valid length field but an invalid checksum, the write was interrupted mid-record. The record is discarded. If length itself is torn (partially written), the parser detects this because the remaining bytes in the segment are fewer than length specifies. Both cases are safe -- the record was never acknowledged to the caller (fsync had not completed), so discarding it does not violate the durability guarantee.

Recovery time bound. Recovery replays only the WAL tail (records since last checkpoint). With the default checkpoint interval of 30 seconds (Section 8) and a write rate of 10,000 events/sec, the WAL tail contains at most ~300,000 records. At ~1 us per record replay, recovery completes in under 300 ms.

2.4 Signal Event Deduplication

Signal events are content-addressed using BLAKE3. The hash is computed over the canonical fields that define event identity:

BLAKE3(entity_id || signal_type || user_id || timestamp_ns)

The resulting 32-byte hash serves dual purpose:

  1. WAL checksum -- the same hash stored in the WAL record header.
  2. Deduplication key -- before appending a signal event to the WAL, the writer checks a bloom filter of recent event hashes. If the hash is present, the event is a duplicate (webhook retry, client double-submit) and is silently acknowledged without writing.

The deduplication bloom filter covers the last dedup_window (default: 5 minutes) of event hashes. At 10,000 events/sec, this is 3 million entries. A bloom filter with 10 bits/entry and 3 hash functions consumes ~3.75 MB with a 1% false positive rate. False positives cause a harmless duplicate check against the WAL -- they do not cause event loss.


3. Durability Levels

Not all writes carry the same durability requirement. A purchase event must survive any crash. An impression event can tolerate losing the last 100 ms of writes. The storage engine exposes three durability levels, configurable per signal type in schema.

3.1 Durability Level Definitions

/// Durability guarantee for a write operation.
pub enum DurabilityLevel {
    /// fsync after every write. The write is durable when the call returns.
    /// Use for: purchases, subscriptions, blocks, reports.
    Immediate,

    /// fsync per batch. Writes are buffered until either `max_batch_size`
    /// records accumulate or `max_delay` elapses, whichever comes first.
    /// Use for: likes, comments, shares, follows (default for engagement).
    Batched {
        max_batch_size: u32,
        max_delay: Duration,
    },

    /// fsync on OS schedule (typically every 30s on Linux).
    /// Use for: impressions, scroll depth, hover events, telemetry.
    Eventual,
}
Level Default Parameters Worst-Case Data Loss on Crash fsync Cost
Immediate -- 0 bytes 1 fsync per write (~200 us on NVMe)
Batched max_batch_size: 256, max_delay: 10ms Up to 256 records or 10 ms of writes 1 fsync per batch (~200 us amortized over 256 writes = ~0.8 us/write)
Eventual -- Up to ~30 seconds of writes (OS-dependent) 0 explicit fsyncs

3.2 Group Commit

Group commit amortizes the cost of fsync across multiple concurrent writers. This is the same technique used by PostgreSQL's commit_delay and Citadel's GroupCommitQueue.

Mechanism:

  1. Writers append their WAL records to an in-memory buffer and register a notification channel.
  2. A dedicated commit thread monitors the buffer. It triggers a flush when either condition is met:
    • The buffer contains max_batch_size records.
    • max_delay has elapsed since the first unflushed record was buffered.
  3. The commit thread writes all buffered records to the WAL segment file in a single writev() call, then issues one fdatasync().
  4. After fdatasync() returns, the commit thread notifies all waiting writers that their records are durable.
  5. Writers blocked on Immediate durability wake up and return success.
Group Commit Timeline

Writer A:  write -----> [wait] -----> ACK
Writer B:       write --------> [wait] -> ACK
Writer C:            write ---> [wait] -> ACK
                                  |
Commit thread:          writev + fdatasync
                        (one syscall pair for 3 records)

Configuration:

Parameter Default Tuning Guidance
group_commit_max_batch 256 Higher values amortize fsync better but increase tail latency for early arrivals in the batch. At 10K writes/sec, 256 records accumulate in ~25 ms.
group_commit_max_delay 10 ms Maximum time any writer waits for the batch to fill. 10 ms is the sweet spot: perceptible latency is >50 ms, and 10 ms captures most of the batching benefit.

Interaction with durability levels:

  • Immediate writers are always included in the next group commit flush. They wait for the fdatasync but benefit from batching with concurrent writers.
  • Batched writers share the group commit mechanism with their configured parameters.
  • Eventual writers append to the WAL buffer but do not wait for fdatasync. Their records ride along with the next flush but the writer returns immediately.

Invariant: A writer that receives an ACK for an Immediate or Batched write is guaranteed that the record has been fsynced. The group commit thread never acknowledges a record before fdatasync completes.


4. Hybrid Storage Backend

4.1 Rationale

tidalDB has a split personality: signal ingestion is write-heavy and append-mostly; ranking queries are read-heavy and random-access. No single storage engine excels at both.

From thoughts.md, StemeDB's key insight: "Rather than forcing one storage engine to be good at everything, pick two and route intelligently." StemeDB uses fjall (LSM-tree) for write-heavy assertion appends and redb (B-tree) for read-heavy index lookups. tidalDB adopts the same pattern for the same reasons.

Workload Access Pattern Optimal Engine Why
Signal event log Append-only, sequential writes, range scans by time LSM-tree (fjall) LSM-trees batch writes in memtables and flush sequentially. Write amplification with FIFO compaction is 2x.
Signal ledger (running scores) Frequent point updates, frequent point reads LSM-tree (fjall) Running decay scores are updated on every event and read on every ranking query. LSM memtable serves both from memory.
Entity metadata Infrequent writes, frequent random reads B-tree (redb) B-trees provide O(log n) point reads with no compaction overhead. Entity metadata changes rarely but is read on every query.
Relationship graph Infrequent writes, range scans per entity B-tree (redb) Relationships are read during social-graph-aware ranking. Range scans over a user's edges are B-tree's sweet spot.
Materialized aggregates Periodic batch writes, frequent point reads B-tree (redb) Aggregates are written by the background materializer and read during ranking. Write frequency is low (once per rollup interval).
Schema definitions Rare writes, reads on startup + DDL B-tree (redb) Tiny dataset, read-heavy. B-tree is simpler.

4.2 Engine Selection

LSM-tree: fjall v3. Pure Rust (#![forbid(unsafe_code)]). Embeddable. Keyspace-based isolation (equivalent to column families). Batch write performance competitive with RocksDB. Compiles in 3.5 seconds vs RocksDB's 40 seconds. No C++ FFI boundary. Aligns with tidalDB's pure-Rust-where-possible philosophy.

B-tree: redb. Pure Rust. ACID transactions. Copy-on-write B-tree with MVCC. No compaction overhead. Crash-safe by design (COW means the old page is valid until the new page is fully written). Zero-copy reads via memory mapping.

Both engines sit behind trait boundaries (Section 4.4). If benchmarks reveal fjall or redb is insufficient for a specific workload, the engine can be swapped without touching any code outside the storage module.

4.3 Key Routing

All keys follow the subject-prefix encoding defined in Section 5. The router dispatches based on the tag byte in the key:

/// Routes a key to the appropriate storage backend.
fn route(key: &[u8]) -> Backend {
    let tag = extract_tag(key);
    match tag {
        Tag::Sig | Tag::Evt => Backend::Lsm,   // signal state + raw events
        Tag::Meta | Tag::Rel | Tag::Mv | Tag::Idx | Tag::Schema => Backend::Btree,
    }
}
Key Routing Diagram

                    +------------------+
   write(key, val)  |   Key Router     |
   ----------------->  extract_tag(key) |
                    +--------+---------+
                             |
              +--------------+--------------+
              |                             |
        tag in {SIG, EVT}           tag in {META, REL,
              |                      MV, IDX, SCHEMA}
              v                             v
     +--------+--------+          +--------+--------+
     |   fjall (LSM)   |          |   redb (B-tree) |
     |                  |          |                  |
     | - Signal events  |          | - Entity metadata|
     | - Decay scores   |          | - Relationships  |
     | - Window counts  |          | - Materialized   |
     | - Raw event log  |          |   aggregates     |
     +---------+--------+          | - Schema defs    |
               |                   | - Secondary idx  |
               v                   +---------+--------+
     FIFO compaction for                     |
     events; leveled for                     v
     signal state                  COW B-tree, MVCC,
                                   crash-safe by design

4.4 Trait Abstraction

The storage engine exposes a single trait boundary. No module outside of storage/ knows whether data is served from fjall, redb, or an in-memory cache.

/// The storage engine trait. All access to durable state goes through this.
pub trait StorageEngine: Send + Sync {
    /// Read a single key.
    fn get(&self, key: &[u8]) -> Result<Option<Vec<u8>>, StorageError>;

    /// Write a single key-value pair. Durability is governed by the WAL,
    /// not by this call -- this updates derived state only.
    fn put(&self, key: &[u8], value: &[u8]) -> Result<(), StorageError>;

    /// Delete a key.
    fn delete(&self, key: &[u8]) -> Result<(), StorageError>;

    /// Scan all keys with the given prefix, in lexicographic order.
    fn scan_prefix(&self, prefix: &[u8]) -> Result<PrefixIterator, StorageError>;

    /// Write a batch of key-value pairs atomically within a single backend.
    fn write_batch(&self, batch: &WriteBatch) -> Result<(), StorageError>;

    /// Force all buffered data to stable storage.
    fn flush(&self) -> Result<(), StorageError>;
}

The HybridStorage implementation composes an LsmBackend (fjall) and a BtreeBackend (redb), routing each call based on key prefix as described above. Tests use an InMemoryStorage implementation that stores everything in a BTreeMap, enabling deterministic testing without disk I/O.


5. Key Encoding Scheme

5.1 Design Goals

The key encoding must satisfy:

  1. Co-location. All data for a single entity (metadata, signals, relationships, aggregates) shares a common prefix, enabling single-prefix-scan retrieval.
  2. Shard boundary. The entity ID prefix is a natural partition key for future range-based sharding (Section 9).
  3. Lexicographic ordering. Byte ordering matches logical ordering. Range scans over time-ordered data yield chronologically sorted results.
  4. Tag-based routing. The tag byte enables the key router (Section 4.3) to dispatch to the correct backend without parsing the full key.

5.2 Key Layout

Subject-Prefix Key Encoding

+-------------------+------+------+---------------------------+
| Entity ID         | NUL  | Tag  | Suffix                    |
| 8 bytes           | 1 b  | 1-3b | variable                  |
+-------------------+------+------+---------------------------+
  u64 big-endian     0x00   ASCII   tag-dependent encoding

Total header: 10-12 bytes (entity_id + NUL + tag)

Why big-endian for the entity ID. Byte-lexicographic ordering of big-endian integers matches numeric ordering. This means a prefix scan over entity IDs 1000-2000 is a contiguous range scan in the storage engine. Little-endian would scatter numerically adjacent entities across the keyspace.

Why NUL separator. The 0x00 byte between entity ID and tag guarantees that no valid entity ID suffix collides with a tag prefix. Entity IDs are u64 values that may contain 0x00 bytes internally, but the NUL separator is always at offset 8, so parsing is unambiguous.

5.3 Tag Types

Tag Bytes Backend Description
EVT 0x45 0x56 0x54 LSM Raw signal event log
SIG 0x53 0x49 0x47 LSM Running decay scores, window counts
META 0x4D 0x45 0x54 B-tree Entity metadata (title, format, embedding pointer)
REL 0x52 0x45 0x4C B-tree Relationship edges (follows, blocks, interactions)
MV 0x4D 0x56 B-tree Materialized view aggregates (hourly/daily rollups)
IDX 0x49 0x44 0x58 B-tree Secondary indexes (inverted index postings, etc.)

5.4 Suffix Encoding by Tag

EVT (raw signal events):

{entity_id:8BE}{0x00}EVT{signal_type:2BE}{timestamp_ns:8BE}{event_hash:8}
                                                             ^-- first 8 bytes of BLAKE3
Total: 30 bytes

Events for a given entity and signal type are ordered chronologically by the timestamp suffix. The truncated event hash breaks ties for events at the same nanosecond.

SIG (signal ledger state):

{entity_id:8BE}{0x00}SIG{signal_type:2BE}{window_tag:1}
                                           ^-- 0x00=running, 0x01=1h, 0x02=24h, etc.
Total: 14 bytes

The running decay score, windowed counts, and velocity are stored as separate keys under the SIG tag. Each is a small fixed-size value (8-32 bytes).

META (entity metadata):

{entity_id:8BE}{0x00}META
Total: 12 bytes (value is the serialized entity struct)

REL (relationships):

{entity_id:8BE}{0x00}REL{rel_type:2BE}{target_id:8BE}
Total: 21 bytes (value is weight + metadata)

Range scan on {entity_id}\x00REL returns all relationships for an entity. Scan on {entity_id}\x00REL{rel_type} returns all relationships of a given type.

MV (materialized aggregates):

{entity_id:8BE}{0x00}MV{signal_type:2BE}{bucket_tag:1}{bucket_id:4BE}
                                          ^-- 0x01=hourly, 0x02=daily
Total: 18 bytes

bucket_id is hours-since-epoch (u32, good until year 2516) for hourly rollups, or days-since-epoch for daily rollups.

5.5 Byte-Level Example

For entity ID 0x00000000000003E8 (1000), a view signal event at timestamp 1740000000000000000 ns:

Offset  Bytes                              Meaning
------  ---------------------------------  -------
0x00    00 00 00 00 00 00 03 E8            entity_id = 1000 (u64 BE)
0x08    00                                 NUL separator
0x09    45 56 54                           "EVT" tag
0x0C    00 01                              signal_type = 1 (view)
0x0E    18 21 7D 68 7F 62 00 00           timestamp_ns (u64 BE)
0x16    A3 B7 2C 19 F0 81 DD 04           event_hash (first 8 bytes of BLAKE3)
        --------------------------------
        Total: 30 bytes

5.6 Why This Enables Sharding

The entity ID prefix is the natural shard key. A range-based partition scheme divides the entity ID space into contiguous ranges:

Shard 0:  entity_id [0x0000000000000000, 0x3FFFFFFFFFFFFFFF)
Shard 1:  entity_id [0x4000000000000000, 0x7FFFFFFFFFFFFFFF)
Shard 2:  entity_id [0x8000000000000000, 0xBFFFFFFFFFFFFFFF)
Shard 3:  entity_id [0xC000000000000000, 0xFFFFFFFFFFFFFFFF)

Because all keys for an entity share the same 8-byte prefix, shard splits never bisect an entity's data. All signals, metadata, relationships, and aggregates for entity X live on the same shard. Cross-shard ranking queries fan out by shard, score locally, and merge results -- the same pattern used by Elasticsearch and every distributed search engine.


6. Tiered Storage

6.1 Architecture

Data moves through three tiers based on access pattern, not just age. A viral old video's signal state stays hot. Yesterday's impression data for a video nobody watched moves to warm.

Tiered Storage Architecture

+--------------------------------------------------+
|  HOT TIER (in-memory)                            |
|                                                  |
|  DashMap<EntityId, EntitySignalState>            |
|  - Running decay scores (per-lambda)             |
|  - SWAG windowed counters (1h window)            |
|  - Recent event buffer (last N events)           |
|  - Velocity estimates                            |
|                                                  |
|  Budget: ~80 bytes/entity x 10M = 800 MB         |
|  Eviction: access-pattern-based (see 6.3)        |
+------------------------+-------------------------+
                         |
           promote on    | demote when
           access        | cold
                         v
+--------------------------------------------------+
|  WARM TIER (SSD - fjall + redb)                  |
|                                                  |
|  Signal ledger state (SIG keys)                  |
|  Raw events (EVT keys, 7-day retention)          |
|  Hourly rollups (MV keys, 30-day retention)      |
|  Entity metadata (META keys)                     |
|  Relationship graph (REL keys)                   |
|                                                  |
|  Budget: ~460 GB for full workload               |
+------------------------+-------------------------+
                         |
           archive when  | load on
           beyond window | ad-hoc query
                         v
+--------------------------------------------------+
|  COLD TIER (compressed archival)                 |
|                                                  |
|  Daily rollups (MV keys, no TTL)                 |
|  Compressed raw events beyond retention window   |
|  (optional, for compliance/audit)                |
|                                                  |
|  Format: ZSTD-compressed, columnar               |
|  Budget: grows at ~320 MB/day                    |
+--------------------------------------------------+

6.2 Hot Tier Design

The hot tier is an in-memory cache of per-entity signal state, optimized for the ranking query hot path. It is NOT a source of truth -- every value in the hot tier is derivable from the WAL and warm tier.

/// Per-entity signal state, cache-line aligned for zero false sharing.
/// This is the hottest struct in the entire system. Every ranking query
/// touches ~200 of these.
#[repr(C, align(64))]
pub struct EntitySignalState {
    // -- 8 bytes: identity
    entity_id: u64,

    // -- 24 bytes: running decay scores (one per configured lambda)
    // Lambdas: ln(2)/3600 (1h), ln(2)/86400 (24h), ln(2)/604800 (7d)
    decay_scores: [f64; 3],

    // -- 8 bytes: last update timestamp
    last_update_ns: u64,

    // -- 8 bytes: windowed count (SWAG-backed, 1h window)
    window_count_1h: u32,
    velocity_1h: f32,

    // -- 8 bytes: access tracking for tier management
    last_access_ns: u64,

    // -- 8 bytes: padding to 64-byte boundary
    _pad: [u8; 8],
}
// Total: 64 bytes = exactly 1 cache line

const _: () = assert!(core::mem::size_of::<EntitySignalState>() == 64);

Memory budget at scale:

Entities in hot tier Memory
1 million (active) 64 MB
10 million (all) 640 MB
1 million hot + lazy load 64 MB + demand-loaded

The recommended configuration for a 10M-entity deployment is to keep the most active 1-2 million entities in the hot tier (64-128 MB) and load others on demand from the warm tier. On-demand loading from redb/fjall adds ~10-50 us per entity -- acceptable for cold entities that appear infrequently in candidate sets.

6.3 Tier Migration Policy

Migration is driven by access pattern, not age. The policy uses two signals:

  1. Signal write frequency. Entities receiving signals in the last hot_write_window (default: 1 hour) are hot.
  2. Ranking read frequency. Entities that appeared in a ranking candidate set in the last hot_read_window (default: 15 minutes) are hot.

An entity becomes hot when it receives a signal write or is read by a ranking query. An entity becomes cold when neither condition has been true for cold_threshold (default: 2 hours).

Parameter Default Tuning Guidance
hot_write_window 1 hour Entities with recent signals stay hot. Increase for workloads with bursty signal patterns.
hot_read_window 15 min Entities recently scored in ranking stay hot. Increase if the same entities are queried repeatedly (e.g., trending page).
cold_threshold 2 hours How long an idle entity stays in memory. Decrease to reduce memory pressure; increase to absorb intermittent access spikes.
max_hot_entities 2 million Hard cap on hot tier size. When exceeded, the least-recently-accessed entities are evicted regardless of activity.

Eviction on memory pressure: When the hot tier reaches max_hot_entities, the entity with the oldest last_access_ns is evicted. Its state is already persisted in the warm tier (the hot tier is a cache), so eviction is a simple memory deallocation with no I/O.

6.4 Per-Signal-Window Tiering

Signal aggregates have natural temperature that correlates with window size:

Aggregate Tier Update Frequency Read Frequency
Running decay score Hot Every signal event Every ranking query
1h windowed count/velocity Hot Every signal event Trending/rising sorts
24h windowed count Warm (SIG key in fjall) Every signal event or per-minute rollup Hot/top-today sorts
7d windowed count Warm (MV key in redb) Hourly rollup Top-this-week sorts
30d aggregate Warm (MV key in redb) Daily rollup Top-this-month sorts
All-time aggregate Cold (MV key in redb) Daily rollup Top-all-time sorts

This means the 1-hour velocity computation (the backbone of trending/rising sorts) never touches disk on the hot path. The 7-day aggregate is a single point read from redb (B-tree, sub-millisecond). The all-time count is the same read cost but accessed less frequently.


7. Compaction Strategy

Compaction applies only to the LSM-tree backend (fjall). The B-tree backend (redb) uses copy-on-write pages and does not require compaction.

7.1 Signal Event Log (EVT keys)

Strategy: FIFO compaction.

The signal event log is append-only and time-ordered. FIFO compaction achieves write amplification of 2x (1x WAL flush to memtable, 1x memtable flush to L0 SST file). Old SST files are dropped whole when they fall outside the retention window.

Parameter Default Rationale
evt_retention 7 days Raw events are needed for: (a) crash recovery replay, (b) backfill when adding new decay lambdas, (c) ad-hoc historical queries. 7 days covers all active signal windows.
evt_max_sst_size 256 MiB Larger SSTs reduce file count; smaller SSTs enable finer-grained retention cleanup. 256 MiB balances both.

Why FIFO, not leveled. Leveled compaction for append-only time-series data has write amplification of 12-32x. Solana's BlockStore measured a 6.5x speedup after switching from leveled to FIFO. For tidalDB's event log, where data is written once and deleted by time window, FIFO is strictly superior.

Retention enforcement. Every SST file in the event log has a maximum timestamp recorded in its metadata. A background task periodically scans SST metadata and deletes files whose maximum timestamp is older than now - evt_retention. Cost: O(1) per file, zero write amplification for deletion.

7.2 Signal Ledger State (SIG keys)

Strategy: Leveled compaction.

The signal ledger contains per-entity running decay scores and windowed counters. These are updated frequently (on every signal event) and read frequently (on every ranking query). Leveled compaction ensures read amplification stays low (1-2 levels for point reads with bloom filters).

Parameter Default Rationale
sig_level_size_multiplier 10 Standard leveled compaction ratio. Each level is 10x the size of the previous.
sig_bloom_bits_per_key 10 1% false positive rate. Sufficient for the signal ledger's point-read workload.
sig_target_file_size 64 MiB Balances compaction granularity with file count.

7.3 Write Amplification Analysis

For the reference workload (10M entities, 50 events/day, ~5,800 events/sec sustained):

Component Daily Data Written Write Amp Disk I/O
WAL 32 GB/day 1x 32 GB/day
EVT SSTs (FIFO) 32 GB/day 2x 64 GB/day
SIG updates (leveled) ~1.6 GB/day (10M entities x 32B x ~5 updates) ~10x ~16 GB/day
MV rollups (B-tree, COW) ~5 GB/day ~2x (COW) ~10 GB/day
Total ~122 GB/day

At ~122 GB/day sustained, the average write throughput is ~1.4 MB/s -- trivial for any modern NVMe SSD rated at 1+ GB/s sequential writes. The SSD write endurance requirement is ~44.5 TB/year, well within the rated endurance of enterprise NVMe drives (typically 1+ DWPD on 2 TB = 730 TB/year).


8. Checkpoint Strategy

Checkpoints snapshot the materializer state, creating a recovery boundary that limits WAL replay length on crash restart.

8.1 Checkpoint Contents

A checkpoint record in the WAL (type 0x05) contains:

Checkpoint Record Payload

+-------------------+-------------------+-------------------+
| Checkpoint SeqNo  | Materializer Pos  | Entity State Hash |
| 8 bytes           | 8 bytes           | 32 bytes          |
+-------------------+-------------------+-------------------+
| Timestamp         | Hot Tier Count    |
| 8 bytes           | 4 bytes           |
+-------------------+-------------------+

Total: 60 bytes
Field Description
checkpoint_seqno The WAL sequence number up to which all derived state is consistent.
materializer_pos The last event seqno processed by the background materializer.
entity_state_hash BLAKE3 hash of a deterministic serialization of all in-memory entity signal states. Used to verify that warm-tier persisted state matches in-memory state.
timestamp Wall-clock time of the checkpoint (for monitoring/debugging).
hot_tier_count Number of entities in the hot tier at checkpoint time (for monitoring).

8.2 Checkpoint Procedure

  1. Pause signal writes (briefly). The write path acquires a lightweight checkpoint lock. Writers that arrive during checkpoint are buffered in the group commit queue -- they do not block, they just ride the next batch.
  2. Flush entity signal state. All dirty EntitySignalState entries in the hot tier are written to the warm tier (SIG keys in fjall). This is a batch write of only the entries modified since the last checkpoint.
  3. Flush fjall memtable. Force-flush the fjall memtable to ensure all SIG key writes are durable on disk.
  4. Write checkpoint record to WAL. The checkpoint record contains the current seqno and materializer position.
  5. fdatasync the WAL. The checkpoint record is durable.
  6. Release the checkpoint lock. Writers resume.
  7. Clean up old WAL segments. Segments fully before the new checkpoint seqno are deleted.

Checkpoint duration. Steps 2-5 are the critical section. Flushing dirty entity state is O(dirty entries), which at the default 30-second interval with 5,800 events/sec is at most ~174,000 entities. At ~1 us per key-value write to fjall's memtable, this takes ~174 ms. The fdatasync adds ~200 us. Total checkpoint duration: ~175 ms in the worst case.

During this 175 ms, signal writers are not blocked -- they are buffered in the group commit queue. The only observable effect is slightly higher write latency for events that arrive during the checkpoint flush.

8.3 Configuration

Parameter Default Tuning Guidance
checkpoint_interval 30 seconds Shorter intervals reduce recovery time but increase disk I/O. At 30s with 5,800 events/sec, recovery replays ~174K records (~174 ms).
checkpoint_dirty_threshold 100,000 Force a checkpoint when this many entity states are dirty, even if the interval has not elapsed. Prevents unbounded recovery time during write spikes.
max_recovery_time_target 500 ms Advisory. The system tunes checkpoint_interval to keep estimated recovery time below this target.

8.4 Recovery Procedure

Crash Recovery Sequence

1. Open WAL segments
2. Scan backward to find last checkpoint record
3. Read checkpoint: seqno=N, materializer_pos=M
4. Replay WAL records from seqno N+1 to end:
   - SignalEvent: update entity signal state + re-derive aggregates
   - EntityWrite: apply to entity store (redb)
   - RelationshipWrite: apply to relationship store (redb)
   - SchemaChange: apply to schema store (redb)
   - Padding/BatchBoundary: skip
5. Verify: entity_state_hash matches recomputed state (debug builds)
6. Write new checkpoint at current position
7. Resume normal operation

9. Per-Entity-Type Isolation

9.1 Namespace Architecture

Items, Users, and Creators occupy separate storage namespaces. This is not merely a key prefix convention -- it maps to separate fjall keyspaces and separate redb tables. The goal: a viral item's signal burst does not contend with user profile reads at the storage engine level.

Storage Namespace Layout

fjall instance
  +-- keyspace: "item_signals"     (EVT + SIG keys for items)
  +-- keyspace: "user_signals"     (EVT + SIG keys for users)
  +-- keyspace: "creator_signals"  (EVT + SIG keys for creators)

redb instance
  +-- table: "item_meta"           (META keys for items)
  +-- table: "user_meta"           (META keys for users)
  +-- table: "creator_meta"        (META keys for creators)
  +-- table: "relationships"       (REL keys, all entity types)
  +-- table: "materialized_views"  (MV keys, all entity types)
  +-- table: "schema"              (schema definitions)
  +-- table: "indexes"             (IDX keys, secondary indexes)

9.2 Why Separate Namespaces

Independent compaction. Item signals compact on their own schedule without affecting user signal reads. At 10M items generating 50 events/day each, the item_signals keyspace handles ~5,800 writes/sec. User signals are typically 10x lower volume. Without isolation, item signal compaction would stall user signal reads.

Independent memory budgets. Each fjall keyspace has its own memtable and block cache. The item_signals keyspace can be allocated a larger memtable (more write-buffering) while user_signals gets a smaller memtable but larger block cache (more read-caching).

Independent monitoring. Latency, throughput, and error metrics are per-namespace. When item signal write latency spikes, you know it is an item signal problem, not a user profile problem.

Shard-ready. When tidalDB moves to multi-node, each namespace maps naturally to an independent shard group. Item shards and user shards can be placed on different machines based on their workload profiles.

9.3 Cross-Entity Reads

A ranking query touches multiple namespaces: item signals (candidate scoring), user signals (preference vector), creator signals (creator quality), and relationships (social graph). These are separate read operations that execute concurrently via async I/O or thread pool. The storage engine does not provide cross-namespace transactions -- the query executor handles consistency by reading from a consistent WAL position.


10. Scale-Ready Design

tidalDB is single-node first. But the storage architecture is designed so that the transition to multi-node requires changing the deployment topology, not the storage engine.

10.1 What Stays the Same

Component Single-Node Multi-Node
Key encoding {entity_id}\x00{TAG}:{suffix} Identical. Entity ID prefix is the shard key.
WAL Local WAL per process Local WAL per shard. Each shard is a self-contained tidalDB instance.
Hybrid backend fjall + redb in-process fjall + redb per shard. Same code, same configuration.
Trait abstraction StorageEngine trait Same trait. The multi-node router implements StorageEngine by dispatching to the correct shard.
Checkpoints Local checkpoints Per-shard checkpoints. Same mechanism.
Compaction Local compaction Per-shard compaction. Same strategies.

10.2 What Changes

Concern Single-Node Multi-Node
Shard routing All keys local A routing layer maps entity_id to shard via consistent hashing or range partitioning.
Cross-shard queries N/A Ranking queries fan out to shards containing candidate entities, score locally, merge results.
Replication N/A Each shard is replicated via WAL shipping (leader ships sealed WAL segments to followers).
Rebalancing N/A Shard splits use the key encoding's natural range boundaries. All data for an entity moves together.

10.3 Design Decisions That Enable This

  1. Entity ID as the universal prefix. Every key starts with the entity ID. This means shard routing is a single 8-byte prefix lookup, and shard splits never bisect an entity's data.

  2. No cross-entity storage transactions. The storage engine provides per-entity atomicity (all keys for entity X are updated atomically), not cross-entity atomicity. This means a ranking query that scores items A, B, C reads each independently -- there is no global snapshot. This is acceptable because ranking is inherently approximate, and signal staleness of a few milliseconds does not affect result quality.

  3. Namespace isolation maps to shard groups. The per-entity-type namespaces (Section 9) are independent storage instances. In a multi-node deployment, item shards can run on high-write-throughput machines while user shards run on high-read-throughput machines.

  4. WAL segments are self-contained. Each WAL segment contains complete records that can be replayed independently. This makes WAL shipping for replication straightforward: the leader ships sealed segments to followers, who replay them locally.

  5. Checksums enable verification. BLAKE3 checksums on every WAL record and checkpoint enable followers to verify the integrity of replicated data without trusting the network.


Appendix A: Configuration Reference

All parameters with defaults and tuning guidance, consolidated.

WAL Configuration

Parameter Default Range Description
wal.segment_size 64 MiB 16-256 MiB Size of each WAL segment file.
wal.max_segments 128 8-1024 Maximum number of WAL segments before forced cleanup.
wal.preallocate true -- Pre-allocate segment files to avoid filesystem metadata updates.
wal.dedup_window 5 min 1-60 min Time window for signal event deduplication bloom filter.
wal.dedup_bloom_bits 10 5-20 Bits per entry in the dedup bloom filter. 10 = ~1% FPR.

Group Commit Configuration

Parameter Default Range Description
group_commit.max_batch 256 1-4096 Maximum records per group commit batch.
group_commit.max_delay 10 ms 1-100 ms Maximum time before a batch is flushed.

Durability Defaults (per signal type)

Signal Category Default Level Override In Schema
Financial (purchase, subscribe) Immediate DURABILITY immediate
Engagement (like, comment, share) Batched { 256, 10ms } DURABILITY batched(256, 10ms)
Telemetry (impression, scroll, hover) Eventual DURABILITY eventual

Tiered Storage Configuration

Parameter Default Range Description
tiers.hot_write_window 1 hour 5min-24h Signal write recency threshold for hot tier.
tiers.hot_read_window 15 min 1min-1h Ranking read recency threshold for hot tier.
tiers.cold_threshold 2 hours 30min-24h Inactivity duration before demotion from hot tier.
tiers.max_hot_entities 2 million 100K-50M Hard cap on hot tier entity count.

Compaction Configuration

Parameter Default Range Description
compaction.evt_retention 7 days 1-90 days Retention window for raw signal events.
compaction.evt_max_sst_size 256 MiB 64-1024 MiB Target SST file size for event log.
compaction.sig_level_multiplier 10 4-20 Leveled compaction size ratio for signal ledger.
compaction.sig_bloom_bits 10 5-20 Bloom filter bits per key for signal ledger.

Checkpoint Configuration

Parameter Default Range Description
checkpoint.interval 30 sec 5sec-5min Time between periodic checkpoints.
checkpoint.dirty_threshold 100,000 10K-1M Dirty entity count that forces an early checkpoint.
checkpoint.max_recovery_target 500 ms 100ms-5s Advisory target for maximum crash recovery time.

Appendix B: Filesystem Layout

{data_dir}/
  wal/
    segment-{seqno}.wal              # WAL segments (rotated at segment_size)
  lsm/
    item_signals/                    # fjall keyspace: item EVT + SIG keys
      ...                            # fjall internal structure
    user_signals/                    # fjall keyspace: user EVT + SIG keys
      ...
    creator_signals/                 # fjall keyspace: creator EVT + SIG keys
      ...
  btree/
    tidaldb.redb                     # single redb file containing all B-tree tables
  meta/
    config.json                      # persisted configuration (checkpoint interval, etc.)
    LOCK                             # flock-based single-writer guard

The LOCK file prevents multiple tidalDB instances from opening the same data directory. It uses flock(LOCK_EX | LOCK_NB) on open -- if the lock cannot be acquired, the process fails with a clear error message. This prevents silent data corruption from concurrent access.


Appendix C: Invariant Checklist

These invariants must be verified by property tests and crash recovery tests. Each maps to a specific test case.

# Invariant Test Strategy
1 A WAL record with a valid checksum is never silently dropped during replay. Property test: write N records, replay, verify all N are present.
2 A WAL record with an invalid checksum is never applied during replay. Property test: corrupt random bytes in WAL segment, replay, verify only valid records are applied.
3 Crash at any point during checkpoint leaves the previous checkpoint valid. Crash test: inject crashes during each step of the checkpoint procedure, verify recovery uses the previous checkpoint.
4 The group commit thread never ACKs a record before fdatasync completes. Instrumented test: mock fdatasync to delay, verify writers block until it returns.
5 Materialized aggregates are always consistent with the WAL. Property test: write random signal events, compute aggregates from WAL, compare with materialized state.
6 Key routing is deterministic: the same key always routes to the same backend. Property test: generate random keys, verify route() is a pure function.
7 Entity isolation: writes to one namespace do not affect read latency in another. Benchmark test: measure user_meta read latency while saturating item_signals writes.
8 Deduplication never causes a unique event to be silently dropped. Property test: generate events with guaranteed-unique hashes, verify all are written.
9 Big-endian entity ID encoding preserves numeric ordering in byte-lexicographic scans. Property test: generate random u64 pairs, verify BE encoding preserves ordering.
10 After crash recovery, the hot tier state matches what would be produced by replaying all events from the last checkpoint. Crash test: fill hot tier, crash, recover, compare entity states against fresh computation from WAL.

References

  • Signal Ledger Research -- Three-tier hybrid architecture, running decay scores, SWAG, compaction analysis
  • thoughts.md -- Lessons from Engram (cache-line alignment), Citadel (quarantine-first durability, group commit), StemeDB (hybrid backend routing, subject-prefix keys, background materializer)
  • CODING_GUIDELINES.md -- #[repr(C, align(64))] for hot structs, lock-free hot path, trait-abstracted backends
  • VISION.md -- The ranking query that this storage engine exists to serve
  • Cormode et al., "Forward Decay: A Practical Time Decay Model for Streaming Systems" (ICDE 2009) -- Running decay score correctness proof
  • Tangwongsan, Hirzel, Schneider, "Sliding-Window Aggregation Algorithms" (PVLDB 2015) -- Two-Stacks SWAG algorithm
  • Traub et al., "Scotty: Efficient Window Aggregation for out-of-order Stream Processing" (EDBT 2019) -- Stream-slicing for shared windows