tidaldb/docs/specs/03-signal-system.md
jordan 413b712c0a chore: initialize tidalDB repository with schema foundation and standards
- Schema phase 1 (tasks 01-02): EntityId, EntityKind, Timestamp, Score, SignalTypeDef, DecayModel, Window, WindowSet — all with property tests and benchmarks scaffolding
- Stub modules for storage, signals, query, ranking
- Full documentation suite: VISION, USE_CASES, SEQUENCE, API, CODING_GUIDELINES, ai-lookup, research docs, specs, roadmap, planning docs
- Marketing site (Next.js) with blog infrastructure
- .claude/ agents and skills for the tidalDB development workflow
- Foundation standards enforced: thiserror + tracing declared as dependencies, clippy::unwrap_used = deny added to lint config
- .gitignore hardened: .next/, node_modules/, .env, secrets, logs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 12:52:20 -07:00

72 KiB

Signal System Specification

Status: Draft Authors: tidalDB Engineering Date: 2026-02-20 Depends on: WAL subsystem, Entity Store, Schema Engine Research: docs/research/tidaldb_signal_ledger.md


Table of Contents

  1. Overview
  2. Signal Type Declaration
  3. Signal Ledger (Per-Entity)
  4. Decay Computation
  5. Velocity Computation
  6. Windowed Aggregation
  7. Cohort-Scoped Signal Aggregation
  8. Signal Write Path
  9. Background Materializer
  10. Signal Event Format
  11. Signal Types Reference
  12. Performance Targets
  13. Invariants and Correctness Guarantees

1. Overview

The signal system is the temporal event backbone of tidalDB. Every engagement event -- a view, a like, a skip, a share -- flows through the signal system and updates the state that ranking queries consume. The system must sustain thousands of signal writes per second while serving sub-millisecond aggregate reads across hundreds of candidate entities.

Signals are not fields. They are typed, timestamped streams with native temporal semantics: decay, velocity, and windowed aggregation are computed by the database, not by the application. The application writes SIGNAL view item:@id user:@uid. The ranking profile references view.velocity(24h). No application code touches temporal math.

Design Principles

  1. WAL-first durability. Every signal event is durably logged before any processing occurs. The signal aggregation system can crash, restart, and replay from the WAL. Signals cannot be lost.

  2. O(1) running scores. Decay scores are maintained as running accumulators updated on each write, not recomputed by scanning raw events. Read cost is one exp() call per entity per decay rate.

  3. Immutable events, mutable aggregates. Signal events are immutable facts. Aggregates are derived state that can always be recomputed from events.

  4. Lock-free hot path. Signal counters and decay scores use atomic operations. A signal write never blocks a ranking query. A ranking query never blocks a signal write.

  5. Cohort aggregation as a first-class primitive. Not just "this item has 50k views in 24h" but "this item has 50k views in 24h among US users aged 18-24 who like jazz."


2. Signal Type Declaration

Signal types are declared in schema before signal events can be written. A signal declaration specifies: what the signal is called, what entity type it targets, how it decays, what windows it maintains, and whether velocity is computed.

Schema Definition

db.define_signal(SignalDef {
    name: "view",
    target: EntityKind::Item,
    decay: Decay::Exponential { half_life: Duration::days(7) },
    windows: vec![
        Window::hours(1),
        Window::hours(24),
        Window::days(7),
        Window::days(30),
        Window::all_time(),
    ],
    velocity: true,
})?;

Signal Definition Fields

Field Type Required Description
name &str Yes Unique signal identifier. Lowercase alphanumeric plus underscores.
target EntityKind Yes Which entity type this signal targets: Item, User, or Creator.
decay Decay Yes How signal weight diminishes over time.
windows Vec<Window> Yes Time windows for which aggregates are maintained. May be empty (e.g., hide).
velocity bool Yes Whether to compute rate-of-change per window.

Decay Types

pub enum Decay {
    /// Signal weight halves every `half_life` duration.
    /// Formula: w(t) = w_0 * exp(-lambda * t), lambda = ln(2) / half_life
    Exponential { half_life: Duration },

    /// Signal weight drops linearly to zero over `lifetime`.
    /// Formula: w(t) = w_0 * max(0, 1 - t / lifetime)
    Linear { lifetime: Duration },

    /// Signal weight never decays. For permanent state: hides, blocks, follows.
    Permanent,
}

Lambda precomputation. For exponential decay, lambda is computed once at schema definition time and stored alongside the signal definition:

lambda = ln(2) / half_life_seconds
Half-Life Lambda (s^-1) Interpretation
1 hour 1.925e-4 Fast decay. Impressions, skips. Signal is negligible after ~7 hours.
24 hours 8.022e-6 Medium decay. Shares, comments. Signal halves daily.
7 days 1.146e-6 Slow decay. Views, likes. Signal persists for weeks.
30 days 2.674e-7 Very slow decay. Completions, saves. Signal persists for months.

Window Definitions

pub enum Window {
    /// Fixed-duration sliding window.
    Sliding { duration: Duration },
    /// Unbounded accumulator -- all events since entity creation.
    AllTime,
}

impl Window {
    pub fn hours(n: u64) -> Self { Window::Sliding { duration: Duration::hours(n) } }
    pub fn days(n: u64) -> Self { Window::Sliding { duration: Duration::days(n) } }
    pub fn all_time() -> Self { Window::AllTime }
}

Windows define the time boundaries for count/sum aggregation. A signal with windows: [hours(1), hours(24), days(7), all_time()] maintains four independent aggregates. Each window answers "how many/how much of this signal occurred within the last N?"

Velocity Declaration

When velocity: true, the system computes the rate of change of the signal count within each declared window. Velocity answers "is this signal accelerating or decelerating?" -- the foundation of trending and rising detection.

Velocity is computed per window. view.velocity(1h) measures short-term acceleration. view.velocity(24h) measures daily trend. These are different signals with different noise characteristics, and ranking profiles choose which to reference.

Schema Validation Rules

  1. Signal names must be unique within a target entity type.
  2. Permanent decay signals must have velocity: false (rate of change is meaningless for permanent state).
  3. Windows must be non-empty unless the signal is boolean/permanent (e.g., hide, block).
  4. all_time() windows do not support velocity (no bounded window to measure rate over).
  5. Maximum 8 windows per signal type (bounded by the hot-tier struct layout).
  6. Maximum 64 signal types per entity type (bounded by storage layout).

3. Signal Ledger (Per-Entity)

Every entity in tidalDB has a signal ledger: the complete temporal state of all signals targeting that entity. The ledger is implemented as a three-tier hybrid, following the architecture validated in the research document.

Three-Tier Architecture

                    +---------------------------+
  Ranking queries   |     HOT TIER (Memory)     |   ~64 bytes per signal type
  read from here    | Running decay scores      |   10M entities = 400-800 MB
  (sub-microsecond) | Atomic counters           |
                    | Last-update timestamp     |
                    +---------------------------+
                              |
                    +---------------------------+
  Windowed queries  |    WARM TIER (Memory)     |   Per-minute bucket counters
  merge from here   | Time-bucketed counters    |   10M entities = ~1 GB
  (microseconds)    | Recent event buffer       |
                    | SWAG stacks               |
                    +---------------------------+
                              |
                    +---------------------------+
  Replay, ad-hoc,  |    COLD TIER (Disk)       |   Raw events: 7 days retention
  backfill from     | Raw signal events (WAL)   |   Rollups: 30 days hourly,
  here              | Hourly rollups            |   daily indefinitely
                    | Daily rollups             |   Total: ~460 GB at scale
                    +---------------------------+

Hot Tier: Per-Entity Signal State

The hot tier is the structure touched on every ranking query. It must be cache-line aligned, lock-free, and as compact as possible.

Memory Layout:

        0         8        16        24        32        40        48        56        64
        +----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
 Line 0 | entity_id (u64)   | last_update_ns (u64)    |  signal_type_id (u16)  | flags  |
        |                   |                         |                  | pad  | (u16)  |
        +-------------------+-------------------------+------------------+------+--------+
        | decay_score_0     | decay_score_1           | decay_score_2          | pad    |
        | (f64)             | (f64)                   | (f64)                  | (f64)  |
        +-------------------+-------------------------+------------------------+--------+

        Total: 64 bytes per signal type per entity (one cache line)
/// Hot-path signal state for a single signal type on a single entity.
/// One cache line. Touched on every ranking query involving this signal.
///
/// Contains running decay scores for up to 3 decay rates (matching the
/// common configuration of 1h, 24h, 7d half-lives) and the timestamp
/// of the last update for lazy decay application at read time.
#[repr(C, align(64))]
pub struct HotSignalState {
    /// Entity this state belongs to.
    entity_id: u64,                 // 8 bytes [0..8]

    /// Nanosecond timestamp of the last signal write to this entity.
    /// Used for lazy decay: score(now) = stored_score * exp(-lambda * (now - last_update)).
    /// Stored as AtomicU64 for lock-free read/write.
    last_update_ns: AtomicU64,      // 8 bytes [8..16]

    /// Signal type index (0..63) within this entity's signal set.
    signal_type_id: u16,            // 2 bytes [16..18]

    /// Flags: bit 0 = velocity_enabled, bits 1-15 reserved.
    flags: u16,                     // 2 bytes [18..20]

    /// Padding to align decay_scores to 8-byte boundary.
    _pad0: [u8; 4],                 // 4 bytes [20..24]

    /// Running exponential decay scores. One per configured decay rate.
    /// Updated atomically via CAS on f64 bit patterns.
    /// Index 0: primary decay rate (from signal definition).
    /// Index 1-2: additional rates if the signal participates in
    ///            multiple ranking profiles with different half-lives.
    decay_scores: [AtomicU64; 3],   // 24 bytes [24..48] (f64 via from_bits/to_bits)

    /// Padding to fill cache line.
    _pad1: [u8; 16],               // 16 bytes [48..64]
}
// Static assertion: size_of::<HotSignalState>() == 64

Atomic access patterns:

  • Signal write: Load last_update_ns (Acquire), compute decayed score, CAS decay_scores[i] (AcqRel), store last_update_ns (Release).
  • Ranking read: Load last_update_ns (Acquire), load decay_scores[i] (Acquire), apply lazy decay with exp(-lambda * dt).
  • Memory ordering rationale: Acquire on last_update_ns ensures we see the most recent decay score that was stored with Release. Without this ordering, a reader could see a new timestamp with an old score, producing an incorrect (over-decayed) value.

Memory budget:

Entity Count Signal Types Hot Tier Size
1M 6 384 MB
10M 6 3.84 GB
10M 3 1.92 GB

For the 10M entity target, the hot tier consumes 2-4 GB depending on signal type count. This is within the recommended memory_budget of 2-4 GB. Entities with no recent signals can be evicted to warm/cold tier and loaded on demand (see Section 3.5).

Warm Tier: Bucketed Counters and SWAG Stacks

The warm tier maintains the data structures needed for windowed aggregation and velocity computation. It is in-memory but not cache-line-aligned -- it trades compactness for query flexibility.

/// Warm-tier signal state for windowed aggregation.
/// One instance per signal type per entity.
pub struct WarmSignalState {
    /// Per-minute event count buckets for the last 60 minutes.
    /// Used for 1h window. Shared across 24h, 7d via hierarchical rollup.
    minute_buckets: [AtomicU32; 60],     // 240 bytes

    /// Per-hour event count buckets for the last 168 hours (7 days).
    /// Used for 24h and 7d windows.
    hour_buckets: [AtomicU32; 168],      // 672 bytes

    /// Weighted sum buckets (same granularity as count buckets).
    /// For signals with non-unit weights (e.g., completion ratio).
    minute_weight_sums: [AtomicU32; 60], // 240 bytes (f32 via bits)
    hour_weight_sums: [AtomicU32; 168],  // 672 bytes (f32 via bits)

    /// Current bucket index (minute of the hour for minute_buckets).
    current_minute: AtomicU8,            // 1 byte

    /// Current bucket index (hour of the week for hour_buckets).
    current_hour: AtomicU8,              // 1 byte

    /// All-time counters.
    all_time_count: AtomicU64,           // 8 bytes
    all_time_weighted_sum: AtomicU64,    // 8 bytes (f64 via bits)

    /// SWAG Two-Stacks state for O(1) amortized windowed aggregation.
    /// One pair of stacks per active window.
    swag_stacks: Vec<SwagState>,         // heap-allocated, per window
}
// ~1.8 KB per signal type per entity
// 10M entities * 6 signal types * 1.8 KB = ~108 GB -- TOO LARGE

Critical sizing decision. At 1.8 KB per signal per entity, the warm tier for 10M entities with 6 signal types would consume ~108 GB. This is infeasible. The warm tier must be sparse: only entities with recent activity maintain warm-tier state. The vast majority of entities (>95%) have no signals in the last hour and need only the hot-tier running scores.

Revised warm tier: active-entity-only.

/// Warm tier is a concurrent hash map keyed by (entity_id, signal_type_id).
/// Only entities with signal activity in the last 7 days have entries.
/// Evicted to cold tier on inactivity.
type WarmTier = DashMap<(EntityId, SignalTypeId), WarmSignalState>;

At 5% active rate (500K entities with recent activity), warm tier = 500K * 6 * 1.8 KB = ~5.4 GB. Manageable within a 8 GB total memory budget.

Eviction policy: Warm-tier entries with no signal writes in the last 2 * max_window_duration are evicted. Their bucketed state is rolled up into the cold tier before eviction.

Cold Tier: Durable Storage

The cold tier is on disk. It stores raw signal events and pre-computed rollups.

Column families (or keyspaces):

CF "signal_events"      FIFO compaction, 7-day TTL
    Key:   [entity_id: u64 BE][timestamp_ns: u64 BE][signal_type: u8]
    Value: [user_id: u64][weight: f32][context_len: u16][context: bytes]
    Prefix bloom filter on first 8 bytes (entity_id)

CF "hourly_rollups"     Leveled compaction, 30-day TTL
    Key:   [entity_id: u64 BE][signal_type: u8][hour_bucket: u32 BE]
    Value: HourlyRollup (see below)

CF "daily_rollups"      Leveled compaction, no TTL
    Key:   [entity_id: u64 BE][signal_type: u8][day_bucket: u16 BE]
    Value: DailyRollup (see below)

CF "entity_signal_state" Leveled compaction, no TTL
    Key:   [entity_id: u64 BE]
    Value: Serialized hot-tier state (for crash recovery checkpoint)

Rollup record formats:

/// Composable hourly aggregate. Never store averages -- store sum + count.
struct HourlyRollup {
    total_count: u32,
    weighted_sum: f32,
    unique_users: u32,          // HyperLogLog sketch cardinality
    max_weight: f32,
    min_weight: f32,
}  // 20 bytes

/// Composable daily aggregate. Computed from hourly rollups, not raw events.
struct DailyRollup {
    total_count: u64,
    weighted_sum: f64,
    unique_users: u64,          // HyperLogLog union
    hourly_peak_count: u32,     // max count in any single hour
    _pad: u32,
}  // 32 bytes

Storage Cost Analysis

For the reference workload (10M entities, 50 events/day average, 40+ signal types in schema but ~6 active per entity):

Component Storage Size Write Amplification Retention
Raw signal events 224 GB 2x (FIFO) 7 days
Hourly rollups 231 GB ~15x (leveled) 30 days
Daily rollups Growing 320 MB/day ~15x (leveled) Indefinite
Hot-tier checkpoint ~3.8 GB Periodic Latest only
Total ~460 GB Blended ~6x

Hot/Cold Entity Tiering

Not all 10M entities need hot-tier state in memory at all times. An entity that received its last signal 3 months ago does not need a 64-byte cache-line-aligned struct consuming L1 capacity.

Tiering policy:

Activity Level Tier Read Latency Eviction Rule
Signal in last 1h Hot (memory, aligned) ~15 ns N/A
Signal in last 7d Warm (memory, unaligned) ~100 ns No activity for 2x max window
Signal older than 7d Cold (disk) ~50 us Loaded on demand

On a cold-tier read miss, the entity's checkpoint is loaded from entity_signal_state CF, promoted to hot tier, and lazy-decayed to current time. The cold read adds ~50 us latency for that single entity, amortized over future queries.


4. Decay Computation

The Running Score Formula

Exponential decay scores are maintained as running accumulators. The formula is mathematically exact (not an approximation), proven by the Forward Decay model (Cormode et al., ICDE 2009) and independently described by Jules Jacobs.

Definition. Given a stream of signal events with weights w_1, w_2, ..., w_n arriving at times t_1, t_2, ..., t_n, the exponential decay score at time t is:

S(t) = SUM_i [ w_i * exp(-lambda * (t - t_i)) ]

Incremental update. When a new event with weight w arrives at time t_new:

S(t_new) = S(t_prev) * exp(-lambda * (t_new - t_prev)) + w

Proof of exactness. If S(t_prev) = SUM_i [ w_i * exp(-lambda * (t_prev - t_i)) ] for all events up to t_prev, then multiplying by exp(-lambda * (t_new - t_prev)) shifts every event's decay to be relative to t_new, and adding w incorporates the new event with zero age. The result is exactly SUM_i [ w_i * exp(-lambda * (t_new - t_i)) ] for all events including the new one.

Write-Path Update

impl HotSignalState {
    /// Update running decay scores on a new signal event.
    ///
    /// Cost: K * exp() calls where K = number of configured decay rates.
    /// At K=3: ~36ns on modern hardware (12ns per exp()).
    pub fn on_signal(
        &self,
        weight: f64,
        event_time_ns: u64,
        lambdas: &[f64],
    ) {
        // Acquire: ensures we see the latest decay_score before updating.
        let prev_time = self.last_update_ns.load(Ordering::Acquire);
        let dt = (event_time_ns.saturating_sub(prev_time)) as f64 / 1e9;

        for (i, &lambda) in lambdas.iter().enumerate().take(3) {
            loop {
                // Acquire: read current score.
                let prev_bits = self.decay_scores[i].load(Ordering::Acquire);
                let prev_score = f64::from_bits(prev_bits);

                // Apply decay to previous score, then add new weight.
                let new_score = prev_score * (-lambda * dt).exp() + weight;
                let new_bits = new_score.to_bits();

                // AcqRel CAS: if another writer updated between our load and
                // this CAS, we retry with the newer value.
                match self.decay_scores[i].compare_exchange_weak(
                    prev_bits,
                    new_bits,
                    Ordering::AcqRel,
                    Ordering::Acquire,
                ) {
                    Ok(_) => break,
                    Err(_) => continue, // Retry with updated value
                }
            }
        }

        // Release: make updated scores visible to ranking queries.
        // Only advance timestamp if this event is newer than the last update.
        if event_time_ns > prev_time {
            self.last_update_ns.store(event_time_ns, Ordering::Release);
        }
    }
}

Read-Path Query

impl HotSignalState {
    /// Read the current decay score at query time.
    ///
    /// Applies lazy decay from last_update to query_time.
    /// Cost: 1 exp() + 1 multiply = ~15ns per entity per decay rate.
    pub fn current_score(
        &self,
        decay_rate_idx: usize,
        query_time_ns: u64,
        lambda: f64,
    ) -> f64 {
        // Acquire: ensures we see the score matching the timestamp.
        let last_update = self.last_update_ns.load(Ordering::Acquire);
        let stored_bits = self.decay_scores[decay_rate_idx].load(Ordering::Acquire);
        let stored_score = f64::from_bits(stored_bits);

        let dt = (query_time_ns.saturating_sub(last_update)) as f64 / 1e9;
        stored_score * (-lambda * dt).exp()
    }
}

Out-of-Order Events

When an event arrives with t_event < last_update_ns (out-of-order delivery, late-arriving data):

score += weight * exp(-lambda * (last_update - t_event))

The weight is pre-decayed to reflect that the event is older than the current state. The last_update_ns timestamp is not changed because it already reflects a more recent time. This is handled in the on_signal implementation above: when dt would be negative (via saturating_sub), the decay factor is exp(0) = 1.0 which is incorrect. Instead:

// Correct out-of-order handling:
let dt_seconds = if event_time_ns >= prev_time {
    (event_time_ns - prev_time) as f64 / 1e9
} else {
    // Out-of-order: pre-decay the weight instead
    let late_by = (prev_time - event_time_ns) as f64 / 1e9;
    // Decay the existing score by 0 (it's already at prev_time),
    // and add the weight decayed by how late the event is.
    // new_score = prev_score + weight * exp(-lambda * late_by)
    for (i, &lambda) in lambdas.iter().enumerate().take(3) {
        let adjusted_weight = weight * (-lambda * late_by).exp();
        // CAS loop to add adjusted_weight to decay_scores[i]
        // ... (same pattern as above but with dt=0 for the score)
    }
    return; // Don't update last_update_ns
};

The Jacobs Forward-Decay Trick

For ranking-only queries (where only relative ordering matters, not absolute scores), the running score can be reformulated to eliminate all read-time computation:

S(t) = exp(-lambda * t) * SUM_i [ w_i * exp(lambda * t_i) ]

The term S_static = SUM_i [ w_i * exp(lambda * t_i) ] changes only on writes. Since exp(-lambda * t) is the same for all entities at a given query time, relative ordering is determined by S_static alone.

Overflow prevention. S_static grows exponentially. After time T, the magnitude is approximately exp(lambda * T). With a 1-hour half-life and lambda = 1.925e-4, after 1 year: exp(1.925e-4 * 3.15e7) = exp(6063) -- catastrophic overflow.

Solution: log-space arithmetic. Store z = log(S_static) instead. Update rule:

z_new = log(exp(z_prev) + w * exp(lambda * t_event))
      = z_prev + log(1 + w * exp(lambda * t_event - z_prev))

Using the log1p function for numerical stability when the addend is small.

Applicability. Implement the Jacobs trick only for the primary ranking hot path where it eliminates the per-entity exp() call. Fall back to standard lazy-decay for queries that need absolute score values (e.g., SignalSnapshot in the response).

Numerical Stability

f64 precision is not a practical concern. Each running-score update introduces ~0.5 ULP of rounding error. After 10^12 updates, accumulated error would be ~10^-10 relative. Jules Jacobs analyzed that with f64 and a 1-hour half-life, the system can run until the year 18,000 without precision issues.

Underflow is desirable. When an entity receives no signals for a long time, its decay score approaches 0.0. This is correct behavior -- the content has become irrelevant. Underflow to exactly 0.0 (which happens at approximately dt > 700 * half_life for f64) produces the correct ranking: the entity drops out of contention.

Invariant. Decay scores are non-negative. A negative score indicates a bug. Assert score >= 0.0 on every update in debug builds.

Linear Decay

For signals using Decay::Linear { lifetime }:

S(t) = SUM_i [ w_i * max(0, 1 - (t - t_i) / lifetime) ]

Linear decay cannot use the running-score trick because the max(0, ...) clamp is not multiplicatively composable. Instead, linear-decay signals rely on windowed aggregation with the window duration set to lifetime. The aggregate at query time is the count/sum of events within the lifetime window, with the weight linearly interpolated at the window boundary.

Linear decay is primarily used for signals where the "cliff" behavior is desirable -- e.g., a promotion that lasts exactly 7 days.


5. Velocity Computation

Velocity is the rate of change of signal volume within a window. It answers: "Is this signal accelerating or decelerating?" Velocity is the primary signal for trending and rising surfaces.

Definition

For a signal with windowed count C(t, w) representing the number of events in the window [t-w, t]:

velocity(t, w) = C(t, w) / w

This is the simplest form: events per unit time. A view velocity of 500/hour means 500 views in the last hour.

Relative Velocity (Acceleration)

For rising/breakout detection, what matters is not absolute velocity but velocity relative to a baseline:

relative_velocity(t) = velocity(t, w_short) / velocity(t, w_long)

Where w_short is a short window (e.g., 1h) and w_long is a longer window (e.g., 24h). When relative_velocity > 1.0, the signal is accelerating. When relative_velocity >> 1.0, the content is breaking out.

Example. An item averaging 100 views/hour over the last 24h that suddenly receives 1,000 views in the last hour has relative_velocity = 10.0. This is a strong rising signal.

Smoothed Velocity (EWMA)

Raw velocity is noisy at short windows. A single burst of views creates a spike that disappears one window-duration later. For ranking stability, velocity is smoothed using an Exponentially Weighted Moving Average (EWMA):

V_smooth(t) = alpha * V_raw(t) + (1 - alpha) * V_smooth(t_prev)

Where alpha determines the smoothing factor. Smaller alpha = smoother but slower to react. Larger alpha = noisier but faster to detect changes.

Window Recommended alpha Rationale
1h 0.3 Fast reaction for real-time trending
24h 0.1 Smooth daily trend with less noise
7d 0.05 Very smooth weekly trend

Implementation

Velocity does not require a separate data structure. It is computed from the bucketed counters in the warm tier:

impl WarmSignalState {
    /// Compute velocity for a given window.
    ///
    /// Sums the relevant minute/hour buckets and divides by window duration.
    /// Cost: O(bucket_count) -- at most 168 for 7-day window at hourly granularity.
    pub fn velocity(&self, window: &Window, now_ns: u64) -> f64 {
        let (count, duration_secs) = match window {
            Window::Sliding { duration } if duration <= &Duration::hours(1) => {
                let minutes = duration.as_secs() / 60;
                let count = self.sum_minute_buckets(minutes as usize, now_ns);
                (count, duration.as_secs_f64())
            }
            Window::Sliding { duration } => {
                let hours = duration.as_secs() / 3600;
                let count = self.sum_hour_buckets(hours as usize, now_ns);
                (count, duration.as_secs_f64())
            }
            Window::AllTime => return 0.0, // velocity is undefined for all-time
        };
        count as f64 / duration_secs
    }

    /// Compute relative velocity (acceleration).
    ///
    /// ratio > 1.0 means accelerating; ratio < 1.0 means decelerating.
    pub fn relative_velocity(
        &self,
        short_window: &Window,
        long_window: &Window,
        now_ns: u64,
    ) -> f64 {
        let v_short = self.velocity(short_window, now_ns);
        let v_long = self.velocity(long_window, now_ns);
        if v_long < f64::EPSILON {
            // No baseline -- treat as infinite acceleration if short > 0.
            if v_short > 0.0 { f64::MAX } else { 0.0 }
        } else {
            v_short / v_long
        }
    }
}

Velocity as EWMA (Smoothed)

The EWMA velocity is maintained as an additional atomic field in the warm tier, updated every time the minute bucket rolls over:

/// Updated once per minute by the bucket rotation logic.
fn update_smoothed_velocity(&self, raw_velocity: f64, alpha: f64) {
    loop {
        let prev_bits = self.smoothed_velocity.load(Ordering::Acquire);
        let prev = f64::from_bits(prev_bits);
        let new = alpha * raw_velocity + (1.0 - alpha) * prev;
        match self.smoothed_velocity.compare_exchange_weak(
            prev_bits,
            new.to_bits(),
            Ordering::AcqRel,
            Ordering::Acquire,
        ) {
            Ok(_) => break,
            Err(_) => continue,
        }
    }
}

6. Windowed Aggregation

SWAG: Sliding Window Aggregation via Two-Stacks

For O(1) amortized sliding window aggregation, we use the Two-Stacks algorithm (Tangwongsan, Hirzel, Schneider, PVLDB 2015).

Requirements. The aggregation operator must be associative (forming a monoid). This covers count, sum, min, max, and compositions thereof.

Structure. Two stacks, each storing (value, prefix_aggregate) pairs:

  • Back stack: New events are pushed here. back.top.agg = combine(back.prev.agg, new_value).
  • Front stack: Evictions pop from here. If empty, flip all elements from back to front.
Insert event:  push to back stack     O(1)
Evict event:   pop from front stack   O(1) amortized (O(n) flip at most once per element)
Query agg:     combine(front.top.agg, back.top.agg)   O(1)

Scotty Stream-Slicing: Practical Implementation

Rather than maintaining pure SWAG stacks per window, tidalDB uses the Scotty stream-slicing approach (Traub et al., EDBT 2019): divide the event stream into non-overlapping time slices (per-minute and per-hour buckets), compute partial aggregates per slice, and share these across all concurrent windows.

This means a single set of per-minute counters supports simultaneous 1h, 24h, and 7d window queries. The cost of a windowed query is O(number_of_buckets_in_window):

Window Bucket Granularity Buckets to Sum Cost
1h per-minute 60 ~120 ns
24h per-hour 24 ~48 ns
7d per-hour 168 ~336 ns
30d per-hour 720 (from rollups) ~1.4 us
all_time single counter 1 ~2 ns

For the 30-day window, the system merges hourly rollups from the cold tier (disk) with in-memory hour buckets for the current 7 days. This follows the TimescaleDB real-time continuous aggregate pattern.

Bucket Rotation

Minute buckets rotate every 60 seconds. Hour buckets rotate every 3600 seconds. Rotation is performed by the background materializer thread:

  1. Record the current bucket's final value.
  2. Zero the bucket for reuse.
  3. Update the current-bucket pointer (atomic store).
  4. If hour boundary crossed: aggregate the last 60 minute buckets into the hour bucket.

Concurrency during rotation. Writers continue incrementing the new current bucket via atomic add. Readers sum buckets starting from the current pointer and wrapping backwards. The window between "bucket zeroed" and "pointer advanced" is at most one atomic store apart, and a reader that sees the old pointer will include one extra bucket (slightly over-counting rather than under-counting), which is acceptable for ranking purposes.

Multiple Simultaneous Windows

All windows for a given signal type share the same bucket arrays. A 1h query sums the last 60 minute buckets. A 24h query sums the last 24 hour buckets. A 7d query sums the last 168 hour buckets. No duplicated storage.

The all_time window is a simple atomic counter incremented on every event. No bucketing needed.


7. Cohort-Scoped Signal Aggregation

This section specifies the architecture for cohort-scoped signal queries: "this item has 50k views in 24h among US users aged 18-24 who like jazz." This is the foundation for cohort-based trending, demographic-targeted recommendations, and audience analytics.

Problem Statement

Global signal aggregates answer "what is trending for everyone." Cohort-scoped aggregates answer "what is trending for this group of users." The groups can be defined by:

  • Demographics: region, language, age bracket
  • Behavioral: users who like jazz, users who prefer short-form, users who are power consumers
  • Social: users in this follower graph, users in this community
  • Composite: US users aged 18-24 who like jazz AND prefer short-form video

The number of possible cohort combinations is combinatorially explosive. The system must support thousands of pre-defined cohorts and ad-hoc cohort queries without unbounded storage growth.

Approach Evaluation

Three approaches were evaluated:

Approach A: Pre-computed cohort signals. At signal write time, resolve which cohorts the user belongs to and increment per-item-per-cohort counters.

  • Write amplification: events/sec * avg_cohorts_per_user (typically 5-15x).
  • Storage: items * cohorts * signals * windows * 4 bytes. At 10M items * 1000 cohorts * 6 signals * 5 windows * 4 bytes = 1.2 TB. Infeasible.
  • Read latency: O(1). Direct counter lookup.
  • Verdict: Rejected. Storage and write amplification are unacceptable at 1000+ cohorts.

Approach B: Query-time cohort filtering. Store signal events with user attributes attached. Filter events by cohort predicate at query time.

  • Write amplification: 1x (no additional writes).
  • Storage: Marginal increase per event (cohort attributes stored inline).
  • Read latency: O(events_in_window) per entity. At 50K events/day per popular item, scanning 24h of events = ~50K events * 50 ns = 2.5 ms per entity. For 200 candidates: 500 ms. Infeasible.
  • Verdict: Rejected. Read latency is unacceptable.

Approach C: Hierarchical rollups with dimensional decomposition. This is the recommended approach.

The design decomposes the cohort space into a fixed hierarchy of dimensions with pre-computed rollups at each level. Fine-grained cohort queries are answered by intersecting the appropriate dimensional rollups.

Dimension Hierarchy

Level 0: GLOBAL
    One counter per item per signal per window.
    Always maintained. Source of truth for global trending.

Level 1: PRIMARY DIMENSIONS (independently maintained)
    region:    {US, EU, APAC, LATAM, ...}     ~20 values
    language:  {en, es, fr, de, ja, ...}       ~30 values
    age_group: {13-17, 18-24, 25-34, 35-44, 45-54, 55+}  6 values
    Total Level 1 cohorts: ~56

Level 2: BEHAVIORAL SEGMENTS (computed, not enumerated)
    Defined by the application in schema. Examples:
    - "jazz_fans": users where preference_vector cosine_sim > 0.7 with jazz centroid
    - "power_users": users with > 100 signals in last 7 days
    - "short_form_preferred": users where > 70% of views are format:short
    Maximum: 100 application-defined segments.

Level 3: COMPOSITE (computed at query time)
    Intersection of Level 1 and Level 2 dimensions.
    e.g., "US + 18-24 + jazz_fans"
    Not pre-computed. Estimated from Level 1 and Level 2 aggregates.

Storage Layout

Cohort-scoped counters are stored in a dedicated column family:

CF "cohort_signals"     Leveled compaction, TTL matches window
    Key:   [item_id: u64 BE][signal_type: u8][dimension: u8][cohort_value: u16 BE][hour_bucket: u32 BE]
    Value: CohortBucket { count: u32, weighted_sum: f32, unique_users_hll: [u8; 12] }

Dimension encoding:

Dimension ID (u8) Dimension Max Values Description
0 global 1 Global aggregate (Level 0)
1 region 20 Geographic region
2 language 30 User language
3 age_group 6 Age bracket
4-103 segment_0..99 2 each (in/out) Behavioral segments

Storage Cost Analysis

Per-item, per-signal-type, per-hour:

Level 0: 1 global bucket                              = 20 bytes
Level 1: (20 + 30 + 6) = 56 cohort buckets            = 1,120 bytes
Level 2: 100 segment buckets (boolean in/out)          = 2,000 bytes
Total per item per signal per hour:                    = 3,140 bytes

For 10M items * 6 signal types * 24 hours * 3,140 bytes = 4.5 TB/day at full population. This is infeasible for all 10M items.

Critical insight: cohort counters are only needed for candidate items. Cohort-scoped trending queries operate over at most a few thousand candidate items (e.g., items with global velocity above a threshold). The vast majority of items have negligible signal activity and do not need cohort decomposition.

Revised approach: threshold-gated cohort tracking.

/// Cohort tracking is activated for an item + signal when the global
/// signal rate exceeds this threshold. Below this threshold, cohort
/// breakdown adds no useful information.
const COHORT_ACTIVATION_THRESHOLD: u32 = 100; // events per hour

At any given time, fewer than 100K items have >100 events/hour for any signal type. Cohort storage for 100K items:

100K items * 6 signals * 24 hours * 3,140 bytes = 45.2 GB/day

With 7-day retention on hourly cohort rollups: 316 GB. Feasible.

Write Path: Cohort Attribution

At signal write time, the user's cohort memberships are resolved and cached:

/// Resolved once per user, cached in the user's hot-tier state.
/// Refreshed when user metadata changes or behavioral segments are recomputed.
struct UserCohortMemberships {
    region: CohortValueId,          // 2 bytes
    language: CohortValueId,        // 2 bytes
    age_group: CohortValueId,       // 2 bytes
    segments: BitSet128,            // 16 bytes -- one bit per behavioral segment
}
// 22 bytes per user. 10M users = 220 MB.

On signal write:

  1. Look up the user's UserCohortMemberships (hot-tier, O(1)).
  2. If the target item has cohort tracking activated: a. Increment the global counter (always). b. Increment the region counter for this user's region. c. Increment the language counter for this user's language. d. Increment the age_group counter for this user's age group. e. For each behavioral segment the user belongs to, increment that segment's counter.
  3. If the item does not have cohort tracking activated: a. Increment the global counter only. b. Check if the global counter crossed the activation threshold. If so, activate cohort tracking.

Write amplification analysis:

Scenario Counter Increments per Event
Below threshold (vast majority) 1 (global only)
Above threshold, user in 8 segments 1 + 3 + 8 = 12
Above threshold, user in 20 segments 1 + 3 + 20 = 24

Average write amplification across all events (assuming 1% of events target cohort-tracked items, users average 10 segments): 0.99 * 1 + 0.01 * 14 = 1.13x. Negligible.

Read Path: Cohort-Scoped Queries

Single-dimension queries (e.g., "trending in US") are direct lookups:

/// O(1) per item per signal. Same as global trending but reads from
/// the dimension-specific counter.
fn cohort_velocity(
    &self,
    item: EntityId,
    signal: SignalTypeId,
    dimension: DimensionId,
    cohort_value: CohortValueId,
    window: &Window,
) -> f64 {
    // Sum the hour buckets for this (item, signal, dimension, cohort_value)
    // Same pattern as global velocity but from the cohort_signals CF.
}

Read latency: same as global windowed query, ~50 ns to ~1.4 us depending on window.

Composite queries (e.g., "trending among US users aged 18-24 who like jazz"):

Composite cohort queries combine multiple dimensions. Since dimensions are independent, the intersection is estimated using the inclusion-exclusion principle on independently maintained counters.

Estimation approach for composite cohorts:

For two independent dimensions A and B, the count of events from users in both A and B is estimated as:

C(A AND B) ~= C(global) * (C(A) / C(global)) * (C(B) / C(global))
           = C(A) * C(B) / C(global)

This assumes independence between dimensions. For correlated dimensions (e.g., region and language are correlated: US users are more likely to speak English), the estimate has error proportional to the correlation strength.

For three dimensions A, B, S (two Level 1 + one Level 2):

C(A AND B AND S) ~= C(A) * C(B) * C(S) / C(global)^2

Accuracy bounds. Under the independence assumption, the estimation error is bounded by the mutual information between dimensions. For region/language (moderately correlated), empirical testing on real engagement data shows ~15-25% relative error. For region/age_group (weakly correlated), error is ~5-10%.

When estimation is insufficient: For high-value composite cohorts that the application queries frequently, the application can define them as Level 2 behavioral segments with exact counting. A segment "us_young_jazz" that is the intersection of region:US, age_group:18-24, and jazz_fans gets its own exact counter tracked at write time.

Cohort Membership Changes Over Time

User cohort memberships change:

  • Demographics (Level 1): Rarely change. Region changes on relocation. Age group changes yearly. Language changes rarely.
  • Behavioral segments (Level 2): Change as user preferences evolve. A user may enter or leave the "jazz_fans" segment as their engagement shifts.

Membership refresh policy:

  1. Level 1 memberships are updated when user metadata is explicitly changed (db.update_user()).
  2. Level 2 memberships are recomputed by the background materializer on a configurable schedule (default: every hour).
  3. When a membership changes, future signal events use the new membership. Historical counters are not retroactively adjusted -- this is acceptable because cohort trending is inherently a "what's happening now" query, not a historical audit.

Implication for accuracy. If a user's behavioral segment changes hourly, counters for the old segment may include events from users who no longer belong. The staleness is bounded by the refresh interval (default 1 hour). For trending queries over 1h and 24h windows, this introduces at most ~4% error in the worst case (1 stale hour out of 24).

Capacity and Scaling

Metric Value
Maximum pre-defined cohorts (Level 1 + Level 2) ~156
Maximum ad-hoc composite cohorts Unlimited (estimated at query time)
Items with active cohort tracking ~100K (threshold-gated)
Storage for cohort data ~316 GB (7-day retention)
Write amplification (average) ~1.13x
Read latency (single dimension) ~50 ns to ~1.4 us
Read latency (composite, 2 dimensions) ~100 ns to ~3 us
Read latency (composite, 3+ dimensions) ~200 ns to ~5 us
Accuracy (single dimension) Exact
Accuracy (2-dimension composite) ~85-95% (independence assumption)
Accuracy (3+ dimension composite) ~75-90% (use exact segments for critical queries)

8. Signal Write Path

The signal write path is the most performance-critical transaction in tidalDB. A single db.signal() call triggers a cascade of updates across multiple subsystems.

Write Path Data Flow

Application calls db.signal(Signal { kind: "view", item: "X", user: "U", ... })
     |
     v
[1. DEDUP CHECK] ---- BLAKE3(signal_type, item_id, user_id, timestamp) ---> content hash
     |                 If hash exists in dedup set: return Ok(()) silently.
     |                 Dedup set: in-memory bloom filter + on-disk hash set.
     v
[2. WAL APPEND] -----> Write signal event to WAL segment.
     |                  Durability: Immediate, Batched, or Eventual per signal type.
     |                  Event is durable after this step.
     v
[3. HOT-TIER UPDATE] -> Update HotSignalState.decay_scores (atomic CAS).
     |                   Update HotSignalState.last_update_ns (atomic store).
     |                   Cost: ~36ns (3 exp() calls).
     v
[4. WARM-TIER UPDATE] -> Increment minute bucket (atomic add).
     |                    Increment all-time counter (atomic add).
     |                    If cohort tracking active: increment cohort counters.
     |                    Cost: ~20ns (atomic increments).
     v
[5. USER PREF UPDATE] -> Shift user preference vector toward/away from item embedding.
     |                    Direction: toward for positive signals, away for negative.
     |                    Magnitude: proportional to signal weight * learning_rate.
     |                    Cost: ~200ns (vector arithmetic on 1536D embedding).
     v
[6. RELATIONSHIP UPDATE] -> Update user->creator interaction_weight.
     |                       Update user->item state (seen, liked, hidden, etc.).
     |                       Cost: ~50ns (atomic updates).
     v
[7. RETURN Ok(())]

Atomicity Guarantees

Steps 3-6 are not wrapped in a transaction. They are independent atomic updates to separate data structures. The WAL (step 2) is the source of truth. If the process crashes between step 3 and step 6:

  • The WAL contains the event.
  • On recovery, the WAL is replayed from the last checkpoint.
  • Steps 3-6 are re-executed idempotently (the dedup hash prevents double-counting in the dedup set, and running-score updates are commutative).

This is a deliberate choice: transactional atomicity across all four updates would require a mutex or 2PC, which violates the lock-free hot-path requirement. Instead, eventual consistency is achieved through WAL replay.

Consistency guarantee: After WAL replay completes (bounded by max_replay_time, typically <30 seconds), all aggregates are consistent with the event stream.

Content-Addressed Deduplication

Signal events are deduplicated using BLAKE3 hashing:

/// Compute the content hash for deduplication.
fn signal_content_hash(signal: &Signal) -> [u8; 32] {
    let mut hasher = blake3::Hasher::new();
    hasher.update(signal.kind.as_bytes());
    hasher.update(&signal.item.to_bytes());
    hasher.update(&signal.user.to_bytes());
    // Truncate timestamp to second granularity to handle
    // sub-second retries of the same logical event.
    let ts_secs = signal.timestamp.timestamp();
    hasher.update(&ts_secs.to_le_bytes());
    *hasher.finalize().as_bytes()
}

Dedup storage: A bloom filter (in-memory, ~10MB for 100M events at 0.01% FPR) provides fast negative lookups. On bloom filter hit (potential duplicate), the on-disk hash set is consulted for confirmation. False positives in the bloom filter cause unnecessary disk reads (~50 us) but do not cause data loss.

Group Commit

Signal writes use the configurable Durability level from Config:

pub enum Durability {
    /// fsync every write. For financial/purchase events.
    /// Latency: ~1ms per write (dominated by fsync).
    Immediate,

    /// fsync per batch. Default for engagement signals.
    /// Accumulate up to max_batch events or max_delay_ms, whichever comes first.
    /// Latency: ~10-100us per write (amortized fsync).
    Batched { max_batch: usize, max_delay_ms: u64 },

    /// fsync on OS schedule. For impressions, low-value telemetry.
    /// Latency: ~1us per write (no fsync).
    /// Risk: up to OS buffer duration of events lost on power failure.
    Eventual,
}

The group commit queue accumulates signal events and issues a single fsync per batch. Writers are notified of completion via a per-batch condition variable. This follows the PostgreSQL commit delay pattern, validated in production by Citadel's GroupCommitQueue.

Throughput at Batched { max_batch: 100, max_delay_ms: 10 }:

  • 1 fsync per 100 events or per 10ms.
  • At 10,000 events/sec: 100 fsyncs/sec, each flushing ~100 events.
  • NVMe SSD fsync latency: ~50-100us.
  • Throughput: bounded by event processing, not fsync. >50,000 events/sec achievable.

Signal Weight Semantics

The weight field in a signal event has signal-type-specific semantics:

Signal Type Weight Meaning Typical Values
view 1.0 per view Always 1.0
completion Fraction completed 0.0 to 1.0
like 1.0 per like Always 1.0
skip 1.0 per skip Always 1.0
dwell_time Seconds of dwell 0.0 to 3600.0
share 1.0 per share Always 1.0
search_click 1.0 / log2(rank + 1) Inversely proportional to rank

Weights are validated at write time against the signal definition. Negative weights are rejected (negative signals use separate signal types, not negative weights).


9. Background Materializer

The background materializer is a dedicated thread (or thread pool) that continuously maintains materialized aggregates, performs bucket rotation, computes behavioral segments, and manages tier transitions.

Responsibilities

  1. Bucket rotation. Every minute: rotate minute buckets. Every hour: aggregate minute buckets into hour buckets. Every day: aggregate hour buckets into daily rollups.

  2. Rollup generation. Incrementally compute hourly and daily rollups and persist to the cold tier. Follows the TimescaleDB continuous aggregate pattern.

  3. Hot-tier checkpointing. Periodically (every 30-60 seconds) snapshot hot-tier HotSignalState to the entity_signal_state CF for crash recovery.

  4. Cohort segment recomputation. Hourly: recompute behavioral segment memberships for users with recent activity.

  5. Cohort activation/deactivation. Monitor global signal rates and activate/deactivate cohort tracking for items crossing the threshold.

  6. Warm-tier eviction. Evict warm-tier entries for entities with no recent activity.

  7. Velocity smoothing. Update EWMA velocity estimates on each bucket rotation.

Staleness Bounds

The materializer guarantees that materialized state is fresh within a bounded staleness interval:

Materialized State Staleness Bound Rationale
Hot-tier decay scores 0 (updated inline on write) Part of the write path, not materializer
Minute-bucket counts 0 (updated inline on write) Part of the write path
Hour-bucket counts 60 seconds Aggregated from minute buckets on rotation
Hourly rollups (disk) 65 seconds Written after hour-bucket rotation + flush
Daily rollups (disk) 25 hours Computed from hourly rollups with 1h grace period
Behavioral segments 1 hour Recomputed hourly
Smoothed velocity (EWMA) 60 seconds Updated on minute-bucket rotation
Hot-tier checkpoint 60 seconds Persisted every 30-60 seconds

Rollup Schedule

Every 1 minute:
    - Rotate minute buckets for all active entities.
    - Update EWMA velocity for all active entities.
    - Flush completed minute aggregates to hour-bucket accumulators.

Every 1 hour:
    - Finalize hourly rollup for the just-completed hour (after 1-minute grace).
    - Write hourly rollups to cold-tier CF "hourly_rollups".
    - Recompute behavioral segment memberships for recently active users.
    - Evaluate cohort activation thresholds.

Every 1 day:
    - Compute daily rollups from the 24 hourly rollups of the just-completed day.
    - Write daily rollups to cold-tier CF "daily_rollups".
    - Drop expired hourly rollups (>30 days) and raw events (>7 days).
    - Log storage size metrics.

Every 30-60 seconds:
    - Checkpoint hot-tier state to entity_signal_state CF.

Rollup Composability

Rollups store composable aggregates -- never store averages, percentiles, or other non-composable statistics. Store the components from which any statistic can be derived:

/// Composable hourly aggregate.
/// Invariant: a daily rollup is computed by composing 24 hourly rollups.
/// Invariant: a 7-day aggregate is computed by composing 168 hourly rollups.
struct HourlyRollup {
    /// Total event count in this hour.
    total_count: u32,
    /// Sum of event weights in this hour.
    weighted_sum: f32,
    /// Approximate unique user count (HyperLogLog, 12-byte register).
    unique_users_hll: [u8; 12],
    /// Maximum single-event weight (for outlier detection).
    max_weight: f32,
}

// Composition:
impl HourlyRollup {
    fn compose(a: &Self, b: &Self) -> Self {
        HourlyRollup {
            total_count: a.total_count + b.total_count,
            weighted_sum: a.weighted_sum + b.weighted_sum,
            unique_users_hll: hll_union(&a.unique_users_hll, &b.unique_users_hll),
            max_weight: a.max_weight.max(b.max_weight),
        }
    }
}

Real-Time Continuous Aggregates

At query time, a windowed aggregate is computed by merging pre-materialized rollups with un-rolled-up recent data:

window_count(entity, signal, 7d) =
    sum(hourly_rollups for hours h-168..h-1)     // from cold tier
    + sum(minute_buckets for current hour)        // from warm tier

This is the TimescaleDB real-time continuous aggregate pattern. Materialized state provides the bulk of the answer (168 lookups from sorted on-disk data), and the warm tier fills in the gap since the last materialization. The measured speedup over scanning raw events is ~979x (TimescaleDB benchmark).

Changelog

When a materialized aggregate changes significantly (configurable threshold, default: >20% relative change), the materializer records the change:

CF "signal_changelog"
    Key:   [entity_id: u64 BE][signal_type: u8][window_id: u8][timestamp_ns: u64 BE]
    Value: { old_value: f64, new_value: f64 }

The changelog enables:

  • "What was trending yesterday?" queries.
  • Debugging ranking behavior over time.
  • Alerting on unusual signal spikes (breakout detection).

10. Signal Event Format

Wire Format (API Boundary)

pub struct Signal {
    /// Signal type name. Must match a defined signal type.
    pub kind: &str,

    /// Target item entity ID.
    pub item: &str,

    /// Source user entity ID.
    pub user: &str,

    /// Event timestamp. If None, uses server time.
    pub timestamp: Option<DateTime<Utc>>,

    /// Signal weight. Meaning depends on signal type.
    /// Must be non-negative. Default: 1.0.
    pub weight: f64,

    /// Optional context for signal attribution and analysis.
    pub context: Option<serde_json::Value>,
}

Internal Storage Format (WAL)

+--------+--------+--------+--------+--------+--------+--------+--------+
| magic  | len    | signal | item_id          | user_id          | ts_ns
| (u8)   | (u16)  | type   | (u64 BE)         | (u64 BE)         | (u64 BE)
|        |        | (u8)   |                  |                  |
+--------+--------+--------+--------+--------+--------+--------+--------+
  ts_ns           | weight | ctx_len| context (variable)        | blake3
  (continued)     | (f32)  | (u16)  |                           | checksum
                  |        |        |                           | (first 8 bytes)
+--------+--------+--------+--------+--------+--------+--------+--------+

Fixed header:  33 bytes
Context:       0 to 65535 bytes
Checksum:      8 bytes (truncated BLAKE3)
Total:         41 + context_len bytes

Design decisions:

  • signal_type is stored as a u8 index (not string) for compactness. Mapped from the signal name via the schema's signal type registry.
  • item_id and user_id are stored as u64 after the application's string IDs are mapped to internal numeric IDs by the entity store.
  • weight is stored as f32 (not f64) in the WAL for compactness. The running decay score in the hot tier uses f64 for accumulated precision; individual event weights do not need f64.
  • context is stored as raw bytes (MessagePack or JSON). Only parsed when accessed for analysis, never on the hot path.
  • BLAKE3 checksum (truncated to 8 bytes) provides corruption detection. Full 32-byte hash is used for deduplication but not stored in the WAL record.

Context Field Schema

The context field carries signal-type-specific attribution data:

Signal Type Context Fields Purpose
view source_surface, position_in_feed Attribution
search_click query, rank_at_click Relevance training
skip dwell_ms, source Quality/format signal
completion total_duration_ms, completed_duration_ms Precision
share platform, share_type Virality analysis
dwell_time total_ms, active_ms Engagement depth

Context is not indexed or aggregated. It is stored for offline analysis, model training, and debugging. It is never read on the ranking hot path.


11. Signal Types Reference

All signal types from USE_CASES.md Appendix C, grouped by category with recommended configuration.

Positive Engagement Signals

Signal Type Decay Windows Velocity Primary Use
view count Exp 7d 1h, 24h, 7d, 30d, all Yes Baseline reach
unique_view count Exp 7d 1h, 24h, 7d, all Yes Deduplicated reach
like count Exp 7d 1h, 24h, 7d, all Yes Positive sentiment
share count Exp 3d 1h, 24h, 7d Yes Virality
repost count Exp 3d 1h, 24h, 7d Yes Amplification
quote count Exp 3d 1h, 24h, 7d Yes Engaged resharing
comment count Exp 3d 1h, 24h, 7d, all Yes Discussion
reply count Exp 3d 24h, 7d No Discussion depth
upvote count Exp 3d 1h, 24h, 7d, all Yes Forum positive
save count Exp 7d 24h, 7d, all No Return intent
pin count Exp 7d 24h, 7d, all No Curation
collection_add count Exp 7d 24h, 7d, all No Curation
download count Exp 7d 24h, 7d, all No High-intent
screenshot count Exp 7d 24h, 7d No Save intent
outbound_click count Exp 3d 24h, 7d No Link engagement
replay count Exp 3d 24h, 7d No Exceptional content
award_given count Permanent all No Community endorsement

Negative Engagement Signals

Signal Type Decay Windows Velocity Primary Use
skip count Exp 1d 1h, 24h No Quality negative
skip_intro bool Exp 1d -- No Format preference
hide bool Permanent -- No Hard item negative
not_interested bool Permanent -- No Hard topic negative
dislike count Exp 7d 1h, 24h, 7d, all Yes Explicit negative
downvote count Exp 3d 1h, 24h, 7d, all Yes Forum negative
report count Permanent all No Moderation flag

Quality Signals

Signal Type Decay Windows Velocity Primary Use
completion ratio 0-1 Exp 30d all No Content quality
partial_completion float Exp 7d -- No Continue watching
dwell_time duration Exp 3d 24h, 7d No Engagement depth
impression count Exp 1d 1h, 24h No Exposure tracking

Relationship Signals

Signal Type Decay Windows Velocity Primary Use
follow bool Permanent -- No User-creator edge
unfollow event Decays follow -- No Edge removal
block bool Permanent -- No Hard filter
mute bool Permanent -- No Soft filter
interaction_weight float Exp 7d -- No Relationship strength

Recommendation Feedback Signals

Signal Type Decay Windows Velocity Primary Use
autoplay_accept bool Exp 3d 24h No Rec quality
autoplay_reject bool Exp 1d 24h No Rec failure
notification_open bool Exp 7d 7d No Notification priority
notification_dismiss bool Exp 3d 7d No Reduce push
reminder_set bool Exp 7d -- No Intent for scheduled
search_click count+rank Exp 3d 24h, 7d No Query relevance
search_impression count Exp 1d 1h, 24h No Query exposure

Signal Type Configuration Summary

Category Count Typical Decay Range Typical Windows
Positive engagement 17 3d - 7d half-life 1h, 24h, 7d, all
Negative engagement 7 1d - permanent 1h, 24h or none
Quality 4 1d - 30d half-life 24h, 7d, all
Relationship 5 7d - permanent None (state, not stream)
Recommendation feedback 7 1d - 7d half-life 24h, 7d
Total 40

12. Performance Targets

These are the latency and throughput targets the signal system must meet. Regressions against these numbers are treated as bugs.

Write Path Targets

Operation Target Measurement Point
Signal write (end-to-end, Batched durability) < 100 us p50, < 500 us p99 db.signal() return
WAL append (amortized fsync) < 50 us p50 WAL write + batch fsync
Hot-tier update (decay scores) < 50 ns 3 CAS operations
Warm-tier update (bucket increment) < 20 ns Atomic add
User preference vector shift < 500 ns 1536D vector arithmetic
Content-address dedup check < 100 ns (bloom miss), < 50 us (bloom hit) BLAKE3 hash + lookup
Sustained write throughput > 50,000 events/sec Single writer thread

Read Path Targets

Operation Target Measurement Point
Decay score read (per entity per lambda) ~15 ns 1 load + 1 exp() + 1 mul
200-candidate scoring pass (decay only) < 5 us 200 * 15ns + overhead
Windowed count (1h, per entity) < 200 ns Sum 60 minute buckets
Windowed count (7d, per entity) < 500 ns Sum 168 hour buckets
Velocity computation (per entity) < 500 ns Windowed count / duration
Cohort-scoped velocity (single dimension) < 2 us Disk-backed bucket sum
Cohort-scoped velocity (composite, 2-dim) < 5 us Estimation arithmetic
Signal snapshot (all windows, 1 entity) < 5 us All counters + decay reads

Background Materializer Targets

Operation Target Measurement Point
Minute-bucket rotation (all active entities) < 100 ms Rotate + EWMA update
Hourly rollup generation < 5 seconds All active entities
Daily rollup generation < 30 seconds All entities with hourly data
Hot-tier checkpoint < 2 seconds Serialize + write to disk
Behavioral segment recomputation < 60 seconds All recently active users

Crash Recovery Targets

Operation Target Notes
WAL replay (cold start) < 60 seconds For 7 days of events at scale
Hot-tier restore from checkpoint < 10 seconds For 10M entities
Time to first query after crash < 15 seconds Serve from checkpoint, replay in background

13. Invariants and Correctness Guarantees

These invariants must hold at all times. They are encoded as property tests, assertions, and crash recovery tests.

Signal Integrity Invariants

INV-SIG-1: No signal loss. Every signal event accepted by db.signal() (i.e., after Ok(()) is returned) is reflected in all aggregates after WAL replay completes. Formally: if signal(s) returns Ok(()) at time t, then for all t' > t + max_replay_time, all aggregate queries reflect s.

INV-SIG-2: Decay score monotonic decrease. In the absence of new signal events, a decay score monotonically decreases toward zero. Formally: if no events arrive for entity e signal s between times t1 and t2 where t2 > t1, then score(e, s, t2) <= score(e, s, t1).

INV-SIG-3: Decay score non-negative. Decay scores are always non-negative. score(e, s, t) >= 0.0 for all entities, signals, and times.

INV-SIG-4: Windowed count consistency. The windowed count for window w at time t equals the number of events in [t-w, t]. Formally: window_count(e, s, w, t) == |{event in events(e, s) : event.time in [t-w, t]}|. This is exact for counts maintained in the warm tier, and exact to within the rollup boundary granularity for counts composed from cold-tier rollups.

INV-SIG-5: Running score exactness. The running decay score matches the analytical sum to within floating-point epsilon. Formally: |running_score(e, s, t) - SUM_i[w_i * exp(-lambda * (t - t_i))]| < epsilon where epsilon = n * 2^-52 * max_score and n is the number of events.

INV-SIG-6: Deduplication idempotency. Writing the same signal event twice produces the same state as writing it once. Formally: state(write(s) ; write(s)) == state(write(s)).

Crash Recovery Invariants

INV-CR-1: WAL completeness. After crash recovery, the WAL contains all events that were acknowledged to the caller (events for which db.signal() returned Ok(())). Events in the WAL but not yet processed are replayed.

INV-CR-2: Checkpoint consistency. The hot-tier checkpoint, when restored and replayed from the checkpoint's WAL position, produces state identical to the pre-crash state (modulo lazy-decay time differences, which are corrected at read time).

INV-CR-3: No phantom state. After crash recovery, no aggregate reflects an event that was not durably committed to the WAL. There are no phantom signal counts.

Concurrency Invariants

INV-CON-1: Lock-free reads. Ranking queries never acquire a mutex. They read atomic values and apply lazy decay. A concurrent signal write may cause a ranking query to see either the pre-update or post-update state, but never a torn or invalid state.

INV-CON-2: CAS correctness. Under concurrent signal writes to the same entity, every event's weight is reflected in the running score. The CAS retry loop ensures that concurrent updates are serialized without loss. Formally: if write(w1) and write(w2) execute concurrently, the final score equals the score that would result from either sequential ordering w1;w2 or w2;w1.

INV-CON-3: Bucket atomicity. Atomic increment of bucket counters ensures that concurrent writes to the same minute bucket are correctly accumulated. No count is lost.

Property Tests

The following properties must be verified with proptest:

// P1: Decay scores decrease monotonically without new events.
proptest! {
    fn decay_monotonic_decrease(
        initial_score in 0.0f64..1e12,
        lambda in 1e-7..1e-3,
        dt_secs in 1.0f64..1e7,
    ) {
        let decayed = initial_score * (-lambda * dt_secs).exp();
        prop_assert!(decayed <= initial_score);
        prop_assert!(decayed >= 0.0);
    }
}

// P2: Running score matches analytical sum.
proptest! {
    fn running_score_matches_analytical(
        events in prop::collection::vec((0.1f64..10.0, 1u64..1_000_000), 1..100),
        lambda in 1e-7..1e-3,
    ) {
        let mut running = 0.0f64;
        let mut last_time = 0u64;
        let query_time = events.last().unwrap().1 + 1000;

        // Compute running score
        for &(weight, time) in &events {
            let dt = (time - last_time) as f64;
            running = running * (-lambda * dt).exp() + weight;
            last_time = time;
        }
        let final_running = running * (-lambda * (query_time - last_time) as f64).exp();

        // Compute analytical sum
        let analytical: f64 = events.iter()
            .map(|&(w, t)| w * (-lambda * (query_time - t) as f64).exp())
            .sum();

        let relative_error = (final_running - analytical).abs() / analytical.max(1e-15);
        prop_assert!(relative_error < 1e-10,
            "running={}, analytical={}, error={}", final_running, analytical, relative_error);
    }
}

// P3: Windowed count equals event count in window.
proptest! {
    fn windowed_count_matches_events(
        event_times in prop::collection::vec(0u64..86400, 1..1000),
        window_secs in 60u64..86400,
        query_time in 0u64..172800,
    ) {
        // Count events in [query_time - window_secs, query_time]
        let expected = event_times.iter()
            .filter(|&&t| t <= query_time && t > query_time.saturating_sub(window_secs))
            .count();

        // The warm-tier bucket count should match
        // (implementation-specific assertion)
        let actual = warm_tier.windowed_count(window_secs, query_time);
        prop_assert_eq!(expected, actual);
    }
}

// P4: Out-of-order events produce same final score as in-order.
proptest! {
    fn out_of_order_events_commutative(
        events in prop::collection::vec((0.1f64..10.0, 1u64..1_000_000), 2..50),
        lambda in 1e-7..1e-3,
    ) {
        let query_time = events.iter().map(|e| e.1).max().unwrap() + 1000;

        // Apply events in original order
        let score_ordered = apply_events_and_query(&events, lambda, query_time);

        // Apply events in shuffled order
        let mut shuffled = events.clone();
        shuffled.sort_by_key(|e| std::cmp::Reverse(e.1)); // reverse time order
        let score_shuffled = apply_events_and_query(&shuffled, lambda, query_time);

        let relative_error = (score_ordered - score_shuffled).abs()
            / score_ordered.max(1e-15);
        prop_assert!(relative_error < 1e-10);
    }
}

// P5: Dedup produces idempotent state.
proptest! {
    fn dedup_idempotent(
        event in arb_signal_event(),
    ) {
        let state_once = apply_signal(&event);
        let state_twice = apply_signal(&event); // same event again
        prop_assert_eq!(state_once, state_twice);
    }
}

// P6: WAL replay produces same state as uninterrupted execution.
proptest! {
    fn wal_replay_consistency(
        events in prop::collection::vec(arb_signal_event(), 1..500),
        crash_point in 0usize..500,
    ) {
        // Execute all events without crash
        let expected_state = execute_all(&events);

        // Execute up to crash_point, then "crash" and replay from WAL
        let (wal, partial_state) = execute_with_crash(&events, crash_point);
        let recovered_state = replay_from_wal(wal, partial_state);

        prop_assert_eq!(expected_state, recovered_state);
    }
}

Appendix A: Glossary

Term Definition
Signal A typed, timestamped engagement event (view, like, skip, etc.)
Signal Ledger The per-entity aggregation of all signals targeting that entity
Decay Score The running exponential decay aggregate: recent events weighted more heavily
Lambda The decay rate constant: ln(2) / half_life
Velocity The rate of signal events per unit time within a window
Relative Velocity Ratio of short-window to long-window velocity (acceleration)
SWAG Sliding Window Aggregation -- O(1) amortized algorithm for windowed aggregate maintenance
Scotty Slicing Stream-slicing approach where partial aggregates per time bucket are shared across windows
Cohort A group of users sharing a common attribute (region, age, behavioral segment)
Dimensional Rollup Per-dimension pre-aggregated counters for cohort-scoped queries
Hot Tier In-memory, cache-line-aligned signal state for sub-microsecond reads
Warm Tier In-memory bucketed counters for active entities, supporting windowed aggregation
Cold Tier On-disk raw events and rollups for durability and historical queries
Running Score The incrementally maintained decay score: S(t) = S(prev) * exp(-lambda * dt) + w
Forward Decay The mathematical model (Cormode et al.) proving the running score formula is exact
Jacobs Trick Log-space reformulation that eliminates read-time computation for ranking-only queries
Group Commit Batching fsync calls to amortize durability cost across multiple writes
Content-Addressed Identifying events by BLAKE3 hash of content for automatic deduplication
EWMA Exponentially Weighted Moving Average for smoothing noisy velocity signals

Appendix B: References

  1. Cormode, G., Shkapenyuk, V., Srivastava, D., Xu, B. "Forward Decay: A Practical Time Decay Model for Streaming Systems." ICDE 2009.
  2. Tangwongsan, K., Hirzel, M., Schneider, S. "General Incremental Sliding-Window Aggregation." PVLDB 2015.
  3. Traub, J., Grulich, P., Cuevas, A., et al. "Scotty: General and Efficient Open-Source Window Aggregation." EDBT 2019 (Best Paper).
  4. Jacobs, J. "Exponentially Decaying Sums With a Twist." 2023.
  5. Miller, E. "How Not To Sort By Average Rating." 2009.
  6. TimescaleDB Documentation. "Continuous Aggregates." 2024.
  7. Flajolet, P., Fusy, E., Gandouet, O., Meunier, F. "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS 2007.