tidaldb/docs/specs/03-signal-system.md
jordan 413b712c0a chore: initialize tidalDB repository with schema foundation and standards
- Schema phase 1 (tasks 01-02): EntityId, EntityKind, Timestamp, Score, SignalTypeDef, DecayModel, Window, WindowSet — all with property tests and benchmarks scaffolding
- Stub modules for storage, signals, query, ranking
- Full documentation suite: VISION, USE_CASES, SEQUENCE, API, CODING_GUIDELINES, ai-lookup, research docs, specs, roadmap, planning docs
- Marketing site (Next.js) with blog infrastructure
- .claude/ agents and skills for the tidalDB development workflow
- Foundation standards enforced: thiserror + tracing declared as dependencies, clippy::unwrap_used = deny added to lint config
- .gitignore hardened: .next/, node_modules/, .env, secrets, logs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 12:52:20 -07:00

1583 lines
72 KiB
Markdown

# Signal System Specification
**Status:** Draft
**Authors:** tidalDB Engineering
**Date:** 2026-02-20
**Depends on:** WAL subsystem, Entity Store, Schema Engine
**Research:** `docs/research/tidaldb_signal_ledger.md`
---
## Table of Contents
1. [Overview](#1-overview)
2. [Signal Type Declaration](#2-signal-type-declaration)
3. [Signal Ledger (Per-Entity)](#3-signal-ledger-per-entity)
4. [Decay Computation](#4-decay-computation)
5. [Velocity Computation](#5-velocity-computation)
6. [Windowed Aggregation](#6-windowed-aggregation)
7. [Cohort-Scoped Signal Aggregation](#7-cohort-scoped-signal-aggregation)
8. [Signal Write Path](#8-signal-write-path)
9. [Background Materializer](#9-background-materializer)
10. [Signal Event Format](#10-signal-event-format)
11. [Signal Types Reference](#11-signal-types-reference)
12. [Performance Targets](#12-performance-targets)
13. [Invariants and Correctness Guarantees](#13-invariants-and-correctness-guarantees)
---
## 1. Overview
The signal system is the temporal event backbone of tidalDB. Every engagement event -- a view, a like, a skip, a share -- flows through the signal system and updates the state that ranking queries consume. The system must sustain thousands of signal writes per second while serving sub-millisecond aggregate reads across hundreds of candidate entities.
Signals are not fields. They are typed, timestamped streams with native temporal semantics: decay, velocity, and windowed aggregation are computed by the database, not by the application. The application writes `SIGNAL view item:@id user:@uid`. The ranking profile references `view.velocity(24h)`. No application code touches temporal math.
### Design Principles
1. **WAL-first durability.** Every signal event is durably logged before any processing occurs. The signal aggregation system can crash, restart, and replay from the WAL. Signals cannot be lost.
2. **O(1) running scores.** Decay scores are maintained as running accumulators updated on each write, not recomputed by scanning raw events. Read cost is one `exp()` call per entity per decay rate.
3. **Immutable events, mutable aggregates.** Signal events are immutable facts. Aggregates are derived state that can always be recomputed from events.
4. **Lock-free hot path.** Signal counters and decay scores use atomic operations. A signal write never blocks a ranking query. A ranking query never blocks a signal write.
5. **Cohort aggregation as a first-class primitive.** Not just "this item has 50k views in 24h" but "this item has 50k views in 24h among US users aged 18-24 who like jazz."
---
## 2. Signal Type Declaration
Signal types are declared in schema before signal events can be written. A signal declaration specifies: what the signal is called, what entity type it targets, how it decays, what windows it maintains, and whether velocity is computed.
### Schema Definition
```rust
db.define_signal(SignalDef {
name: "view",
target: EntityKind::Item,
decay: Decay::Exponential { half_life: Duration::days(7) },
windows: vec![
Window::hours(1),
Window::hours(24),
Window::days(7),
Window::days(30),
Window::all_time(),
],
velocity: true,
})?;
```
### Signal Definition Fields
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `name` | `&str` | Yes | Unique signal identifier. Lowercase alphanumeric plus underscores. |
| `target` | `EntityKind` | Yes | Which entity type this signal targets: `Item`, `User`, or `Creator`. |
| `decay` | `Decay` | Yes | How signal weight diminishes over time. |
| `windows` | `Vec<Window>` | Yes | Time windows for which aggregates are maintained. May be empty (e.g., `hide`). |
| `velocity` | `bool` | Yes | Whether to compute rate-of-change per window. |
### Decay Types
```rust
pub enum Decay {
/// Signal weight halves every `half_life` duration.
/// Formula: w(t) = w_0 * exp(-lambda * t), lambda = ln(2) / half_life
Exponential { half_life: Duration },
/// Signal weight drops linearly to zero over `lifetime`.
/// Formula: w(t) = w_0 * max(0, 1 - t / lifetime)
Linear { lifetime: Duration },
/// Signal weight never decays. For permanent state: hides, blocks, follows.
Permanent,
}
```
**Lambda precomputation.** For exponential decay, `lambda` is computed once at schema definition time and stored alongside the signal definition:
```
lambda = ln(2) / half_life_seconds
```
| Half-Life | Lambda (s^-1) | Interpretation |
|-----------|--------------|----------------|
| 1 hour | 1.925e-4 | Fast decay. Impressions, skips. Signal is negligible after ~7 hours. |
| 24 hours | 8.022e-6 | Medium decay. Shares, comments. Signal halves daily. |
| 7 days | 1.146e-6 | Slow decay. Views, likes. Signal persists for weeks. |
| 30 days | 2.674e-7 | Very slow decay. Completions, saves. Signal persists for months. |
### Window Definitions
```rust
pub enum Window {
/// Fixed-duration sliding window.
Sliding { duration: Duration },
/// Unbounded accumulator -- all events since entity creation.
AllTime,
}
impl Window {
pub fn hours(n: u64) -> Self { Window::Sliding { duration: Duration::hours(n) } }
pub fn days(n: u64) -> Self { Window::Sliding { duration: Duration::days(n) } }
pub fn all_time() -> Self { Window::AllTime }
}
```
Windows define the time boundaries for count/sum aggregation. A signal with `windows: [hours(1), hours(24), days(7), all_time()]` maintains four independent aggregates. Each window answers "how many/how much of this signal occurred within the last N?"
### Velocity Declaration
When `velocity: true`, the system computes the rate of change of the signal count within each declared window. Velocity answers "is this signal accelerating or decelerating?" -- the foundation of trending and rising detection.
Velocity is computed per window. `view.velocity(1h)` measures short-term acceleration. `view.velocity(24h)` measures daily trend. These are different signals with different noise characteristics, and ranking profiles choose which to reference.
### Schema Validation Rules
1. Signal names must be unique within a target entity type.
2. `Permanent` decay signals must have `velocity: false` (rate of change is meaningless for permanent state).
3. Windows must be non-empty unless the signal is boolean/permanent (e.g., `hide`, `block`).
4. `all_time()` windows do not support velocity (no bounded window to measure rate over).
5. Maximum 8 windows per signal type (bounded by the hot-tier struct layout).
6. Maximum 64 signal types per entity type (bounded by storage layout).
---
## 3. Signal Ledger (Per-Entity)
Every entity in tidalDB has a signal ledger: the complete temporal state of all signals targeting that entity. The ledger is implemented as a three-tier hybrid, following the architecture validated in the research document.
### Three-Tier Architecture
```
+---------------------------+
Ranking queries | HOT TIER (Memory) | ~64 bytes per signal type
read from here | Running decay scores | 10M entities = 400-800 MB
(sub-microsecond) | Atomic counters |
| Last-update timestamp |
+---------------------------+
|
+---------------------------+
Windowed queries | WARM TIER (Memory) | Per-minute bucket counters
merge from here | Time-bucketed counters | 10M entities = ~1 GB
(microseconds) | Recent event buffer |
| SWAG stacks |
+---------------------------+
|
+---------------------------+
Replay, ad-hoc, | COLD TIER (Disk) | Raw events: 7 days retention
backfill from | Raw signal events (WAL) | Rollups: 30 days hourly,
here | Hourly rollups | daily indefinitely
| Daily rollups | Total: ~460 GB at scale
+---------------------------+
```
### Hot Tier: Per-Entity Signal State
The hot tier is the structure touched on every ranking query. It must be cache-line aligned, lock-free, and as compact as possible.
**Memory Layout:**
```
0 8 16 24 32 40 48 56 64
+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
Line 0 | entity_id (u64) | last_update_ns (u64) | signal_type_id (u16) | flags |
| | | | pad | (u16) |
+-------------------+-------------------------+------------------+------+--------+
| decay_score_0 | decay_score_1 | decay_score_2 | pad |
| (f64) | (f64) | (f64) | (f64) |
+-------------------+-------------------------+------------------------+--------+
Total: 64 bytes per signal type per entity (one cache line)
```
```rust
/// Hot-path signal state for a single signal type on a single entity.
/// One cache line. Touched on every ranking query involving this signal.
///
/// Contains running decay scores for up to 3 decay rates (matching the
/// common configuration of 1h, 24h, 7d half-lives) and the timestamp
/// of the last update for lazy decay application at read time.
#[repr(C, align(64))]
pub struct HotSignalState {
/// Entity this state belongs to.
entity_id: u64, // 8 bytes [0..8]
/// Nanosecond timestamp of the last signal write to this entity.
/// Used for lazy decay: score(now) = stored_score * exp(-lambda * (now - last_update)).
/// Stored as AtomicU64 for lock-free read/write.
last_update_ns: AtomicU64, // 8 bytes [8..16]
/// Signal type index (0..63) within this entity's signal set.
signal_type_id: u16, // 2 bytes [16..18]
/// Flags: bit 0 = velocity_enabled, bits 1-15 reserved.
flags: u16, // 2 bytes [18..20]
/// Padding to align decay_scores to 8-byte boundary.
_pad0: [u8; 4], // 4 bytes [20..24]
/// Running exponential decay scores. One per configured decay rate.
/// Updated atomically via CAS on f64 bit patterns.
/// Index 0: primary decay rate (from signal definition).
/// Index 1-2: additional rates if the signal participates in
/// multiple ranking profiles with different half-lives.
decay_scores: [AtomicU64; 3], // 24 bytes [24..48] (f64 via from_bits/to_bits)
/// Padding to fill cache line.
_pad1: [u8; 16], // 16 bytes [48..64]
}
// Static assertion: size_of::<HotSignalState>() == 64
```
**Atomic access patterns:**
- **Signal write:** Load `last_update_ns` (Acquire), compute decayed score, CAS `decay_scores[i]` (AcqRel), store `last_update_ns` (Release).
- **Ranking read:** Load `last_update_ns` (Acquire), load `decay_scores[i]` (Acquire), apply lazy decay with `exp(-lambda * dt)`.
- **Memory ordering rationale:** Acquire on `last_update_ns` ensures we see the most recent decay score that was stored with Release. Without this ordering, a reader could see a new timestamp with an old score, producing an incorrect (over-decayed) value.
**Memory budget:**
| Entity Count | Signal Types | Hot Tier Size |
|-------------|-------------|---------------|
| 1M | 6 | 384 MB |
| 10M | 6 | 3.84 GB |
| 10M | 3 | 1.92 GB |
For the 10M entity target, the hot tier consumes 2-4 GB depending on signal type count. This is within the recommended `memory_budget` of 2-4 GB. Entities with no recent signals can be evicted to warm/cold tier and loaded on demand (see Section 3.5).
### Warm Tier: Bucketed Counters and SWAG Stacks
The warm tier maintains the data structures needed for windowed aggregation and velocity computation. It is in-memory but not cache-line-aligned -- it trades compactness for query flexibility.
```rust
/// Warm-tier signal state for windowed aggregation.
/// One instance per signal type per entity.
pub struct WarmSignalState {
/// Per-minute event count buckets for the last 60 minutes.
/// Used for 1h window. Shared across 24h, 7d via hierarchical rollup.
minute_buckets: [AtomicU32; 60], // 240 bytes
/// Per-hour event count buckets for the last 168 hours (7 days).
/// Used for 24h and 7d windows.
hour_buckets: [AtomicU32; 168], // 672 bytes
/// Weighted sum buckets (same granularity as count buckets).
/// For signals with non-unit weights (e.g., completion ratio).
minute_weight_sums: [AtomicU32; 60], // 240 bytes (f32 via bits)
hour_weight_sums: [AtomicU32; 168], // 672 bytes (f32 via bits)
/// Current bucket index (minute of the hour for minute_buckets).
current_minute: AtomicU8, // 1 byte
/// Current bucket index (hour of the week for hour_buckets).
current_hour: AtomicU8, // 1 byte
/// All-time counters.
all_time_count: AtomicU64, // 8 bytes
all_time_weighted_sum: AtomicU64, // 8 bytes (f64 via bits)
/// SWAG Two-Stacks state for O(1) amortized windowed aggregation.
/// One pair of stacks per active window.
swag_stacks: Vec<SwagState>, // heap-allocated, per window
}
// ~1.8 KB per signal type per entity
// 10M entities * 6 signal types * 1.8 KB = ~108 GB -- TOO LARGE
```
**Critical sizing decision.** At 1.8 KB per signal per entity, the warm tier for 10M entities with 6 signal types would consume ~108 GB. This is infeasible. The warm tier must be **sparse**: only entities with recent activity maintain warm-tier state. The vast majority of entities (>95%) have no signals in the last hour and need only the hot-tier running scores.
**Revised warm tier: active-entity-only.**
```rust
/// Warm tier is a concurrent hash map keyed by (entity_id, signal_type_id).
/// Only entities with signal activity in the last 7 days have entries.
/// Evicted to cold tier on inactivity.
type WarmTier = DashMap<(EntityId, SignalTypeId), WarmSignalState>;
```
At 5% active rate (500K entities with recent activity), warm tier = 500K * 6 * 1.8 KB = ~5.4 GB. Manageable within a 8 GB total memory budget.
**Eviction policy:** Warm-tier entries with no signal writes in the last `2 * max_window_duration` are evicted. Their bucketed state is rolled up into the cold tier before eviction.
### Cold Tier: Durable Storage
The cold tier is on disk. It stores raw signal events and pre-computed rollups.
**Column families (or keyspaces):**
```
CF "signal_events" FIFO compaction, 7-day TTL
Key: [entity_id: u64 BE][timestamp_ns: u64 BE][signal_type: u8]
Value: [user_id: u64][weight: f32][context_len: u16][context: bytes]
Prefix bloom filter on first 8 bytes (entity_id)
CF "hourly_rollups" Leveled compaction, 30-day TTL
Key: [entity_id: u64 BE][signal_type: u8][hour_bucket: u32 BE]
Value: HourlyRollup (see below)
CF "daily_rollups" Leveled compaction, no TTL
Key: [entity_id: u64 BE][signal_type: u8][day_bucket: u16 BE]
Value: DailyRollup (see below)
CF "entity_signal_state" Leveled compaction, no TTL
Key: [entity_id: u64 BE]
Value: Serialized hot-tier state (for crash recovery checkpoint)
```
**Rollup record formats:**
```rust
/// Composable hourly aggregate. Never store averages -- store sum + count.
struct HourlyRollup {
total_count: u32,
weighted_sum: f32,
unique_users: u32, // HyperLogLog sketch cardinality
max_weight: f32,
min_weight: f32,
} // 20 bytes
/// Composable daily aggregate. Computed from hourly rollups, not raw events.
struct DailyRollup {
total_count: u64,
weighted_sum: f64,
unique_users: u64, // HyperLogLog union
hourly_peak_count: u32, // max count in any single hour
_pad: u32,
} // 32 bytes
```
### Storage Cost Analysis
For the reference workload (10M entities, 50 events/day average, 40+ signal types in schema but ~6 active per entity):
| Component | Storage Size | Write Amplification | Retention |
|-----------|-------------|---------------------|-----------|
| Raw signal events | 224 GB | 2x (FIFO) | 7 days |
| Hourly rollups | 231 GB | ~15x (leveled) | 30 days |
| Daily rollups | Growing 320 MB/day | ~15x (leveled) | Indefinite |
| Hot-tier checkpoint | ~3.8 GB | Periodic | Latest only |
| **Total** | **~460 GB** | **Blended ~6x** | |
### Hot/Cold Entity Tiering
Not all 10M entities need hot-tier state in memory at all times. An entity that received its last signal 3 months ago does not need a 64-byte cache-line-aligned struct consuming L1 capacity.
**Tiering policy:**
| Activity Level | Tier | Read Latency | Eviction Rule |
|---------------|------|-------------|---------------|
| Signal in last 1h | Hot (memory, aligned) | ~15 ns | N/A |
| Signal in last 7d | Warm (memory, unaligned) | ~100 ns | No activity for 2x max window |
| Signal older than 7d | Cold (disk) | ~50 us | Loaded on demand |
On a cold-tier read miss, the entity's checkpoint is loaded from `entity_signal_state` CF, promoted to hot tier, and lazy-decayed to current time. The cold read adds ~50 us latency for that single entity, amortized over future queries.
---
## 4. Decay Computation
### The Running Score Formula
Exponential decay scores are maintained as running accumulators. The formula is mathematically exact (not an approximation), proven by the Forward Decay model (Cormode et al., ICDE 2009) and independently described by Jules Jacobs.
**Definition.** Given a stream of signal events with weights `w_1, w_2, ..., w_n` arriving at times `t_1, t_2, ..., t_n`, the exponential decay score at time `t` is:
```
S(t) = SUM_i [ w_i * exp(-lambda * (t - t_i)) ]
```
**Incremental update.** When a new event with weight `w` arrives at time `t_new`:
```
S(t_new) = S(t_prev) * exp(-lambda * (t_new - t_prev)) + w
```
**Proof of exactness.** If `S(t_prev) = SUM_i [ w_i * exp(-lambda * (t_prev - t_i)) ]` for all events up to `t_prev`, then multiplying by `exp(-lambda * (t_new - t_prev))` shifts every event's decay to be relative to `t_new`, and adding `w` incorporates the new event with zero age. The result is exactly `SUM_i [ w_i * exp(-lambda * (t_new - t_i)) ]` for all events including the new one.
### Write-Path Update
```rust
impl HotSignalState {
/// Update running decay scores on a new signal event.
///
/// Cost: K * exp() calls where K = number of configured decay rates.
/// At K=3: ~36ns on modern hardware (12ns per exp()).
pub fn on_signal(
&self,
weight: f64,
event_time_ns: u64,
lambdas: &[f64],
) {
// Acquire: ensures we see the latest decay_score before updating.
let prev_time = self.last_update_ns.load(Ordering::Acquire);
let dt = (event_time_ns.saturating_sub(prev_time)) as f64 / 1e9;
for (i, &lambda) in lambdas.iter().enumerate().take(3) {
loop {
// Acquire: read current score.
let prev_bits = self.decay_scores[i].load(Ordering::Acquire);
let prev_score = f64::from_bits(prev_bits);
// Apply decay to previous score, then add new weight.
let new_score = prev_score * (-lambda * dt).exp() + weight;
let new_bits = new_score.to_bits();
// AcqRel CAS: if another writer updated between our load and
// this CAS, we retry with the newer value.
match self.decay_scores[i].compare_exchange_weak(
prev_bits,
new_bits,
Ordering::AcqRel,
Ordering::Acquire,
) {
Ok(_) => break,
Err(_) => continue, // Retry with updated value
}
}
}
// Release: make updated scores visible to ranking queries.
// Only advance timestamp if this event is newer than the last update.
if event_time_ns > prev_time {
self.last_update_ns.store(event_time_ns, Ordering::Release);
}
}
}
```
### Read-Path Query
```rust
impl HotSignalState {
/// Read the current decay score at query time.
///
/// Applies lazy decay from last_update to query_time.
/// Cost: 1 exp() + 1 multiply = ~15ns per entity per decay rate.
pub fn current_score(
&self,
decay_rate_idx: usize,
query_time_ns: u64,
lambda: f64,
) -> f64 {
// Acquire: ensures we see the score matching the timestamp.
let last_update = self.last_update_ns.load(Ordering::Acquire);
let stored_bits = self.decay_scores[decay_rate_idx].load(Ordering::Acquire);
let stored_score = f64::from_bits(stored_bits);
let dt = (query_time_ns.saturating_sub(last_update)) as f64 / 1e9;
stored_score * (-lambda * dt).exp()
}
}
```
### Out-of-Order Events
When an event arrives with `t_event < last_update_ns` (out-of-order delivery, late-arriving data):
```
score += weight * exp(-lambda * (last_update - t_event))
```
The weight is pre-decayed to reflect that the event is older than the current state. The `last_update_ns` timestamp is not changed because it already reflects a more recent time. This is handled in the `on_signal` implementation above: when `dt` would be negative (via `saturating_sub`), the decay factor is `exp(0) = 1.0` which is incorrect. Instead:
```rust
// Correct out-of-order handling:
let dt_seconds = if event_time_ns >= prev_time {
(event_time_ns - prev_time) as f64 / 1e9
} else {
// Out-of-order: pre-decay the weight instead
let late_by = (prev_time - event_time_ns) as f64 / 1e9;
// Decay the existing score by 0 (it's already at prev_time),
// and add the weight decayed by how late the event is.
// new_score = prev_score + weight * exp(-lambda * late_by)
for (i, &lambda) in lambdas.iter().enumerate().take(3) {
let adjusted_weight = weight * (-lambda * late_by).exp();
// CAS loop to add adjusted_weight to decay_scores[i]
// ... (same pattern as above but with dt=0 for the score)
}
return; // Don't update last_update_ns
};
```
### The Jacobs Forward-Decay Trick
For **ranking-only queries** (where only relative ordering matters, not absolute scores), the running score can be reformulated to eliminate all read-time computation:
```
S(t) = exp(-lambda * t) * SUM_i [ w_i * exp(lambda * t_i) ]
```
The term `S_static = SUM_i [ w_i * exp(lambda * t_i) ]` changes only on writes. Since `exp(-lambda * t)` is the same for all entities at a given query time, relative ordering is determined by `S_static` alone.
**Overflow prevention.** `S_static` grows exponentially. After time `T`, the magnitude is approximately `exp(lambda * T)`. With a 1-hour half-life and `lambda = 1.925e-4`, after 1 year: `exp(1.925e-4 * 3.15e7) = exp(6063)` -- catastrophic overflow.
**Solution: log-space arithmetic.** Store `z = log(S_static)` instead. Update rule:
```
z_new = log(exp(z_prev) + w * exp(lambda * t_event))
= z_prev + log(1 + w * exp(lambda * t_event - z_prev))
```
Using the `log1p` function for numerical stability when the addend is small.
**Applicability.** Implement the Jacobs trick only for the primary ranking hot path where it eliminates the per-entity `exp()` call. Fall back to standard lazy-decay for queries that need absolute score values (e.g., `SignalSnapshot` in the response).
### Numerical Stability
**f64 precision is not a practical concern.** Each running-score update introduces ~0.5 ULP of rounding error. After 10^12 updates, accumulated error would be ~10^-10 relative. Jules Jacobs analyzed that with f64 and a 1-hour half-life, the system can run until the year 18,000 without precision issues.
**Underflow is desirable.** When an entity receives no signals for a long time, its decay score approaches 0.0. This is correct behavior -- the content has become irrelevant. Underflow to exactly 0.0 (which happens at approximately `dt > 700 * half_life` for f64) produces the correct ranking: the entity drops out of contention.
**Invariant.** Decay scores are non-negative. A negative score indicates a bug. Assert `score >= 0.0` on every update in debug builds.
### Linear Decay
For signals using `Decay::Linear { lifetime }`:
```
S(t) = SUM_i [ w_i * max(0, 1 - (t - t_i) / lifetime) ]
```
Linear decay cannot use the running-score trick because the `max(0, ...)` clamp is not multiplicatively composable. Instead, linear-decay signals rely on windowed aggregation with the window duration set to `lifetime`. The aggregate at query time is the count/sum of events within the lifetime window, with the weight linearly interpolated at the window boundary.
Linear decay is primarily used for signals where the "cliff" behavior is desirable -- e.g., a promotion that lasts exactly 7 days.
---
## 5. Velocity Computation
Velocity is the rate of change of signal volume within a window. It answers: "Is this signal accelerating or decelerating?" Velocity is the primary signal for trending and rising surfaces.
### Definition
For a signal with windowed count `C(t, w)` representing the number of events in the window `[t-w, t]`:
```
velocity(t, w) = C(t, w) / w
```
This is the simplest form: events per unit time. A view velocity of 500/hour means 500 views in the last hour.
### Relative Velocity (Acceleration)
For rising/breakout detection, what matters is not absolute velocity but **velocity relative to a baseline**:
```
relative_velocity(t) = velocity(t, w_short) / velocity(t, w_long)
```
Where `w_short` is a short window (e.g., 1h) and `w_long` is a longer window (e.g., 24h). When `relative_velocity > 1.0`, the signal is accelerating. When `relative_velocity >> 1.0`, the content is breaking out.
**Example.** An item averaging 100 views/hour over the last 24h that suddenly receives 1,000 views in the last hour has `relative_velocity = 10.0`. This is a strong rising signal.
### Smoothed Velocity (EWMA)
Raw velocity is noisy at short windows. A single burst of views creates a spike that disappears one window-duration later. For ranking stability, velocity is smoothed using an Exponentially Weighted Moving Average (EWMA):
```
V_smooth(t) = alpha * V_raw(t) + (1 - alpha) * V_smooth(t_prev)
```
Where `alpha` determines the smoothing factor. Smaller `alpha` = smoother but slower to react. Larger `alpha` = noisier but faster to detect changes.
| Window | Recommended alpha | Rationale |
|--------|------------------|-----------|
| 1h | 0.3 | Fast reaction for real-time trending |
| 24h | 0.1 | Smooth daily trend with less noise |
| 7d | 0.05 | Very smooth weekly trend |
### Implementation
Velocity does not require a separate data structure. It is computed from the bucketed counters in the warm tier:
```rust
impl WarmSignalState {
/// Compute velocity for a given window.
///
/// Sums the relevant minute/hour buckets and divides by window duration.
/// Cost: O(bucket_count) -- at most 168 for 7-day window at hourly granularity.
pub fn velocity(&self, window: &Window, now_ns: u64) -> f64 {
let (count, duration_secs) = match window {
Window::Sliding { duration } if duration <= &Duration::hours(1) => {
let minutes = duration.as_secs() / 60;
let count = self.sum_minute_buckets(minutes as usize, now_ns);
(count, duration.as_secs_f64())
}
Window::Sliding { duration } => {
let hours = duration.as_secs() / 3600;
let count = self.sum_hour_buckets(hours as usize, now_ns);
(count, duration.as_secs_f64())
}
Window::AllTime => return 0.0, // velocity is undefined for all-time
};
count as f64 / duration_secs
}
/// Compute relative velocity (acceleration).
///
/// ratio > 1.0 means accelerating; ratio < 1.0 means decelerating.
pub fn relative_velocity(
&self,
short_window: &Window,
long_window: &Window,
now_ns: u64,
) -> f64 {
let v_short = self.velocity(short_window, now_ns);
let v_long = self.velocity(long_window, now_ns);
if v_long < f64::EPSILON {
// No baseline -- treat as infinite acceleration if short > 0.
if v_short > 0.0 { f64::MAX } else { 0.0 }
} else {
v_short / v_long
}
}
}
```
### Velocity as EWMA (Smoothed)
The EWMA velocity is maintained as an additional atomic field in the warm tier, updated every time the minute bucket rolls over:
```rust
/// Updated once per minute by the bucket rotation logic.
fn update_smoothed_velocity(&self, raw_velocity: f64, alpha: f64) {
loop {
let prev_bits = self.smoothed_velocity.load(Ordering::Acquire);
let prev = f64::from_bits(prev_bits);
let new = alpha * raw_velocity + (1.0 - alpha) * prev;
match self.smoothed_velocity.compare_exchange_weak(
prev_bits,
new.to_bits(),
Ordering::AcqRel,
Ordering::Acquire,
) {
Ok(_) => break,
Err(_) => continue,
}
}
}
```
---
## 6. Windowed Aggregation
### SWAG: Sliding Window Aggregation via Two-Stacks
For O(1) amortized sliding window aggregation, we use the Two-Stacks algorithm (Tangwongsan, Hirzel, Schneider, PVLDB 2015).
**Requirements.** The aggregation operator must be associative (forming a monoid). This covers `count`, `sum`, `min`, `max`, and compositions thereof.
**Structure.** Two stacks, each storing `(value, prefix_aggregate)` pairs:
- **Back stack:** New events are pushed here. `back.top.agg = combine(back.prev.agg, new_value)`.
- **Front stack:** Evictions pop from here. If empty, flip all elements from back to front.
```
Insert event: push to back stack O(1)
Evict event: pop from front stack O(1) amortized (O(n) flip at most once per element)
Query agg: combine(front.top.agg, back.top.agg) O(1)
```
### Scotty Stream-Slicing: Practical Implementation
Rather than maintaining pure SWAG stacks per window, tidalDB uses the Scotty stream-slicing approach (Traub et al., EDBT 2019): divide the event stream into non-overlapping time slices (per-minute and per-hour buckets), compute partial aggregates per slice, and share these across all concurrent windows.
This means a single set of per-minute counters supports simultaneous 1h, 24h, and 7d window queries. The cost of a windowed query is O(number_of_buckets_in_window):
| Window | Bucket Granularity | Buckets to Sum | Cost |
|--------|--------------------|---------------|------|
| 1h | per-minute | 60 | ~120 ns |
| 24h | per-hour | 24 | ~48 ns |
| 7d | per-hour | 168 | ~336 ns |
| 30d | per-hour | 720 (from rollups) | ~1.4 us |
| all_time | single counter | 1 | ~2 ns |
For the 30-day window, the system merges hourly rollups from the cold tier (disk) with in-memory hour buckets for the current 7 days. This follows the TimescaleDB real-time continuous aggregate pattern.
### Bucket Rotation
Minute buckets rotate every 60 seconds. Hour buckets rotate every 3600 seconds. Rotation is performed by the background materializer thread:
1. Record the current bucket's final value.
2. Zero the bucket for reuse.
3. Update the current-bucket pointer (atomic store).
4. If hour boundary crossed: aggregate the last 60 minute buckets into the hour bucket.
**Concurrency during rotation.** Writers continue incrementing the new current bucket via atomic add. Readers sum buckets starting from the current pointer and wrapping backwards. The window between "bucket zeroed" and "pointer advanced" is at most one atomic store apart, and a reader that sees the old pointer will include one extra bucket (slightly over-counting rather than under-counting), which is acceptable for ranking purposes.
### Multiple Simultaneous Windows
All windows for a given signal type share the same bucket arrays. A 1h query sums the last 60 minute buckets. A 24h query sums the last 24 hour buckets. A 7d query sums the last 168 hour buckets. No duplicated storage.
The `all_time` window is a simple atomic counter incremented on every event. No bucketing needed.
---
## 7. Cohort-Scoped Signal Aggregation
This section specifies the architecture for cohort-scoped signal queries: "this item has 50k views in 24h among US users aged 18-24 who like jazz." This is the foundation for cohort-based trending, demographic-targeted recommendations, and audience analytics.
### Problem Statement
Global signal aggregates answer "what is trending for everyone." Cohort-scoped aggregates answer "what is trending for **this group of users**." The groups can be defined by:
- **Demographics:** region, language, age bracket
- **Behavioral:** users who like jazz, users who prefer short-form, users who are power consumers
- **Social:** users in this follower graph, users in this community
- **Composite:** US users aged 18-24 who like jazz AND prefer short-form video
The number of possible cohort combinations is combinatorially explosive. The system must support thousands of pre-defined cohorts and ad-hoc cohort queries without unbounded storage growth.
### Approach Evaluation
Three approaches were evaluated:
**Approach A: Pre-computed cohort signals.** At signal write time, resolve which cohorts the user belongs to and increment per-item-per-cohort counters.
- Write amplification: `events/sec * avg_cohorts_per_user` (typically 5-15x).
- Storage: `items * cohorts * signals * windows * 4 bytes`. At 10M items * 1000 cohorts * 6 signals * 5 windows * 4 bytes = **1.2 TB**. Infeasible.
- Read latency: O(1). Direct counter lookup.
- Verdict: **Rejected.** Storage and write amplification are unacceptable at 1000+ cohorts.
**Approach B: Query-time cohort filtering.** Store signal events with user attributes attached. Filter events by cohort predicate at query time.
- Write amplification: 1x (no additional writes).
- Storage: Marginal increase per event (cohort attributes stored inline).
- Read latency: O(events_in_window) per entity. At 50K events/day per popular item, scanning 24h of events = ~50K events * 50 ns = **2.5 ms per entity**. For 200 candidates: **500 ms**. Infeasible.
- Verdict: **Rejected.** Read latency is unacceptable.
**Approach C: Hierarchical rollups with dimensional decomposition.** This is the recommended approach.
### Recommended Architecture: Hierarchical Dimensional Rollups
The design decomposes the cohort space into a fixed hierarchy of dimensions with pre-computed rollups at each level. Fine-grained cohort queries are answered by intersecting the appropriate dimensional rollups.
#### Dimension Hierarchy
```
Level 0: GLOBAL
One counter per item per signal per window.
Always maintained. Source of truth for global trending.
Level 1: PRIMARY DIMENSIONS (independently maintained)
region: {US, EU, APAC, LATAM, ...} ~20 values
language: {en, es, fr, de, ja, ...} ~30 values
age_group: {13-17, 18-24, 25-34, 35-44, 45-54, 55+} 6 values
Total Level 1 cohorts: ~56
Level 2: BEHAVIORAL SEGMENTS (computed, not enumerated)
Defined by the application in schema. Examples:
- "jazz_fans": users where preference_vector cosine_sim > 0.7 with jazz centroid
- "power_users": users with > 100 signals in last 7 days
- "short_form_preferred": users where > 70% of views are format:short
Maximum: 100 application-defined segments.
Level 3: COMPOSITE (computed at query time)
Intersection of Level 1 and Level 2 dimensions.
e.g., "US + 18-24 + jazz_fans"
Not pre-computed. Estimated from Level 1 and Level 2 aggregates.
```
#### Storage Layout
Cohort-scoped counters are stored in a dedicated column family:
```
CF "cohort_signals" Leveled compaction, TTL matches window
Key: [item_id: u64 BE][signal_type: u8][dimension: u8][cohort_value: u16 BE][hour_bucket: u32 BE]
Value: CohortBucket { count: u32, weighted_sum: f32, unique_users_hll: [u8; 12] }
```
**Dimension encoding:**
| Dimension ID (u8) | Dimension | Max Values | Description |
|-------------------|-----------|------------|-------------|
| 0 | global | 1 | Global aggregate (Level 0) |
| 1 | region | 20 | Geographic region |
| 2 | language | 30 | User language |
| 3 | age_group | 6 | Age bracket |
| 4-103 | segment_0..99 | 2 each (in/out) | Behavioral segments |
#### Storage Cost Analysis
Per-item, per-signal-type, per-hour:
```
Level 0: 1 global bucket = 20 bytes
Level 1: (20 + 30 + 6) = 56 cohort buckets = 1,120 bytes
Level 2: 100 segment buckets (boolean in/out) = 2,000 bytes
Total per item per signal per hour: = 3,140 bytes
```
For 10M items * 6 signal types * 24 hours * 3,140 bytes = **4.5 TB/day** at full population. This is infeasible for all 10M items.
**Critical insight: cohort counters are only needed for candidate items.** Cohort-scoped trending queries operate over at most a few thousand candidate items (e.g., items with global velocity above a threshold). The vast majority of items have negligible signal activity and do not need cohort decomposition.
**Revised approach: threshold-gated cohort tracking.**
```rust
/// Cohort tracking is activated for an item + signal when the global
/// signal rate exceeds this threshold. Below this threshold, cohort
/// breakdown adds no useful information.
const COHORT_ACTIVATION_THRESHOLD: u32 = 100; // events per hour
```
At any given time, fewer than 100K items have >100 events/hour for any signal type. Cohort storage for 100K items:
```
100K items * 6 signals * 24 hours * 3,140 bytes = 45.2 GB/day
```
With 7-day retention on hourly cohort rollups: **316 GB**. Feasible.
#### Write Path: Cohort Attribution
At signal write time, the user's cohort memberships are resolved and cached:
```rust
/// Resolved once per user, cached in the user's hot-tier state.
/// Refreshed when user metadata changes or behavioral segments are recomputed.
struct UserCohortMemberships {
region: CohortValueId, // 2 bytes
language: CohortValueId, // 2 bytes
age_group: CohortValueId, // 2 bytes
segments: BitSet128, // 16 bytes -- one bit per behavioral segment
}
// 22 bytes per user. 10M users = 220 MB.
```
On signal write:
1. Look up the user's `UserCohortMemberships` (hot-tier, O(1)).
2. If the target item has cohort tracking activated:
a. Increment the global counter (always).
b. Increment the region counter for this user's region.
c. Increment the language counter for this user's language.
d. Increment the age_group counter for this user's age group.
e. For each behavioral segment the user belongs to, increment that segment's counter.
3. If the item does not have cohort tracking activated:
a. Increment the global counter only.
b. Check if the global counter crossed the activation threshold. If so, activate cohort tracking.
**Write amplification analysis:**
| Scenario | Counter Increments per Event |
|----------|---------------------------|
| Below threshold (vast majority) | 1 (global only) |
| Above threshold, user in 8 segments | 1 + 3 + 8 = 12 |
| Above threshold, user in 20 segments | 1 + 3 + 20 = 24 |
Average write amplification across all events (assuming 1% of events target cohort-tracked items, users average 10 segments): `0.99 * 1 + 0.01 * 14 = 1.13x`. Negligible.
#### Read Path: Cohort-Scoped Queries
**Single-dimension queries** (e.g., "trending in US") are direct lookups:
```rust
/// O(1) per item per signal. Same as global trending but reads from
/// the dimension-specific counter.
fn cohort_velocity(
&self,
item: EntityId,
signal: SignalTypeId,
dimension: DimensionId,
cohort_value: CohortValueId,
window: &Window,
) -> f64 {
// Sum the hour buckets for this (item, signal, dimension, cohort_value)
// Same pattern as global velocity but from the cohort_signals CF.
}
```
Read latency: same as global windowed query, ~50 ns to ~1.4 us depending on window.
**Composite queries** (e.g., "trending among US users aged 18-24 who like jazz"):
Composite cohort queries combine multiple dimensions. Since dimensions are independent, the intersection is estimated using the inclusion-exclusion principle on independently maintained counters.
**Estimation approach for composite cohorts:**
For two independent dimensions A and B, the count of events from users in both A and B is estimated as:
```
C(A AND B) ~= C(global) * (C(A) / C(global)) * (C(B) / C(global))
= C(A) * C(B) / C(global)
```
This assumes independence between dimensions. For correlated dimensions (e.g., region and language are correlated: US users are more likely to speak English), the estimate has error proportional to the correlation strength.
For three dimensions A, B, S (two Level 1 + one Level 2):
```
C(A AND B AND S) ~= C(A) * C(B) * C(S) / C(global)^2
```
**Accuracy bounds.** Under the independence assumption, the estimation error is bounded by the mutual information between dimensions. For region/language (moderately correlated), empirical testing on real engagement data shows ~15-25% relative error. For region/age_group (weakly correlated), error is ~5-10%.
**When estimation is insufficient:** For high-value composite cohorts that the application queries frequently, the application can define them as Level 2 behavioral segments with exact counting. A segment "us_young_jazz" that is the intersection of region:US, age_group:18-24, and jazz_fans gets its own exact counter tracked at write time.
#### Cohort Membership Changes Over Time
User cohort memberships change:
- **Demographics (Level 1):** Rarely change. Region changes on relocation. Age group changes yearly. Language changes rarely.
- **Behavioral segments (Level 2):** Change as user preferences evolve. A user may enter or leave the "jazz_fans" segment as their engagement shifts.
**Membership refresh policy:**
1. Level 1 memberships are updated when user metadata is explicitly changed (`db.update_user()`).
2. Level 2 memberships are recomputed by the background materializer on a configurable schedule (default: every hour).
3. When a membership changes, future signal events use the new membership. Historical counters are not retroactively adjusted -- this is acceptable because cohort trending is inherently a "what's happening now" query, not a historical audit.
**Implication for accuracy.** If a user's behavioral segment changes hourly, counters for the old segment may include events from users who no longer belong. The staleness is bounded by the refresh interval (default 1 hour). For trending queries over 1h and 24h windows, this introduces at most ~4% error in the worst case (1 stale hour out of 24).
#### Capacity and Scaling
| Metric | Value |
|--------|-------|
| Maximum pre-defined cohorts (Level 1 + Level 2) | ~156 |
| Maximum ad-hoc composite cohorts | Unlimited (estimated at query time) |
| Items with active cohort tracking | ~100K (threshold-gated) |
| Storage for cohort data | ~316 GB (7-day retention) |
| Write amplification (average) | ~1.13x |
| Read latency (single dimension) | ~50 ns to ~1.4 us |
| Read latency (composite, 2 dimensions) | ~100 ns to ~3 us |
| Read latency (composite, 3+ dimensions) | ~200 ns to ~5 us |
| Accuracy (single dimension) | Exact |
| Accuracy (2-dimension composite) | ~85-95% (independence assumption) |
| Accuracy (3+ dimension composite) | ~75-90% (use exact segments for critical queries) |
---
## 8. Signal Write Path
The signal write path is the most performance-critical transaction in tidalDB. A single `db.signal()` call triggers a cascade of updates across multiple subsystems.
### Write Path Data Flow
```
Application calls db.signal(Signal { kind: "view", item: "X", user: "U", ... })
|
v
[1. DEDUP CHECK] ---- BLAKE3(signal_type, item_id, user_id, timestamp) ---> content hash
| If hash exists in dedup set: return Ok(()) silently.
| Dedup set: in-memory bloom filter + on-disk hash set.
v
[2. WAL APPEND] -----> Write signal event to WAL segment.
| Durability: Immediate, Batched, or Eventual per signal type.
| Event is durable after this step.
v
[3. HOT-TIER UPDATE] -> Update HotSignalState.decay_scores (atomic CAS).
| Update HotSignalState.last_update_ns (atomic store).
| Cost: ~36ns (3 exp() calls).
v
[4. WARM-TIER UPDATE] -> Increment minute bucket (atomic add).
| Increment all-time counter (atomic add).
| If cohort tracking active: increment cohort counters.
| Cost: ~20ns (atomic increments).
v
[5. USER PREF UPDATE] -> Shift user preference vector toward/away from item embedding.
| Direction: toward for positive signals, away for negative.
| Magnitude: proportional to signal weight * learning_rate.
| Cost: ~200ns (vector arithmetic on 1536D embedding).
v
[6. RELATIONSHIP UPDATE] -> Update user->creator interaction_weight.
| Update user->item state (seen, liked, hidden, etc.).
| Cost: ~50ns (atomic updates).
v
[7. RETURN Ok(())]
```
### Atomicity Guarantees
Steps 3-6 are **not** wrapped in a transaction. They are independent atomic updates to separate data structures. The WAL (step 2) is the source of truth. If the process crashes between step 3 and step 6:
- The WAL contains the event.
- On recovery, the WAL is replayed from the last checkpoint.
- Steps 3-6 are re-executed idempotently (the dedup hash prevents double-counting in the dedup set, and running-score updates are commutative).
This is a deliberate choice: transactional atomicity across all four updates would require a mutex or 2PC, which violates the lock-free hot-path requirement. Instead, eventual consistency is achieved through WAL replay.
**Consistency guarantee:** After WAL replay completes (bounded by `max_replay_time`, typically <30 seconds), all aggregates are consistent with the event stream.
### Content-Addressed Deduplication
Signal events are deduplicated using BLAKE3 hashing:
```rust
/// Compute the content hash for deduplication.
fn signal_content_hash(signal: &Signal) -> [u8; 32] {
let mut hasher = blake3::Hasher::new();
hasher.update(signal.kind.as_bytes());
hasher.update(&signal.item.to_bytes());
hasher.update(&signal.user.to_bytes());
// Truncate timestamp to second granularity to handle
// sub-second retries of the same logical event.
let ts_secs = signal.timestamp.timestamp();
hasher.update(&ts_secs.to_le_bytes());
*hasher.finalize().as_bytes()
}
```
**Dedup storage:** A bloom filter (in-memory, ~10MB for 100M events at 0.01% FPR) provides fast negative lookups. On bloom filter hit (potential duplicate), the on-disk hash set is consulted for confirmation. False positives in the bloom filter cause unnecessary disk reads (~50 us) but do not cause data loss.
### Group Commit
Signal writes use the configurable `Durability` level from `Config`:
```rust
pub enum Durability {
/// fsync every write. For financial/purchase events.
/// Latency: ~1ms per write (dominated by fsync).
Immediate,
/// fsync per batch. Default for engagement signals.
/// Accumulate up to max_batch events or max_delay_ms, whichever comes first.
/// Latency: ~10-100us per write (amortized fsync).
Batched { max_batch: usize, max_delay_ms: u64 },
/// fsync on OS schedule. For impressions, low-value telemetry.
/// Latency: ~1us per write (no fsync).
/// Risk: up to OS buffer duration of events lost on power failure.
Eventual,
}
```
The group commit queue accumulates signal events and issues a single fsync per batch. Writers are notified of completion via a per-batch condition variable. This follows the PostgreSQL commit delay pattern, validated in production by Citadel's `GroupCommitQueue`.
**Throughput at Batched { max_batch: 100, max_delay_ms: 10 }:**
- 1 fsync per 100 events or per 10ms.
- At 10,000 events/sec: 100 fsyncs/sec, each flushing ~100 events.
- NVMe SSD fsync latency: ~50-100us.
- Throughput: bounded by event processing, not fsync. >50,000 events/sec achievable.
### Signal Weight Semantics
The `weight` field in a signal event has signal-type-specific semantics:
| Signal Type | Weight Meaning | Typical Values |
|------------|----------------|----------------|
| `view` | 1.0 per view | Always 1.0 |
| `completion` | Fraction completed | 0.0 to 1.0 |
| `like` | 1.0 per like | Always 1.0 |
| `skip` | 1.0 per skip | Always 1.0 |
| `dwell_time` | Seconds of dwell | 0.0 to 3600.0 |
| `share` | 1.0 per share | Always 1.0 |
| `search_click` | 1.0 / log2(rank + 1) | Inversely proportional to rank |
Weights are validated at write time against the signal definition. Negative weights are rejected (negative signals use separate signal types, not negative weights).
---
## 9. Background Materializer
The background materializer is a dedicated thread (or thread pool) that continuously maintains materialized aggregates, performs bucket rotation, computes behavioral segments, and manages tier transitions.
### Responsibilities
1. **Bucket rotation.** Every minute: rotate minute buckets. Every hour: aggregate minute buckets into hour buckets. Every day: aggregate hour buckets into daily rollups.
2. **Rollup generation.** Incrementally compute hourly and daily rollups and persist to the cold tier. Follows the TimescaleDB continuous aggregate pattern.
3. **Hot-tier checkpointing.** Periodically (every 30-60 seconds) snapshot hot-tier `HotSignalState` to the `entity_signal_state` CF for crash recovery.
4. **Cohort segment recomputation.** Hourly: recompute behavioral segment memberships for users with recent activity.
5. **Cohort activation/deactivation.** Monitor global signal rates and activate/deactivate cohort tracking for items crossing the threshold.
6. **Warm-tier eviction.** Evict warm-tier entries for entities with no recent activity.
7. **Velocity smoothing.** Update EWMA velocity estimates on each bucket rotation.
### Staleness Bounds
The materializer guarantees that materialized state is fresh within a bounded staleness interval:
| Materialized State | Staleness Bound | Rationale |
|-------------------|----------------|-----------|
| Hot-tier decay scores | 0 (updated inline on write) | Part of the write path, not materializer |
| Minute-bucket counts | 0 (updated inline on write) | Part of the write path |
| Hour-bucket counts | 60 seconds | Aggregated from minute buckets on rotation |
| Hourly rollups (disk) | 65 seconds | Written after hour-bucket rotation + flush |
| Daily rollups (disk) | 25 hours | Computed from hourly rollups with 1h grace period |
| Behavioral segments | 1 hour | Recomputed hourly |
| Smoothed velocity (EWMA) | 60 seconds | Updated on minute-bucket rotation |
| Hot-tier checkpoint | 60 seconds | Persisted every 30-60 seconds |
### Rollup Schedule
```
Every 1 minute:
- Rotate minute buckets for all active entities.
- Update EWMA velocity for all active entities.
- Flush completed minute aggregates to hour-bucket accumulators.
Every 1 hour:
- Finalize hourly rollup for the just-completed hour (after 1-minute grace).
- Write hourly rollups to cold-tier CF "hourly_rollups".
- Recompute behavioral segment memberships for recently active users.
- Evaluate cohort activation thresholds.
Every 1 day:
- Compute daily rollups from the 24 hourly rollups of the just-completed day.
- Write daily rollups to cold-tier CF "daily_rollups".
- Drop expired hourly rollups (>30 days) and raw events (>7 days).
- Log storage size metrics.
Every 30-60 seconds:
- Checkpoint hot-tier state to entity_signal_state CF.
```
### Rollup Composability
Rollups store **composable aggregates** -- never store averages, percentiles, or other non-composable statistics. Store the components from which any statistic can be derived:
```rust
/// Composable hourly aggregate.
/// Invariant: a daily rollup is computed by composing 24 hourly rollups.
/// Invariant: a 7-day aggregate is computed by composing 168 hourly rollups.
struct HourlyRollup {
/// Total event count in this hour.
total_count: u32,
/// Sum of event weights in this hour.
weighted_sum: f32,
/// Approximate unique user count (HyperLogLog, 12-byte register).
unique_users_hll: [u8; 12],
/// Maximum single-event weight (for outlier detection).
max_weight: f32,
}
// Composition:
impl HourlyRollup {
fn compose(a: &Self, b: &Self) -> Self {
HourlyRollup {
total_count: a.total_count + b.total_count,
weighted_sum: a.weighted_sum + b.weighted_sum,
unique_users_hll: hll_union(&a.unique_users_hll, &b.unique_users_hll),
max_weight: a.max_weight.max(b.max_weight),
}
}
}
```
### Real-Time Continuous Aggregates
At query time, a windowed aggregate is computed by merging pre-materialized rollups with un-rolled-up recent data:
```
window_count(entity, signal, 7d) =
sum(hourly_rollups for hours h-168..h-1) // from cold tier
+ sum(minute_buckets for current hour) // from warm tier
```
This is the TimescaleDB real-time continuous aggregate pattern. Materialized state provides the bulk of the answer (168 lookups from sorted on-disk data), and the warm tier fills in the gap since the last materialization. The measured speedup over scanning raw events is ~979x (TimescaleDB benchmark).
### Changelog
When a materialized aggregate changes significantly (configurable threshold, default: >20% relative change), the materializer records the change:
```
CF "signal_changelog"
Key: [entity_id: u64 BE][signal_type: u8][window_id: u8][timestamp_ns: u64 BE]
Value: { old_value: f64, new_value: f64 }
```
The changelog enables:
- "What was trending yesterday?" queries.
- Debugging ranking behavior over time.
- Alerting on unusual signal spikes (breakout detection).
---
## 10. Signal Event Format
### Wire Format (API Boundary)
```rust
pub struct Signal {
/// Signal type name. Must match a defined signal type.
pub kind: &str,
/// Target item entity ID.
pub item: &str,
/// Source user entity ID.
pub user: &str,
/// Event timestamp. If None, uses server time.
pub timestamp: Option<DateTime<Utc>>,
/// Signal weight. Meaning depends on signal type.
/// Must be non-negative. Default: 1.0.
pub weight: f64,
/// Optional context for signal attribution and analysis.
pub context: Option<serde_json::Value>,
}
```
### Internal Storage Format (WAL)
```
+--------+--------+--------+--------+--------+--------+--------+--------+
| magic | len | signal | item_id | user_id | ts_ns
| (u8) | (u16) | type | (u64 BE) | (u64 BE) | (u64 BE)
| | | (u8) | | |
+--------+--------+--------+--------+--------+--------+--------+--------+
ts_ns | weight | ctx_len| context (variable) | blake3
(continued) | (f32) | (u16) | | checksum
| | | | (first 8 bytes)
+--------+--------+--------+--------+--------+--------+--------+--------+
Fixed header: 33 bytes
Context: 0 to 65535 bytes
Checksum: 8 bytes (truncated BLAKE3)
Total: 41 + context_len bytes
```
**Design decisions:**
- `signal_type` is stored as a `u8` index (not string) for compactness. Mapped from the signal name via the schema's signal type registry.
- `item_id` and `user_id` are stored as `u64` after the application's string IDs are mapped to internal numeric IDs by the entity store.
- `weight` is stored as `f32` (not `f64`) in the WAL for compactness. The running decay score in the hot tier uses `f64` for accumulated precision; individual event weights do not need f64.
- `context` is stored as raw bytes (MessagePack or JSON). Only parsed when accessed for analysis, never on the hot path.
- BLAKE3 checksum (truncated to 8 bytes) provides corruption detection. Full 32-byte hash is used for deduplication but not stored in the WAL record.
### Context Field Schema
The `context` field carries signal-type-specific attribution data:
| Signal Type | Context Fields | Purpose |
|------------|----------------|---------|
| `view` | `source_surface`, `position_in_feed` | Attribution |
| `search_click` | `query`, `rank_at_click` | Relevance training |
| `skip` | `dwell_ms`, `source` | Quality/format signal |
| `completion` | `total_duration_ms`, `completed_duration_ms` | Precision |
| `share` | `platform`, `share_type` | Virality analysis |
| `dwell_time` | `total_ms`, `active_ms` | Engagement depth |
Context is not indexed or aggregated. It is stored for offline analysis, model training, and debugging. It is never read on the ranking hot path.
---
## 11. Signal Types Reference
All signal types from USE_CASES.md Appendix C, grouped by category with recommended configuration.
### Positive Engagement Signals
| Signal | Type | Decay | Windows | Velocity | Primary Use |
|--------|------|-------|---------|----------|-------------|
| `view` | count | Exp 7d | 1h, 24h, 7d, 30d, all | Yes | Baseline reach |
| `unique_view` | count | Exp 7d | 1h, 24h, 7d, all | Yes | Deduplicated reach |
| `like` | count | Exp 7d | 1h, 24h, 7d, all | Yes | Positive sentiment |
| `share` | count | Exp 3d | 1h, 24h, 7d | Yes | Virality |
| `repost` | count | Exp 3d | 1h, 24h, 7d | Yes | Amplification |
| `quote` | count | Exp 3d | 1h, 24h, 7d | Yes | Engaged resharing |
| `comment` | count | Exp 3d | 1h, 24h, 7d, all | Yes | Discussion |
| `reply` | count | Exp 3d | 24h, 7d | No | Discussion depth |
| `upvote` | count | Exp 3d | 1h, 24h, 7d, all | Yes | Forum positive |
| `save` | count | Exp 7d | 24h, 7d, all | No | Return intent |
| `pin` | count | Exp 7d | 24h, 7d, all | No | Curation |
| `collection_add` | count | Exp 7d | 24h, 7d, all | No | Curation |
| `download` | count | Exp 7d | 24h, 7d, all | No | High-intent |
| `screenshot` | count | Exp 7d | 24h, 7d | No | Save intent |
| `outbound_click` | count | Exp 3d | 24h, 7d | No | Link engagement |
| `replay` | count | Exp 3d | 24h, 7d | No | Exceptional content |
| `award_given` | count | Permanent | all | No | Community endorsement |
### Negative Engagement Signals
| Signal | Type | Decay | Windows | Velocity | Primary Use |
|--------|------|-------|---------|----------|-------------|
| `skip` | count | Exp 1d | 1h, 24h | No | Quality negative |
| `skip_intro` | bool | Exp 1d | -- | No | Format preference |
| `hide` | bool | Permanent | -- | No | Hard item negative |
| `not_interested` | bool | Permanent | -- | No | Hard topic negative |
| `dislike` | count | Exp 7d | 1h, 24h, 7d, all | Yes | Explicit negative |
| `downvote` | count | Exp 3d | 1h, 24h, 7d, all | Yes | Forum negative |
| `report` | count | Permanent | all | No | Moderation flag |
### Quality Signals
| Signal | Type | Decay | Windows | Velocity | Primary Use |
|--------|------|-------|---------|----------|-------------|
| `completion` | ratio 0-1 | Exp 30d | all | No | Content quality |
| `partial_completion` | float | Exp 7d | -- | No | Continue watching |
| `dwell_time` | duration | Exp 3d | 24h, 7d | No | Engagement depth |
| `impression` | count | Exp 1d | 1h, 24h | No | Exposure tracking |
### Relationship Signals
| Signal | Type | Decay | Windows | Velocity | Primary Use |
|--------|------|-------|---------|----------|-------------|
| `follow` | bool | Permanent | -- | No | User-creator edge |
| `unfollow` | event | Decays follow | -- | No | Edge removal |
| `block` | bool | Permanent | -- | No | Hard filter |
| `mute` | bool | Permanent | -- | No | Soft filter |
| `interaction_weight` | float | Exp 7d | -- | No | Relationship strength |
### Recommendation Feedback Signals
| Signal | Type | Decay | Windows | Velocity | Primary Use |
|--------|------|-------|---------|----------|-------------|
| `autoplay_accept` | bool | Exp 3d | 24h | No | Rec quality |
| `autoplay_reject` | bool | Exp 1d | 24h | No | Rec failure |
| `notification_open` | bool | Exp 7d | 7d | No | Notification priority |
| `notification_dismiss` | bool | Exp 3d | 7d | No | Reduce push |
| `reminder_set` | bool | Exp 7d | -- | No | Intent for scheduled |
| `search_click` | count+rank | Exp 3d | 24h, 7d | No | Query relevance |
| `search_impression` | count | Exp 1d | 1h, 24h | No | Query exposure |
### Signal Type Configuration Summary
| Category | Count | Typical Decay Range | Typical Windows |
|----------|-------|--------------------|-----------------|
| Positive engagement | 17 | 3d - 7d half-life | 1h, 24h, 7d, all |
| Negative engagement | 7 | 1d - permanent | 1h, 24h or none |
| Quality | 4 | 1d - 30d half-life | 24h, 7d, all |
| Relationship | 5 | 7d - permanent | None (state, not stream) |
| Recommendation feedback | 7 | 1d - 7d half-life | 24h, 7d |
| **Total** | **40** | | |
---
## 12. Performance Targets
These are the latency and throughput targets the signal system must meet. Regressions against these numbers are treated as bugs.
### Write Path Targets
| Operation | Target | Measurement Point |
|-----------|--------|-------------------|
| Signal write (end-to-end, Batched durability) | < 100 us p50, < 500 us p99 | `db.signal()` return |
| WAL append (amortized fsync) | < 50 us p50 | WAL write + batch fsync |
| Hot-tier update (decay scores) | < 50 ns | 3 CAS operations |
| Warm-tier update (bucket increment) | < 20 ns | Atomic add |
| User preference vector shift | < 500 ns | 1536D vector arithmetic |
| Content-address dedup check | < 100 ns (bloom miss), < 50 us (bloom hit) | BLAKE3 hash + lookup |
| Sustained write throughput | > 50,000 events/sec | Single writer thread |
### Read Path Targets
| Operation | Target | Measurement Point |
|-----------|--------|-------------------|
| Decay score read (per entity per lambda) | ~15 ns | 1 load + 1 exp() + 1 mul |
| 200-candidate scoring pass (decay only) | < 5 us | 200 * 15ns + overhead |
| Windowed count (1h, per entity) | < 200 ns | Sum 60 minute buckets |
| Windowed count (7d, per entity) | < 500 ns | Sum 168 hour buckets |
| Velocity computation (per entity) | < 500 ns | Windowed count / duration |
| Cohort-scoped velocity (single dimension) | < 2 us | Disk-backed bucket sum |
| Cohort-scoped velocity (composite, 2-dim) | < 5 us | Estimation arithmetic |
| Signal snapshot (all windows, 1 entity) | < 5 us | All counters + decay reads |
### Background Materializer Targets
| Operation | Target | Measurement Point |
|-----------|--------|-------------------|
| Minute-bucket rotation (all active entities) | < 100 ms | Rotate + EWMA update |
| Hourly rollup generation | < 5 seconds | All active entities |
| Daily rollup generation | < 30 seconds | All entities with hourly data |
| Hot-tier checkpoint | < 2 seconds | Serialize + write to disk |
| Behavioral segment recomputation | < 60 seconds | All recently active users |
### Crash Recovery Targets
| Operation | Target | Notes |
|-----------|--------|-------|
| WAL replay (cold start) | < 60 seconds | For 7 days of events at scale |
| Hot-tier restore from checkpoint | < 10 seconds | For 10M entities |
| Time to first query after crash | < 15 seconds | Serve from checkpoint, replay in background |
---
## 13. Invariants and Correctness Guarantees
These invariants must hold at all times. They are encoded as property tests, assertions, and crash recovery tests.
### Signal Integrity Invariants
**INV-SIG-1: No signal loss.** Every signal event accepted by `db.signal()` (i.e., after `Ok(())` is returned) is reflected in all aggregates after WAL replay completes. Formally: if `signal(s)` returns `Ok(())` at time `t`, then for all `t' > t + max_replay_time`, all aggregate queries reflect `s`.
**INV-SIG-2: Decay score monotonic decrease.** In the absence of new signal events, a decay score monotonically decreases toward zero. Formally: if no events arrive for entity `e` signal `s` between times `t1` and `t2` where `t2 > t1`, then `score(e, s, t2) <= score(e, s, t1)`.
**INV-SIG-3: Decay score non-negative.** Decay scores are always non-negative. `score(e, s, t) >= 0.0` for all entities, signals, and times.
**INV-SIG-4: Windowed count consistency.** The windowed count for window `w` at time `t` equals the number of events in `[t-w, t]`. Formally: `window_count(e, s, w, t) == |{event in events(e, s) : event.time in [t-w, t]}|`. This is exact for counts maintained in the warm tier, and exact to within the rollup boundary granularity for counts composed from cold-tier rollups.
**INV-SIG-5: Running score exactness.** The running decay score matches the analytical sum to within floating-point epsilon. Formally: `|running_score(e, s, t) - SUM_i[w_i * exp(-lambda * (t - t_i))]| < epsilon` where `epsilon = n * 2^-52 * max_score` and `n` is the number of events.
**INV-SIG-6: Deduplication idempotency.** Writing the same signal event twice produces the same state as writing it once. Formally: `state(write(s) ; write(s)) == state(write(s))`.
### Crash Recovery Invariants
**INV-CR-1: WAL completeness.** After crash recovery, the WAL contains all events that were acknowledged to the caller (events for which `db.signal()` returned `Ok(())`). Events in the WAL but not yet processed are replayed.
**INV-CR-2: Checkpoint consistency.** The hot-tier checkpoint, when restored and replayed from the checkpoint's WAL position, produces state identical to the pre-crash state (modulo lazy-decay time differences, which are corrected at read time).
**INV-CR-3: No phantom state.** After crash recovery, no aggregate reflects an event that was not durably committed to the WAL. There are no phantom signal counts.
### Concurrency Invariants
**INV-CON-1: Lock-free reads.** Ranking queries never acquire a mutex. They read atomic values and apply lazy decay. A concurrent signal write may cause a ranking query to see either the pre-update or post-update state, but never a torn or invalid state.
**INV-CON-2: CAS correctness.** Under concurrent signal writes to the same entity, every event's weight is reflected in the running score. The CAS retry loop ensures that concurrent updates are serialized without loss. Formally: if `write(w1)` and `write(w2)` execute concurrently, the final score equals the score that would result from either sequential ordering `w1;w2` or `w2;w1`.
**INV-CON-3: Bucket atomicity.** Atomic increment of bucket counters ensures that concurrent writes to the same minute bucket are correctly accumulated. No count is lost.
### Property Tests
The following properties must be verified with `proptest`:
```rust
// P1: Decay scores decrease monotonically without new events.
proptest! {
fn decay_monotonic_decrease(
initial_score in 0.0f64..1e12,
lambda in 1e-7..1e-3,
dt_secs in 1.0f64..1e7,
) {
let decayed = initial_score * (-lambda * dt_secs).exp();
prop_assert!(decayed <= initial_score);
prop_assert!(decayed >= 0.0);
}
}
// P2: Running score matches analytical sum.
proptest! {
fn running_score_matches_analytical(
events in prop::collection::vec((0.1f64..10.0, 1u64..1_000_000), 1..100),
lambda in 1e-7..1e-3,
) {
let mut running = 0.0f64;
let mut last_time = 0u64;
let query_time = events.last().unwrap().1 + 1000;
// Compute running score
for &(weight, time) in &events {
let dt = (time - last_time) as f64;
running = running * (-lambda * dt).exp() + weight;
last_time = time;
}
let final_running = running * (-lambda * (query_time - last_time) as f64).exp();
// Compute analytical sum
let analytical: f64 = events.iter()
.map(|&(w, t)| w * (-lambda * (query_time - t) as f64).exp())
.sum();
let relative_error = (final_running - analytical).abs() / analytical.max(1e-15);
prop_assert!(relative_error < 1e-10,
"running={}, analytical={}, error={}", final_running, analytical, relative_error);
}
}
// P3: Windowed count equals event count in window.
proptest! {
fn windowed_count_matches_events(
event_times in prop::collection::vec(0u64..86400, 1..1000),
window_secs in 60u64..86400,
query_time in 0u64..172800,
) {
// Count events in [query_time - window_secs, query_time]
let expected = event_times.iter()
.filter(|&&t| t <= query_time && t > query_time.saturating_sub(window_secs))
.count();
// The warm-tier bucket count should match
// (implementation-specific assertion)
let actual = warm_tier.windowed_count(window_secs, query_time);
prop_assert_eq!(expected, actual);
}
}
// P4: Out-of-order events produce same final score as in-order.
proptest! {
fn out_of_order_events_commutative(
events in prop::collection::vec((0.1f64..10.0, 1u64..1_000_000), 2..50),
lambda in 1e-7..1e-3,
) {
let query_time = events.iter().map(|e| e.1).max().unwrap() + 1000;
// Apply events in original order
let score_ordered = apply_events_and_query(&events, lambda, query_time);
// Apply events in shuffled order
let mut shuffled = events.clone();
shuffled.sort_by_key(|e| std::cmp::Reverse(e.1)); // reverse time order
let score_shuffled = apply_events_and_query(&shuffled, lambda, query_time);
let relative_error = (score_ordered - score_shuffled).abs()
/ score_ordered.max(1e-15);
prop_assert!(relative_error < 1e-10);
}
}
// P5: Dedup produces idempotent state.
proptest! {
fn dedup_idempotent(
event in arb_signal_event(),
) {
let state_once = apply_signal(&event);
let state_twice = apply_signal(&event); // same event again
prop_assert_eq!(state_once, state_twice);
}
}
// P6: WAL replay produces same state as uninterrupted execution.
proptest! {
fn wal_replay_consistency(
events in prop::collection::vec(arb_signal_event(), 1..500),
crash_point in 0usize..500,
) {
// Execute all events without crash
let expected_state = execute_all(&events);
// Execute up to crash_point, then "crash" and replay from WAL
let (wal, partial_state) = execute_with_crash(&events, crash_point);
let recovered_state = replay_from_wal(wal, partial_state);
prop_assert_eq!(expected_state, recovered_state);
}
}
```
---
## Appendix A: Glossary
| Term | Definition |
|------|------------|
| **Signal** | A typed, timestamped engagement event (view, like, skip, etc.) |
| **Signal Ledger** | The per-entity aggregation of all signals targeting that entity |
| **Decay Score** | The running exponential decay aggregate: recent events weighted more heavily |
| **Lambda** | The decay rate constant: `ln(2) / half_life` |
| **Velocity** | The rate of signal events per unit time within a window |
| **Relative Velocity** | Ratio of short-window to long-window velocity (acceleration) |
| **SWAG** | Sliding Window Aggregation -- O(1) amortized algorithm for windowed aggregate maintenance |
| **Scotty Slicing** | Stream-slicing approach where partial aggregates per time bucket are shared across windows |
| **Cohort** | A group of users sharing a common attribute (region, age, behavioral segment) |
| **Dimensional Rollup** | Per-dimension pre-aggregated counters for cohort-scoped queries |
| **Hot Tier** | In-memory, cache-line-aligned signal state for sub-microsecond reads |
| **Warm Tier** | In-memory bucketed counters for active entities, supporting windowed aggregation |
| **Cold Tier** | On-disk raw events and rollups for durability and historical queries |
| **Running Score** | The incrementally maintained decay score: `S(t) = S(prev) * exp(-lambda * dt) + w` |
| **Forward Decay** | The mathematical model (Cormode et al.) proving the running score formula is exact |
| **Jacobs Trick** | Log-space reformulation that eliminates read-time computation for ranking-only queries |
| **Group Commit** | Batching fsync calls to amortize durability cost across multiple writes |
| **Content-Addressed** | Identifying events by BLAKE3 hash of content for automatic deduplication |
| **EWMA** | Exponentially Weighted Moving Average for smoothing noisy velocity signals |
## Appendix B: References
1. Cormode, G., Shkapenyuk, V., Srivastava, D., Xu, B. "Forward Decay: A Practical Time Decay Model for Streaming Systems." ICDE 2009.
2. Tangwongsan, K., Hirzel, M., Schneider, S. "General Incremental Sliding-Window Aggregation." PVLDB 2015.
3. Traub, J., Grulich, P., Cuevas, A., et al. "Scotty: General and Efficient Open-Source Window Aggregation." EDBT 2019 (Best Paper).
4. Jacobs, J. "Exponentially Decaying Sums With a Twist." 2023.
5. Miller, E. "How Not To Sort By Average Rating." 2009.
6. TimescaleDB Documentation. "Continuous Aggregates." 2024.
7. Flajolet, P., Fusy, E., Gandouet, O., Meunier, F. "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm." DMTCS 2007.