- M5p1: BM25 text indexing via Tantivy with background syncer (0.26ms @ 10K docs) - M5p2: RRF fusion layer combining BM25 + ANN scores (46µs @ 1K candidates) - M5p3: unified Search query API (8-stage pipeline, BM25 + vector + ranking) - M5p4: creator text + vector indexing and creator search executor (< 20ms @ 200 creators) - Refactor db/mod.rs into focused sub-modules (creators, items, sessions, signals, etc.) - Decompose monolithic files into directory modules (query/executor, ranking/diversity, etc.) - Split brute.rs → brute/mod.rs + brute/tests.rs; extract search executor helpers - Add benches: fusion, search, session, text_index - Add M5 UAT test suites (m5_uat, m5_search, m5p4_creator_search, text_index) - Update blog posts, roadmap, content strategy, and M5 planning docs - Add tmp/ and .claude/worktrees/ to .gitignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
299 lines
14 KiB
Plaintext
299 lines
14 KiB
Plaintext
---
|
|
title: "Running decay scores are O(1) -- here is the math"
|
|
date: "2026-02-21"
|
|
author: "tidalDB"
|
|
description: "The forward-decay formula eliminates raw-event scanning at query time. One exp() call per decay rate on write, one on read. 15 nanoseconds per entity. Here is how it works."
|
|
tags: ["signals", "architecture", "performance"]
|
|
---
|
|
|
|
Every content platform computes some variant of this formula:
|
|
|
|
```python
|
|
trending_score = sum(views) / (age_hours + 2) ** 1.8
|
|
```
|
|
|
|
It runs in a cron job, or a Kafka consumer, or a Redis Lua script. It scans raw events, computes a score, writes the result to a field, and hopes the field is still accurate by the time the ranking query reads it. It is always stale. And it is always O(N) in the number of events per entity.
|
|
|
|
tidalDB does this differently. The signal ledger maintains running exponentially-decayed scores that update in O(1) per write and O(1) per read. No event scanning. No batch recomputation. No stale cache. The math is exact -- not an approximation -- and the implementation fits in 64 bytes per entity-signal pair.
|
|
|
|
This post explains the formula, the implementation, and what it costs.
|
|
|
|
## The formula
|
|
|
|
Exponential decay maps naturally to engagement signals. A view from 10 minutes ago matters more than a view from 10 days ago. The standard formulation:
|
|
|
|
```
|
|
score(t) = sum over all events i of: weight_i * exp(-lambda * (t - t_i))
|
|
```
|
|
|
|
where `lambda = ln(2) / half_life`. After one half-life, a signal's contribution drops to 50%. After two, 25%. The curve is smooth, continuous, and parameterized by a single value you declare in schema.
|
|
|
|
Computing this sum naively requires iterating every event for the entity. For an item with 500 views, that is 500 `exp()` calls at query time. For 200 candidate entities in a ranking pass, that is 100,000 `exp()` calls. At 12 nanoseconds per `exp()`, the scan alone costs 1.2 milliseconds -- before you have done any scoring, filtering, or diversity enforcement.
|
|
|
|
The insight is that you do not need the sum. You need a running accumulator.
|
|
|
|
## The running accumulator
|
|
|
|
When a new event arrives at time `t` with weight `w`, the relationship between the old score and the new score is:
|
|
|
|
```
|
|
S(t) = S(t_prev) * exp(-lambda * dt) + w
|
|
```
|
|
|
|
where `dt = t - t_prev`. That is one `exp()` call and one multiply-add. The proof is direct: if `S(t_prev)` already equals the sum of all prior events decayed to `t_prev`, then multiplying by `exp(-lambda * dt)` shifts every prior event's decay to be relative to `t`, and adding `w` incorporates the new event with zero age. The result is exactly the analytical sum.
|
|
|
|
This is not an approximation. The running score and the brute-force sum produce identical results to floating-point precision. We verify this with property tests that generate random event sequences and compare the running score against the analytical computation:
|
|
|
|
```rust
|
|
// From tidal/src/signals/hot.rs — property test P2
|
|
proptest! {
|
|
#[test]
|
|
fn running_score_matches_analytical(
|
|
events in proptest::collection::vec(
|
|
(0.1f64..10.0, 1_000_000u64..1_000_000_000),
|
|
1..100,
|
|
),
|
|
lambda in 1e-7f64..1e-3,
|
|
) {
|
|
let mut sorted_events = events;
|
|
sorted_events.sort_by_key(|e| e.1);
|
|
|
|
let query_time_ns = sorted_events.last().unwrap().1 + 1_000_000_000;
|
|
|
|
let state = HotSignalState::new(42, 0);
|
|
for &(weight, time_ns) in &sorted_events {
|
|
state.on_signal(weight, time_ns, &[lambda]);
|
|
}
|
|
let running = state.current_score(0, query_time_ns, lambda);
|
|
|
|
let analytical: f64 = sorted_events.iter()
|
|
.map(|&(w, t)| w * (-lambda * (query_time_ns - t) as f64 / 1e9).exp())
|
|
.sum();
|
|
|
|
let relative_error = (running - analytical).abs() / analytical;
|
|
prop_assert!(relative_error < 1e-6);
|
|
}
|
|
}
|
|
```
|
|
|
|
This test generates up to 100 events with random weights and timestamps, processes them through the running accumulator, and asserts the result matches the analytical sum to within one part per million. It runs thousands of iterations across the full parameter space.
|
|
|
|
## The read path
|
|
|
|
The stored score reflects the state at `last_update_ns` -- the timestamp of the most recent event. At query time, we apply one final decay to bring the score forward to the current moment:
|
|
|
|
```rust
|
|
// From tidal/src/signals/hot.rs
|
|
pub fn current_score(&self, decay_rate_idx: usize, query_time_ns: u64, lambda: f64) -> f64 {
|
|
let last_ns = self.last_update_ns.load(Ordering::Acquire);
|
|
let stored = f64::from_bits(self.decay_scores[idx].load(Ordering::Acquire));
|
|
let dt_secs = (query_time_ns - last_ns) as f64 / 1e9;
|
|
stored * (-lambda * dt_secs).exp()
|
|
}
|
|
```
|
|
|
|
One `exp()`, one multiply. That is the entire read path. The score accounts for every event ever written, decayed to the exact query instant, without touching a single raw event record.
|
|
|
|
## Out-of-order events
|
|
|
|
Distributed systems deliver events out of order. A view that happened at `t=5s` may arrive after a view at `t=10s` has already been processed. The naive approach -- recompute from raw events -- handles this trivially but at O(N) cost. The running accumulator handles it at O(1) cost with a different formula.
|
|
|
|
When an event arrives with `t_event < last_update_ns`, we pre-decay its weight by the event's age relative to the current state:
|
|
|
|
```rust
|
|
// From tidal/src/signals/hot.rs — out-of-order path
|
|
let age_secs = (last_ns - event_time_ns) as f64 / 1e9;
|
|
let effective_weight = weight * (-lambda * age_secs).exp();
|
|
// CAS loop to add effective_weight to the running score
|
|
```
|
|
|
|
The timestamp does not regress. The weight is reduced as if the event had arrived on time and then decayed. The result is analytically identical to processing events in order.
|
|
|
|
We prove this with a second property test that processes the same events in forward and reverse order and asserts both produce the same score:
|
|
|
|
```rust
|
|
// From tidal/src/signals/hot.rs — property test P4
|
|
proptest! {
|
|
#[test]
|
|
fn out_of_order_events_commutative(
|
|
events in proptest::collection::vec(
|
|
(0.1f64..10.0, 1_000_000u64..1_000_000_000),
|
|
2..50,
|
|
),
|
|
lambda in 1e-7f64..1e-3,
|
|
) {
|
|
// Process in-order
|
|
let mut sorted = events.clone();
|
|
sorted.sort_by_key(|e| e.1);
|
|
let state_ordered = HotSignalState::new(42, 0);
|
|
for &(w, t) in &sorted {
|
|
state_ordered.on_signal(w, t, &[lambda]);
|
|
}
|
|
|
|
// Process in reverse order
|
|
sorted.reverse();
|
|
let state_reversed = HotSignalState::new(42, 0);
|
|
for &(w, t) in &sorted {
|
|
state_reversed.on_signal(w, t, &[lambda]);
|
|
}
|
|
|
|
// Both match the analytical sum
|
|
let analytical: f64 = events.iter()
|
|
.map(|&(w, t)| w * (-lambda * (query_time_ns - t) as f64 / 1e9).exp())
|
|
.sum();
|
|
|
|
// Assertions: both within 1e-6 relative error of analytical
|
|
}
|
|
}
|
|
```
|
|
|
|
Event order does not matter. The final score is the same.
|
|
|
|
## 64 bytes per entity
|
|
|
|
The hot-path struct fits exactly one CPU cache line:
|
|
|
|
```rust
|
|
// From tidal/src/signals/hot.rs
|
|
#[repr(C, align(64))]
|
|
pub struct HotSignalState {
|
|
entity_id: u64,
|
|
last_update_ns: AtomicU64,
|
|
signal_type_id: u16,
|
|
flags: u16,
|
|
_pad0: [u8; 4],
|
|
decay_scores: [AtomicU64; 3], // f64 bits stored as u64 for atomic CAS
|
|
_pad1: [u8; 16],
|
|
}
|
|
|
|
const _SIZE: () = assert!(std::mem::size_of::<HotSignalState>() == 64);
|
|
const _ALIGN: () = assert!(std::mem::align_of::<HotSignalState>() == 64);
|
|
```
|
|
|
|
Cache-line alignment eliminates false sharing. When two threads score different entities concurrently, their `HotSignalState` structs live on different cache lines. No invalidation traffic. No contention.
|
|
|
|
The `decay_scores` array holds three simultaneous decay rates per signal type -- a 1-hour half-life, a 24-hour half-life, and a 7-day half-life can all be maintained in a single struct. Each score is stored as the bit pattern of an `f64` inside an `AtomicU64`, updated via compare-and-swap. Readers are never blocked by writers.
|
|
|
|
The memory ordering is deliberate. `last_update_ns` uses `Acquire`/`Release` to establish happens-before between writers and readers. The decay scores use `AcqRel` on CAS success to make new values visible to concurrent readers. CAS failure uses `Acquire` to load the freshest competing write for the next retry. These are the weakest orderings that maintain correctness -- no `SeqCst` anywhere on the hot path.
|
|
|
|
## The warm tier
|
|
|
|
Decay scores tell you a weighted, time-discounted aggregate. But ranking also needs windowed counts: "how many views in the last hour?" "what is the velocity over 24 hours?" These are different questions with different data structures.
|
|
|
|
The warm tier maintains bucketed counters -- circular buffers of per-minute and per-hour event counts:
|
|
|
|
```rust
|
|
// From tidal/src/signals/warm.rs
|
|
pub struct BucketedCounter {
|
|
minute_buckets: [AtomicU32; 60], // last 60 minutes
|
|
hour_buckets: [AtomicU32; 168], // last 168 hours (7 days)
|
|
current_minute: AtomicU8,
|
|
current_hour: AtomicU8,
|
|
all_time_count: AtomicU64,
|
|
last_minute_rotation_ns: AtomicU64,
|
|
last_hour_rotation_ns: AtomicU64,
|
|
}
|
|
```
|
|
|
|
A 1-hour windowed count sums 60 minute buckets. A 7-day count sums 168 hour buckets. An all-time count reads a single atomic. No scanning. No aggregation pipeline. Rotation is trigger-based -- checked inline on each write, no background thread required.
|
|
|
|
Velocity falls out naturally: `windowed_count / window_duration_seconds`. The database computes this at read time from the bucketed counters.
|
|
|
|
## The cost
|
|
|
|
The benchmark setup: 200 entities, each with 50 pre-written signals spread over one hour. The scoring pass reads the decay score for every entity using direct `DashMap` access -- isolating the hot-path read from schema lookup overhead.
|
|
|
|
```rust
|
|
// From tidal/benches/signals.rs
|
|
fn bench_200_entity_scoring_pass(c: &mut Criterion) {
|
|
let (ledger, type_id) = view_ledger();
|
|
|
|
// Pre-warm: 200 entities x 50 signals each
|
|
let entity_ids: Vec<EntityId> = (0u64..200).map(EntityId::new).collect();
|
|
for &entity_id in &entity_ids {
|
|
for j in 0u64..50 {
|
|
let ts = Timestamp::from_nanos(
|
|
base_ns.saturating_sub(3_600_000_000_000) + j * 72_000_000_000,
|
|
);
|
|
ledger.record_signal("view", entity_id, 1.0, ts).unwrap();
|
|
}
|
|
}
|
|
|
|
c.bench_function("signal_200_entity_scoring_pass", |b| {
|
|
b.iter(|| {
|
|
let mut sum = 0.0_f64;
|
|
for &entity_id in black_box(&entity_ids) {
|
|
if let Some(entry) = ledger.entries().get(&(entity_id, type_id)) {
|
|
sum += entry.hot.current_score(0, now_ns, LAMBDA_7D);
|
|
}
|
|
}
|
|
black_box(sum)
|
|
});
|
|
});
|
|
}
|
|
```
|
|
|
|
The target was under 5 microseconds for the full 200-entity pass. That is 25 nanoseconds per entity -- one `DashMap` lookup, one `exp()`, one multiply.
|
|
|
|
Compare this to scanning raw events. At 50 events per entity and 15 nanoseconds per `exp()`, the raw scan costs 750 nanoseconds per entity. For 200 entities, 150 microseconds. At 500 events per entity -- a moderately popular item -- the scan costs 1.5 milliseconds. The O(1) approach does not change with event count. The 200-entity pass costs the same whether each entity has 50 events or 50,000.
|
|
|
|
## Checkpoint and crash recovery
|
|
|
|
Running scores live in memory. Memory is volatile. The signal ledger checkpoints its entire state to durable storage as a single atomic write batch -- every `HotSignalState` and every `BucketedCounter` serialized to a 983-byte fixed-length record per entity-signal pair.
|
|
|
|
On restart, the ledger restores from the checkpoint and replays WAL events that arrived after the checkpoint was taken. The WAL is the source of truth. The checkpoint is an optimization that bounds replay time. Periodic checkpoints run every 30 seconds in a background thread.
|
|
|
|
The full UAT scenario validates this end-to-end: open a database, define three signal types (view, like, skip), write 100 items, write 10,000 signal events spread over 7 days, verify decay scores match analytical computation, close the database, reopen it, and verify the recovered scores match. Including crash recovery, the deviation is under 0.1%.
|
|
|
|
```rust
|
|
// From tidal/tests/signal_api.rs — crash recovery test
|
|
{
|
|
let db = TidalDb::builder()
|
|
.with_data_dir(tmp.path())
|
|
.with_schema(schema)
|
|
.open()?;
|
|
|
|
// Write 100 signals over 7 days
|
|
for i in 0..100u64 {
|
|
let ts = Timestamp::from_nanos(/* spread over 7 days */);
|
|
db.signal("view", entity, 1.0, ts)?;
|
|
}
|
|
|
|
score_before = db.read_decay_score(entity, "view", 0)?.unwrap();
|
|
db.close()?;
|
|
}
|
|
|
|
// Reopen — WAL replay restores state
|
|
{
|
|
let db = TidalDb::builder()
|
|
.with_data_dir(tmp.path())
|
|
.with_schema(schema)
|
|
.open()?;
|
|
|
|
let score_after = db.read_decay_score(entity, "view", 0)?.unwrap();
|
|
|
|
let rel_err = (score_after - score_before).abs() / score_before;
|
|
assert!(rel_err < 0.001); // Under 0.1% deviation
|
|
}
|
|
```
|
|
|
|
## What this replaces
|
|
|
|
The standard approach to computing trending scores in a content platform:
|
|
|
|
1. Engagement events flow into Kafka.
|
|
2. A consumer aggregates events into Redis counters with TTLs.
|
|
3. A cron job reads Redis counters and computes `trending_score = f(views, age)`.
|
|
4. The cron job writes the score to a field in Elasticsearch.
|
|
5. The ranking query reads the field.
|
|
|
|
Steps 2 through 4 introduce lag. The score in Elasticsearch reflects the state of the Redis counters at the time the cron job last ran. The Redis counters reflect the state of the Kafka topic at the time the consumer last processed events. The Kafka topic reflects the state of the application at the time the events were published. At every seam, time passes. At every seam, correctness degrades.
|
|
|
|
In tidalDB, step 1 is `db.signal("view", entity_id, 1.0, timestamp)`. There are no other steps. The decay score is updated in the same call, in the same process, in the same memory space. The next ranking query -- even 100 milliseconds later -- reads the updated score. No lag. No cache. No batch pipeline.
|
|
|
|
One `exp()` call per decay rate on write -- up to three if you register three rates, typically one. One on read. 64 bytes per entity. The score is always current because the score is always computed, not cached.
|
|
|
|
---
|
|
|
|
*tidalDB is an open-source, embeddable Rust database for personalized content ranking. The signal ledger code referenced in this post is at [tidal/src/signals/](https://github.com/orchard9/tidalDB/tree/main/tidal/src/signals). Follow the build on [GitHub](https://github.com/orchard9/tidalDB).*
|