- Schema phase 1 (tasks 01-02): EntityId, EntityKind, Timestamp, Score, SignalTypeDef, DecayModel, Window, WindowSet — all with property tests and benchmarks scaffolding - Stub modules for storage, signals, query, ranking - Full documentation suite: VISION, USE_CASES, SEQUENCE, API, CODING_GUIDELINES, ai-lookup, research docs, specs, roadmap, planning docs - Marketing site (Next.js) with blog infrastructure - .claude/ agents and skills for the tidalDB development workflow - Foundation standards enforced: thiserror + tracing declared as dependencies, clippy::unwrap_used = deny added to lint config - .gitignore hardened: .next/, node_modules/, .env, secrets, logs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
20 KiB
tidalDB's signal ledger needs a hybrid storage engine with running decay scores
The optimal architecture for tidalDB is a hybrid of raw event storage and pre-materialized aggregates, backed by a time-partitioned LSM engine (RocksDB or fjall) with per-entity running decay scores maintained on every write. This recommendation draws on evidence from Google Monarch, Facebook Scuba, InfluxDB IOx, TimescaleDB continuous aggregates, and the SWAG algorithm literature. The hybrid approach achieves sub-millisecond reads across hundreds of candidates (measured at ~4 µs for 200 entities), sustains thousands of writes per second with write amplification of just 2–3×, and keeps storage bounded at ~460 GB for the full workload. The key insight: exponential decay scores should be maintained as running per-entity accumulators (O(1) per write, O(1) per read), while windowed count/velocity aggregates use pre-materialized time buckets with real-time merge of recent events.
Executive summary and architecture recommendation
tidalDB's workload — append-only events at thousands/sec with sub-millisecond windowed reads across hundreds of entities — sits at a unique intersection that no single off-the-shelf approach perfectly serves. Pure raw-event storage (the Scuba model) provides maximum query flexibility but risks exceeding the sub-millisecond read budget as event counts grow. Pure pre-aggregation (the Druid rollup model) breaks down with 10M high-cardinality entities where rollup ratios approach 1:1. The literature and production evidence converge on a three-tier hybrid:
Tier 1 — In-memory per-entity state serves the hot read path. Each entity maintains a compact struct (~40–80 bytes) containing running decay scores, a SWAG-backed windowed counter, and a pointer to recent events. For 10M entities, this is 400–800 MB of RAM — modest for a ranking system. Reads never touch disk for the hot path.
Tier 2 — Time-partitioned raw event storage on disk provides durability, replay capability, and support for ad-hoc queries. Daily partitions with FIFO compaction achieve write amplification of 2× and enable O(1) partition drops for retention enforcement. Seven-day retention requires ~224 GB of SSD.
Tier 3 — Materialized rollups (hourly and daily aggregates) extend the queryable window beyond raw retention. Hourly rollups for 30 days add ~231 GB; daily rollups grow at 320 MB/day indefinitely. These rollups are computed incrementally by a background thread, following the TimescaleDB continuous aggregate pattern that delivers 979× faster queries than scanning raw data.
This architecture is validated by production systems: InfluxDB IOx uses the same WAL → in-memory buffer → persistent columnar lifecycle in Rust. TimescaleDB's continuous aggregates with real-time merge solve the stale-aggregate problem. Google Monarch's sliding admission window and pre-aggregation at ingestion confirms the hybrid model at planet-scale.
Approach comparison table
| Criterion | Raw events only | Pre-aggregated windows | Hybrid (recommended) |
|---|---|---|---|
| Write throughput | ★★★★★ Simple append, no computation | ★★★☆ Must update multiple aggregates per write | ★★★★ Append + O(1) running score update (~60ns overhead) |
| Read latency (p50) | ★★☆ 200 entities × 50 events × 15ns/exp = ~160 µs | ★★★★★ 200 entities × 15ns = ~3 µs | ★★★★★ ~4 µs (running scores + small merge) |
| Read latency (p99) | ★☆ Degrades to 1.6ms at 500 events/entity | ★★★★★ Stable ~5 µs | ★★★★ ~10–50 µs (with recent-event merge) |
| Storage overhead | ★★★ 224 GB for 7d raw; no rollups means 960 GB for 30d | ★★★★★ Minimal (rollups only, ~10 GB for 30d) | ★★★★ ~460 GB (7d raw + 30d hourly + daily rollups) |
| Implementation complexity | ★★★★★ Simplest: append and scan | ★★☆ Must define all windows upfront; inflexible | ★★★ Moderate: running scores + background rollups + partition management |
| Decay support | ★★★ Supports arbitrary λ at query time, but O(N) per entity | ★★★★ Running score is exact, O(1) read, but requires 1 score per λ | ★★★★★ Running scores for production λ + raw events for experimentation |
| Flexibility | ★★★★★ Any query on raw data | ★★☆ Only pre-defined aggregations | ★★★★ Pre-defined fast path + raw data for ad-hoc |
The raw-events approach fails at p99 latency when entity event counts exceed ~200 (200 × 200 × 15ns = 600 µs, approaching the budget). Pre-aggregation alone cannot support exponential decay with arbitrary λ values or ad-hoc historical queries. The hybrid captures the best of both: running scores for the fast path, raw events for flexibility.
Rust implementation path
Storage engine selection
Primary recommendation: RocksDB via the rocksdb crate (v0.24+, 38.7M downloads). The prefix bloom filter + composite key pattern is battle-tested at TiKV and CockroachDB scale. CompactionFilter handles TTL-based GC natively. Prefix iteration on entity_id prefixes achieves 4–6M range scan ops/sec in benchmarks. TiKV reports ≥10% read performance improvement from prefix bloom filters and another 15% write improvement from memtable insert hints for monotonically-increasing keys.
Strong alternative: fjall v3 (pure Rust, #![forbid(unsafe_code)]). Batch write performance actually beats RocksDB in benchmarks (353ms vs 451ms for 1M entries on Ryzen 9950X3D). Compiles in 3.5s vs RocksDB's 40s. Binary adds 2.2 MB vs 12 MB. Keyspaces provide column-family semantics. The tradeoff is relative immaturity (first release Dec 2023) and lack of prefix bloom filters.
Key schema design
For the raw event storage, the key schema encodes entity and time for efficient prefix-based range scans:
Key: [entity_id: u64 big-endian][timestamp_ns: u64 big-endian] (16 bytes)
Value: [event_type: u8][weight: f32][metadata: var] (48 bytes)
Big-endian encoding ensures byte-lexicographic ordering matches numeric ordering. RocksDB's prefix extractor is configured for the first 8 bytes (entity_id), enabling the prefix bloom filter to skip SST files that don't contain a given entity. A windowed read for entity X over the last 7 days becomes a single seek(X || t_start) followed by forward iteration until timestamp > t_end — a tight sequential scan within sorted data.
Per-entity in-memory state
struct EntityState {
entity_id: u64,
decay_scores: [f64; 3], // one per λ (1h, 24h, 7d half-lives)
last_update_ns: u64,
window_counts: BucketedCounter, // per-minute buckets for velocity
recent_events: VecDeque<Event>, // last N events for real-time merge
}
// ~128 bytes per entity; 10M entities ≈ 1.28 GB
The BucketedCounter maintains per-minute event counts for the last 60 minutes (or per-hour for 7-day windows). At query time, windowed counts are computed by summing the relevant buckets — O(number_of_buckets), which is at most 60 for a 1-hour window at minute granularity. This follows the Scotty stream-slicing pattern where partial aggregates are pre-computed per time slice and shared across overlapping windows.
Column family layout (RocksDB)
CF "raw_events" → FIFO compaction, TTL=7 days
Key: entity_id || timestamp
Value: event payload
Prefix bloom filter on entity_id (8 bytes)
CF "hourly_rollups" → Leveled compaction, TTL=30 days
Key: entity_id || hour_bucket
Value: {count, weighted_sum, per_type_counts}
CF "daily_rollups" → Leveled compaction, no TTL
Key: entity_id || day_bucket
Value: {count, weighted_sum, per_type_counts}
CF "entity_state" → Leveled compaction, no TTL
Key: entity_id
Value: EntityState (decay scores, last_update)
All four column families share a single WAL, enabling atomic cross-CF writes. The entity_state CF provides crash recovery for in-memory state — on startup, each entity's running scores and counters are restored from this CF.
Decay implementation
The running-score formula is the right approach
The formula S(t) = S(t_prev) × e^(-λ × Δt) + w is mathematically exact (not an approximation) and provides O(1) update cost per event. This is proven by the Forward Decay model formalized by Cormode, Shkapenyuk, Srivastava, and Xu in their ICDE 2009 paper, and independently described by Jules Jacobs and Evan Miller.
The proof is straightforward: if S(t_prev) = Σ w_i × e^(-λ(t_prev - t_i)) for all events up to t_prev, then multiplying by e^(-λ(t - t_prev)) shifts every event's decay to be relative to the new time t, and adding the new weight w incorporates the new event with zero age. The result is exactly Σ w_i × e^(-λ(t - t_i)) for all events including the new one.
Write path (on each engagement event):
fn on_event(&mut self, weight: f64, event_time_ns: u64, lambdas: &[f64; 3]) {
let dt = (event_time_ns - self.last_update_ns) as f64 / 1e9;
for i in 0..3 {
self.decay_scores[i] = self.decay_scores[i] * (-lambdas[i] * dt).exp() + weight;
}
self.last_update_ns = event_time_ns;
}
// Cost: 3 exp() calls ≈ 36ns on modern hardware
Read path (at query time):
fn current_score(&self, lambda_idx: usize, query_time_ns: u64, lambda: f64) -> f64 {
let dt = (query_time_ns - self.last_update_ns) as f64 / 1e9;
self.decay_scores[lambda_idx] * (-lambda * dt).exp()
}
// Cost: 1 exp() + 1 mul ≈ 15ns per entity per lambda
Why this beats alternatives by 20–60×
Scanning 50 raw events to compute decay at read time costs 750–900ns (scalar) per entity: 50 memory loads at 2–5ns each, 50 exp() calls at 12ns each, 50 multiply-accumulates. Reading a single pre-computed score costs 15–20ns: one 16-byte load, one exp(), one multiply. For 200 candidate entities, that's 3–4 µs vs 160 µs — comfortably sub-millisecond either way, but the running-score approach leaves massive headroom for growth to 500+ events/entity where raw scanning would hit 1.6ms and bust the budget.
Handling edge cases
Out-of-order events are handled correctly without recomputation. When an event arrives with t_event < last_update, pre-decay the weight: score += weight × exp(-λ × (last_update - t_event)). The last_update timestamp doesn't change since it already reflects a more recent time.
Multiple λ values require one score per λ per entity. With K=3 decay rates (1-hour, 24-hour, 7-day half-lives), storage is 3 × 8 bytes = 24 bytes per entity plus 8 bytes for the timestamp — 32 bytes total. For 10M entities, that's 320 MB. Adding a new λ requires either a backfill pass over raw events (feasible since we keep 7 days) or starting fresh.
Floating-point precision is not a concern with f64. Each update introduces ~0.5 ULP of rounding error. After 10^12 updates, accumulated error would be ~10^-10 relative — negligible. Underflow (score decaying to zero) is desirable behavior, not a bug. Jules Jacobs analyzed that with f64 and a 1-hour half-life, the system can run until the year 18,000 without precision issues.
The Jacobs forward-decay trick for ranking
For ranking-only queries (no absolute score needed), an even faster approach exists. Factor out the time-dependent term: Σ w_i × e^(-λ(t_now - t_i)) = e^(-λ × t_now) × Σ w_i × e^(λ × t_i). The term S_static = Σ w_i × e^(λ × t_i) changes only on writes. Since e^(-λ × t_now) is the same for all entities, relative ordering is determined by S_static alone — zero read-time computation for ranking. The catch: S_static grows exponentially over time, requiring log-space arithmetic (z = log(S_static)) to avoid overflow. This is worth implementing for the primary ranking hot path.
SWAG algorithm summary
Two-Stacks achieves O(1) amortized sliding window aggregation
The Two-Stacks algorithm, introduced by Tangwongsan, Hirzel, and Schneider (PVLDB 2015), maintains a sliding window aggregate using two stacks. The back stack accumulates new insertions; the front stack serves evictions. Each stack entry stores both the element's value and the cumulative aggregate of all elements below it in the stack.
Insert: push to back stack, compute back.top.agg = combine(back.previous_top.agg, new_value). O(1).
Evict: pop from front stack. O(1) unless front is empty, which triggers a "flip" — all elements from back are popped and pushed to front with recomputed prefix aggregates. The flip is O(n) but each element flips at most once, yielding O(1) amortized.
Query: combine(front.top.agg, back.top.agg) — one combine operation, O(1).
The requirement is that the aggregation operator be associative (forming a monoid). This covers count, sum, min, max, and any composition thereof. DABA (De-Amortized Banker's Aggregator) from the same group eliminates the occasional O(n) flip spike, achieving O(1) worst-case with a more complex data structure. FiBA extends this to out-of-order streams with O(log d) cost where d is the distance from the window boundary.
Applicability to tidalDB's use case
SWAG directly applies to tidalDB's windowed count and sum aggregates (view_count last 7d, like_count last 1h). These are associative operations that fit the Two-Stacks model perfectly. For velocity (rate of change), SWAG can maintain a windowed count, with velocity = count / window_duration.
Exponential decay is NOT compatible with standard SWAG because the weight of each event depends on the current query time, which changes continuously — the aggregation is not associative in the required sense. However, this is a non-issue because the running-score approach described above already provides O(1) decay computation without needing SWAG.
For practical implementation, the Scotty stream-slicing approach (Traub et al., EDBT 2019 Best Paper) is most relevant to tidalDB. It divides the event stream into non-overlapping time slices (e.g., 1-minute buckets), computes partial aggregates per slice, and shares these across all concurrent windows. This means a single set of per-minute counters supports simultaneous 1-hour, 24-hour, and 7-day window queries — a natural fit for tidalDB's bucketed counter design. Reference implementations exist in Rust at segeljakt/swag and IBM/sliding-window-aggregators on GitHub.
Compaction and retention strategy
Time-partitioned FIFO is the right model for raw events
For tidalDB's append-only, timestamp-ordered event workload, FIFO compaction achieves write amplification of just 2× (1× WAL + 1× memtable flush), compared to 12–32× for leveled compaction. This finding is validated by Solana's BlockStore, which switched from leveled to FIFO compaction and achieved 6.5× faster compaction with 1/3 the disk writes.
The recommended partition layout uses daily partitions:
/data/raw/2026-02-14/ → RocksDB instance, FIFO compaction
/data/raw/2026-02-15/ → RocksDB instance, FIFO compaction
...
/data/raw/2026-02-20/ → Active partition
/data/rollups/hourly/ → Single instance, leveled compaction, 30-day TTL
/data/rollups/daily/ → Single instance, leveled compaction, no TTL
Retention enforcement is trivial: close the partition handle, delete the directory. O(1) cost, zero write amplification for deletion. This avoids the fundamental problem InfluxDB identified: "In LSM Trees, a delete is as expensive, if not more so, than a write." With 7 daily partitions plus 2 rollup instances, the system manages only 9 database instances — well within file handle limits.
Concrete storage and I/O estimates
For the reference workload of 10M entities × 50 events/day:
| Component | Daily writes to disk | Stored data | Write amplification |
|---|---|---|---|
| Raw events (FIFO) | 64 GB/day | 224 GB (7 days) | 2× |
| Hourly rollups (leveled) | ~115 GB/day | ~231 GB (30 days) | ~15× |
| Daily rollups (leveled) | ~5 GB/day | Growing 320 MB/day | ~15× |
| Total | ~184 GB/day | ~460 GB | Blended ~6× |
Optimizing further with time-partitioned rollups (FIFO instead of leveled for hourly rollups) reduces total daily disk I/O to ~80 GB/day with a blended write amplification of ~2.5×. Sustained disk I/O is ~925 KB/s average for the FIFO path — trivial for any modern NVMe SSD.
Rollup generation strategy
Rollups are generated by a background thread using incremental aggregation (the Flink ReduceFunction pattern). An in-memory hash map of per-entity hourly accumulators is updated on every write — O(1) per event. Every hour, the accumulated counters are flushed to the hourly rollup CF. Daily rollups are computed hierarchically from hourly rollups, not raw data. Following TimescaleDB's best practice: never store averages (store sum + count instead), snap timestamps to bucket boundaries, and keep a 1-hour grace period for late arrivals before finalizing rollups.
Critical rollup design: store composable aggregates per bucket:
struct HourlyRollup {
entity_id: u64,
hour_bucket: u32, // hours since epoch
total_count: u32,
weighted_sum: f32,
view_count: u16,
like_count: u16,
skip_count: u16,
completion_count: u16,
} // ~24 bytes per rollup record
At query time for a 7-day window, the system merges 168 hourly rollup records (7 × 24) plus a handful of recent un-rolled-up events — still sub-millisecond. This "real-time continuous aggregate" pattern, where pre-computed rollups are merged with recent unmaterialized data at query time, is exactly what TimescaleDB implements and what produced their measured 979× speedup over raw queries.
Open questions requiring benchmarks
Several design decisions should be validated with actual tidalDB benchmarks before committing to production:
RocksDB vs fjall write throughput under realistic contention. Fjall's batch writes beat RocksDB in synthetic benchmarks (353ms vs 451ms for 1M entries), but real-world performance with concurrent readers, prefix bloom filters, and multiple column families may differ. Run a 24-hour stress test at 2× expected write rate with simultaneous read load.
Optimal time bucket granularity for windowed aggregates. Per-minute buckets (60 per hour, 10,080 per week) vs per-5-minute (2,016 per week) vs per-hour (168 per week). Finer granularity improves accuracy for "last 1 hour" windows at the sliding boundary but increases memory and merge cost. Benchmark the actual latency difference for tidalDB's target candidate set sizes.
In-memory state recovery time on crash restart. With 10M entities and 7 days of raw events, reconstructing all running decay scores from the WAL/raw events could take minutes. Benchmark this and determine the right checkpoint interval for the entity_state CF — likely every 30–60 seconds.
Prefix bloom filter false-positive rate tuning. RocksDB's default 10 bits/key yields ~1% false positive rate. For tidalDB's per-entity prefix scans across potentially thousands of SST files, higher bit counts (20 bits/key at 0.01% FPR) may significantly reduce unnecessary I/O. Measure actual range scan latency under varying bloom filter configurations.
Memory budget sensitivity. The recommended architecture assumes ~1.3 GB for per-entity in-memory state. If this is too large, evaluate a tiered approach: hot entities (recently active) in memory, cold entities loaded on demand from the entity_state CF. The threshold between hot and cold — and the p99 latency impact of cold-entity reads — needs measurement.
Decay score accuracy over long idle periods. When an entity receives no events for days, its running score decays toward zero. Verify that f64 precision remains adequate and that the exp() underflow behavior (score → 0.0) doesn't cause ranking artifacts compared to scanning the actual raw events.