tidaldb/docs/planning/milestone-1/phase-4/task-01-hot-tier-signal-state.md

# Task 01: Hot-Tier Signal State

## Context

**Milestone:** 1 -- Signal Engine
**Phase:** m1p4 -- Signal Ledger
**Depends On:** None (uses types from m1p1 but no m1p4 tasks)
**Blocks:** Task 03 (Signal Ledger and Velocity)
**Complexity:** L

## Objective

Deliver `HotSignalState`, the cache-line-aligned, lock-free struct that holds running exponential decay scores for a single signal type on a single entity. This is the structure touched on every ranking query -- it must be exactly 64 bytes, use atomic operations for concurrent read/write, and implement the running decay formula with mathematical exactness. The struct handles both in-order and out-of-order signal events, and provides lazy decay at read time so ranking queries pay only one `exp()` call per entity per decay rate.

This is the single most performance-critical data structure in tidalDB. Every design choice is driven by the hot-path constraint: a ranking query scoring 200 candidates must complete in under 5 microseconds. That means ~25 nanoseconds per entity for decay score reads, which allows exactly one L1 cache miss and one `exp()` call.

## Requirements

- `HotSignalState` must be `#[repr(C, align(64))]` -- exactly one L1 cache line
- `static_assert!(size_of::<HotSignalState>() == 64)`
- Running decay formula: `S(t) = S(t_prev) * exp(-lambda * dt) + weight`
- `on_signal()` updates decay scores via CAS loop with correct memory ordering
- `current_score()` applies lazy decay at read time: `stored_score * exp(-lambda * dt)`
- Out-of-order events: when `t_event < last_update_ns`, pre-decay the weight instead of advancing time
- Decay scores are non-negative (debug assertion)
- All atomic operations use Acquire/Release/AcqRel -- no Relaxed without explicit justification
- `Send + Sync` (ensured by atomic-only fields)
- No `unsafe` code

## Technical Design

### Module Structure

```
tidal/src/signals/
  hot.rs    -- HotSignalState, all methods
```

### Public API

```rust
// === signals/hot.rs ===

use std::sync::atomic::{AtomicU64, Ordering};

/// Hot-path signal state for a single signal type on a single entity.
///
/// One cache line (64 bytes). Touched on every ranking query involving this
/// signal. Contains running decay scores for up to 3 decay rates and the
/// timestamp of the last update for lazy decay at read time.
///
/// # Memory Layout
///
/// ```text
///  Offset   Size   Field
///  0..8     8      entity_id (u64)
///  8..16    8      last_update_ns (AtomicU64)
///  16..18   2      signal_type_id (u16)
///  18..20   2      flags (u16)
///  20..24   4      _pad0
///  24..32   8      decay_scores[0] (AtomicU64, f64 via to_bits/from_bits)
///  32..40   8      decay_scores[1] (AtomicU64)
///  40..48   8      decay_scores[2] (AtomicU64)
///  48..64   16     _pad1
/// ```
///
/// # Concurrency
///
/// - Writers: CAS loop on each `decay_scores[i]`, then conditional store on
///   `last_update_ns`. Multiple concurrent writers are serialized by CAS retry.
/// - Readers: Acquire load on `last_update_ns`, then Acquire load on
///   `decay_scores[i]`. Lazy decay applied from stored time to query time.
/// - A reader may see a stale score with a fresh timestamp (over-decaying by
///   a few nanoseconds) or a fresh score with a stale timestamp (under-decaying).
///   Both produce ranking-correct results within floating-point epsilon.
#[repr(C, align(64))]
pub struct HotSignalState {
    entity_id: u64,
    last_update_ns: AtomicU64,
    signal_type_id: u16,
    flags: u16,
    _pad0: [u8; 4],
    decay_scores: [AtomicU64; 3],
    _pad1: [u8; 16],
}

// Compile-time size assertion
const _: () = assert!(std::mem::size_of::<HotSignalState>() == 64);
const _: () = assert!(std::mem::align_of::<HotSignalState>() == 64);

/// Maximum number of decay rate slots per signal type.
pub const MAX_DECAY_RATES: usize = 3;

impl HotSignalState {
    /// Construct a new, zeroed state for the given entity and signal type.
    pub fn new(entity_id: u64, signal_type_id: u16) -> Self;

    /// Construct with the velocity_enabled flag set.
    pub fn with_flags(entity_id: u64, signal_type_id: u16, velocity_enabled: bool) -> Self;

    /// The entity this state belongs to.
    pub fn entity_id(&self) -> u64;

    /// The signal type index.
    pub fn signal_type_id(&self) -> u16;

    /// Whether velocity computation is enabled for this signal.
    pub fn velocity_enabled(&self) -> bool;

    /// Update running decay scores on a new signal event.
    ///
    /// For each configured lambda, applies the decay formula:
    ///   new_score = old_score * exp(-lambda * dt) + effective_weight
    ///
    /// For in-order events (event_time_ns >= last_update_ns):
    ///   dt = (event_time_ns - last_update_ns) as seconds
    ///   effective_weight = weight
    ///   last_update_ns is advanced to event_time_ns
    ///
    /// For out-of-order events (event_time_ns < last_update_ns):
    ///   The existing score is not decayed (dt=0 for the score shift).
    ///   Instead, the weight is pre-decayed:
    ///   effective_weight = weight * exp(-lambda * (last_update_ns - event_time_ns))
    ///   last_update_ns is NOT changed.
    ///
    /// Cost: K * exp() calls where K = number of configured decay rates.
    /// At K=1 (M1 default): ~12ns. At K=3: ~36ns.
    pub fn on_signal(
        &self,
        weight: f64,
        event_time_ns: u64,
        lambdas: &[f64],
    );

    /// Read the current decay score at query time.
    ///
    /// Applies lazy decay from last_update to query_time_ns:
    ///   score = stored_score * exp(-lambda * dt)
    ///
    /// Cost: 1 load + 1 exp() + 1 multiply = ~15ns.
    pub fn current_score(
        &self,
        decay_rate_idx: usize,
        query_time_ns: u64,
        lambda: f64,
    ) -> f64;

    /// Read the raw stored score without lazy decay.
    /// Used only for checkpoint serialization.
    pub fn stored_score(&self, decay_rate_idx: usize) -> f64;

    /// Read the last update timestamp in nanoseconds.
    pub fn last_update_ns(&self) -> u64;

    /// Restore state from a checkpoint (set all fields).
    /// Called during crash recovery before WAL replay.
    pub fn restore(
        &self,
        last_update_ns: u64,
        scores: &[f64],
    );
}
```

### Internal Design

**Atomic memory ordering rationale:**

The critical invariant is that a reader who loads `last_update_ns` via Acquire must see decay scores that are consistent with (or more recent than) that timestamp. Without this, a reader could see a new timestamp with an old score, producing an over-decayed (too small) result.

- `last_update_ns` loads: `Ordering::Acquire` -- establishes a happens-before edge with the Release store from the writer.
- `last_update_ns` stores: `Ordering::Release` -- makes all prior decay score CAS operations visible to readers who Acquire this timestamp.
- `decay_scores[i]` loads: `Ordering::Acquire` -- ensures we read the most recent value stored by any CAS.
- `decay_scores[i]` CAS: `Ordering::AcqRel` (success), `Ordering::Acquire` (failure) -- AcqRel on success makes the new score visible and acquires the latest value; Acquire on failure loads the freshest competing write.

The write order is critical: CAS all decay scores FIRST, then conditionally store `last_update_ns`. If the process crashes between CAS and timestamp store, the worst case is that a reader applies lazy decay from an older timestamp, producing a slightly under-decayed (too large) score. This is safe for ranking because it is bounded and self-correcting on the next write.

**Out-of-order event handling:**

When `event_time_ns < last_update_ns`, the event arrived late. We cannot "rewind" the running score. Instead, we pre-decay the weight to account for the event's age relative to the current state:

```
adjusted_weight = weight * exp(-lambda * (last_update_ns - event_time_ns) / 1e9)
```

This is mathematically equivalent to having processed the event at its original time: the contribution of the late event to the score at `last_update_ns` is exactly `weight * exp(-lambda * age)`.

For the CAS loop on out-of-order events, `dt` is 0 (the score is not decayed), and the adjusted weight is added:

```
new_score = old_score + adjusted_weight
```

**f64 via AtomicU64:**

Decay scores are f64 values stored as u64 bit patterns using `f64::to_bits()` and `f64::from_bits()`. Both functions are safe, const, and produce well-defined results for all finite f64 values including 0.0, negative zero, and subnormals. NaN bit patterns are never stored because the decay formula cannot produce NaN from non-negative inputs.

### Error Handling

No fallible operations. `on_signal()` and `current_score()` are infallible. `decay_rate_idx` out of bounds is a caller error -- debug-asserted but saturated to 0 in release (never panics on the hot path).

## Test Strategy

### Property Tests

```rust
use proptest::prelude::*;

// P1: Decay scores decrease monotonically without new events.
proptest! {
    #[test]
    fn decay_monotonic_decrease(
        initial_score in 0.0f64..1e12,
        lambda in 1e-7f64..1e-3,
        dt_secs in 1.0f64..1e7,
    ) {
        let decayed = initial_score * (-lambda * dt_secs).exp();
        prop_assert!(decayed <= initial_score);
        prop_assert!(decayed >= 0.0);
    }
}

// P2: Running score matches analytical sum to 6 decimal places.
proptest! {
    #[test]
    fn running_score_matches_analytical(
        events in prop::collection::vec(
            (0.1f64..10.0, 1_000_000u64..1_000_000_000),
            1..100,
        ),
        lambda in 1e-7f64..1e-3,
    ) {
        // Sort events by time for in-order processing
        let mut sorted_events = events.clone();
        sorted_events.sort_by_key(|e| e.1);

        let query_time_ns = sorted_events.last().unwrap().1 + 1_000_000_000; // +1 second

        // Build HotSignalState and process events
        let state = HotSignalState::new(42, 0);
        for &(weight, time_ns) in &sorted_events {
            state.on_signal(weight, time_ns, &[lambda]);
        }
        let running = state.current_score(0, query_time_ns, lambda);

        // Compute analytical sum
        let analytical: f64 = sorted_events.iter()
            .map(|&(w, t)| w * (-lambda * (query_time_ns - t) as f64 / 1e9).exp())
            .sum();

        let relative_error = if analytical.abs() < 1e-15 {
            running.abs()
        } else {
            (running - analytical).abs() / analytical
        };
        prop_assert!(
            relative_error < 1e-6,
            "running={running}, analytical={analytical}, relative_error={relative_error}"
        );
    }
}

// P4: Out-of-order events produce same final score as in-order.
proptest! {
    #[test]
    fn out_of_order_events_commutative(
        events in prop::collection::vec(
            (0.1f64..10.0, 1_000_000u64..1_000_000_000),
            2..50,
        ),
        lambda in 1e-7f64..1e-3,
    ) {
        let query_time_ns = events.iter().map(|e| e.1).max().unwrap() + 1_000_000_000;

        // Process in-order
        let mut sorted = events.clone();
        sorted.sort_by_key(|e| e.1);
        let state_ordered = HotSignalState::new(42, 0);
        for &(w, t) in &sorted {
            state_ordered.on_signal(w, t, &[lambda]);
        }
        let score_ordered = state_ordered.current_score(0, query_time_ns, lambda);

        // Process in reverse order (all out-of-order except first)
        sorted.reverse();
        let state_reversed = HotSignalState::new(42, 0);
        for &(w, t) in &sorted {
            state_reversed.on_signal(w, t, &[lambda]);
        }
        let score_reversed = state_reversed.current_score(0, query_time_ns, lambda);

        // Also compare to analytical sum
        let analytical: f64 = events.iter()
            .map(|&(w, t)| w * (-lambda * (query_time_ns - t) as f64 / 1e9).exp())
            .sum();

        let error_ordered = if analytical.abs() < 1e-15 {
            score_ordered.abs()
        } else {
            (score_ordered - analytical).abs() / analytical
        };
        let error_reversed = if analytical.abs() < 1e-15 {
            score_reversed.abs()
        } else {
            (score_reversed - analytical).abs() / analytical
        };

        prop_assert!(error_ordered < 1e-6,
            "ordered: running={score_ordered}, analytical={analytical}, error={error_ordered}");
        prop_assert!(error_reversed < 1e-6,
            "reversed: running={score_reversed}, analytical={analytical}, error={error_reversed}");
    }
}

// Decay scores are always non-negative (INV-SIG-3).
proptest! {
    #[test]
    fn decay_scores_non_negative(
        events in prop::collection::vec(
            (0.0f64..100.0, 0u64..2_000_000_000),
            1..200,
        ),
        lambda in 1e-7f64..1e-3,
        query_offset in 0u64..2_000_000_000,
    ) {
        let state = HotSignalState::new(1, 0);
        for &(w, t) in &events {
            state.on_signal(w, t, &[lambda]);
        }
        let query_time = events.iter().map(|e| e.1).max().unwrap_or(0) + query_offset;
        let score = state.current_score(0, query_time, lambda);
        prop_assert!(score >= 0.0, "score was {score}");
    }
}
```

### Unit Tests

```rust
#[test]
fn hot_signal_state_size_and_alignment() {
    assert_eq!(std::mem::size_of::<HotSignalState>(), 64);
    assert_eq!(std::mem::align_of::<HotSignalState>(), 64);
}

#[test]
fn new_state_is_zeroed() {
    let state = HotSignalState::new(42, 5);
    assert_eq!(state.entity_id(), 42);
    assert_eq!(state.signal_type_id(), 5);
    assert_eq!(state.last_update_ns(), 0);
    assert_eq!(state.stored_score(0), 0.0);
    assert_eq!(state.stored_score(1), 0.0);
    assert_eq!(state.stored_score(2), 0.0);
}

#[test]
fn single_event_sets_score_to_weight() {
    let state = HotSignalState::new(1, 0);
    let lambda = std::f64::consts::LN_2 / (7.0 * 24.0 * 3600.0); // 7-day half-life
    let t = 1_000_000_000u64; // 1 second in nanos

    state.on_signal(1.0, t, &[lambda]);

    // Immediately after, with no time elapsed, score should be ~1.0
    let score = state.current_score(0, t, lambda);
    assert!((score - 1.0).abs() < 1e-10);
}

#[test]
fn score_halves_after_half_life() {
    let half_life_secs = 3600.0; // 1 hour
    let lambda = std::f64::consts::LN_2 / half_life_secs;
    let state = HotSignalState::new(1, 0);

    let t0 = 0u64;
    state.on_signal(1.0, t0, &[lambda]);

    // Read after exactly one half-life
    let t1 = (half_life_secs * 1e9) as u64;
    let score = state.current_score(0, t1, lambda);
    assert!((score - 0.5).abs() < 1e-10, "score was {score}, expected ~0.5");
}

#[test]
fn two_events_accumulate() {
    let lambda = std::f64::consts::LN_2 / 3600.0; // 1h half-life
    let state = HotSignalState::new(1, 0);

    let t0 = 0u64;
    let t1 = 1_000_000_000u64; // 1 second later

    state.on_signal(1.0, t0, &[lambda]);
    state.on_signal(1.0, t1, &[lambda]);

    let score = state.current_score(0, t1, lambda);
    // score = 1.0 * exp(-lambda * 1.0) + 1.0
    let expected = 1.0_f64 * (-lambda * 1.0).exp() + 1.0;
    assert!((score - expected).abs() < 1e-10, "score={score}, expected={expected}");
}

#[test]
fn out_of_order_event_predecays_weight() {
    let lambda = std::f64::consts::LN_2 / 3600.0;
    let state = HotSignalState::new(1, 0);

    // Process event at t=10s first
    let t_late = 10_000_000_000u64;
    state.on_signal(1.0, t_late, &[lambda]);

    // Then process event at t=5s (out of order)
    let t_early = 5_000_000_000u64;
    state.on_signal(1.0, t_early, &[lambda]);

    // Query at t=10s -- should match analytical result
    let analytical = 1.0 * (-lambda * 0.0).exp()  // event at t=10, age=0
                   + 1.0 * (-lambda * 5.0).exp();  // event at t=5, age=5s
    let actual = state.current_score(0, t_late, lambda);
    assert!((actual - analytical).abs() < 1e-10,
        "actual={actual}, analytical={analytical}");
}

#[test]
fn last_update_ns_not_regressed_by_out_of_order() {
    let lambda = std::f64::consts::LN_2 / 3600.0;
    let state = HotSignalState::new(1, 0);

    state.on_signal(1.0, 10_000_000_000, &[lambda]);
    let ts_before = state.last_update_ns();

    state.on_signal(1.0, 5_000_000_000, &[lambda]); // older event
    let ts_after = state.last_update_ns();

    assert_eq!(ts_before, ts_after, "timestamp should not regress");
    assert_eq!(ts_after, 10_000_000_000);
}

#[test]
fn score_decays_to_near_zero_after_many_half_lives() {
    let lambda = std::f64::consts::LN_2 / 3600.0; // 1h half-life
    let state = HotSignalState::new(1, 0);

    state.on_signal(1.0, 0, &[lambda]);

    // After 100 half-lives (~100 hours), score should be essentially zero
    let t = (100.0 * 3600.0 * 1e9) as u64;
    let score = state.current_score(0, t, lambda);
    assert!(score < 1e-20, "score was {score}");
}

#[test]
fn velocity_flag() {
    let state = HotSignalState::with_flags(1, 0, true);
    assert!(state.velocity_enabled());

    let state2 = HotSignalState::with_flags(1, 0, false);
    assert!(!state2.velocity_enabled());
}

#[test]
fn restore_sets_all_fields() {
    let state = HotSignalState::new(1, 0);
    state.restore(42_000_000_000, &[1.5, 2.5, 3.5]);

    assert_eq!(state.last_update_ns(), 42_000_000_000);
    assert!((state.stored_score(0) - 1.5).abs() < 1e-15);
    assert!((state.stored_score(1) - 2.5).abs() < 1e-15);
    assert!((state.stored_score(2) - 3.5).abs() < 1e-15);
}

#[test]
fn multiple_lambdas() {
    let lambda_fast = std::f64::consts::LN_2 / 3600.0;   // 1h half-life
    let lambda_slow = std::f64::consts::LN_2 / 604800.0;  // 7d half-life
    let lambdas = [lambda_fast, lambda_slow];
    let state = HotSignalState::new(1, 0);

    state.on_signal(1.0, 0, &lambdas);

    // After 1 hour, fast score ~0.5, slow score ~0.9996
    let t = (3600.0 * 1e9) as u64;
    let score_fast = state.current_score(0, t, lambda_fast);
    let score_slow = state.current_score(1, t, lambda_slow);
    assert!((score_fast - 0.5).abs() < 1e-6);
    assert!((score_slow - (-lambda_slow * 3600.0).exp()).abs() < 1e-6);
    assert!(score_slow > score_fast, "slow decay should retain more");
}
```

## Acceptance Criteria

- [ ] `HotSignalState` is `#[repr(C, align(64))]` with compile-time size assertion `== 64`
- [ ] `on_signal()` implements the running decay formula with CAS loops using `AcqRel`/`Acquire` ordering
- [ ] `current_score()` applies lazy decay with `Acquire` loads
- [ ] Out-of-order events pre-decay the weight and do not regress `last_update_ns`
- [ ] Running score matches analytical brute-force sum to 6 decimal places (property test P2)
- [ ] Decay scores monotonically decrease without new events (property test P1)
- [ ] Decay scores are always non-negative across all property test inputs (INV-SIG-3)
- [ ] Out-of-order processing produces same score as in-order to 6 decimal places (property test P4)
- [ ] `restore()` correctly sets all fields for checkpoint recovery
- [ ] No `unsafe` code
- [ ] `cargo clippy -- -D warnings` passes
- [ ] All property tests and unit tests pass

## Research References

- [docs/research/tidaldb_signal_ledger.md](../../../research/tidaldb_signal_ledger.md) -- Section 3 (running-score formula proof), Section 4 (EntityState struct layout), Section 5 (f64 precision analysis: "adequate through year 18,000"), performance estimates (12ns per exp(), 36ns for 3 rates)
- Cormode, G. et al., "Forward Decay: A Practical Time Decay Model for Streaming Systems," ICDE 2009 -- mathematical foundation for running score exactness

## Spec References

- [docs/specs/03-signal-system.md](../../../specs/03-signal-system.md) -- Section 3 (HotSignalState layout), Section 4 (decay computation: write-path `on_signal`, read-path `current_score`, out-of-order handling, numerical stability), invariants INV-SIG-2 (monotonic decrease), INV-SIG-3 (non-negative), INV-SIG-5 (running score exactness), INV-CON-1 (lock-free reads), INV-CON-2 (CAS correctness), performance targets (Section 12: hot-tier update < 50ns, decay score read ~15ns)
- [docs/specs/00-architecture-overview.md](../../../specs/00-architecture-overview.md) -- Section 8 code module map showing `signal/hot.rs`

## Implementation Notes

- `f64::from_bits(0u64)` returns `0.0` and `(0.0f64).to_bits()` returns `0u64`. This means a zeroed `AtomicU64` reads as `0.0` through `from_bits`, which is the correct initial decay score. No special initialization needed.
- `compare_exchange_weak` is used instead of `compare_exchange` because we are in a retry loop. The weak variant may fail spuriously but is faster on architectures with LL/SC (ARM). On x86, both compile to `CMPXCHG`.
- The `_pad0` and `_pad1` fields ensure the struct is exactly 64 bytes. Without them, the compiler might add different padding that changes the size. `#[repr(C)]` makes the layout deterministic.
- Do NOT implement the Jacobs forward-decay trick in this task. It eliminates read-time computation but requires log-space arithmetic and overflow prevention. Deferred to M2+ as an optimization.
- Do NOT add benchmark harness in this task. Benchmarks are added in Task 03 after the full signal ledger is assembled. Property tests are the correctness gate for this task.