tidaldb/docs/planning/milestone-7/phase-4/task-04-session-cohort-degradation-metrics.md
2026-02-23 22:41:16 -07:00

6.3 KiB

Task 04: Session + Cohort + Degradation Metrics

Delivers

Prometheus gauges and counters for session lifecycle (active, closed, auto-closed), cohort ledger size, degradation level, and rate-limiting activity. These metrics give operators visibility into agent session health, cohort engine load, and system stress without reading application logs.

Complexity: S

Dependencies

  • task-01 complete (establishes instrumentation pattern)
  • m7p2 complete (provides DegradationLevel and rate-limiting infrastructure)
  • tidal/src/db/metrics.rs -- MetricsState to extend
  • tidal/src/db/sessions.rs -- session start/close paths
  • tidal/src/cohort/mod.rs -- CohortSignalLedger and CohortRegistry

Technical Design

1. Add atomic counters to MetricsState

In tidal/src/db/metrics.rs:

pub struct MetricsState {
    // ... existing + task-02 + task-03 fields ...

    // ── Session + cohort + degradation metrics (m7p4) ──────────────────
    /// Number of currently active sessions.
    #[cfg(feature = "metrics")]
    pub(crate) active_sessions: AtomicU64,
    /// Total sessions closed since open (cumulative).
    #[cfg(feature = "metrics")]
    pub(crate) closed_sessions_total: AtomicU64,
    /// Total sessions auto-closed due to timeout since open (cumulative).
    #[cfg(feature = "metrics")]
    pub(crate) session_auto_closed_total: AtomicU64,
    /// Total requests rate-limited since open (cumulative).
    #[cfg(feature = "metrics")]
    pub(crate) rate_limited_total: AtomicU64,
    /// Current degradation level (0 = healthy, 1 = warn, 2 = shed, 3 = critical).
    #[cfg(feature = "metrics")]
    pub(crate) degradation_level: AtomicU64,
}

2. Instrument session lifecycle

In tidal/src/db/sessions.rs:

// start_session():
#[cfg(feature = "metrics")]
self.metrics.active_sessions.fetch_add(1, Ordering::Relaxed);

// close_session():
#[cfg(feature = "metrics")]
{
    // Decrement will not underflow because active_sessions >= 1 when a session exists.
    self.metrics.active_sessions.fetch_sub(1, Ordering::Relaxed);
    self.metrics.closed_sessions_total.fetch_add(1, Ordering::Relaxed);
}

// auto_close_expired_sessions() (existing timeout reaper):
#[cfg(feature = "metrics")]
{
    self.metrics.active_sessions.fetch_sub(1, Ordering::Relaxed);
    self.metrics.closed_sessions_total.fetch_add(1, Ordering::Relaxed);
    self.metrics.session_auto_closed_total.fetch_add(1, Ordering::Relaxed);
}

3. Instrument rate limiter

In the m7p2 rate-limiting path (wherever a request is rejected due to load):

#[cfg(feature = "metrics")]
self.metrics.rate_limited_total.fetch_add(1, Ordering::Relaxed);

4. Update degradation level gauge

In the m7p2 degradation level setter:

#[cfg(feature = "metrics")]
self.metrics.degradation_level.store(new_level as u64, Ordering::Relaxed);

5. Render in Prometheus format

Extend MetricsState::render_prometheus():

// Sessions
write_gauge(&mut out, "tidaldb_active_sessions",
    "Number of currently active agent sessions",
    self.active_sessions.load(Ordering::Relaxed) as f64);

write_counter(&mut out, "tidaldb_closed_sessions_total",
    "Total agent sessions closed since open",
    self.closed_sessions_total.load(Ordering::Relaxed) as f64);

write_counter(&mut out, "tidaldb_session_auto_closed_total",
    "Total agent sessions auto-closed due to timeout",
    self.session_auto_closed_total.load(Ordering::Relaxed) as f64);

// Rate limiting
write_counter(&mut out, "tidaldb_rate_limited_total",
    "Total requests rate-limited due to overload",
    self.rate_limited_total.load(Ordering::Relaxed) as f64);

// Degradation
write_gauge(&mut out, "tidaldb_degradation_level",
    "Current degradation level (0=healthy, 1=warn, 2=shed, 3=critical)",
    self.degradation_level.load(Ordering::Relaxed) as f64);

6. Metric names (string literals)

Metric name Type Description
tidaldb_active_sessions gauge Number of currently active agent sessions
tidaldb_closed_sessions_total counter Total sessions closed since open
tidaldb_session_auto_closed_total counter Total sessions auto-closed due to timeout
tidaldb_rate_limited_total counter Total requests rate-limited due to overload
tidaldb_degradation_level gauge Current degradation level (0-3)

Acceptance Criteria

  • MetricsState extended with 5 atomic counters, all #[cfg(feature = "metrics")]
  • tidaldb_active_sessions incremented on start_session, decremented on close_session and auto-close
  • tidaldb_closed_sessions_total incremented on every session close
  • tidaldb_session_auto_closed_total incremented only on timeout-based auto-close
  • tidaldb_rate_limited_total incremented on every rate-limited rejection
  • tidaldb_degradation_level updated when the degradation level changes
  • /metrics endpoint renders all 5 new metrics in valid Prometheus format
  • cargo clippy -D warnings and cargo fmt --check pass

Test Strategy

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn active_sessions_tracks_lifecycle() {
        let state = MetricsState::new();
        state.active_sessions.fetch_add(1, Ordering::Relaxed);
        state.active_sessions.fetch_add(1, Ordering::Relaxed);
        assert_eq!(state.active_sessions.load(Ordering::Relaxed), 2);
        state.active_sessions.fetch_sub(1, Ordering::Relaxed);
        assert_eq!(state.active_sessions.load(Ordering::Relaxed), 1);
    }

    #[test]
    fn degradation_level_renders_correctly() {
        let state = MetricsState::new();
        state.degradation_level.store(2, Ordering::Relaxed);
        let output = state.render_prometheus();
        assert!(output.contains("tidaldb_degradation_level"));
        assert!(output.contains(" 2"));
    }
}

Integration test:

#[test]
fn session_metrics_increment_on_start_close() {
    let db = make_test_db_with_sessions();
    let sid = db.start_session(1, &AgentId::new("test").unwrap(), "default").unwrap();

    let prom = db.metrics().render_prometheus();
    assert!(prom.contains("tidaldb_active_sessions"));

    db.close_session(sid).unwrap();
    let prom = db.metrics().render_prometheus();
    assert!(prom.contains("tidaldb_closed_sessions_total"));
}