# Task 04: Session + Cohort + Degradation Metrics ## Delivers Prometheus gauges and counters for session lifecycle (active, closed, auto-closed), cohort ledger size, degradation level, and rate-limiting activity. These metrics give operators visibility into agent session health, cohort engine load, and system stress without reading application logs. ## Complexity: S ## Dependencies - task-01 complete (establishes instrumentation pattern) - m7p2 complete (provides `DegradationLevel` and rate-limiting infrastructure) - `tidal/src/db/metrics.rs` -- `MetricsState` to extend - `tidal/src/db/sessions.rs` -- session start/close paths - `tidal/src/cohort/mod.rs` -- `CohortSignalLedger` and `CohortRegistry` ## Technical Design ### 1. Add atomic counters to MetricsState In `tidal/src/db/metrics.rs`: ```rust pub struct MetricsState { // ... existing + task-02 + task-03 fields ... // ── Session + cohort + degradation metrics (m7p4) ────────────────── /// Number of currently active sessions. #[cfg(feature = "metrics")] pub(crate) active_sessions: AtomicU64, /// Total sessions closed since open (cumulative). #[cfg(feature = "metrics")] pub(crate) closed_sessions_total: AtomicU64, /// Total sessions auto-closed due to timeout since open (cumulative). #[cfg(feature = "metrics")] pub(crate) session_auto_closed_total: AtomicU64, /// Total requests rate-limited since open (cumulative). #[cfg(feature = "metrics")] pub(crate) rate_limited_total: AtomicU64, /// Current degradation level (0 = healthy, 1 = warn, 2 = shed, 3 = critical). #[cfg(feature = "metrics")] pub(crate) degradation_level: AtomicU64, } ``` ### 2. Instrument session lifecycle In `tidal/src/db/sessions.rs`: ```rust // start_session(): #[cfg(feature = "metrics")] self.metrics.active_sessions.fetch_add(1, Ordering::Relaxed); // close_session(): #[cfg(feature = "metrics")] { // Decrement will not underflow because active_sessions >= 1 when a session exists. self.metrics.active_sessions.fetch_sub(1, Ordering::Relaxed); self.metrics.closed_sessions_total.fetch_add(1, Ordering::Relaxed); } // auto_close_expired_sessions() (existing timeout reaper): #[cfg(feature = "metrics")] { self.metrics.active_sessions.fetch_sub(1, Ordering::Relaxed); self.metrics.closed_sessions_total.fetch_add(1, Ordering::Relaxed); self.metrics.session_auto_closed_total.fetch_add(1, Ordering::Relaxed); } ``` ### 3. Instrument rate limiter In the m7p2 rate-limiting path (wherever a request is rejected due to load): ```rust #[cfg(feature = "metrics")] self.metrics.rate_limited_total.fetch_add(1, Ordering::Relaxed); ``` ### 4. Update degradation level gauge In the m7p2 degradation level setter: ```rust #[cfg(feature = "metrics")] self.metrics.degradation_level.store(new_level as u64, Ordering::Relaxed); ``` ### 5. Render in Prometheus format Extend `MetricsState::render_prometheus()`: ```rust // Sessions write_gauge(&mut out, "tidaldb_active_sessions", "Number of currently active agent sessions", self.active_sessions.load(Ordering::Relaxed) as f64); write_counter(&mut out, "tidaldb_closed_sessions_total", "Total agent sessions closed since open", self.closed_sessions_total.load(Ordering::Relaxed) as f64); write_counter(&mut out, "tidaldb_session_auto_closed_total", "Total agent sessions auto-closed due to timeout", self.session_auto_closed_total.load(Ordering::Relaxed) as f64); // Rate limiting write_counter(&mut out, "tidaldb_rate_limited_total", "Total requests rate-limited due to overload", self.rate_limited_total.load(Ordering::Relaxed) as f64); // Degradation write_gauge(&mut out, "tidaldb_degradation_level", "Current degradation level (0=healthy, 1=warn, 2=shed, 3=critical)", self.degradation_level.load(Ordering::Relaxed) as f64); ``` ### 6. Metric names (string literals) | Metric name | Type | Description | |---|---|---| | `tidaldb_active_sessions` | gauge | Number of currently active agent sessions | | `tidaldb_closed_sessions_total` | counter | Total sessions closed since open | | `tidaldb_session_auto_closed_total` | counter | Total sessions auto-closed due to timeout | | `tidaldb_rate_limited_total` | counter | Total requests rate-limited due to overload | | `tidaldb_degradation_level` | gauge | Current degradation level (0-3) | ## Acceptance Criteria - [ ] `MetricsState` extended with 5 atomic counters, all `#[cfg(feature = "metrics")]` - [ ] `tidaldb_active_sessions` incremented on `start_session`, decremented on `close_session` and auto-close - [ ] `tidaldb_closed_sessions_total` incremented on every session close - [ ] `tidaldb_session_auto_closed_total` incremented only on timeout-based auto-close - [ ] `tidaldb_rate_limited_total` incremented on every rate-limited rejection - [ ] `tidaldb_degradation_level` updated when the degradation level changes - [ ] `/metrics` endpoint renders all 5 new metrics in valid Prometheus format - [ ] `cargo clippy -D warnings` and `cargo fmt --check` pass ## Test Strategy ```rust #[cfg(test)] mod tests { use super::*; #[test] fn active_sessions_tracks_lifecycle() { let state = MetricsState::new(); state.active_sessions.fetch_add(1, Ordering::Relaxed); state.active_sessions.fetch_add(1, Ordering::Relaxed); assert_eq!(state.active_sessions.load(Ordering::Relaxed), 2); state.active_sessions.fetch_sub(1, Ordering::Relaxed); assert_eq!(state.active_sessions.load(Ordering::Relaxed), 1); } #[test] fn degradation_level_renders_correctly() { let state = MetricsState::new(); state.degradation_level.store(2, Ordering::Relaxed); let output = state.render_prometheus(); assert!(output.contains("tidaldb_degradation_level")); assert!(output.contains(" 2")); } } ``` Integration test: ```rust #[test] fn session_metrics_increment_on_start_close() { let db = make_test_db_with_sessions(); let sid = db.start_session(1, &AgentId::new("test").unwrap(), "default").unwrap(); let prom = db.metrics().render_prometheus(); assert!(prom.contains("tidaldb_active_sessions")); db.close_session(sid).unwrap(); let prom = db.metrics().render_prometheus(); assert!(prom.contains("tidaldb_closed_sessions_total")); } ```