tidaldb/docs/planning/milestone-7/phase-4/task-04-session-cohort-degradation-metrics.md
2026-02-23 22:41:16 -07:00

184 lines
6.3 KiB
Markdown

# Task 04: Session + Cohort + Degradation Metrics
## Delivers
Prometheus gauges and counters for session lifecycle (active, closed, auto-closed), cohort ledger size, degradation level, and rate-limiting activity. These metrics give operators visibility into agent session health, cohort engine load, and system stress without reading application logs.
## Complexity: S
## Dependencies
- task-01 complete (establishes instrumentation pattern)
- m7p2 complete (provides `DegradationLevel` and rate-limiting infrastructure)
- `tidal/src/db/metrics.rs` -- `MetricsState` to extend
- `tidal/src/db/sessions.rs` -- session start/close paths
- `tidal/src/cohort/mod.rs` -- `CohortSignalLedger` and `CohortRegistry`
## Technical Design
### 1. Add atomic counters to MetricsState
In `tidal/src/db/metrics.rs`:
```rust
pub struct MetricsState {
// ... existing + task-02 + task-03 fields ...
// ── Session + cohort + degradation metrics (m7p4) ──────────────────
/// Number of currently active sessions.
#[cfg(feature = "metrics")]
pub(crate) active_sessions: AtomicU64,
/// Total sessions closed since open (cumulative).
#[cfg(feature = "metrics")]
pub(crate) closed_sessions_total: AtomicU64,
/// Total sessions auto-closed due to timeout since open (cumulative).
#[cfg(feature = "metrics")]
pub(crate) session_auto_closed_total: AtomicU64,
/// Total requests rate-limited since open (cumulative).
#[cfg(feature = "metrics")]
pub(crate) rate_limited_total: AtomicU64,
/// Current degradation level (0 = healthy, 1 = warn, 2 = shed, 3 = critical).
#[cfg(feature = "metrics")]
pub(crate) degradation_level: AtomicU64,
}
```
### 2. Instrument session lifecycle
In `tidal/src/db/sessions.rs`:
```rust
// start_session():
#[cfg(feature = "metrics")]
self.metrics.active_sessions.fetch_add(1, Ordering::Relaxed);
// close_session():
#[cfg(feature = "metrics")]
{
// Decrement will not underflow because active_sessions >= 1 when a session exists.
self.metrics.active_sessions.fetch_sub(1, Ordering::Relaxed);
self.metrics.closed_sessions_total.fetch_add(1, Ordering::Relaxed);
}
// auto_close_expired_sessions() (existing timeout reaper):
#[cfg(feature = "metrics")]
{
self.metrics.active_sessions.fetch_sub(1, Ordering::Relaxed);
self.metrics.closed_sessions_total.fetch_add(1, Ordering::Relaxed);
self.metrics.session_auto_closed_total.fetch_add(1, Ordering::Relaxed);
}
```
### 3. Instrument rate limiter
In the m7p2 rate-limiting path (wherever a request is rejected due to load):
```rust
#[cfg(feature = "metrics")]
self.metrics.rate_limited_total.fetch_add(1, Ordering::Relaxed);
```
### 4. Update degradation level gauge
In the m7p2 degradation level setter:
```rust
#[cfg(feature = "metrics")]
self.metrics.degradation_level.store(new_level as u64, Ordering::Relaxed);
```
### 5. Render in Prometheus format
Extend `MetricsState::render_prometheus()`:
```rust
// Sessions
write_gauge(&mut out, "tidaldb_active_sessions",
"Number of currently active agent sessions",
self.active_sessions.load(Ordering::Relaxed) as f64);
write_counter(&mut out, "tidaldb_closed_sessions_total",
"Total agent sessions closed since open",
self.closed_sessions_total.load(Ordering::Relaxed) as f64);
write_counter(&mut out, "tidaldb_session_auto_closed_total",
"Total agent sessions auto-closed due to timeout",
self.session_auto_closed_total.load(Ordering::Relaxed) as f64);
// Rate limiting
write_counter(&mut out, "tidaldb_rate_limited_total",
"Total requests rate-limited due to overload",
self.rate_limited_total.load(Ordering::Relaxed) as f64);
// Degradation
write_gauge(&mut out, "tidaldb_degradation_level",
"Current degradation level (0=healthy, 1=warn, 2=shed, 3=critical)",
self.degradation_level.load(Ordering::Relaxed) as f64);
```
### 6. Metric names (string literals)
| Metric name | Type | Description |
|---|---|---|
| `tidaldb_active_sessions` | gauge | Number of currently active agent sessions |
| `tidaldb_closed_sessions_total` | counter | Total sessions closed since open |
| `tidaldb_session_auto_closed_total` | counter | Total sessions auto-closed due to timeout |
| `tidaldb_rate_limited_total` | counter | Total requests rate-limited due to overload |
| `tidaldb_degradation_level` | gauge | Current degradation level (0-3) |
## Acceptance Criteria
- [ ] `MetricsState` extended with 5 atomic counters, all `#[cfg(feature = "metrics")]`
- [ ] `tidaldb_active_sessions` incremented on `start_session`, decremented on `close_session` and auto-close
- [ ] `tidaldb_closed_sessions_total` incremented on every session close
- [ ] `tidaldb_session_auto_closed_total` incremented only on timeout-based auto-close
- [ ] `tidaldb_rate_limited_total` incremented on every rate-limited rejection
- [ ] `tidaldb_degradation_level` updated when the degradation level changes
- [ ] `/metrics` endpoint renders all 5 new metrics in valid Prometheus format
- [ ] `cargo clippy -D warnings` and `cargo fmt --check` pass
## Test Strategy
```rust
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn active_sessions_tracks_lifecycle() {
let state = MetricsState::new();
state.active_sessions.fetch_add(1, Ordering::Relaxed);
state.active_sessions.fetch_add(1, Ordering::Relaxed);
assert_eq!(state.active_sessions.load(Ordering::Relaxed), 2);
state.active_sessions.fetch_sub(1, Ordering::Relaxed);
assert_eq!(state.active_sessions.load(Ordering::Relaxed), 1);
}
#[test]
fn degradation_level_renders_correctly() {
let state = MetricsState::new();
state.degradation_level.store(2, Ordering::Relaxed);
let output = state.render_prometheus();
assert!(output.contains("tidaldb_degradation_level"));
assert!(output.contains(" 2"));
}
}
```
Integration test:
```rust
#[test]
fn session_metrics_increment_on_start_close() {
let db = make_test_db_with_sessions();
let sid = db.start_session(1, &AgentId::new("test").unwrap(), "default").unwrap();
let prom = db.metrics().render_prometheus();
assert!(prom.contains("tidaldb_active_sessions"));
db.close_session(sid).unwrap();
let prom = db.metrics().render_prometheus();
assert!(prom.contains("tidaldb_closed_sessions_total"));
}
```