6.3 KiB
6.3 KiB
Task 04: Session + Cohort + Degradation Metrics
Delivers
Prometheus gauges and counters for session lifecycle (active, closed, auto-closed), cohort ledger size, degradation level, and rate-limiting activity. These metrics give operators visibility into agent session health, cohort engine load, and system stress without reading application logs.
Complexity: S
Dependencies
- task-01 complete (establishes instrumentation pattern)
- m7p2 complete (provides
DegradationLeveland rate-limiting infrastructure) tidal/src/db/metrics.rs--MetricsStateto extendtidal/src/db/sessions.rs-- session start/close pathstidal/src/cohort/mod.rs--CohortSignalLedgerandCohortRegistry
Technical Design
1. Add atomic counters to MetricsState
In tidal/src/db/metrics.rs:
pub struct MetricsState {
// ... existing + task-02 + task-03 fields ...
// ── Session + cohort + degradation metrics (m7p4) ──────────────────
/// Number of currently active sessions.
#[cfg(feature = "metrics")]
pub(crate) active_sessions: AtomicU64,
/// Total sessions closed since open (cumulative).
#[cfg(feature = "metrics")]
pub(crate) closed_sessions_total: AtomicU64,
/// Total sessions auto-closed due to timeout since open (cumulative).
#[cfg(feature = "metrics")]
pub(crate) session_auto_closed_total: AtomicU64,
/// Total requests rate-limited since open (cumulative).
#[cfg(feature = "metrics")]
pub(crate) rate_limited_total: AtomicU64,
/// Current degradation level (0 = healthy, 1 = warn, 2 = shed, 3 = critical).
#[cfg(feature = "metrics")]
pub(crate) degradation_level: AtomicU64,
}
2. Instrument session lifecycle
In tidal/src/db/sessions.rs:
// start_session():
#[cfg(feature = "metrics")]
self.metrics.active_sessions.fetch_add(1, Ordering::Relaxed);
// close_session():
#[cfg(feature = "metrics")]
{
// Decrement will not underflow because active_sessions >= 1 when a session exists.
self.metrics.active_sessions.fetch_sub(1, Ordering::Relaxed);
self.metrics.closed_sessions_total.fetch_add(1, Ordering::Relaxed);
}
// auto_close_expired_sessions() (existing timeout reaper):
#[cfg(feature = "metrics")]
{
self.metrics.active_sessions.fetch_sub(1, Ordering::Relaxed);
self.metrics.closed_sessions_total.fetch_add(1, Ordering::Relaxed);
self.metrics.session_auto_closed_total.fetch_add(1, Ordering::Relaxed);
}
3. Instrument rate limiter
In the m7p2 rate-limiting path (wherever a request is rejected due to load):
#[cfg(feature = "metrics")]
self.metrics.rate_limited_total.fetch_add(1, Ordering::Relaxed);
4. Update degradation level gauge
In the m7p2 degradation level setter:
#[cfg(feature = "metrics")]
self.metrics.degradation_level.store(new_level as u64, Ordering::Relaxed);
5. Render in Prometheus format
Extend MetricsState::render_prometheus():
// Sessions
write_gauge(&mut out, "tidaldb_active_sessions",
"Number of currently active agent sessions",
self.active_sessions.load(Ordering::Relaxed) as f64);
write_counter(&mut out, "tidaldb_closed_sessions_total",
"Total agent sessions closed since open",
self.closed_sessions_total.load(Ordering::Relaxed) as f64);
write_counter(&mut out, "tidaldb_session_auto_closed_total",
"Total agent sessions auto-closed due to timeout",
self.session_auto_closed_total.load(Ordering::Relaxed) as f64);
// Rate limiting
write_counter(&mut out, "tidaldb_rate_limited_total",
"Total requests rate-limited due to overload",
self.rate_limited_total.load(Ordering::Relaxed) as f64);
// Degradation
write_gauge(&mut out, "tidaldb_degradation_level",
"Current degradation level (0=healthy, 1=warn, 2=shed, 3=critical)",
self.degradation_level.load(Ordering::Relaxed) as f64);
6. Metric names (string literals)
| Metric name | Type | Description |
|---|---|---|
tidaldb_active_sessions |
gauge | Number of currently active agent sessions |
tidaldb_closed_sessions_total |
counter | Total sessions closed since open |
tidaldb_session_auto_closed_total |
counter | Total sessions auto-closed due to timeout |
tidaldb_rate_limited_total |
counter | Total requests rate-limited due to overload |
tidaldb_degradation_level |
gauge | Current degradation level (0-3) |
Acceptance Criteria
MetricsStateextended with 5 atomic counters, all#[cfg(feature = "metrics")]tidaldb_active_sessionsincremented onstart_session, decremented onclose_sessionand auto-closetidaldb_closed_sessions_totalincremented on every session closetidaldb_session_auto_closed_totalincremented only on timeout-based auto-closetidaldb_rate_limited_totalincremented on every rate-limited rejectiontidaldb_degradation_levelupdated when the degradation level changes/metricsendpoint renders all 5 new metrics in valid Prometheus formatcargo clippy -D warningsandcargo fmt --checkpass
Test Strategy
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn active_sessions_tracks_lifecycle() {
let state = MetricsState::new();
state.active_sessions.fetch_add(1, Ordering::Relaxed);
state.active_sessions.fetch_add(1, Ordering::Relaxed);
assert_eq!(state.active_sessions.load(Ordering::Relaxed), 2);
state.active_sessions.fetch_sub(1, Ordering::Relaxed);
assert_eq!(state.active_sessions.load(Ordering::Relaxed), 1);
}
#[test]
fn degradation_level_renders_correctly() {
let state = MetricsState::new();
state.degradation_level.store(2, Ordering::Relaxed);
let output = state.render_prometheus();
assert!(output.contains("tidaldb_degradation_level"));
assert!(output.contains(" 2"));
}
}
Integration test:
#[test]
fn session_metrics_increment_on_start_close() {
let db = make_test_db_with_sessions();
let sid = db.start_session(1, &AgentId::new("test").unwrap(), "default").unwrap();
let prom = db.metrics().render_prometheus();
assert!(prom.contains("tidaldb_active_sessions"));
db.close_session(sid).unwrap();
let prom = db.metrics().render_prometheus();
assert!(prom.contains("tidaldb_closed_sessions_total"));
}