184 lines
6.3 KiB
Markdown
184 lines
6.3 KiB
Markdown
# Task 04: Session + Cohort + Degradation Metrics
|
|
|
|
## Delivers
|
|
|
|
Prometheus gauges and counters for session lifecycle (active, closed, auto-closed), cohort ledger size, degradation level, and rate-limiting activity. These metrics give operators visibility into agent session health, cohort engine load, and system stress without reading application logs.
|
|
|
|
## Complexity: S
|
|
|
|
## Dependencies
|
|
|
|
- task-01 complete (establishes instrumentation pattern)
|
|
- m7p2 complete (provides `DegradationLevel` and rate-limiting infrastructure)
|
|
- `tidal/src/db/metrics.rs` -- `MetricsState` to extend
|
|
- `tidal/src/db/sessions.rs` -- session start/close paths
|
|
- `tidal/src/cohort/mod.rs` -- `CohortSignalLedger` and `CohortRegistry`
|
|
|
|
## Technical Design
|
|
|
|
### 1. Add atomic counters to MetricsState
|
|
|
|
In `tidal/src/db/metrics.rs`:
|
|
|
|
```rust
|
|
pub struct MetricsState {
|
|
// ... existing + task-02 + task-03 fields ...
|
|
|
|
// ── Session + cohort + degradation metrics (m7p4) ──────────────────
|
|
/// Number of currently active sessions.
|
|
#[cfg(feature = "metrics")]
|
|
pub(crate) active_sessions: AtomicU64,
|
|
/// Total sessions closed since open (cumulative).
|
|
#[cfg(feature = "metrics")]
|
|
pub(crate) closed_sessions_total: AtomicU64,
|
|
/// Total sessions auto-closed due to timeout since open (cumulative).
|
|
#[cfg(feature = "metrics")]
|
|
pub(crate) session_auto_closed_total: AtomicU64,
|
|
/// Total requests rate-limited since open (cumulative).
|
|
#[cfg(feature = "metrics")]
|
|
pub(crate) rate_limited_total: AtomicU64,
|
|
/// Current degradation level (0 = healthy, 1 = warn, 2 = shed, 3 = critical).
|
|
#[cfg(feature = "metrics")]
|
|
pub(crate) degradation_level: AtomicU64,
|
|
}
|
|
```
|
|
|
|
### 2. Instrument session lifecycle
|
|
|
|
In `tidal/src/db/sessions.rs`:
|
|
|
|
```rust
|
|
// start_session():
|
|
#[cfg(feature = "metrics")]
|
|
self.metrics.active_sessions.fetch_add(1, Ordering::Relaxed);
|
|
|
|
// close_session():
|
|
#[cfg(feature = "metrics")]
|
|
{
|
|
// Decrement will not underflow because active_sessions >= 1 when a session exists.
|
|
self.metrics.active_sessions.fetch_sub(1, Ordering::Relaxed);
|
|
self.metrics.closed_sessions_total.fetch_add(1, Ordering::Relaxed);
|
|
}
|
|
|
|
// auto_close_expired_sessions() (existing timeout reaper):
|
|
#[cfg(feature = "metrics")]
|
|
{
|
|
self.metrics.active_sessions.fetch_sub(1, Ordering::Relaxed);
|
|
self.metrics.closed_sessions_total.fetch_add(1, Ordering::Relaxed);
|
|
self.metrics.session_auto_closed_total.fetch_add(1, Ordering::Relaxed);
|
|
}
|
|
```
|
|
|
|
### 3. Instrument rate limiter
|
|
|
|
In the m7p2 rate-limiting path (wherever a request is rejected due to load):
|
|
|
|
```rust
|
|
#[cfg(feature = "metrics")]
|
|
self.metrics.rate_limited_total.fetch_add(1, Ordering::Relaxed);
|
|
```
|
|
|
|
### 4. Update degradation level gauge
|
|
|
|
In the m7p2 degradation level setter:
|
|
|
|
```rust
|
|
#[cfg(feature = "metrics")]
|
|
self.metrics.degradation_level.store(new_level as u64, Ordering::Relaxed);
|
|
```
|
|
|
|
### 5. Render in Prometheus format
|
|
|
|
Extend `MetricsState::render_prometheus()`:
|
|
|
|
```rust
|
|
// Sessions
|
|
write_gauge(&mut out, "tidaldb_active_sessions",
|
|
"Number of currently active agent sessions",
|
|
self.active_sessions.load(Ordering::Relaxed) as f64);
|
|
|
|
write_counter(&mut out, "tidaldb_closed_sessions_total",
|
|
"Total agent sessions closed since open",
|
|
self.closed_sessions_total.load(Ordering::Relaxed) as f64);
|
|
|
|
write_counter(&mut out, "tidaldb_session_auto_closed_total",
|
|
"Total agent sessions auto-closed due to timeout",
|
|
self.session_auto_closed_total.load(Ordering::Relaxed) as f64);
|
|
|
|
// Rate limiting
|
|
write_counter(&mut out, "tidaldb_rate_limited_total",
|
|
"Total requests rate-limited due to overload",
|
|
self.rate_limited_total.load(Ordering::Relaxed) as f64);
|
|
|
|
// Degradation
|
|
write_gauge(&mut out, "tidaldb_degradation_level",
|
|
"Current degradation level (0=healthy, 1=warn, 2=shed, 3=critical)",
|
|
self.degradation_level.load(Ordering::Relaxed) as f64);
|
|
```
|
|
|
|
### 6. Metric names (string literals)
|
|
|
|
| Metric name | Type | Description |
|
|
|---|---|---|
|
|
| `tidaldb_active_sessions` | gauge | Number of currently active agent sessions |
|
|
| `tidaldb_closed_sessions_total` | counter | Total sessions closed since open |
|
|
| `tidaldb_session_auto_closed_total` | counter | Total sessions auto-closed due to timeout |
|
|
| `tidaldb_rate_limited_total` | counter | Total requests rate-limited due to overload |
|
|
| `tidaldb_degradation_level` | gauge | Current degradation level (0-3) |
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] `MetricsState` extended with 5 atomic counters, all `#[cfg(feature = "metrics")]`
|
|
- [ ] `tidaldb_active_sessions` incremented on `start_session`, decremented on `close_session` and auto-close
|
|
- [ ] `tidaldb_closed_sessions_total` incremented on every session close
|
|
- [ ] `tidaldb_session_auto_closed_total` incremented only on timeout-based auto-close
|
|
- [ ] `tidaldb_rate_limited_total` incremented on every rate-limited rejection
|
|
- [ ] `tidaldb_degradation_level` updated when the degradation level changes
|
|
- [ ] `/metrics` endpoint renders all 5 new metrics in valid Prometheus format
|
|
- [ ] `cargo clippy -D warnings` and `cargo fmt --check` pass
|
|
|
|
## Test Strategy
|
|
|
|
```rust
|
|
#[cfg(test)]
|
|
mod tests {
|
|
use super::*;
|
|
|
|
#[test]
|
|
fn active_sessions_tracks_lifecycle() {
|
|
let state = MetricsState::new();
|
|
state.active_sessions.fetch_add(1, Ordering::Relaxed);
|
|
state.active_sessions.fetch_add(1, Ordering::Relaxed);
|
|
assert_eq!(state.active_sessions.load(Ordering::Relaxed), 2);
|
|
state.active_sessions.fetch_sub(1, Ordering::Relaxed);
|
|
assert_eq!(state.active_sessions.load(Ordering::Relaxed), 1);
|
|
}
|
|
|
|
#[test]
|
|
fn degradation_level_renders_correctly() {
|
|
let state = MetricsState::new();
|
|
state.degradation_level.store(2, Ordering::Relaxed);
|
|
let output = state.render_prometheus();
|
|
assert!(output.contains("tidaldb_degradation_level"));
|
|
assert!(output.contains(" 2"));
|
|
}
|
|
}
|
|
```
|
|
|
|
Integration test:
|
|
|
|
```rust
|
|
#[test]
|
|
fn session_metrics_increment_on_start_close() {
|
|
let db = make_test_db_with_sessions();
|
|
let sid = db.start_session(1, &AgentId::new("test").unwrap(), "default").unwrap();
|
|
|
|
let prom = db.metrics().render_prometheus();
|
|
assert!(prom.contains("tidaldb_active_sessions"));
|
|
|
|
db.close_session(sid).unwrap();
|
|
let prom = db.metrics().render_prometheus();
|
|
assert!(prom.contains("tidaldb_closed_sessions_total"));
|
|
}
|
|
```
|