tidaldb/docs/planning/milestone-7/phase-2/OVERVIEW.md
2026-02-23 22:41:16 -07:00

84 lines
5.1 KiB
Markdown

# m7p2: Graceful Degradation, Rate Limiting, and Session Cleanup
## Delivers
Automatic quality reduction under load pressure. 4-stage degradation order documented and enforced. Backpressure on write path. Per-agent token-bucket rate limiting. Session TTL auto-cleanup sweeper. All load behavior visible in query responses.
## Dependencies
- m7p1 complete (observability, metrics, health check)
- m4 sessions (active session tracking, `SessionHandle`, `SessionState`)
- m1p4 WAL (write path, `WalHandle`, `WalSender`, channel-based group commit)
- m2p5 query executor (`RetrieveExecutor`, `Results`)
- m5p3 search executor (`SearchExecutor`, `SearchResults`)
## Research References
- `docs/research/tidaldb_signal_ledger.md` -- signal write path, WAL durability guarantees
- `CODING_GUIDELINES.md` Section 7 -- error handling, `Result<T, E>` everywhere
- `thoughts.md` -- lessons from Engram on graceful degradation
- `ARCHITECTURE.md` -- single-node scaling philosophy
## Acceptance Criteria (Phase Level)
- [ ] `DegradationLevel` enum: `Full`, `ReducedCandidates`, `CoarseAggregates`, `NoDiversity`
- [ ] Load detection: `AtomicU64` tracking in-flight query count; thresholds: 200->ReducedCandidates, 500->CoarseAggregates, 1000->NoDiversity (configurable)
- [ ] `ReducedCandidates`: ANN `top_k` reduced from 500 to 200; BM25 candidate limit halved
- [ ] `CoarseAggregates`: windowed count reads use AllTime; velocity reads use 24h window
- [ ] `NoDiversity`: diversity pass skipped entirely
- [ ] Under 3x overload, all well-formed queries return results (no ServiceUnavailable)
- [ ] Degradation level in `Results.degradation_level` and `SearchResults.degradation_level`
- [ ] `TidalError::Backpressure { retry_after_ms: u64 }` when WAL queue depth exceeds threshold
- [ ] Per-agent token-bucket rate limiting: `RateLimiter` with `DashMap<(AgentId, SessionId), TokenBucket>`; unlimited by default; configure via `RateLimiterConfig::limited(rate, burst)` in `TidalDb::builder()`
- [ ] `TidalError::RateLimited { agent_id, limit, retry_after_ms }` on excess
- [ ] Rate limiter does not affect non-session signal writes
- [ ] Session TTL sweeper: runs every 60s; auto-closes expired sessions; `SessionSummary.auto_closed: bool`
- [ ] Sweeper cancellable on db.close(); no dangling threads
## Task Execution Order
```
task-01 (DegradationLevel + LoadDetector)
|
+------+------+
| | |
task-02 task-03 task-04 task-05
(exec) (resp+bp) (ratelimit) (sweeper)
| | | |
+------+------+------+
|
task-06 (load test)
```
Tasks 02-05 can parallelize after task-01. Task-06 depends on all prior tasks.
## Module Location
New module: `tidal/src/load/` with submodules:
- `mod.rs` -- public re-exports (`DegradationLevel`, `LoadDetector`, `InFlightGuard`, `RateLimiter`)
- `detector.rs` -- `LoadDetector` struct, `InFlightGuard` RAII type, degradation threshold config
- `rate_limiter.rs` -- `RateLimiter`, `TokenBucket`, per-agent keying
Modified modules:
- `tidal/src/query/executor/mod.rs` -- `RetrieveExecutor` degradation branches
- `tidal/src/query/search/executor.rs` -- `SearchExecutor` degradation branches
- `tidal/src/query/retrieve/types.rs` -- `Results.degradation_level` field
- `tidal/src/query/search/types.rs` -- `SearchResults.degradation_level` field
- `tidal/src/schema/error.rs` -- `TidalError::Backpressure`, `TidalError::RateLimited`
- `tidal/src/session/types.rs` -- `SessionSummary.auto_closed` field
- `tidal/src/db/mod.rs` -- `LoadDetector` field, sweeper thread field
- `tidal/src/db/sessions.rs` -- rate limiter check in `session_signal()`, sweeper logic
- `tidal/src/lib.rs` -- `pub mod load;` re-export
## Notes
- The `LoadDetector` uses only atomics -- no mutex on the hot path. `AtomicU64::fetch_add` on query entry, `AtomicU64::fetch_sub` via RAII `Drop` on exit. The `DegradationLevel` is computed from the current counter value at query entry time and threaded through the executor, not re-checked mid-pipeline.
- The token-bucket `RateLimiter` is keyed by `(AgentId, SessionId)` so that each agent-session pair gets its own bucket. The refill is computed lazily at check time (no background thread), using `Instant::now()` delta. This avoids a timer thread and keeps the hot path lock-free via `DashMap` shard-level locking.
- The session sweeper thread is a simple `loop { sleep(60s); scan; }` with a `shutdown: AtomicBool` checked each iteration. It calls `TidalDb::close_session_internal()` (a variant of `close_session` that does not require a `SessionHandle`) and sets the `auto_closed` flag.
- Backpressure on the WAL write path is checked by inspecting the channel's `len()` against a configurable threshold. This is O(1) on `crossbeam::channel::bounded`. The check happens in `TidalDb::signal()` before enqueuing, not inside the WAL writer.
- The `DegradationLevel` is `Copy + Clone + PartialEq + Eq + Debug` and has a `u8` repr for cheap embedding in response structs and tracing spans.
## Done When
All 13 acceptance criteria above pass. `cargo test --manifest-path tidal/Cargo.toml` passes. Load test (task-06) verifies degradation progression under 3x simulated overload. No ServiceUnavailable errors under sustained load.