# m7p4: Operational Visibility ## Delivers Query execution stats, signal system health metrics, index health metrics, session and cohort degradation metrics, structured error reporting with context, `tidalctl diagnostics` command, zero-overhead `metrics` feature flag, RLHF training data export API, and cross-session aggregation query. This phase makes tidalDB observable. Before m7p4, an operator diagnosing a slow query or a stale index must read code to know what to measure. After m7p4, every critical subsystem reports its health through Prometheus counters and gauges, every query returns execution statistics, every error carries structured context, and `tidalctl diagnostics` prints a one-screen summary of the entire system's state. ## Dependencies - m7p1 complete (crash recovery hardening -- checkpoint BLAKE3, WAL compaction) - m7p2 complete (graceful degradation -- `DegradationLevel` gauge, rate-limiting counters) - M6 complete (all entity types, session layer, cohort engine, collections, suggestion index) - `tidal/src/db/metrics.rs` -- existing `MetricsState` with uptime and health gauges - `tidal/src/query/executor/mod.rs` -- `RetrieveExecutor` pipeline - `tidal/src/query/search/executor.rs` -- `SearchExecutor` pipeline - `tidal/src/query/retrieve/types.rs` -- `Results`, `RetrieveResult` - `tidal/src/query/search/types.rs` -- `SearchResults`, `SearchResultItem` - `tidal/src/schema/error.rs` -- `TidalError` enum - `tidal/src/query/retrieve/errors.rs` -- `QueryError` enum - `tidal/src/wal/mod.rs` -- `WalHandle`, segment management - `tidal/src/signals/checkpoint/meta.rs` -- `CheckpointMeta` - `tidal/src/session/types.rs` -- `SessionSummary` ## Research References - `docs/research/tidaldb_tooling_and_diagnostics.md` -- CLI framework choice (manual), HTTP server (hand-rolled), Prometheus text format (hand-written) - `docs/research/tidaldb_signal_ledger.md` -- three-tier hybrid, checkpoint semantics - `thoughts.md` -- lessons from Engram/Citadel on observability and diagnostics ## Acceptance Criteria (Phase Level) - [ ] `QueryStats` struct with fields: `candidates_considered`, `candidates_after_filter`, `candidates_after_diversity`, `filters_applied`, `scoring_time_us`, `diversity_time_us`, `total_time_us`, `degradation_level`, `profile_name` - [ ] `Results.stats: QueryStats` and `SearchResults.stats: QueryStats` populated by executors - [ ] Signal metrics at `/metrics`: `tidaldb_wal_lag_bytes`, `tidaldb_wal_compacted_segments_total`, `tidaldb_checkpoint_age_seconds`, `tidaldb_signal_hot_entries`, `tidaldb_signal_writes_total`, `tidaldb_signal_write_latency_us` histogram - [ ] Index metrics: `tidaldb_tantivy_segment_count`, `tidaldb_tantivy_indexed_docs`, `tidaldb_usearch_index_size_bytes`, `tidaldb_usearch_vector_count`, `tidaldb_bitmap_index_cardinality` - [ ] Session/cohort metrics: `tidaldb_active_sessions`, `tidaldb_closed_sessions_total`, `tidaldb_session_auto_closed_total`, `tidaldb_rate_limited_total` - [ ] Degradation: `tidaldb_degradation_level` gauge (0-3) - [ ] `tidalctl diagnostics` prints WAL state, checkpoint age, signal size, index sizes, session count, degradation level, collection count, cohort count - [ ] All `TidalError` variants have operation name + context; no bare strings - [ ] `db.export_signals(ExportRequest { user_id, signal_types, since, until, format }) -> Result>`; `ExportFormat::JsonLines` supported - [ ] `db.user_session_summary(user_id, since) -> Result`; returns `sessions_count`, `total_signals`, `total_rejections`, `top_signal_types`, `preference_drift` (cosine distance) - [ ] Metrics zero-overhead without `metrics` feature; verified by compile + inspection - [ ] All new code passes `cargo clippy -D warnings` and `cargo fmt --check` - [ ] Integration test suite `m7p4_visibility` passes ## Task Execution Order ``` task-01 (QueryStats struct + executor instrumentation) | +----> task-02 (signal + WAL metrics) | +----> task-03 (index health metrics) | +----> task-04 (session + cohort + degradation metrics) | +----> task-06 (structured error context audit) | +----> task-07 (metrics feature flag + zero-overhead) | +----> task-08 (RLHF export) | +----> task-09 (cross-session aggregation) | v task-05 (tidalctl diagnostics) -- depends on task-02, task-03, task-04 ``` task-01 is the foundation: `QueryStats` is referenced by executor instrumentation that tasks 02-09 build upon. Tasks 02, 03, 04, 06, 07, 08, and 09 can parallelize after task-01. Task-05 depends on tasks 02, 03, and 04 because `tidalctl diagnostics` reads the metrics those tasks expose. ## Module Location New and modified modules: ``` tidal/src/ query/ stats.rs -- new: QueryStats struct and builder executor/mod.rs -- modified: instrument RetrieveExecutor stages search/executor.rs -- modified: instrument SearchExecutor stages retrieve/types.rs -- modified: add stats field to Results search/types.rs -- modified: add stats field to SearchResults db/ metrics.rs -- modified: add AtomicU64 counters/gauges, histogram export.rs -- new: ExportRequest, ExportedSignal, ExportFormat, export_signals() sessions.rs -- modified: add user_session_summary() mod.rs -- modified: wire new metrics fields, export/aggregation methods schema/ error.rs -- modified: add ErrorContext to TidalError variants wal/ mod.rs -- modified: expose lag_bytes(), compacted_count() via metrics tidal/tests/ m7p4_visibility.rs -- new: integration tests for all m7p4 functionality ``` ## Notes ### QueryStats design philosophy `QueryStats` is a pure data struct, not a builder. The executor constructs it incrementally during pipeline execution using `Instant::elapsed()` for timing fields. It is cheap to construct (no heap allocations) and always populated -- there is no `Option`. Even when a query returns zero results, the stats reflect the work done to determine that. ### Histogram for write latency The `tidaldb_signal_write_latency_us` metric uses a hand-written histogram with fixed bucket boundaries: `[1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000, 10000]` microseconds. This matches the Prometheus histogram convention (cumulative buckets + sum + count) and avoids pulling in a histogram library. The bucket boundaries are chosen based on expected signal write latency: most writes should complete in <100us, with WAL fsync pushing outliers to the 1-5ms range. ### RLHF export and WAL scanning `export_signals` reads from the WAL segment files directly using the existing `reader::recover()` path, filtered by time range and signal type. This is an offline operation -- it does not interfere with the live write path. For large WAL backlogs, the caller should use narrow time ranges. The `ExportedSignal` type is a flat struct suitable for JSON serialization. ### Cross-session aggregation `user_session_summary` scans `closed_sessions` (the in-memory `DashMap`) filtered by user ID and timestamp. It does not read from persistent storage -- only closed sessions that exist in the current process's memory are visible. This is a deliberate simplification: persistent session archive scanning is deferred to M8 when the distributed fabric needs cross-node session aggregation. ### Zero-overhead metrics feature flag All `AtomicU64` counters and the histogram are wrapped in `#[cfg(feature = "metrics")]` blocks. When the feature is disabled, the compiler eliminates all counter increments, load operations, and the HTTP endpoint. The `QueryStats` struct is always present (it has value even without Prometheus export), but the Prometheus-specific rendering and atomic counters are gated. ## Done When All 12 acceptance criteria above pass. `cargo test --manifest-path tidal/Cargo.toml` passes including the new `m7p4_visibility` integration test suite. `tidalctl diagnostics --path ` prints a correct summary for a running database. `cargo test --manifest-path tidal/Cargo.toml --no-default-features` compiles without the `metrics` feature and produces no metrics overhead.