8.1 KiB
m7p4: Operational Visibility
Delivers
Query execution stats, signal system health metrics, index health metrics, session and cohort degradation metrics, structured error reporting with context, tidalctl diagnostics command, zero-overhead metrics feature flag, RLHF training data export API, and cross-session aggregation query.
This phase makes tidalDB observable. Before m7p4, an operator diagnosing a slow query or a stale index must read code to know what to measure. After m7p4, every critical subsystem reports its health through Prometheus counters and gauges, every query returns execution statistics, every error carries structured context, and tidalctl diagnostics prints a one-screen summary of the entire system's state.
Dependencies
- m7p1 complete (crash recovery hardening -- checkpoint BLAKE3, WAL compaction)
- m7p2 complete (graceful degradation --
DegradationLevelgauge, rate-limiting counters) - M6 complete (all entity types, session layer, cohort engine, collections, suggestion index)
tidal/src/db/metrics.rs-- existingMetricsStatewith uptime and health gaugestidal/src/query/executor/mod.rs--RetrieveExecutorpipelinetidal/src/query/search/executor.rs--SearchExecutorpipelinetidal/src/query/retrieve/types.rs--Results,RetrieveResulttidal/src/query/search/types.rs--SearchResults,SearchResultItemtidal/src/schema/error.rs--TidalErrorenumtidal/src/query/retrieve/errors.rs--QueryErrorenumtidal/src/wal/mod.rs--WalHandle, segment managementtidal/src/signals/checkpoint/meta.rs--CheckpointMetatidal/src/session/types.rs--SessionSummary
Research References
docs/research/tidaldb_tooling_and_diagnostics.md-- CLI framework choice (manual), HTTP server (hand-rolled), Prometheus text format (hand-written)docs/research/tidaldb_signal_ledger.md-- three-tier hybrid, checkpoint semanticsthoughts.md-- lessons from Engram/Citadel on observability and diagnostics
Acceptance Criteria (Phase Level)
QueryStatsstruct with fields:candidates_considered,candidates_after_filter,candidates_after_diversity,filters_applied,scoring_time_us,diversity_time_us,total_time_us,degradation_level,profile_nameResults.stats: QueryStatsandSearchResults.stats: QueryStatspopulated by executors- Signal metrics at
/metrics:tidaldb_wal_lag_bytes,tidaldb_wal_compacted_segments_total,tidaldb_checkpoint_age_seconds,tidaldb_signal_hot_entries,tidaldb_signal_writes_total,tidaldb_signal_write_latency_ushistogram - Index metrics:
tidaldb_tantivy_segment_count,tidaldb_tantivy_indexed_docs,tidaldb_usearch_index_size_bytes,tidaldb_usearch_vector_count,tidaldb_bitmap_index_cardinality - Session/cohort metrics:
tidaldb_active_sessions,tidaldb_closed_sessions_total,tidaldb_session_auto_closed_total,tidaldb_rate_limited_total - Degradation:
tidaldb_degradation_levelgauge (0-3) tidalctl diagnosticsprints WAL state, checkpoint age, signal size, index sizes, session count, degradation level, collection count, cohort count- All
TidalErrorvariants have operation name + context; no bare strings db.export_signals(ExportRequest { user_id, signal_types, since, until, format }) -> Result<Vec<ExportedSignal>>;ExportFormat::JsonLinessupporteddb.user_session_summary(user_id, since) -> Result<UserSessionSummary>; returnssessions_count,total_signals,total_rejections,top_signal_types,preference_drift(cosine distance)- Metrics zero-overhead without
metricsfeature; verified by compile + inspection - All new code passes
cargo clippy -D warningsandcargo fmt --check - Integration test suite
m7p4_visibilitypasses
Task Execution Order
task-01 (QueryStats struct + executor instrumentation)
|
+----> task-02 (signal + WAL metrics)
|
+----> task-03 (index health metrics)
|
+----> task-04 (session + cohort + degradation metrics)
|
+----> task-06 (structured error context audit)
|
+----> task-07 (metrics feature flag + zero-overhead)
|
+----> task-08 (RLHF export)
|
+----> task-09 (cross-session aggregation)
|
v
task-05 (tidalctl diagnostics) -- depends on task-02, task-03, task-04
task-01 is the foundation: QueryStats is referenced by executor instrumentation that tasks 02-09 build upon. Tasks 02, 03, 04, 06, 07, 08, and 09 can parallelize after task-01. Task-05 depends on tasks 02, 03, and 04 because tidalctl diagnostics reads the metrics those tasks expose.
Module Location
New and modified modules:
tidal/src/
query/
stats.rs -- new: QueryStats struct and builder
executor/mod.rs -- modified: instrument RetrieveExecutor stages
search/executor.rs -- modified: instrument SearchExecutor stages
retrieve/types.rs -- modified: add stats field to Results
search/types.rs -- modified: add stats field to SearchResults
db/
metrics.rs -- modified: add AtomicU64 counters/gauges, histogram
export.rs -- new: ExportRequest, ExportedSignal, ExportFormat, export_signals()
sessions.rs -- modified: add user_session_summary()
mod.rs -- modified: wire new metrics fields, export/aggregation methods
schema/
error.rs -- modified: add ErrorContext to TidalError variants
wal/
mod.rs -- modified: expose lag_bytes(), compacted_count() via metrics
tidal/tests/
m7p4_visibility.rs -- new: integration tests for all m7p4 functionality
Notes
QueryStats design philosophy
QueryStats is a pure data struct, not a builder. The executor constructs it incrementally during pipeline execution using Instant::elapsed() for timing fields. It is cheap to construct (no heap allocations) and always populated -- there is no Option<QueryStats>. Even when a query returns zero results, the stats reflect the work done to determine that.
Histogram for write latency
The tidaldb_signal_write_latency_us metric uses a hand-written histogram with fixed bucket boundaries: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000, 10000] microseconds. This matches the Prometheus histogram convention (cumulative buckets + sum + count) and avoids pulling in a histogram library. The bucket boundaries are chosen based on expected signal write latency: most writes should complete in <100us, with WAL fsync pushing outliers to the 1-5ms range.
RLHF export and WAL scanning
export_signals reads from the WAL segment files directly using the existing reader::recover() path, filtered by time range and signal type. This is an offline operation -- it does not interfere with the live write path. For large WAL backlogs, the caller should use narrow time ranges. The ExportedSignal type is a flat struct suitable for JSON serialization.
Cross-session aggregation
user_session_summary scans closed_sessions (the in-memory DashMap<SessionId, SessionSnapshot>) filtered by user ID and timestamp. It does not read from persistent storage -- only closed sessions that exist in the current process's memory are visible. This is a deliberate simplification: persistent session archive scanning is deferred to M8 when the distributed fabric needs cross-node session aggregation.
Zero-overhead metrics feature flag
All AtomicU64 counters and the histogram are wrapped in #[cfg(feature = "metrics")] blocks. When the feature is disabled, the compiler eliminates all counter increments, load operations, and the HTTP endpoint. The QueryStats struct is always present (it has value even without Prometheus export), but the Prometheus-specific rendering and atomic counters are gated.
Done When
All 12 acceptance criteria above pass. cargo test --manifest-path tidal/Cargo.toml passes including the new m7p4_visibility integration test suite. tidalctl diagnostics --path <dir> prints a correct summary for a running database. cargo test --manifest-path tidal/Cargo.toml --no-default-features compiles without the metrics feature and produces no metrics overhead.