tidaldb/docs/planning/milestone-7/phase-4/OVERVIEW.md
2026-02-23 22:41:16 -07:00

8.1 KiB

m7p4: Operational Visibility

Delivers

Query execution stats, signal system health metrics, index health metrics, session and cohort degradation metrics, structured error reporting with context, tidalctl diagnostics command, zero-overhead metrics feature flag, RLHF training data export API, and cross-session aggregation query.

This phase makes tidalDB observable. Before m7p4, an operator diagnosing a slow query or a stale index must read code to know what to measure. After m7p4, every critical subsystem reports its health through Prometheus counters and gauges, every query returns execution statistics, every error carries structured context, and tidalctl diagnostics prints a one-screen summary of the entire system's state.

Dependencies

  • m7p1 complete (crash recovery hardening -- checkpoint BLAKE3, WAL compaction)
  • m7p2 complete (graceful degradation -- DegradationLevel gauge, rate-limiting counters)
  • M6 complete (all entity types, session layer, cohort engine, collections, suggestion index)
  • tidal/src/db/metrics.rs -- existing MetricsState with uptime and health gauges
  • tidal/src/query/executor/mod.rs -- RetrieveExecutor pipeline
  • tidal/src/query/search/executor.rs -- SearchExecutor pipeline
  • tidal/src/query/retrieve/types.rs -- Results, RetrieveResult
  • tidal/src/query/search/types.rs -- SearchResults, SearchResultItem
  • tidal/src/schema/error.rs -- TidalError enum
  • tidal/src/query/retrieve/errors.rs -- QueryError enum
  • tidal/src/wal/mod.rs -- WalHandle, segment management
  • tidal/src/signals/checkpoint/meta.rs -- CheckpointMeta
  • tidal/src/session/types.rs -- SessionSummary

Research References

  • docs/research/tidaldb_tooling_and_diagnostics.md -- CLI framework choice (manual), HTTP server (hand-rolled), Prometheus text format (hand-written)
  • docs/research/tidaldb_signal_ledger.md -- three-tier hybrid, checkpoint semantics
  • thoughts.md -- lessons from Engram/Citadel on observability and diagnostics

Acceptance Criteria (Phase Level)

  • QueryStats struct with fields: candidates_considered, candidates_after_filter, candidates_after_diversity, filters_applied, scoring_time_us, diversity_time_us, total_time_us, degradation_level, profile_name
  • Results.stats: QueryStats and SearchResults.stats: QueryStats populated by executors
  • Signal metrics at /metrics: tidaldb_wal_lag_bytes, tidaldb_wal_compacted_segments_total, tidaldb_checkpoint_age_seconds, tidaldb_signal_hot_entries, tidaldb_signal_writes_total, tidaldb_signal_write_latency_us histogram
  • Index metrics: tidaldb_tantivy_segment_count, tidaldb_tantivy_indexed_docs, tidaldb_usearch_index_size_bytes, tidaldb_usearch_vector_count, tidaldb_bitmap_index_cardinality
  • Session/cohort metrics: tidaldb_active_sessions, tidaldb_closed_sessions_total, tidaldb_session_auto_closed_total, tidaldb_rate_limited_total
  • Degradation: tidaldb_degradation_level gauge (0-3)
  • tidalctl diagnostics prints WAL state, checkpoint age, signal size, index sizes, session count, degradation level, collection count, cohort count
  • All TidalError variants have operation name + context; no bare strings
  • db.export_signals(ExportRequest { user_id, signal_types, since, until, format }) -> Result<Vec<ExportedSignal>>; ExportFormat::JsonLines supported
  • db.user_session_summary(user_id, since) -> Result<UserSessionSummary>; returns sessions_count, total_signals, total_rejections, top_signal_types, preference_drift (cosine distance)
  • Metrics zero-overhead without metrics feature; verified by compile + inspection
  • All new code passes cargo clippy -D warnings and cargo fmt --check
  • Integration test suite m7p4_visibility passes

Task Execution Order

task-01 (QueryStats struct + executor instrumentation)
    |
    +----> task-02 (signal + WAL metrics)
    |
    +----> task-03 (index health metrics)
    |
    +----> task-04 (session + cohort + degradation metrics)
    |
    +----> task-06 (structured error context audit)
    |
    +----> task-07 (metrics feature flag + zero-overhead)
    |
    +----> task-08 (RLHF export)
    |
    +----> task-09 (cross-session aggregation)
    |
    v
task-05 (tidalctl diagnostics) -- depends on task-02, task-03, task-04

task-01 is the foundation: QueryStats is referenced by executor instrumentation that tasks 02-09 build upon. Tasks 02, 03, 04, 06, 07, 08, and 09 can parallelize after task-01. Task-05 depends on tasks 02, 03, and 04 because tidalctl diagnostics reads the metrics those tasks expose.

Module Location

New and modified modules:

tidal/src/
  query/
    stats.rs            -- new: QueryStats struct and builder
    executor/mod.rs     -- modified: instrument RetrieveExecutor stages
    search/executor.rs  -- modified: instrument SearchExecutor stages
    retrieve/types.rs   -- modified: add stats field to Results
    search/types.rs     -- modified: add stats field to SearchResults
  db/
    metrics.rs          -- modified: add AtomicU64 counters/gauges, histogram
    export.rs           -- new: ExportRequest, ExportedSignal, ExportFormat, export_signals()
    sessions.rs         -- modified: add user_session_summary()
    mod.rs              -- modified: wire new metrics fields, export/aggregation methods
  schema/
    error.rs            -- modified: add ErrorContext to TidalError variants
  wal/
    mod.rs              -- modified: expose lag_bytes(), compacted_count() via metrics

tidal/tests/
  m7p4_visibility.rs    -- new: integration tests for all m7p4 functionality

Notes

QueryStats design philosophy

QueryStats is a pure data struct, not a builder. The executor constructs it incrementally during pipeline execution using Instant::elapsed() for timing fields. It is cheap to construct (no heap allocations) and always populated -- there is no Option<QueryStats>. Even when a query returns zero results, the stats reflect the work done to determine that.

Histogram for write latency

The tidaldb_signal_write_latency_us metric uses a hand-written histogram with fixed bucket boundaries: [1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000, 10000] microseconds. This matches the Prometheus histogram convention (cumulative buckets + sum + count) and avoids pulling in a histogram library. The bucket boundaries are chosen based on expected signal write latency: most writes should complete in <100us, with WAL fsync pushing outliers to the 1-5ms range.

RLHF export and WAL scanning

export_signals reads from the WAL segment files directly using the existing reader::recover() path, filtered by time range and signal type. This is an offline operation -- it does not interfere with the live write path. For large WAL backlogs, the caller should use narrow time ranges. The ExportedSignal type is a flat struct suitable for JSON serialization.

Cross-session aggregation

user_session_summary scans closed_sessions (the in-memory DashMap<SessionId, SessionSnapshot>) filtered by user ID and timestamp. It does not read from persistent storage -- only closed sessions that exist in the current process's memory are visible. This is a deliberate simplification: persistent session archive scanning is deferred to M8 when the distributed fabric needs cross-node session aggregation.

Zero-overhead metrics feature flag

All AtomicU64 counters and the histogram are wrapped in #[cfg(feature = "metrics")] blocks. When the feature is disabled, the compiler eliminates all counter increments, load operations, and the HTTP endpoint. The QueryStats struct is always present (it has value even without Prometheus export), but the Prometheus-specific rendering and atomic counters are gated.

Done When

All 12 acceptance criteria above pass. cargo test --manifest-path tidal/Cargo.toml passes including the new m7p4_visibility integration test suite. tidalctl diagnostics --path <dir> prints a correct summary for a running database. cargo test --manifest-path tidal/Cargo.toml --no-default-features compiles without the metrics feature and produces no metrics overhead.