jordan 213b8efcca feat: complete M6-M7 + Enterprise Readiness milestones; split oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-02-23 22:41:16 -07:00

10 KiB

Raw Blame History

Monitoring

This document covers tidalDB's built-in Prometheus metrics endpoint, all exposed metrics, and recommended alerting thresholds.

Setup

Enable the metrics HTTP server via the builder:

let db = TidalDb::builder()
    .with_data_dir("/var/lib/tidaldb")
    .with_schema(schema)
    .enable_metrics("127.0.0.1:9090")
    .open()?;

// Discover the bound address (useful when using port 0):
if let Some(addr) = db.metrics_addr() {
    println!("metrics at http://{}/metrics", addr);
    println!("health at http://{}/healthz", addr);
}

Security: The metrics endpoint has no authentication. Bind to 127.0.0.1 (loopback) only. If you need to scrape from a remote Prometheus server, use your infrastructure's network controls (SSH tunnel, reverse proxy with auth, or VPN) rather than binding to 0.0.0.0. tidalDB logs a WARN-level message if you bind to a non-loopback address.

Feature flag: The metrics HTTP server requires the metrics feature, which is enabled by default. Build with --no-default-features to disable the HTTP server entirely. Base metrics (uptime_seconds, health_ok, info, checkpoint_failures_total) are always compiled regardless of the feature flag.

Endpoints

Path	Content-Type	Description
`/metrics`	`text/plain`	Prometheus text exposition format
`/healthz`	`application/json`	JSON health check: `{"status":"ok","uptime_seconds":123.456,"version":"0.1.0","build_hash":"..."}`

Prometheus Scrape Configuration

scrape_configs:
  - job_name: 'tidaldb'
    static_configs:
      - targets: ['127.0.0.1:9090']
    scrape_interval: 15s

Metrics Reference

All metrics use the tidaldb_ prefix. Metrics marked with "(feature-gated)" are only emitted when the metrics Cargo feature is enabled (default: enabled).

Build and Health

Metric	Type	Description	Labels
`tidaldb_uptime_seconds`	gauge	Seconds since the database was opened. Monotonically increasing.	`partition_id="0"`
`tidaldb_health_ok`	gauge	Whether the database is healthy. `1` = ok, `0` = degraded or closed.	`partition_id="0"`
`tidaldb_info`	gauge	Build and version information. Always `1`.	`version`, `build_hash`, `partition_id="0"`

Normal range for tidaldb_health_ok: Always 1 during normal operation. Drops to 0 during shutdown or if an internal health check fails. Alert immediately if 0 during expected uptime.

Signal System (feature-gated)

Metric	Type	Unit	Description
`tidaldb_signal_writes_total`	counter	count	Total signal writes since database open. Includes all signal types across all entities.
`tidaldb_signal_hot_entries`	gauge	count	Number of entries currently in the signal ledger hot tier (DashMap). Each entry is one `(entity_id, signal_type_id)` pair.
`tidaldb_signal_write_latency_us`	histogram	microseconds	Signal write latency distribution. Bucket boundaries: 1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000, 10000 microseconds.

Normal range for signal_hot_entries: Proportional to active_entities * signal_types_per_entity. The hot tier is trimmed at 5M entries (DEFAULT_MAX_SIGNAL_ENTRIES). Alert if approaching 80% of budget (4M entries).

Normal range for signal_write_latency_us: p50 should be < 50us, p99 should be < 1ms. If p99 exceeds 5ms, investigate WAL write latency or DashMap contention.

WAL and Checkpoint (feature-gated)

Metric	Type	Unit	Description
`tidaldb_wal_lag_bytes`	gauge	bytes	Total bytes of WAL segment files not yet compacted. Updated after each checkpoint cycle.
`tidaldb_wal_compacted_segments_total`	counter	count	Total WAL segments deleted by compaction since database open.
`tidaldb_checkpoint_age_seconds`	gauge	seconds	Seconds since the last successful signal checkpoint. Derived from `last_checkpoint_ns` at render time.
`tidaldb_checkpoint_failures_total`	counter	count	Total number of failed periodic signal checkpoints. Not feature-gated -- always emitted.

Normal range for checkpoint_age_seconds: Should stay below 60 seconds (checkpoint runs every 30 seconds, with some jitter from the 500ms poll interval). Alert if > 300 seconds (5 minutes) -- the checkpoint thread may be stuck or the storage engine is under pressure.

Normal range for wal_lag_bytes: Depends on signal write rate. At 1K signals/sec, expect ~1.2 MB of WAL per 30-second checkpoint cycle. Alert if > 1 GB -- compaction may be failing.

Normal range for checkpoint_failures_total: Should be 0. Any non-zero value means signal durability is at risk -- the hot tier is not being persisted. Investigate storage errors (disk full, I/O errors).

Index Health (feature-gated)

Metric	Type	Unit	Description
`tidaldb_tantivy_segment_count`	gauge	count	Number of Tantivy index segments for the items text index.
`tidaldb_tantivy_indexed_docs`	gauge	count	Number of documents indexed in the items Tantivy text index.
`tidaldb_usearch_index_size_bytes`	gauge	bytes	Estimated total byte size of all USearch vector index files (f16).
`tidaldb_usearch_vector_count`	gauge	count	Number of vectors stored across all USearch indexes.
`tidaldb_bitmap_index_cardinality`	gauge	count	Total entity IDs across all four bitmap indexes (category + format + creator + tag).

Index health metrics are refreshed every 10 seconds by the checkpoint thread (3x more frequently than checkpoints) so operators get near-real-time visibility.

Normal range for tantivy_segment_count: Should stay below 20 during normal operation. Tantivy merges segments in the background. If segment count grows unbounded, the text syncer thread may have stalled.

Normal range for usearch_vector_count: Should match the number of entities with embeddings written via write_item_embedding() or write_creator_embedding().

Session Lifecycle (feature-gated)

Metric	Type	Unit	Description
`tidaldb_active_sessions`	gauge	count	Number of currently active agent sessions.
`tidaldb_closed_sessions_total`	counter	count	Total agent sessions closed (explicitly or by sweeper) since database open.
`tidaldb_session_auto_closed_total`	counter	count	Total sessions auto-closed by the TTL sweeper due to exceeding `max_session_duration`.

Normal range for active_sessions: Depends on your application's agent concurrency. Each open session consumes memory for signal state tracking. Alert if this grows unbounded -- agents may be leaking sessions (opening without closing).

Rate Limiting and Degradation (feature-gated)

Metric	Type	Unit	Description
`tidaldb_rate_limited_total`	counter	count	Total signal write requests rejected due to per-agent rate limits since database open.
`tidaldb_degradation_level`	gauge	level	Current graceful degradation level. `0` = full quality, `1` = reduced candidates, `2` = coarse aggregates, `3` = no diversity enforcement.

Normal range for degradation_level: Should be 0 during normal operation. Any value > 0 means the load detector has triggered degradation to protect latency. Investigate system load (CPU, memory pressure, I/O saturation).

Recommended Alerts

Alert Name	Condition	Severity	Meaning
TidalDB Down	`tidaldb_health_ok == 0`	Critical	Database is unhealthy or shut down. Immediate investigation required.
Checkpoint Stale	`tidaldb_checkpoint_age_seconds > 300`	Warning	Checkpoint has not run in 5+ minutes. Signal durability at risk. Check storage I/O and disk space.
Checkpoint Failures	`tidaldb_checkpoint_failures_total > 0`	Warning	At least one checkpoint has failed. Signal state may not be durable. Check disk space and storage errors.
WAL Disk Pressure	`tidaldb_wal_lag_bytes > 1000000000`	Warning	WAL exceeds 1 GB uncompacted. Compaction may be stuck or checkpoint is failing.
Signal Backlog	`tidaldb_signal_hot_entries > 4000000`	Warning	Signal ledger over 80% of the 5M entry budget. Cold entry trimming will begin at 5M.
Degraded Ranking	`tidaldb_degradation_level > 0`	Warning	Load-based degradation is active. Ranking quality is reduced to protect latency. Scale up or reduce load.
Session Leak	`rate(tidaldb_active_sessions[5m]) > 10 AND tidaldb_active_sessions > 100`	Warning	Active session count growing rapidly. Agents may not be closing sessions.
High Rate Limiting	`rate(tidaldb_rate_limited_total[5m]) > 100`	Info	Sustained rate limiting. Review agent rate limit configuration or reduce write volume.
Tantivy Segment Bloat	`tidaldb_tantivy_segment_count > 30`	Warning	Tantivy has many unmerged segments. Text syncer may be stalled.

Grafana Dashboard Suggestions

Row 1: Health overview

tidaldb_health_ok (stat panel, green/red)
tidaldb_uptime_seconds (stat panel)
tidaldb_degradation_level (stat panel, thresholds at 1/2/3)
tidaldb_info labels (stat panel showing version + build hash)

Row 2: Signal throughput

rate(tidaldb_signal_writes_total[5m]) (time series, signals/sec)
tidaldb_signal_write_latency_us histogram (heatmap or quantile panel)
tidaldb_signal_hot_entries (gauge, threshold at 4M/5M)

Row 3: Durability

tidaldb_checkpoint_age_seconds (time series, threshold line at 300)
tidaldb_checkpoint_failures_total (stat panel, should be 0)
tidaldb_wal_lag_bytes (time series)
rate(tidaldb_wal_compacted_segments_total[5m]) (time series)

Row 4: Index health

tidaldb_tantivy_indexed_docs (stat panel)
tidaldb_tantivy_segment_count (gauge)
tidaldb_usearch_vector_count (stat panel)
tidaldb_usearch_index_size_bytes (stat panel, bytes format)
tidaldb_bitmap_index_cardinality (stat panel)

Row 5: Sessions

tidaldb_active_sessions (time series)
rate(tidaldb_closed_sessions_total[5m]) (time series)
tidaldb_session_auto_closed_total (stat panel)
rate(tidaldb_rate_limited_total[5m]) (time series)

10 KiB Raw Blame History