10 KiB
Monitoring
This document covers tidalDB's built-in Prometheus metrics endpoint, all exposed metrics, and recommended alerting thresholds.
Setup
Enable the metrics HTTP server via the builder:
let db = TidalDb::builder()
.with_data_dir("/var/lib/tidaldb")
.with_schema(schema)
.enable_metrics("127.0.0.1:9090")
.open()?;
// Discover the bound address (useful when using port 0):
if let Some(addr) = db.metrics_addr() {
println!("metrics at http://{}/metrics", addr);
println!("health at http://{}/healthz", addr);
}
Security: The metrics endpoint has no authentication. Bind to 127.0.0.1 (loopback) only. If you need to scrape from a remote Prometheus server, use your infrastructure's network controls (SSH tunnel, reverse proxy with auth, or VPN) rather than binding to 0.0.0.0. tidalDB logs a WARN-level message if you bind to a non-loopback address.
Feature flag: The metrics HTTP server requires the metrics feature, which is enabled by default. Build with --no-default-features to disable the HTTP server entirely. Base metrics (uptime_seconds, health_ok, info, checkpoint_failures_total) are always compiled regardless of the feature flag.
Endpoints
| Path | Content-Type | Description |
|---|---|---|
/metrics |
text/plain |
Prometheus text exposition format |
/healthz |
application/json |
JSON health check: {"status":"ok","uptime_seconds":123.456,"version":"0.1.0","build_hash":"..."} |
Prometheus Scrape Configuration
scrape_configs:
- job_name: 'tidaldb'
static_configs:
- targets: ['127.0.0.1:9090']
scrape_interval: 15s
Metrics Reference
All metrics use the tidaldb_ prefix. Metrics marked with "(feature-gated)" are only emitted when the metrics Cargo feature is enabled (default: enabled).
Build and Health
| Metric | Type | Description | Labels |
|---|---|---|---|
tidaldb_uptime_seconds |
gauge | Seconds since the database was opened. Monotonically increasing. | partition_id="0" |
tidaldb_health_ok |
gauge | Whether the database is healthy. 1 = ok, 0 = degraded or closed. |
partition_id="0" |
tidaldb_info |
gauge | Build and version information. Always 1. |
version, build_hash, partition_id="0" |
Normal range for tidaldb_health_ok: Always 1 during normal operation. Drops to 0 during shutdown or if an internal health check fails. Alert immediately if 0 during expected uptime.
Signal System (feature-gated)
| Metric | Type | Unit | Description |
|---|---|---|---|
tidaldb_signal_writes_total |
counter | count | Total signal writes since database open. Includes all signal types across all entities. |
tidaldb_signal_hot_entries |
gauge | count | Number of entries currently in the signal ledger hot tier (DashMap). Each entry is one (entity_id, signal_type_id) pair. |
tidaldb_signal_write_latency_us |
histogram | microseconds | Signal write latency distribution. Bucket boundaries: 1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000, 10000 microseconds. |
Normal range for signal_hot_entries: Proportional to active_entities * signal_types_per_entity. The hot tier is trimmed at 5M entries (DEFAULT_MAX_SIGNAL_ENTRIES). Alert if approaching 80% of budget (4M entries).
Normal range for signal_write_latency_us: p50 should be < 50us, p99 should be < 1ms. If p99 exceeds 5ms, investigate WAL write latency or DashMap contention.
WAL and Checkpoint (feature-gated)
| Metric | Type | Unit | Description |
|---|---|---|---|
tidaldb_wal_lag_bytes |
gauge | bytes | Total bytes of WAL segment files not yet compacted. Updated after each checkpoint cycle. |
tidaldb_wal_compacted_segments_total |
counter | count | Total WAL segments deleted by compaction since database open. |
tidaldb_checkpoint_age_seconds |
gauge | seconds | Seconds since the last successful signal checkpoint. Derived from last_checkpoint_ns at render time. |
tidaldb_checkpoint_failures_total |
counter | count | Total number of failed periodic signal checkpoints. Not feature-gated -- always emitted. |
Normal range for checkpoint_age_seconds: Should stay below 60 seconds (checkpoint runs every 30 seconds, with some jitter from the 500ms poll interval). Alert if > 300 seconds (5 minutes) -- the checkpoint thread may be stuck or the storage engine is under pressure.
Normal range for wal_lag_bytes: Depends on signal write rate. At 1K signals/sec, expect ~1.2 MB of WAL per 30-second checkpoint cycle. Alert if > 1 GB -- compaction may be failing.
Normal range for checkpoint_failures_total: Should be 0. Any non-zero value means signal durability is at risk -- the hot tier is not being persisted. Investigate storage errors (disk full, I/O errors).
Index Health (feature-gated)
| Metric | Type | Unit | Description |
|---|---|---|---|
tidaldb_tantivy_segment_count |
gauge | count | Number of Tantivy index segments for the items text index. |
tidaldb_tantivy_indexed_docs |
gauge | count | Number of documents indexed in the items Tantivy text index. |
tidaldb_usearch_index_size_bytes |
gauge | bytes | Estimated total byte size of all USearch vector index files (f16). |
tidaldb_usearch_vector_count |
gauge | count | Number of vectors stored across all USearch indexes. |
tidaldb_bitmap_index_cardinality |
gauge | count | Total entity IDs across all four bitmap indexes (category + format + creator + tag). |
Index health metrics are refreshed every 10 seconds by the checkpoint thread (3x more frequently than checkpoints) so operators get near-real-time visibility.
Normal range for tantivy_segment_count: Should stay below 20 during normal operation. Tantivy merges segments in the background. If segment count grows unbounded, the text syncer thread may have stalled.
Normal range for usearch_vector_count: Should match the number of entities with embeddings written via write_item_embedding() or write_creator_embedding().
Session Lifecycle (feature-gated)
| Metric | Type | Unit | Description |
|---|---|---|---|
tidaldb_active_sessions |
gauge | count | Number of currently active agent sessions. |
tidaldb_closed_sessions_total |
counter | count | Total agent sessions closed (explicitly or by sweeper) since database open. |
tidaldb_session_auto_closed_total |
counter | count | Total sessions auto-closed by the TTL sweeper due to exceeding max_session_duration. |
Normal range for active_sessions: Depends on your application's agent concurrency. Each open session consumes memory for signal state tracking. Alert if this grows unbounded -- agents may be leaking sessions (opening without closing).
Rate Limiting and Degradation (feature-gated)
| Metric | Type | Unit | Description |
|---|---|---|---|
tidaldb_rate_limited_total |
counter | count | Total signal write requests rejected due to per-agent rate limits since database open. |
tidaldb_degradation_level |
gauge | level | Current graceful degradation level. 0 = full quality, 1 = reduced candidates, 2 = coarse aggregates, 3 = no diversity enforcement. |
Normal range for degradation_level: Should be 0 during normal operation. Any value > 0 means the load detector has triggered degradation to protect latency. Investigate system load (CPU, memory pressure, I/O saturation).
Recommended Alerts
| Alert Name | Condition | Severity | Meaning |
|---|---|---|---|
| TidalDB Down | tidaldb_health_ok == 0 |
Critical | Database is unhealthy or shut down. Immediate investigation required. |
| Checkpoint Stale | tidaldb_checkpoint_age_seconds > 300 |
Warning | Checkpoint has not run in 5+ minutes. Signal durability at risk. Check storage I/O and disk space. |
| Checkpoint Failures | tidaldb_checkpoint_failures_total > 0 |
Warning | At least one checkpoint has failed. Signal state may not be durable. Check disk space and storage errors. |
| WAL Disk Pressure | tidaldb_wal_lag_bytes > 1000000000 |
Warning | WAL exceeds 1 GB uncompacted. Compaction may be stuck or checkpoint is failing. |
| Signal Backlog | tidaldb_signal_hot_entries > 4000000 |
Warning | Signal ledger over 80% of the 5M entry budget. Cold entry trimming will begin at 5M. |
| Degraded Ranking | tidaldb_degradation_level > 0 |
Warning | Load-based degradation is active. Ranking quality is reduced to protect latency. Scale up or reduce load. |
| Session Leak | rate(tidaldb_active_sessions[5m]) > 10 AND tidaldb_active_sessions > 100 |
Warning | Active session count growing rapidly. Agents may not be closing sessions. |
| High Rate Limiting | rate(tidaldb_rate_limited_total[5m]) > 100 |
Info | Sustained rate limiting. Review agent rate limit configuration or reduce write volume. |
| Tantivy Segment Bloat | tidaldb_tantivy_segment_count > 30 |
Warning | Tantivy has many unmerged segments. Text syncer may be stalled. |
Grafana Dashboard Suggestions
Row 1: Health overview
tidaldb_health_ok(stat panel, green/red)tidaldb_uptime_seconds(stat panel)tidaldb_degradation_level(stat panel, thresholds at 1/2/3)tidaldb_infolabels (stat panel showing version + build hash)
Row 2: Signal throughput
rate(tidaldb_signal_writes_total[5m])(time series, signals/sec)tidaldb_signal_write_latency_ushistogram (heatmap or quantile panel)tidaldb_signal_hot_entries(gauge, threshold at 4M/5M)
Row 3: Durability
tidaldb_checkpoint_age_seconds(time series, threshold line at 300)tidaldb_checkpoint_failures_total(stat panel, should be 0)tidaldb_wal_lag_bytes(time series)rate(tidaldb_wal_compacted_segments_total[5m])(time series)
Row 4: Index health
tidaldb_tantivy_indexed_docs(stat panel)tidaldb_tantivy_segment_count(gauge)tidaldb_usearch_vector_count(stat panel)tidaldb_usearch_index_size_bytes(stat panel, bytes format)tidaldb_bitmap_index_cardinality(stat panel)
Row 5: Sessions
tidaldb_active_sessions(time series)rate(tidaldb_closed_sessions_total[5m])(time series)tidaldb_session_auto_closed_total(stat panel)rate(tidaldb_rate_limited_total[5m])(time series)