175 lines
10 KiB
Markdown
175 lines
10 KiB
Markdown
# Monitoring
|
|
|
|
This document covers tidalDB's built-in Prometheus metrics endpoint, all exposed metrics, and recommended alerting thresholds.
|
|
|
|
---
|
|
|
|
## Setup
|
|
|
|
Enable the metrics HTTP server via the builder:
|
|
|
|
```rust
|
|
let db = TidalDb::builder()
|
|
.with_data_dir("/var/lib/tidaldb")
|
|
.with_schema(schema)
|
|
.enable_metrics("127.0.0.1:9090")
|
|
.open()?;
|
|
|
|
// Discover the bound address (useful when using port 0):
|
|
if let Some(addr) = db.metrics_addr() {
|
|
println!("metrics at http://{}/metrics", addr);
|
|
println!("health at http://{}/healthz", addr);
|
|
}
|
|
```
|
|
|
|
**Security:** The metrics endpoint has no authentication. Bind to `127.0.0.1` (loopback) only. If you need to scrape from a remote Prometheus server, use your infrastructure's network controls (SSH tunnel, reverse proxy with auth, or VPN) rather than binding to `0.0.0.0`. tidalDB logs a WARN-level message if you bind to a non-loopback address.
|
|
|
|
**Feature flag:** The metrics HTTP server requires the `metrics` feature, which is enabled by default. Build with `--no-default-features` to disable the HTTP server entirely. Base metrics (`uptime_seconds`, `health_ok`, `info`, `checkpoint_failures_total`) are always compiled regardless of the feature flag.
|
|
|
|
---
|
|
|
|
## Endpoints
|
|
|
|
| Path | Content-Type | Description |
|
|
|:-----|:-------------|:------------|
|
|
| `/metrics` | `text/plain` | Prometheus text exposition format |
|
|
| `/healthz` | `application/json` | JSON health check: `{"status":"ok","uptime_seconds":123.456,"version":"0.1.0","build_hash":"..."}` |
|
|
|
|
---
|
|
|
|
## Prometheus Scrape Configuration
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'tidaldb'
|
|
static_configs:
|
|
- targets: ['127.0.0.1:9090']
|
|
scrape_interval: 15s
|
|
```
|
|
|
|
---
|
|
|
|
## Metrics Reference
|
|
|
|
All metrics use the `tidaldb_` prefix. Metrics marked with "(feature-gated)" are only emitted when the `metrics` Cargo feature is enabled (default: enabled).
|
|
|
|
### Build and Health
|
|
|
|
| Metric | Type | Description | Labels |
|
|
|:-------|:-----|:------------|:-------|
|
|
| `tidaldb_uptime_seconds` | gauge | Seconds since the database was opened. Monotonically increasing. | `partition_id="0"` |
|
|
| `tidaldb_health_ok` | gauge | Whether the database is healthy. `1` = ok, `0` = degraded or closed. | `partition_id="0"` |
|
|
| `tidaldb_info` | gauge | Build and version information. Always `1`. | `version`, `build_hash`, `partition_id="0"` |
|
|
|
|
**Normal range for `tidaldb_health_ok`:** Always `1` during normal operation. Drops to `0` during shutdown or if an internal health check fails. Alert immediately if `0` during expected uptime.
|
|
|
|
### Signal System (feature-gated)
|
|
|
|
| Metric | Type | Unit | Description |
|
|
|:-------|:-----|:-----|:------------|
|
|
| `tidaldb_signal_writes_total` | counter | count | Total signal writes since database open. Includes all signal types across all entities. |
|
|
| `tidaldb_signal_hot_entries` | gauge | count | Number of entries currently in the signal ledger hot tier (DashMap). Each entry is one `(entity_id, signal_type_id)` pair. |
|
|
| `tidaldb_signal_write_latency_us` | histogram | microseconds | Signal write latency distribution. Bucket boundaries: 1, 5, 10, 25, 50, 100, 250, 500, 1000, 5000, 10000 microseconds. |
|
|
|
|
**Normal range for `signal_hot_entries`:** Proportional to `active_entities * signal_types_per_entity`. The hot tier is trimmed at 5M entries (`DEFAULT_MAX_SIGNAL_ENTRIES`). Alert if approaching 80% of budget (4M entries).
|
|
|
|
**Normal range for `signal_write_latency_us`:** p50 should be < 50us, p99 should be < 1ms. If p99 exceeds 5ms, investigate WAL write latency or DashMap contention.
|
|
|
|
### WAL and Checkpoint (feature-gated)
|
|
|
|
| Metric | Type | Unit | Description |
|
|
|:-------|:-----|:-----|:------------|
|
|
| `tidaldb_wal_lag_bytes` | gauge | bytes | Total bytes of WAL segment files not yet compacted. Updated after each checkpoint cycle. |
|
|
| `tidaldb_wal_compacted_segments_total` | counter | count | Total WAL segments deleted by compaction since database open. |
|
|
| `tidaldb_checkpoint_age_seconds` | gauge | seconds | Seconds since the last successful signal checkpoint. Derived from `last_checkpoint_ns` at render time. |
|
|
| `tidaldb_checkpoint_failures_total` | counter | count | Total number of failed periodic signal checkpoints. **Not feature-gated** -- always emitted. |
|
|
|
|
**Normal range for `checkpoint_age_seconds`:** Should stay below 60 seconds (checkpoint runs every 30 seconds, with some jitter from the 500ms poll interval). Alert if > 300 seconds (5 minutes) -- the checkpoint thread may be stuck or the storage engine is under pressure.
|
|
|
|
**Normal range for `wal_lag_bytes`:** Depends on signal write rate. At 1K signals/sec, expect ~1.2 MB of WAL per 30-second checkpoint cycle. Alert if > 1 GB -- compaction may be failing.
|
|
|
|
**Normal range for `checkpoint_failures_total`:** Should be 0. Any non-zero value means signal durability is at risk -- the hot tier is not being persisted. Investigate storage errors (disk full, I/O errors).
|
|
|
|
### Index Health (feature-gated)
|
|
|
|
| Metric | Type | Unit | Description |
|
|
|:-------|:-----|:-----|:------------|
|
|
| `tidaldb_tantivy_segment_count` | gauge | count | Number of Tantivy index segments for the items text index. |
|
|
| `tidaldb_tantivy_indexed_docs` | gauge | count | Number of documents indexed in the items Tantivy text index. |
|
|
| `tidaldb_usearch_index_size_bytes` | gauge | bytes | Estimated total byte size of all USearch vector index files (f16). |
|
|
| `tidaldb_usearch_vector_count` | gauge | count | Number of vectors stored across all USearch indexes. |
|
|
| `tidaldb_bitmap_index_cardinality` | gauge | count | Total entity IDs across all four bitmap indexes (category + format + creator + tag). |
|
|
|
|
Index health metrics are refreshed every 10 seconds by the checkpoint thread (3x more frequently than checkpoints) so operators get near-real-time visibility.
|
|
|
|
**Normal range for `tantivy_segment_count`:** Should stay below 20 during normal operation. Tantivy merges segments in the background. If segment count grows unbounded, the text syncer thread may have stalled.
|
|
|
|
**Normal range for `usearch_vector_count`:** Should match the number of entities with embeddings written via `write_item_embedding()` or `write_creator_embedding()`.
|
|
|
|
### Session Lifecycle (feature-gated)
|
|
|
|
| Metric | Type | Unit | Description |
|
|
|:-------|:-----|:-----|:------------|
|
|
| `tidaldb_active_sessions` | gauge | count | Number of currently active agent sessions. |
|
|
| `tidaldb_closed_sessions_total` | counter | count | Total agent sessions closed (explicitly or by sweeper) since database open. |
|
|
| `tidaldb_session_auto_closed_total` | counter | count | Total sessions auto-closed by the TTL sweeper due to exceeding `max_session_duration`. |
|
|
|
|
**Normal range for `active_sessions`:** Depends on your application's agent concurrency. Each open session consumes memory for signal state tracking. Alert if this grows unbounded -- agents may be leaking sessions (opening without closing).
|
|
|
|
### Rate Limiting and Degradation (feature-gated)
|
|
|
|
| Metric | Type | Unit | Description |
|
|
|:-------|:-----|:-----|:------------|
|
|
| `tidaldb_rate_limited_total` | counter | count | Total signal write requests rejected due to per-agent rate limits since database open. |
|
|
| `tidaldb_degradation_level` | gauge | level | Current graceful degradation level. `0` = full quality, `1` = reduced candidates, `2` = coarse aggregates, `3` = no diversity enforcement. |
|
|
|
|
**Normal range for `degradation_level`:** Should be `0` during normal operation. Any value > 0 means the load detector has triggered degradation to protect latency. Investigate system load (CPU, memory pressure, I/O saturation).
|
|
|
|
---
|
|
|
|
## Recommended Alerts
|
|
|
|
| Alert Name | Condition | Severity | Meaning |
|
|
|:-----------|:----------|:---------|:--------|
|
|
| TidalDB Down | `tidaldb_health_ok == 0` | Critical | Database is unhealthy or shut down. Immediate investigation required. |
|
|
| Checkpoint Stale | `tidaldb_checkpoint_age_seconds > 300` | Warning | Checkpoint has not run in 5+ minutes. Signal durability at risk. Check storage I/O and disk space. |
|
|
| Checkpoint Failures | `tidaldb_checkpoint_failures_total > 0` | Warning | At least one checkpoint has failed. Signal state may not be durable. Check disk space and storage errors. |
|
|
| WAL Disk Pressure | `tidaldb_wal_lag_bytes > 1000000000` | Warning | WAL exceeds 1 GB uncompacted. Compaction may be stuck or checkpoint is failing. |
|
|
| Signal Backlog | `tidaldb_signal_hot_entries > 4000000` | Warning | Signal ledger over 80% of the 5M entry budget. Cold entry trimming will begin at 5M. |
|
|
| Degraded Ranking | `tidaldb_degradation_level > 0` | Warning | Load-based degradation is active. Ranking quality is reduced to protect latency. Scale up or reduce load. |
|
|
| Session Leak | `rate(tidaldb_active_sessions[5m]) > 10 AND tidaldb_active_sessions > 100` | Warning | Active session count growing rapidly. Agents may not be closing sessions. |
|
|
| High Rate Limiting | `rate(tidaldb_rate_limited_total[5m]) > 100` | Info | Sustained rate limiting. Review agent rate limit configuration or reduce write volume. |
|
|
| Tantivy Segment Bloat | `tidaldb_tantivy_segment_count > 30` | Warning | Tantivy has many unmerged segments. Text syncer may be stalled. |
|
|
|
|
### Grafana Dashboard Suggestions
|
|
|
|
**Row 1: Health overview**
|
|
- `tidaldb_health_ok` (stat panel, green/red)
|
|
- `tidaldb_uptime_seconds` (stat panel)
|
|
- `tidaldb_degradation_level` (stat panel, thresholds at 1/2/3)
|
|
- `tidaldb_info` labels (stat panel showing version + build hash)
|
|
|
|
**Row 2: Signal throughput**
|
|
- `rate(tidaldb_signal_writes_total[5m])` (time series, signals/sec)
|
|
- `tidaldb_signal_write_latency_us` histogram (heatmap or quantile panel)
|
|
- `tidaldb_signal_hot_entries` (gauge, threshold at 4M/5M)
|
|
|
|
**Row 3: Durability**
|
|
- `tidaldb_checkpoint_age_seconds` (time series, threshold line at 300)
|
|
- `tidaldb_checkpoint_failures_total` (stat panel, should be 0)
|
|
- `tidaldb_wal_lag_bytes` (time series)
|
|
- `rate(tidaldb_wal_compacted_segments_total[5m])` (time series)
|
|
|
|
**Row 4: Index health**
|
|
- `tidaldb_tantivy_indexed_docs` (stat panel)
|
|
- `tidaldb_tantivy_segment_count` (gauge)
|
|
- `tidaldb_usearch_vector_count` (stat panel)
|
|
- `tidaldb_usearch_index_size_bytes` (stat panel, bytes format)
|
|
- `tidaldb_bitmap_index_cardinality` (stat panel)
|
|
|
|
**Row 5: Sessions**
|
|
- `tidaldb_active_sessions` (time series)
|
|
- `rate(tidaldb_closed_sessions_total[5m])` (time series)
|
|
- `tidaldb_session_auto_closed_total` (stat panel)
|
|
- `rate(tidaldb_rate_limited_total[5m])` (time series)
|