# Phase 8B: Observability Prometheus metrics and admin endpoints for monitoring StemeDB clusters. ## Overview StemeDB exposes metrics in Prometheus format and provides admin endpoints for operators to monitor cluster health, diagnose sync issues, and force anti-entropy convergence. ## Endpoints ### Standalone API Server (stemedb-api) | Endpoint | Method | Description | |----------|--------|-------------| | `/metrics` | GET | Prometheus metrics in text format | ### Cluster Gateway (stemedb-cluster) | Endpoint | Method | Description | |----------|--------|-------------| | `/metrics` | GET | Prometheus metrics in text format | | `/v1/admin/cluster` | GET | Cluster status (alias for `/v1/cluster/status`) | | `/v1/admin/ranges` | GET | All shard/range assignments | | `/v1/admin/sync` | POST | Force anti-entropy sync | ## Metrics Reference ### Application Metrics (stemedb-api) | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `stemedb_assertions_total` | Gauge | - | Total assertions in database (updated on health check) | | `stemedb_assertions_ingested_total` | Counter | - | Assertions ingested via `POST /v1/assert` | | `stemedb_queries_total` | Counter | `endpoint` | Queries executed (query, skeptic, layered, constraints) | | `stemedb_query_latency_seconds` | Histogram | `endpoint` | End-to-end query latency by endpoint | | `stemedb_quarantine_pending` | Gauge | - | Pending quarantine events (updated on health check) | | `stemedb_circuit_breakers_open` | Gauge | - | Open circuit breakers (updated on health check) | **Source files:** - `handlers/health.rs` — gauges for assertions_total, quarantine_pending, circuit_breakers_open - `handlers/assert.rs` — counter for assertions_ingested_total - `handlers/query.rs`, `skeptic.rs`, `layered.rs`, `constraints.rs` — counter + histogram per endpoint ### Grafana Dashboard A pre-built Grafana dashboard is available at `docs/grafana/stemedb-overview.json`. **Rows:** 1. **Overview** — assertions_total, queries/sec, quarantine_pending, circuit_breakers_open (stat panels) 2. **Query Performance** — latency p50/p95/p99 histogram, queries by endpoint (time series) 3. **Cluster Health** — node counts, sync lag, convergence latency 4. **Write Path** — assertions ingested rate, sync throughput Import via Grafana UI > Dashboards > Import. Uses `${DS_PROMETHEUS}` variable for datasource portability. ### Sync Metrics (stemedb-sync) | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `stemedb_sync_cycles_total` | Counter | `peer` | Total anti-entropy sync cycles completed | | `stemedb_sync_failures_total` | Counter | `peer` | Total sync failures | | `stemedb_assertions_synced_total` | Counter | `peer` | Total assertions synced from peers | | `stemedb_sync_lag_seconds` | Gauge | `peer` | Seconds since last successful sync with peer | | `stemedb_merkle_diff_size` | Gauge | `peer` | Number of assertions different from peer | | `stemedb_convergence_latency_seconds` | Histogram | `peer` | Time to converge after detecting divergence | ### Membership Metrics (stemedb-cluster) | Metric | Type | Labels | Description | |--------|------|--------|-------------| | `stemedb_membership_events_total` | Counter | `type` | Membership change events | | `stemedb_cluster_nodes_alive` | Gauge | - | Number of alive nodes | | `stemedb_cluster_nodes_suspect` | Gauge | - | Number of suspect nodes | | `stemedb_cluster_nodes_total` | Gauge | - | Total nodes (alive + suspect) | ### Membership Event Types | Type | Description | |------|-------------| | `joined` | Node joined the cluster | | `suspected` | Node marked as suspect (unresponsive) | | `failed` | Node marked as dead | | `left` | Node left gracefully | | `recovered` | Node recovered from suspect state | ## Admin API Details ### GET /v1/admin/ranges Returns all shard assignments with their key ranges, replicas, and size metrics. **Response:** ```json { "ranges": [ { "range_id": "shard_0", "start_key": "", "end_key": "8000000000000000000000000000000000000000000000000000000000000000", "size_bytes": 1048576, "assertion_count": 1000, "leader_node": "abc123", "replica_nodes": ["abc123", "def456"], "generation": 1 } ], "total_ranges": 16 } ``` ### POST /v1/admin/sync Triggers immediate anti-entropy sync with all peers, bypassing the normal interval timer. **Request:** ```json { "peer_id": null } ``` **Response:** ```json { "triggered": true, "peers_notified": 3, "message": "Anti-entropy sync triggered for 3 peer(s)" } ``` ## Example Prometheus Queries ### Sync Health ```promql # Sync lag per peer (should be < 60s normally) stemedb_sync_lag_seconds # Sync failure rate over 5 minutes rate(stemedb_sync_failures_total[5m]) # Average convergence time histogram_quantile(0.95, rate(stemedb_convergence_latency_seconds_bucket[5m])) ``` ### Cluster Health ```promql # Total cluster size stemedb_cluster_nodes_total # Percentage of healthy nodes stemedb_cluster_nodes_alive / stemedb_cluster_nodes_total * 100 # Membership churn rate rate(stemedb_membership_events_total[1h]) ``` ### Replication Throughput ```promql # Assertions synced per second rate(stemedb_assertions_synced_total[1m]) # Merkle diff backlog (should trend toward 0) sum(stemedb_merkle_diff_size) ``` ## Grafana Dashboard Suggestions 1. **Cluster Overview Panel** - Nodes alive/suspect/total gauges - Membership event timeline 2. **Sync Health Panel** - Sync lag heatmap by peer - Convergence latency histogram - Sync failure rate alert 3. **Replication Panel** - Assertions synced rate - Merkle diff backlog trend - Sync cycles per peer ## Alerting Rules ```yaml groups: - name: stemedb-sync rules: - alert: SyncLagHigh expr: stemedb_sync_lag_seconds > 300 for: 5m labels: severity: warning annotations: summary: "Sync lag with peer {{ $labels.peer }} is {{ $value }}s" - alert: MerkleDiffBacklog expr: stemedb_merkle_diff_size > 10000 for: 10m labels: severity: warning annotations: summary: "Large Merkle diff with peer {{ $labels.peer }}: {{ $value }} assertions" - alert: ClusterNodeDown expr: stemedb_cluster_nodes_alive < 3 for: 1m labels: severity: critical annotations: summary: "Cluster has only {{ $value }} alive nodes" ``` ## User Journey: Incident Response ``` [Grafana alert: SyncLagHigh fires] -> [SRE opens /v1/admin/cluster to see node status] -> [Identifies node-3 has state "suspect"] -> [Checks /v1/admin/ranges to see if node-3 ranges are affected] -> [Triggers POST /v1/admin/sync to force anti-entropy] -> [Monitors stemedb_merkle_diff_size dropping toward 0] -> [Alert auto-resolves when sync_lag < 300s] ``` ## Implementation Notes ### Force Sync Mechanism The admin sync endpoint uses `tokio::sync::Notify` to signal anti-entropy workers: 1. Gateway registers notify handles from each `AntiEntropyWorker` 2. `POST /v1/admin/sync` calls `notify.notify_one()` on all handles 3. Workers wake from `tokio::select!` and run sync immediately 4. Normal interval-based sync continues after force sync completes ### Metrics Storage Metrics use the `metrics` crate with `metrics-exporter-prometheus`: - Counters/gauges are lock-free atomic operations - Histogram uses DDSketch for memory-efficient percentiles - Labels are allocated once per unique label combination - `/metrics` endpoint renders all registered metrics in Prometheus format ## Related Documentation - [API Documentation](../services/api.md) - [Phase 6 UAT](phase6-uat.md) - [Distributed Architecture](../../docs/research/distributed-write-path.md)