stemedb/ai-lookup/features/observability.md
jordan b3e8a9a058 feat: Multi-application expansion with chaos testing and community UI
Major additions:
- Community Next.js app (port 18187) for browsing claims with API docs
- stemedb-chaos crate: Fault injection, chaos testing, CRDT properties
- Latent ingestion system: Reddit/FDA ingesters with ADK-Go agents
- Disputed claims handling: Manual review workflows and validation
- Aphoria security scanner: New extractors (SQL injection, command
  injection, weak crypto, TLS version), policy-based ignores, UAT reports
- Docker infrastructure: Dockerfile, docker-compose.yml for full stack
- VulnBank demo: Intentionally vulnerable multi-language test corpus

SDK & API enhancements:
- Source registry handlers for tracking data provenance
- Metrics endpoint
- Skeptic filtering improvements

Code quality:
- Split 14 large files (>500 lines) into focused modules
- All files now under 500-line limit per project guidelines

Documentation:
- Chaos testing guide, circuit breakers, observability docs
- Phase 7 UAT documentation updates
- Martin Kleppmann technical writer agent

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 01:24:14 -07:00

225 lines
6.1 KiB
Markdown

# Phase 8B: Observability
Prometheus metrics and admin endpoints for monitoring StemeDB clusters.
## Overview
StemeDB exposes metrics in Prometheus format and provides admin endpoints for operators to monitor cluster health, diagnose sync issues, and force anti-entropy convergence.
## Endpoints
### Standalone API Server (stemedb-api)
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/metrics` | GET | Prometheus metrics in text format |
### Cluster Gateway (stemedb-cluster)
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/metrics` | GET | Prometheus metrics in text format |
| `/v1/admin/cluster` | GET | Cluster status (alias for `/v1/cluster/status`) |
| `/v1/admin/ranges` | GET | All shard/range assignments |
| `/v1/admin/sync` | POST | Force anti-entropy sync |
## Metrics Reference
### Sync Metrics (stemedb-sync)
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `stemedb_sync_cycles_total` | Counter | `peer` | Total anti-entropy sync cycles completed |
| `stemedb_sync_failures_total` | Counter | `peer` | Total sync failures |
| `stemedb_assertions_synced_total` | Counter | `peer` | Total assertions synced from peers |
| `stemedb_sync_lag_seconds` | Gauge | `peer` | Seconds since last successful sync with peer |
| `stemedb_merkle_diff_size` | Gauge | `peer` | Number of assertions different from peer |
| `stemedb_convergence_latency_seconds` | Histogram | `peer` | Time to converge after detecting divergence |
### Membership Metrics (stemedb-cluster)
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `stemedb_membership_events_total` | Counter | `type` | Membership change events |
| `stemedb_cluster_nodes_alive` | Gauge | - | Number of alive nodes |
| `stemedb_cluster_nodes_suspect` | Gauge | - | Number of suspect nodes |
| `stemedb_cluster_nodes_total` | Gauge | - | Total nodes (alive + suspect) |
### Membership Event Types
| Type | Description |
|------|-------------|
| `joined` | Node joined the cluster |
| `suspected` | Node marked as suspect (unresponsive) |
| `failed` | Node marked as dead |
| `left` | Node left gracefully |
| `recovered` | Node recovered from suspect state |
## Admin API Details
### GET /v1/admin/ranges
Returns all shard assignments with their key ranges, replicas, and size metrics.
**Response:**
```json
{
"ranges": [
{
"range_id": "shard_0",
"start_key": "",
"end_key": "8000000000000000000000000000000000000000000000000000000000000000",
"size_bytes": 1048576,
"assertion_count": 1000,
"leader_node": "abc123",
"replica_nodes": ["abc123", "def456"],
"generation": 1
}
],
"total_ranges": 16
}
```
### POST /v1/admin/sync
Triggers immediate anti-entropy sync with all peers, bypassing the normal interval timer.
**Request:**
```json
{
"peer_id": null
}
```
**Response:**
```json
{
"triggered": true,
"peers_notified": 3,
"message": "Anti-entropy sync triggered for 3 peer(s)"
}
```
## Example Prometheus Queries
### Sync Health
```promql
# Sync lag per peer (should be < 60s normally)
stemedb_sync_lag_seconds
# Sync failure rate over 5 minutes
rate(stemedb_sync_failures_total[5m])
# Average convergence time
histogram_quantile(0.95, rate(stemedb_convergence_latency_seconds_bucket[5m]))
```
### Cluster Health
```promql
# Total cluster size
stemedb_cluster_nodes_total
# Percentage of healthy nodes
stemedb_cluster_nodes_alive / stemedb_cluster_nodes_total * 100
# Membership churn rate
rate(stemedb_membership_events_total[1h])
```
### Replication Throughput
```promql
# Assertions synced per second
rate(stemedb_assertions_synced_total[1m])
# Merkle diff backlog (should trend toward 0)
sum(stemedb_merkle_diff_size)
```
## Grafana Dashboard Suggestions
1. **Cluster Overview Panel**
- Nodes alive/suspect/total gauges
- Membership event timeline
2. **Sync Health Panel**
- Sync lag heatmap by peer
- Convergence latency histogram
- Sync failure rate alert
3. **Replication Panel**
- Assertions synced rate
- Merkle diff backlog trend
- Sync cycles per peer
## Alerting Rules
```yaml
groups:
- name: stemedb-sync
rules:
- alert: SyncLagHigh
expr: stemedb_sync_lag_seconds > 300
for: 5m
labels:
severity: warning
annotations:
summary: "Sync lag with peer {{ $labels.peer }} is {{ $value }}s"
- alert: MerkleDiffBacklog
expr: stemedb_merkle_diff_size > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Large Merkle diff with peer {{ $labels.peer }}: {{ $value }} assertions"
- alert: ClusterNodeDown
expr: stemedb_cluster_nodes_alive < 3
for: 1m
labels:
severity: critical
annotations:
summary: "Cluster has only {{ $value }} alive nodes"
```
## User Journey: Incident Response
```
[Grafana alert: SyncLagHigh fires]
-> [SRE opens /v1/admin/cluster to see node status]
-> [Identifies node-3 has state "suspect"]
-> [Checks /v1/admin/ranges to see if node-3 ranges are affected]
-> [Triggers POST /v1/admin/sync to force anti-entropy]
-> [Monitors stemedb_merkle_diff_size dropping toward 0]
-> [Alert auto-resolves when sync_lag < 300s]
```
## Implementation Notes
### Force Sync Mechanism
The admin sync endpoint uses `tokio::sync::Notify` to signal anti-entropy workers:
1. Gateway registers notify handles from each `AntiEntropyWorker`
2. `POST /v1/admin/sync` calls `notify.notify_one()` on all handles
3. Workers wake from `tokio::select!` and run sync immediately
4. Normal interval-based sync continues after force sync completes
### Metrics Storage
Metrics use the `metrics` crate with `metrics-exporter-prometheus`:
- Counters/gauges are lock-free atomic operations
- Histogram uses DDSketch for memory-efficient percentiles
- Labels are allocated once per unique label combination
- `/metrics` endpoint renders all registered metrics in Prometheus format
## Related Documentation
- [API Documentation](../services/api.md)
- [Phase 6 UAT](phase6-uat.md)
- [Distributed Architecture](../../docs/research/distributed-write-path.md)