Major additions: - Community Next.js app (port 18187) for browsing claims with API docs - stemedb-chaos crate: Fault injection, chaos testing, CRDT properties - Latent ingestion system: Reddit/FDA ingesters with ADK-Go agents - Disputed claims handling: Manual review workflows and validation - Aphoria security scanner: New extractors (SQL injection, command injection, weak crypto, TLS version), policy-based ignores, UAT reports - Docker infrastructure: Dockerfile, docker-compose.yml for full stack - VulnBank demo: Intentionally vulnerable multi-language test corpus SDK & API enhancements: - Source registry handlers for tracking data provenance - Metrics endpoint - Skeptic filtering improvements Code quality: - Split 14 large files (>500 lines) into focused modules - All files now under 500-line limit per project guidelines Documentation: - Chaos testing guide, circuit breakers, observability docs - Phase 7 UAT documentation updates - Martin Kleppmann technical writer agent Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
6.1 KiB
6.1 KiB
Phase 8B: Observability
Prometheus metrics and admin endpoints for monitoring StemeDB clusters.
Overview
StemeDB exposes metrics in Prometheus format and provides admin endpoints for operators to monitor cluster health, diagnose sync issues, and force anti-entropy convergence.
Endpoints
Standalone API Server (stemedb-api)
| Endpoint | Method | Description |
|---|---|---|
/metrics |
GET | Prometheus metrics in text format |
Cluster Gateway (stemedb-cluster)
| Endpoint | Method | Description |
|---|---|---|
/metrics |
GET | Prometheus metrics in text format |
/v1/admin/cluster |
GET | Cluster status (alias for /v1/cluster/status) |
/v1/admin/ranges |
GET | All shard/range assignments |
/v1/admin/sync |
POST | Force anti-entropy sync |
Metrics Reference
Sync Metrics (stemedb-sync)
| Metric | Type | Labels | Description |
|---|---|---|---|
stemedb_sync_cycles_total |
Counter | peer |
Total anti-entropy sync cycles completed |
stemedb_sync_failures_total |
Counter | peer |
Total sync failures |
stemedb_assertions_synced_total |
Counter | peer |
Total assertions synced from peers |
stemedb_sync_lag_seconds |
Gauge | peer |
Seconds since last successful sync with peer |
stemedb_merkle_diff_size |
Gauge | peer |
Number of assertions different from peer |
stemedb_convergence_latency_seconds |
Histogram | peer |
Time to converge after detecting divergence |
Membership Metrics (stemedb-cluster)
| Metric | Type | Labels | Description |
|---|---|---|---|
stemedb_membership_events_total |
Counter | type |
Membership change events |
stemedb_cluster_nodes_alive |
Gauge | - | Number of alive nodes |
stemedb_cluster_nodes_suspect |
Gauge | - | Number of suspect nodes |
stemedb_cluster_nodes_total |
Gauge | - | Total nodes (alive + suspect) |
Membership Event Types
| Type | Description |
|---|---|
joined |
Node joined the cluster |
suspected |
Node marked as suspect (unresponsive) |
failed |
Node marked as dead |
left |
Node left gracefully |
recovered |
Node recovered from suspect state |
Admin API Details
GET /v1/admin/ranges
Returns all shard assignments with their key ranges, replicas, and size metrics.
Response:
{
"ranges": [
{
"range_id": "shard_0",
"start_key": "",
"end_key": "8000000000000000000000000000000000000000000000000000000000000000",
"size_bytes": 1048576,
"assertion_count": 1000,
"leader_node": "abc123",
"replica_nodes": ["abc123", "def456"],
"generation": 1
}
],
"total_ranges": 16
}
POST /v1/admin/sync
Triggers immediate anti-entropy sync with all peers, bypassing the normal interval timer.
Request:
{
"peer_id": null
}
Response:
{
"triggered": true,
"peers_notified": 3,
"message": "Anti-entropy sync triggered for 3 peer(s)"
}
Example Prometheus Queries
Sync Health
# Sync lag per peer (should be < 60s normally)
stemedb_sync_lag_seconds
# Sync failure rate over 5 minutes
rate(stemedb_sync_failures_total[5m])
# Average convergence time
histogram_quantile(0.95, rate(stemedb_convergence_latency_seconds_bucket[5m]))
Cluster Health
# Total cluster size
stemedb_cluster_nodes_total
# Percentage of healthy nodes
stemedb_cluster_nodes_alive / stemedb_cluster_nodes_total * 100
# Membership churn rate
rate(stemedb_membership_events_total[1h])
Replication Throughput
# Assertions synced per second
rate(stemedb_assertions_synced_total[1m])
# Merkle diff backlog (should trend toward 0)
sum(stemedb_merkle_diff_size)
Grafana Dashboard Suggestions
-
Cluster Overview Panel
- Nodes alive/suspect/total gauges
- Membership event timeline
-
Sync Health Panel
- Sync lag heatmap by peer
- Convergence latency histogram
- Sync failure rate alert
-
Replication Panel
- Assertions synced rate
- Merkle diff backlog trend
- Sync cycles per peer
Alerting Rules
groups:
- name: stemedb-sync
rules:
- alert: SyncLagHigh
expr: stemedb_sync_lag_seconds > 300
for: 5m
labels:
severity: warning
annotations:
summary: "Sync lag with peer {{ $labels.peer }} is {{ $value }}s"
- alert: MerkleDiffBacklog
expr: stemedb_merkle_diff_size > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Large Merkle diff with peer {{ $labels.peer }}: {{ $value }} assertions"
- alert: ClusterNodeDown
expr: stemedb_cluster_nodes_alive < 3
for: 1m
labels:
severity: critical
annotations:
summary: "Cluster has only {{ $value }} alive nodes"
User Journey: Incident Response
[Grafana alert: SyncLagHigh fires]
-> [SRE opens /v1/admin/cluster to see node status]
-> [Identifies node-3 has state "suspect"]
-> [Checks /v1/admin/ranges to see if node-3 ranges are affected]
-> [Triggers POST /v1/admin/sync to force anti-entropy]
-> [Monitors stemedb_merkle_diff_size dropping toward 0]
-> [Alert auto-resolves when sync_lag < 300s]
Implementation Notes
Force Sync Mechanism
The admin sync endpoint uses tokio::sync::Notify to signal anti-entropy workers:
- Gateway registers notify handles from each
AntiEntropyWorker POST /v1/admin/synccallsnotify.notify_one()on all handles- Workers wake from
tokio::select!and run sync immediately - Normal interval-based sync continues after force sync completes
Metrics Storage
Metrics use the metrics crate with metrics-exporter-prometheus:
- Counters/gauges are lock-free atomic operations
- Histogram uses DDSketch for memory-efficient percentiles
- Labels are allocated once per unique label combination
/metricsendpoint renders all registered metrics in Prometheus format