Major additions: - Community Next.js app (port 18187) for browsing claims with API docs - stemedb-chaos crate: Fault injection, chaos testing, CRDT properties - Latent ingestion system: Reddit/FDA ingesters with ADK-Go agents - Disputed claims handling: Manual review workflows and validation - Aphoria security scanner: New extractors (SQL injection, command injection, weak crypto, TLS version), policy-based ignores, UAT reports - Docker infrastructure: Dockerfile, docker-compose.yml for full stack - VulnBank demo: Intentionally vulnerable multi-language test corpus SDK & API enhancements: - Source registry handlers for tracking data provenance - Metrics endpoint - Skeptic filtering improvements Code quality: - Split 14 large files (>500 lines) into focused modules - All files now under 500-line limit per project guidelines Documentation: - Chaos testing guide, circuit breakers, observability docs - Phase 7 UAT documentation updates - Martin Kleppmann technical writer agent Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
225 lines
6.1 KiB
Markdown
225 lines
6.1 KiB
Markdown
# Phase 8B: Observability
|
|
|
|
Prometheus metrics and admin endpoints for monitoring StemeDB clusters.
|
|
|
|
## Overview
|
|
|
|
StemeDB exposes metrics in Prometheus format and provides admin endpoints for operators to monitor cluster health, diagnose sync issues, and force anti-entropy convergence.
|
|
|
|
## Endpoints
|
|
|
|
### Standalone API Server (stemedb-api)
|
|
|
|
| Endpoint | Method | Description |
|
|
|----------|--------|-------------|
|
|
| `/metrics` | GET | Prometheus metrics in text format |
|
|
|
|
### Cluster Gateway (stemedb-cluster)
|
|
|
|
| Endpoint | Method | Description |
|
|
|----------|--------|-------------|
|
|
| `/metrics` | GET | Prometheus metrics in text format |
|
|
| `/v1/admin/cluster` | GET | Cluster status (alias for `/v1/cluster/status`) |
|
|
| `/v1/admin/ranges` | GET | All shard/range assignments |
|
|
| `/v1/admin/sync` | POST | Force anti-entropy sync |
|
|
|
|
## Metrics Reference
|
|
|
|
### Sync Metrics (stemedb-sync)
|
|
|
|
| Metric | Type | Labels | Description |
|
|
|--------|------|--------|-------------|
|
|
| `stemedb_sync_cycles_total` | Counter | `peer` | Total anti-entropy sync cycles completed |
|
|
| `stemedb_sync_failures_total` | Counter | `peer` | Total sync failures |
|
|
| `stemedb_assertions_synced_total` | Counter | `peer` | Total assertions synced from peers |
|
|
| `stemedb_sync_lag_seconds` | Gauge | `peer` | Seconds since last successful sync with peer |
|
|
| `stemedb_merkle_diff_size` | Gauge | `peer` | Number of assertions different from peer |
|
|
| `stemedb_convergence_latency_seconds` | Histogram | `peer` | Time to converge after detecting divergence |
|
|
|
|
### Membership Metrics (stemedb-cluster)
|
|
|
|
| Metric | Type | Labels | Description |
|
|
|--------|------|--------|-------------|
|
|
| `stemedb_membership_events_total` | Counter | `type` | Membership change events |
|
|
| `stemedb_cluster_nodes_alive` | Gauge | - | Number of alive nodes |
|
|
| `stemedb_cluster_nodes_suspect` | Gauge | - | Number of suspect nodes |
|
|
| `stemedb_cluster_nodes_total` | Gauge | - | Total nodes (alive + suspect) |
|
|
|
|
### Membership Event Types
|
|
|
|
| Type | Description |
|
|
|------|-------------|
|
|
| `joined` | Node joined the cluster |
|
|
| `suspected` | Node marked as suspect (unresponsive) |
|
|
| `failed` | Node marked as dead |
|
|
| `left` | Node left gracefully |
|
|
| `recovered` | Node recovered from suspect state |
|
|
|
|
## Admin API Details
|
|
|
|
### GET /v1/admin/ranges
|
|
|
|
Returns all shard assignments with their key ranges, replicas, and size metrics.
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"ranges": [
|
|
{
|
|
"range_id": "shard_0",
|
|
"start_key": "",
|
|
"end_key": "8000000000000000000000000000000000000000000000000000000000000000",
|
|
"size_bytes": 1048576,
|
|
"assertion_count": 1000,
|
|
"leader_node": "abc123",
|
|
"replica_nodes": ["abc123", "def456"],
|
|
"generation": 1
|
|
}
|
|
],
|
|
"total_ranges": 16
|
|
}
|
|
```
|
|
|
|
### POST /v1/admin/sync
|
|
|
|
Triggers immediate anti-entropy sync with all peers, bypassing the normal interval timer.
|
|
|
|
**Request:**
|
|
```json
|
|
{
|
|
"peer_id": null
|
|
}
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"triggered": true,
|
|
"peers_notified": 3,
|
|
"message": "Anti-entropy sync triggered for 3 peer(s)"
|
|
}
|
|
```
|
|
|
|
## Example Prometheus Queries
|
|
|
|
### Sync Health
|
|
|
|
```promql
|
|
# Sync lag per peer (should be < 60s normally)
|
|
stemedb_sync_lag_seconds
|
|
|
|
# Sync failure rate over 5 minutes
|
|
rate(stemedb_sync_failures_total[5m])
|
|
|
|
# Average convergence time
|
|
histogram_quantile(0.95, rate(stemedb_convergence_latency_seconds_bucket[5m]))
|
|
```
|
|
|
|
### Cluster Health
|
|
|
|
```promql
|
|
# Total cluster size
|
|
stemedb_cluster_nodes_total
|
|
|
|
# Percentage of healthy nodes
|
|
stemedb_cluster_nodes_alive / stemedb_cluster_nodes_total * 100
|
|
|
|
# Membership churn rate
|
|
rate(stemedb_membership_events_total[1h])
|
|
```
|
|
|
|
### Replication Throughput
|
|
|
|
```promql
|
|
# Assertions synced per second
|
|
rate(stemedb_assertions_synced_total[1m])
|
|
|
|
# Merkle diff backlog (should trend toward 0)
|
|
sum(stemedb_merkle_diff_size)
|
|
```
|
|
|
|
## Grafana Dashboard Suggestions
|
|
|
|
1. **Cluster Overview Panel**
|
|
- Nodes alive/suspect/total gauges
|
|
- Membership event timeline
|
|
|
|
2. **Sync Health Panel**
|
|
- Sync lag heatmap by peer
|
|
- Convergence latency histogram
|
|
- Sync failure rate alert
|
|
|
|
3. **Replication Panel**
|
|
- Assertions synced rate
|
|
- Merkle diff backlog trend
|
|
- Sync cycles per peer
|
|
|
|
## Alerting Rules
|
|
|
|
```yaml
|
|
groups:
|
|
- name: stemedb-sync
|
|
rules:
|
|
- alert: SyncLagHigh
|
|
expr: stemedb_sync_lag_seconds > 300
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Sync lag with peer {{ $labels.peer }} is {{ $value }}s"
|
|
|
|
- alert: MerkleDiffBacklog
|
|
expr: stemedb_merkle_diff_size > 10000
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Large Merkle diff with peer {{ $labels.peer }}: {{ $value }} assertions"
|
|
|
|
- alert: ClusterNodeDown
|
|
expr: stemedb_cluster_nodes_alive < 3
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Cluster has only {{ $value }} alive nodes"
|
|
```
|
|
|
|
## User Journey: Incident Response
|
|
|
|
```
|
|
[Grafana alert: SyncLagHigh fires]
|
|
-> [SRE opens /v1/admin/cluster to see node status]
|
|
-> [Identifies node-3 has state "suspect"]
|
|
-> [Checks /v1/admin/ranges to see if node-3 ranges are affected]
|
|
-> [Triggers POST /v1/admin/sync to force anti-entropy]
|
|
-> [Monitors stemedb_merkle_diff_size dropping toward 0]
|
|
-> [Alert auto-resolves when sync_lag < 300s]
|
|
```
|
|
|
|
## Implementation Notes
|
|
|
|
### Force Sync Mechanism
|
|
|
|
The admin sync endpoint uses `tokio::sync::Notify` to signal anti-entropy workers:
|
|
|
|
1. Gateway registers notify handles from each `AntiEntropyWorker`
|
|
2. `POST /v1/admin/sync` calls `notify.notify_one()` on all handles
|
|
3. Workers wake from `tokio::select!` and run sync immediately
|
|
4. Normal interval-based sync continues after force sync completes
|
|
|
|
### Metrics Storage
|
|
|
|
Metrics use the `metrics` crate with `metrics-exporter-prometheus`:
|
|
|
|
- Counters/gauges are lock-free atomic operations
|
|
- Histogram uses DDSketch for memory-efficient percentiles
|
|
- Labels are allocated once per unique label combination
|
|
- `/metrics` endpoint renders all registered metrics in Prometheus format
|
|
|
|
## Related Documentation
|
|
|
|
- [API Documentation](../services/api.md)
|
|
- [Phase 6 UAT](phase6-uat.md)
|
|
- [Distributed Architecture](../../docs/research/distributed-write-path.md)
|