stemedb/ai-lookup/features/observability.md

# Phase 8B: Observability

Prometheus metrics and admin endpoints for monitoring StemeDB clusters.

## Overview

StemeDB exposes metrics in Prometheus format and provides admin endpoints for operators to monitor cluster health, diagnose sync issues, and force anti-entropy convergence.

## Endpoints

### Standalone API Server (stemedb-api)

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/metrics` | GET | Prometheus metrics in text format |

### Cluster Gateway (stemedb-cluster)

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/metrics` | GET | Prometheus metrics in text format |
| `/v1/admin/cluster` | GET | Cluster status (alias for `/v1/cluster/status`) |
| `/v1/admin/ranges` | GET | All shard/range assignments |
| `/v1/admin/sync` | POST | Force anti-entropy sync |

## Metrics Reference

### Sync Metrics (stemedb-sync)

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `stemedb_sync_cycles_total` | Counter | `peer` | Total anti-entropy sync cycles completed |
| `stemedb_sync_failures_total` | Counter | `peer` | Total sync failures |
| `stemedb_assertions_synced_total` | Counter | `peer` | Total assertions synced from peers |
| `stemedb_sync_lag_seconds` | Gauge | `peer` | Seconds since last successful sync with peer |
| `stemedb_merkle_diff_size` | Gauge | `peer` | Number of assertions different from peer |
| `stemedb_convergence_latency_seconds` | Histogram | `peer` | Time to converge after detecting divergence |

### Membership Metrics (stemedb-cluster)

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `stemedb_membership_events_total` | Counter | `type` | Membership change events |
| `stemedb_cluster_nodes_alive` | Gauge | - | Number of alive nodes |
| `stemedb_cluster_nodes_suspect` | Gauge | - | Number of suspect nodes |
| `stemedb_cluster_nodes_total` | Gauge | - | Total nodes (alive + suspect) |

### Membership Event Types

| Type | Description |
|------|-------------|
| `joined` | Node joined the cluster |
| `suspected` | Node marked as suspect (unresponsive) |
| `failed` | Node marked as dead |
| `left` | Node left gracefully |
| `recovered` | Node recovered from suspect state |

## Admin API Details

### GET /v1/admin/ranges

Returns all shard assignments with their key ranges, replicas, and size metrics.

**Response:**
```json
{
  "ranges": [
    {
      "range_id": "shard_0",
      "start_key": "",
      "end_key": "8000000000000000000000000000000000000000000000000000000000000000",
      "size_bytes": 1048576,
      "assertion_count": 1000,
      "leader_node": "abc123",
      "replica_nodes": ["abc123", "def456"],
      "generation": 1
    }
  ],
  "total_ranges": 16
}
```

### POST /v1/admin/sync

Triggers immediate anti-entropy sync with all peers, bypassing the normal interval timer.

**Request:**
```json
{
  "peer_id": null
}
```

**Response:**
```json
{
  "triggered": true,
  "peers_notified": 3,
  "message": "Anti-entropy sync triggered for 3 peer(s)"
}
```

## Example Prometheus Queries

### Sync Health

```promql
# Sync lag per peer (should be < 60s normally)
stemedb_sync_lag_seconds

# Sync failure rate over 5 minutes
rate(stemedb_sync_failures_total[5m])

# Average convergence time
histogram_quantile(0.95, rate(stemedb_convergence_latency_seconds_bucket[5m]))
```

### Cluster Health

```promql
# Total cluster size
stemedb_cluster_nodes_total

# Percentage of healthy nodes
stemedb_cluster_nodes_alive / stemedb_cluster_nodes_total * 100

# Membership churn rate
rate(stemedb_membership_events_total[1h])
```

### Replication Throughput

```promql
# Assertions synced per second
rate(stemedb_assertions_synced_total[1m])

# Merkle diff backlog (should trend toward 0)
sum(stemedb_merkle_diff_size)
```

## Grafana Dashboard Suggestions

1. **Cluster Overview Panel**
   - Nodes alive/suspect/total gauges
   - Membership event timeline

2. **Sync Health Panel**
   - Sync lag heatmap by peer
   - Convergence latency histogram
   - Sync failure rate alert

3. **Replication Panel**
   - Assertions synced rate
   - Merkle diff backlog trend
   - Sync cycles per peer

## Alerting Rules

```yaml
groups:
  - name: stemedb-sync
    rules:
      - alert: SyncLagHigh
        expr: stemedb_sync_lag_seconds > 300
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Sync lag with peer {{ $labels.peer }} is {{ $value }}s"

      - alert: MerkleDiffBacklog
        expr: stemedb_merkle_diff_size > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Large Merkle diff with peer {{ $labels.peer }}: {{ $value }} assertions"

      - alert: ClusterNodeDown
        expr: stemedb_cluster_nodes_alive < 3
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Cluster has only {{ $value }} alive nodes"
```

## User Journey: Incident Response

```
[Grafana alert: SyncLagHigh fires]
  -> [SRE opens /v1/admin/cluster to see node status]
  -> [Identifies node-3 has state "suspect"]
  -> [Checks /v1/admin/ranges to see if node-3 ranges are affected]
  -> [Triggers POST /v1/admin/sync to force anti-entropy]
  -> [Monitors stemedb_merkle_diff_size dropping toward 0]
  -> [Alert auto-resolves when sync_lag < 300s]
```

## Implementation Notes

### Force Sync Mechanism

The admin sync endpoint uses `tokio::sync::Notify` to signal anti-entropy workers:

1. Gateway registers notify handles from each `AntiEntropyWorker`
2. `POST /v1/admin/sync` calls `notify.notify_one()` on all handles
3. Workers wake from `tokio::select!` and run sync immediately
4. Normal interval-based sync continues after force sync completes

### Metrics Storage

Metrics use the `metrics` crate with `metrics-exporter-prometheus`:

- Counters/gauges are lock-free atomic operations
- Histogram uses DDSketch for memory-efficient percentiles
- Labels are allocated once per unique label combination
- `/metrics` endpoint renders all registered metrics in Prometheus format

## Related Documentation

- [API Documentation](../services/api.md)
- [Phase 6 UAT](phase6-uat.md)
- [Distributed Architecture](../../docs/research/distributed-write-path.md)