stemedb/ai-lookup/features/observability.md
jordan b3e8a9a058 feat: Multi-application expansion with chaos testing and community UI
Major additions:
- Community Next.js app (port 18187) for browsing claims with API docs
- stemedb-chaos crate: Fault injection, chaos testing, CRDT properties
- Latent ingestion system: Reddit/FDA ingesters with ADK-Go agents
- Disputed claims handling: Manual review workflows and validation
- Aphoria security scanner: New extractors (SQL injection, command
  injection, weak crypto, TLS version), policy-based ignores, UAT reports
- Docker infrastructure: Dockerfile, docker-compose.yml for full stack
- VulnBank demo: Intentionally vulnerable multi-language test corpus

SDK & API enhancements:
- Source registry handlers for tracking data provenance
- Metrics endpoint
- Skeptic filtering improvements

Code quality:
- Split 14 large files (>500 lines) into focused modules
- All files now under 500-line limit per project guidelines

Documentation:
- Chaos testing guide, circuit breakers, observability docs
- Phase 7 UAT documentation updates
- Martin Kleppmann technical writer agent

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 01:24:14 -07:00

6.1 KiB

Phase 8B: Observability

Prometheus metrics and admin endpoints for monitoring StemeDB clusters.

Overview

StemeDB exposes metrics in Prometheus format and provides admin endpoints for operators to monitor cluster health, diagnose sync issues, and force anti-entropy convergence.

Endpoints

Standalone API Server (stemedb-api)

Endpoint Method Description
/metrics GET Prometheus metrics in text format

Cluster Gateway (stemedb-cluster)

Endpoint Method Description
/metrics GET Prometheus metrics in text format
/v1/admin/cluster GET Cluster status (alias for /v1/cluster/status)
/v1/admin/ranges GET All shard/range assignments
/v1/admin/sync POST Force anti-entropy sync

Metrics Reference

Sync Metrics (stemedb-sync)

Metric Type Labels Description
stemedb_sync_cycles_total Counter peer Total anti-entropy sync cycles completed
stemedb_sync_failures_total Counter peer Total sync failures
stemedb_assertions_synced_total Counter peer Total assertions synced from peers
stemedb_sync_lag_seconds Gauge peer Seconds since last successful sync with peer
stemedb_merkle_diff_size Gauge peer Number of assertions different from peer
stemedb_convergence_latency_seconds Histogram peer Time to converge after detecting divergence

Membership Metrics (stemedb-cluster)

Metric Type Labels Description
stemedb_membership_events_total Counter type Membership change events
stemedb_cluster_nodes_alive Gauge - Number of alive nodes
stemedb_cluster_nodes_suspect Gauge - Number of suspect nodes
stemedb_cluster_nodes_total Gauge - Total nodes (alive + suspect)

Membership Event Types

Type Description
joined Node joined the cluster
suspected Node marked as suspect (unresponsive)
failed Node marked as dead
left Node left gracefully
recovered Node recovered from suspect state

Admin API Details

GET /v1/admin/ranges

Returns all shard assignments with their key ranges, replicas, and size metrics.

Response:

{
  "ranges": [
    {
      "range_id": "shard_0",
      "start_key": "",
      "end_key": "8000000000000000000000000000000000000000000000000000000000000000",
      "size_bytes": 1048576,
      "assertion_count": 1000,
      "leader_node": "abc123",
      "replica_nodes": ["abc123", "def456"],
      "generation": 1
    }
  ],
  "total_ranges": 16
}

POST /v1/admin/sync

Triggers immediate anti-entropy sync with all peers, bypassing the normal interval timer.

Request:

{
  "peer_id": null
}

Response:

{
  "triggered": true,
  "peers_notified": 3,
  "message": "Anti-entropy sync triggered for 3 peer(s)"
}

Example Prometheus Queries

Sync Health

# Sync lag per peer (should be < 60s normally)
stemedb_sync_lag_seconds

# Sync failure rate over 5 minutes
rate(stemedb_sync_failures_total[5m])

# Average convergence time
histogram_quantile(0.95, rate(stemedb_convergence_latency_seconds_bucket[5m]))

Cluster Health

# Total cluster size
stemedb_cluster_nodes_total

# Percentage of healthy nodes
stemedb_cluster_nodes_alive / stemedb_cluster_nodes_total * 100

# Membership churn rate
rate(stemedb_membership_events_total[1h])

Replication Throughput

# Assertions synced per second
rate(stemedb_assertions_synced_total[1m])

# Merkle diff backlog (should trend toward 0)
sum(stemedb_merkle_diff_size)

Grafana Dashboard Suggestions

  1. Cluster Overview Panel

    • Nodes alive/suspect/total gauges
    • Membership event timeline
  2. Sync Health Panel

    • Sync lag heatmap by peer
    • Convergence latency histogram
    • Sync failure rate alert
  3. Replication Panel

    • Assertions synced rate
    • Merkle diff backlog trend
    • Sync cycles per peer

Alerting Rules

groups:
  - name: stemedb-sync
    rules:
      - alert: SyncLagHigh
        expr: stemedb_sync_lag_seconds > 300
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Sync lag with peer {{ $labels.peer }} is {{ $value }}s"

      - alert: MerkleDiffBacklog
        expr: stemedb_merkle_diff_size > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Large Merkle diff with peer {{ $labels.peer }}: {{ $value }} assertions"

      - alert: ClusterNodeDown
        expr: stemedb_cluster_nodes_alive < 3
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Cluster has only {{ $value }} alive nodes"

User Journey: Incident Response

[Grafana alert: SyncLagHigh fires]
  -> [SRE opens /v1/admin/cluster to see node status]
  -> [Identifies node-3 has state "suspect"]
  -> [Checks /v1/admin/ranges to see if node-3 ranges are affected]
  -> [Triggers POST /v1/admin/sync to force anti-entropy]
  -> [Monitors stemedb_merkle_diff_size dropping toward 0]
  -> [Alert auto-resolves when sync_lag < 300s]

Implementation Notes

Force Sync Mechanism

The admin sync endpoint uses tokio::sync::Notify to signal anti-entropy workers:

  1. Gateway registers notify handles from each AntiEntropyWorker
  2. POST /v1/admin/sync calls notify.notify_one() on all handles
  3. Workers wake from tokio::select! and run sync immediately
  4. Normal interval-based sync continues after force sync completes

Metrics Storage

Metrics use the metrics crate with metrics-exporter-prometheus:

  • Counters/gauges are lock-free atomic operations
  • Histogram uses DDSketch for memory-efficient percentiles
  • Labels are allocated once per unique label combination
  • /metrics endpoint renders all registered metrics in Prometheus format