stemedb/docs/operations/monitoring/http-metrics-completion.md

# HTTP SLI Metrics Completion Guide

## Status: Layer 3 (HTTP SLI Metrics) - 5% Complete

**Completed:**
- ✅ Pattern established in `handlers/vote.rs` (reference implementation)
- ✅ Helper script created at `scripts/add_http_metrics.sh`

**Remaining:** 19+ handlers need the same pattern applied

## Reference Pattern (from vote.rs)

```rust
pub async fn handler_function(
    State(state): State<AppState>,
    // ... other parameters
) -> Result<(StatusCode, Json<Response>)> {
    // 1. Start timing + increment request counter
    let start = std::time::Instant::now();
    metrics::counter!("stemedb_http_requests_total", "method" => "POST", "path" => "/v1/endpoint").increment(1);

    // 2. Handler logic (unchanged)
    // ...

    // 3. Capture result
    let result = Ok((StatusCode::OK, Json(response)));

    // 4. Track duration with status
    let status = match &result {
        Ok((s, _)) => s.as_u16(),
        Err(_) => 500,
    };
    metrics::histogram!("stemedb_http_request_duration_seconds",
        "method" => "POST",
        "path" => "/v1/endpoint",
        "status" => status.to_string().as_str()
    ).record(start.elapsed().as_secs_f64());

    result
}
```

## Handlers Requiring Metrics

### Write Endpoints
- [ ] `handlers/supersession.rs::supersede` (POST /v1/supersede)
- [ ] `handlers/epoch.rs::create_epoch` (POST /v1/epoch)
- [ ] `handlers/source.rs::store_source` (POST /v1/source)

### Admin Endpoints
- [ ] `handlers/admin.rs::decay_trust_ranks` (POST /v1/admin/decay_trust_ranks)
- [ ] `handlers/escalation.rs::resolve_escalation` (POST /v1/admin/escalation/resolve)
- [ ] `handlers/gold_standard.rs::create_gold_standard` (POST /v1/gold_standard)
- [ ] `handlers/gold_standard.rs::remove_gold_standard` (DELETE /v1/gold_standard)
- [ ] `handlers/gold_standard.rs::verify_agent` (POST /v1/gold_standard/verify)
- [ ] `handlers/quarantine.rs::approve_quarantine` (POST /v1/admin/quarantine/approve)
- [ ] `handlers/quarantine.rs::reject_quarantine` (POST /v1/admin/quarantine/reject)
- [ ] `handlers/circuit_breaker.rs::reset_circuit` (POST /v1/admin/circuit_breaker/reset)
- [ ] `handlers/api_keys.rs::create_api_key` (POST /v1/admin/api_keys)
- [ ] `handlers/api_keys.rs::revoke_api_key` (DELETE /v1/admin/api_keys)
- [ ] `handlers/api_keys.rs::rotate_api_key` (POST /v1/admin/api_keys/rotate)
- [ ] `handlers/api_keys.rs::update_api_key` (PATCH /v1/admin/api_keys)

### Read Endpoints
- [ ] `handlers/audit.rs::list_audits` (GET /v1/audit)
- [ ] `handlers/audit.rs::get_audit` (GET /v1/audit/{id})
- [ ] `handlers/source.rs::get_provenance` (GET /v1/source/provenance)
- [ ] `handlers/concepts.rs::resolve_alias` (GET /v1/concepts/alias)
- [ ] `handlers/concepts.rs::list_aliases` (GET /v1/concepts/aliases)
- [ ] `handlers/concepts.rs::suggest_aliases` (GET /v1/concepts/suggest)
- [ ] `handlers/concepts.rs::parse_concept_path` (GET /v1/concepts/parse)

### Aphoria Endpoints (if feature enabled)
- [ ] `handlers/aphoria/policy.rs::bless` (POST /v1/aphoria/policy/bless)
- [ ] `handlers/aphoria/policy.rs::export_policy` (GET /v1/aphoria/policy/export)
- [ ] `handlers/aphoria/policy.rs::import_policy` (POST /v1/aphoria/policy/import)
- [ ] `handlers/aphoria/scan.rs::scan` (POST /v1/aphoria/scan)
- [ ] `handlers/aphoria/report.rs::push_observations` (POST /v1/aphoria/report)

## Completion Steps

1. **For each handler:**
   - Add `let start = std::time::Instant::now();` at function start
   - Add `metrics::counter!` increment after timing starts
   - Wrap the return value in a variable (`let result = Ok(...)`)
   - Add status extraction and histogram recording before returning
   - Return `result`

2. **Verification:**
   ```bash
   # After making changes
   cargo build --workspace
   cargo run --bin stemedb-api &

   # Trigger endpoint
   curl -X POST http://localhost:18180/v1/vote -d '...'

   # Check metrics
   curl http://localhost:18180/metrics | grep stemedb_http_request_duration_seconds
   curl http://localhost:18180/metrics | grep stemedb_http_requests_total
   ```

3. **Estimated time:** ~2-3 hours for all 20+ handlers

## Metrics Added

Once complete, these metrics will be available:

- `stemedb_http_requests_total{method,path}` (counter) - Total request count per endpoint
- `stemedb_http_request_duration_seconds{method,path,status}` (histogram) - Request latency distribution

## Next Steps After Completion

After Layer 3 is complete:
1. Verify all metrics appear in `/metrics` endpoint
2. Create Grafana dashboards (Layer 5)
3. Configure Prometheus alerts (Layer 6)
4. Set up PagerDuty/Slack integration (Layer 7)