stemedb/docs/operations/monitoring/http-metrics-completion.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

4.5 KiB

HTTP SLI Metrics Completion Guide

Status: Layer 3 (HTTP SLI Metrics) - 5% Complete

Completed:

  • Pattern established in handlers/vote.rs (reference implementation)
  • Helper script created at scripts/add_http_metrics.sh

Remaining: 19+ handlers need the same pattern applied

Reference Pattern (from vote.rs)

pub async fn handler_function(
    State(state): State<AppState>,
    // ... other parameters
) -> Result<(StatusCode, Json<Response>)> {
    // 1. Start timing + increment request counter
    let start = std::time::Instant::now();
    metrics::counter!("stemedb_http_requests_total", "method" => "POST", "path" => "/v1/endpoint").increment(1);

    // 2. Handler logic (unchanged)
    // ...

    // 3. Capture result
    let result = Ok((StatusCode::OK, Json(response)));

    // 4. Track duration with status
    let status = match &result {
        Ok((s, _)) => s.as_u16(),
        Err(_) => 500,
    };
    metrics::histogram!("stemedb_http_request_duration_seconds",
        "method" => "POST",
        "path" => "/v1/endpoint",
        "status" => status.to_string().as_str()
    ).record(start.elapsed().as_secs_f64());

    result
}

Handlers Requiring Metrics

Write Endpoints

  • handlers/supersession.rs::supersede (POST /v1/supersede)
  • handlers/epoch.rs::create_epoch (POST /v1/epoch)
  • handlers/source.rs::store_source (POST /v1/source)

Admin Endpoints

  • handlers/admin.rs::decay_trust_ranks (POST /v1/admin/decay_trust_ranks)
  • handlers/escalation.rs::resolve_escalation (POST /v1/admin/escalation/resolve)
  • handlers/gold_standard.rs::create_gold_standard (POST /v1/gold_standard)
  • handlers/gold_standard.rs::remove_gold_standard (DELETE /v1/gold_standard)
  • handlers/gold_standard.rs::verify_agent (POST /v1/gold_standard/verify)
  • handlers/quarantine.rs::approve_quarantine (POST /v1/admin/quarantine/approve)
  • handlers/quarantine.rs::reject_quarantine (POST /v1/admin/quarantine/reject)
  • handlers/circuit_breaker.rs::reset_circuit (POST /v1/admin/circuit_breaker/reset)
  • handlers/api_keys.rs::create_api_key (POST /v1/admin/api_keys)
  • handlers/api_keys.rs::revoke_api_key (DELETE /v1/admin/api_keys)
  • handlers/api_keys.rs::rotate_api_key (POST /v1/admin/api_keys/rotate)
  • handlers/api_keys.rs::update_api_key (PATCH /v1/admin/api_keys)

Read Endpoints

  • handlers/audit.rs::list_audits (GET /v1/audit)
  • handlers/audit.rs::get_audit (GET /v1/audit/{id})
  • handlers/source.rs::get_provenance (GET /v1/source/provenance)
  • handlers/concepts.rs::resolve_alias (GET /v1/concepts/alias)
  • handlers/concepts.rs::list_aliases (GET /v1/concepts/aliases)
  • handlers/concepts.rs::suggest_aliases (GET /v1/concepts/suggest)
  • handlers/concepts.rs::parse_concept_path (GET /v1/concepts/parse)

Aphoria Endpoints (if feature enabled)

  • handlers/aphoria/policy.rs::bless (POST /v1/aphoria/policy/bless)
  • handlers/aphoria/policy.rs::export_policy (GET /v1/aphoria/policy/export)
  • handlers/aphoria/policy.rs::import_policy (POST /v1/aphoria/policy/import)
  • handlers/aphoria/scan.rs::scan (POST /v1/aphoria/scan)
  • handlers/aphoria/report.rs::push_observations (POST /v1/aphoria/report)

Completion Steps

  1. For each handler:

    • Add let start = std::time::Instant::now(); at function start
    • Add metrics::counter! increment after timing starts
    • Wrap the return value in a variable (let result = Ok(...))
    • Add status extraction and histogram recording before returning
    • Return result
  2. Verification:

    # After making changes
    cargo build --workspace
    cargo run --bin stemedb-api &
    
    # Trigger endpoint
    curl -X POST http://localhost:18180/v1/vote -d '...'
    
    # Check metrics
    curl http://localhost:18180/metrics | grep stemedb_http_request_duration_seconds
    curl http://localhost:18180/metrics | grep stemedb_http_requests_total
    
  3. Estimated time: ~2-3 hours for all 20+ handlers

Metrics Added

Once complete, these metrics will be available:

  • stemedb_http_requests_total{method,path} (counter) - Total request count per endpoint
  • stemedb_http_request_duration_seconds{method,path,status} (histogram) - Request latency distribution

Next Steps After Completion

After Layer 3 is complete:

  1. Verify all metrics appear in /metrics endpoint
  2. Create Grafana dashboards (Layer 5)
  3. Configure Prometheus alerts (Layer 6)
  4. Set up PagerDuty/Slack integration (Layer 7)