jordan/stemedb

jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

4.5 KiB

Raw Blame History

HTTP SLI Metrics Completion Guide

Status: Layer 3 (HTTP SLI Metrics) - 5% Complete

Completed:

✅ Pattern established in handlers/vote.rs (reference implementation)
✅ Helper script created at scripts/add_http_metrics.sh

Remaining: 19+ handlers need the same pattern applied

Reference Pattern (from vote.rs)

pub async fn handler_function(
    State(state): State<AppState>,
    // ... other parameters
) -> Result<(StatusCode, Json<Response>)> {
    // 1. Start timing + increment request counter
    let start = std::time::Instant::now();
    metrics::counter!("stemedb_http_requests_total", "method" => "POST", "path" => "/v1/endpoint").increment(1);

    // 2. Handler logic (unchanged)
    // ...

    // 3. Capture result
    let result = Ok((StatusCode::OK, Json(response)));

    // 4. Track duration with status
    let status = match &result {
        Ok((s, _)) => s.as_u16(),
        Err(_) => 500,
    };
    metrics::histogram!("stemedb_http_request_duration_seconds",
        "method" => "POST",
        "path" => "/v1/endpoint",
        "status" => status.to_string().as_str()
    ).record(start.elapsed().as_secs_f64());

    result
}

Handlers Requiring Metrics

Write Endpoints

handlers/supersession.rs::supersede (POST /v1/supersede)
handlers/epoch.rs::create_epoch (POST /v1/epoch)
handlers/source.rs::store_source (POST /v1/source)

Admin Endpoints

handlers/admin.rs::decay_trust_ranks (POST /v1/admin/decay_trust_ranks)
handlers/escalation.rs::resolve_escalation (POST /v1/admin/escalation/resolve)
handlers/gold_standard.rs::create_gold_standard (POST /v1/gold_standard)
handlers/gold_standard.rs::remove_gold_standard (DELETE /v1/gold_standard)
handlers/gold_standard.rs::verify_agent (POST /v1/gold_standard/verify)
handlers/quarantine.rs::approve_quarantine (POST /v1/admin/quarantine/approve)
handlers/quarantine.rs::reject_quarantine (POST /v1/admin/quarantine/reject)
handlers/circuit_breaker.rs::reset_circuit (POST /v1/admin/circuit_breaker/reset)
handlers/api_keys.rs::create_api_key (POST /v1/admin/api_keys)
handlers/api_keys.rs::revoke_api_key (DELETE /v1/admin/api_keys)
handlers/api_keys.rs::rotate_api_key (POST /v1/admin/api_keys/rotate)
handlers/api_keys.rs::update_api_key (PATCH /v1/admin/api_keys)

Read Endpoints

handlers/audit.rs::list_audits (GET /v1/audit)
handlers/audit.rs::get_audit (GET /v1/audit/{id})
handlers/source.rs::get_provenance (GET /v1/source/provenance)
handlers/concepts.rs::resolve_alias (GET /v1/concepts/alias)
handlers/concepts.rs::list_aliases (GET /v1/concepts/aliases)
handlers/concepts.rs::suggest_aliases (GET /v1/concepts/suggest)
handlers/concepts.rs::parse_concept_path (GET /v1/concepts/parse)

Aphoria Endpoints (if feature enabled)

handlers/aphoria/policy.rs::bless (POST /v1/aphoria/policy/bless)
handlers/aphoria/policy.rs::export_policy (GET /v1/aphoria/policy/export)
handlers/aphoria/policy.rs::import_policy (POST /v1/aphoria/policy/import)
handlers/aphoria/scan.rs::scan (POST /v1/aphoria/scan)
handlers/aphoria/report.rs::push_observations (POST /v1/aphoria/report)

Completion Steps

For each handler:
- Add let start = std::time::Instant::now(); at function start
- Add metrics::counter! increment after timing starts
- Wrap the return value in a variable (let result = Ok(...))
- Add status extraction and histogram recording before returning
- Return result

Verification:

# After making changes
cargo build --workspace
cargo run --bin stemedb-api &

# Trigger endpoint
curl -X POST http://localhost:18180/v1/vote -d '...'

# Check metrics
curl http://localhost:18180/metrics | grep stemedb_http_request_duration_seconds
curl http://localhost:18180/metrics | grep stemedb_http_requests_total

Estimated time: ~2-3 hours for all 20+ handlers

Metrics Added

Once complete, these metrics will be available:

stemedb_http_requests_total{method,path} (counter) - Total request count per endpoint
stemedb_http_request_duration_seconds{method,path,status} (histogram) - Request latency distribution

Next Steps After Completion

After Layer 3 is complete:

Verify all metrics appear in /metrics endpoint
Create Grafana dashboards (Layer 5)
Configure Prometheus alerts (Layer 6)
Set up PagerDuty/Slack integration (Layer 7)