stemedb/docs/operations/monitoring/http-metrics-completion.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

119 lines
4.5 KiB
Markdown

# HTTP SLI Metrics Completion Guide
## Status: Layer 3 (HTTP SLI Metrics) - 5% Complete
**Completed:**
- ✅ Pattern established in `handlers/vote.rs` (reference implementation)
- ✅ Helper script created at `scripts/add_http_metrics.sh`
**Remaining:** 19+ handlers need the same pattern applied
## Reference Pattern (from vote.rs)
```rust
pub async fn handler_function(
State(state): State<AppState>,
// ... other parameters
) -> Result<(StatusCode, Json<Response>)> {
// 1. Start timing + increment request counter
let start = std::time::Instant::now();
metrics::counter!("stemedb_http_requests_total", "method" => "POST", "path" => "/v1/endpoint").increment(1);
// 2. Handler logic (unchanged)
// ...
// 3. Capture result
let result = Ok((StatusCode::OK, Json(response)));
// 4. Track duration with status
let status = match &result {
Ok((s, _)) => s.as_u16(),
Err(_) => 500,
};
metrics::histogram!("stemedb_http_request_duration_seconds",
"method" => "POST",
"path" => "/v1/endpoint",
"status" => status.to_string().as_str()
).record(start.elapsed().as_secs_f64());
result
}
```
## Handlers Requiring Metrics
### Write Endpoints
- [ ] `handlers/supersession.rs::supersede` (POST /v1/supersede)
- [ ] `handlers/epoch.rs::create_epoch` (POST /v1/epoch)
- [ ] `handlers/source.rs::store_source` (POST /v1/source)
### Admin Endpoints
- [ ] `handlers/admin.rs::decay_trust_ranks` (POST /v1/admin/decay_trust_ranks)
- [ ] `handlers/escalation.rs::resolve_escalation` (POST /v1/admin/escalation/resolve)
- [ ] `handlers/gold_standard.rs::create_gold_standard` (POST /v1/gold_standard)
- [ ] `handlers/gold_standard.rs::remove_gold_standard` (DELETE /v1/gold_standard)
- [ ] `handlers/gold_standard.rs::verify_agent` (POST /v1/gold_standard/verify)
- [ ] `handlers/quarantine.rs::approve_quarantine` (POST /v1/admin/quarantine/approve)
- [ ] `handlers/quarantine.rs::reject_quarantine` (POST /v1/admin/quarantine/reject)
- [ ] `handlers/circuit_breaker.rs::reset_circuit` (POST /v1/admin/circuit_breaker/reset)
- [ ] `handlers/api_keys.rs::create_api_key` (POST /v1/admin/api_keys)
- [ ] `handlers/api_keys.rs::revoke_api_key` (DELETE /v1/admin/api_keys)
- [ ] `handlers/api_keys.rs::rotate_api_key` (POST /v1/admin/api_keys/rotate)
- [ ] `handlers/api_keys.rs::update_api_key` (PATCH /v1/admin/api_keys)
### Read Endpoints
- [ ] `handlers/audit.rs::list_audits` (GET /v1/audit)
- [ ] `handlers/audit.rs::get_audit` (GET /v1/audit/{id})
- [ ] `handlers/source.rs::get_provenance` (GET /v1/source/provenance)
- [ ] `handlers/concepts.rs::resolve_alias` (GET /v1/concepts/alias)
- [ ] `handlers/concepts.rs::list_aliases` (GET /v1/concepts/aliases)
- [ ] `handlers/concepts.rs::suggest_aliases` (GET /v1/concepts/suggest)
- [ ] `handlers/concepts.rs::parse_concept_path` (GET /v1/concepts/parse)
### Aphoria Endpoints (if feature enabled)
- [ ] `handlers/aphoria/policy.rs::bless` (POST /v1/aphoria/policy/bless)
- [ ] `handlers/aphoria/policy.rs::export_policy` (GET /v1/aphoria/policy/export)
- [ ] `handlers/aphoria/policy.rs::import_policy` (POST /v1/aphoria/policy/import)
- [ ] `handlers/aphoria/scan.rs::scan` (POST /v1/aphoria/scan)
- [ ] `handlers/aphoria/report.rs::push_observations` (POST /v1/aphoria/report)
## Completion Steps
1. **For each handler:**
- Add `let start = std::time::Instant::now();` at function start
- Add `metrics::counter!` increment after timing starts
- Wrap the return value in a variable (`let result = Ok(...)`)
- Add status extraction and histogram recording before returning
- Return `result`
2. **Verification:**
```bash
# After making changes
cargo build --workspace
cargo run --bin stemedb-api &
# Trigger endpoint
curl -X POST http://localhost:18180/v1/vote -d '...'
# Check metrics
curl http://localhost:18180/metrics | grep stemedb_http_request_duration_seconds
curl http://localhost:18180/metrics | grep stemedb_http_requests_total
```
3. **Estimated time:** ~2-3 hours for all 20+ handlers
## Metrics Added
Once complete, these metrics will be available:
- `stemedb_http_requests_total{method,path}` (counter) - Total request count per endpoint
- `stemedb_http_request_duration_seconds{method,path,status}` (histogram) - Request latency distribution
## Next Steps After Completion
After Layer 3 is complete:
1. Verify all metrics appear in `/metrics` endpoint
2. Create Grafana dashboards (Layer 5)
3. Configure Prometheus alerts (Layer 6)
4. Set up PagerDuty/Slack integration (Layer 7)