This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
4.5 KiB
4.5 KiB
HTTP SLI Metrics Completion Guide
Status: Layer 3 (HTTP SLI Metrics) - 5% Complete
Completed:
- ✅ Pattern established in
handlers/vote.rs(reference implementation) - ✅ Helper script created at
scripts/add_http_metrics.sh
Remaining: 19+ handlers need the same pattern applied
Reference Pattern (from vote.rs)
pub async fn handler_function(
State(state): State<AppState>,
// ... other parameters
) -> Result<(StatusCode, Json<Response>)> {
// 1. Start timing + increment request counter
let start = std::time::Instant::now();
metrics::counter!("stemedb_http_requests_total", "method" => "POST", "path" => "/v1/endpoint").increment(1);
// 2. Handler logic (unchanged)
// ...
// 3. Capture result
let result = Ok((StatusCode::OK, Json(response)));
// 4. Track duration with status
let status = match &result {
Ok((s, _)) => s.as_u16(),
Err(_) => 500,
};
metrics::histogram!("stemedb_http_request_duration_seconds",
"method" => "POST",
"path" => "/v1/endpoint",
"status" => status.to_string().as_str()
).record(start.elapsed().as_secs_f64());
result
}
Handlers Requiring Metrics
Write Endpoints
handlers/supersession.rs::supersede(POST /v1/supersede)handlers/epoch.rs::create_epoch(POST /v1/epoch)handlers/source.rs::store_source(POST /v1/source)
Admin Endpoints
handlers/admin.rs::decay_trust_ranks(POST /v1/admin/decay_trust_ranks)handlers/escalation.rs::resolve_escalation(POST /v1/admin/escalation/resolve)handlers/gold_standard.rs::create_gold_standard(POST /v1/gold_standard)handlers/gold_standard.rs::remove_gold_standard(DELETE /v1/gold_standard)handlers/gold_standard.rs::verify_agent(POST /v1/gold_standard/verify)handlers/quarantine.rs::approve_quarantine(POST /v1/admin/quarantine/approve)handlers/quarantine.rs::reject_quarantine(POST /v1/admin/quarantine/reject)handlers/circuit_breaker.rs::reset_circuit(POST /v1/admin/circuit_breaker/reset)handlers/api_keys.rs::create_api_key(POST /v1/admin/api_keys)handlers/api_keys.rs::revoke_api_key(DELETE /v1/admin/api_keys)handlers/api_keys.rs::rotate_api_key(POST /v1/admin/api_keys/rotate)handlers/api_keys.rs::update_api_key(PATCH /v1/admin/api_keys)
Read Endpoints
handlers/audit.rs::list_audits(GET /v1/audit)handlers/audit.rs::get_audit(GET /v1/audit/{id})handlers/source.rs::get_provenance(GET /v1/source/provenance)handlers/concepts.rs::resolve_alias(GET /v1/concepts/alias)handlers/concepts.rs::list_aliases(GET /v1/concepts/aliases)handlers/concepts.rs::suggest_aliases(GET /v1/concepts/suggest)handlers/concepts.rs::parse_concept_path(GET /v1/concepts/parse)
Aphoria Endpoints (if feature enabled)
handlers/aphoria/policy.rs::bless(POST /v1/aphoria/policy/bless)handlers/aphoria/policy.rs::export_policy(GET /v1/aphoria/policy/export)handlers/aphoria/policy.rs::import_policy(POST /v1/aphoria/policy/import)handlers/aphoria/scan.rs::scan(POST /v1/aphoria/scan)handlers/aphoria/report.rs::push_observations(POST /v1/aphoria/report)
Completion Steps
-
For each handler:
- Add
let start = std::time::Instant::now();at function start - Add
metrics::counter!increment after timing starts - Wrap the return value in a variable (
let result = Ok(...)) - Add status extraction and histogram recording before returning
- Return
result
- Add
-
Verification:
# After making changes cargo build --workspace cargo run --bin stemedb-api & # Trigger endpoint curl -X POST http://localhost:18180/v1/vote -d '...' # Check metrics curl http://localhost:18180/metrics | grep stemedb_http_request_duration_seconds curl http://localhost:18180/metrics | grep stemedb_http_requests_total -
Estimated time: ~2-3 hours for all 20+ handlers
Metrics Added
Once complete, these metrics will be available:
stemedb_http_requests_total{method,path}(counter) - Total request count per endpointstemedb_http_request_duration_seconds{method,path,status}(histogram) - Request latency distribution
Next Steps After Completion
After Layer 3 is complete:
- Verify all metrics appear in
/metricsendpoint - Create Grafana dashboards (Layer 5)
- Configure Prometheus alerts (Layer 6)
- Set up PagerDuty/Slack integration (Layer 7)