This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
119 lines
4.5 KiB
Markdown
119 lines
4.5 KiB
Markdown
# HTTP SLI Metrics Completion Guide
|
|
|
|
## Status: Layer 3 (HTTP SLI Metrics) - 5% Complete
|
|
|
|
**Completed:**
|
|
- ✅ Pattern established in `handlers/vote.rs` (reference implementation)
|
|
- ✅ Helper script created at `scripts/add_http_metrics.sh`
|
|
|
|
**Remaining:** 19+ handlers need the same pattern applied
|
|
|
|
## Reference Pattern (from vote.rs)
|
|
|
|
```rust
|
|
pub async fn handler_function(
|
|
State(state): State<AppState>,
|
|
// ... other parameters
|
|
) -> Result<(StatusCode, Json<Response>)> {
|
|
// 1. Start timing + increment request counter
|
|
let start = std::time::Instant::now();
|
|
metrics::counter!("stemedb_http_requests_total", "method" => "POST", "path" => "/v1/endpoint").increment(1);
|
|
|
|
// 2. Handler logic (unchanged)
|
|
// ...
|
|
|
|
// 3. Capture result
|
|
let result = Ok((StatusCode::OK, Json(response)));
|
|
|
|
// 4. Track duration with status
|
|
let status = match &result {
|
|
Ok((s, _)) => s.as_u16(),
|
|
Err(_) => 500,
|
|
};
|
|
metrics::histogram!("stemedb_http_request_duration_seconds",
|
|
"method" => "POST",
|
|
"path" => "/v1/endpoint",
|
|
"status" => status.to_string().as_str()
|
|
).record(start.elapsed().as_secs_f64());
|
|
|
|
result
|
|
}
|
|
```
|
|
|
|
## Handlers Requiring Metrics
|
|
|
|
### Write Endpoints
|
|
- [ ] `handlers/supersession.rs::supersede` (POST /v1/supersede)
|
|
- [ ] `handlers/epoch.rs::create_epoch` (POST /v1/epoch)
|
|
- [ ] `handlers/source.rs::store_source` (POST /v1/source)
|
|
|
|
### Admin Endpoints
|
|
- [ ] `handlers/admin.rs::decay_trust_ranks` (POST /v1/admin/decay_trust_ranks)
|
|
- [ ] `handlers/escalation.rs::resolve_escalation` (POST /v1/admin/escalation/resolve)
|
|
- [ ] `handlers/gold_standard.rs::create_gold_standard` (POST /v1/gold_standard)
|
|
- [ ] `handlers/gold_standard.rs::remove_gold_standard` (DELETE /v1/gold_standard)
|
|
- [ ] `handlers/gold_standard.rs::verify_agent` (POST /v1/gold_standard/verify)
|
|
- [ ] `handlers/quarantine.rs::approve_quarantine` (POST /v1/admin/quarantine/approve)
|
|
- [ ] `handlers/quarantine.rs::reject_quarantine` (POST /v1/admin/quarantine/reject)
|
|
- [ ] `handlers/circuit_breaker.rs::reset_circuit` (POST /v1/admin/circuit_breaker/reset)
|
|
- [ ] `handlers/api_keys.rs::create_api_key` (POST /v1/admin/api_keys)
|
|
- [ ] `handlers/api_keys.rs::revoke_api_key` (DELETE /v1/admin/api_keys)
|
|
- [ ] `handlers/api_keys.rs::rotate_api_key` (POST /v1/admin/api_keys/rotate)
|
|
- [ ] `handlers/api_keys.rs::update_api_key` (PATCH /v1/admin/api_keys)
|
|
|
|
### Read Endpoints
|
|
- [ ] `handlers/audit.rs::list_audits` (GET /v1/audit)
|
|
- [ ] `handlers/audit.rs::get_audit` (GET /v1/audit/{id})
|
|
- [ ] `handlers/source.rs::get_provenance` (GET /v1/source/provenance)
|
|
- [ ] `handlers/concepts.rs::resolve_alias` (GET /v1/concepts/alias)
|
|
- [ ] `handlers/concepts.rs::list_aliases` (GET /v1/concepts/aliases)
|
|
- [ ] `handlers/concepts.rs::suggest_aliases` (GET /v1/concepts/suggest)
|
|
- [ ] `handlers/concepts.rs::parse_concept_path` (GET /v1/concepts/parse)
|
|
|
|
### Aphoria Endpoints (if feature enabled)
|
|
- [ ] `handlers/aphoria/policy.rs::bless` (POST /v1/aphoria/policy/bless)
|
|
- [ ] `handlers/aphoria/policy.rs::export_policy` (GET /v1/aphoria/policy/export)
|
|
- [ ] `handlers/aphoria/policy.rs::import_policy` (POST /v1/aphoria/policy/import)
|
|
- [ ] `handlers/aphoria/scan.rs::scan` (POST /v1/aphoria/scan)
|
|
- [ ] `handlers/aphoria/report.rs::push_observations` (POST /v1/aphoria/report)
|
|
|
|
## Completion Steps
|
|
|
|
1. **For each handler:**
|
|
- Add `let start = std::time::Instant::now();` at function start
|
|
- Add `metrics::counter!` increment after timing starts
|
|
- Wrap the return value in a variable (`let result = Ok(...)`)
|
|
- Add status extraction and histogram recording before returning
|
|
- Return `result`
|
|
|
|
2. **Verification:**
|
|
```bash
|
|
# After making changes
|
|
cargo build --workspace
|
|
cargo run --bin stemedb-api &
|
|
|
|
# Trigger endpoint
|
|
curl -X POST http://localhost:18180/v1/vote -d '...'
|
|
|
|
# Check metrics
|
|
curl http://localhost:18180/metrics | grep stemedb_http_request_duration_seconds
|
|
curl http://localhost:18180/metrics | grep stemedb_http_requests_total
|
|
```
|
|
|
|
3. **Estimated time:** ~2-3 hours for all 20+ handlers
|
|
|
|
## Metrics Added
|
|
|
|
Once complete, these metrics will be available:
|
|
|
|
- `stemedb_http_requests_total{method,path}` (counter) - Total request count per endpoint
|
|
- `stemedb_http_request_duration_seconds{method,path,status}` (histogram) - Request latency distribution
|
|
|
|
## Next Steps After Completion
|
|
|
|
After Layer 3 is complete:
|
|
1. Verify all metrics appear in `/metrics` endpoint
|
|
2. Create Grafana dashboards (Layer 5)
|
|
3. Configure Prometheus alerts (Layer 6)
|
|
4. Set up PagerDuty/Slack integration (Layer 7)
|