# High API Error Rate ## Severity: WARNING ## Alert Rule **Alert:** `HighAPIErrorRate` **Trigger:** HTTP 5xx error rate > 5% of total requests **Duration:** 5m ## Symptom - Metrics show `rate(stemedb_http_requests_total{status=~"5.."}[5m]) / rate(stemedb_http_requests_total[5m]) > 0.05` - API returns 500/503 errors for subset of requests - Logs contain repeated error patterns - Client applications report intermittent failures ## Impact **User Impact:** - Degraded user experience (retries, slow responses) - Data operations fail for subset of requests - Inconsistent query results **System Impact:** - Increased retry traffic (amplification) - Potential cascading failures - SLA violations if sustained ## Investigation Steps ### 1. Check Error Rate by Endpoint ```bash # Error rate per endpoint curl -s http://localhost:18180/metrics | \ grep 'stemedb_http_requests_total.*status="5' | \ awk '{print $1}' | sort | uniq -c # Look for specific endpoints with high error rate ``` ### 2. Check Error Types ```bash # Recent errors grouped by type journalctl -u stemedb-api --since "5 min ago" | \ grep -i "error" | \ grep -oP 'Error: \K[^:]+' | \ sort | uniq -c | sort -rn | head -10 ``` **Common error patterns:** - `StorageError`: Storage layer failures (disk, LSM tree) - `TimeoutError`: Operations exceeding configured timeouts - `SerializationError`: Data corruption or version mismatch - `NetworkError`: Cluster communication failures - `AuthenticationError`: API key or signature validation failures ### 3. Check System Resources ```bash # CPU top -b -n 1 | grep stemedb-api # Memory ps aux | grep stemedb-api | awk '{print $4, $6}' # Disk I/O iostat -x 1 5 # Network netstat -s | grep -i "segments retransmitted" ``` ### 4. Check Downstream Dependencies ```bash # WAL health curl -s http://localhost:18180/metrics | grep wal_fsync_errors # Storage health curl -s http://localhost:18180/metrics | grep storage_operation_errors # Cluster health curl -s http://localhost:18180/v1/admin/cluster/status | jq '.health' ``` ### 5. Check Client Patterns ```bash # Top error-generating clients (by agent_id or IP) journalctl -u stemedb-api --since "5 min ago" | \ grep "HTTP.*500" | \ grep -oP 'agent_id=\K[^ ]+' | \ sort | uniq -c | sort -rn | head -10 ``` ## Resolution ### If Storage Errors Detected ```bash # Check storage error rate curl -s http://localhost:18180/metrics | grep storage_operation_errors_total ``` **See:** `docs/operations/runbooks/storage-errors.md` ### If Memory Pressure Detected ```bash # Check memory usage free -h ps aux | grep stemedb-api | awk '{print $6 / 1024 " MB"}' ``` **See:** `docs/operations/runbooks/memory-exhaustion.md` ### If Timeout Errors **1. Identify slow operations:** ```bash # Slow queries curl -s http://localhost:18180/v1/admin/slow-queries | jq '.queries[] | select(.duration_ms > 1000)' ``` **2. Increase timeout temporarily:** Edit `/etc/stemedb/api.toml`: ```toml [api] request_timeout_seconds = 60 # Increase from default 30 ``` Restart: ```bash systemctl restart stemedb-api ``` **3. Optimize slow queries:** ```bash # Identify expensive query patterns curl -s http://localhost:18180/v1/admin/slow-queries | jq -r \ '.queries[] | "\(.subject) \(.predicate) \(.duration_ms)ms"' | \ sort -k3 -rn | head -10 ``` ### If Authentication Errors **1. Check API key validity:** ```bash # List disabled/expired keys curl -s http://localhost:18180/v1/admin/api-keys | jq \ '.keys[] | select(.enabled==false or .expires_at < now)' ``` **2. Check signature verification errors:** ```bash journalctl -u stemedb-api --since "5 min ago" | grep "signature verification failed" ``` **3. If widespread auth failures, check clock skew:** ```bash # Check time on all nodes for node in node1 node2 node3; do echo "$node: $(ssh $node date +%s)" done # Sync clocks if skew >1 second for node in node1 node2 node3; do ssh $node "systemctl restart chronyd && chronyc makestep" done ``` ### If Network Errors **1. Check cluster connectivity:** ```bash # Test RPC connectivity for node in node2 node3; do timeout 2 nc -zv $node 18182 || echo "FAIL: $node unreachable" done ``` **2. Check for packet loss:** ```bash ping -c 100 node2 | tail -2 # Expected: 0% packet loss ``` **3. If packet loss detected:** ```bash # Check network interface errors ip -s link show eth0 | grep -E "(RX|TX).*errors" # Check for MTU mismatch ping -M do -s 1472 node2 # Should succeed if MTU=1500 ``` ### If Client Abuse Detected **1. Identify abusive pattern:** ```bash # Request rate by agent curl -s http://localhost:18180/metrics | \ grep 'stemedb_http_requests_total{.*agent=' | \ awk '{sum[$1]+=$NF} END {for(i in sum) print sum[i], i}' | \ sort -rn | head -5 ``` **2. Rate limit or block abusive agent:** ```bash # Enable rate limiting curl -X POST http://localhost:18180/v1/admin/rate-limit \ -d '{"agent_id": "", "max_requests_per_min": 100}' # Or trip circuit breaker curl -X POST http://localhost:18180/v1/admin/circuit-breaker/trip \ -d '{"agent_id": ""}' ``` ### If Errors Persist **1. Enable debug logging:** Edit `/etc/stemedb/api.toml`: ```toml [logging] level = "debug" ``` Restart: ```bash systemctl restart stemedb-api ``` **2. Capture detailed traces:** ```bash # Watch errors in real-time journalctl -u stemedb-api -f --output=json | \ jq 'select(.level=="ERROR") | {time: .timestamp, error: .message}' ``` **3. Collect diagnostic bundle:** ```bash # Create bundle for escalation mkdir /tmp/stemedb-diag cp /etc/stemedb/api.toml /tmp/stemedb-diag/ journalctl -u stemedb-api --since "1 hour ago" > /tmp/stemedb-diag/logs.txt curl -s http://localhost:18180/metrics > /tmp/stemedb-diag/metrics.txt tar czf /tmp/stemedb-diag-$(date +%Y%m%d-%H%M).tar.gz /tmp/stemedb-diag/ ``` ## Prevention ### Monitoring **1. Error rate by endpoint:** ```yaml - alert: EndpointErrorRateHigh expr: | sum by (path) (rate(stemedb_http_requests_total{status=~"5.."}[5m])) / sum by (path) (rate(stemedb_http_requests_total[5m])) > 0.05 for: 5m annotations: summary: "Endpoint {{$labels.path}} has >5% error rate" ``` **2. Alert on new error types:** ```yaml - alert: NewErrorTypeDetected expr: | stemedb_error_count_by_type > 0 unless stemedb_error_count_by_type offset 1h > 0 annotations: summary: "New error type detected: {{$labels.error_type}}" ``` **3. Track error budget consumption:** ```yaml - alert: ErrorBudgetExhausted expr: | (1 - sum(rate(stemedb_http_requests_total{status=~"2.."}[30d])) / sum(rate(stemedb_http_requests_total[30d]))) > 0.001 # 99.9% SLA annotations: summary: "Monthly error budget exhausted" ``` ### Capacity Planning **1. Load test error behavior:** ```bash # Test error rate under load hey -z 60s -c 100 -q 50 http://localhost:18180/v1/query # Monitor error rate during test watch -n 1 'curl -s http://localhost:18180/metrics | grep "status=\"5"' ``` **2. Set error rate thresholds:** ```toml # /etc/stemedb/api.toml [slo] target_availability = 0.999 # 99.9% error_budget_burn_rate_alert = 0.1 # Alert at 10% burn rate ``` ### Operational Best Practices **1. Implement circuit breakers:** ```toml [resilience] enable_circuit_breaker = true failure_threshold = 5 # Open after 5 consecutive failures timeout_ms = 5000 reset_timeout_ms = 30000 ``` **2. Graceful degradation:** ```toml [fallback] enable_cache_fallback = true # Serve stale data on storage errors max_stale_seconds = 300 ``` **3. Regular chaos testing:** ```bash # Monthly chaos experiment # - Kill random process # - Inject network latency # - Fill disk to 95% # - Verify error handling is graceful ``` ## Escalation **Escalate if:** - Error rate exceeds 10% for >15 minutes - Errors indicate data corruption (SerializationError) - New error type with no known resolution - Error rate climbing despite mitigation attempts **Escalation path:** 1. **Primary on-call:** API/Platform SRE 2. **Secondary:** Backend engineer 3. **Final escalation:** Engineering manager + on-call incident commander ## References - **Dashboard:** [StemeDB API Health](http://grafana.example.com/d/stemedb-api-health) - **Related alerts:** `HighStorageErrorRate`, `SlowAPIResponses`, `CircuitBreakerTripped` - **Metrics:** - `stemedb_http_requests_total{status=~"5.."}` (5xx count) - `stemedb_http_request_duration_seconds` (latency) - `stemedb_error_count_by_type` (error breakdown) - **Runbooks:** `storage-errors.md`, `memory-exhaustion.md`, `slow-fsync.md`