This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
388 lines
8.4 KiB
Markdown
388 lines
8.4 KiB
Markdown
# High API Error Rate
|
|
|
|
## Severity: WARNING
|
|
|
|
## Alert Rule
|
|
|
|
**Alert:** `HighAPIErrorRate`
|
|
**Trigger:** HTTP 5xx error rate > 5% of total requests
|
|
**Duration:** 5m
|
|
|
|
## Symptom
|
|
|
|
- Metrics show `rate(stemedb_http_requests_total{status=~"5.."}[5m]) / rate(stemedb_http_requests_total[5m]) > 0.05`
|
|
- API returns 500/503 errors for subset of requests
|
|
- Logs contain repeated error patterns
|
|
- Client applications report intermittent failures
|
|
|
|
## Impact
|
|
|
|
**User Impact:**
|
|
- Degraded user experience (retries, slow responses)
|
|
- Data operations fail for subset of requests
|
|
- Inconsistent query results
|
|
|
|
**System Impact:**
|
|
- Increased retry traffic (amplification)
|
|
- Potential cascading failures
|
|
- SLA violations if sustained
|
|
|
|
## Investigation Steps
|
|
|
|
### 1. Check Error Rate by Endpoint
|
|
|
|
```bash
|
|
# Error rate per endpoint
|
|
curl -s http://localhost:18180/metrics | \
|
|
grep 'stemedb_http_requests_total.*status="5' | \
|
|
awk '{print $1}' | sort | uniq -c
|
|
|
|
# Look for specific endpoints with high error rate
|
|
```
|
|
|
|
### 2. Check Error Types
|
|
|
|
```bash
|
|
# Recent errors grouped by type
|
|
journalctl -u stemedb-api --since "5 min ago" | \
|
|
grep -i "error" | \
|
|
grep -oP 'Error: \K[^:]+' | \
|
|
sort | uniq -c | sort -rn | head -10
|
|
```
|
|
|
|
**Common error patterns:**
|
|
|
|
- `StorageError`: Storage layer failures (disk, LSM tree)
|
|
- `TimeoutError`: Operations exceeding configured timeouts
|
|
- `SerializationError`: Data corruption or version mismatch
|
|
- `NetworkError`: Cluster communication failures
|
|
- `AuthenticationError`: API key or signature validation failures
|
|
|
|
### 3. Check System Resources
|
|
|
|
```bash
|
|
# CPU
|
|
top -b -n 1 | grep stemedb-api
|
|
|
|
# Memory
|
|
ps aux | grep stemedb-api | awk '{print $4, $6}'
|
|
|
|
# Disk I/O
|
|
iostat -x 1 5
|
|
|
|
# Network
|
|
netstat -s | grep -i "segments retransmitted"
|
|
```
|
|
|
|
### 4. Check Downstream Dependencies
|
|
|
|
```bash
|
|
# WAL health
|
|
curl -s http://localhost:18180/metrics | grep wal_fsync_errors
|
|
|
|
# Storage health
|
|
curl -s http://localhost:18180/metrics | grep storage_operation_errors
|
|
|
|
# Cluster health
|
|
curl -s http://localhost:18180/v1/admin/cluster/status | jq '.health'
|
|
```
|
|
|
|
### 5. Check Client Patterns
|
|
|
|
```bash
|
|
# Top error-generating clients (by agent_id or IP)
|
|
journalctl -u stemedb-api --since "5 min ago" | \
|
|
grep "HTTP.*500" | \
|
|
grep -oP 'agent_id=\K[^ ]+' | \
|
|
sort | uniq -c | sort -rn | head -10
|
|
```
|
|
|
|
## Resolution
|
|
|
|
### If Storage Errors Detected
|
|
|
|
```bash
|
|
# Check storage error rate
|
|
curl -s http://localhost:18180/metrics | grep storage_operation_errors_total
|
|
```
|
|
|
|
**See:** `docs/operations/runbooks/storage-errors.md`
|
|
|
|
### If Memory Pressure Detected
|
|
|
|
```bash
|
|
# Check memory usage
|
|
free -h
|
|
ps aux | grep stemedb-api | awk '{print $6 / 1024 " MB"}'
|
|
```
|
|
|
|
**See:** `docs/operations/runbooks/memory-exhaustion.md`
|
|
|
|
### If Timeout Errors
|
|
|
|
**1. Identify slow operations:**
|
|
|
|
```bash
|
|
# Slow queries
|
|
curl -s http://localhost:18180/v1/admin/slow-queries | jq '.queries[] | select(.duration_ms > 1000)'
|
|
```
|
|
|
|
**2. Increase timeout temporarily:**
|
|
|
|
Edit `/etc/stemedb/api.toml`:
|
|
|
|
```toml
|
|
[api]
|
|
request_timeout_seconds = 60 # Increase from default 30
|
|
```
|
|
|
|
Restart:
|
|
|
|
```bash
|
|
systemctl restart stemedb-api
|
|
```
|
|
|
|
**3. Optimize slow queries:**
|
|
|
|
```bash
|
|
# Identify expensive query patterns
|
|
curl -s http://localhost:18180/v1/admin/slow-queries | jq -r \
|
|
'.queries[] | "\(.subject) \(.predicate) \(.duration_ms)ms"' | \
|
|
sort -k3 -rn | head -10
|
|
```
|
|
|
|
### If Authentication Errors
|
|
|
|
**1. Check API key validity:**
|
|
|
|
```bash
|
|
# List disabled/expired keys
|
|
curl -s http://localhost:18180/v1/admin/api-keys | jq \
|
|
'.keys[] | select(.enabled==false or .expires_at < now)'
|
|
```
|
|
|
|
**2. Check signature verification errors:**
|
|
|
|
```bash
|
|
journalctl -u stemedb-api --since "5 min ago" | grep "signature verification failed"
|
|
```
|
|
|
|
**3. If widespread auth failures, check clock skew:**
|
|
|
|
```bash
|
|
# Check time on all nodes
|
|
for node in node1 node2 node3; do
|
|
echo "$node: $(ssh $node date +%s)"
|
|
done
|
|
|
|
# Sync clocks if skew >1 second
|
|
for node in node1 node2 node3; do
|
|
ssh $node "systemctl restart chronyd && chronyc makestep"
|
|
done
|
|
```
|
|
|
|
### If Network Errors
|
|
|
|
**1. Check cluster connectivity:**
|
|
|
|
```bash
|
|
# Test RPC connectivity
|
|
for node in node2 node3; do
|
|
timeout 2 nc -zv $node 18182 || echo "FAIL: $node unreachable"
|
|
done
|
|
```
|
|
|
|
**2. Check for packet loss:**
|
|
|
|
```bash
|
|
ping -c 100 node2 | tail -2
|
|
# Expected: 0% packet loss
|
|
```
|
|
|
|
**3. If packet loss detected:**
|
|
|
|
```bash
|
|
# Check network interface errors
|
|
ip -s link show eth0 | grep -E "(RX|TX).*errors"
|
|
|
|
# Check for MTU mismatch
|
|
ping -M do -s 1472 node2 # Should succeed if MTU=1500
|
|
```
|
|
|
|
### If Client Abuse Detected
|
|
|
|
**1. Identify abusive pattern:**
|
|
|
|
```bash
|
|
# Request rate by agent
|
|
curl -s http://localhost:18180/metrics | \
|
|
grep 'stemedb_http_requests_total{.*agent=' | \
|
|
awk '{sum[$1]+=$NF} END {for(i in sum) print sum[i], i}' | \
|
|
sort -rn | head -5
|
|
```
|
|
|
|
**2. Rate limit or block abusive agent:**
|
|
|
|
```bash
|
|
# Enable rate limiting
|
|
curl -X POST http://localhost:18180/v1/admin/rate-limit \
|
|
-d '{"agent_id": "<agent_id>", "max_requests_per_min": 100}'
|
|
|
|
# Or trip circuit breaker
|
|
curl -X POST http://localhost:18180/v1/admin/circuit-breaker/trip \
|
|
-d '{"agent_id": "<agent_id>"}'
|
|
```
|
|
|
|
### If Errors Persist
|
|
|
|
**1. Enable debug logging:**
|
|
|
|
Edit `/etc/stemedb/api.toml`:
|
|
|
|
```toml
|
|
[logging]
|
|
level = "debug"
|
|
```
|
|
|
|
Restart:
|
|
|
|
```bash
|
|
systemctl restart stemedb-api
|
|
```
|
|
|
|
**2. Capture detailed traces:**
|
|
|
|
```bash
|
|
# Watch errors in real-time
|
|
journalctl -u stemedb-api -f --output=json | \
|
|
jq 'select(.level=="ERROR") | {time: .timestamp, error: .message}'
|
|
```
|
|
|
|
**3. Collect diagnostic bundle:**
|
|
|
|
```bash
|
|
# Create bundle for escalation
|
|
mkdir /tmp/stemedb-diag
|
|
cp /etc/stemedb/api.toml /tmp/stemedb-diag/
|
|
journalctl -u stemedb-api --since "1 hour ago" > /tmp/stemedb-diag/logs.txt
|
|
curl -s http://localhost:18180/metrics > /tmp/stemedb-diag/metrics.txt
|
|
tar czf /tmp/stemedb-diag-$(date +%Y%m%d-%H%M).tar.gz /tmp/stemedb-diag/
|
|
```
|
|
|
|
## Prevention
|
|
|
|
### Monitoring
|
|
|
|
**1. Error rate by endpoint:**
|
|
|
|
```yaml
|
|
- alert: EndpointErrorRateHigh
|
|
expr: |
|
|
sum by (path) (rate(stemedb_http_requests_total{status=~"5.."}[5m]))
|
|
/
|
|
sum by (path) (rate(stemedb_http_requests_total[5m]))
|
|
> 0.05
|
|
for: 5m
|
|
annotations:
|
|
summary: "Endpoint {{$labels.path}} has >5% error rate"
|
|
```
|
|
|
|
**2. Alert on new error types:**
|
|
|
|
```yaml
|
|
- alert: NewErrorTypeDetected
|
|
expr: |
|
|
stemedb_error_count_by_type > 0
|
|
unless
|
|
stemedb_error_count_by_type offset 1h > 0
|
|
annotations:
|
|
summary: "New error type detected: {{$labels.error_type}}"
|
|
```
|
|
|
|
**3. Track error budget consumption:**
|
|
|
|
```yaml
|
|
- alert: ErrorBudgetExhausted
|
|
expr: |
|
|
(1 - sum(rate(stemedb_http_requests_total{status=~"2.."}[30d]))
|
|
/ sum(rate(stemedb_http_requests_total[30d]))) > 0.001 # 99.9% SLA
|
|
annotations:
|
|
summary: "Monthly error budget exhausted"
|
|
```
|
|
|
|
### Capacity Planning
|
|
|
|
**1. Load test error behavior:**
|
|
|
|
```bash
|
|
# Test error rate under load
|
|
hey -z 60s -c 100 -q 50 http://localhost:18180/v1/query
|
|
|
|
# Monitor error rate during test
|
|
watch -n 1 'curl -s http://localhost:18180/metrics | grep "status=\"5"'
|
|
```
|
|
|
|
**2. Set error rate thresholds:**
|
|
|
|
```toml
|
|
# /etc/stemedb/api.toml
|
|
[slo]
|
|
target_availability = 0.999 # 99.9%
|
|
error_budget_burn_rate_alert = 0.1 # Alert at 10% burn rate
|
|
```
|
|
|
|
### Operational Best Practices
|
|
|
|
**1. Implement circuit breakers:**
|
|
|
|
```toml
|
|
[resilience]
|
|
enable_circuit_breaker = true
|
|
failure_threshold = 5 # Open after 5 consecutive failures
|
|
timeout_ms = 5000
|
|
reset_timeout_ms = 30000
|
|
```
|
|
|
|
**2. Graceful degradation:**
|
|
|
|
```toml
|
|
[fallback]
|
|
enable_cache_fallback = true # Serve stale data on storage errors
|
|
max_stale_seconds = 300
|
|
```
|
|
|
|
**3. Regular chaos testing:**
|
|
|
|
```bash
|
|
# Monthly chaos experiment
|
|
# - Kill random process
|
|
# - Inject network latency
|
|
# - Fill disk to 95%
|
|
# - Verify error handling is graceful
|
|
```
|
|
|
|
## Escalation
|
|
|
|
**Escalate if:**
|
|
|
|
- Error rate exceeds 10% for >15 minutes
|
|
- Errors indicate data corruption (SerializationError)
|
|
- New error type with no known resolution
|
|
- Error rate climbing despite mitigation attempts
|
|
|
|
**Escalation path:**
|
|
|
|
1. **Primary on-call:** API/Platform SRE
|
|
2. **Secondary:** Backend engineer
|
|
3. **Final escalation:** Engineering manager + on-call incident commander
|
|
|
|
## References
|
|
|
|
- **Dashboard:** [StemeDB API Health](http://grafana.example.com/d/stemedb-api-health)
|
|
- **Related alerts:** `HighStorageErrorRate`, `SlowAPIResponses`, `CircuitBreakerTripped`
|
|
- **Metrics:**
|
|
- `stemedb_http_requests_total{status=~"5.."}` (5xx count)
|
|
- `stemedb_http_request_duration_seconds` (latency)
|
|
- `stemedb_error_count_by_type` (error breakdown)
|
|
- **Runbooks:** `storage-errors.md`, `memory-exhaustion.md`, `slow-fsync.md`
|