jml 3e7eddc074 feat: add enterprise production readiness infrastructure

This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-12 06:08:15 +00:00

8.4 KiB

Raw Blame History

High API Error Rate

Severity: WARNING

Alert Rule

Alert: HighAPIErrorRate Trigger: HTTP 5xx error rate > 5% of total requests Duration: 5m

Symptom

Metrics show rate(stemedb_http_requests_total{status=~"5.."}[5m]) / rate(stemedb_http_requests_total[5m]) > 0.05
API returns 500/503 errors for subset of requests
Logs contain repeated error patterns
Client applications report intermittent failures

Impact

User Impact:

Degraded user experience (retries, slow responses)
Data operations fail for subset of requests
Inconsistent query results

System Impact:

Increased retry traffic (amplification)
Potential cascading failures
SLA violations if sustained

Investigation Steps

1. Check Error Rate by Endpoint

# Error rate per endpoint
curl -s http://localhost:18180/metrics | \
  grep 'stemedb_http_requests_total.*status="5' | \
  awk '{print $1}' | sort | uniq -c

# Look for specific endpoints with high error rate

2. Check Error Types

# Recent errors grouped by type
journalctl -u stemedb-api --since "5 min ago" | \
  grep -i "error" | \
  grep -oP 'Error: \K[^:]+' | \
  sort | uniq -c | sort -rn | head -10

Common error patterns:

StorageError: Storage layer failures (disk, LSM tree)
TimeoutError: Operations exceeding configured timeouts
SerializationError: Data corruption or version mismatch
NetworkError: Cluster communication failures
AuthenticationError: API key or signature validation failures

3. Check System Resources

# CPU
top -b -n 1 | grep stemedb-api

# Memory
ps aux | grep stemedb-api | awk '{print $4, $6}'

# Disk I/O
iostat -x 1 5

# Network
netstat -s | grep -i "segments retransmitted"

4. Check Downstream Dependencies

# WAL health
curl -s http://localhost:18180/metrics | grep wal_fsync_errors

# Storage health
curl -s http://localhost:18180/metrics | grep storage_operation_errors

# Cluster health
curl -s http://localhost:18180/v1/admin/cluster/status | jq '.health'

5. Check Client Patterns

# Top error-generating clients (by agent_id or IP)
journalctl -u stemedb-api --since "5 min ago" | \
  grep "HTTP.*500" | \
  grep -oP 'agent_id=\K[^ ]+' | \
  sort | uniq -c | sort -rn | head -10

Resolution

If Storage Errors Detected

# Check storage error rate
curl -s http://localhost:18180/metrics | grep storage_operation_errors_total

See: docs/operations/runbooks/storage-errors.md

If Memory Pressure Detected

# Check memory usage
free -h
ps aux | grep stemedb-api | awk '{print $6 / 1024 " MB"}'

See: docs/operations/runbooks/memory-exhaustion.md

If Timeout Errors

1. Identify slow operations:

# Slow queries
curl -s http://localhost:18180/v1/admin/slow-queries | jq '.queries[] | select(.duration_ms > 1000)'

2. Increase timeout temporarily:

Edit /etc/stemedb/api.toml:

[api]
request_timeout_seconds = 60  # Increase from default 30

Restart:

systemctl restart stemedb-api

3. Optimize slow queries:

# Identify expensive query patterns
curl -s http://localhost:18180/v1/admin/slow-queries | jq -r \
  '.queries[] | "\(.subject) \(.predicate) \(.duration_ms)ms"' | \
  sort -k3 -rn | head -10

If Authentication Errors

1. Check API key validity:

# List disabled/expired keys
curl -s http://localhost:18180/v1/admin/api-keys | jq \
  '.keys[] | select(.enabled==false or .expires_at < now)'

2. Check signature verification errors:

journalctl -u stemedb-api --since "5 min ago" | grep "signature verification failed"

3. If widespread auth failures, check clock skew:

# Check time on all nodes
for node in node1 node2 node3; do
  echo "$node: $(ssh $node date +%s)"
done

# Sync clocks if skew >1 second
for node in node1 node2 node3; do
  ssh $node "systemctl restart chronyd && chronyc makestep"
done

If Network Errors

1. Check cluster connectivity:

# Test RPC connectivity
for node in node2 node3; do
  timeout 2 nc -zv $node 18182 || echo "FAIL: $node unreachable"
done

2. Check for packet loss:

ping -c 100 node2 | tail -2
# Expected: 0% packet loss

3. If packet loss detected:

# Check network interface errors
ip -s link show eth0 | grep -E "(RX|TX).*errors"

# Check for MTU mismatch
ping -M do -s 1472 node2  # Should succeed if MTU=1500

If Client Abuse Detected

1. Identify abusive pattern:

# Request rate by agent
curl -s http://localhost:18180/metrics | \
  grep 'stemedb_http_requests_total{.*agent=' | \
  awk '{sum[$1]+=$NF} END {for(i in sum) print sum[i], i}' | \
  sort -rn | head -5

2. Rate limit or block abusive agent:

# Enable rate limiting
curl -X POST http://localhost:18180/v1/admin/rate-limit \
  -d '{"agent_id": "<agent_id>", "max_requests_per_min": 100}'

# Or trip circuit breaker
curl -X POST http://localhost:18180/v1/admin/circuit-breaker/trip \
  -d '{"agent_id": "<agent_id>"}'

If Errors Persist

1. Enable debug logging:

Edit /etc/stemedb/api.toml:

[logging]
level = "debug"

Restart:

systemctl restart stemedb-api

2. Capture detailed traces:

# Watch errors in real-time
journalctl -u stemedb-api -f --output=json | \
  jq 'select(.level=="ERROR") | {time: .timestamp, error: .message}'

3. Collect diagnostic bundle:

# Create bundle for escalation
mkdir /tmp/stemedb-diag
cp /etc/stemedb/api.toml /tmp/stemedb-diag/
journalctl -u stemedb-api --since "1 hour ago" > /tmp/stemedb-diag/logs.txt
curl -s http://localhost:18180/metrics > /tmp/stemedb-diag/metrics.txt
tar czf /tmp/stemedb-diag-$(date +%Y%m%d-%H%M).tar.gz /tmp/stemedb-diag/

Prevention

Monitoring

1. Error rate by endpoint:

- alert: EndpointErrorRateHigh
  expr: |
    sum by (path) (rate(stemedb_http_requests_total{status=~"5.."}[5m]))
    /
    sum by (path) (rate(stemedb_http_requests_total[5m]))
    > 0.05    
  for: 5m
  annotations:
    summary: "Endpoint {{$labels.path}} has >5% error rate"

2. Alert on new error types:

- alert: NewErrorTypeDetected
  expr: |
    stemedb_error_count_by_type > 0
    unless
    stemedb_error_count_by_type offset 1h > 0    
  annotations:
    summary: "New error type detected: {{$labels.error_type}}"

3. Track error budget consumption:

- alert: ErrorBudgetExhausted
  expr: |
    (1 - sum(rate(stemedb_http_requests_total{status=~"2.."}[30d]))
     / sum(rate(stemedb_http_requests_total[30d]))) > 0.001  # 99.9% SLA    
  annotations:
    summary: "Monthly error budget exhausted"

Capacity Planning

1. Load test error behavior:

# Test error rate under load
hey -z 60s -c 100 -q 50 http://localhost:18180/v1/query

# Monitor error rate during test
watch -n 1 'curl -s http://localhost:18180/metrics | grep "status=\"5"'

2. Set error rate thresholds:

# /etc/stemedb/api.toml
[slo]
target_availability = 0.999  # 99.9%
error_budget_burn_rate_alert = 0.1  # Alert at 10% burn rate

Operational Best Practices

1. Implement circuit breakers:

[resilience]
enable_circuit_breaker = true
failure_threshold = 5  # Open after 5 consecutive failures
timeout_ms = 5000
reset_timeout_ms = 30000

2. Graceful degradation:

[fallback]
enable_cache_fallback = true  # Serve stale data on storage errors
max_stale_seconds = 300

3. Regular chaos testing:

# Monthly chaos experiment
# - Kill random process
# - Inject network latency
# - Fill disk to 95%
# - Verify error handling is graceful

Escalation

Escalate if:

Error rate exceeds 10% for >15 minutes
Errors indicate data corruption (SerializationError)
New error type with no known resolution
Error rate climbing despite mitigation attempts

Escalation path:

Primary on-call: API/Platform SRE
Secondary: Backend engineer
Final escalation: Engineering manager + on-call incident commander

References

Dashboard: StemeDB API Health
Related alerts: HighStorageErrorRate, SlowAPIResponses, CircuitBreakerTripped
Metrics:
- stemedb_http_requests_total{status=~"5.."} (5xx count)
- stemedb_http_request_duration_seconds (latency)
- stemedb_error_count_by_type (error breakdown)
Runbooks: storage-errors.md, memory-exhaustion.md, slow-fsync.md

8.4 KiB Raw Blame History