This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
7.6 KiB
Memory Exhaustion
Severity: CRITICAL
Alert Rule
Alert: MemoryExhaustion
Trigger: Available memory < 10% for 5 minutes
Duration: 5m
Symptom
- System metrics show high memory usage (>90%)
- Logs contain "Out of memory" or allocation failures
- Process killed by OOM killer:
kernel: Out of memory: Kill process stemedb-api - API becomes unresponsive or crashes
- Swap usage increasing rapidly
Impact
User Impact:
- API requests timeout or return 503 errors
- Service crashes and restarts (data in flight lost)
- Degraded performance (heavy swapping)
System Impact:
- OOM killer may terminate stemedb-api
- System instability (swap thrashing)
- Risk of cascading failures if other services affected
Investigation Steps
1. Check Memory Usage
# Overall system memory
free -h
# Process-specific memory
ps aux | grep stemedb-api | awk '{print $2, $4, $5, $6}'
# PID %MEM VSZ RSS
# Detailed process memory map
pmap -x $(pgrep stemedb-api)
2. Check for Memory Leaks
# Memory growth over time
curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes
# Compare with historical data
# Expected: Stable after warmup, not continuously increasing
3. Check Index/Cache Size
# Check index memory usage
curl -s http://localhost:18180/v1/admin/storage/stats | jq '{
index_memory_mb: (.index_memory_bytes / 1e6),
cache_memory_mb: (.cache_memory_bytes / 1e6)
}'
4. Identify Large Allocations
# Enable heap profiling (if compiled with jemalloc)
curl -X POST http://localhost:18180/v1/admin/debug/heap-profile
# Download profile
curl -s http://localhost:18180/v1/admin/debug/heap-profile/download > /tmp/heap.prof
# Analyze with jeprof
jeprof --text /usr/bin/stemedb-api /tmp/heap.prof | head -20
5. Check for Query Bomb
# Recent large queries
curl -s http://localhost:18180/v1/admin/slow-queries | jq '.queries[] | select(.memory_mb > 100)'
Resolution
Immediate Mitigation: Free Memory
1. Drop caches (safe, temporary relief):
sync
echo 3 > /proc/sys/vm/drop_caches
2. Restart service to reclaim memory:
systemctl restart stemedb-api
3. Monitor memory after restart:
watch -n 5 'free -h; echo "---"; ps aux | grep stemedb-api | awk "{print \$4, \$6}"'
If Memory Leak Suspected
1. Compare memory usage before/after restart:
# Record initial memory
INITIAL=$(curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes | awk '{print $2}')
# Wait 1 hour
sleep 3600
# Check growth
CURRENT=$(curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes | awk '{print $2}')
echo "Growth: $(( ($CURRENT - $INITIAL) / 1024 / 1024 )) MB/hour"
2. If growth exceeds 100 MB/hour, collect diagnostic data:
# Enable memory profiling
export MALLOC_CONF="prof:true,prof_leak:true,lg_prof_sample:19"
# Restart with profiling
systemctl restart stemedb-api
# Wait for leak to accumulate
sleep 7200 # 2 hours
# Dump heap profile
curl -X POST http://localhost:18180/v1/admin/debug/heap-profile
3. Escalate with profile data:
Attach heap profile to incident ticket.
If Index/Cache Too Large
1. Reduce cache size:
Edit /etc/stemedb/api.toml:
[storage]
max_cache_size_mb = 512 # Reduce from default 2048
Restart:
systemctl restart stemedb-api
2. Enable index eviction:
[storage]
index_eviction_enabled = true
index_max_memory_mb = 1024
3. Monitor memory after changes:
curl -s http://localhost:18180/metrics | grep -E '(cache|index)_memory_bytes'
If Query Bomb Detected
1. Identify expensive query pattern:
curl -s http://localhost:18180/v1/admin/slow-queries | jq -r '.queries[] |
select(.memory_mb > 100) |
"\(.agent_id) \(.subject) \(.predicate)"' | sort | uniq -c
2. Block abusive agent (if identified):
curl -X POST http://localhost:18180/v1/admin/circuit-breaker/trip \
-d '{"agent_id": "<agent_id_hex>"}'
3. Set query memory limit:
[query]
max_memory_per_query_mb = 256
query_timeout_seconds = 30
If OOM Killer Triggered
1. Check OOM killer logs:
dmesg | grep -i "killed process"
# kernel: Out of memory: Kill process 1234 (stemedb-api) score 800 or sacrifice child
2. Increase OOM score adjustment (make less likely to be killed):
# Set lower score (less likely to be killed)
echo -500 > /proc/$(pgrep stemedb-api)/oom_score_adj
3. Add to systemd service:
Edit /etc/systemd/system/stemedb-api.service:
[Service]
OOMScoreAdjust=-500
Prevention
Monitoring and Alerting
1. Memory warning alert:
- alert: MemoryWarning
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.2
for: 10m
annotations:
summary: "Available memory below 20%"
2. Memory growth alert:
- alert: MemoryLeakSuspected
expr: rate(process_resident_memory_bytes[1h]) > 1e8 # 100 MB/hour
for: 2h
annotations:
summary: "Memory growing continuously, possible leak"
3. Swap usage alert:
- alert: HighSwapUsage
expr: (node_memory_SwapCached_bytes / node_memory_SwapTotal_bytes) > 0.5
annotations:
summary: "Swap usage exceeds 50%"
Capacity Planning
1. Right-size instance memory:
# Calculate memory requirements:
# - Base process: 500 MB
# - Cache: 2 GB (configurable)
# - Index: 1 GB per 10M assertions
# - Headroom: 20% buffer
# Example for 50M assertions:
# Total = 500 + 2000 + 5000 + (7500 * 0.2) = 9 GB minimum
2. Configure memory limits:
# /etc/stemedb/api.toml
[resources]
max_memory_mb = 8192 # Hard limit (OOM before this)
cache_limit_mb = 2048
index_limit_mb = 5000
3. Enable memory ballast (prevent GC thrashing):
[runtime]
memory_ballast_mb = 100 # Pre-allocate to reduce GC frequency
Operational Best Practices
1. Regular memory profiling:
# Weekly heap dump
curl -X POST http://localhost:18180/v1/admin/debug/heap-profile
curl -s http://localhost:18180/v1/admin/debug/heap-profile/download \
> /backup/heap-$(date +%Y%m%d).prof
2. Monitor memory per assertion:
# Calculate memory efficiency
ASSERTIONS=$(curl -s http://localhost:18180/metrics | grep assertions_indexed_total | awk '{print $2}')
MEMORY_MB=$(ps aux | grep stemedb-api | awk '{print $6 / 1024}')
echo "Memory per assertion: $(echo "scale=2; $MEMORY_MB / $ASSERTIONS * 1000" | bc) KB"
3. Test memory limits in staging:
# Simulate memory pressure
stress-ng --vm 1 --vm-bytes 6G --vm-method all --verify -t 300s
# Monitor API behavior under pressure
while true; do
curl -s http://localhost:18180/health || echo "FAIL"
sleep 10
done
Escalation
Escalate immediately if:
- Memory exhaustion recurs after restart (<1 hour)
- Clear memory leak identified (>200 MB/hour growth)
- OOM killer terminates process 3+ times in 24 hours
- No memory available for critical system operations
Escalation path:
- Primary on-call: Performance engineer
- Secondary: Rust/systems developer
- Final escalation: Principal engineer (memory safety issue)
References
- Dashboard: StemeDB Memory Usage
- Related alerts:
HighSwapUsage,ProcessRestarted,CacheEvictionRate - Metrics:
process_resident_memory_bytes(RSS)stemedb_cache_memory_bytes(cache usage)stemedb_index_memory_bytes(index usage)node_memory_MemAvailable_bytes(system memory)
- Logs:
/var/log/syslog(OOM killer),journalctl -u stemedb-api