This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
350 lines
7.6 KiB
Markdown
350 lines
7.6 KiB
Markdown
# Memory Exhaustion
|
|
|
|
## Severity: CRITICAL
|
|
|
|
## Alert Rule
|
|
|
|
**Alert:** `MemoryExhaustion`
|
|
**Trigger:** Available memory < 10% for 5 minutes
|
|
**Duration:** 5m
|
|
|
|
## Symptom
|
|
|
|
- System metrics show high memory usage (>90%)
|
|
- Logs contain "Out of memory" or allocation failures
|
|
- Process killed by OOM killer: `kernel: Out of memory: Kill process stemedb-api`
|
|
- API becomes unresponsive or crashes
|
|
- Swap usage increasing rapidly
|
|
|
|
## Impact
|
|
|
|
**User Impact:**
|
|
- API requests timeout or return 503 errors
|
|
- Service crashes and restarts (data in flight lost)
|
|
- Degraded performance (heavy swapping)
|
|
|
|
**System Impact:**
|
|
- OOM killer may terminate stemedb-api
|
|
- System instability (swap thrashing)
|
|
- Risk of cascading failures if other services affected
|
|
|
|
## Investigation Steps
|
|
|
|
### 1. Check Memory Usage
|
|
|
|
```bash
|
|
# Overall system memory
|
|
free -h
|
|
|
|
# Process-specific memory
|
|
ps aux | grep stemedb-api | awk '{print $2, $4, $5, $6}'
|
|
# PID %MEM VSZ RSS
|
|
|
|
# Detailed process memory map
|
|
pmap -x $(pgrep stemedb-api)
|
|
```
|
|
|
|
### 2. Check for Memory Leaks
|
|
|
|
```bash
|
|
# Memory growth over time
|
|
curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes
|
|
|
|
# Compare with historical data
|
|
# Expected: Stable after warmup, not continuously increasing
|
|
```
|
|
|
|
### 3. Check Index/Cache Size
|
|
|
|
```bash
|
|
# Check index memory usage
|
|
curl -s http://localhost:18180/v1/admin/storage/stats | jq '{
|
|
index_memory_mb: (.index_memory_bytes / 1e6),
|
|
cache_memory_mb: (.cache_memory_bytes / 1e6)
|
|
}'
|
|
```
|
|
|
|
### 4. Identify Large Allocations
|
|
|
|
```bash
|
|
# Enable heap profiling (if compiled with jemalloc)
|
|
curl -X POST http://localhost:18180/v1/admin/debug/heap-profile
|
|
|
|
# Download profile
|
|
curl -s http://localhost:18180/v1/admin/debug/heap-profile/download > /tmp/heap.prof
|
|
|
|
# Analyze with jeprof
|
|
jeprof --text /usr/bin/stemedb-api /tmp/heap.prof | head -20
|
|
```
|
|
|
|
### 5. Check for Query Bomb
|
|
|
|
```bash
|
|
# Recent large queries
|
|
curl -s http://localhost:18180/v1/admin/slow-queries | jq '.queries[] | select(.memory_mb > 100)'
|
|
```
|
|
|
|
## Resolution
|
|
|
|
### Immediate Mitigation: Free Memory
|
|
|
|
**1. Drop caches (safe, temporary relief):**
|
|
|
|
```bash
|
|
sync
|
|
echo 3 > /proc/sys/vm/drop_caches
|
|
```
|
|
|
|
**2. Restart service to reclaim memory:**
|
|
|
|
```bash
|
|
systemctl restart stemedb-api
|
|
```
|
|
|
|
**3. Monitor memory after restart:**
|
|
|
|
```bash
|
|
watch -n 5 'free -h; echo "---"; ps aux | grep stemedb-api | awk "{print \$4, \$6}"'
|
|
```
|
|
|
|
### If Memory Leak Suspected
|
|
|
|
**1. Compare memory usage before/after restart:**
|
|
|
|
```bash
|
|
# Record initial memory
|
|
INITIAL=$(curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes | awk '{print $2}')
|
|
|
|
# Wait 1 hour
|
|
sleep 3600
|
|
|
|
# Check growth
|
|
CURRENT=$(curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes | awk '{print $2}')
|
|
echo "Growth: $(( ($CURRENT - $INITIAL) / 1024 / 1024 )) MB/hour"
|
|
```
|
|
|
|
**2. If growth exceeds 100 MB/hour, collect diagnostic data:**
|
|
|
|
```bash
|
|
# Enable memory profiling
|
|
export MALLOC_CONF="prof:true,prof_leak:true,lg_prof_sample:19"
|
|
|
|
# Restart with profiling
|
|
systemctl restart stemedb-api
|
|
|
|
# Wait for leak to accumulate
|
|
sleep 7200 # 2 hours
|
|
|
|
# Dump heap profile
|
|
curl -X POST http://localhost:18180/v1/admin/debug/heap-profile
|
|
```
|
|
|
|
**3. Escalate with profile data:**
|
|
|
|
Attach heap profile to incident ticket.
|
|
|
|
### If Index/Cache Too Large
|
|
|
|
**1. Reduce cache size:**
|
|
|
|
Edit `/etc/stemedb/api.toml`:
|
|
|
|
```toml
|
|
[storage]
|
|
max_cache_size_mb = 512 # Reduce from default 2048
|
|
```
|
|
|
|
Restart:
|
|
|
|
```bash
|
|
systemctl restart stemedb-api
|
|
```
|
|
|
|
**2. Enable index eviction:**
|
|
|
|
```toml
|
|
[storage]
|
|
index_eviction_enabled = true
|
|
index_max_memory_mb = 1024
|
|
```
|
|
|
|
**3. Monitor memory after changes:**
|
|
|
|
```bash
|
|
curl -s http://localhost:18180/metrics | grep -E '(cache|index)_memory_bytes'
|
|
```
|
|
|
|
### If Query Bomb Detected
|
|
|
|
**1. Identify expensive query pattern:**
|
|
|
|
```bash
|
|
curl -s http://localhost:18180/v1/admin/slow-queries | jq -r '.queries[] |
|
|
select(.memory_mb > 100) |
|
|
"\(.agent_id) \(.subject) \(.predicate)"' | sort | uniq -c
|
|
```
|
|
|
|
**2. Block abusive agent (if identified):**
|
|
|
|
```bash
|
|
curl -X POST http://localhost:18180/v1/admin/circuit-breaker/trip \
|
|
-d '{"agent_id": "<agent_id_hex>"}'
|
|
```
|
|
|
|
**3. Set query memory limit:**
|
|
|
|
```toml
|
|
[query]
|
|
max_memory_per_query_mb = 256
|
|
query_timeout_seconds = 30
|
|
```
|
|
|
|
### If OOM Killer Triggered
|
|
|
|
**1. Check OOM killer logs:**
|
|
|
|
```bash
|
|
dmesg | grep -i "killed process"
|
|
# kernel: Out of memory: Kill process 1234 (stemedb-api) score 800 or sacrifice child
|
|
```
|
|
|
|
**2. Increase OOM score adjustment (make less likely to be killed):**
|
|
|
|
```bash
|
|
# Set lower score (less likely to be killed)
|
|
echo -500 > /proc/$(pgrep stemedb-api)/oom_score_adj
|
|
```
|
|
|
|
**3. Add to systemd service:**
|
|
|
|
Edit `/etc/systemd/system/stemedb-api.service`:
|
|
|
|
```ini
|
|
[Service]
|
|
OOMScoreAdjust=-500
|
|
```
|
|
|
|
## Prevention
|
|
|
|
### Monitoring and Alerting
|
|
|
|
**1. Memory warning alert:**
|
|
|
|
```yaml
|
|
- alert: MemoryWarning
|
|
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.2
|
|
for: 10m
|
|
annotations:
|
|
summary: "Available memory below 20%"
|
|
```
|
|
|
|
**2. Memory growth alert:**
|
|
|
|
```yaml
|
|
- alert: MemoryLeakSuspected
|
|
expr: rate(process_resident_memory_bytes[1h]) > 1e8 # 100 MB/hour
|
|
for: 2h
|
|
annotations:
|
|
summary: "Memory growing continuously, possible leak"
|
|
```
|
|
|
|
**3. Swap usage alert:**
|
|
|
|
```yaml
|
|
- alert: HighSwapUsage
|
|
expr: (node_memory_SwapCached_bytes / node_memory_SwapTotal_bytes) > 0.5
|
|
annotations:
|
|
summary: "Swap usage exceeds 50%"
|
|
```
|
|
|
|
### Capacity Planning
|
|
|
|
**1. Right-size instance memory:**
|
|
|
|
```bash
|
|
# Calculate memory requirements:
|
|
# - Base process: 500 MB
|
|
# - Cache: 2 GB (configurable)
|
|
# - Index: 1 GB per 10M assertions
|
|
# - Headroom: 20% buffer
|
|
|
|
# Example for 50M assertions:
|
|
# Total = 500 + 2000 + 5000 + (7500 * 0.2) = 9 GB minimum
|
|
```
|
|
|
|
**2. Configure memory limits:**
|
|
|
|
```toml
|
|
# /etc/stemedb/api.toml
|
|
[resources]
|
|
max_memory_mb = 8192 # Hard limit (OOM before this)
|
|
cache_limit_mb = 2048
|
|
index_limit_mb = 5000
|
|
```
|
|
|
|
**3. Enable memory ballast (prevent GC thrashing):**
|
|
|
|
```toml
|
|
[runtime]
|
|
memory_ballast_mb = 100 # Pre-allocate to reduce GC frequency
|
|
```
|
|
|
|
### Operational Best Practices
|
|
|
|
**1. Regular memory profiling:**
|
|
|
|
```bash
|
|
# Weekly heap dump
|
|
curl -X POST http://localhost:18180/v1/admin/debug/heap-profile
|
|
curl -s http://localhost:18180/v1/admin/debug/heap-profile/download \
|
|
> /backup/heap-$(date +%Y%m%d).prof
|
|
```
|
|
|
|
**2. Monitor memory per assertion:**
|
|
|
|
```bash
|
|
# Calculate memory efficiency
|
|
ASSERTIONS=$(curl -s http://localhost:18180/metrics | grep assertions_indexed_total | awk '{print $2}')
|
|
MEMORY_MB=$(ps aux | grep stemedb-api | awk '{print $6 / 1024}')
|
|
echo "Memory per assertion: $(echo "scale=2; $MEMORY_MB / $ASSERTIONS * 1000" | bc) KB"
|
|
```
|
|
|
|
**3. Test memory limits in staging:**
|
|
|
|
```bash
|
|
# Simulate memory pressure
|
|
stress-ng --vm 1 --vm-bytes 6G --vm-method all --verify -t 300s
|
|
|
|
# Monitor API behavior under pressure
|
|
while true; do
|
|
curl -s http://localhost:18180/health || echo "FAIL"
|
|
sleep 10
|
|
done
|
|
```
|
|
|
|
## Escalation
|
|
|
|
**Escalate immediately if:**
|
|
|
|
- Memory exhaustion recurs after restart (<1 hour)
|
|
- Clear memory leak identified (>200 MB/hour growth)
|
|
- OOM killer terminates process 3+ times in 24 hours
|
|
- No memory available for critical system operations
|
|
|
|
**Escalation path:**
|
|
|
|
1. **Primary on-call:** Performance engineer
|
|
2. **Secondary:** Rust/systems developer
|
|
3. **Final escalation:** Principal engineer (memory safety issue)
|
|
|
|
## References
|
|
|
|
- **Dashboard:** [StemeDB Memory Usage](http://grafana.example.com/d/stemedb-memory)
|
|
- **Related alerts:** `HighSwapUsage`, `ProcessRestarted`, `CacheEvictionRate`
|
|
- **Metrics:**
|
|
- `process_resident_memory_bytes` (RSS)
|
|
- `stemedb_cache_memory_bytes` (cache usage)
|
|
- `stemedb_index_memory_bytes` (index usage)
|
|
- `node_memory_MemAvailable_bytes` (system memory)
|
|
- **Logs:** `/var/log/syslog` (OOM killer), `journalctl -u stemedb-api`
|