# Memory Exhaustion ## Severity: CRITICAL ## Alert Rule **Alert:** `MemoryExhaustion` **Trigger:** Available memory < 10% for 5 minutes **Duration:** 5m ## Symptom - System metrics show high memory usage (>90%) - Logs contain "Out of memory" or allocation failures - Process killed by OOM killer: `kernel: Out of memory: Kill process stemedb-api` - API becomes unresponsive or crashes - Swap usage increasing rapidly ## Impact **User Impact:** - API requests timeout or return 503 errors - Service crashes and restarts (data in flight lost) - Degraded performance (heavy swapping) **System Impact:** - OOM killer may terminate stemedb-api - System instability (swap thrashing) - Risk of cascading failures if other services affected ## Investigation Steps ### 1. Check Memory Usage ```bash # Overall system memory free -h # Process-specific memory ps aux | grep stemedb-api | awk '{print $2, $4, $5, $6}' # PID %MEM VSZ RSS # Detailed process memory map pmap -x $(pgrep stemedb-api) ``` ### 2. Check for Memory Leaks ```bash # Memory growth over time curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes # Compare with historical data # Expected: Stable after warmup, not continuously increasing ``` ### 3. Check Index/Cache Size ```bash # Check index memory usage curl -s http://localhost:18180/v1/admin/storage/stats | jq '{ index_memory_mb: (.index_memory_bytes / 1e6), cache_memory_mb: (.cache_memory_bytes / 1e6) }' ``` ### 4. Identify Large Allocations ```bash # Enable heap profiling (if compiled with jemalloc) curl -X POST http://localhost:18180/v1/admin/debug/heap-profile # Download profile curl -s http://localhost:18180/v1/admin/debug/heap-profile/download > /tmp/heap.prof # Analyze with jeprof jeprof --text /usr/bin/stemedb-api /tmp/heap.prof | head -20 ``` ### 5. Check for Query Bomb ```bash # Recent large queries curl -s http://localhost:18180/v1/admin/slow-queries | jq '.queries[] | select(.memory_mb > 100)' ``` ## Resolution ### Immediate Mitigation: Free Memory **1. Drop caches (safe, temporary relief):** ```bash sync echo 3 > /proc/sys/vm/drop_caches ``` **2. Restart service to reclaim memory:** ```bash systemctl restart stemedb-api ``` **3. Monitor memory after restart:** ```bash watch -n 5 'free -h; echo "---"; ps aux | grep stemedb-api | awk "{print \$4, \$6}"' ``` ### If Memory Leak Suspected **1. Compare memory usage before/after restart:** ```bash # Record initial memory INITIAL=$(curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes | awk '{print $2}') # Wait 1 hour sleep 3600 # Check growth CURRENT=$(curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes | awk '{print $2}') echo "Growth: $(( ($CURRENT - $INITIAL) / 1024 / 1024 )) MB/hour" ``` **2. If growth exceeds 100 MB/hour, collect diagnostic data:** ```bash # Enable memory profiling export MALLOC_CONF="prof:true,prof_leak:true,lg_prof_sample:19" # Restart with profiling systemctl restart stemedb-api # Wait for leak to accumulate sleep 7200 # 2 hours # Dump heap profile curl -X POST http://localhost:18180/v1/admin/debug/heap-profile ``` **3. Escalate with profile data:** Attach heap profile to incident ticket. ### If Index/Cache Too Large **1. Reduce cache size:** Edit `/etc/stemedb/api.toml`: ```toml [storage] max_cache_size_mb = 512 # Reduce from default 2048 ``` Restart: ```bash systemctl restart stemedb-api ``` **2. Enable index eviction:** ```toml [storage] index_eviction_enabled = true index_max_memory_mb = 1024 ``` **3. Monitor memory after changes:** ```bash curl -s http://localhost:18180/metrics | grep -E '(cache|index)_memory_bytes' ``` ### If Query Bomb Detected **1. Identify expensive query pattern:** ```bash curl -s http://localhost:18180/v1/admin/slow-queries | jq -r '.queries[] | select(.memory_mb > 100) | "\(.agent_id) \(.subject) \(.predicate)"' | sort | uniq -c ``` **2. Block abusive agent (if identified):** ```bash curl -X POST http://localhost:18180/v1/admin/circuit-breaker/trip \ -d '{"agent_id": ""}' ``` **3. Set query memory limit:** ```toml [query] max_memory_per_query_mb = 256 query_timeout_seconds = 30 ``` ### If OOM Killer Triggered **1. Check OOM killer logs:** ```bash dmesg | grep -i "killed process" # kernel: Out of memory: Kill process 1234 (stemedb-api) score 800 or sacrifice child ``` **2. Increase OOM score adjustment (make less likely to be killed):** ```bash # Set lower score (less likely to be killed) echo -500 > /proc/$(pgrep stemedb-api)/oom_score_adj ``` **3. Add to systemd service:** Edit `/etc/systemd/system/stemedb-api.service`: ```ini [Service] OOMScoreAdjust=-500 ``` ## Prevention ### Monitoring and Alerting **1. Memory warning alert:** ```yaml - alert: MemoryWarning expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.2 for: 10m annotations: summary: "Available memory below 20%" ``` **2. Memory growth alert:** ```yaml - alert: MemoryLeakSuspected expr: rate(process_resident_memory_bytes[1h]) > 1e8 # 100 MB/hour for: 2h annotations: summary: "Memory growing continuously, possible leak" ``` **3. Swap usage alert:** ```yaml - alert: HighSwapUsage expr: (node_memory_SwapCached_bytes / node_memory_SwapTotal_bytes) > 0.5 annotations: summary: "Swap usage exceeds 50%" ``` ### Capacity Planning **1. Right-size instance memory:** ```bash # Calculate memory requirements: # - Base process: 500 MB # - Cache: 2 GB (configurable) # - Index: 1 GB per 10M assertions # - Headroom: 20% buffer # Example for 50M assertions: # Total = 500 + 2000 + 5000 + (7500 * 0.2) = 9 GB minimum ``` **2. Configure memory limits:** ```toml # /etc/stemedb/api.toml [resources] max_memory_mb = 8192 # Hard limit (OOM before this) cache_limit_mb = 2048 index_limit_mb = 5000 ``` **3. Enable memory ballast (prevent GC thrashing):** ```toml [runtime] memory_ballast_mb = 100 # Pre-allocate to reduce GC frequency ``` ### Operational Best Practices **1. Regular memory profiling:** ```bash # Weekly heap dump curl -X POST http://localhost:18180/v1/admin/debug/heap-profile curl -s http://localhost:18180/v1/admin/debug/heap-profile/download \ > /backup/heap-$(date +%Y%m%d).prof ``` **2. Monitor memory per assertion:** ```bash # Calculate memory efficiency ASSERTIONS=$(curl -s http://localhost:18180/metrics | grep assertions_indexed_total | awk '{print $2}') MEMORY_MB=$(ps aux | grep stemedb-api | awk '{print $6 / 1024}') echo "Memory per assertion: $(echo "scale=2; $MEMORY_MB / $ASSERTIONS * 1000" | bc) KB" ``` **3. Test memory limits in staging:** ```bash # Simulate memory pressure stress-ng --vm 1 --vm-bytes 6G --vm-method all --verify -t 300s # Monitor API behavior under pressure while true; do curl -s http://localhost:18180/health || echo "FAIL" sleep 10 done ``` ## Escalation **Escalate immediately if:** - Memory exhaustion recurs after restart (<1 hour) - Clear memory leak identified (>200 MB/hour growth) - OOM killer terminates process 3+ times in 24 hours - No memory available for critical system operations **Escalation path:** 1. **Primary on-call:** Performance engineer 2. **Secondary:** Rust/systems developer 3. **Final escalation:** Principal engineer (memory safety issue) ## References - **Dashboard:** [StemeDB Memory Usage](http://grafana.example.com/d/stemedb-memory) - **Related alerts:** `HighSwapUsage`, `ProcessRestarted`, `CacheEvictionRate` - **Metrics:** - `process_resident_memory_bytes` (RSS) - `stemedb_cache_memory_bytes` (cache usage) - `stemedb_index_memory_bytes` (index usage) - `node_memory_MemAvailable_bytes` (system memory) - **Logs:** `/var/log/syslog` (OOM killer), `journalctl -u stemedb-api`