stemedb/docs/operations/runbooks/memory-exhaustion.md

# Memory Exhaustion

## Severity: CRITICAL

## Alert Rule

**Alert:** `MemoryExhaustion`
**Trigger:** Available memory < 10% for 5 minutes
**Duration:** 5m

## Symptom

- System metrics show high memory usage (>90%)
- Logs contain "Out of memory" or allocation failures
- Process killed by OOM killer: `kernel: Out of memory: Kill process stemedb-api`
- API becomes unresponsive or crashes
- Swap usage increasing rapidly

## Impact

**User Impact:**
- API requests timeout or return 503 errors
- Service crashes and restarts (data in flight lost)
- Degraded performance (heavy swapping)

**System Impact:**
- OOM killer may terminate stemedb-api
- System instability (swap thrashing)
- Risk of cascading failures if other services affected

## Investigation Steps

### 1. Check Memory Usage

```bash
# Overall system memory
free -h

# Process-specific memory
ps aux | grep stemedb-api | awk '{print $2, $4, $5, $6}'
# PID  %MEM  VSZ   RSS

# Detailed process memory map
pmap -x $(pgrep stemedb-api)
```

### 2. Check for Memory Leaks

```bash
# Memory growth over time
curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes

# Compare with historical data
# Expected: Stable after warmup, not continuously increasing
```

### 3. Check Index/Cache Size

```bash
# Check index memory usage
curl -s http://localhost:18180/v1/admin/storage/stats | jq '{
  index_memory_mb: (.index_memory_bytes / 1e6),
  cache_memory_mb: (.cache_memory_bytes / 1e6)
}'
```

### 4. Identify Large Allocations

```bash
# Enable heap profiling (if compiled with jemalloc)
curl -X POST http://localhost:18180/v1/admin/debug/heap-profile

# Download profile
curl -s http://localhost:18180/v1/admin/debug/heap-profile/download > /tmp/heap.prof

# Analyze with jeprof
jeprof --text /usr/bin/stemedb-api /tmp/heap.prof | head -20
```

### 5. Check for Query Bomb

```bash
# Recent large queries
curl -s http://localhost:18180/v1/admin/slow-queries | jq '.queries[] | select(.memory_mb > 100)'
```

## Resolution

### Immediate Mitigation: Free Memory

**1. Drop caches (safe, temporary relief):**

```bash
sync
echo 3 > /proc/sys/vm/drop_caches
```

**2. Restart service to reclaim memory:**

```bash
systemctl restart stemedb-api
```

**3. Monitor memory after restart:**

```bash
watch -n 5 'free -h; echo "---"; ps aux | grep stemedb-api | awk "{print \$4, \$6}"'
```

### If Memory Leak Suspected

**1. Compare memory usage before/after restart:**

```bash
# Record initial memory
INITIAL=$(curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes | awk '{print $2}')

# Wait 1 hour
sleep 3600

# Check growth
CURRENT=$(curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes | awk '{print $2}')
echo "Growth: $(( ($CURRENT - $INITIAL) / 1024 / 1024 )) MB/hour"
```

**2. If growth exceeds 100 MB/hour, collect diagnostic data:**

```bash
# Enable memory profiling
export MALLOC_CONF="prof:true,prof_leak:true,lg_prof_sample:19"

# Restart with profiling
systemctl restart stemedb-api

# Wait for leak to accumulate
sleep 7200  # 2 hours

# Dump heap profile
curl -X POST http://localhost:18180/v1/admin/debug/heap-profile
```

**3. Escalate with profile data:**

Attach heap profile to incident ticket.

### If Index/Cache Too Large

**1. Reduce cache size:**

Edit `/etc/stemedb/api.toml`:

```toml
[storage]
max_cache_size_mb = 512  # Reduce from default 2048
```

Restart:

```bash
systemctl restart stemedb-api
```

**2. Enable index eviction:**

```toml
[storage]
index_eviction_enabled = true
index_max_memory_mb = 1024
```

**3. Monitor memory after changes:**

```bash
curl -s http://localhost:18180/metrics | grep -E '(cache|index)_memory_bytes'
```

### If Query Bomb Detected

**1. Identify expensive query pattern:**

```bash
curl -s http://localhost:18180/v1/admin/slow-queries | jq -r '.queries[] |
  select(.memory_mb > 100) |
  "\(.agent_id) \(.subject) \(.predicate)"' | sort | uniq -c
```

**2. Block abusive agent (if identified):**

```bash
curl -X POST http://localhost:18180/v1/admin/circuit-breaker/trip \
  -d '{"agent_id": "<agent_id_hex>"}'
```

**3. Set query memory limit:**

```toml
[query]
max_memory_per_query_mb = 256
query_timeout_seconds = 30
```

### If OOM Killer Triggered

**1. Check OOM killer logs:**

```bash
dmesg | grep -i "killed process"
# kernel: Out of memory: Kill process 1234 (stemedb-api) score 800 or sacrifice child
```

**2. Increase OOM score adjustment (make less likely to be killed):**

```bash
# Set lower score (less likely to be killed)
echo -500 > /proc/$(pgrep stemedb-api)/oom_score_adj
```

**3. Add to systemd service:**

Edit `/etc/systemd/system/stemedb-api.service`:

```ini
[Service]
OOMScoreAdjust=-500
```

## Prevention

### Monitoring and Alerting

**1. Memory warning alert:**

```yaml
- alert: MemoryWarning
  expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.2
  for: 10m
  annotations:
    summary: "Available memory below 20%"
```

**2. Memory growth alert:**

```yaml
- alert: MemoryLeakSuspected
  expr: rate(process_resident_memory_bytes[1h]) > 1e8  # 100 MB/hour
  for: 2h
  annotations:
    summary: "Memory growing continuously, possible leak"
```

**3. Swap usage alert:**

```yaml
- alert: HighSwapUsage
  expr: (node_memory_SwapCached_bytes / node_memory_SwapTotal_bytes) > 0.5
  annotations:
    summary: "Swap usage exceeds 50%"
```

### Capacity Planning

**1. Right-size instance memory:**

```bash
# Calculate memory requirements:
# - Base process: 500 MB
# - Cache: 2 GB (configurable)
# - Index: 1 GB per 10M assertions
# - Headroom: 20% buffer

# Example for 50M assertions:
# Total = 500 + 2000 + 5000 + (7500 * 0.2) = 9 GB minimum
```

**2. Configure memory limits:**

```toml
# /etc/stemedb/api.toml
[resources]
max_memory_mb = 8192  # Hard limit (OOM before this)
cache_limit_mb = 2048
index_limit_mb = 5000
```

**3. Enable memory ballast (prevent GC thrashing):**

```toml
[runtime]
memory_ballast_mb = 100  # Pre-allocate to reduce GC frequency
```

### Operational Best Practices

**1. Regular memory profiling:**

```bash
# Weekly heap dump
curl -X POST http://localhost:18180/v1/admin/debug/heap-profile
curl -s http://localhost:18180/v1/admin/debug/heap-profile/download \
  > /backup/heap-$(date +%Y%m%d).prof
```

**2. Monitor memory per assertion:**

```bash
# Calculate memory efficiency
ASSERTIONS=$(curl -s http://localhost:18180/metrics | grep assertions_indexed_total | awk '{print $2}')
MEMORY_MB=$(ps aux | grep stemedb-api | awk '{print $6 / 1024}')
echo "Memory per assertion: $(echo "scale=2; $MEMORY_MB / $ASSERTIONS * 1000" | bc) KB"
```

**3. Test memory limits in staging:**

```bash
# Simulate memory pressure
stress-ng --vm 1 --vm-bytes 6G --vm-method all --verify -t 300s

# Monitor API behavior under pressure
while true; do
  curl -s http://localhost:18180/health || echo "FAIL"
  sleep 10
done
```

## Escalation

**Escalate immediately if:**

- Memory exhaustion recurs after restart (<1 hour)
- Clear memory leak identified (>200 MB/hour growth)
- OOM killer terminates process 3+ times in 24 hours
- No memory available for critical system operations

**Escalation path:**

1. **Primary on-call:** Performance engineer
2. **Secondary:** Rust/systems developer
3. **Final escalation:** Principal engineer (memory safety issue)

## References

- **Dashboard:** [StemeDB Memory Usage](http://grafana.example.com/d/stemedb-memory)
- **Related alerts:** `HighSwapUsage`, `ProcessRestarted`, `CacheEvictionRate`
- **Metrics:**
  - `process_resident_memory_bytes` (RSS)
  - `stemedb_cache_memory_bytes` (cache usage)
  - `stemedb_index_memory_bytes` (index usage)
  - `node_memory_MemAvailable_bytes` (system memory)
- **Logs:** `/var/log/syslog` (OOM killer), `journalctl -u stemedb-api`