stemedb/docs/operations/runbooks/memory-exhaustion.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

350 lines
7.6 KiB
Markdown

# Memory Exhaustion
## Severity: CRITICAL
## Alert Rule
**Alert:** `MemoryExhaustion`
**Trigger:** Available memory < 10% for 5 minutes
**Duration:** 5m
## Symptom
- System metrics show high memory usage (>90%)
- Logs contain "Out of memory" or allocation failures
- Process killed by OOM killer: `kernel: Out of memory: Kill process stemedb-api`
- API becomes unresponsive or crashes
- Swap usage increasing rapidly
## Impact
**User Impact:**
- API requests timeout or return 503 errors
- Service crashes and restarts (data in flight lost)
- Degraded performance (heavy swapping)
**System Impact:**
- OOM killer may terminate stemedb-api
- System instability (swap thrashing)
- Risk of cascading failures if other services affected
## Investigation Steps
### 1. Check Memory Usage
```bash
# Overall system memory
free -h
# Process-specific memory
ps aux | grep stemedb-api | awk '{print $2, $4, $5, $6}'
# PID %MEM VSZ RSS
# Detailed process memory map
pmap -x $(pgrep stemedb-api)
```
### 2. Check for Memory Leaks
```bash
# Memory growth over time
curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes
# Compare with historical data
# Expected: Stable after warmup, not continuously increasing
```
### 3. Check Index/Cache Size
```bash
# Check index memory usage
curl -s http://localhost:18180/v1/admin/storage/stats | jq '{
index_memory_mb: (.index_memory_bytes / 1e6),
cache_memory_mb: (.cache_memory_bytes / 1e6)
}'
```
### 4. Identify Large Allocations
```bash
# Enable heap profiling (if compiled with jemalloc)
curl -X POST http://localhost:18180/v1/admin/debug/heap-profile
# Download profile
curl -s http://localhost:18180/v1/admin/debug/heap-profile/download > /tmp/heap.prof
# Analyze with jeprof
jeprof --text /usr/bin/stemedb-api /tmp/heap.prof | head -20
```
### 5. Check for Query Bomb
```bash
# Recent large queries
curl -s http://localhost:18180/v1/admin/slow-queries | jq '.queries[] | select(.memory_mb > 100)'
```
## Resolution
### Immediate Mitigation: Free Memory
**1. Drop caches (safe, temporary relief):**
```bash
sync
echo 3 > /proc/sys/vm/drop_caches
```
**2. Restart service to reclaim memory:**
```bash
systemctl restart stemedb-api
```
**3. Monitor memory after restart:**
```bash
watch -n 5 'free -h; echo "---"; ps aux | grep stemedb-api | awk "{print \$4, \$6}"'
```
### If Memory Leak Suspected
**1. Compare memory usage before/after restart:**
```bash
# Record initial memory
INITIAL=$(curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes | awk '{print $2}')
# Wait 1 hour
sleep 3600
# Check growth
CURRENT=$(curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes | awk '{print $2}')
echo "Growth: $(( ($CURRENT - $INITIAL) / 1024 / 1024 )) MB/hour"
```
**2. If growth exceeds 100 MB/hour, collect diagnostic data:**
```bash
# Enable memory profiling
export MALLOC_CONF="prof:true,prof_leak:true,lg_prof_sample:19"
# Restart with profiling
systemctl restart stemedb-api
# Wait for leak to accumulate
sleep 7200 # 2 hours
# Dump heap profile
curl -X POST http://localhost:18180/v1/admin/debug/heap-profile
```
**3. Escalate with profile data:**
Attach heap profile to incident ticket.
### If Index/Cache Too Large
**1. Reduce cache size:**
Edit `/etc/stemedb/api.toml`:
```toml
[storage]
max_cache_size_mb = 512 # Reduce from default 2048
```
Restart:
```bash
systemctl restart stemedb-api
```
**2. Enable index eviction:**
```toml
[storage]
index_eviction_enabled = true
index_max_memory_mb = 1024
```
**3. Monitor memory after changes:**
```bash
curl -s http://localhost:18180/metrics | grep -E '(cache|index)_memory_bytes'
```
### If Query Bomb Detected
**1. Identify expensive query pattern:**
```bash
curl -s http://localhost:18180/v1/admin/slow-queries | jq -r '.queries[] |
select(.memory_mb > 100) |
"\(.agent_id) \(.subject) \(.predicate)"' | sort | uniq -c
```
**2. Block abusive agent (if identified):**
```bash
curl -X POST http://localhost:18180/v1/admin/circuit-breaker/trip \
-d '{"agent_id": "<agent_id_hex>"}'
```
**3. Set query memory limit:**
```toml
[query]
max_memory_per_query_mb = 256
query_timeout_seconds = 30
```
### If OOM Killer Triggered
**1. Check OOM killer logs:**
```bash
dmesg | grep -i "killed process"
# kernel: Out of memory: Kill process 1234 (stemedb-api) score 800 or sacrifice child
```
**2. Increase OOM score adjustment (make less likely to be killed):**
```bash
# Set lower score (less likely to be killed)
echo -500 > /proc/$(pgrep stemedb-api)/oom_score_adj
```
**3. Add to systemd service:**
Edit `/etc/systemd/system/stemedb-api.service`:
```ini
[Service]
OOMScoreAdjust=-500
```
## Prevention
### Monitoring and Alerting
**1. Memory warning alert:**
```yaml
- alert: MemoryWarning
expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.2
for: 10m
annotations:
summary: "Available memory below 20%"
```
**2. Memory growth alert:**
```yaml
- alert: MemoryLeakSuspected
expr: rate(process_resident_memory_bytes[1h]) > 1e8 # 100 MB/hour
for: 2h
annotations:
summary: "Memory growing continuously, possible leak"
```
**3. Swap usage alert:**
```yaml
- alert: HighSwapUsage
expr: (node_memory_SwapCached_bytes / node_memory_SwapTotal_bytes) > 0.5
annotations:
summary: "Swap usage exceeds 50%"
```
### Capacity Planning
**1. Right-size instance memory:**
```bash
# Calculate memory requirements:
# - Base process: 500 MB
# - Cache: 2 GB (configurable)
# - Index: 1 GB per 10M assertions
# - Headroom: 20% buffer
# Example for 50M assertions:
# Total = 500 + 2000 + 5000 + (7500 * 0.2) = 9 GB minimum
```
**2. Configure memory limits:**
```toml
# /etc/stemedb/api.toml
[resources]
max_memory_mb = 8192 # Hard limit (OOM before this)
cache_limit_mb = 2048
index_limit_mb = 5000
```
**3. Enable memory ballast (prevent GC thrashing):**
```toml
[runtime]
memory_ballast_mb = 100 # Pre-allocate to reduce GC frequency
```
### Operational Best Practices
**1. Regular memory profiling:**
```bash
# Weekly heap dump
curl -X POST http://localhost:18180/v1/admin/debug/heap-profile
curl -s http://localhost:18180/v1/admin/debug/heap-profile/download \
> /backup/heap-$(date +%Y%m%d).prof
```
**2. Monitor memory per assertion:**
```bash
# Calculate memory efficiency
ASSERTIONS=$(curl -s http://localhost:18180/metrics | grep assertions_indexed_total | awk '{print $2}')
MEMORY_MB=$(ps aux | grep stemedb-api | awk '{print $6 / 1024}')
echo "Memory per assertion: $(echo "scale=2; $MEMORY_MB / $ASSERTIONS * 1000" | bc) KB"
```
**3. Test memory limits in staging:**
```bash
# Simulate memory pressure
stress-ng --vm 1 --vm-bytes 6G --vm-method all --verify -t 300s
# Monitor API behavior under pressure
while true; do
curl -s http://localhost:18180/health || echo "FAIL"
sleep 10
done
```
## Escalation
**Escalate immediately if:**
- Memory exhaustion recurs after restart (<1 hour)
- Clear memory leak identified (>200 MB/hour growth)
- OOM killer terminates process 3+ times in 24 hours
- No memory available for critical system operations
**Escalation path:**
1. **Primary on-call:** Performance engineer
2. **Secondary:** Rust/systems developer
3. **Final escalation:** Principal engineer (memory safety issue)
## References
- **Dashboard:** [StemeDB Memory Usage](http://grafana.example.com/d/stemedb-memory)
- **Related alerts:** `HighSwapUsage`, `ProcessRestarted`, `CacheEvictionRate`
- **Metrics:**
- `process_resident_memory_bytes` (RSS)
- `stemedb_cache_memory_bytes` (cache usage)
- `stemedb_index_memory_bytes` (index usage)
- `node_memory_MemAvailable_bytes` (system memory)
- **Logs:** `/var/log/syslog` (OOM killer), `journalctl -u stemedb-api`