stemedb/docs/operations/runbooks/memory-exhaustion.md
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00

7.6 KiB

Memory Exhaustion

Severity: CRITICAL

Alert Rule

Alert: MemoryExhaustion Trigger: Available memory < 10% for 5 minutes Duration: 5m

Symptom

  • System metrics show high memory usage (>90%)
  • Logs contain "Out of memory" or allocation failures
  • Process killed by OOM killer: kernel: Out of memory: Kill process stemedb-api
  • API becomes unresponsive or crashes
  • Swap usage increasing rapidly

Impact

User Impact:

  • API requests timeout or return 503 errors
  • Service crashes and restarts (data in flight lost)
  • Degraded performance (heavy swapping)

System Impact:

  • OOM killer may terminate stemedb-api
  • System instability (swap thrashing)
  • Risk of cascading failures if other services affected

Investigation Steps

1. Check Memory Usage

# Overall system memory
free -h

# Process-specific memory
ps aux | grep stemedb-api | awk '{print $2, $4, $5, $6}'
# PID  %MEM  VSZ   RSS

# Detailed process memory map
pmap -x $(pgrep stemedb-api)

2. Check for Memory Leaks

# Memory growth over time
curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes

# Compare with historical data
# Expected: Stable after warmup, not continuously increasing

3. Check Index/Cache Size

# Check index memory usage
curl -s http://localhost:18180/v1/admin/storage/stats | jq '{
  index_memory_mb: (.index_memory_bytes / 1e6),
  cache_memory_mb: (.cache_memory_bytes / 1e6)
}'

4. Identify Large Allocations

# Enable heap profiling (if compiled with jemalloc)
curl -X POST http://localhost:18180/v1/admin/debug/heap-profile

# Download profile
curl -s http://localhost:18180/v1/admin/debug/heap-profile/download > /tmp/heap.prof

# Analyze with jeprof
jeprof --text /usr/bin/stemedb-api /tmp/heap.prof | head -20

5. Check for Query Bomb

# Recent large queries
curl -s http://localhost:18180/v1/admin/slow-queries | jq '.queries[] | select(.memory_mb > 100)'

Resolution

Immediate Mitigation: Free Memory

1. Drop caches (safe, temporary relief):

sync
echo 3 > /proc/sys/vm/drop_caches

2. Restart service to reclaim memory:

systemctl restart stemedb-api

3. Monitor memory after restart:

watch -n 5 'free -h; echo "---"; ps aux | grep stemedb-api | awk "{print \$4, \$6}"'

If Memory Leak Suspected

1. Compare memory usage before/after restart:

# Record initial memory
INITIAL=$(curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes | awk '{print $2}')

# Wait 1 hour
sleep 3600

# Check growth
CURRENT=$(curl -s http://localhost:18180/metrics | grep process_resident_memory_bytes | awk '{print $2}')
echo "Growth: $(( ($CURRENT - $INITIAL) / 1024 / 1024 )) MB/hour"

2. If growth exceeds 100 MB/hour, collect diagnostic data:

# Enable memory profiling
export MALLOC_CONF="prof:true,prof_leak:true,lg_prof_sample:19"

# Restart with profiling
systemctl restart stemedb-api

# Wait for leak to accumulate
sleep 7200  # 2 hours

# Dump heap profile
curl -X POST http://localhost:18180/v1/admin/debug/heap-profile

3. Escalate with profile data:

Attach heap profile to incident ticket.

If Index/Cache Too Large

1. Reduce cache size:

Edit /etc/stemedb/api.toml:

[storage]
max_cache_size_mb = 512  # Reduce from default 2048

Restart:

systemctl restart stemedb-api

2. Enable index eviction:

[storage]
index_eviction_enabled = true
index_max_memory_mb = 1024

3. Monitor memory after changes:

curl -s http://localhost:18180/metrics | grep -E '(cache|index)_memory_bytes'

If Query Bomb Detected

1. Identify expensive query pattern:

curl -s http://localhost:18180/v1/admin/slow-queries | jq -r '.queries[] |
  select(.memory_mb > 100) |
  "\(.agent_id) \(.subject) \(.predicate)"' | sort | uniq -c

2. Block abusive agent (if identified):

curl -X POST http://localhost:18180/v1/admin/circuit-breaker/trip \
  -d '{"agent_id": "<agent_id_hex>"}'

3. Set query memory limit:

[query]
max_memory_per_query_mb = 256
query_timeout_seconds = 30

If OOM Killer Triggered

1. Check OOM killer logs:

dmesg | grep -i "killed process"
# kernel: Out of memory: Kill process 1234 (stemedb-api) score 800 or sacrifice child

2. Increase OOM score adjustment (make less likely to be killed):

# Set lower score (less likely to be killed)
echo -500 > /proc/$(pgrep stemedb-api)/oom_score_adj

3. Add to systemd service:

Edit /etc/systemd/system/stemedb-api.service:

[Service]
OOMScoreAdjust=-500

Prevention

Monitoring and Alerting

1. Memory warning alert:

- alert: MemoryWarning
  expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.2
  for: 10m
  annotations:
    summary: "Available memory below 20%"

2. Memory growth alert:

- alert: MemoryLeakSuspected
  expr: rate(process_resident_memory_bytes[1h]) > 1e8  # 100 MB/hour
  for: 2h
  annotations:
    summary: "Memory growing continuously, possible leak"

3. Swap usage alert:

- alert: HighSwapUsage
  expr: (node_memory_SwapCached_bytes / node_memory_SwapTotal_bytes) > 0.5
  annotations:
    summary: "Swap usage exceeds 50%"

Capacity Planning

1. Right-size instance memory:

# Calculate memory requirements:
# - Base process: 500 MB
# - Cache: 2 GB (configurable)
# - Index: 1 GB per 10M assertions
# - Headroom: 20% buffer

# Example for 50M assertions:
# Total = 500 + 2000 + 5000 + (7500 * 0.2) = 9 GB minimum

2. Configure memory limits:

# /etc/stemedb/api.toml
[resources]
max_memory_mb = 8192  # Hard limit (OOM before this)
cache_limit_mb = 2048
index_limit_mb = 5000

3. Enable memory ballast (prevent GC thrashing):

[runtime]
memory_ballast_mb = 100  # Pre-allocate to reduce GC frequency

Operational Best Practices

1. Regular memory profiling:

# Weekly heap dump
curl -X POST http://localhost:18180/v1/admin/debug/heap-profile
curl -s http://localhost:18180/v1/admin/debug/heap-profile/download \
  > /backup/heap-$(date +%Y%m%d).prof

2. Monitor memory per assertion:

# Calculate memory efficiency
ASSERTIONS=$(curl -s http://localhost:18180/metrics | grep assertions_indexed_total | awk '{print $2}')
MEMORY_MB=$(ps aux | grep stemedb-api | awk '{print $6 / 1024}')
echo "Memory per assertion: $(echo "scale=2; $MEMORY_MB / $ASSERTIONS * 1000" | bc) KB"

3. Test memory limits in staging:

# Simulate memory pressure
stress-ng --vm 1 --vm-bytes 6G --vm-method all --verify -t 300s

# Monitor API behavior under pressure
while true; do
  curl -s http://localhost:18180/health || echo "FAIL"
  sleep 10
done

Escalation

Escalate immediately if:

  • Memory exhaustion recurs after restart (<1 hour)
  • Clear memory leak identified (>200 MB/hour growth)
  • OOM killer terminates process 3+ times in 24 hours
  • No memory available for critical system operations

Escalation path:

  1. Primary on-call: Performance engineer
  2. Secondary: Rust/systems developer
  3. Final escalation: Principal engineer (memory safety issue)

References

  • Dashboard: StemeDB Memory Usage
  • Related alerts: HighSwapUsage, ProcessRestarted, CacheEvictionRate
  • Metrics:
    • process_resident_memory_bytes (RSS)
    • stemedb_cache_memory_bytes (cache usage)
    • stemedb_index_memory_bytes (index usage)
    • node_memory_MemAvailable_bytes (system memory)
  • Logs: /var/log/syslog (OOM killer), journalctl -u stemedb-api