# Runbook: High Memory Usage ## Alert **RdevAPIHighMemory**: Memory usage exceeds 80% of limit ## Impact - Risk of OOMKill - Service disruption - Lost in-flight requests ## Investigation ### 1. Confirm the Issue ```bash # Check current memory usage kubectl -n rdev top pod -l app=rdev-api # Check for OOMKilled events kubectl -n rdev get events --field-selector reason=OOMKilled # Check pod restarts kubectl -n rdev get pods -l app=rdev-api -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}' ``` ### 2. Identify the Cause ```bash # Check active SSE connections (potential memory leak source) curl -s http://rdev-api:8080/metrics | grep sse_connections_active # Check active commands curl -s http://rdev-api:8080/metrics | grep commands_active # Check heap profile kubectl -n rdev exec -it deployment/rdev-api -- \ curl -o /tmp/heap.prof localhost:8080/debug/pprof/heap ``` ### 3. Common Causes - **SSE connection leaks**: Clients not closing connections properly - **Large command outputs**: Commands producing excessive output - **Many concurrent commands**: Each command buffers output - **Cache growth**: Project cache not expiring ## Remediation ### Immediate: Restart Pod If memory is critical (>95%): ```bash # Restart specific pod kubectl -n rdev delete pod # Or restart all pods rolling kubectl -n rdev rollout restart deployment/rdev-api ``` ### Short-term: Increase Limits ```bash kubectl -n rdev patch deployment rdev-api --type='json' -p='[ {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"} ]' ``` ### If SSE Connections Are Leaking 1. Check for stuck connections: ```bash kubectl -n rdev logs -l app=rdev-api | grep "SSE connection" | tail -50 ``` 2. Reduce connection timeout in ingress: ```yaml nginx.ingress.kubernetes.io/proxy-read-timeout: "1800" # 30 min max ``` ### If Command Output Is Too Large 1. Commands should implement output limits 2. Check for runaway commands: ```bash kubectl -n rdev logs -l app=rdev-api | grep "output line" | wc -l ``` ### If Cache Is Growing 1. Reduce cache TTL: ```bash kubectl -n rdev set env deployment/rdev-api CACHE_TTL=15s ``` ## Verification ```bash # Confirm memory has stabilized kubectl -n rdev top pod -l app=rdev-api # Check no new OOMKill events kubectl -n rdev get events --field-selector reason=OOMKilled --since=5m # Verify service is healthy curl -s http://rdev-api:8080/ready ``` ## Post-Incident 1. Analyze heap profile for memory leaks 2. Review SSE connection lifecycle 3. Consider implementing output size limits 4. Update memory limits based on findings 5. Consider adding memory-based HPA