rdev/docs/operations/runbooks/high-memory.md

# Runbook: High Memory Usage

## Alert

**RdevAPIHighMemory**: Memory usage exceeds 80% of limit

## Impact

- Risk of OOMKill
- Service disruption
- Lost in-flight requests

## Investigation

### 1. Confirm the Issue

```bash
# Check current memory usage
kubectl -n rdev top pod -l app=rdev-api

# Check for OOMKilled events
kubectl -n rdev get events --field-selector reason=OOMKilled

# Check pod restarts
kubectl -n rdev get pods -l app=rdev-api -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}'
```

### 2. Identify the Cause

```bash
# Check active SSE connections (potential memory leak source)
curl -s http://rdev-api:8080/metrics | grep sse_connections_active

# Check active commands
curl -s http://rdev-api:8080/metrics | grep commands_active

# Check heap profile
kubectl -n rdev exec -it deployment/rdev-api -- \
  curl -o /tmp/heap.prof localhost:8080/debug/pprof/heap
```

### 3. Common Causes

- **SSE connection leaks**: Clients not closing connections properly
- **Large command outputs**: Commands producing excessive output
- **Many concurrent commands**: Each command buffers output
- **Cache growth**: Project cache not expiring

## Remediation

### Immediate: Restart Pod

If memory is critical (>95%):

```bash
# Restart specific pod
kubectl -n rdev delete pod <pod-name>

# Or restart all pods rolling
kubectl -n rdev rollout restart deployment/rdev-api
```

### Short-term: Increase Limits

```bash
kubectl -n rdev patch deployment rdev-api --type='json' -p='[
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"}
]'
```

### If SSE Connections Are Leaking

1. Check for stuck connections:
   ```bash
   kubectl -n rdev logs -l app=rdev-api | grep "SSE connection" | tail -50
   ```

2. Reduce connection timeout at the Traefik entrypoint level:
   ```yaml
   # Traefik: configure respondingTimeouts at entrypoint level
   # or use ServersTransport for per-service forwarding timeout
   traefik.ingress.kubernetes.io/router.entrypoints: websecure
   ```

### If Command Output Is Too Large

1. Commands should implement output limits
2. Check for runaway commands:
   ```bash
   kubectl -n rdev logs -l app=rdev-api | grep "output line" | wc -l
   ```

### If Cache Is Growing

1. Reduce cache TTL:
   ```bash
   kubectl -n rdev set env deployment/rdev-api CACHE_TTL=15s
   ```

## Verification

```bash
# Confirm memory has stabilized
kubectl -n rdev top pod -l app=rdev-api

# Check no new OOMKill events
kubectl -n rdev get events --field-selector reason=OOMKilled --since=5m

# Verify service is healthy
curl -s http://rdev-api:8080/ready
```

## Post-Incident

1. Analyze heap profile for memory leaks
2. Review SSE connection lifecycle
3. Consider implementing output size limits
4. Update memory limits based on findings
5. Consider adding memory-based HPA