rdev/docs/operations/runbooks/high-cpu.md

# Runbook: High CPU Usage

## Alert

**RdevAPIHighCPU**: CPU usage exceeds 80% for 5+ minutes

## Impact

- Slow request processing
- Increased latency
- Potential request timeouts

## Investigation

### 1. Confirm the Issue

```bash
# Check current CPU usage
kubectl -n rdev top pod -l app=rdev-api

# Check CPU throttling
kubectl -n rdev get pod -l app=rdev-api -o jsonpath='{.items[*].status.containerStatuses[*].lastState}'
```

### 2. Identify the Cause

```bash
# Check request rate
curl -s http://rdev-api:8080/metrics | grep http_requests_total

# Check active commands
curl -s http://rdev-api:8080/metrics | grep commands_active

# Check logs for errors
kubectl -n rdev logs -l app=rdev-api --since=5m | grep -i error
```

### 3. Check for Hot Paths

If possible, capture a CPU profile:

```bash
# Start 30-second profile
kubectl -n rdev exec -it deployment/rdev-api -- \
  curl -o /tmp/cpu.prof localhost:8080/debug/pprof/profile?seconds=30

# Copy profile locally
kubectl -n rdev cp deployment/rdev-api:/tmp/cpu.prof cpu.prof

# Analyze
go tool pprof cpu.prof
```

## Remediation

### Immediate: Scale Up

```bash
# Increase replicas
kubectl -n rdev scale deployment/rdev-api --replicas=4

# Verify new pods are running
kubectl -n rdev get pods -l app=rdev-api -w
```

### Short-term: Increase Limits

If throttling is occurring but not OOM:

```bash
kubectl -n rdev patch deployment rdev-api --type='json' -p='[
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/cpu", "value": "1000m"}
]'
```

### If Caused by Command Load

1. Reduce concurrent command limit:
   ```bash
   kubectl -n rdev set env deployment/rdev-api CONCURRENT_COMMANDS=3
   ```

2. Investigate which commands are heavy:
   ```bash
   kubectl -n rdev logs -l app=rdev-api | grep "command started" | tail -20
   ```

### If Caused by Request Volume

1. Lower rate limits temporarily:
   ```bash
   kubectl -n rdev set env deployment/rdev-api RATE_LIMIT_RPS=5
   ```

2. Identify high-volume clients from logs

## Verification

```bash
# Confirm CPU has stabilized
kubectl -n rdev top pod -l app=rdev-api

# Check request latency is normal
curl -s http://rdev-api:8080/metrics | grep request_duration
```

## Post-Incident

1. Review capacity planning
2. Consider enabling HPA if not already
3. Analyze traffic patterns
4. Update resource requests/limits