# Runbook: High CPU Usage ## Alert **RdevAPIHighCPU**: CPU usage exceeds 80% for 5+ minutes ## Impact - Slow request processing - Increased latency - Potential request timeouts ## Investigation ### 1. Confirm the Issue ```bash # Check current CPU usage kubectl -n rdev top pod -l app=rdev-api # Check CPU throttling kubectl -n rdev get pod -l app=rdev-api -o jsonpath='{.items[*].status.containerStatuses[*].lastState}' ``` ### 2. Identify the Cause ```bash # Check request rate curl -s http://rdev-api:8080/metrics | grep http_requests_total # Check active commands curl -s http://rdev-api:8080/metrics | grep commands_active # Check logs for errors kubectl -n rdev logs -l app=rdev-api --since=5m | grep -i error ``` ### 3. Check for Hot Paths If possible, capture a CPU profile: ```bash # Start 30-second profile kubectl -n rdev exec -it deployment/rdev-api -- \ curl -o /tmp/cpu.prof localhost:8080/debug/pprof/profile?seconds=30 # Copy profile locally kubectl -n rdev cp deployment/rdev-api:/tmp/cpu.prof cpu.prof # Analyze go tool pprof cpu.prof ``` ## Remediation ### Immediate: Scale Up ```bash # Increase replicas kubectl -n rdev scale deployment/rdev-api --replicas=4 # Verify new pods are running kubectl -n rdev get pods -l app=rdev-api -w ``` ### Short-term: Increase Limits If throttling is occurring but not OOM: ```bash kubectl -n rdev patch deployment rdev-api --type='json' -p='[ {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/cpu", "value": "1000m"} ]' ``` ### If Caused by Command Load 1. Reduce concurrent command limit: ```bash kubectl -n rdev set env deployment/rdev-api CONCURRENT_COMMANDS=3 ``` 2. Investigate which commands are heavy: ```bash kubectl -n rdev logs -l app=rdev-api | grep "command started" | tail -20 ``` ### If Caused by Request Volume 1. Lower rate limits temporarily: ```bash kubectl -n rdev set env deployment/rdev-api RATE_LIMIT_RPS=5 ``` 2. Identify high-volume clients from logs ## Verification ```bash # Confirm CPU has stabilized kubectl -n rdev top pod -l app=rdev-api # Check request latency is normal curl -s http://rdev-api:8080/metrics | grep request_duration ``` ## Post-Incident 1. Review capacity planning 2. Consider enabling HPA if not already 3. Analyze traffic patterns 4. Update resource requests/limits