rdev/docs/operations/runbooks/high-cpu.md
jordan 72d16929ca feat: Implement hexagonal architecture with services, webhooks, queue, and telemetry
Major refactoring to hexagonal (ports & adapters) architecture:

- Add service layer (apikey_service, project_service) for business logic
- Add webhook system with dispatcher and delivery tracking
- Add command queue with priority-based processing
- Add rate limiting with sliding window algorithm
- Add audit logging for command execution
- Add OpenTelemetry integration (traces, metrics, spans)
- Add circuit breaker for fault tolerance
- Add cached repository wrapper for performance
- Add comprehensive validation package
- Add Kubernetes client integration for pod management
- Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks)
- Add network policy and PodDisruptionBudget for k8s
- Remove legacy executor and projects/registry packages
- Untrack secrets.yaml (now managed via envault)
- Add coverage.out to .gitignore
- Add e2e test infrastructure with docker-compose
- Add comprehensive documentation (API, architecture, operations, plans)
- Add golangci-lint config and pre-commit hook

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 19:57:46 -07:00

2.3 KiB

Runbook: High CPU Usage

Alert

RdevAPIHighCPU: CPU usage exceeds 80% for 5+ minutes

Impact

  • Slow request processing
  • Increased latency
  • Potential request timeouts

Investigation

1. Confirm the Issue

# Check current CPU usage
kubectl -n rdev top pod -l app=rdev-api

# Check CPU throttling
kubectl -n rdev get pod -l app=rdev-api -o jsonpath='{.items[*].status.containerStatuses[*].lastState}'

2. Identify the Cause

# Check request rate
curl -s http://rdev-api:8080/metrics | grep http_requests_total

# Check active commands
curl -s http://rdev-api:8080/metrics | grep commands_active

# Check logs for errors
kubectl -n rdev logs -l app=rdev-api --since=5m | grep -i error

3. Check for Hot Paths

If possible, capture a CPU profile:

# Start 30-second profile
kubectl -n rdev exec -it deployment/rdev-api -- \
  curl -o /tmp/cpu.prof localhost:8080/debug/pprof/profile?seconds=30

# Copy profile locally
kubectl -n rdev cp deployment/rdev-api:/tmp/cpu.prof cpu.prof

# Analyze
go tool pprof cpu.prof

Remediation

Immediate: Scale Up

# Increase replicas
kubectl -n rdev scale deployment/rdev-api --replicas=4

# Verify new pods are running
kubectl -n rdev get pods -l app=rdev-api -w

Short-term: Increase Limits

If throttling is occurring but not OOM:

kubectl -n rdev patch deployment rdev-api --type='json' -p='[
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/cpu", "value": "1000m"}
]'

If Caused by Command Load

  1. Reduce concurrent command limit:

    kubectl -n rdev set env deployment/rdev-api CONCURRENT_COMMANDS=3
    
  2. Investigate which commands are heavy:

    kubectl -n rdev logs -l app=rdev-api | grep "command started" | tail -20
    

If Caused by Request Volume

  1. Lower rate limits temporarily:

    kubectl -n rdev set env deployment/rdev-api RATE_LIMIT_RPS=5
    
  2. Identify high-volume clients from logs

Verification

# Confirm CPU has stabilized
kubectl -n rdev top pod -l app=rdev-api

# Check request latency is normal
curl -s http://rdev-api:8080/metrics | grep request_duration

Post-Incident

  1. Review capacity planning
  2. Consider enabling HPA if not already
  3. Analyze traffic patterns
  4. Update resource requests/limits