Major refactoring to hexagonal (ports & adapters) architecture: - Add service layer (apikey_service, project_service) for business logic - Add webhook system with dispatcher and delivery tracking - Add command queue with priority-based processing - Add rate limiting with sliding window algorithm - Add audit logging for command execution - Add OpenTelemetry integration (traces, metrics, spans) - Add circuit breaker for fault tolerance - Add cached repository wrapper for performance - Add comprehensive validation package - Add Kubernetes client integration for pod management - Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks) - Add network policy and PodDisruptionBudget for k8s - Remove legacy executor and projects/registry packages - Untrack secrets.yaml (now managed via envault) - Add coverage.out to .gitignore - Add e2e test infrastructure with docker-compose - Add comprehensive documentation (API, architecture, operations, plans) - Add golangci-lint config and pre-commit hook Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
113 lines
2.3 KiB
Markdown
113 lines
2.3 KiB
Markdown
# Runbook: High CPU Usage
|
|
|
|
## Alert
|
|
|
|
**RdevAPIHighCPU**: CPU usage exceeds 80% for 5+ minutes
|
|
|
|
## Impact
|
|
|
|
- Slow request processing
|
|
- Increased latency
|
|
- Potential request timeouts
|
|
|
|
## Investigation
|
|
|
|
### 1. Confirm the Issue
|
|
|
|
```bash
|
|
# Check current CPU usage
|
|
kubectl -n rdev top pod -l app=rdev-api
|
|
|
|
# Check CPU throttling
|
|
kubectl -n rdev get pod -l app=rdev-api -o jsonpath='{.items[*].status.containerStatuses[*].lastState}'
|
|
```
|
|
|
|
### 2. Identify the Cause
|
|
|
|
```bash
|
|
# Check request rate
|
|
curl -s http://rdev-api:8080/metrics | grep http_requests_total
|
|
|
|
# Check active commands
|
|
curl -s http://rdev-api:8080/metrics | grep commands_active
|
|
|
|
# Check logs for errors
|
|
kubectl -n rdev logs -l app=rdev-api --since=5m | grep -i error
|
|
```
|
|
|
|
### 3. Check for Hot Paths
|
|
|
|
If possible, capture a CPU profile:
|
|
|
|
```bash
|
|
# Start 30-second profile
|
|
kubectl -n rdev exec -it deployment/rdev-api -- \
|
|
curl -o /tmp/cpu.prof localhost:8080/debug/pprof/profile?seconds=30
|
|
|
|
# Copy profile locally
|
|
kubectl -n rdev cp deployment/rdev-api:/tmp/cpu.prof cpu.prof
|
|
|
|
# Analyze
|
|
go tool pprof cpu.prof
|
|
```
|
|
|
|
## Remediation
|
|
|
|
### Immediate: Scale Up
|
|
|
|
```bash
|
|
# Increase replicas
|
|
kubectl -n rdev scale deployment/rdev-api --replicas=4
|
|
|
|
# Verify new pods are running
|
|
kubectl -n rdev get pods -l app=rdev-api -w
|
|
```
|
|
|
|
### Short-term: Increase Limits
|
|
|
|
If throttling is occurring but not OOM:
|
|
|
|
```bash
|
|
kubectl -n rdev patch deployment rdev-api --type='json' -p='[
|
|
{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/cpu", "value": "1000m"}
|
|
]'
|
|
```
|
|
|
|
### If Caused by Command Load
|
|
|
|
1. Reduce concurrent command limit:
|
|
```bash
|
|
kubectl -n rdev set env deployment/rdev-api CONCURRENT_COMMANDS=3
|
|
```
|
|
|
|
2. Investigate which commands are heavy:
|
|
```bash
|
|
kubectl -n rdev logs -l app=rdev-api | grep "command started" | tail -20
|
|
```
|
|
|
|
### If Caused by Request Volume
|
|
|
|
1. Lower rate limits temporarily:
|
|
```bash
|
|
kubectl -n rdev set env deployment/rdev-api RATE_LIMIT_RPS=5
|
|
```
|
|
|
|
2. Identify high-volume clients from logs
|
|
|
|
## Verification
|
|
|
|
```bash
|
|
# Confirm CPU has stabilized
|
|
kubectl -n rdev top pod -l app=rdev-api
|
|
|
|
# Check request latency is normal
|
|
curl -s http://rdev-api:8080/metrics | grep request_duration
|
|
```
|
|
|
|
## Post-Incident
|
|
|
|
1. Review capacity planning
|
|
2. Consider enabling HPA if not already
|
|
3. Analyze traffic patterns
|
|
4. Update resource requests/limits
|