rdev/docs/operations/runbooks/high-memory.md
jordan 72d16929ca feat: Implement hexagonal architecture with services, webhooks, queue, and telemetry
Major refactoring to hexagonal (ports & adapters) architecture:

- Add service layer (apikey_service, project_service) for business logic
- Add webhook system with dispatcher and delivery tracking
- Add command queue with priority-based processing
- Add rate limiting with sliding window algorithm
- Add audit logging for command execution
- Add OpenTelemetry integration (traces, metrics, spans)
- Add circuit breaker for fault tolerance
- Add cached repository wrapper for performance
- Add comprehensive validation package
- Add Kubernetes client integration for pod management
- Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks)
- Add network policy and PodDisruptionBudget for k8s
- Remove legacy executor and projects/registry packages
- Untrack secrets.yaml (now managed via envault)
- Add coverage.out to .gitignore
- Add e2e test infrastructure with docker-compose
- Add comprehensive documentation (API, architecture, operations, plans)
- Add golangci-lint config and pre-commit hook

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 19:57:46 -07:00

118 lines
2.7 KiB
Markdown

# Runbook: High Memory Usage
## Alert
**RdevAPIHighMemory**: Memory usage exceeds 80% of limit
## Impact
- Risk of OOMKill
- Service disruption
- Lost in-flight requests
## Investigation
### 1. Confirm the Issue
```bash
# Check current memory usage
kubectl -n rdev top pod -l app=rdev-api
# Check for OOMKilled events
kubectl -n rdev get events --field-selector reason=OOMKilled
# Check pod restarts
kubectl -n rdev get pods -l app=rdev-api -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}'
```
### 2. Identify the Cause
```bash
# Check active SSE connections (potential memory leak source)
curl -s http://rdev-api:8080/metrics | grep sse_connections_active
# Check active commands
curl -s http://rdev-api:8080/metrics | grep commands_active
# Check heap profile
kubectl -n rdev exec -it deployment/rdev-api -- \
curl -o /tmp/heap.prof localhost:8080/debug/pprof/heap
```
### 3. Common Causes
- **SSE connection leaks**: Clients not closing connections properly
- **Large command outputs**: Commands producing excessive output
- **Many concurrent commands**: Each command buffers output
- **Cache growth**: Project cache not expiring
## Remediation
### Immediate: Restart Pod
If memory is critical (>95%):
```bash
# Restart specific pod
kubectl -n rdev delete pod <pod-name>
# Or restart all pods rolling
kubectl -n rdev rollout restart deployment/rdev-api
```
### Short-term: Increase Limits
```bash
kubectl -n rdev patch deployment rdev-api --type='json' -p='[
{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"}
]'
```
### If SSE Connections Are Leaking
1. Check for stuck connections:
```bash
kubectl -n rdev logs -l app=rdev-api | grep "SSE connection" | tail -50
```
2. Reduce connection timeout in ingress:
```yaml
nginx.ingress.kubernetes.io/proxy-read-timeout: "1800" # 30 min max
```
### If Command Output Is Too Large
1. Commands should implement output limits
2. Check for runaway commands:
```bash
kubectl -n rdev logs -l app=rdev-api | grep "output line" | wc -l
```
### If Cache Is Growing
1. Reduce cache TTL:
```bash
kubectl -n rdev set env deployment/rdev-api CACHE_TTL=15s
```
## Verification
```bash
# Confirm memory has stabilized
kubectl -n rdev top pod -l app=rdev-api
# Check no new OOMKill events
kubectl -n rdev get events --field-selector reason=OOMKilled --since=5m
# Verify service is healthy
curl -s http://rdev-api:8080/ready
```
## Post-Incident
1. Analyze heap profile for memory leaks
2. Review SSE connection lifecycle
3. Consider implementing output size limits
4. Update memory limits based on findings
5. Consider adding memory-based HPA