All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
CI / Woodpecker: - Add explicit depends_on to all .woodpecker.yml steps (rdev + templates) - Fix skip_tls_verify -> skip-tls-verify (correct Kaniko flag name) - Add replicasets get/list to deployer RBAC for rollout status - Skeleton template: add failure:ignore on docs steps, Traefik TLS annotations on ingress, depends_on on verify step Component templates: - Fix container name in deploy steps (PROJECT_NAME-COMPONENT_NAME) - Replace kubectl scale with kubectl patch for replicas - Add post-deploy image verification and rollout status checks - Applied consistently across all 5 component templates Adapters: - gitea: Add HTTP client timeout (30s), context cancellation checks, handle 404 on GetRepo/DeleteRepo - zot: Add retry with exponential backoff (doWithRetry), limit response body reads to 10MB - cockroach: Use net.JoinHostPort for IPv6-safe DSN construction - woodpecker: Fix error wrapping (%v -> %w) - redis: Fix error wrapping (%v -> %w) - deployer: Add context cancellation checks Services: - apikey_service: Fix error wrapping (%v -> %w) - component_deploy: Fix error wrapping (%v -> %w) - project_infra: Fix error wrapping (%v -> %w) - webhook/dispatcher: Fix error wrapping (%v -> %w) Other: - CLAUDE.md: Add guide links for Gitea, Go 1.25, Woodpecker v3, Traefik v3, Zot registry - circuitbreaker: Add test for error wrapping - docs: Update deployment, troubleshooting, and runbook docs - health: Fix error wrapping (%v -> %w) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
120 lines
2.8 KiB
Markdown
120 lines
2.8 KiB
Markdown
# Runbook: High Memory Usage
|
|
|
|
## Alert
|
|
|
|
**RdevAPIHighMemory**: Memory usage exceeds 80% of limit
|
|
|
|
## Impact
|
|
|
|
- Risk of OOMKill
|
|
- Service disruption
|
|
- Lost in-flight requests
|
|
|
|
## Investigation
|
|
|
|
### 1. Confirm the Issue
|
|
|
|
```bash
|
|
# Check current memory usage
|
|
kubectl -n rdev top pod -l app=rdev-api
|
|
|
|
# Check for OOMKilled events
|
|
kubectl -n rdev get events --field-selector reason=OOMKilled
|
|
|
|
# Check pod restarts
|
|
kubectl -n rdev get pods -l app=rdev-api -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}'
|
|
```
|
|
|
|
### 2. Identify the Cause
|
|
|
|
```bash
|
|
# Check active SSE connections (potential memory leak source)
|
|
curl -s http://rdev-api:8080/metrics | grep sse_connections_active
|
|
|
|
# Check active commands
|
|
curl -s http://rdev-api:8080/metrics | grep commands_active
|
|
|
|
# Check heap profile
|
|
kubectl -n rdev exec -it deployment/rdev-api -- \
|
|
curl -o /tmp/heap.prof localhost:8080/debug/pprof/heap
|
|
```
|
|
|
|
### 3. Common Causes
|
|
|
|
- **SSE connection leaks**: Clients not closing connections properly
|
|
- **Large command outputs**: Commands producing excessive output
|
|
- **Many concurrent commands**: Each command buffers output
|
|
- **Cache growth**: Project cache not expiring
|
|
|
|
## Remediation
|
|
|
|
### Immediate: Restart Pod
|
|
|
|
If memory is critical (>95%):
|
|
|
|
```bash
|
|
# Restart specific pod
|
|
kubectl -n rdev delete pod <pod-name>
|
|
|
|
# Or restart all pods rolling
|
|
kubectl -n rdev rollout restart deployment/rdev-api
|
|
```
|
|
|
|
### Short-term: Increase Limits
|
|
|
|
```bash
|
|
kubectl -n rdev patch deployment rdev-api --type='json' -p='[
|
|
{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"}
|
|
]'
|
|
```
|
|
|
|
### If SSE Connections Are Leaking
|
|
|
|
1. Check for stuck connections:
|
|
```bash
|
|
kubectl -n rdev logs -l app=rdev-api | grep "SSE connection" | tail -50
|
|
```
|
|
|
|
2. Reduce connection timeout at the Traefik entrypoint level:
|
|
```yaml
|
|
# Traefik: configure respondingTimeouts at entrypoint level
|
|
# or use ServersTransport for per-service forwarding timeout
|
|
traefik.ingress.kubernetes.io/router.entrypoints: websecure
|
|
```
|
|
|
|
### If Command Output Is Too Large
|
|
|
|
1. Commands should implement output limits
|
|
2. Check for runaway commands:
|
|
```bash
|
|
kubectl -n rdev logs -l app=rdev-api | grep "output line" | wc -l
|
|
```
|
|
|
|
### If Cache Is Growing
|
|
|
|
1. Reduce cache TTL:
|
|
```bash
|
|
kubectl -n rdev set env deployment/rdev-api CACHE_TTL=15s
|
|
```
|
|
|
|
## Verification
|
|
|
|
```bash
|
|
# Confirm memory has stabilized
|
|
kubectl -n rdev top pod -l app=rdev-api
|
|
|
|
# Check no new OOMKill events
|
|
kubectl -n rdev get events --field-selector reason=OOMKilled --since=5m
|
|
|
|
# Verify service is healthy
|
|
curl -s http://rdev-api:8080/ready
|
|
```
|
|
|
|
## Post-Incident
|
|
|
|
1. Analyze heap profile for memory leaks
|
|
2. Review SSE connection lifecycle
|
|
3. Consider implementing output size limits
|
|
4. Update memory limits based on findings
|
|
5. Consider adding memory-based HPA
|