rdev/docs/operations/runbooks/high-memory.md
jordan a9ad3d8304
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
chore: accumulated platform hardening and CI fixes
CI / Woodpecker:
- Add explicit depends_on to all .woodpecker.yml steps (rdev + templates)
- Fix skip_tls_verify -> skip-tls-verify (correct Kaniko flag name)
- Add replicasets get/list to deployer RBAC for rollout status
- Skeleton template: add failure:ignore on docs steps, Traefik TLS
  annotations on ingress, depends_on on verify step

Component templates:
- Fix container name in deploy steps (PROJECT_NAME-COMPONENT_NAME)
- Replace kubectl scale with kubectl patch for replicas
- Add post-deploy image verification and rollout status checks
- Applied consistently across all 5 component templates

Adapters:
- gitea: Add HTTP client timeout (30s), context cancellation checks,
  handle 404 on GetRepo/DeleteRepo
- zot: Add retry with exponential backoff (doWithRetry), limit response
  body reads to 10MB
- cockroach: Use net.JoinHostPort for IPv6-safe DSN construction
- woodpecker: Fix error wrapping (%v -> %w)
- redis: Fix error wrapping (%v -> %w)
- deployer: Add context cancellation checks

Services:
- apikey_service: Fix error wrapping (%v -> %w)
- component_deploy: Fix error wrapping (%v -> %w)
- project_infra: Fix error wrapping (%v -> %w)
- webhook/dispatcher: Fix error wrapping (%v -> %w)

Other:
- CLAUDE.md: Add guide links for Gitea, Go 1.25, Woodpecker v3,
  Traefik v3, Zot registry
- circuitbreaker: Add test for error wrapping
- docs: Update deployment, troubleshooting, and runbook docs
- health: Fix error wrapping (%v -> %w)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 23:16:56 -07:00

2.8 KiB

Runbook: High Memory Usage

Alert

RdevAPIHighMemory: Memory usage exceeds 80% of limit

Impact

  • Risk of OOMKill
  • Service disruption
  • Lost in-flight requests

Investigation

1. Confirm the Issue

# Check current memory usage
kubectl -n rdev top pod -l app=rdev-api

# Check for OOMKilled events
kubectl -n rdev get events --field-selector reason=OOMKilled

# Check pod restarts
kubectl -n rdev get pods -l app=rdev-api -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}'

2. Identify the Cause

# Check active SSE connections (potential memory leak source)
curl -s http://rdev-api:8080/metrics | grep sse_connections_active

# Check active commands
curl -s http://rdev-api:8080/metrics | grep commands_active

# Check heap profile
kubectl -n rdev exec -it deployment/rdev-api -- \
  curl -o /tmp/heap.prof localhost:8080/debug/pprof/heap

3. Common Causes

  • SSE connection leaks: Clients not closing connections properly
  • Large command outputs: Commands producing excessive output
  • Many concurrent commands: Each command buffers output
  • Cache growth: Project cache not expiring

Remediation

Immediate: Restart Pod

If memory is critical (>95%):

# Restart specific pod
kubectl -n rdev delete pod <pod-name>

# Or restart all pods rolling
kubectl -n rdev rollout restart deployment/rdev-api

Short-term: Increase Limits

kubectl -n rdev patch deployment rdev-api --type='json' -p='[
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "1Gi"}
]'

If SSE Connections Are Leaking

  1. Check for stuck connections:

    kubectl -n rdev logs -l app=rdev-api | grep "SSE connection" | tail -50
    
  2. Reduce connection timeout at the Traefik entrypoint level:

    # Traefik: configure respondingTimeouts at entrypoint level
    # or use ServersTransport for per-service forwarding timeout
    traefik.ingress.kubernetes.io/router.entrypoints: websecure
    

If Command Output Is Too Large

  1. Commands should implement output limits
  2. Check for runaway commands:
    kubectl -n rdev logs -l app=rdev-api | grep "output line" | wc -l
    

If Cache Is Growing

  1. Reduce cache TTL:
    kubectl -n rdev set env deployment/rdev-api CACHE_TTL=15s
    

Verification

# Confirm memory has stabilized
kubectl -n rdev top pod -l app=rdev-api

# Check no new OOMKill events
kubectl -n rdev get events --field-selector reason=OOMKilled --since=5m

# Verify service is healthy
curl -s http://rdev-api:8080/ready

Post-Incident

  1. Analyze heap profile for memory leaks
  2. Review SSE connection lifecycle
  3. Consider implementing output size limits
  4. Update memory limits based on findings
  5. Consider adding memory-based HPA