rdev/docs/operations/troubleshooting.md
jordan a9ad3d8304
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
chore: accumulated platform hardening and CI fixes
CI / Woodpecker:
- Add explicit depends_on to all .woodpecker.yml steps (rdev + templates)
- Fix skip_tls_verify -> skip-tls-verify (correct Kaniko flag name)
- Add replicasets get/list to deployer RBAC for rollout status
- Skeleton template: add failure:ignore on docs steps, Traefik TLS
  annotations on ingress, depends_on on verify step

Component templates:
- Fix container name in deploy steps (PROJECT_NAME-COMPONENT_NAME)
- Replace kubectl scale with kubectl patch for replicas
- Add post-deploy image verification and rollout status checks
- Applied consistently across all 5 component templates

Adapters:
- gitea: Add HTTP client timeout (30s), context cancellation checks,
  handle 404 on GetRepo/DeleteRepo
- zot: Add retry with exponential backoff (doWithRetry), limit response
  body reads to 10MB
- cockroach: Use net.JoinHostPort for IPv6-safe DSN construction
- woodpecker: Fix error wrapping (%v -> %w)
- redis: Fix error wrapping (%v -> %w)
- deployer: Add context cancellation checks

Services:
- apikey_service: Fix error wrapping (%v -> %w)
- component_deploy: Fix error wrapping (%v -> %w)
- project_infra: Fix error wrapping (%v -> %w)
- webhook/dispatcher: Fix error wrapping (%v -> %w)

Other:
- CLAUDE.md: Add guide links for Gitea, Go 1.25, Woodpecker v3,
  Traefik v3, Zot registry
- circuitbreaker: Add test for error wrapping
- docs: Update deployment, troubleshooting, and runbook docs
- health: Fix error wrapping (%v -> %w)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 23:16:56 -07:00

6.8 KiB

Troubleshooting Guide

Common issues and their resolutions for rdev API.

Prerequisites

# REQUIRED: Set kubeconfig before any kubectl command
export KUBECONFIG=~/.kube/orchard9-k3sf.yaml

Quick Diagnostics

# Check pod status
kubectl -n rdev get pods -l app=rdev-api

# Check logs (use script for convenience)
./scripts/logs.sh           # Last 100 lines
./scripts/logs.sh -e        # Errors only

# Check events
kubectl -n rdev get events --sort-by='.lastTimestamp'

# Check endpoints
kubectl -n rdev get endpoints rdev-api

# Test health
curl $RDEV_API_URL/health

Common Issues

Pod Not Starting

Symptoms:

  • Pod stuck in Pending or CrashLoopBackOff
  • No endpoints registered

Diagnosis:

kubectl -n rdev describe pod -l app=rdev-api
kubectl -n rdev logs -l app=rdev-api --previous

Common Causes:

  1. Missing secrets:

    Error: secret "rdev-api-secrets" not found
    

    Fix: Create the required secret

    kubectl -n rdev create secret generic rdev-api-secrets \
      --from-literal=postgres-password=xxx
    
  2. Resource constraints:

    0/3 nodes are available: insufficient memory
    

    Fix: Reduce resource requests or add nodes

  3. Image pull errors:

    Failed to pull image "registry/rdev-api:latest"
    

    Fix: Check image name, registry credentials

Database Connection Failed

Symptoms:

  • Readiness probe failing
  • Logs show dial tcp: connection refused

Diagnosis:

# Check database pods
kubectl get pods -n databases

# Test CockroachDB
kubectl exec -n databases cockroachdb-0 -- \
  /cockroach/cockroach node status --insecure --host=localhost:26257

# Test Redis
REDIS_PASS=$(kubectl get secret -n threesix redis-credentials -o jsonpath="{.data.REDIS_PASSWORD}" | base64 -d)
kubectl exec -n threesix redis-0 -- redis-cli -a "$REDIS_PASS" ping

# Test PostgreSQL
kubectl exec -n databases postgres-0 -- psql -U rdev -d rdev -c "SELECT 1;"

See database-connections.md for full connection details.

Common Causes:

  1. Wrong host/port: Check ConfigMap values match actual database

  2. Network policy blocking:

    kubectl -n rdev get networkpolicy
    

    Ensure egress to database namespace is allowed

  3. Credentials incorrect: Verify secret values match database credentials

Authentication Failures

Symptoms:

  • All requests return 401
  • Logs show invalid API key

Diagnosis:

# Check if keys exist in database
kubectl -n rdev exec -it deployment/rdev-api -- sh
psql $DATABASE_URL -c "SELECT id, name, revoked_at FROM api_keys LIMIT 10;"

Common Causes:

  1. Key not created: Create an admin key manually if needed

  2. Key revoked: Check revoked_at is NULL for the key

  3. Wrong key format: Keys must start with rdev_

Rate Limiting Issues

Symptoms:

  • Intermittent 429 responses
  • X-RateLimit-Remaining: 0

Diagnosis:

# Check rate limit metrics
curl http://rdev-api:8080/metrics | grep ratelimit

Solutions:

  1. Increase limits: Update ConfigMap:

    RATE_LIMIT_RPS: "20"
    
  2. Check for loops: Client may be making excessive requests

  3. Use separate keys: Different clients should use different API keys

Command Execution Timeouts

Symptoms:

  • Commands hang indefinitely
  • SSE stream never completes

Diagnosis:

# Check active commands
kubectl -n rdev exec -it deployment/rdev-api -- sh
curl localhost:8080/metrics | grep commands_active

# Check target pod
kubectl -n rdev get pod <target-pod> -o wide
kubectl -n rdev exec -it <target-pod> -- ps aux

Common Causes:

  1. Target pod not running:

    kubectl -n rdev get pods -l rdev.orchard9.ai/project=true
    
  2. Command actually slow: Some commands take a long time legitimately

  3. Network issues: Check connectivity between API pod and target pod

SSE Connection Drops

Symptoms:

  • Clients disconnect unexpectedly
  • Events stop arriving mid-command

Diagnosis:

# Check ingress timeout settings
kubectl -n ingress-nginx get ing rdev-api -o yaml

Common Causes:

  1. Proxy timeout: Traefik timeout is configured at the entrypoint level via HelmChartConfig, not per-Ingress annotations. See .claude/guides/ops/traefik-v3.md for details.

    # Traefik timeout is configured at the entrypoint level via HelmChartConfig
    # See .claude/guides/ops/traefik-v3.md for details
    traefik.ingress.kubernetes.io/router.entrypoints: websecure
    traefik.ingress.kubernetes.io/router.tls: "true"
    
  2. Client timeout: Check client-side timeout configuration

  3. Network interruption: Implement reconnection with Last-Event-ID

High Memory Usage

Symptoms:

  • OOMKilled events
  • Slow response times

Diagnosis:

# Check memory metrics
kubectl -n rdev top pod -l app=rdev-api

# Check for memory leaks in logs
kubectl -n rdev logs -l app=rdev-api | grep -i memory

Solutions:

  1. Increase limits:

    resources:
      limits:
        memory: "1Gi"
    
  2. Check for stream leaks: Ensure SSE connections are properly closed

  3. Restart pod:

    kubectl -n rdev rollout restart deployment/rdev-api
    

High CPU Usage

Symptoms:

  • CPU throttling
  • Slow request processing

Diagnosis:

# Check CPU metrics
kubectl -n rdev top pod -l app=rdev-api

# Profile if possible
kubectl -n rdev exec -it deployment/rdev-api -- curl localhost:8080/debug/pprof/profile > cpu.prof

Solutions:

  1. Scale horizontally:

    kubectl -n rdev scale deployment/rdev-api --replicas=3
    
  2. Identify hot paths: Use profiling to find CPU-intensive code

  3. Check command sanitization: Complex regex can be expensive

Recovery Procedures

Emergency Restart

# Restart all pods
kubectl -n rdev rollout restart deployment/rdev-api

# Scale down and up
kubectl -n rdev scale deployment/rdev-api --replicas=0
kubectl -n rdev scale deployment/rdev-api --replicas=2

Rollback

# Check rollout history
kubectl -n rdev rollout history deployment/rdev-api

# Rollback to previous
kubectl -n rdev rollout undo deployment/rdev-api

# Rollback to specific revision
kubectl -n rdev rollout undo deployment/rdev-api --to-revision=5

Database Recovery

# Connect to database
kubectl -n databases exec -it deployment/postgres -- psql -U rdev

# Check tables
\dt

# Check recent keys
SELECT id, name, created_at FROM api_keys ORDER BY created_at DESC LIMIT 10;

Getting Help

  1. Check logs for specific error messages
  2. Search this troubleshooting guide
  3. Check runbooks for specific scenarios
  4. Contact the platform team with:
    • Request ID (from error response)
    • Timestamp
    • Steps to reproduce
    • Relevant logs