CI / Woodpecker: - Add explicit depends_on to all .woodpecker.yml steps (rdev + templates) - Fix skip_tls_verify -> skip-tls-verify (correct Kaniko flag name) - Add replicasets get/list to deployer RBAC for rollout status - Skeleton template: add failure:ignore on docs steps, Traefik TLS annotations on ingress, depends_on on verify step Component templates: - Fix container name in deploy steps (PROJECT_NAME-COMPONENT_NAME) - Replace kubectl scale with kubectl patch for replicas - Add post-deploy image verification and rollout status checks - Applied consistently across all 5 component templates Adapters: - gitea: Add HTTP client timeout (30s), context cancellation checks, handle 404 on GetRepo/DeleteRepo - zot: Add retry with exponential backoff (doWithRetry), limit response body reads to 10MB - cockroach: Use net.JoinHostPort for IPv6-safe DSN construction - woodpecker: Fix error wrapping (%v -> %w) - redis: Fix error wrapping (%v -> %w) - deployer: Add context cancellation checks Services: - apikey_service: Fix error wrapping (%v -> %w) - component_deploy: Fix error wrapping (%v -> %w) - project_infra: Fix error wrapping (%v -> %w) - webhook/dispatcher: Fix error wrapping (%v -> %w) Other: - CLAUDE.md: Add guide links for Gitea, Go 1.25, Woodpecker v3, Traefik v3, Zot registry - circuitbreaker: Add test for error wrapping - docs: Update deployment, troubleshooting, and runbook docs - health: Fix error wrapping (%v -> %w) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6.8 KiB
Troubleshooting Guide
Common issues and their resolutions for rdev API.
Prerequisites
# REQUIRED: Set kubeconfig before any kubectl command
export KUBECONFIG=~/.kube/orchard9-k3sf.yaml
Quick Diagnostics
# Check pod status
kubectl -n rdev get pods -l app=rdev-api
# Check logs (use script for convenience)
./scripts/logs.sh # Last 100 lines
./scripts/logs.sh -e # Errors only
# Check events
kubectl -n rdev get events --sort-by='.lastTimestamp'
# Check endpoints
kubectl -n rdev get endpoints rdev-api
# Test health
curl $RDEV_API_URL/health
Common Issues
Pod Not Starting
Symptoms:
- Pod stuck in
PendingorCrashLoopBackOff - No endpoints registered
Diagnosis:
kubectl -n rdev describe pod -l app=rdev-api
kubectl -n rdev logs -l app=rdev-api --previous
Common Causes:
-
Missing secrets:
Error: secret "rdev-api-secrets" not foundFix: Create the required secret
kubectl -n rdev create secret generic rdev-api-secrets \ --from-literal=postgres-password=xxx -
Resource constraints:
0/3 nodes are available: insufficient memoryFix: Reduce resource requests or add nodes
-
Image pull errors:
Failed to pull image "registry/rdev-api:latest"Fix: Check image name, registry credentials
Database Connection Failed
Symptoms:
- Readiness probe failing
- Logs show
dial tcp: connection refused
Diagnosis:
# Check database pods
kubectl get pods -n databases
# Test CockroachDB
kubectl exec -n databases cockroachdb-0 -- \
/cockroach/cockroach node status --insecure --host=localhost:26257
# Test Redis
REDIS_PASS=$(kubectl get secret -n threesix redis-credentials -o jsonpath="{.data.REDIS_PASSWORD}" | base64 -d)
kubectl exec -n threesix redis-0 -- redis-cli -a "$REDIS_PASS" ping
# Test PostgreSQL
kubectl exec -n databases postgres-0 -- psql -U rdev -d rdev -c "SELECT 1;"
See database-connections.md for full connection details.
Common Causes:
-
Wrong host/port: Check ConfigMap values match actual database
-
Network policy blocking:
kubectl -n rdev get networkpolicyEnsure egress to database namespace is allowed
-
Credentials incorrect: Verify secret values match database credentials
Authentication Failures
Symptoms:
- All requests return 401
- Logs show
invalid API key
Diagnosis:
# Check if keys exist in database
kubectl -n rdev exec -it deployment/rdev-api -- sh
psql $DATABASE_URL -c "SELECT id, name, revoked_at FROM api_keys LIMIT 10;"
Common Causes:
-
Key not created: Create an admin key manually if needed
-
Key revoked: Check
revoked_atis NULL for the key -
Wrong key format: Keys must start with
rdev_
Rate Limiting Issues
Symptoms:
- Intermittent 429 responses
X-RateLimit-Remaining: 0
Diagnosis:
# Check rate limit metrics
curl http://rdev-api:8080/metrics | grep ratelimit
Solutions:
-
Increase limits: Update ConfigMap:
RATE_LIMIT_RPS: "20" -
Check for loops: Client may be making excessive requests
-
Use separate keys: Different clients should use different API keys
Command Execution Timeouts
Symptoms:
- Commands hang indefinitely
- SSE stream never completes
Diagnosis:
# Check active commands
kubectl -n rdev exec -it deployment/rdev-api -- sh
curl localhost:8080/metrics | grep commands_active
# Check target pod
kubectl -n rdev get pod <target-pod> -o wide
kubectl -n rdev exec -it <target-pod> -- ps aux
Common Causes:
-
Target pod not running:
kubectl -n rdev get pods -l rdev.orchard9.ai/project=true -
Command actually slow: Some commands take a long time legitimately
-
Network issues: Check connectivity between API pod and target pod
SSE Connection Drops
Symptoms:
- Clients disconnect unexpectedly
- Events stop arriving mid-command
Diagnosis:
# Check ingress timeout settings
kubectl -n ingress-nginx get ing rdev-api -o yaml
Common Causes:
-
Proxy timeout: Traefik timeout is configured at the entrypoint level via HelmChartConfig, not per-Ingress annotations. See
.claude/guides/ops/traefik-v3.mdfor details.# Traefik timeout is configured at the entrypoint level via HelmChartConfig # See .claude/guides/ops/traefik-v3.md for details traefik.ingress.kubernetes.io/router.entrypoints: websecure traefik.ingress.kubernetes.io/router.tls: "true" -
Client timeout: Check client-side timeout configuration
-
Network interruption: Implement reconnection with
Last-Event-ID
High Memory Usage
Symptoms:
- OOMKilled events
- Slow response times
Diagnosis:
# Check memory metrics
kubectl -n rdev top pod -l app=rdev-api
# Check for memory leaks in logs
kubectl -n rdev logs -l app=rdev-api | grep -i memory
Solutions:
-
Increase limits:
resources: limits: memory: "1Gi" -
Check for stream leaks: Ensure SSE connections are properly closed
-
Restart pod:
kubectl -n rdev rollout restart deployment/rdev-api
High CPU Usage
Symptoms:
- CPU throttling
- Slow request processing
Diagnosis:
# Check CPU metrics
kubectl -n rdev top pod -l app=rdev-api
# Profile if possible
kubectl -n rdev exec -it deployment/rdev-api -- curl localhost:8080/debug/pprof/profile > cpu.prof
Solutions:
-
Scale horizontally:
kubectl -n rdev scale deployment/rdev-api --replicas=3 -
Identify hot paths: Use profiling to find CPU-intensive code
-
Check command sanitization: Complex regex can be expensive
Recovery Procedures
Emergency Restart
# Restart all pods
kubectl -n rdev rollout restart deployment/rdev-api
# Scale down and up
kubectl -n rdev scale deployment/rdev-api --replicas=0
kubectl -n rdev scale deployment/rdev-api --replicas=2
Rollback
# Check rollout history
kubectl -n rdev rollout history deployment/rdev-api
# Rollback to previous
kubectl -n rdev rollout undo deployment/rdev-api
# Rollback to specific revision
kubectl -n rdev rollout undo deployment/rdev-api --to-revision=5
Database Recovery
# Connect to database
kubectl -n databases exec -it deployment/postgres -- psql -U rdev
# Check tables
\dt
# Check recent keys
SELECT id, name, created_at FROM api_keys ORDER BY created_at DESC LIMIT 10;
Getting Help
- Check logs for specific error messages
- Search this troubleshooting guide
- Check runbooks for specific scenarios
- Contact the platform team with:
- Request ID (from error response)
- Timestamp
- Steps to reproduce
- Relevant logs