# Troubleshooting Guide Common issues and their resolutions for rdev API. ## Prerequisites ```bash # REQUIRED: Set kubeconfig before any kubectl command export KUBECONFIG=~/.kube/orchard9-k3sf.yaml ``` ## Quick Diagnostics ```bash # Check pod status kubectl -n rdev get pods -l app=rdev-api # Check logs (use script for convenience) ./scripts/logs.sh # Last 100 lines ./scripts/logs.sh -e # Errors only # Check events kubectl -n rdev get events --sort-by='.lastTimestamp' # Check endpoints kubectl -n rdev get endpoints rdev-api # Test health curl $RDEV_API_URL/health ``` ## Common Issues ### Pod Not Starting **Symptoms:** - Pod stuck in `Pending` or `CrashLoopBackOff` - No endpoints registered **Diagnosis:** ```bash kubectl -n rdev describe pod -l app=rdev-api kubectl -n rdev logs -l app=rdev-api --previous ``` **Common Causes:** 1. **Missing secrets:** ``` Error: secret "rdev-api-secrets" not found ``` Fix: Create the required secret ```bash kubectl -n rdev create secret generic rdev-api-secrets \ --from-literal=postgres-password=xxx ``` 2. **Resource constraints:** ``` 0/3 nodes are available: insufficient memory ``` Fix: Reduce resource requests or add nodes 3. **Image pull errors:** ``` Failed to pull image "registry/rdev-api:latest" ``` Fix: Check image name, registry credentials ### Database Connection Failed **Symptoms:** - Readiness probe failing - Logs show `dial tcp: connection refused` **Diagnosis:** ```bash # Check database pods kubectl get pods -n databases # Test CockroachDB kubectl exec -n databases cockroachdb-0 -- \ /cockroach/cockroach node status --insecure --host=localhost:26257 # Test Redis REDIS_PASS=$(kubectl get secret -n threesix redis-credentials -o jsonpath="{.data.REDIS_PASSWORD}" | base64 -d) kubectl exec -n threesix redis-0 -- redis-cli -a "$REDIS_PASS" ping # Test PostgreSQL kubectl exec -n databases postgres-0 -- psql -U rdev -d rdev -c "SELECT 1;" ``` See [database-connections.md](database-connections.md) for full connection details. **Common Causes:** 1. **Wrong host/port:** Check ConfigMap values match actual database 2. **Network policy blocking:** ```bash kubectl -n rdev get networkpolicy ``` Ensure egress to database namespace is allowed 3. **Credentials incorrect:** Verify secret values match database credentials ### Authentication Failures **Symptoms:** - All requests return 401 - Logs show `invalid API key` **Diagnosis:** ```bash # Check if keys exist in database kubectl -n rdev exec -it deployment/rdev-api -- sh psql $DATABASE_URL -c "SELECT id, name, revoked_at FROM api_keys LIMIT 10;" ``` **Common Causes:** 1. **Key not created:** Create an admin key manually if needed 2. **Key revoked:** Check `revoked_at` is NULL for the key 3. **Wrong key format:** Keys must start with `rdev_` ### Rate Limiting Issues **Symptoms:** - Intermittent 429 responses - `X-RateLimit-Remaining: 0` **Diagnosis:** ```bash # Check rate limit metrics curl http://rdev-api:8080/metrics | grep ratelimit ``` **Solutions:** 1. **Increase limits:** Update ConfigMap: ```yaml RATE_LIMIT_RPS: "20" ``` 2. **Check for loops:** Client may be making excessive requests 3. **Use separate keys:** Different clients should use different API keys ### Command Execution Timeouts **Symptoms:** - Commands hang indefinitely - SSE stream never completes **Diagnosis:** ```bash # Check active commands kubectl -n rdev exec -it deployment/rdev-api -- sh curl localhost:8080/metrics | grep commands_active # Check target pod kubectl -n rdev get pod -o wide kubectl -n rdev exec -it -- ps aux ``` **Common Causes:** 1. **Target pod not running:** ```bash kubectl -n rdev get pods -l rdev.orchard9.ai/project=true ``` 2. **Command actually slow:** Some commands take a long time legitimately 3. **Network issues:** Check connectivity between API pod and target pod ### SSE Connection Drops **Symptoms:** - Clients disconnect unexpectedly - Events stop arriving mid-command **Diagnosis:** ```bash # Check ingress timeout settings kubectl -n ingress-nginx get ing rdev-api -o yaml ``` **Common Causes:** 1. **Proxy timeout:** Ensure ingress has long timeout: ```yaml nginx.ingress.kubernetes.io/proxy-read-timeout: "3600" ``` 2. **Client timeout:** Check client-side timeout configuration 3. **Network interruption:** Implement reconnection with `Last-Event-ID` ### High Memory Usage **Symptoms:** - OOMKilled events - Slow response times **Diagnosis:** ```bash # Check memory metrics kubectl -n rdev top pod -l app=rdev-api # Check for memory leaks in logs kubectl -n rdev logs -l app=rdev-api | grep -i memory ``` **Solutions:** 1. **Increase limits:** ```yaml resources: limits: memory: "1Gi" ``` 2. **Check for stream leaks:** Ensure SSE connections are properly closed 3. **Restart pod:** ```bash kubectl -n rdev rollout restart deployment/rdev-api ``` ### High CPU Usage **Symptoms:** - CPU throttling - Slow request processing **Diagnosis:** ```bash # Check CPU metrics kubectl -n rdev top pod -l app=rdev-api # Profile if possible kubectl -n rdev exec -it deployment/rdev-api -- curl localhost:8080/debug/pprof/profile > cpu.prof ``` **Solutions:** 1. **Scale horizontally:** ```bash kubectl -n rdev scale deployment/rdev-api --replicas=3 ``` 2. **Identify hot paths:** Use profiling to find CPU-intensive code 3. **Check command sanitization:** Complex regex can be expensive ## Recovery Procedures ### Emergency Restart ```bash # Restart all pods kubectl -n rdev rollout restart deployment/rdev-api # Scale down and up kubectl -n rdev scale deployment/rdev-api --replicas=0 kubectl -n rdev scale deployment/rdev-api --replicas=2 ``` ### Rollback ```bash # Check rollout history kubectl -n rdev rollout history deployment/rdev-api # Rollback to previous kubectl -n rdev rollout undo deployment/rdev-api # Rollback to specific revision kubectl -n rdev rollout undo deployment/rdev-api --to-revision=5 ``` ### Database Recovery ```bash # Connect to database kubectl -n databases exec -it deployment/postgres -- psql -U rdev # Check tables \dt # Check recent keys SELECT id, name, created_at FROM api_keys ORDER BY created_at DESC LIMIT 10; ``` ## Getting Help 1. Check logs for specific error messages 2. Search this troubleshooting guide 3. Check runbooks for specific scenarios 4. Contact the platform team with: - Request ID (from error response) - Timestamp - Steps to reproduce - Relevant logs