Major refactoring to hexagonal (ports & adapters) architecture: - Add service layer (apikey_service, project_service) for business logic - Add webhook system with dispatcher and delivery tracking - Add command queue with priority-based processing - Add rate limiting with sliding window algorithm - Add audit logging for command execution - Add OpenTelemetry integration (traces, metrics, spans) - Add circuit breaker for fault tolerance - Add cached repository wrapper for performance - Add comprehensive validation package - Add Kubernetes client integration for pod management - Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks) - Add network policy and PodDisruptionBudget for k8s - Remove legacy executor and projects/registry packages - Untrack secrets.yaml (now managed via envault) - Add coverage.out to .gitignore - Add e2e test infrastructure with docker-compose - Add comprehensive documentation (API, architecture, operations, plans) - Add golangci-lint config and pre-commit hook Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
5.9 KiB
Troubleshooting Guide
Common issues and their resolutions for rdev API.
Quick Diagnostics
# Check pod status
kubectl -n rdev get pods -l app=rdev-api
# Check logs
kubectl -n rdev logs -l app=rdev-api --tail=100
# Check events
kubectl -n rdev get events --sort-by='.lastTimestamp'
# Check endpoints
kubectl -n rdev get endpoints rdev-api
# Test health
kubectl -n rdev exec -it deployment/rdev-api -- wget -qO- localhost:8080/health
Common Issues
Pod Not Starting
Symptoms:
- Pod stuck in
PendingorCrashLoopBackOff - No endpoints registered
Diagnosis:
kubectl -n rdev describe pod -l app=rdev-api
kubectl -n rdev logs -l app=rdev-api --previous
Common Causes:
-
Missing secrets:
Error: secret "rdev-api-secrets" not foundFix: Create the required secret
kubectl -n rdev create secret generic rdev-api-secrets \ --from-literal=postgres-password=xxx -
Resource constraints:
0/3 nodes are available: insufficient memoryFix: Reduce resource requests or add nodes
-
Image pull errors:
Failed to pull image "registry/rdev-api:latest"Fix: Check image name, registry credentials
Database Connection Failed
Symptoms:
- Readiness probe failing
- Logs show
dial tcp: connection refused
Diagnosis:
# Check database connectivity from pod
kubectl -n rdev exec -it deployment/rdev-api -- sh
nc -zv postgres.databases.svc 5432
Common Causes:
-
Wrong host/port: Check ConfigMap values match actual database
-
Network policy blocking:
kubectl -n rdev get networkpolicyEnsure egress to database namespace is allowed
-
Credentials incorrect: Verify secret values match database credentials
Authentication Failures
Symptoms:
- All requests return 401
- Logs show
invalid API key
Diagnosis:
# Check if keys exist in database
kubectl -n rdev exec -it deployment/rdev-api -- sh
psql $DATABASE_URL -c "SELECT id, name, revoked_at FROM api_keys LIMIT 10;"
Common Causes:
-
Key not created: Create an admin key manually if needed
-
Key revoked: Check
revoked_atis NULL for the key -
Wrong key format: Keys must start with
rdev_
Rate Limiting Issues
Symptoms:
- Intermittent 429 responses
X-RateLimit-Remaining: 0
Diagnosis:
# Check rate limit metrics
curl http://rdev-api:8080/metrics | grep ratelimit
Solutions:
-
Increase limits: Update ConfigMap:
RATE_LIMIT_RPS: "20" -
Check for loops: Client may be making excessive requests
-
Use separate keys: Different clients should use different API keys
Command Execution Timeouts
Symptoms:
- Commands hang indefinitely
- SSE stream never completes
Diagnosis:
# Check active commands
kubectl -n rdev exec -it deployment/rdev-api -- sh
curl localhost:8080/metrics | grep commands_active
# Check target pod
kubectl -n rdev get pod <target-pod> -o wide
kubectl -n rdev exec -it <target-pod> -- ps aux
Common Causes:
-
Target pod not running:
kubectl -n rdev get pods -l rdev.orchard9.ai/project=true -
Command actually slow: Some commands take a long time legitimately
-
Network issues: Check connectivity between API pod and target pod
SSE Connection Drops
Symptoms:
- Clients disconnect unexpectedly
- Events stop arriving mid-command
Diagnosis:
# Check ingress timeout settings
kubectl -n ingress-nginx get ing rdev-api -o yaml
Common Causes:
-
Proxy timeout: Ensure ingress has long timeout:
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600" -
Client timeout: Check client-side timeout configuration
-
Network interruption: Implement reconnection with
Last-Event-ID
High Memory Usage
Symptoms:
- OOMKilled events
- Slow response times
Diagnosis:
# Check memory metrics
kubectl -n rdev top pod -l app=rdev-api
# Check for memory leaks in logs
kubectl -n rdev logs -l app=rdev-api | grep -i memory
Solutions:
-
Increase limits:
resources: limits: memory: "1Gi" -
Check for stream leaks: Ensure SSE connections are properly closed
-
Restart pod:
kubectl -n rdev rollout restart deployment/rdev-api
High CPU Usage
Symptoms:
- CPU throttling
- Slow request processing
Diagnosis:
# Check CPU metrics
kubectl -n rdev top pod -l app=rdev-api
# Profile if possible
kubectl -n rdev exec -it deployment/rdev-api -- curl localhost:8080/debug/pprof/profile > cpu.prof
Solutions:
-
Scale horizontally:
kubectl -n rdev scale deployment/rdev-api --replicas=3 -
Identify hot paths: Use profiling to find CPU-intensive code
-
Check command sanitization: Complex regex can be expensive
Recovery Procedures
Emergency Restart
# Restart all pods
kubectl -n rdev rollout restart deployment/rdev-api
# Scale down and up
kubectl -n rdev scale deployment/rdev-api --replicas=0
kubectl -n rdev scale deployment/rdev-api --replicas=2
Rollback
# Check rollout history
kubectl -n rdev rollout history deployment/rdev-api
# Rollback to previous
kubectl -n rdev rollout undo deployment/rdev-api
# Rollback to specific revision
kubectl -n rdev rollout undo deployment/rdev-api --to-revision=5
Database Recovery
# Connect to database
kubectl -n databases exec -it deployment/postgres -- psql -U rdev
# Check tables
\dt
# Check recent keys
SELECT id, name, created_at FROM api_keys ORDER BY created_at DESC LIMIT 10;
Getting Help
- Check logs for specific error messages
- Search this troubleshooting guide
- Check runbooks for specific scenarios
- Contact the platform team with:
- Request ID (from error response)
- Timestamp
- Steps to reproduce
- Relevant logs