This commit captures the current state before implementing the composable monorepo template system. Key changes included: Infrastructure: - Add CockroachDB provisioner adapter for database provisioning - Add Redis provisioner adapter for cache provisioning - Add build events system with PostgreSQL storage - Add WebSocket endpoint for real-time build progress Code agent improvements: - Fix Claude Code adapter to use default allowed tools instead of dangerously-skip-permissions - Add context-aware stream closing for cancellation support - Improve parser tests for edge cases Build system: - Add build event constants and metrics - Remove deprecated git_operations.go (replaced by pod_git_operations.go) - Add rollback logic for multi-step provisioning operations Documentation: - Add composable-monorepo feature documentation - Add DNS/Cloudflare service documentation - Update deployment and troubleshooting guides Cookbooks: - Add fullstack-app cookbook - Refactor landing-test with shared library Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
6.5 KiB
Troubleshooting Guide
Common issues and their resolutions for rdev API.
Prerequisites
# REQUIRED: Set kubeconfig before any kubectl command
export KUBECONFIG=~/.kube/orchard9-k3sf.yaml
Quick Diagnostics
# Check pod status
kubectl -n rdev get pods -l app=rdev-api
# Check logs (use script for convenience)
./scripts/logs.sh # Last 100 lines
./scripts/logs.sh -e # Errors only
# Check events
kubectl -n rdev get events --sort-by='.lastTimestamp'
# Check endpoints
kubectl -n rdev get endpoints rdev-api
# Test health
curl $RDEV_API_URL/health
Common Issues
Pod Not Starting
Symptoms:
- Pod stuck in
PendingorCrashLoopBackOff - No endpoints registered
Diagnosis:
kubectl -n rdev describe pod -l app=rdev-api
kubectl -n rdev logs -l app=rdev-api --previous
Common Causes:
-
Missing secrets:
Error: secret "rdev-api-secrets" not foundFix: Create the required secret
kubectl -n rdev create secret generic rdev-api-secrets \ --from-literal=postgres-password=xxx -
Resource constraints:
0/3 nodes are available: insufficient memoryFix: Reduce resource requests or add nodes
-
Image pull errors:
Failed to pull image "registry/rdev-api:latest"Fix: Check image name, registry credentials
Database Connection Failed
Symptoms:
- Readiness probe failing
- Logs show
dial tcp: connection refused
Diagnosis:
# Check database pods
kubectl get pods -n databases
# Test CockroachDB
kubectl exec -n databases cockroachdb-0 -- \
/cockroach/cockroach node status --insecure --host=localhost:26257
# Test Redis
REDIS_PASS=$(kubectl get secret -n threesix redis-credentials -o jsonpath="{.data.REDIS_PASSWORD}" | base64 -d)
kubectl exec -n threesix redis-0 -- redis-cli -a "$REDIS_PASS" ping
# Test PostgreSQL
kubectl exec -n databases postgres-0 -- psql -U rdev -d rdev -c "SELECT 1;"
See database-connections.md for full connection details.
Common Causes:
-
Wrong host/port: Check ConfigMap values match actual database
-
Network policy blocking:
kubectl -n rdev get networkpolicyEnsure egress to database namespace is allowed
-
Credentials incorrect: Verify secret values match database credentials
Authentication Failures
Symptoms:
- All requests return 401
- Logs show
invalid API key
Diagnosis:
# Check if keys exist in database
kubectl -n rdev exec -it deployment/rdev-api -- sh
psql $DATABASE_URL -c "SELECT id, name, revoked_at FROM api_keys LIMIT 10;"
Common Causes:
-
Key not created: Create an admin key manually if needed
-
Key revoked: Check
revoked_atis NULL for the key -
Wrong key format: Keys must start with
rdev_
Rate Limiting Issues
Symptoms:
- Intermittent 429 responses
X-RateLimit-Remaining: 0
Diagnosis:
# Check rate limit metrics
curl http://rdev-api:8080/metrics | grep ratelimit
Solutions:
-
Increase limits: Update ConfigMap:
RATE_LIMIT_RPS: "20" -
Check for loops: Client may be making excessive requests
-
Use separate keys: Different clients should use different API keys
Command Execution Timeouts
Symptoms:
- Commands hang indefinitely
- SSE stream never completes
Diagnosis:
# Check active commands
kubectl -n rdev exec -it deployment/rdev-api -- sh
curl localhost:8080/metrics | grep commands_active
# Check target pod
kubectl -n rdev get pod <target-pod> -o wide
kubectl -n rdev exec -it <target-pod> -- ps aux
Common Causes:
-
Target pod not running:
kubectl -n rdev get pods -l rdev.orchard9.ai/project=true -
Command actually slow: Some commands take a long time legitimately
-
Network issues: Check connectivity between API pod and target pod
SSE Connection Drops
Symptoms:
- Clients disconnect unexpectedly
- Events stop arriving mid-command
Diagnosis:
# Check ingress timeout settings
kubectl -n ingress-nginx get ing rdev-api -o yaml
Common Causes:
-
Proxy timeout: Ensure ingress has long timeout:
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600" -
Client timeout: Check client-side timeout configuration
-
Network interruption: Implement reconnection with
Last-Event-ID
High Memory Usage
Symptoms:
- OOMKilled events
- Slow response times
Diagnosis:
# Check memory metrics
kubectl -n rdev top pod -l app=rdev-api
# Check for memory leaks in logs
kubectl -n rdev logs -l app=rdev-api | grep -i memory
Solutions:
-
Increase limits:
resources: limits: memory: "1Gi" -
Check for stream leaks: Ensure SSE connections are properly closed
-
Restart pod:
kubectl -n rdev rollout restart deployment/rdev-api
High CPU Usage
Symptoms:
- CPU throttling
- Slow request processing
Diagnosis:
# Check CPU metrics
kubectl -n rdev top pod -l app=rdev-api
# Profile if possible
kubectl -n rdev exec -it deployment/rdev-api -- curl localhost:8080/debug/pprof/profile > cpu.prof
Solutions:
-
Scale horizontally:
kubectl -n rdev scale deployment/rdev-api --replicas=3 -
Identify hot paths: Use profiling to find CPU-intensive code
-
Check command sanitization: Complex regex can be expensive
Recovery Procedures
Emergency Restart
# Restart all pods
kubectl -n rdev rollout restart deployment/rdev-api
# Scale down and up
kubectl -n rdev scale deployment/rdev-api --replicas=0
kubectl -n rdev scale deployment/rdev-api --replicas=2
Rollback
# Check rollout history
kubectl -n rdev rollout history deployment/rdev-api
# Rollback to previous
kubectl -n rdev rollout undo deployment/rdev-api
# Rollback to specific revision
kubectl -n rdev rollout undo deployment/rdev-api --to-revision=5
Database Recovery
# Connect to database
kubectl -n databases exec -it deployment/postgres -- psql -U rdev
# Check tables
\dt
# Check recent keys
SELECT id, name, created_at FROM api_keys ORDER BY created_at DESC LIMIT 10;
Getting Help
- Check logs for specific error messages
- Search this troubleshooting guide
- Check runbooks for specific scenarios
- Contact the platform team with:
- Request ID (from error response)
- Timestamp
- Steps to reproduce
- Relevant logs