This commit captures the current state before implementing the composable monorepo template system. Key changes included: Infrastructure: - Add CockroachDB provisioner adapter for database provisioning - Add Redis provisioner adapter for cache provisioning - Add build events system with PostgreSQL storage - Add WebSocket endpoint for real-time build progress Code agent improvements: - Fix Claude Code adapter to use default allowed tools instead of dangerously-skip-permissions - Add context-aware stream closing for cancellation support - Improve parser tests for edge cases Build system: - Add build event constants and metrics - Remove deprecated git_operations.go (replaced by pod_git_operations.go) - Add rollback logic for multi-step provisioning operations Documentation: - Add composable-monorepo feature documentation - Add DNS/Cloudflare service documentation - Update deployment and troubleshooting guides Cookbooks: - Add fullstack-app cookbook - Refactor landing-test with shared library Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
324 lines
6.5 KiB
Markdown
324 lines
6.5 KiB
Markdown
# Troubleshooting Guide
|
|
|
|
Common issues and their resolutions for rdev API.
|
|
|
|
## Prerequisites
|
|
|
|
```bash
|
|
# REQUIRED: Set kubeconfig before any kubectl command
|
|
export KUBECONFIG=~/.kube/orchard9-k3sf.yaml
|
|
```
|
|
|
|
## Quick Diagnostics
|
|
|
|
```bash
|
|
# Check pod status
|
|
kubectl -n rdev get pods -l app=rdev-api
|
|
|
|
# Check logs (use script for convenience)
|
|
./scripts/logs.sh # Last 100 lines
|
|
./scripts/logs.sh -e # Errors only
|
|
|
|
# Check events
|
|
kubectl -n rdev get events --sort-by='.lastTimestamp'
|
|
|
|
# Check endpoints
|
|
kubectl -n rdev get endpoints rdev-api
|
|
|
|
# Test health
|
|
curl $RDEV_API_URL/health
|
|
```
|
|
|
|
## Common Issues
|
|
|
|
### Pod Not Starting
|
|
|
|
**Symptoms:**
|
|
- Pod stuck in `Pending` or `CrashLoopBackOff`
|
|
- No endpoints registered
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
kubectl -n rdev describe pod -l app=rdev-api
|
|
kubectl -n rdev logs -l app=rdev-api --previous
|
|
```
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Missing secrets:**
|
|
```
|
|
Error: secret "rdev-api-secrets" not found
|
|
```
|
|
Fix: Create the required secret
|
|
```bash
|
|
kubectl -n rdev create secret generic rdev-api-secrets \
|
|
--from-literal=postgres-password=xxx
|
|
```
|
|
|
|
2. **Resource constraints:**
|
|
```
|
|
0/3 nodes are available: insufficient memory
|
|
```
|
|
Fix: Reduce resource requests or add nodes
|
|
|
|
3. **Image pull errors:**
|
|
```
|
|
Failed to pull image "registry/rdev-api:latest"
|
|
```
|
|
Fix: Check image name, registry credentials
|
|
|
|
### Database Connection Failed
|
|
|
|
**Symptoms:**
|
|
- Readiness probe failing
|
|
- Logs show `dial tcp: connection refused`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check database pods
|
|
kubectl get pods -n databases
|
|
|
|
# Test CockroachDB
|
|
kubectl exec -n databases cockroachdb-0 -- \
|
|
/cockroach/cockroach node status --insecure --host=localhost:26257
|
|
|
|
# Test Redis
|
|
REDIS_PASS=$(kubectl get secret -n threesix redis-credentials -o jsonpath="{.data.REDIS_PASSWORD}" | base64 -d)
|
|
kubectl exec -n threesix redis-0 -- redis-cli -a "$REDIS_PASS" ping
|
|
|
|
# Test PostgreSQL
|
|
kubectl exec -n databases postgres-0 -- psql -U rdev -d rdev -c "SELECT 1;"
|
|
```
|
|
|
|
See [database-connections.md](database-connections.md) for full connection details.
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Wrong host/port:**
|
|
Check ConfigMap values match actual database
|
|
|
|
2. **Network policy blocking:**
|
|
```bash
|
|
kubectl -n rdev get networkpolicy
|
|
```
|
|
Ensure egress to database namespace is allowed
|
|
|
|
3. **Credentials incorrect:**
|
|
Verify secret values match database credentials
|
|
|
|
### Authentication Failures
|
|
|
|
**Symptoms:**
|
|
- All requests return 401
|
|
- Logs show `invalid API key`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check if keys exist in database
|
|
kubectl -n rdev exec -it deployment/rdev-api -- sh
|
|
psql $DATABASE_URL -c "SELECT id, name, revoked_at FROM api_keys LIMIT 10;"
|
|
```
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Key not created:**
|
|
Create an admin key manually if needed
|
|
|
|
2. **Key revoked:**
|
|
Check `revoked_at` is NULL for the key
|
|
|
|
3. **Wrong key format:**
|
|
Keys must start with `rdev_`
|
|
|
|
### Rate Limiting Issues
|
|
|
|
**Symptoms:**
|
|
- Intermittent 429 responses
|
|
- `X-RateLimit-Remaining: 0`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check rate limit metrics
|
|
curl http://rdev-api:8080/metrics | grep ratelimit
|
|
```
|
|
|
|
**Solutions:**
|
|
|
|
1. **Increase limits:**
|
|
Update ConfigMap:
|
|
```yaml
|
|
RATE_LIMIT_RPS: "20"
|
|
```
|
|
|
|
2. **Check for loops:**
|
|
Client may be making excessive requests
|
|
|
|
3. **Use separate keys:**
|
|
Different clients should use different API keys
|
|
|
|
### Command Execution Timeouts
|
|
|
|
**Symptoms:**
|
|
- Commands hang indefinitely
|
|
- SSE stream never completes
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check active commands
|
|
kubectl -n rdev exec -it deployment/rdev-api -- sh
|
|
curl localhost:8080/metrics | grep commands_active
|
|
|
|
# Check target pod
|
|
kubectl -n rdev get pod <target-pod> -o wide
|
|
kubectl -n rdev exec -it <target-pod> -- ps aux
|
|
```
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Target pod not running:**
|
|
```bash
|
|
kubectl -n rdev get pods -l rdev.orchard9.ai/project=true
|
|
```
|
|
|
|
2. **Command actually slow:**
|
|
Some commands take a long time legitimately
|
|
|
|
3. **Network issues:**
|
|
Check connectivity between API pod and target pod
|
|
|
|
### SSE Connection Drops
|
|
|
|
**Symptoms:**
|
|
- Clients disconnect unexpectedly
|
|
- Events stop arriving mid-command
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check ingress timeout settings
|
|
kubectl -n ingress-nginx get ing rdev-api -o yaml
|
|
```
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Proxy timeout:**
|
|
Ensure ingress has long timeout:
|
|
```yaml
|
|
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
|
|
```
|
|
|
|
2. **Client timeout:**
|
|
Check client-side timeout configuration
|
|
|
|
3. **Network interruption:**
|
|
Implement reconnection with `Last-Event-ID`
|
|
|
|
### High Memory Usage
|
|
|
|
**Symptoms:**
|
|
- OOMKilled events
|
|
- Slow response times
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check memory metrics
|
|
kubectl -n rdev top pod -l app=rdev-api
|
|
|
|
# Check for memory leaks in logs
|
|
kubectl -n rdev logs -l app=rdev-api | grep -i memory
|
|
```
|
|
|
|
**Solutions:**
|
|
|
|
1. **Increase limits:**
|
|
```yaml
|
|
resources:
|
|
limits:
|
|
memory: "1Gi"
|
|
```
|
|
|
|
2. **Check for stream leaks:**
|
|
Ensure SSE connections are properly closed
|
|
|
|
3. **Restart pod:**
|
|
```bash
|
|
kubectl -n rdev rollout restart deployment/rdev-api
|
|
```
|
|
|
|
### High CPU Usage
|
|
|
|
**Symptoms:**
|
|
- CPU throttling
|
|
- Slow request processing
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check CPU metrics
|
|
kubectl -n rdev top pod -l app=rdev-api
|
|
|
|
# Profile if possible
|
|
kubectl -n rdev exec -it deployment/rdev-api -- curl localhost:8080/debug/pprof/profile > cpu.prof
|
|
```
|
|
|
|
**Solutions:**
|
|
|
|
1. **Scale horizontally:**
|
|
```bash
|
|
kubectl -n rdev scale deployment/rdev-api --replicas=3
|
|
```
|
|
|
|
2. **Identify hot paths:**
|
|
Use profiling to find CPU-intensive code
|
|
|
|
3. **Check command sanitization:**
|
|
Complex regex can be expensive
|
|
|
|
## Recovery Procedures
|
|
|
|
### Emergency Restart
|
|
|
|
```bash
|
|
# Restart all pods
|
|
kubectl -n rdev rollout restart deployment/rdev-api
|
|
|
|
# Scale down and up
|
|
kubectl -n rdev scale deployment/rdev-api --replicas=0
|
|
kubectl -n rdev scale deployment/rdev-api --replicas=2
|
|
```
|
|
|
|
### Rollback
|
|
|
|
```bash
|
|
# Check rollout history
|
|
kubectl -n rdev rollout history deployment/rdev-api
|
|
|
|
# Rollback to previous
|
|
kubectl -n rdev rollout undo deployment/rdev-api
|
|
|
|
# Rollback to specific revision
|
|
kubectl -n rdev rollout undo deployment/rdev-api --to-revision=5
|
|
```
|
|
|
|
### Database Recovery
|
|
|
|
```bash
|
|
# Connect to database
|
|
kubectl -n databases exec -it deployment/postgres -- psql -U rdev
|
|
|
|
# Check tables
|
|
\dt
|
|
|
|
# Check recent keys
|
|
SELECT id, name, created_at FROM api_keys ORDER BY created_at DESC LIMIT 10;
|
|
```
|
|
|
|
## Getting Help
|
|
|
|
1. Check logs for specific error messages
|
|
2. Search this troubleshooting guide
|
|
3. Check runbooks for specific scenarios
|
|
4. Contact the platform team with:
|
|
- Request ID (from error response)
|
|
- Timestamp
|
|
- Steps to reproduce
|
|
- Relevant logs
|