Major refactoring to hexagonal (ports & adapters) architecture: - Add service layer (apikey_service, project_service) for business logic - Add webhook system with dispatcher and delivery tracking - Add command queue with priority-based processing - Add rate limiting with sliding window algorithm - Add audit logging for command execution - Add OpenTelemetry integration (traces, metrics, spans) - Add circuit breaker for fault tolerance - Add cached repository wrapper for performance - Add comprehensive validation package - Add Kubernetes client integration for pod management - Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks) - Add network policy and PodDisruptionBudget for k8s - Remove legacy executor and projects/registry packages - Untrack secrets.yaml (now managed via envault) - Add coverage.out to .gitignore - Add e2e test infrastructure with docker-compose - Add comprehensive documentation (API, architecture, operations, plans) - Add golangci-lint config and pre-commit hook Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
304 lines
5.9 KiB
Markdown
304 lines
5.9 KiB
Markdown
# Troubleshooting Guide
|
|
|
|
Common issues and their resolutions for rdev API.
|
|
|
|
## Quick Diagnostics
|
|
|
|
```bash
|
|
# Check pod status
|
|
kubectl -n rdev get pods -l app=rdev-api
|
|
|
|
# Check logs
|
|
kubectl -n rdev logs -l app=rdev-api --tail=100
|
|
|
|
# Check events
|
|
kubectl -n rdev get events --sort-by='.lastTimestamp'
|
|
|
|
# Check endpoints
|
|
kubectl -n rdev get endpoints rdev-api
|
|
|
|
# Test health
|
|
kubectl -n rdev exec -it deployment/rdev-api -- wget -qO- localhost:8080/health
|
|
```
|
|
|
|
## Common Issues
|
|
|
|
### Pod Not Starting
|
|
|
|
**Symptoms:**
|
|
- Pod stuck in `Pending` or `CrashLoopBackOff`
|
|
- No endpoints registered
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
kubectl -n rdev describe pod -l app=rdev-api
|
|
kubectl -n rdev logs -l app=rdev-api --previous
|
|
```
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Missing secrets:**
|
|
```
|
|
Error: secret "rdev-api-secrets" not found
|
|
```
|
|
Fix: Create the required secret
|
|
```bash
|
|
kubectl -n rdev create secret generic rdev-api-secrets \
|
|
--from-literal=postgres-password=xxx
|
|
```
|
|
|
|
2. **Resource constraints:**
|
|
```
|
|
0/3 nodes are available: insufficient memory
|
|
```
|
|
Fix: Reduce resource requests or add nodes
|
|
|
|
3. **Image pull errors:**
|
|
```
|
|
Failed to pull image "registry/rdev-api:latest"
|
|
```
|
|
Fix: Check image name, registry credentials
|
|
|
|
### Database Connection Failed
|
|
|
|
**Symptoms:**
|
|
- Readiness probe failing
|
|
- Logs show `dial tcp: connection refused`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check database connectivity from pod
|
|
kubectl -n rdev exec -it deployment/rdev-api -- sh
|
|
nc -zv postgres.databases.svc 5432
|
|
```
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Wrong host/port:**
|
|
Check ConfigMap values match actual database
|
|
|
|
2. **Network policy blocking:**
|
|
```bash
|
|
kubectl -n rdev get networkpolicy
|
|
```
|
|
Ensure egress to database namespace is allowed
|
|
|
|
3. **Credentials incorrect:**
|
|
Verify secret values match database credentials
|
|
|
|
### Authentication Failures
|
|
|
|
**Symptoms:**
|
|
- All requests return 401
|
|
- Logs show `invalid API key`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check if keys exist in database
|
|
kubectl -n rdev exec -it deployment/rdev-api -- sh
|
|
psql $DATABASE_URL -c "SELECT id, name, revoked_at FROM api_keys LIMIT 10;"
|
|
```
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Key not created:**
|
|
Create an admin key manually if needed
|
|
|
|
2. **Key revoked:**
|
|
Check `revoked_at` is NULL for the key
|
|
|
|
3. **Wrong key format:**
|
|
Keys must start with `rdev_`
|
|
|
|
### Rate Limiting Issues
|
|
|
|
**Symptoms:**
|
|
- Intermittent 429 responses
|
|
- `X-RateLimit-Remaining: 0`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check rate limit metrics
|
|
curl http://rdev-api:8080/metrics | grep ratelimit
|
|
```
|
|
|
|
**Solutions:**
|
|
|
|
1. **Increase limits:**
|
|
Update ConfigMap:
|
|
```yaml
|
|
RATE_LIMIT_RPS: "20"
|
|
```
|
|
|
|
2. **Check for loops:**
|
|
Client may be making excessive requests
|
|
|
|
3. **Use separate keys:**
|
|
Different clients should use different API keys
|
|
|
|
### Command Execution Timeouts
|
|
|
|
**Symptoms:**
|
|
- Commands hang indefinitely
|
|
- SSE stream never completes
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check active commands
|
|
kubectl -n rdev exec -it deployment/rdev-api -- sh
|
|
curl localhost:8080/metrics | grep commands_active
|
|
|
|
# Check target pod
|
|
kubectl -n rdev get pod <target-pod> -o wide
|
|
kubectl -n rdev exec -it <target-pod> -- ps aux
|
|
```
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Target pod not running:**
|
|
```bash
|
|
kubectl -n rdev get pods -l rdev.orchard9.ai/project=true
|
|
```
|
|
|
|
2. **Command actually slow:**
|
|
Some commands take a long time legitimately
|
|
|
|
3. **Network issues:**
|
|
Check connectivity between API pod and target pod
|
|
|
|
### SSE Connection Drops
|
|
|
|
**Symptoms:**
|
|
- Clients disconnect unexpectedly
|
|
- Events stop arriving mid-command
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check ingress timeout settings
|
|
kubectl -n ingress-nginx get ing rdev-api -o yaml
|
|
```
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Proxy timeout:**
|
|
Ensure ingress has long timeout:
|
|
```yaml
|
|
nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
|
|
```
|
|
|
|
2. **Client timeout:**
|
|
Check client-side timeout configuration
|
|
|
|
3. **Network interruption:**
|
|
Implement reconnection with `Last-Event-ID`
|
|
|
|
### High Memory Usage
|
|
|
|
**Symptoms:**
|
|
- OOMKilled events
|
|
- Slow response times
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check memory metrics
|
|
kubectl -n rdev top pod -l app=rdev-api
|
|
|
|
# Check for memory leaks in logs
|
|
kubectl -n rdev logs -l app=rdev-api | grep -i memory
|
|
```
|
|
|
|
**Solutions:**
|
|
|
|
1. **Increase limits:**
|
|
```yaml
|
|
resources:
|
|
limits:
|
|
memory: "1Gi"
|
|
```
|
|
|
|
2. **Check for stream leaks:**
|
|
Ensure SSE connections are properly closed
|
|
|
|
3. **Restart pod:**
|
|
```bash
|
|
kubectl -n rdev rollout restart deployment/rdev-api
|
|
```
|
|
|
|
### High CPU Usage
|
|
|
|
**Symptoms:**
|
|
- CPU throttling
|
|
- Slow request processing
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check CPU metrics
|
|
kubectl -n rdev top pod -l app=rdev-api
|
|
|
|
# Profile if possible
|
|
kubectl -n rdev exec -it deployment/rdev-api -- curl localhost:8080/debug/pprof/profile > cpu.prof
|
|
```
|
|
|
|
**Solutions:**
|
|
|
|
1. **Scale horizontally:**
|
|
```bash
|
|
kubectl -n rdev scale deployment/rdev-api --replicas=3
|
|
```
|
|
|
|
2. **Identify hot paths:**
|
|
Use profiling to find CPU-intensive code
|
|
|
|
3. **Check command sanitization:**
|
|
Complex regex can be expensive
|
|
|
|
## Recovery Procedures
|
|
|
|
### Emergency Restart
|
|
|
|
```bash
|
|
# Restart all pods
|
|
kubectl -n rdev rollout restart deployment/rdev-api
|
|
|
|
# Scale down and up
|
|
kubectl -n rdev scale deployment/rdev-api --replicas=0
|
|
kubectl -n rdev scale deployment/rdev-api --replicas=2
|
|
```
|
|
|
|
### Rollback
|
|
|
|
```bash
|
|
# Check rollout history
|
|
kubectl -n rdev rollout history deployment/rdev-api
|
|
|
|
# Rollback to previous
|
|
kubectl -n rdev rollout undo deployment/rdev-api
|
|
|
|
# Rollback to specific revision
|
|
kubectl -n rdev rollout undo deployment/rdev-api --to-revision=5
|
|
```
|
|
|
|
### Database Recovery
|
|
|
|
```bash
|
|
# Connect to database
|
|
kubectl -n databases exec -it deployment/postgres -- psql -U rdev
|
|
|
|
# Check tables
|
|
\dt
|
|
|
|
# Check recent keys
|
|
SELECT id, name, created_at FROM api_keys ORDER BY created_at DESC LIMIT 10;
|
|
```
|
|
|
|
## Getting Help
|
|
|
|
1. Check logs for specific error messages
|
|
2. Search this troubleshooting guide
|
|
3. Check runbooks for specific scenarios
|
|
4. Contact the platform team with:
|
|
- Request ID (from error response)
|
|
- Timestamp
|
|
- Steps to reproduce
|
|
- Relevant logs
|