rdev/docs/operations/troubleshooting.md
jordan a9ad3d8304
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
chore: accumulated platform hardening and CI fixes
CI / Woodpecker:
- Add explicit depends_on to all .woodpecker.yml steps (rdev + templates)
- Fix skip_tls_verify -> skip-tls-verify (correct Kaniko flag name)
- Add replicasets get/list to deployer RBAC for rollout status
- Skeleton template: add failure:ignore on docs steps, Traefik TLS
  annotations on ingress, depends_on on verify step

Component templates:
- Fix container name in deploy steps (PROJECT_NAME-COMPONENT_NAME)
- Replace kubectl scale with kubectl patch for replicas
- Add post-deploy image verification and rollout status checks
- Applied consistently across all 5 component templates

Adapters:
- gitea: Add HTTP client timeout (30s), context cancellation checks,
  handle 404 on GetRepo/DeleteRepo
- zot: Add retry with exponential backoff (doWithRetry), limit response
  body reads to 10MB
- cockroach: Use net.JoinHostPort for IPv6-safe DSN construction
- woodpecker: Fix error wrapping (%v -> %w)
- redis: Fix error wrapping (%v -> %w)
- deployer: Add context cancellation checks

Services:
- apikey_service: Fix error wrapping (%v -> %w)
- component_deploy: Fix error wrapping (%v -> %w)
- project_infra: Fix error wrapping (%v -> %w)
- webhook/dispatcher: Fix error wrapping (%v -> %w)

Other:
- CLAUDE.md: Add guide links for Gitea, Go 1.25, Woodpecker v3,
  Traefik v3, Zot registry
- circuitbreaker: Add test for error wrapping
- docs: Update deployment, troubleshooting, and runbook docs
- health: Fix error wrapping (%v -> %w)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-10 23:16:56 -07:00

328 lines
6.8 KiB
Markdown

# Troubleshooting Guide
Common issues and their resolutions for rdev API.
## Prerequisites
```bash
# REQUIRED: Set kubeconfig before any kubectl command
export KUBECONFIG=~/.kube/orchard9-k3sf.yaml
```
## Quick Diagnostics
```bash
# Check pod status
kubectl -n rdev get pods -l app=rdev-api
# Check logs (use script for convenience)
./scripts/logs.sh # Last 100 lines
./scripts/logs.sh -e # Errors only
# Check events
kubectl -n rdev get events --sort-by='.lastTimestamp'
# Check endpoints
kubectl -n rdev get endpoints rdev-api
# Test health
curl $RDEV_API_URL/health
```
## Common Issues
### Pod Not Starting
**Symptoms:**
- Pod stuck in `Pending` or `CrashLoopBackOff`
- No endpoints registered
**Diagnosis:**
```bash
kubectl -n rdev describe pod -l app=rdev-api
kubectl -n rdev logs -l app=rdev-api --previous
```
**Common Causes:**
1. **Missing secrets:**
```
Error: secret "rdev-api-secrets" not found
```
Fix: Create the required secret
```bash
kubectl -n rdev create secret generic rdev-api-secrets \
--from-literal=postgres-password=xxx
```
2. **Resource constraints:**
```
0/3 nodes are available: insufficient memory
```
Fix: Reduce resource requests or add nodes
3. **Image pull errors:**
```
Failed to pull image "registry/rdev-api:latest"
```
Fix: Check image name, registry credentials
### Database Connection Failed
**Symptoms:**
- Readiness probe failing
- Logs show `dial tcp: connection refused`
**Diagnosis:**
```bash
# Check database pods
kubectl get pods -n databases
# Test CockroachDB
kubectl exec -n databases cockroachdb-0 -- \
/cockroach/cockroach node status --insecure --host=localhost:26257
# Test Redis
REDIS_PASS=$(kubectl get secret -n threesix redis-credentials -o jsonpath="{.data.REDIS_PASSWORD}" | base64 -d)
kubectl exec -n threesix redis-0 -- redis-cli -a "$REDIS_PASS" ping
# Test PostgreSQL
kubectl exec -n databases postgres-0 -- psql -U rdev -d rdev -c "SELECT 1;"
```
See [database-connections.md](database-connections.md) for full connection details.
**Common Causes:**
1. **Wrong host/port:**
Check ConfigMap values match actual database
2. **Network policy blocking:**
```bash
kubectl -n rdev get networkpolicy
```
Ensure egress to database namespace is allowed
3. **Credentials incorrect:**
Verify secret values match database credentials
### Authentication Failures
**Symptoms:**
- All requests return 401
- Logs show `invalid API key`
**Diagnosis:**
```bash
# Check if keys exist in database
kubectl -n rdev exec -it deployment/rdev-api -- sh
psql $DATABASE_URL -c "SELECT id, name, revoked_at FROM api_keys LIMIT 10;"
```
**Common Causes:**
1. **Key not created:**
Create an admin key manually if needed
2. **Key revoked:**
Check `revoked_at` is NULL for the key
3. **Wrong key format:**
Keys must start with `rdev_`
### Rate Limiting Issues
**Symptoms:**
- Intermittent 429 responses
- `X-RateLimit-Remaining: 0`
**Diagnosis:**
```bash
# Check rate limit metrics
curl http://rdev-api:8080/metrics | grep ratelimit
```
**Solutions:**
1. **Increase limits:**
Update ConfigMap:
```yaml
RATE_LIMIT_RPS: "20"
```
2. **Check for loops:**
Client may be making excessive requests
3. **Use separate keys:**
Different clients should use different API keys
### Command Execution Timeouts
**Symptoms:**
- Commands hang indefinitely
- SSE stream never completes
**Diagnosis:**
```bash
# Check active commands
kubectl -n rdev exec -it deployment/rdev-api -- sh
curl localhost:8080/metrics | grep commands_active
# Check target pod
kubectl -n rdev get pod <target-pod> -o wide
kubectl -n rdev exec -it <target-pod> -- ps aux
```
**Common Causes:**
1. **Target pod not running:**
```bash
kubectl -n rdev get pods -l rdev.orchard9.ai/project=true
```
2. **Command actually slow:**
Some commands take a long time legitimately
3. **Network issues:**
Check connectivity between API pod and target pod
### SSE Connection Drops
**Symptoms:**
- Clients disconnect unexpectedly
- Events stop arriving mid-command
**Diagnosis:**
```bash
# Check ingress timeout settings
kubectl -n ingress-nginx get ing rdev-api -o yaml
```
**Common Causes:**
1. **Proxy timeout:**
Traefik timeout is configured at the entrypoint level via HelmChartConfig,
not per-Ingress annotations. See `.claude/guides/ops/traefik-v3.md` for details.
```yaml
# Traefik timeout is configured at the entrypoint level via HelmChartConfig
# See .claude/guides/ops/traefik-v3.md for details
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.tls: "true"
```
2. **Client timeout:**
Check client-side timeout configuration
3. **Network interruption:**
Implement reconnection with `Last-Event-ID`
### High Memory Usage
**Symptoms:**
- OOMKilled events
- Slow response times
**Diagnosis:**
```bash
# Check memory metrics
kubectl -n rdev top pod -l app=rdev-api
# Check for memory leaks in logs
kubectl -n rdev logs -l app=rdev-api | grep -i memory
```
**Solutions:**
1. **Increase limits:**
```yaml
resources:
limits:
memory: "1Gi"
```
2. **Check for stream leaks:**
Ensure SSE connections are properly closed
3. **Restart pod:**
```bash
kubectl -n rdev rollout restart deployment/rdev-api
```
### High CPU Usage
**Symptoms:**
- CPU throttling
- Slow request processing
**Diagnosis:**
```bash
# Check CPU metrics
kubectl -n rdev top pod -l app=rdev-api
# Profile if possible
kubectl -n rdev exec -it deployment/rdev-api -- curl localhost:8080/debug/pprof/profile > cpu.prof
```
**Solutions:**
1. **Scale horizontally:**
```bash
kubectl -n rdev scale deployment/rdev-api --replicas=3
```
2. **Identify hot paths:**
Use profiling to find CPU-intensive code
3. **Check command sanitization:**
Complex regex can be expensive
## Recovery Procedures
### Emergency Restart
```bash
# Restart all pods
kubectl -n rdev rollout restart deployment/rdev-api
# Scale down and up
kubectl -n rdev scale deployment/rdev-api --replicas=0
kubectl -n rdev scale deployment/rdev-api --replicas=2
```
### Rollback
```bash
# Check rollout history
kubectl -n rdev rollout history deployment/rdev-api
# Rollback to previous
kubectl -n rdev rollout undo deployment/rdev-api
# Rollback to specific revision
kubectl -n rdev rollout undo deployment/rdev-api --to-revision=5
```
### Database Recovery
```bash
# Connect to database
kubectl -n databases exec -it deployment/postgres -- psql -U rdev
# Check tables
\dt
# Check recent keys
SELECT id, name, created_at FROM api_keys ORDER BY created_at DESC LIMIT 10;
```
## Getting Help
1. Check logs for specific error messages
2. Search this troubleshooting guide
3. Check runbooks for specific scenarios
4. Contact the platform team with:
- Request ID (from error response)
- Timestamp
- Steps to reproduce
- Relevant logs