rdev/docs/operations/troubleshooting.md

# Troubleshooting Guide

Common issues and their resolutions for rdev API.

## Prerequisites

```bash
# REQUIRED: Set kubeconfig before any kubectl command
export KUBECONFIG=~/.kube/orchard9-k3sf.yaml
```

## Quick Diagnostics

```bash
# Check pod status
kubectl -n rdev get pods -l app=rdev-api

# Check logs (use script for convenience)
./scripts/logs.sh           # Last 100 lines
./scripts/logs.sh -e        # Errors only

# Check events
kubectl -n rdev get events --sort-by='.lastTimestamp'

# Check endpoints
kubectl -n rdev get endpoints rdev-api

# Test health
curl $RDEV_API_URL/health
```

## Common Issues

### Pod Not Starting

**Symptoms:**
- Pod stuck in `Pending` or `CrashLoopBackOff`
- No endpoints registered

**Diagnosis:**
```bash
kubectl -n rdev describe pod -l app=rdev-api
kubectl -n rdev logs -l app=rdev-api --previous
```

**Common Causes:**

1. **Missing secrets:**
   ```
   Error: secret "rdev-api-secrets" not found
   ```
   Fix: Create the required secret
   ```bash
   kubectl -n rdev create secret generic rdev-api-secrets \
     --from-literal=postgres-password=xxx
   ```

2. **Resource constraints:**
   ```
   0/3 nodes are available: insufficient memory
   ```
   Fix: Reduce resource requests or add nodes

3. **Image pull errors:**
   ```
   Failed to pull image "registry/rdev-api:latest"
   ```
   Fix: Check image name, registry credentials

### Database Connection Failed

**Symptoms:**
- Readiness probe failing
- Logs show `dial tcp: connection refused`

**Diagnosis:**
```bash
# Check database pods
kubectl get pods -n databases

# Test CockroachDB
kubectl exec -n databases cockroachdb-0 -- \
  /cockroach/cockroach node status --insecure --host=localhost:26257

# Test Redis
REDIS_PASS=$(kubectl get secret -n threesix redis-credentials -o jsonpath="{.data.REDIS_PASSWORD}" | base64 -d)
kubectl exec -n threesix redis-0 -- redis-cli -a "$REDIS_PASS" ping

# Test PostgreSQL
kubectl exec -n databases postgres-0 -- psql -U rdev -d rdev -c "SELECT 1;"
```

See [database-connections.md](database-connections.md) for full connection details.

**Common Causes:**

1. **Wrong host/port:**
   Check ConfigMap values match actual database

2. **Network policy blocking:**
   ```bash
   kubectl -n rdev get networkpolicy
   ```
   Ensure egress to database namespace is allowed

3. **Credentials incorrect:**
   Verify secret values match database credentials

### Authentication Failures

**Symptoms:**
- All requests return 401
- Logs show `invalid API key`

**Diagnosis:**
```bash
# Check if keys exist in database
kubectl -n rdev exec -it deployment/rdev-api -- sh
psql $DATABASE_URL -c "SELECT id, name, revoked_at FROM api_keys LIMIT 10;"
```

**Common Causes:**

1. **Key not created:**
   Create an admin key manually if needed

2. **Key revoked:**
   Check `revoked_at` is NULL for the key

3. **Wrong key format:**
   Keys must start with `rdev_`

### Rate Limiting Issues

**Symptoms:**
- Intermittent 429 responses
- `X-RateLimit-Remaining: 0`

**Diagnosis:**
```bash
# Check rate limit metrics
curl http://rdev-api:8080/metrics | grep ratelimit
```

**Solutions:**

1. **Increase limits:**
   Update ConfigMap:
   ```yaml
   RATE_LIMIT_RPS: "20"
   ```

2. **Check for loops:**
   Client may be making excessive requests

3. **Use separate keys:**
   Different clients should use different API keys

### Command Execution Timeouts

**Symptoms:**
- Commands hang indefinitely
- SSE stream never completes

**Diagnosis:**
```bash
# Check active commands
kubectl -n rdev exec -it deployment/rdev-api -- sh
curl localhost:8080/metrics | grep commands_active

# Check target pod
kubectl -n rdev get pod <target-pod> -o wide
kubectl -n rdev exec -it <target-pod> -- ps aux
```

**Common Causes:**

1. **Target pod not running:**
   ```bash
   kubectl -n rdev get pods -l rdev.orchard9.ai/project=true
   ```

2. **Command actually slow:**
   Some commands take a long time legitimately

3. **Network issues:**
   Check connectivity between API pod and target pod

### SSE Connection Drops

**Symptoms:**
- Clients disconnect unexpectedly
- Events stop arriving mid-command

**Diagnosis:**
```bash
# Check ingress timeout settings
kubectl -n ingress-nginx get ing rdev-api -o yaml
```

**Common Causes:**

1. **Proxy timeout:**
   Traefik timeout is configured at the entrypoint level via HelmChartConfig,
   not per-Ingress annotations. See `.claude/guides/ops/traefik-v3.md` for details.
   ```yaml
   # Traefik timeout is configured at the entrypoint level via HelmChartConfig
   # See .claude/guides/ops/traefik-v3.md for details
   traefik.ingress.kubernetes.io/router.entrypoints: websecure
   traefik.ingress.kubernetes.io/router.tls: "true"
   ```

2. **Client timeout:**
   Check client-side timeout configuration

3. **Network interruption:**
   Implement reconnection with `Last-Event-ID`

### High Memory Usage

**Symptoms:**
- OOMKilled events
- Slow response times

**Diagnosis:**
```bash
# Check memory metrics
kubectl -n rdev top pod -l app=rdev-api

# Check for memory leaks in logs
kubectl -n rdev logs -l app=rdev-api | grep -i memory
```

**Solutions:**

1. **Increase limits:**
   ```yaml
   resources:
     limits:
       memory: "1Gi"
   ```

2. **Check for stream leaks:**
   Ensure SSE connections are properly closed

3. **Restart pod:**
   ```bash
   kubectl -n rdev rollout restart deployment/rdev-api
   ```

### High CPU Usage

**Symptoms:**
- CPU throttling
- Slow request processing

**Diagnosis:**
```bash
# Check CPU metrics
kubectl -n rdev top pod -l app=rdev-api

# Profile if possible
kubectl -n rdev exec -it deployment/rdev-api -- curl localhost:8080/debug/pprof/profile > cpu.prof
```

**Solutions:**

1. **Scale horizontally:**
   ```bash
   kubectl -n rdev scale deployment/rdev-api --replicas=3
   ```

2. **Identify hot paths:**
   Use profiling to find CPU-intensive code

3. **Check command sanitization:**
   Complex regex can be expensive

## Recovery Procedures

### Emergency Restart

```bash
# Restart all pods
kubectl -n rdev rollout restart deployment/rdev-api

# Scale down and up
kubectl -n rdev scale deployment/rdev-api --replicas=0
kubectl -n rdev scale deployment/rdev-api --replicas=2
```

### Rollback

```bash
# Check rollout history
kubectl -n rdev rollout history deployment/rdev-api

# Rollback to previous
kubectl -n rdev rollout undo deployment/rdev-api

# Rollback to specific revision
kubectl -n rdev rollout undo deployment/rdev-api --to-revision=5
```

### Database Recovery

```bash
# Connect to database
kubectl -n databases exec -it deployment/postgres -- psql -U rdev

# Check tables
\dt

# Check recent keys
SELECT id, name, created_at FROM api_keys ORDER BY created_at DESC LIMIT 10;
```

## Getting Help

1. Check logs for specific error messages
2. Search this troubleshooting guide
3. Check runbooks for specific scenarios
4. Contact the platform team with:
   - Request ID (from error response)
   - Timestamp
   - Steps to reproduce
   - Relevant logs