All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
CI / Woodpecker: - Add explicit depends_on to all .woodpecker.yml steps (rdev + templates) - Fix skip_tls_verify -> skip-tls-verify (correct Kaniko flag name) - Add replicasets get/list to deployer RBAC for rollout status - Skeleton template: add failure:ignore on docs steps, Traefik TLS annotations on ingress, depends_on on verify step Component templates: - Fix container name in deploy steps (PROJECT_NAME-COMPONENT_NAME) - Replace kubectl scale with kubectl patch for replicas - Add post-deploy image verification and rollout status checks - Applied consistently across all 5 component templates Adapters: - gitea: Add HTTP client timeout (30s), context cancellation checks, handle 404 on GetRepo/DeleteRepo - zot: Add retry with exponential backoff (doWithRetry), limit response body reads to 10MB - cockroach: Use net.JoinHostPort for IPv6-safe DSN construction - woodpecker: Fix error wrapping (%v -> %w) - redis: Fix error wrapping (%v -> %w) - deployer: Add context cancellation checks Services: - apikey_service: Fix error wrapping (%v -> %w) - component_deploy: Fix error wrapping (%v -> %w) - project_infra: Fix error wrapping (%v -> %w) - webhook/dispatcher: Fix error wrapping (%v -> %w) Other: - CLAUDE.md: Add guide links for Gitea, Go 1.25, Woodpecker v3, Traefik v3, Zot registry - circuitbreaker: Add test for error wrapping - docs: Update deployment, troubleshooting, and runbook docs - health: Fix error wrapping (%v -> %w) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
328 lines
6.8 KiB
Markdown
328 lines
6.8 KiB
Markdown
# Troubleshooting Guide
|
|
|
|
Common issues and their resolutions for rdev API.
|
|
|
|
## Prerequisites
|
|
|
|
```bash
|
|
# REQUIRED: Set kubeconfig before any kubectl command
|
|
export KUBECONFIG=~/.kube/orchard9-k3sf.yaml
|
|
```
|
|
|
|
## Quick Diagnostics
|
|
|
|
```bash
|
|
# Check pod status
|
|
kubectl -n rdev get pods -l app=rdev-api
|
|
|
|
# Check logs (use script for convenience)
|
|
./scripts/logs.sh # Last 100 lines
|
|
./scripts/logs.sh -e # Errors only
|
|
|
|
# Check events
|
|
kubectl -n rdev get events --sort-by='.lastTimestamp'
|
|
|
|
# Check endpoints
|
|
kubectl -n rdev get endpoints rdev-api
|
|
|
|
# Test health
|
|
curl $RDEV_API_URL/health
|
|
```
|
|
|
|
## Common Issues
|
|
|
|
### Pod Not Starting
|
|
|
|
**Symptoms:**
|
|
- Pod stuck in `Pending` or `CrashLoopBackOff`
|
|
- No endpoints registered
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
kubectl -n rdev describe pod -l app=rdev-api
|
|
kubectl -n rdev logs -l app=rdev-api --previous
|
|
```
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Missing secrets:**
|
|
```
|
|
Error: secret "rdev-api-secrets" not found
|
|
```
|
|
Fix: Create the required secret
|
|
```bash
|
|
kubectl -n rdev create secret generic rdev-api-secrets \
|
|
--from-literal=postgres-password=xxx
|
|
```
|
|
|
|
2. **Resource constraints:**
|
|
```
|
|
0/3 nodes are available: insufficient memory
|
|
```
|
|
Fix: Reduce resource requests or add nodes
|
|
|
|
3. **Image pull errors:**
|
|
```
|
|
Failed to pull image "registry/rdev-api:latest"
|
|
```
|
|
Fix: Check image name, registry credentials
|
|
|
|
### Database Connection Failed
|
|
|
|
**Symptoms:**
|
|
- Readiness probe failing
|
|
- Logs show `dial tcp: connection refused`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check database pods
|
|
kubectl get pods -n databases
|
|
|
|
# Test CockroachDB
|
|
kubectl exec -n databases cockroachdb-0 -- \
|
|
/cockroach/cockroach node status --insecure --host=localhost:26257
|
|
|
|
# Test Redis
|
|
REDIS_PASS=$(kubectl get secret -n threesix redis-credentials -o jsonpath="{.data.REDIS_PASSWORD}" | base64 -d)
|
|
kubectl exec -n threesix redis-0 -- redis-cli -a "$REDIS_PASS" ping
|
|
|
|
# Test PostgreSQL
|
|
kubectl exec -n databases postgres-0 -- psql -U rdev -d rdev -c "SELECT 1;"
|
|
```
|
|
|
|
See [database-connections.md](database-connections.md) for full connection details.
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Wrong host/port:**
|
|
Check ConfigMap values match actual database
|
|
|
|
2. **Network policy blocking:**
|
|
```bash
|
|
kubectl -n rdev get networkpolicy
|
|
```
|
|
Ensure egress to database namespace is allowed
|
|
|
|
3. **Credentials incorrect:**
|
|
Verify secret values match database credentials
|
|
|
|
### Authentication Failures
|
|
|
|
**Symptoms:**
|
|
- All requests return 401
|
|
- Logs show `invalid API key`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check if keys exist in database
|
|
kubectl -n rdev exec -it deployment/rdev-api -- sh
|
|
psql $DATABASE_URL -c "SELECT id, name, revoked_at FROM api_keys LIMIT 10;"
|
|
```
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Key not created:**
|
|
Create an admin key manually if needed
|
|
|
|
2. **Key revoked:**
|
|
Check `revoked_at` is NULL for the key
|
|
|
|
3. **Wrong key format:**
|
|
Keys must start with `rdev_`
|
|
|
|
### Rate Limiting Issues
|
|
|
|
**Symptoms:**
|
|
- Intermittent 429 responses
|
|
- `X-RateLimit-Remaining: 0`
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check rate limit metrics
|
|
curl http://rdev-api:8080/metrics | grep ratelimit
|
|
```
|
|
|
|
**Solutions:**
|
|
|
|
1. **Increase limits:**
|
|
Update ConfigMap:
|
|
```yaml
|
|
RATE_LIMIT_RPS: "20"
|
|
```
|
|
|
|
2. **Check for loops:**
|
|
Client may be making excessive requests
|
|
|
|
3. **Use separate keys:**
|
|
Different clients should use different API keys
|
|
|
|
### Command Execution Timeouts
|
|
|
|
**Symptoms:**
|
|
- Commands hang indefinitely
|
|
- SSE stream never completes
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check active commands
|
|
kubectl -n rdev exec -it deployment/rdev-api -- sh
|
|
curl localhost:8080/metrics | grep commands_active
|
|
|
|
# Check target pod
|
|
kubectl -n rdev get pod <target-pod> -o wide
|
|
kubectl -n rdev exec -it <target-pod> -- ps aux
|
|
```
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Target pod not running:**
|
|
```bash
|
|
kubectl -n rdev get pods -l rdev.orchard9.ai/project=true
|
|
```
|
|
|
|
2. **Command actually slow:**
|
|
Some commands take a long time legitimately
|
|
|
|
3. **Network issues:**
|
|
Check connectivity between API pod and target pod
|
|
|
|
### SSE Connection Drops
|
|
|
|
**Symptoms:**
|
|
- Clients disconnect unexpectedly
|
|
- Events stop arriving mid-command
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check ingress timeout settings
|
|
kubectl -n ingress-nginx get ing rdev-api -o yaml
|
|
```
|
|
|
|
**Common Causes:**
|
|
|
|
1. **Proxy timeout:**
|
|
Traefik timeout is configured at the entrypoint level via HelmChartConfig,
|
|
not per-Ingress annotations. See `.claude/guides/ops/traefik-v3.md` for details.
|
|
```yaml
|
|
# Traefik timeout is configured at the entrypoint level via HelmChartConfig
|
|
# See .claude/guides/ops/traefik-v3.md for details
|
|
traefik.ingress.kubernetes.io/router.entrypoints: websecure
|
|
traefik.ingress.kubernetes.io/router.tls: "true"
|
|
```
|
|
|
|
2. **Client timeout:**
|
|
Check client-side timeout configuration
|
|
|
|
3. **Network interruption:**
|
|
Implement reconnection with `Last-Event-ID`
|
|
|
|
### High Memory Usage
|
|
|
|
**Symptoms:**
|
|
- OOMKilled events
|
|
- Slow response times
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check memory metrics
|
|
kubectl -n rdev top pod -l app=rdev-api
|
|
|
|
# Check for memory leaks in logs
|
|
kubectl -n rdev logs -l app=rdev-api | grep -i memory
|
|
```
|
|
|
|
**Solutions:**
|
|
|
|
1. **Increase limits:**
|
|
```yaml
|
|
resources:
|
|
limits:
|
|
memory: "1Gi"
|
|
```
|
|
|
|
2. **Check for stream leaks:**
|
|
Ensure SSE connections are properly closed
|
|
|
|
3. **Restart pod:**
|
|
```bash
|
|
kubectl -n rdev rollout restart deployment/rdev-api
|
|
```
|
|
|
|
### High CPU Usage
|
|
|
|
**Symptoms:**
|
|
- CPU throttling
|
|
- Slow request processing
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check CPU metrics
|
|
kubectl -n rdev top pod -l app=rdev-api
|
|
|
|
# Profile if possible
|
|
kubectl -n rdev exec -it deployment/rdev-api -- curl localhost:8080/debug/pprof/profile > cpu.prof
|
|
```
|
|
|
|
**Solutions:**
|
|
|
|
1. **Scale horizontally:**
|
|
```bash
|
|
kubectl -n rdev scale deployment/rdev-api --replicas=3
|
|
```
|
|
|
|
2. **Identify hot paths:**
|
|
Use profiling to find CPU-intensive code
|
|
|
|
3. **Check command sanitization:**
|
|
Complex regex can be expensive
|
|
|
|
## Recovery Procedures
|
|
|
|
### Emergency Restart
|
|
|
|
```bash
|
|
# Restart all pods
|
|
kubectl -n rdev rollout restart deployment/rdev-api
|
|
|
|
# Scale down and up
|
|
kubectl -n rdev scale deployment/rdev-api --replicas=0
|
|
kubectl -n rdev scale deployment/rdev-api --replicas=2
|
|
```
|
|
|
|
### Rollback
|
|
|
|
```bash
|
|
# Check rollout history
|
|
kubectl -n rdev rollout history deployment/rdev-api
|
|
|
|
# Rollback to previous
|
|
kubectl -n rdev rollout undo deployment/rdev-api
|
|
|
|
# Rollback to specific revision
|
|
kubectl -n rdev rollout undo deployment/rdev-api --to-revision=5
|
|
```
|
|
|
|
### Database Recovery
|
|
|
|
```bash
|
|
# Connect to database
|
|
kubectl -n databases exec -it deployment/postgres -- psql -U rdev
|
|
|
|
# Check tables
|
|
\dt
|
|
|
|
# Check recent keys
|
|
SELECT id, name, created_at FROM api_keys ORDER BY created_at DESC LIMIT 10;
|
|
```
|
|
|
|
## Getting Help
|
|
|
|
1. Check logs for specific error messages
|
|
2. Search this troubleshooting guide
|
|
3. Check runbooks for specific scenarios
|
|
4. Contact the platform team with:
|
|
- Request ID (from error response)
|
|
- Timestamp
|
|
- Steps to reproduce
|
|
- Relevant logs
|