ci/woodpecker/push/woodpecker Pipeline was successful

Details

chore: accumulated platform hardening and CI fixes

CI / Woodpecker:
- Add explicit depends_on to all .woodpecker.yml steps (rdev + templates)
- Fix skip_tls_verify -> skip-tls-verify (correct Kaniko flag name)
- Add replicasets get/list to deployer RBAC for rollout status
- Skeleton template: add failure:ignore on docs steps, Traefik TLS
  annotations on ingress, depends_on on verify step

Component templates:
- Fix container name in deploy steps (PROJECT_NAME-COMPONENT_NAME)
- Replace kubectl scale with kubectl patch for replicas
- Add post-deploy image verification and rollout status checks
- Applied consistently across all 5 component templates

Adapters:
- gitea: Add HTTP client timeout (30s), context cancellation checks,
  handle 404 on GetRepo/DeleteRepo
- zot: Add retry with exponential backoff (doWithRetry), limit response
  body reads to 10MB
- cockroach: Use net.JoinHostPort for IPv6-safe DSN construction
- woodpecker: Fix error wrapping (%v -> %w)
- redis: Fix error wrapping (%v -> %w)
- deployer: Add context cancellation checks

Services:
- apikey_service: Fix error wrapping (%v -> %w)
- component_deploy: Fix error wrapping (%v -> %w)
- project_infra: Fix error wrapping (%v -> %w)
- webhook/dispatcher: Fix error wrapping (%v -> %w)

Other:
- CLAUDE.md: Add guide links for Gitea, Go 1.25, Woodpecker v3,
  Traefik v3, Zot registry
- circuitbreaker: Add test for error wrapping
- docs: Update deployment, troubleshooting, and runbook docs
- health: Fix error wrapping (%v -> %w)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-10 23:16:56 -07:00

6.8 KiB

Raw Blame History

Troubleshooting Guide

Common issues and their resolutions for rdev API.

Prerequisites

# REQUIRED: Set kubeconfig before any kubectl command
export KUBECONFIG=~/.kube/orchard9-k3sf.yaml

Quick Diagnostics

# Check pod status
kubectl -n rdev get pods -l app=rdev-api

# Check logs (use script for convenience)
./scripts/logs.sh           # Last 100 lines
./scripts/logs.sh -e        # Errors only

# Check events
kubectl -n rdev get events --sort-by='.lastTimestamp'

# Check endpoints
kubectl -n rdev get endpoints rdev-api

# Test health
curl $RDEV_API_URL/health

Common Issues

Pod Not Starting

Symptoms:

Pod stuck in Pending or CrashLoopBackOff
No endpoints registered

Diagnosis:

kubectl -n rdev describe pod -l app=rdev-api
kubectl -n rdev logs -l app=rdev-api --previous

Common Causes:

Missing secrets:

Error: secret "rdev-api-secrets" not found

Fix: Create the required secret

kubectl -n rdev create secret generic rdev-api-secrets \
  --from-literal=postgres-password=xxx

Resource constraints:
```
0/3 nodes are available: insufficient memory
```
Fix: Reduce resource requests or add nodes
Image pull errors:
```
Failed to pull image "registry/rdev-api:latest"
```
Fix: Check image name, registry credentials

Database Connection Failed

Symptoms:

Readiness probe failing
Logs show dial tcp: connection refused

Diagnosis:

# Check database pods
kubectl get pods -n databases

# Test CockroachDB
kubectl exec -n databases cockroachdb-0 -- \
  /cockroach/cockroach node status --insecure --host=localhost:26257

# Test Redis
REDIS_PASS=$(kubectl get secret -n threesix redis-credentials -o jsonpath="{.data.REDIS_PASSWORD}" | base64 -d)
kubectl exec -n threesix redis-0 -- redis-cli -a "$REDIS_PASS" ping

# Test PostgreSQL
kubectl exec -n databases postgres-0 -- psql -U rdev -d rdev -c "SELECT 1;"

See database-connections.md for full connection details.

Common Causes:

Wrong host/port: Check ConfigMap values match actual database
Network policy blocking:
```
kubectl -n rdev get networkpolicy
```
Ensure egress to database namespace is allowed
Credentials incorrect: Verify secret values match database credentials

Authentication Failures

Symptoms:

All requests return 401
Logs show invalid API key

Diagnosis:

# Check if keys exist in database
kubectl -n rdev exec -it deployment/rdev-api -- sh
psql $DATABASE_URL -c "SELECT id, name, revoked_at FROM api_keys LIMIT 10;"

Common Causes:

Key not created: Create an admin key manually if needed
Key revoked: Check revoked_at is NULL for the key
Wrong key format: Keys must start with rdev_

Rate Limiting Issues

Symptoms:

Intermittent 429 responses
X-RateLimit-Remaining: 0

Diagnosis:

# Check rate limit metrics
curl http://rdev-api:8080/metrics | grep ratelimit

Solutions:

Increase limits: Update ConfigMap:
```
RATE_LIMIT_RPS: "20"
```
Check for loops: Client may be making excessive requests
Use separate keys: Different clients should use different API keys

Command Execution Timeouts

Symptoms:

Commands hang indefinitely
SSE stream never completes

Diagnosis:

# Check active commands
kubectl -n rdev exec -it deployment/rdev-api -- sh
curl localhost:8080/metrics | grep commands_active

# Check target pod
kubectl -n rdev get pod <target-pod> -o wide
kubectl -n rdev exec -it <target-pod> -- ps aux

Common Causes:

Target pod not running:

kubectl -n rdev get pods -l rdev.orchard9.ai/project=true

Command actually slow: Some commands take a long time legitimately
Network issues: Check connectivity between API pod and target pod

SSE Connection Drops

Symptoms:

Clients disconnect unexpectedly
Events stop arriving mid-command

Diagnosis:

# Check ingress timeout settings
kubectl -n ingress-nginx get ing rdev-api -o yaml

Common Causes:

Proxy timeout: Traefik timeout is configured at the entrypoint level via HelmChartConfig, not per-Ingress annotations. See .claude/guides/ops/traefik-v3.md for details.

# Traefik timeout is configured at the entrypoint level via HelmChartConfig
# See .claude/guides/ops/traefik-v3.md for details
traefik.ingress.kubernetes.io/router.entrypoints: websecure
traefik.ingress.kubernetes.io/router.tls: "true"

Client timeout: Check client-side timeout configuration
Network interruption: Implement reconnection with Last-Event-ID

High Memory Usage

Symptoms:

OOMKilled events
Slow response times

Diagnosis:

# Check memory metrics
kubectl -n rdev top pod -l app=rdev-api

# Check for memory leaks in logs
kubectl -n rdev logs -l app=rdev-api | grep -i memory

Solutions:

Increase limits:
```
resources:
  limits:
    memory: "1Gi"
```
Check for stream leaks: Ensure SSE connections are properly closed

Restart pod:

kubectl -n rdev rollout restart deployment/rdev-api

High CPU Usage

Symptoms:

CPU throttling
Slow request processing

Diagnosis:

# Check CPU metrics
kubectl -n rdev top pod -l app=rdev-api

# Profile if possible
kubectl -n rdev exec -it deployment/rdev-api -- curl localhost:8080/debug/pprof/profile > cpu.prof

Solutions:

Scale horizontally:

kubectl -n rdev scale deployment/rdev-api --replicas=3

Identify hot paths: Use profiling to find CPU-intensive code
Check command sanitization: Complex regex can be expensive

Recovery Procedures

Emergency Restart

# Restart all pods
kubectl -n rdev rollout restart deployment/rdev-api

# Scale down and up
kubectl -n rdev scale deployment/rdev-api --replicas=0
kubectl -n rdev scale deployment/rdev-api --replicas=2

Rollback

# Check rollout history
kubectl -n rdev rollout history deployment/rdev-api

# Rollback to previous
kubectl -n rdev rollout undo deployment/rdev-api

# Rollback to specific revision
kubectl -n rdev rollout undo deployment/rdev-api --to-revision=5

Database Recovery

# Connect to database
kubectl -n databases exec -it deployment/postgres -- psql -U rdev

# Check tables
\dt

# Check recent keys
SELECT id, name, created_at FROM api_keys ORDER BY created_at DESC LIMIT 10;

Getting Help

Check logs for specific error messages
Search this troubleshooting guide
Check runbooks for specific scenarios
Contact the platform team with:
- Request ID (from error response)
- Timestamp
- Steps to reproduce
- Relevant logs

6.8 KiB Raw Blame History

Troubleshooting Guide

Prerequisites

Quick Diagnostics

Common Issues

Pod Not Starting

Database Connection Failed

Authentication Failures

Rate Limiting Issues

Command Execution Timeouts

SSE Connection Drops

High Memory Usage

High CPU Usage

Recovery Procedures

Emergency Restart

Rollback

Database Recovery

Getting Help

6.8 KiB

Raw Blame History