jordan 72d16929ca feat: Implement hexagonal architecture with services, webhooks, queue, and telemetry

Major refactoring to hexagonal (ports & adapters) architecture:

- Add service layer (apikey_service, project_service) for business logic
- Add webhook system with dispatcher and delivery tracking
- Add command queue with priority-based processing
- Add rate limiting with sliding window algorithm
- Add audit logging for command execution
- Add OpenTelemetry integration (traces, metrics, spans)
- Add circuit breaker for fault tolerance
- Add cached repository wrapper for performance
- Add comprehensive validation package
- Add Kubernetes client integration for pod management
- Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks)
- Add network policy and PodDisruptionBudget for k8s
- Remove legacy executor and projects/registry packages
- Untrack secrets.yaml (now managed via envault)
- Add coverage.out to .gitignore
- Add e2e test infrastructure with docker-compose
- Add comprehensive documentation (API, architecture, operations, plans)
- Add golangci-lint config and pre-commit hook

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-25 19:57:46 -07:00

5.9 KiB

Raw Blame History

Troubleshooting Guide

Common issues and their resolutions for rdev API.

Quick Diagnostics

# Check pod status
kubectl -n rdev get pods -l app=rdev-api

# Check logs
kubectl -n rdev logs -l app=rdev-api --tail=100

# Check events
kubectl -n rdev get events --sort-by='.lastTimestamp'

# Check endpoints
kubectl -n rdev get endpoints rdev-api

# Test health
kubectl -n rdev exec -it deployment/rdev-api -- wget -qO- localhost:8080/health

Common Issues

Pod Not Starting

Symptoms:

Pod stuck in Pending or CrashLoopBackOff
No endpoints registered

Diagnosis:

kubectl -n rdev describe pod -l app=rdev-api
kubectl -n rdev logs -l app=rdev-api --previous

Common Causes:

Missing secrets:

Error: secret "rdev-api-secrets" not found

Fix: Create the required secret

kubectl -n rdev create secret generic rdev-api-secrets \
  --from-literal=postgres-password=xxx

Resource constraints:
```
0/3 nodes are available: insufficient memory
```
Fix: Reduce resource requests or add nodes
Image pull errors:
```
Failed to pull image "registry/rdev-api:latest"
```
Fix: Check image name, registry credentials

Database Connection Failed

Symptoms:

Readiness probe failing
Logs show dial tcp: connection refused

Diagnosis:

# Check database connectivity from pod
kubectl -n rdev exec -it deployment/rdev-api -- sh
nc -zv postgres.databases.svc 5432

Common Causes:

Wrong host/port: Check ConfigMap values match actual database
Network policy blocking:
```
kubectl -n rdev get networkpolicy
```
Ensure egress to database namespace is allowed
Credentials incorrect: Verify secret values match database credentials

Authentication Failures

Symptoms:

All requests return 401
Logs show invalid API key

Diagnosis:

# Check if keys exist in database
kubectl -n rdev exec -it deployment/rdev-api -- sh
psql $DATABASE_URL -c "SELECT id, name, revoked_at FROM api_keys LIMIT 10;"

Common Causes:

Key not created: Create an admin key manually if needed
Key revoked: Check revoked_at is NULL for the key
Wrong key format: Keys must start with rdev_

Rate Limiting Issues

Symptoms:

Intermittent 429 responses
X-RateLimit-Remaining: 0

Diagnosis:

# Check rate limit metrics
curl http://rdev-api:8080/metrics | grep ratelimit

Solutions:

Increase limits: Update ConfigMap:
```
RATE_LIMIT_RPS: "20"
```
Check for loops: Client may be making excessive requests
Use separate keys: Different clients should use different API keys

Command Execution Timeouts

Symptoms:

Commands hang indefinitely
SSE stream never completes

Diagnosis:

# Check active commands
kubectl -n rdev exec -it deployment/rdev-api -- sh
curl localhost:8080/metrics | grep commands_active

# Check target pod
kubectl -n rdev get pod <target-pod> -o wide
kubectl -n rdev exec -it <target-pod> -- ps aux

Common Causes:

Target pod not running:

kubectl -n rdev get pods -l rdev.orchard9.ai/project=true

Command actually slow: Some commands take a long time legitimately
Network issues: Check connectivity between API pod and target pod

SSE Connection Drops

Symptoms:

Clients disconnect unexpectedly
Events stop arriving mid-command

Diagnosis:

# Check ingress timeout settings
kubectl -n ingress-nginx get ing rdev-api -o yaml

Common Causes:

Proxy timeout: Ensure ingress has long timeout:

nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"

Client timeout: Check client-side timeout configuration
Network interruption: Implement reconnection with Last-Event-ID

High Memory Usage

Symptoms:

OOMKilled events
Slow response times

Diagnosis:

# Check memory metrics
kubectl -n rdev top pod -l app=rdev-api

# Check for memory leaks in logs
kubectl -n rdev logs -l app=rdev-api | grep -i memory

Solutions:

Increase limits:
```
resources:
  limits:
    memory: "1Gi"
```
Check for stream leaks: Ensure SSE connections are properly closed

Restart pod:

kubectl -n rdev rollout restart deployment/rdev-api

High CPU Usage

Symptoms:

CPU throttling
Slow request processing

Diagnosis:

# Check CPU metrics
kubectl -n rdev top pod -l app=rdev-api

# Profile if possible
kubectl -n rdev exec -it deployment/rdev-api -- curl localhost:8080/debug/pprof/profile > cpu.prof

Solutions:

Scale horizontally:

kubectl -n rdev scale deployment/rdev-api --replicas=3

Identify hot paths: Use profiling to find CPU-intensive code
Check command sanitization: Complex regex can be expensive

Recovery Procedures

Emergency Restart

# Restart all pods
kubectl -n rdev rollout restart deployment/rdev-api

# Scale down and up
kubectl -n rdev scale deployment/rdev-api --replicas=0
kubectl -n rdev scale deployment/rdev-api --replicas=2

Rollback

# Check rollout history
kubectl -n rdev rollout history deployment/rdev-api

# Rollback to previous
kubectl -n rdev rollout undo deployment/rdev-api

# Rollback to specific revision
kubectl -n rdev rollout undo deployment/rdev-api --to-revision=5

Database Recovery

# Connect to database
kubectl -n databases exec -it deployment/postgres -- psql -U rdev

# Check tables
\dt

# Check recent keys
SELECT id, name, created_at FROM api_keys ORDER BY created_at DESC LIMIT 10;

Getting Help

Check logs for specific error messages
Search this troubleshooting guide
Check runbooks for specific scenarios
Contact the platform team with:
- Request ID (from error response)
- Timestamp
- Steps to reproduce
- Relevant logs

5.9 KiB Raw Blame History

Troubleshooting Guide

Quick Diagnostics

Common Issues

Pod Not Starting

Database Connection Failed

Authentication Failures

Rate Limiting Issues

Command Execution Timeouts

SSE Connection Drops

High Memory Usage

High CPU Usage

Recovery Procedures

Emergency Restart

Rollback

Database Recovery

Getting Help

5.9 KiB

Raw Blame History