Major refactoring to hexagonal (ports & adapters) architecture: - Add service layer (apikey_service, project_service) for business logic - Add webhook system with dispatcher and delivery tracking - Add command queue with priority-based processing - Add rate limiting with sliding window algorithm - Add audit logging for command execution - Add OpenTelemetry integration (traces, metrics, spans) - Add circuit breaker for fault tolerance - Add cached repository wrapper for performance - Add comprehensive validation package - Add Kubernetes client integration for pod management - Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks) - Add network policy and PodDisruptionBudget for k8s - Remove legacy executor and projects/registry packages - Untrack secrets.yaml (now managed via envault) - Add coverage.out to .gitignore - Add e2e test infrastructure with docker-compose - Add comprehensive documentation (API, architecture, operations, plans) - Add golangci-lint config and pre-commit hook Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.1 KiB
3.1 KiB
Runbook: Pod Not Found
Alert
RdevAPIProjectNotFound: Project pod not found errors increasing
Impact
- Users cannot execute commands on their projects
- API returns 404 for valid project IDs
Investigation
1. Confirm the Issue
# Check for NOT_FOUND errors in logs
kubectl -n rdev logs -l app=rdev-api --since=10m | grep "project not found"
# Check metrics
curl -s http://rdev-api:8080/metrics | grep 'http_requests_total.*status="404"'
2. Verify Target Pods Exist
# List all project pods
kubectl -n rdev get pods -l rdev.orchard9.ai/project=true
# Check specific project
kubectl -n rdev get pods -l rdev.orchard9.ai/project-id=<project-id>
3. Check Pod Discovery
# Verify API can see pods
kubectl -n rdev exec -it deployment/rdev-api -- sh
curl localhost:8080/projects
# Check RBAC permissions
kubectl auth can-i list pods -n rdev --as=system:serviceaccount:rdev:rdev-api
4. Common Causes
- Pod terminated: Project pod was deleted or crashed
- Wrong namespace: API looking in wrong namespace
- Missing labels: Pod missing required labels
- RBAC issues: API can't list pods
- Cache stale: Project list cache is outdated
Remediation
If Pod Is Missing
-
Check if pod should exist:
kubectl -n rdev get deployments -
Recreate if needed:
kubectl -n rdev apply -f <project-deployment.yaml>
If Labels Are Wrong
-
Check current labels:
kubectl -n rdev get pod <pod-name> --show-labels -
Add required labels:
kubectl -n rdev label pod <pod-name> rdev.orchard9.ai/project=true kubectl -n rdev label pod <pod-name> rdev.orchard9.ai/project-id=<project-id>
If RBAC Is Broken
-
Verify ServiceAccount:
kubectl -n rdev get serviceaccount rdev-api -
Check RoleBinding:
kubectl -n rdev get rolebinding rdev-api-binding -o yaml -
Reapply RBAC:
kubectl apply -f deployments/k8s/base/rdev-api.yaml
If Cache Is Stale
-
Force cache refresh by restarting:
kubectl -n rdev rollout restart deployment/rdev-api -
Or reduce cache TTL:
kubectl -n rdev set env deployment/rdev-api CACHE_TTL=5s
If Wrong Namespace
-
Check configured namespace:
kubectl -n rdev get deployment rdev-api -o jsonpath='{.spec.template.spec.containers[0].env}' | jq -
Update if wrong:
kubectl -n rdev set env deployment/rdev-api RDEV_NAMESPACE=rdev
Verification
# List projects from API
curl -H "X-API-Key: $API_KEY" http://rdev-api:8080/projects
# Get specific project
curl -H "X-API-Key: $API_KEY" http://rdev-api:8080/projects/<project-id>
# Execute test command
curl -X POST -H "X-API-Key: $API_KEY" -H "Content-Type: application/json" \
http://rdev-api:8080/projects/<project-id>/shell \
-d '{"command": "echo hello"}'
Post-Incident
- Review pod lifecycle management
- Consider adding pod status monitoring
- Review label conventions
- Add alerts for project pod terminations