rdev/docs/operations/runbooks/pod-not-found.md
jordan 72d16929ca feat: Implement hexagonal architecture with services, webhooks, queue, and telemetry
Major refactoring to hexagonal (ports & adapters) architecture:

- Add service layer (apikey_service, project_service) for business logic
- Add webhook system with dispatcher and delivery tracking
- Add command queue with priority-based processing
- Add rate limiting with sliding window algorithm
- Add audit logging for command execution
- Add OpenTelemetry integration (traces, metrics, spans)
- Add circuit breaker for fault tolerance
- Add cached repository wrapper for performance
- Add comprehensive validation package
- Add Kubernetes client integration for pod management
- Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks)
- Add network policy and PodDisruptionBudget for k8s
- Remove legacy executor and projects/registry packages
- Untrack secrets.yaml (now managed via envault)
- Add coverage.out to .gitignore
- Add e2e test infrastructure with docker-compose
- Add comprehensive documentation (API, architecture, operations, plans)
- Add golangci-lint config and pre-commit hook

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 19:57:46 -07:00

3.1 KiB

Runbook: Pod Not Found

Alert

RdevAPIProjectNotFound: Project pod not found errors increasing

Impact

  • Users cannot execute commands on their projects
  • API returns 404 for valid project IDs

Investigation

1. Confirm the Issue

# Check for NOT_FOUND errors in logs
kubectl -n rdev logs -l app=rdev-api --since=10m | grep "project not found"

# Check metrics
curl -s http://rdev-api:8080/metrics | grep 'http_requests_total.*status="404"'

2. Verify Target Pods Exist

# List all project pods
kubectl -n rdev get pods -l rdev.orchard9.ai/project=true

# Check specific project
kubectl -n rdev get pods -l rdev.orchard9.ai/project-id=<project-id>

3. Check Pod Discovery

# Verify API can see pods
kubectl -n rdev exec -it deployment/rdev-api -- sh
curl localhost:8080/projects

# Check RBAC permissions
kubectl auth can-i list pods -n rdev --as=system:serviceaccount:rdev:rdev-api

4. Common Causes

  • Pod terminated: Project pod was deleted or crashed
  • Wrong namespace: API looking in wrong namespace
  • Missing labels: Pod missing required labels
  • RBAC issues: API can't list pods
  • Cache stale: Project list cache is outdated

Remediation

If Pod Is Missing

  1. Check if pod should exist:

    kubectl -n rdev get deployments
    
  2. Recreate if needed:

    kubectl -n rdev apply -f <project-deployment.yaml>
    

If Labels Are Wrong

  1. Check current labels:

    kubectl -n rdev get pod <pod-name> --show-labels
    
  2. Add required labels:

    kubectl -n rdev label pod <pod-name> rdev.orchard9.ai/project=true
    kubectl -n rdev label pod <pod-name> rdev.orchard9.ai/project-id=<project-id>
    

If RBAC Is Broken

  1. Verify ServiceAccount:

    kubectl -n rdev get serviceaccount rdev-api
    
  2. Check RoleBinding:

    kubectl -n rdev get rolebinding rdev-api-binding -o yaml
    
  3. Reapply RBAC:

    kubectl apply -f deployments/k8s/base/rdev-api.yaml
    

If Cache Is Stale

  1. Force cache refresh by restarting:

    kubectl -n rdev rollout restart deployment/rdev-api
    
  2. Or reduce cache TTL:

    kubectl -n rdev set env deployment/rdev-api CACHE_TTL=5s
    

If Wrong Namespace

  1. Check configured namespace:

    kubectl -n rdev get deployment rdev-api -o jsonpath='{.spec.template.spec.containers[0].env}' | jq
    
  2. Update if wrong:

    kubectl -n rdev set env deployment/rdev-api RDEV_NAMESPACE=rdev
    

Verification

# List projects from API
curl -H "X-API-Key: $API_KEY" http://rdev-api:8080/projects

# Get specific project
curl -H "X-API-Key: $API_KEY" http://rdev-api:8080/projects/<project-id>

# Execute test command
curl -X POST -H "X-API-Key: $API_KEY" -H "Content-Type: application/json" \
  http://rdev-api:8080/projects/<project-id>/shell \
  -d '{"command": "echo hello"}'

Post-Incident

  1. Review pod lifecycle management
  2. Consider adding pod status monitoring
  3. Review label conventions
  4. Add alerts for project pod terminations