rdev/docs/operations/runbooks/pod-not-found.md
jordan 72d16929ca feat: Implement hexagonal architecture with services, webhooks, queue, and telemetry
Major refactoring to hexagonal (ports & adapters) architecture:

- Add service layer (apikey_service, project_service) for business logic
- Add webhook system with dispatcher and delivery tracking
- Add command queue with priority-based processing
- Add rate limiting with sliding window algorithm
- Add audit logging for command execution
- Add OpenTelemetry integration (traces, metrics, spans)
- Add circuit breaker for fault tolerance
- Add cached repository wrapper for performance
- Add comprehensive validation package
- Add Kubernetes client integration for pod management
- Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks)
- Add network policy and PodDisruptionBudget for k8s
- Remove legacy executor and projects/registry packages
- Untrack secrets.yaml (now managed via envault)
- Add coverage.out to .gitignore
- Add e2e test infrastructure with docker-compose
- Add comprehensive documentation (API, architecture, operations, plans)
- Add golangci-lint config and pre-commit hook

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 19:57:46 -07:00

142 lines
3.1 KiB
Markdown

# Runbook: Pod Not Found
## Alert
**RdevAPIProjectNotFound**: Project pod not found errors increasing
## Impact
- Users cannot execute commands on their projects
- API returns 404 for valid project IDs
## Investigation
### 1. Confirm the Issue
```bash
# Check for NOT_FOUND errors in logs
kubectl -n rdev logs -l app=rdev-api --since=10m | grep "project not found"
# Check metrics
curl -s http://rdev-api:8080/metrics | grep 'http_requests_total.*status="404"'
```
### 2. Verify Target Pods Exist
```bash
# List all project pods
kubectl -n rdev get pods -l rdev.orchard9.ai/project=true
# Check specific project
kubectl -n rdev get pods -l rdev.orchard9.ai/project-id=<project-id>
```
### 3. Check Pod Discovery
```bash
# Verify API can see pods
kubectl -n rdev exec -it deployment/rdev-api -- sh
curl localhost:8080/projects
# Check RBAC permissions
kubectl auth can-i list pods -n rdev --as=system:serviceaccount:rdev:rdev-api
```
### 4. Common Causes
- **Pod terminated**: Project pod was deleted or crashed
- **Wrong namespace**: API looking in wrong namespace
- **Missing labels**: Pod missing required labels
- **RBAC issues**: API can't list pods
- **Cache stale**: Project list cache is outdated
## Remediation
### If Pod Is Missing
1. Check if pod should exist:
```bash
kubectl -n rdev get deployments
```
2. Recreate if needed:
```bash
kubectl -n rdev apply -f <project-deployment.yaml>
```
### If Labels Are Wrong
1. Check current labels:
```bash
kubectl -n rdev get pod <pod-name> --show-labels
```
2. Add required labels:
```bash
kubectl -n rdev label pod <pod-name> rdev.orchard9.ai/project=true
kubectl -n rdev label pod <pod-name> rdev.orchard9.ai/project-id=<project-id>
```
### If RBAC Is Broken
1. Verify ServiceAccount:
```bash
kubectl -n rdev get serviceaccount rdev-api
```
2. Check RoleBinding:
```bash
kubectl -n rdev get rolebinding rdev-api-binding -o yaml
```
3. Reapply RBAC:
```bash
kubectl apply -f deployments/k8s/base/rdev-api.yaml
```
### If Cache Is Stale
1. Force cache refresh by restarting:
```bash
kubectl -n rdev rollout restart deployment/rdev-api
```
2. Or reduce cache TTL:
```bash
kubectl -n rdev set env deployment/rdev-api CACHE_TTL=5s
```
### If Wrong Namespace
1. Check configured namespace:
```bash
kubectl -n rdev get deployment rdev-api -o jsonpath='{.spec.template.spec.containers[0].env}' | jq
```
2. Update if wrong:
```bash
kubectl -n rdev set env deployment/rdev-api RDEV_NAMESPACE=rdev
```
## Verification
```bash
# List projects from API
curl -H "X-API-Key: $API_KEY" http://rdev-api:8080/projects
# Get specific project
curl -H "X-API-Key: $API_KEY" http://rdev-api:8080/projects/<project-id>
# Execute test command
curl -X POST -H "X-API-Key: $API_KEY" -H "Content-Type: application/json" \
http://rdev-api:8080/projects/<project-id>/shell \
-d '{"command": "echo hello"}'
```
## Post-Incident
1. Review pod lifecycle management
2. Consider adding pod status monitoring
3. Review label conventions
4. Add alerts for project pod terminations