Major refactoring to hexagonal (ports & adapters) architecture: - Add service layer (apikey_service, project_service) for business logic - Add webhook system with dispatcher and delivery tracking - Add command queue with priority-based processing - Add rate limiting with sliding window algorithm - Add audit logging for command execution - Add OpenTelemetry integration (traces, metrics, spans) - Add circuit breaker for fault tolerance - Add cached repository wrapper for performance - Add comprehensive validation package - Add Kubernetes client integration for pod management - Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks) - Add network policy and PodDisruptionBudget for k8s - Remove legacy executor and projects/registry packages - Untrack secrets.yaml (now managed via envault) - Add coverage.out to .gitignore - Add e2e test infrastructure with docker-compose - Add comprehensive documentation (API, architecture, operations, plans) - Add golangci-lint config and pre-commit hook Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
142 lines
3.1 KiB
Markdown
142 lines
3.1 KiB
Markdown
# Runbook: Pod Not Found
|
|
|
|
## Alert
|
|
|
|
**RdevAPIProjectNotFound**: Project pod not found errors increasing
|
|
|
|
## Impact
|
|
|
|
- Users cannot execute commands on their projects
|
|
- API returns 404 for valid project IDs
|
|
|
|
## Investigation
|
|
|
|
### 1. Confirm the Issue
|
|
|
|
```bash
|
|
# Check for NOT_FOUND errors in logs
|
|
kubectl -n rdev logs -l app=rdev-api --since=10m | grep "project not found"
|
|
|
|
# Check metrics
|
|
curl -s http://rdev-api:8080/metrics | grep 'http_requests_total.*status="404"'
|
|
```
|
|
|
|
### 2. Verify Target Pods Exist
|
|
|
|
```bash
|
|
# List all project pods
|
|
kubectl -n rdev get pods -l rdev.orchard9.ai/project=true
|
|
|
|
# Check specific project
|
|
kubectl -n rdev get pods -l rdev.orchard9.ai/project-id=<project-id>
|
|
```
|
|
|
|
### 3. Check Pod Discovery
|
|
|
|
```bash
|
|
# Verify API can see pods
|
|
kubectl -n rdev exec -it deployment/rdev-api -- sh
|
|
curl localhost:8080/projects
|
|
|
|
# Check RBAC permissions
|
|
kubectl auth can-i list pods -n rdev --as=system:serviceaccount:rdev:rdev-api
|
|
```
|
|
|
|
### 4. Common Causes
|
|
|
|
- **Pod terminated**: Project pod was deleted or crashed
|
|
- **Wrong namespace**: API looking in wrong namespace
|
|
- **Missing labels**: Pod missing required labels
|
|
- **RBAC issues**: API can't list pods
|
|
- **Cache stale**: Project list cache is outdated
|
|
|
|
## Remediation
|
|
|
|
### If Pod Is Missing
|
|
|
|
1. Check if pod should exist:
|
|
```bash
|
|
kubectl -n rdev get deployments
|
|
```
|
|
|
|
2. Recreate if needed:
|
|
```bash
|
|
kubectl -n rdev apply -f <project-deployment.yaml>
|
|
```
|
|
|
|
### If Labels Are Wrong
|
|
|
|
1. Check current labels:
|
|
```bash
|
|
kubectl -n rdev get pod <pod-name> --show-labels
|
|
```
|
|
|
|
2. Add required labels:
|
|
```bash
|
|
kubectl -n rdev label pod <pod-name> rdev.orchard9.ai/project=true
|
|
kubectl -n rdev label pod <pod-name> rdev.orchard9.ai/project-id=<project-id>
|
|
```
|
|
|
|
### If RBAC Is Broken
|
|
|
|
1. Verify ServiceAccount:
|
|
```bash
|
|
kubectl -n rdev get serviceaccount rdev-api
|
|
```
|
|
|
|
2. Check RoleBinding:
|
|
```bash
|
|
kubectl -n rdev get rolebinding rdev-api-binding -o yaml
|
|
```
|
|
|
|
3. Reapply RBAC:
|
|
```bash
|
|
kubectl apply -f deployments/k8s/base/rdev-api.yaml
|
|
```
|
|
|
|
### If Cache Is Stale
|
|
|
|
1. Force cache refresh by restarting:
|
|
```bash
|
|
kubectl -n rdev rollout restart deployment/rdev-api
|
|
```
|
|
|
|
2. Or reduce cache TTL:
|
|
```bash
|
|
kubectl -n rdev set env deployment/rdev-api CACHE_TTL=5s
|
|
```
|
|
|
|
### If Wrong Namespace
|
|
|
|
1. Check configured namespace:
|
|
```bash
|
|
kubectl -n rdev get deployment rdev-api -o jsonpath='{.spec.template.spec.containers[0].env}' | jq
|
|
```
|
|
|
|
2. Update if wrong:
|
|
```bash
|
|
kubectl -n rdev set env deployment/rdev-api RDEV_NAMESPACE=rdev
|
|
```
|
|
|
|
## Verification
|
|
|
|
```bash
|
|
# List projects from API
|
|
curl -H "X-API-Key: $API_KEY" http://rdev-api:8080/projects
|
|
|
|
# Get specific project
|
|
curl -H "X-API-Key: $API_KEY" http://rdev-api:8080/projects/<project-id>
|
|
|
|
# Execute test command
|
|
curl -X POST -H "X-API-Key: $API_KEY" -H "Content-Type: application/json" \
|
|
http://rdev-api:8080/projects/<project-id>/shell \
|
|
-d '{"command": "echo hello"}'
|
|
```
|
|
|
|
## Post-Incident
|
|
|
|
1. Review pod lifecycle management
|
|
2. Consider adding pod status monitoring
|
|
3. Review label conventions
|
|
4. Add alerts for project pod terminations
|