rdev/ai-lookup/services/worker-pool.md
jordan bc010c4746 feat: add RWX storage class and full SDLC lifecycle cookbook
- Add longhorn-rwx StorageClass for RWX volume support
- Add slackpath-5-full-lifecycle.yaml cookbook tree (all 10 SDLC phases)
- Update worker-pool.md documentation
- Consolidate PVC configuration, remove separate pvc-shared-claude.yaml
- Update rdev-worker and kustomization for new PVC structure

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 11:37:57 -07:00

194 lines
7.4 KiB
Markdown

# Worker Pool
**Last Updated:** 2026-02-06
**Confidence:** High
## Summary
Distributed task execution system where standalone worker pods poll rdev-api for tasks and execute them via a claudebox sidecar. Supports horizontal scaling by adding more worker pods.
**Key Facts:**
- **Architecture:** Pull-based polling (not push/websocket)
- **Sidecar pattern:** Worker + claudebox in same pod, communicate via localhost HTTP
- **Atomic dequeue:** PostgreSQL `FOR UPDATE SKIP LOCKED` prevents duplicate claims
- **Task types:** `build` (Claude Code prompts), `sdlc` (SDLC commands)
- **Scaling:** Add replicas to handle more concurrent tasks
- **Resilience:** Stale workers marked offline, stuck tasks re-queued automatically
## File Pointers
### Standalone Worker Binary
- **Entry:** `cmd/rdev-worker/main.go` - Main binary, registration, heartbeat, poll loop
- **API Client:** `internal/worker/api_client.go` - HTTP client to rdev-api
- **Build Executor:** `internal/worker/http_build_executor.go` - Execute builds via claudebox
- **SDLC Executor:** `internal/worker/http_sdlc_executor.go` - Execute SDLC tasks via claudebox
### Claudebox Sidecar Client
- **Client:** `internal/adapter/claudebox/client.go` - HTTP client to claudebox sidecar
- **Endpoints:** `/health`, `/execute`, `/git/clone`, `/git/commit-and-push`, `/sdlc`
### rdev-api Server-Side
- **Handlers:** `internal/handlers/workers.go` - `/workers/*` endpoints
- **Service:** `internal/service/worker_service.go` - Claim, complete, fail logic
- **Registry:** `internal/adapter/postgres/worker_registry.go` - Worker state persistence
- **Queue:** `internal/adapter/postgres/work_queue.go` - Task queue with atomic dequeue
### Domain
- **Worker:** `internal/domain/worker.go` - Worker, WorkerStatus
- **Task:** `internal/domain/work.go` - WorkTask, WorkTaskType, WorkTaskStatus
- **Build:** `internal/domain/build.go` - BuildSpec, BuildResult
### Kubernetes
- **Deployment:** `deployments/k8s/base/rdev-worker.yaml` - Worker + claudebox pod spec
## Architecture
```
┌─────────────────────┐ HTTP Polling (5s) ┌──────────────────────────┐
│ rdev-api │◄────────────────────────────────►│ Worker Pod │
│ │ │ ┌─────────┐ ┌─────────┐ │
│ POST /workers/register ← Register at startup │ │ worker │→│claudebox│ │
│ POST /workers/{id}/heartbeat ← Every 30s │ └─────────┘ └─────────┘ │
│ POST /workers/{id}/claim ← Poll for tasks │ ↓ HTTP localhost │
│ POST /workers/{id}/complete/{taskId} ← Success │ Claude Code execution │
│ POST /workers/{id}/fail/{taskId} ← Failure └──────────────────────────┘
│ │
│ PostgreSQL │
│ ├─ workers │ (worker registry)
│ ├─ work_queue │ (task queue)
│ └─ build_audit │ (execution history)
└─────────────────────┘
```
## Worker Lifecycle
1. **Register:** Worker pod starts → `POST /workers/register` with ID, hostname, capabilities
2. **Heartbeat:** Every 30s → `POST /workers/{id}/heartbeat` to stay alive
3. **Poll:** Every 5s → `POST /workers/{id}/claim` to get next task
4. **Execute:** Call claudebox sidecar HTTP API to run Claude Code / SDLC commands
5. **Report:** `POST /workers/{id}/complete/{taskId}` or `/fail/{taskId}` with results
6. **Shutdown:** Graceful wait for in-flight tasks via `sync.WaitGroup`
## Worker Statuses
| Status | Meaning |
|--------|---------|
| `idle` | Ready to claim new tasks |
| `busy` | Currently executing a task |
| `draining` | Not accepting new tasks (pre-shutdown) |
| `offline` | Missed heartbeat threshold (>90s) |
## Task Types
### Build Tasks (`WorkTaskTypeBuild`)
Execute Claude Code prompts with optional git operations.
**Spec:**
```json
{
"prompt": "Build a React app with...",
"auto_commit": true,
"auto_push": false,
"git_clone_url": "https://gitea.../repo.git"
}
```
**Execution Flow:**
1. Clone repo via `claudebox /git/clone`
2. Execute prompt via `claudebox /execute` (streaming)
3. Commit/push via `claudebox /git/commit-and-push`
### SDLC Tasks (`WorkTaskTypeSDLC`)
Execute SDLC CLI commands.
**Spec:**
```json
{
"command": "feature",
"args": ["init", "feature-name"],
"git_clone_url": "https://gitea.../repo.git"
}
```
**Execution Flow:**
1. Clone repo via `claudebox /git/clone`
2. Run SDLC command via `claudebox /sdlc`
3. Commit/push changes
## API Endpoints
| Method | Path | Description |
|--------|------|-------------|
| POST | `/workers/register` | Register new worker |
| POST | `/workers/{id}/heartbeat` | Keep worker alive |
| POST | `/workers/{id}/claim` | Claim next available task (204 if none) |
| POST | `/workers/{id}/complete/{taskId}` | Report successful completion |
| POST | `/workers/{id}/fail/{taskId}` | Report failure |
| GET | `/workers` | List all workers |
| GET | `/workers/{id}` | Get worker details |
| POST | `/workers/{id}/drain` | Set worker to draining |
## Kubernetes Deployment
```yaml
# deployments/k8s/base/rdev-worker.yaml
spec:
replicas: 1 # Scale by increasing
strategy:
type: RollingUpdate # RWX PVC enables multi-pod mounts
rollingUpdate:
maxSurge: 2
maxUnavailable: 0
containers:
- name: worker
image: registry.threesix.ai/rdev/worker:latest
env:
- RDEV_API_URL: http://rdev-api.rdev.svc.cluster.local:8080
- CLAUDEBOX_URL: http://localhost:8080
- WORKER_POLL_INTERVAL: 5s
- WORKER_HEARTBEAT_INTERVAL: 30s
- WORKER_TASK_TIMEOUT: 15m
- name: claudebox
image: registry.threesix.ai/rdev/claudebox:latest
volumeMounts:
- /workspace (EmptyDir)
- /root/.claude (RWX PVC - shared Claude auth)
```
**Storage:** The `claudebox-claude-config` PVC uses `ReadWriteMany` (RWX) access mode with Longhorn NFS, allowing multiple worker pods to share Claude OAuth credentials.
## Error Classification
Failed tasks are classified for smart retry logic:
| Code | Trigger | Retryable |
|------|---------|-----------|
| `RATE_LIMITED` | "rate limit", "quota exceeded" | Yes (with backoff) |
| `AUTH_FAILED` | "unauthorized", "invalid api key" | No |
| `TIMEOUT` | "context deadline exceeded" | Yes |
| `AGENT_ERROR` | Generic error | Yes (limited retries) |
## Queue Maintenance
Background goroutine in rdev-api:
- **Stale worker marking:** Workers without heartbeat >90s → `offline`
- **Stale task recovery:** Tasks running >30m without completion → re-queued
- **Old task cleanup:** Completed/failed tasks >7 days → deleted
- **Metrics refresh:** Queue depth and worker counts → Prometheus
## Graceful Shutdown
Worker uses `sync.WaitGroup` to track in-flight tasks:
1. Receive SIGTERM/SIGINT
2. Cancel context (stops polling)
3. Wait for WaitGroup with timeout (`WORKER_TASK_TIMEOUT`)
4. Log success or timeout warning
## Related Topics
- [Work Queue](./work-queue.md) - Task queue implementation
- [Build Orchestration](../features/build-orchestration.md) - Build API and specs
- [SDLC Orchestration](./sdlc.md) - SDLC task integration