# Worker Pool **Last Updated:** 2026-02-06 **Confidence:** High ## Summary Distributed task execution system where standalone worker pods poll rdev-api for tasks and execute them via a claudebox sidecar. Supports horizontal scaling by adding more worker pods. **Key Facts:** - **Architecture:** Pull-based polling (not push/websocket) - **Sidecar pattern:** Worker + claudebox in same pod, communicate via localhost HTTP - **Atomic dequeue:** PostgreSQL `FOR UPDATE SKIP LOCKED` prevents duplicate claims - **Task types:** `build` (Claude Code prompts), `sdlc` (SDLC commands) - **Scaling:** Add replicas to handle more concurrent tasks - **Resilience:** Stale workers marked offline, stuck tasks re-queued automatically ## File Pointers ### Standalone Worker Binary - **Entry:** `cmd/rdev-worker/main.go` - Main binary, registration, heartbeat, poll loop - **API Client:** `internal/worker/api_client.go` - HTTP client to rdev-api - **Build Executor:** `internal/worker/http_build_executor.go` - Execute builds via claudebox - **SDLC Executor:** `internal/worker/http_sdlc_executor.go` - Execute SDLC tasks via claudebox ### Claudebox Sidecar Client - **Client:** `internal/adapter/claudebox/client.go` - HTTP client to claudebox sidecar - **Endpoints:** `/health`, `/execute`, `/git/clone`, `/git/commit-and-push`, `/sdlc` ### rdev-api Server-Side - **Handlers:** `internal/handlers/workers.go` - `/workers/*` endpoints - **Service:** `internal/service/worker_service.go` - Claim, complete, fail logic - **Registry:** `internal/adapter/postgres/worker_registry.go` - Worker state persistence - **Queue:** `internal/adapter/postgres/work_queue.go` - Task queue with atomic dequeue ### Domain - **Worker:** `internal/domain/worker.go` - Worker, WorkerStatus - **Task:** `internal/domain/work.go` - WorkTask, WorkTaskType, WorkTaskStatus - **Build:** `internal/domain/build.go` - BuildSpec, BuildResult ### Kubernetes - **Deployment:** `deployments/k8s/base/rdev-worker.yaml` - Worker + claudebox pod spec ## Architecture ``` ┌─────────────────────┐ HTTP Polling (5s) ┌──────────────────────────┐ │ rdev-api │◄────────────────────────────────►│ Worker Pod │ │ │ │ ┌─────────┐ ┌─────────┐ │ │ POST /workers/register ← Register at startup │ │ worker │→│claudebox│ │ │ POST /workers/{id}/heartbeat ← Every 30s │ └─────────┘ └─────────┘ │ │ POST /workers/{id}/claim ← Poll for tasks │ ↓ HTTP localhost │ │ POST /workers/{id}/complete/{taskId} ← Success │ Claude Code execution │ │ POST /workers/{id}/fail/{taskId} ← Failure └──────────────────────────┘ │ │ │ PostgreSQL │ │ ├─ workers │ (worker registry) │ ├─ work_queue │ (task queue) │ └─ build_audit │ (execution history) └─────────────────────┘ ``` ## Worker Lifecycle 1. **Register:** Worker pod starts → `POST /workers/register` with ID, hostname, capabilities 2. **Heartbeat:** Every 30s → `POST /workers/{id}/heartbeat` to stay alive 3. **Poll:** Every 5s → `POST /workers/{id}/claim` to get next task 4. **Execute:** Call claudebox sidecar HTTP API to run Claude Code / SDLC commands 5. **Report:** `POST /workers/{id}/complete/{taskId}` or `/fail/{taskId}` with results 6. **Shutdown:** Graceful wait for in-flight tasks via `sync.WaitGroup` ## Worker Statuses | Status | Meaning | |--------|---------| | `idle` | Ready to claim new tasks | | `busy` | Currently executing a task | | `draining` | Not accepting new tasks (pre-shutdown) | | `offline` | Missed heartbeat threshold (>90s) | ## Task Types ### Build Tasks (`WorkTaskTypeBuild`) Execute Claude Code prompts with optional git operations. **Spec:** ```json { "prompt": "Build a React app with...", "auto_commit": true, "auto_push": false, "git_clone_url": "https://gitea.../repo.git" } ``` **Execution Flow:** 1. Clone repo via `claudebox /git/clone` 2. Execute prompt via `claudebox /execute` (streaming) 3. Commit/push via `claudebox /git/commit-and-push` ### SDLC Tasks (`WorkTaskTypeSDLC`) Execute SDLC CLI commands. **Spec:** ```json { "command": "feature", "args": ["init", "feature-name"], "git_clone_url": "https://gitea.../repo.git" } ``` **Execution Flow:** 1. Clone repo via `claudebox /git/clone` 2. Run SDLC command via `claudebox /sdlc` 3. Commit/push changes ## API Endpoints | Method | Path | Description | |--------|------|-------------| | POST | `/workers/register` | Register new worker | | POST | `/workers/{id}/heartbeat` | Keep worker alive | | POST | `/workers/{id}/claim` | Claim next available task (204 if none) | | POST | `/workers/{id}/complete/{taskId}` | Report successful completion | | POST | `/workers/{id}/fail/{taskId}` | Report failure | | GET | `/workers` | List all workers | | GET | `/workers/{id}` | Get worker details | | POST | `/workers/{id}/drain` | Set worker to draining | ## Kubernetes Deployment ```yaml # deployments/k8s/base/rdev-worker.yaml spec: replicas: 1 # Scale by increasing strategy: type: RollingUpdate # RWX PVC enables multi-pod mounts rollingUpdate: maxSurge: 2 maxUnavailable: 0 containers: - name: worker image: registry.threesix.ai/rdev/worker:latest env: - RDEV_API_URL: http://rdev-api.rdev.svc.cluster.local:8080 - CLAUDEBOX_URL: http://localhost:8080 - WORKER_POLL_INTERVAL: 5s - WORKER_HEARTBEAT_INTERVAL: 30s - WORKER_TASK_TIMEOUT: 15m - name: claudebox image: registry.threesix.ai/rdev/claudebox:latest volumeMounts: - /workspace (EmptyDir) - /root/.claude (RWX PVC - shared Claude auth) ``` **Storage:** The `claudebox-claude-config` PVC uses `ReadWriteMany` (RWX) access mode with Longhorn NFS, allowing multiple worker pods to share Claude OAuth credentials. ## Error Classification Failed tasks are classified for smart retry logic: | Code | Trigger | Retryable | |------|---------|-----------| | `RATE_LIMITED` | "rate limit", "quota exceeded" | Yes (with backoff) | | `AUTH_FAILED` | "unauthorized", "invalid api key" | No | | `TIMEOUT` | "context deadline exceeded" | Yes | | `AGENT_ERROR` | Generic error | Yes (limited retries) | ## Queue Maintenance Background goroutine in rdev-api: - **Stale worker marking:** Workers without heartbeat >90s → `offline` - **Stale task recovery:** Tasks running >30m without completion → re-queued - **Old task cleanup:** Completed/failed tasks >7 days → deleted - **Metrics refresh:** Queue depth and worker counts → Prometheus ## Graceful Shutdown Worker uses `sync.WaitGroup` to track in-flight tasks: 1. Receive SIGTERM/SIGINT 2. Cancel context (stops polling) 3. Wait for WaitGroup with timeout (`WORKER_TASK_TIMEOUT`) 4. Log success or timeout warning ## Related Topics - [Work Queue](./work-queue.md) - Task queue implementation - [Build Orchestration](../features/build-orchestration.md) - Build API and specs - [SDLC Orchestration](./sdlc.md) - SDLC task integration