- Add longhorn-rwx StorageClass for RWX volume support - Add slackpath-5-full-lifecycle.yaml cookbook tree (all 10 SDLC phases) - Update worker-pool.md documentation - Consolidate PVC configuration, remove separate pvc-shared-claude.yaml - Update rdev-worker and kustomization for new PVC structure Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
7.4 KiB
7.4 KiB
Worker Pool
Last Updated: 2026-02-06 Confidence: High
Summary
Distributed task execution system where standalone worker pods poll rdev-api for tasks and execute them via a claudebox sidecar. Supports horizontal scaling by adding more worker pods.
Key Facts:
- Architecture: Pull-based polling (not push/websocket)
- Sidecar pattern: Worker + claudebox in same pod, communicate via localhost HTTP
- Atomic dequeue: PostgreSQL
FOR UPDATE SKIP LOCKEDprevents duplicate claims - Task types:
build(Claude Code prompts),sdlc(SDLC commands) - Scaling: Add replicas to handle more concurrent tasks
- Resilience: Stale workers marked offline, stuck tasks re-queued automatically
File Pointers
Standalone Worker Binary
- Entry:
cmd/rdev-worker/main.go- Main binary, registration, heartbeat, poll loop - API Client:
internal/worker/api_client.go- HTTP client to rdev-api - Build Executor:
internal/worker/http_build_executor.go- Execute builds via claudebox - SDLC Executor:
internal/worker/http_sdlc_executor.go- Execute SDLC tasks via claudebox
Claudebox Sidecar Client
- Client:
internal/adapter/claudebox/client.go- HTTP client to claudebox sidecar - Endpoints:
/health,/execute,/git/clone,/git/commit-and-push,/sdlc
rdev-api Server-Side
- Handlers:
internal/handlers/workers.go-/workers/*endpoints - Service:
internal/service/worker_service.go- Claim, complete, fail logic - Registry:
internal/adapter/postgres/worker_registry.go- Worker state persistence - Queue:
internal/adapter/postgres/work_queue.go- Task queue with atomic dequeue
Domain
- Worker:
internal/domain/worker.go- Worker, WorkerStatus - Task:
internal/domain/work.go- WorkTask, WorkTaskType, WorkTaskStatus - Build:
internal/domain/build.go- BuildSpec, BuildResult
Kubernetes
- Deployment:
deployments/k8s/base/rdev-worker.yaml- Worker + claudebox pod spec
Architecture
┌─────────────────────┐ HTTP Polling (5s) ┌──────────────────────────┐
│ rdev-api │◄────────────────────────────────►│ Worker Pod │
│ │ │ ┌─────────┐ ┌─────────┐ │
│ POST /workers/register ← Register at startup │ │ worker │→│claudebox│ │
│ POST /workers/{id}/heartbeat ← Every 30s │ └─────────┘ └─────────┘ │
│ POST /workers/{id}/claim ← Poll for tasks │ ↓ HTTP localhost │
│ POST /workers/{id}/complete/{taskId} ← Success │ Claude Code execution │
│ POST /workers/{id}/fail/{taskId} ← Failure └──────────────────────────┘
│ │
│ PostgreSQL │
│ ├─ workers │ (worker registry)
│ ├─ work_queue │ (task queue)
│ └─ build_audit │ (execution history)
└─────────────────────┘
Worker Lifecycle
- Register: Worker pod starts →
POST /workers/registerwith ID, hostname, capabilities - Heartbeat: Every 30s →
POST /workers/{id}/heartbeatto stay alive - Poll: Every 5s →
POST /workers/{id}/claimto get next task - Execute: Call claudebox sidecar HTTP API to run Claude Code / SDLC commands
- Report:
POST /workers/{id}/complete/{taskId}or/fail/{taskId}with results - Shutdown: Graceful wait for in-flight tasks via
sync.WaitGroup
Worker Statuses
| Status | Meaning |
|---|---|
idle |
Ready to claim new tasks |
busy |
Currently executing a task |
draining |
Not accepting new tasks (pre-shutdown) |
offline |
Missed heartbeat threshold (>90s) |
Task Types
Build Tasks (WorkTaskTypeBuild)
Execute Claude Code prompts with optional git operations.
Spec:
{
"prompt": "Build a React app with...",
"auto_commit": true,
"auto_push": false,
"git_clone_url": "https://gitea.../repo.git"
}
Execution Flow:
- Clone repo via
claudebox /git/clone - Execute prompt via
claudebox /execute(streaming) - Commit/push via
claudebox /git/commit-and-push
SDLC Tasks (WorkTaskTypeSDLC)
Execute SDLC CLI commands.
Spec:
{
"command": "feature",
"args": ["init", "feature-name"],
"git_clone_url": "https://gitea.../repo.git"
}
Execution Flow:
- Clone repo via
claudebox /git/clone - Run SDLC command via
claudebox /sdlc - Commit/push changes
API Endpoints
| Method | Path | Description |
|---|---|---|
| POST | /workers/register |
Register new worker |
| POST | /workers/{id}/heartbeat |
Keep worker alive |
| POST | /workers/{id}/claim |
Claim next available task (204 if none) |
| POST | /workers/{id}/complete/{taskId} |
Report successful completion |
| POST | /workers/{id}/fail/{taskId} |
Report failure |
| GET | /workers |
List all workers |
| GET | /workers/{id} |
Get worker details |
| POST | /workers/{id}/drain |
Set worker to draining |
Kubernetes Deployment
# deployments/k8s/base/rdev-worker.yaml
spec:
replicas: 1 # Scale by increasing
strategy:
type: RollingUpdate # RWX PVC enables multi-pod mounts
rollingUpdate:
maxSurge: 2
maxUnavailable: 0
containers:
- name: worker
image: registry.threesix.ai/rdev/worker:latest
env:
- RDEV_API_URL: http://rdev-api.rdev.svc.cluster.local:8080
- CLAUDEBOX_URL: http://localhost:8080
- WORKER_POLL_INTERVAL: 5s
- WORKER_HEARTBEAT_INTERVAL: 30s
- WORKER_TASK_TIMEOUT: 15m
- name: claudebox
image: registry.threesix.ai/rdev/claudebox:latest
volumeMounts:
- /workspace (EmptyDir)
- /root/.claude (RWX PVC - shared Claude auth)
Storage: The claudebox-claude-config PVC uses ReadWriteMany (RWX) access mode with Longhorn NFS, allowing multiple worker pods to share Claude OAuth credentials.
Error Classification
Failed tasks are classified for smart retry logic:
| Code | Trigger | Retryable |
|---|---|---|
RATE_LIMITED |
"rate limit", "quota exceeded" | Yes (with backoff) |
AUTH_FAILED |
"unauthorized", "invalid api key" | No |
TIMEOUT |
"context deadline exceeded" | Yes |
AGENT_ERROR |
Generic error | Yes (limited retries) |
Queue Maintenance
Background goroutine in rdev-api:
- Stale worker marking: Workers without heartbeat >90s →
offline - Stale task recovery: Tasks running >30m without completion → re-queued
- Old task cleanup: Completed/failed tasks >7 days → deleted
- Metrics refresh: Queue depth and worker counts → Prometheus
Graceful Shutdown
Worker uses sync.WaitGroup to track in-flight tasks:
- Receive SIGTERM/SIGINT
- Cancel context (stops polling)
- Wait for WaitGroup with timeout (
WORKER_TASK_TIMEOUT) - Log success or timeout warning
Related Topics
- Work Queue - Task queue implementation
- Build Orchestration - Build API and specs
- SDLC Orchestration - SDLC task integration