- Add longhorn-rwx StorageClass for RWX volume support - Add slackpath-5-full-lifecycle.yaml cookbook tree (all 10 SDLC phases) - Update worker-pool.md documentation - Consolidate PVC configuration, remove separate pvc-shared-claude.yaml - Update rdev-worker and kustomization for new PVC structure Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
194 lines
7.4 KiB
Markdown
194 lines
7.4 KiB
Markdown
# Worker Pool
|
|
|
|
**Last Updated:** 2026-02-06
|
|
**Confidence:** High
|
|
|
|
## Summary
|
|
|
|
Distributed task execution system where standalone worker pods poll rdev-api for tasks and execute them via a claudebox sidecar. Supports horizontal scaling by adding more worker pods.
|
|
|
|
**Key Facts:**
|
|
- **Architecture:** Pull-based polling (not push/websocket)
|
|
- **Sidecar pattern:** Worker + claudebox in same pod, communicate via localhost HTTP
|
|
- **Atomic dequeue:** PostgreSQL `FOR UPDATE SKIP LOCKED` prevents duplicate claims
|
|
- **Task types:** `build` (Claude Code prompts), `sdlc` (SDLC commands)
|
|
- **Scaling:** Add replicas to handle more concurrent tasks
|
|
- **Resilience:** Stale workers marked offline, stuck tasks re-queued automatically
|
|
|
|
## File Pointers
|
|
|
|
### Standalone Worker Binary
|
|
- **Entry:** `cmd/rdev-worker/main.go` - Main binary, registration, heartbeat, poll loop
|
|
- **API Client:** `internal/worker/api_client.go` - HTTP client to rdev-api
|
|
- **Build Executor:** `internal/worker/http_build_executor.go` - Execute builds via claudebox
|
|
- **SDLC Executor:** `internal/worker/http_sdlc_executor.go` - Execute SDLC tasks via claudebox
|
|
|
|
### Claudebox Sidecar Client
|
|
- **Client:** `internal/adapter/claudebox/client.go` - HTTP client to claudebox sidecar
|
|
- **Endpoints:** `/health`, `/execute`, `/git/clone`, `/git/commit-and-push`, `/sdlc`
|
|
|
|
### rdev-api Server-Side
|
|
- **Handlers:** `internal/handlers/workers.go` - `/workers/*` endpoints
|
|
- **Service:** `internal/service/worker_service.go` - Claim, complete, fail logic
|
|
- **Registry:** `internal/adapter/postgres/worker_registry.go` - Worker state persistence
|
|
- **Queue:** `internal/adapter/postgres/work_queue.go` - Task queue with atomic dequeue
|
|
|
|
### Domain
|
|
- **Worker:** `internal/domain/worker.go` - Worker, WorkerStatus
|
|
- **Task:** `internal/domain/work.go` - WorkTask, WorkTaskType, WorkTaskStatus
|
|
- **Build:** `internal/domain/build.go` - BuildSpec, BuildResult
|
|
|
|
### Kubernetes
|
|
- **Deployment:** `deployments/k8s/base/rdev-worker.yaml` - Worker + claudebox pod spec
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────┐ HTTP Polling (5s) ┌──────────────────────────┐
|
|
│ rdev-api │◄────────────────────────────────►│ Worker Pod │
|
|
│ │ │ ┌─────────┐ ┌─────────┐ │
|
|
│ POST /workers/register ← Register at startup │ │ worker │→│claudebox│ │
|
|
│ POST /workers/{id}/heartbeat ← Every 30s │ └─────────┘ └─────────┘ │
|
|
│ POST /workers/{id}/claim ← Poll for tasks │ ↓ HTTP localhost │
|
|
│ POST /workers/{id}/complete/{taskId} ← Success │ Claude Code execution │
|
|
│ POST /workers/{id}/fail/{taskId} ← Failure └──────────────────────────┘
|
|
│ │
|
|
│ PostgreSQL │
|
|
│ ├─ workers │ (worker registry)
|
|
│ ├─ work_queue │ (task queue)
|
|
│ └─ build_audit │ (execution history)
|
|
└─────────────────────┘
|
|
```
|
|
|
|
## Worker Lifecycle
|
|
|
|
1. **Register:** Worker pod starts → `POST /workers/register` with ID, hostname, capabilities
|
|
2. **Heartbeat:** Every 30s → `POST /workers/{id}/heartbeat` to stay alive
|
|
3. **Poll:** Every 5s → `POST /workers/{id}/claim` to get next task
|
|
4. **Execute:** Call claudebox sidecar HTTP API to run Claude Code / SDLC commands
|
|
5. **Report:** `POST /workers/{id}/complete/{taskId}` or `/fail/{taskId}` with results
|
|
6. **Shutdown:** Graceful wait for in-flight tasks via `sync.WaitGroup`
|
|
|
|
## Worker Statuses
|
|
|
|
| Status | Meaning |
|
|
|--------|---------|
|
|
| `idle` | Ready to claim new tasks |
|
|
| `busy` | Currently executing a task |
|
|
| `draining` | Not accepting new tasks (pre-shutdown) |
|
|
| `offline` | Missed heartbeat threshold (>90s) |
|
|
|
|
## Task Types
|
|
|
|
### Build Tasks (`WorkTaskTypeBuild`)
|
|
|
|
Execute Claude Code prompts with optional git operations.
|
|
|
|
**Spec:**
|
|
```json
|
|
{
|
|
"prompt": "Build a React app with...",
|
|
"auto_commit": true,
|
|
"auto_push": false,
|
|
"git_clone_url": "https://gitea.../repo.git"
|
|
}
|
|
```
|
|
|
|
**Execution Flow:**
|
|
1. Clone repo via `claudebox /git/clone`
|
|
2. Execute prompt via `claudebox /execute` (streaming)
|
|
3. Commit/push via `claudebox /git/commit-and-push`
|
|
|
|
### SDLC Tasks (`WorkTaskTypeSDLC`)
|
|
|
|
Execute SDLC CLI commands.
|
|
|
|
**Spec:**
|
|
```json
|
|
{
|
|
"command": "feature",
|
|
"args": ["init", "feature-name"],
|
|
"git_clone_url": "https://gitea.../repo.git"
|
|
}
|
|
```
|
|
|
|
**Execution Flow:**
|
|
1. Clone repo via `claudebox /git/clone`
|
|
2. Run SDLC command via `claudebox /sdlc`
|
|
3. Commit/push changes
|
|
|
|
## API Endpoints
|
|
|
|
| Method | Path | Description |
|
|
|--------|------|-------------|
|
|
| POST | `/workers/register` | Register new worker |
|
|
| POST | `/workers/{id}/heartbeat` | Keep worker alive |
|
|
| POST | `/workers/{id}/claim` | Claim next available task (204 if none) |
|
|
| POST | `/workers/{id}/complete/{taskId}` | Report successful completion |
|
|
| POST | `/workers/{id}/fail/{taskId}` | Report failure |
|
|
| GET | `/workers` | List all workers |
|
|
| GET | `/workers/{id}` | Get worker details |
|
|
| POST | `/workers/{id}/drain` | Set worker to draining |
|
|
|
|
## Kubernetes Deployment
|
|
|
|
```yaml
|
|
# deployments/k8s/base/rdev-worker.yaml
|
|
spec:
|
|
replicas: 1 # Scale by increasing
|
|
strategy:
|
|
type: RollingUpdate # RWX PVC enables multi-pod mounts
|
|
rollingUpdate:
|
|
maxSurge: 2
|
|
maxUnavailable: 0
|
|
containers:
|
|
- name: worker
|
|
image: registry.threesix.ai/rdev/worker:latest
|
|
env:
|
|
- RDEV_API_URL: http://rdev-api.rdev.svc.cluster.local:8080
|
|
- CLAUDEBOX_URL: http://localhost:8080
|
|
- WORKER_POLL_INTERVAL: 5s
|
|
- WORKER_HEARTBEAT_INTERVAL: 30s
|
|
- WORKER_TASK_TIMEOUT: 15m
|
|
- name: claudebox
|
|
image: registry.threesix.ai/rdev/claudebox:latest
|
|
volumeMounts:
|
|
- /workspace (EmptyDir)
|
|
- /root/.claude (RWX PVC - shared Claude auth)
|
|
```
|
|
|
|
**Storage:** The `claudebox-claude-config` PVC uses `ReadWriteMany` (RWX) access mode with Longhorn NFS, allowing multiple worker pods to share Claude OAuth credentials.
|
|
|
|
## Error Classification
|
|
|
|
Failed tasks are classified for smart retry logic:
|
|
|
|
| Code | Trigger | Retryable |
|
|
|------|---------|-----------|
|
|
| `RATE_LIMITED` | "rate limit", "quota exceeded" | Yes (with backoff) |
|
|
| `AUTH_FAILED` | "unauthorized", "invalid api key" | No |
|
|
| `TIMEOUT` | "context deadline exceeded" | Yes |
|
|
| `AGENT_ERROR` | Generic error | Yes (limited retries) |
|
|
|
|
## Queue Maintenance
|
|
|
|
Background goroutine in rdev-api:
|
|
- **Stale worker marking:** Workers without heartbeat >90s → `offline`
|
|
- **Stale task recovery:** Tasks running >30m without completion → re-queued
|
|
- **Old task cleanup:** Completed/failed tasks >7 days → deleted
|
|
- **Metrics refresh:** Queue depth and worker counts → Prometheus
|
|
|
|
## Graceful Shutdown
|
|
|
|
Worker uses `sync.WaitGroup` to track in-flight tasks:
|
|
1. Receive SIGTERM/SIGINT
|
|
2. Cancel context (stops polling)
|
|
3. Wait for WaitGroup with timeout (`WORKER_TASK_TIMEOUT`)
|
|
4. Log success or timeout warning
|
|
|
|
## Related Topics
|
|
|
|
- [Work Queue](./work-queue.md) - Task queue implementation
|
|
- [Build Orchestration](../features/build-orchestration.md) - Build API and specs
|
|
- [SDLC Orchestration](./sdlc.md) - SDLC task integration
|