rdev/ai-lookup/services/worker-pool.md

# Worker Pool

**Last Updated:** 2026-02-06
**Confidence:** High

## Summary

Distributed task execution system where standalone worker pods poll rdev-api for tasks and execute them via a claudebox sidecar. Supports horizontal scaling by adding more worker pods.

**Key Facts:**
- **Architecture:** Pull-based polling (not push/websocket)
- **Sidecar pattern:** Worker + claudebox in same pod, communicate via localhost HTTP
- **Atomic dequeue:** PostgreSQL `FOR UPDATE SKIP LOCKED` prevents duplicate claims
- **Task types:** `build` (Claude Code prompts), `sdlc` (SDLC commands)
- **Scaling:** Add replicas to handle more concurrent tasks
- **Resilience:** Stale workers marked offline, stuck tasks re-queued automatically

## File Pointers

### Standalone Worker Binary
- **Entry:** `cmd/rdev-worker/main.go` - Main binary, registration, heartbeat, poll loop
- **API Client:** `internal/worker/api_client.go` - HTTP client to rdev-api
- **Build Executor:** `internal/worker/http_build_executor.go` - Execute builds via claudebox
- **SDLC Executor:** `internal/worker/http_sdlc_executor.go` - Execute SDLC tasks via claudebox

### Claudebox Sidecar Client
- **Client:** `internal/adapter/claudebox/client.go` - HTTP client to claudebox sidecar
- **Endpoints:** `/health`, `/execute`, `/git/clone`, `/git/commit-and-push`, `/sdlc`

### rdev-api Server-Side
- **Handlers:** `internal/handlers/workers.go` - `/workers/*` endpoints
- **Service:** `internal/service/worker_service.go` - Claim, complete, fail logic
- **Registry:** `internal/adapter/postgres/worker_registry.go` - Worker state persistence
- **Queue:** `internal/adapter/postgres/work_queue.go` - Task queue with atomic dequeue

### Domain
- **Worker:** `internal/domain/worker.go` - Worker, WorkerStatus
- **Task:** `internal/domain/work.go` - WorkTask, WorkTaskType, WorkTaskStatus
- **Build:** `internal/domain/build.go` - BuildSpec, BuildResult

### Kubernetes
- **Deployment:** `deployments/k8s/base/rdev-worker.yaml` - Worker + claudebox pod spec

## Architecture

```
┌─────────────────────┐         HTTP Polling (5s)        ┌──────────────────────────┐
│     rdev-api        │◄────────────────────────────────►│    Worker Pod            │
│                     │                                   │  ┌─────────┐ ┌─────────┐ │
│  POST /workers/register  ← Register at startup         │  │ worker  │→│claudebox│ │
│  POST /workers/{id}/heartbeat  ← Every 30s             │  └─────────┘ └─────────┘ │
│  POST /workers/{id}/claim  ← Poll for tasks            │      ↓ HTTP localhost    │
│  POST /workers/{id}/complete/{taskId}  ← Success       │  Claude Code execution   │
│  POST /workers/{id}/fail/{taskId}  ← Failure           └──────────────────────────┘
│                     │
│  PostgreSQL         │
│  ├─ workers         │  (worker registry)
│  ├─ work_queue      │  (task queue)
│  └─ build_audit     │  (execution history)
└─────────────────────┘
```

## Worker Lifecycle

1. **Register:** Worker pod starts → `POST /workers/register` with ID, hostname, capabilities
2. **Heartbeat:** Every 30s → `POST /workers/{id}/heartbeat` to stay alive
3. **Poll:** Every 5s → `POST /workers/{id}/claim` to get next task
4. **Execute:** Call claudebox sidecar HTTP API to run Claude Code / SDLC commands
5. **Report:** `POST /workers/{id}/complete/{taskId}` or `/fail/{taskId}` with results
6. **Shutdown:** Graceful wait for in-flight tasks via `sync.WaitGroup`

## Worker Statuses

| Status | Meaning |
|--------|---------|
| `idle` | Ready to claim new tasks |
| `busy` | Currently executing a task |
| `draining` | Not accepting new tasks (pre-shutdown) |
| `offline` | Missed heartbeat threshold (>90s) |

## Task Types

### Build Tasks (`WorkTaskTypeBuild`)

Execute Claude Code prompts with optional git operations.

**Spec:**
```json
{
  "prompt": "Build a React app with...",
  "auto_commit": true,
  "auto_push": false,
  "git_clone_url": "https://gitea.../repo.git"
}
```

**Execution Flow:**
1. Clone repo via `claudebox /git/clone`
2. Execute prompt via `claudebox /execute` (streaming)
3. Commit/push via `claudebox /git/commit-and-push`

### SDLC Tasks (`WorkTaskTypeSDLC`)

Execute SDLC CLI commands.

**Spec:**
```json
{
  "command": "feature",
  "args": ["init", "feature-name"],
  "git_clone_url": "https://gitea.../repo.git"
}
```

**Execution Flow:**
1. Clone repo via `claudebox /git/clone`
2. Run SDLC command via `claudebox /sdlc`
3. Commit/push changes

## API Endpoints

| Method | Path | Description |
|--------|------|-------------|
| POST | `/workers/register` | Register new worker |
| POST | `/workers/{id}/heartbeat` | Keep worker alive |
| POST | `/workers/{id}/claim` | Claim next available task (204 if none) |
| POST | `/workers/{id}/complete/{taskId}` | Report successful completion |
| POST | `/workers/{id}/fail/{taskId}` | Report failure |
| GET | `/workers` | List all workers |
| GET | `/workers/{id}` | Get worker details |
| POST | `/workers/{id}/drain` | Set worker to draining |

## Kubernetes Deployment

```yaml
# deployments/k8s/base/rdev-worker.yaml
spec:
  replicas: 1  # Scale by increasing
  strategy:
    type: RollingUpdate  # RWX PVC enables multi-pod mounts
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 0
  containers:
    - name: worker
      image: registry.threesix.ai/rdev/worker:latest
      env:
        - RDEV_API_URL: http://rdev-api.rdev.svc.cluster.local:8080
        - CLAUDEBOX_URL: http://localhost:8080
        - WORKER_POLL_INTERVAL: 5s
        - WORKER_HEARTBEAT_INTERVAL: 30s
        - WORKER_TASK_TIMEOUT: 15m
    - name: claudebox
      image: registry.threesix.ai/rdev/claudebox:latest
      volumeMounts:
        - /workspace (EmptyDir)
        - /root/.claude (RWX PVC - shared Claude auth)
```

**Storage:** The `claudebox-claude-config` PVC uses `ReadWriteMany` (RWX) access mode with Longhorn NFS, allowing multiple worker pods to share Claude OAuth credentials.

## Error Classification

Failed tasks are classified for smart retry logic:

| Code | Trigger | Retryable |
|------|---------|-----------|
| `RATE_LIMITED` | "rate limit", "quota exceeded" | Yes (with backoff) |
| `AUTH_FAILED` | "unauthorized", "invalid api key" | No |
| `TIMEOUT` | "context deadline exceeded" | Yes |
| `AGENT_ERROR` | Generic error | Yes (limited retries) |

## Queue Maintenance

Background goroutine in rdev-api:
- **Stale worker marking:** Workers without heartbeat >90s → `offline`
- **Stale task recovery:** Tasks running >30m without completion → re-queued
- **Old task cleanup:** Completed/failed tasks >7 days → deleted
- **Metrics refresh:** Queue depth and worker counts → Prometheus

## Graceful Shutdown

Worker uses `sync.WaitGroup` to track in-flight tasks:
1. Receive SIGTERM/SIGINT
2. Cancel context (stops polling)
3. Wait for WaitGroup with timeout (`WORKER_TASK_TIMEOUT`)
4. Log success or timeout warning

## Related Topics

- [Work Queue](./work-queue.md) - Task queue implementation
- [Build Orchestration](../features/build-orchestration.md) - Build API and specs
- [SDLC Orchestration](./sdlc.md) - SDLC task integration