rdev/ai-lookup/services/worker-pool.md
jordan bc010c4746 feat: add RWX storage class and full SDLC lifecycle cookbook
- Add longhorn-rwx StorageClass for RWX volume support
- Add slackpath-5-full-lifecycle.yaml cookbook tree (all 10 SDLC phases)
- Update worker-pool.md documentation
- Consolidate PVC configuration, remove separate pvc-shared-claude.yaml
- Update rdev-worker and kustomization for new PVC structure

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 11:37:57 -07:00

7.4 KiB

Worker Pool

Last Updated: 2026-02-06 Confidence: High

Summary

Distributed task execution system where standalone worker pods poll rdev-api for tasks and execute them via a claudebox sidecar. Supports horizontal scaling by adding more worker pods.

Key Facts:

  • Architecture: Pull-based polling (not push/websocket)
  • Sidecar pattern: Worker + claudebox in same pod, communicate via localhost HTTP
  • Atomic dequeue: PostgreSQL FOR UPDATE SKIP LOCKED prevents duplicate claims
  • Task types: build (Claude Code prompts), sdlc (SDLC commands)
  • Scaling: Add replicas to handle more concurrent tasks
  • Resilience: Stale workers marked offline, stuck tasks re-queued automatically

File Pointers

Standalone Worker Binary

  • Entry: cmd/rdev-worker/main.go - Main binary, registration, heartbeat, poll loop
  • API Client: internal/worker/api_client.go - HTTP client to rdev-api
  • Build Executor: internal/worker/http_build_executor.go - Execute builds via claudebox
  • SDLC Executor: internal/worker/http_sdlc_executor.go - Execute SDLC tasks via claudebox

Claudebox Sidecar Client

  • Client: internal/adapter/claudebox/client.go - HTTP client to claudebox sidecar
  • Endpoints: /health, /execute, /git/clone, /git/commit-and-push, /sdlc

rdev-api Server-Side

  • Handlers: internal/handlers/workers.go - /workers/* endpoints
  • Service: internal/service/worker_service.go - Claim, complete, fail logic
  • Registry: internal/adapter/postgres/worker_registry.go - Worker state persistence
  • Queue: internal/adapter/postgres/work_queue.go - Task queue with atomic dequeue

Domain

  • Worker: internal/domain/worker.go - Worker, WorkerStatus
  • Task: internal/domain/work.go - WorkTask, WorkTaskType, WorkTaskStatus
  • Build: internal/domain/build.go - BuildSpec, BuildResult

Kubernetes

  • Deployment: deployments/k8s/base/rdev-worker.yaml - Worker + claudebox pod spec

Architecture

┌─────────────────────┐         HTTP Polling (5s)        ┌──────────────────────────┐
│     rdev-api        │◄────────────────────────────────►│    Worker Pod            │
│                     │                                   │  ┌─────────┐ ┌─────────┐ │
│  POST /workers/register  ← Register at startup         │  │ worker  │→│claudebox│ │
│  POST /workers/{id}/heartbeat  ← Every 30s             │  └─────────┘ └─────────┘ │
│  POST /workers/{id}/claim  ← Poll for tasks            │      ↓ HTTP localhost    │
│  POST /workers/{id}/complete/{taskId}  ← Success       │  Claude Code execution   │
│  POST /workers/{id}/fail/{taskId}  ← Failure           └──────────────────────────┘
│                     │
│  PostgreSQL         │
│  ├─ workers         │  (worker registry)
│  ├─ work_queue      │  (task queue)
│  └─ build_audit     │  (execution history)
└─────────────────────┘

Worker Lifecycle

  1. Register: Worker pod starts → POST /workers/register with ID, hostname, capabilities
  2. Heartbeat: Every 30s → POST /workers/{id}/heartbeat to stay alive
  3. Poll: Every 5s → POST /workers/{id}/claim to get next task
  4. Execute: Call claudebox sidecar HTTP API to run Claude Code / SDLC commands
  5. Report: POST /workers/{id}/complete/{taskId} or /fail/{taskId} with results
  6. Shutdown: Graceful wait for in-flight tasks via sync.WaitGroup

Worker Statuses

Status Meaning
idle Ready to claim new tasks
busy Currently executing a task
draining Not accepting new tasks (pre-shutdown)
offline Missed heartbeat threshold (>90s)

Task Types

Build Tasks (WorkTaskTypeBuild)

Execute Claude Code prompts with optional git operations.

Spec:

{
  "prompt": "Build a React app with...",
  "auto_commit": true,
  "auto_push": false,
  "git_clone_url": "https://gitea.../repo.git"
}

Execution Flow:

  1. Clone repo via claudebox /git/clone
  2. Execute prompt via claudebox /execute (streaming)
  3. Commit/push via claudebox /git/commit-and-push

SDLC Tasks (WorkTaskTypeSDLC)

Execute SDLC CLI commands.

Spec:

{
  "command": "feature",
  "args": ["init", "feature-name"],
  "git_clone_url": "https://gitea.../repo.git"
}

Execution Flow:

  1. Clone repo via claudebox /git/clone
  2. Run SDLC command via claudebox /sdlc
  3. Commit/push changes

API Endpoints

Method Path Description
POST /workers/register Register new worker
POST /workers/{id}/heartbeat Keep worker alive
POST /workers/{id}/claim Claim next available task (204 if none)
POST /workers/{id}/complete/{taskId} Report successful completion
POST /workers/{id}/fail/{taskId} Report failure
GET /workers List all workers
GET /workers/{id} Get worker details
POST /workers/{id}/drain Set worker to draining

Kubernetes Deployment

# deployments/k8s/base/rdev-worker.yaml
spec:
  replicas: 1  # Scale by increasing
  strategy:
    type: RollingUpdate  # RWX PVC enables multi-pod mounts
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 0
  containers:
    - name: worker
      image: registry.threesix.ai/rdev/worker:latest
      env:
        - RDEV_API_URL: http://rdev-api.rdev.svc.cluster.local:8080
        - CLAUDEBOX_URL: http://localhost:8080
        - WORKER_POLL_INTERVAL: 5s
        - WORKER_HEARTBEAT_INTERVAL: 30s
        - WORKER_TASK_TIMEOUT: 15m
    - name: claudebox
      image: registry.threesix.ai/rdev/claudebox:latest
      volumeMounts:
        - /workspace (EmptyDir)
        - /root/.claude (RWX PVC - shared Claude auth)

Storage: The claudebox-claude-config PVC uses ReadWriteMany (RWX) access mode with Longhorn NFS, allowing multiple worker pods to share Claude OAuth credentials.

Error Classification

Failed tasks are classified for smart retry logic:

Code Trigger Retryable
RATE_LIMITED "rate limit", "quota exceeded" Yes (with backoff)
AUTH_FAILED "unauthorized", "invalid api key" No
TIMEOUT "context deadline exceeded" Yes
AGENT_ERROR Generic error Yes (limited retries)

Queue Maintenance

Background goroutine in rdev-api:

  • Stale worker marking: Workers without heartbeat >90s → offline
  • Stale task recovery: Tasks running >30m without completion → re-queued
  • Old task cleanup: Completed/failed tasks >7 days → deleted
  • Metrics refresh: Queue depth and worker counts → Prometheus

Graceful Shutdown

Worker uses sync.WaitGroup to track in-flight tasks:

  1. Receive SIGTERM/SIGINT
  2. Cancel context (stops polling)
  3. Wait for WaitGroup with timeout (WORKER_TASK_TIMEOUT)
  4. Log success or timeout warning