jordan bc010c4746 feat: add RWX storage class and full SDLC lifecycle cookbook

- Add longhorn-rwx StorageClass for RWX volume support
- Add slackpath-5-full-lifecycle.yaml cookbook tree (all 10 SDLC phases)
- Update worker-pool.md documentation
- Consolidate PVC configuration, remove separate pvc-shared-claude.yaml
- Update rdev-worker and kustomization for new PVC structure

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-06 11:37:57 -07:00

7.4 KiB

Raw Blame History

Worker Pool

Last Updated: 2026-02-06 Confidence: High

Summary

Distributed task execution system where standalone worker pods poll rdev-api for tasks and execute them via a claudebox sidecar. Supports horizontal scaling by adding more worker pods.

Key Facts:

Architecture: Pull-based polling (not push/websocket)
Sidecar pattern: Worker + claudebox in same pod, communicate via localhost HTTP
Atomic dequeue: PostgreSQL FOR UPDATE SKIP LOCKED prevents duplicate claims
Task types: build (Claude Code prompts), sdlc (SDLC commands)
Scaling: Add replicas to handle more concurrent tasks
Resilience: Stale workers marked offline, stuck tasks re-queued automatically

File Pointers

Standalone Worker Binary

Entry: cmd/rdev-worker/main.go - Main binary, registration, heartbeat, poll loop
API Client: internal/worker/api_client.go - HTTP client to rdev-api
Build Executor: internal/worker/http_build_executor.go - Execute builds via claudebox
SDLC Executor: internal/worker/http_sdlc_executor.go - Execute SDLC tasks via claudebox

Claudebox Sidecar Client

Client: internal/adapter/claudebox/client.go - HTTP client to claudebox sidecar
Endpoints: /health, /execute, /git/clone, /git/commit-and-push, /sdlc

rdev-api Server-Side

Handlers: internal/handlers/workers.go - /workers/* endpoints
Service: internal/service/worker_service.go - Claim, complete, fail logic
Registry: internal/adapter/postgres/worker_registry.go - Worker state persistence
Queue: internal/adapter/postgres/work_queue.go - Task queue with atomic dequeue

Domain

Worker: internal/domain/worker.go - Worker, WorkerStatus
Task: internal/domain/work.go - WorkTask, WorkTaskType, WorkTaskStatus
Build: internal/domain/build.go - BuildSpec, BuildResult

Kubernetes

Deployment: deployments/k8s/base/rdev-worker.yaml - Worker + claudebox pod spec

Architecture

┌─────────────────────┐         HTTP Polling (5s)        ┌──────────────────────────┐
│     rdev-api        │◄────────────────────────────────►│    Worker Pod            │
│                     │                                   │  ┌─────────┐ ┌─────────┐ │
│  POST /workers/register  ← Register at startup         │  │ worker  │→│claudebox│ │
│  POST /workers/{id}/heartbeat  ← Every 30s             │  └─────────┘ └─────────┘ │
│  POST /workers/{id}/claim  ← Poll for tasks            │      ↓ HTTP localhost    │
│  POST /workers/{id}/complete/{taskId}  ← Success       │  Claude Code execution   │
│  POST /workers/{id}/fail/{taskId}  ← Failure           └──────────────────────────┘
│                     │
│  PostgreSQL         │
│  ├─ workers         │  (worker registry)
│  ├─ work_queue      │  (task queue)
│  └─ build_audit     │  (execution history)
└─────────────────────┘

Worker Lifecycle

Register: Worker pod starts → POST /workers/register with ID, hostname, capabilities
Heartbeat: Every 30s → POST /workers/{id}/heartbeat to stay alive
Poll: Every 5s → POST /workers/{id}/claim to get next task
Execute: Call claudebox sidecar HTTP API to run Claude Code / SDLC commands
Report: POST /workers/{id}/complete/{taskId} or /fail/{taskId} with results
Shutdown: Graceful wait for in-flight tasks via sync.WaitGroup

Worker Statuses

Status	Meaning
`idle`	Ready to claim new tasks
`busy`	Currently executing a task
`draining`	Not accepting new tasks (pre-shutdown)
`offline`	Missed heartbeat threshold (>90s)

Task Types

Build Tasks (`WorkTaskTypeBuild`)

Execute Claude Code prompts with optional git operations.

Spec:

{
  "prompt": "Build a React app with...",
  "auto_commit": true,
  "auto_push": false,
  "git_clone_url": "https://gitea.../repo.git"
}

Execution Flow:

Clone repo via claudebox /git/clone
Execute prompt via claudebox /execute (streaming)
Commit/push via claudebox /git/commit-and-push

SDLC Tasks (`WorkTaskTypeSDLC`)

Execute SDLC CLI commands.

Spec:

{
  "command": "feature",
  "args": ["init", "feature-name"],
  "git_clone_url": "https://gitea.../repo.git"
}

Execution Flow:

Clone repo via claudebox /git/clone
Run SDLC command via claudebox /sdlc
Commit/push changes

API Endpoints

Method	Path	Description
POST	`/workers/register`	Register new worker
POST	`/workers/{id}/heartbeat`	Keep worker alive
POST	`/workers/{id}/claim`	Claim next available task (204 if none)
POST	`/workers/{id}/complete/{taskId}`	Report successful completion
POST	`/workers/{id}/fail/{taskId}`	Report failure
GET	`/workers`	List all workers
GET	`/workers/{id}`	Get worker details
POST	`/workers/{id}/drain`	Set worker to draining

Kubernetes Deployment

# deployments/k8s/base/rdev-worker.yaml
spec:
  replicas: 1  # Scale by increasing
  strategy:
    type: RollingUpdate  # RWX PVC enables multi-pod mounts
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 0
  containers:
    - name: worker
      image: registry.threesix.ai/rdev/worker:latest
      env:
        - RDEV_API_URL: http://rdev-api.rdev.svc.cluster.local:8080
        - CLAUDEBOX_URL: http://localhost:8080
        - WORKER_POLL_INTERVAL: 5s
        - WORKER_HEARTBEAT_INTERVAL: 30s
        - WORKER_TASK_TIMEOUT: 15m
    - name: claudebox
      image: registry.threesix.ai/rdev/claudebox:latest
      volumeMounts:
        - /workspace (EmptyDir)
        - /root/.claude (RWX PVC - shared Claude auth)

Storage: The claudebox-claude-config PVC uses ReadWriteMany (RWX) access mode with Longhorn NFS, allowing multiple worker pods to share Claude OAuth credentials.

Error Classification

Failed tasks are classified for smart retry logic:

Code	Trigger	Retryable
`RATE_LIMITED`	"rate limit", "quota exceeded"	Yes (with backoff)
`AUTH_FAILED`	"unauthorized", "invalid api key"	No
`TIMEOUT`	"context deadline exceeded"	Yes
`AGENT_ERROR`	Generic error	Yes (limited retries)

Queue Maintenance

Background goroutine in rdev-api:

Stale worker marking: Workers without heartbeat >90s → offline
Stale task recovery: Tasks running >30m without completion → re-queued
Old task cleanup: Completed/failed tasks >7 days → deleted
Metrics refresh: Queue depth and worker counts → Prometheus

Graceful Shutdown

Worker uses sync.WaitGroup to track in-flight tasks:

Receive SIGTERM/SIGINT
Cancel context (stops polling)
Wait for WaitGroup with timeout (WORKER_TASK_TIMEOUT)
Log success or timeout warning

Work Queue - Task queue implementation
Build Orchestration - Build API and specs
SDLC Orchestration - SDLC task integration

7.4 KiB Raw Blame History