When a worker dies mid-build, queue maintenance now updates both work_queue and build_audit tables when requeuing stale tasks. This prevents builds from showing "running" forever in the API. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
4.0 KiB
4.0 KiB
Worker Pool
Last Updated: 2026-01-31 Confidence: High
Summary
Shared worker pool that executes build tasks for any project. Currently runs as an embedded WorkExecutor daemon inside rdev-api. Workers register with the worker registry, poll the work queue for tasks, execute Claude Code in pods via kubectl exec. Post-build git operations (commit/push) are programmatic via PodGitOperations, not LLM-driven.
Key Facts:
- LLM vs rdev boundary: Claude writes code; rdev handles git ops programmatically (no LLM for runbook tasks)
- Embedded WorkExecutor daemon runs inside rdev-api process
- Workers poll work queue every 5 seconds, heartbeat every 30 seconds
- Stale workers (no heartbeat for 2 minutes) automatically marked offline by QueueMaintenance
- Stale tasks (running >30 min without completion) automatically requeued
- Old tasks (>7 days) automatically cleaned up
- Queue depth and worker counts exported as Prometheus metrics
- Future: external worker binary for separate pod deployment
File Pointers:
- Domain:
internal/domain/worker.go(Worker, WorkerStatus) - Domain:
internal/domain/build.go(BuildSpec, BuildResult) - Port:
internal/port/worker_registry.go(WorkerRegistry interface) - Port:
internal/port/build_audit.go(BuildAudit interface) - Adapter:
internal/adapter/postgres/worker_registry.go - Adapter:
internal/adapter/postgres/build_audit.go - Service:
internal/service/worker_service.go - Service:
internal/service/build_service.go - Executor:
internal/worker/work_executor.go(poll loop, heartbeat, task routing) - Executor:
internal/worker/build_executor.go(BuildSpec→AgentRequest) - Git:
internal/worker/pod_git_operations.go(post-build commit/push via kubectl exec) - Maintenance:
internal/worker/queue_maintenance.go(stale recovery, cleanup, metrics) - Handler:
internal/handlers/workers.go(REST API for workers) - Handler:
internal/handlers/builds.go(REST API for builds) - Handler:
internal/handlers/create_and_build.go(combined create+build) - Migration:
internal/db/migrations/012_worker_registry.sql
Worker Lifecycle (Embedded)
- rdev-api starts → WorkExecutor registers as worker in registry
- Heartbeat loop: every 30s sends heartbeat via WorkerService
- Poll loop: every 5s dequeues next task from work queue
- BuildExecutor: executes CodeAgent in pod, then programmatically commits/pushes if auto_commit
- Reports completion with BuildResult via WorkerService
- Graceful shutdown: deregisters worker on rdev-api stop
Worker Statuses
idle- available for new tasksbusy- currently executing a taskdraining- not accepting new tasks (pre-shutdown)offline- missed heartbeat threshold
API Endpoints
| Method | Path | Description |
|---|---|---|
| GET | /workers |
List all workers with status summary |
| GET | /workers/{workerId} |
Get worker details |
| POST | /workers/{workerId}/drain |
Set worker to draining |
| POST | /projects/{id}/builds |
Start build for project |
| GET | /projects/{id}/builds |
List builds for project |
| GET | /builds/{taskId} |
Get build status |
| POST | /project/create-and-build |
Create project + start build |
Queue Maintenance
The QueueMaintenance worker runs inside rdev-api alongside the WorkExecutor:
- Stale task recovery (every 1m): Requeues tasks running >30m without completion. Also syncs build_audit status to "pending" so API correctly reflects requeued state.
- Stale worker marking (every 1m): Marks workers offline after 2m without heartbeat
- Old task cleanup (every 1m): Removes completed/failed/cancelled tasks >7 days old
- Metrics refresh (every 15s): Updates Prometheus gauges for queue depth and worker counts
Build Audit Sync: When stale tasks are requeued, both work_queue and build_audit tables are updated atomically. This prevents builds from appearing stuck in "running" when the underlying task has been requeued for retry due to worker death.