When a worker dies mid-build, queue maintenance now updates both work_queue and build_audit tables when requeuing stale tasks. This prevents builds from showing "running" forever in the API. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
80 lines
4.0 KiB
Markdown
80 lines
4.0 KiB
Markdown
# Worker Pool
|
|
|
|
**Last Updated:** 2026-01-31
|
|
**Confidence:** High
|
|
|
|
## Summary
|
|
|
|
Shared worker pool that executes build tasks for any project. Currently runs as an embedded WorkExecutor daemon inside rdev-api. Workers register with the worker registry, poll the work queue for tasks, execute Claude Code in pods via kubectl exec. Post-build git operations (commit/push) are programmatic via PodGitOperations, not LLM-driven.
|
|
|
|
**Key Facts:**
|
|
- **LLM vs rdev boundary:** Claude writes code; rdev handles git ops programmatically (no LLM for runbook tasks)
|
|
- Embedded WorkExecutor daemon runs inside rdev-api process
|
|
- Workers poll work queue every 5 seconds, heartbeat every 30 seconds
|
|
- Stale workers (no heartbeat for 2 minutes) automatically marked offline by QueueMaintenance
|
|
- Stale tasks (running >30 min without completion) automatically requeued
|
|
- Old tasks (>7 days) automatically cleaned up
|
|
- Queue depth and worker counts exported as Prometheus metrics
|
|
- Future: external worker binary for separate pod deployment
|
|
|
|
**File Pointers:**
|
|
- Domain: `internal/domain/worker.go` (Worker, WorkerStatus)
|
|
- Domain: `internal/domain/build.go` (BuildSpec, BuildResult)
|
|
- Port: `internal/port/worker_registry.go` (WorkerRegistry interface)
|
|
- Port: `internal/port/build_audit.go` (BuildAudit interface)
|
|
- Adapter: `internal/adapter/postgres/worker_registry.go`
|
|
- Adapter: `internal/adapter/postgres/build_audit.go`
|
|
- Service: `internal/service/worker_service.go`
|
|
- Service: `internal/service/build_service.go`
|
|
- Executor: `internal/worker/work_executor.go` (poll loop, heartbeat, task routing)
|
|
- Executor: `internal/worker/build_executor.go` (BuildSpec→AgentRequest)
|
|
- Git: `internal/worker/pod_git_operations.go` (post-build commit/push via kubectl exec)
|
|
- Maintenance: `internal/worker/queue_maintenance.go` (stale recovery, cleanup, metrics)
|
|
- Handler: `internal/handlers/workers.go` (REST API for workers)
|
|
- Handler: `internal/handlers/builds.go` (REST API for builds)
|
|
- Handler: `internal/handlers/create_and_build.go` (combined create+build)
|
|
- Migration: `internal/db/migrations/012_worker_registry.sql`
|
|
|
|
## Worker Lifecycle (Embedded)
|
|
|
|
1. rdev-api starts → WorkExecutor registers as worker in registry
|
|
2. Heartbeat loop: every 30s sends heartbeat via WorkerService
|
|
3. Poll loop: every 5s dequeues next task from work queue
|
|
4. BuildExecutor: executes CodeAgent in pod, then programmatically commits/pushes if auto_commit
|
|
5. Reports completion with BuildResult via WorkerService
|
|
6. Graceful shutdown: deregisters worker on rdev-api stop
|
|
|
|
## Worker Statuses
|
|
|
|
- `idle` - available for new tasks
|
|
- `busy` - currently executing a task
|
|
- `draining` - not accepting new tasks (pre-shutdown)
|
|
- `offline` - missed heartbeat threshold
|
|
|
|
## API Endpoints
|
|
|
|
| Method | Path | Description |
|
|
|--------|------|-------------|
|
|
| GET | `/workers` | List all workers with status summary |
|
|
| GET | `/workers/{workerId}` | Get worker details |
|
|
| POST | `/workers/{workerId}/drain` | Set worker to draining |
|
|
| POST | `/projects/{id}/builds` | Start build for project |
|
|
| GET | `/projects/{id}/builds` | List builds for project |
|
|
| GET | `/builds/{taskId}` | Get build status |
|
|
| POST | `/project/create-and-build` | Create project + start build |
|
|
|
|
## Queue Maintenance
|
|
|
|
The QueueMaintenance worker runs inside rdev-api alongside the WorkExecutor:
|
|
- **Stale task recovery** (every 1m): Requeues tasks running >30m without completion. Also syncs build_audit status to "pending" so API correctly reflects requeued state.
|
|
- **Stale worker marking** (every 1m): Marks workers offline after 2m without heartbeat
|
|
- **Old task cleanup** (every 1m): Removes completed/failed/cancelled tasks >7 days old
|
|
- **Metrics refresh** (every 15s): Updates Prometheus gauges for queue depth and worker counts
|
|
|
|
**Build Audit Sync:** When stale tasks are requeued, both `work_queue` and `build_audit` tables are updated atomically. This prevents builds from appearing stuck in "running" when the underlying task has been requeued for retry due to worker death.
|
|
|
|
## Related Topics
|
|
|
|
- [Work Queue](./work-queue.md)
|
|
- [Build Orchestration](../features/build-orchestration.md)
|