rdev/ai-lookup/services/worker-pool.md
jordan 910bcb62e1 fix: Sync build audit with work queue when stale tasks are requeued
When a worker dies mid-build, queue maintenance now updates both
work_queue and build_audit tables when requeuing stale tasks.
This prevents builds from showing "running" forever in the API.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 02:07:52 -07:00

80 lines
4.0 KiB
Markdown

# Worker Pool
**Last Updated:** 2026-01-31
**Confidence:** High
## Summary
Shared worker pool that executes build tasks for any project. Currently runs as an embedded WorkExecutor daemon inside rdev-api. Workers register with the worker registry, poll the work queue for tasks, execute Claude Code in pods via kubectl exec. Post-build git operations (commit/push) are programmatic via PodGitOperations, not LLM-driven.
**Key Facts:**
- **LLM vs rdev boundary:** Claude writes code; rdev handles git ops programmatically (no LLM for runbook tasks)
- Embedded WorkExecutor daemon runs inside rdev-api process
- Workers poll work queue every 5 seconds, heartbeat every 30 seconds
- Stale workers (no heartbeat for 2 minutes) automatically marked offline by QueueMaintenance
- Stale tasks (running >30 min without completion) automatically requeued
- Old tasks (>7 days) automatically cleaned up
- Queue depth and worker counts exported as Prometheus metrics
- Future: external worker binary for separate pod deployment
**File Pointers:**
- Domain: `internal/domain/worker.go` (Worker, WorkerStatus)
- Domain: `internal/domain/build.go` (BuildSpec, BuildResult)
- Port: `internal/port/worker_registry.go` (WorkerRegistry interface)
- Port: `internal/port/build_audit.go` (BuildAudit interface)
- Adapter: `internal/adapter/postgres/worker_registry.go`
- Adapter: `internal/adapter/postgres/build_audit.go`
- Service: `internal/service/worker_service.go`
- Service: `internal/service/build_service.go`
- Executor: `internal/worker/work_executor.go` (poll loop, heartbeat, task routing)
- Executor: `internal/worker/build_executor.go` (BuildSpec→AgentRequest)
- Git: `internal/worker/pod_git_operations.go` (post-build commit/push via kubectl exec)
- Maintenance: `internal/worker/queue_maintenance.go` (stale recovery, cleanup, metrics)
- Handler: `internal/handlers/workers.go` (REST API for workers)
- Handler: `internal/handlers/builds.go` (REST API for builds)
- Handler: `internal/handlers/create_and_build.go` (combined create+build)
- Migration: `internal/db/migrations/012_worker_registry.sql`
## Worker Lifecycle (Embedded)
1. rdev-api starts → WorkExecutor registers as worker in registry
2. Heartbeat loop: every 30s sends heartbeat via WorkerService
3. Poll loop: every 5s dequeues next task from work queue
4. BuildExecutor: executes CodeAgent in pod, then programmatically commits/pushes if auto_commit
5. Reports completion with BuildResult via WorkerService
6. Graceful shutdown: deregisters worker on rdev-api stop
## Worker Statuses
- `idle` - available for new tasks
- `busy` - currently executing a task
- `draining` - not accepting new tasks (pre-shutdown)
- `offline` - missed heartbeat threshold
## API Endpoints
| Method | Path | Description |
|--------|------|-------------|
| GET | `/workers` | List all workers with status summary |
| GET | `/workers/{workerId}` | Get worker details |
| POST | `/workers/{workerId}/drain` | Set worker to draining |
| POST | `/projects/{id}/builds` | Start build for project |
| GET | `/projects/{id}/builds` | List builds for project |
| GET | `/builds/{taskId}` | Get build status |
| POST | `/project/create-and-build` | Create project + start build |
## Queue Maintenance
The QueueMaintenance worker runs inside rdev-api alongside the WorkExecutor:
- **Stale task recovery** (every 1m): Requeues tasks running >30m without completion. Also syncs build_audit status to "pending" so API correctly reflects requeued state.
- **Stale worker marking** (every 1m): Marks workers offline after 2m without heartbeat
- **Old task cleanup** (every 1m): Removes completed/failed/cancelled tasks >7 days old
- **Metrics refresh** (every 15s): Updates Prometheus gauges for queue depth and worker counts
**Build Audit Sync:** When stale tasks are requeued, both `work_queue` and `build_audit` tables are updated atomically. This prevents builds from appearing stuck in "running" when the underlying task has been requeued for retry due to worker death.
## Related Topics
- [Work Queue](./work-queue.md)
- [Build Orchestration](../features/build-orchestration.md)