rdev/docs/plans/worker-executor-breakdown.md
jordan c59d348040 chore: prepare for composable monorepo template implementation
This commit captures the current state before implementing the composable
monorepo template system. Key changes included:

Infrastructure:
- Add CockroachDB provisioner adapter for database provisioning
- Add Redis provisioner adapter for cache provisioning
- Add build events system with PostgreSQL storage
- Add WebSocket endpoint for real-time build progress

Code agent improvements:
- Fix Claude Code adapter to use default allowed tools instead of dangerously-skip-permissions
- Add context-aware stream closing for cancellation support
- Improve parser tests for edge cases

Build system:
- Add build event constants and metrics
- Remove deprecated git_operations.go (replaced by pod_git_operations.go)
- Add rollback logic for multi-step provisioning operations

Documentation:
- Add composable-monorepo feature documentation
- Add DNS/Cloudflare service documentation
- Update deployment and troubleshooting guides

Cookbooks:
- Add fullstack-app cookbook
- Refactor landing-test with shared library

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 11:39:28 -07:00

318 lines
14 KiB
Markdown

# Worker Executor Implementation Plan
> Close the last gap in the landing page cookbook: automated code generation via the worker pool.
## Context
The work queue, worker registry, build audit, and code agent systems are **all implemented**. The single missing piece is a **work executor** — a background loop that consumes queued tasks and executes them via a code agent. This is analogous to the existing `QueueProcessor` (which processes per-project command queue tasks), but for the generic `WorkQueue` (cross-project worker pool tasks).
### What Already Exists
| Component | File | Status |
|-----------|------|--------|
| Work queue (PostgreSQL) | `internal/adapter/postgres/work_queue.go` | Done |
| Worker registry (PostgreSQL) | `internal/adapter/postgres/worker_registry.go` | Done |
| Build audit (PostgreSQL) | `internal/adapter/postgres/build_audit.go` | Done |
| WorkService (enqueue/dequeue/complete/fail) | `internal/service/work_service.go` | Done |
| WorkerService (claim/complete/health) | `internal/service/worker_service.go` | Done |
| BuildService (start/status/complete) | `internal/service/build_service.go` | Done |
| WorkHandler (REST API) | `internal/handlers/work.go` | Done |
| AgentsHandler (REST API) | `internal/handlers/agents.go` | Done |
| CodeAgent interface | `internal/port/code_agent.go` | Done |
| Domain models (WorkTask, Worker, BuildSpec) | `internal/domain/` | Done |
| Command QueueProcessor (reference pattern) | `internal/worker/queue_processor.go` | Done |
### What's Missing
| Gap | Priority |
|-----|----------|
| Work executor daemon (poll loop) | Critical |
| BuildSpec → AgentRequest translation | Critical |
| Git clone/commit/push in executor | Critical |
| Git credential resolution for cross-project | High |
| Worker management REST endpoints | Medium |
| DNS alias endpoint | Medium |
| Create-and-build endpoint | Medium |
| Woodpecker build status proxy | Low |
---
## Week 1: Work Executor Core
**Goal:** A background loop that claims tasks from the work queue and executes them via a code agent. By end of week, `POST /work/enqueue` → task claimed → agent executes → result recorded.
### Tasks
1. **Create `internal/worker/work_executor.go`**
- Follow the `QueueProcessor` pattern from `queue_processor.go`
- Poll loop: calls `WorkerService.ClaimTask(workerID)` on a ticker
- On task claim: route to appropriate handler based on `task.Type`
- On completion: call `WorkerService.CompleteTask(workerID, taskID, result)`
- On failure: call `WorkService.FailTask(taskID, errMsg)` (handles retry logic)
- Graceful shutdown via context cancellation
- Self-registers as a worker via `WorkerService.Register()` on start
- Sends heartbeats via `WorkerService.Heartbeat()` on a 30s ticker
2. **Create `internal/worker/build_executor.go`**
- Handles `WorkTaskTypeBuild` tasks specifically
- Extracts `BuildSpec` fields from `WorkTask.Spec` (map[string]any → typed fields)
- Translates `BuildSpec.Prompt` into `domain.AgentRequest`
- Calls `CodeAgent.Execute()` with event streaming
- Collects output, files changed, duration into `domain.BuildResult`
- Returns `BuildResult` to the work executor
3. **Wire into `cmd/rdev-api/main.go`**
- Create `WorkExecutor` alongside existing `QueueProcessor`
- Inject: `WorkerService`, `BuildService`, `CodeAgentRegistry`
- Start on boot, stop on shutdown
- Worker ID: hostname or pod name (from `HOSTNAME` env var)
4. **Create `internal/worker/work_executor_test.go`**
- Test: executor starts and registers as a worker
- Test: executor claims a task and routes to build handler
- Test: build handler translates spec and calls code agent
- Test: results are recorded via CompleteTask
- Test: failures trigger FailTask with retry
- Test: graceful shutdown stops the poll loop
- Use mock implementations of ports
### Deliverables
- `POST /work/enqueue` with a build task → executor picks it up → agent runs → result in `GET /work/{taskId}`
- Worker visible in registry during execution
- Build audit entry created with spec and result
### Files Created/Modified
| File | Action |
|------|--------|
| `internal/worker/work_executor.go` | Create |
| `internal/worker/build_executor.go` | Create |
| `internal/worker/work_executor_test.go` | Create |
| `cmd/rdev-api/main.go` | Modify (wire executor) |
---
## Week 2: Git Operations & Cross-Project Execution
**Goal:** The executor can clone any project's repo, run the agent in that directory, and push results back. By end of week, the full build cycle works: enqueue → clone → agent generates code → commit → push → CI triggers.
### Tasks
1. **Create `internal/worker/pod_git_operations.go`** ✅ IMPLEMENTED
- `CommitAndPush(ctx, podName, workDir, message, push) *PostBuildResult`
- Runs git commands **inside the pod** via `kubectl exec` (not locally)
- Post-build phase: Claude writes code, then rdev programmatically commits/pushes
- Follows "LLM vs rdev" principle: LLMs generate code, rdev handles deterministic ops
2. **Add git credential resolution to `BuildExecutor`**
- Option A (simplest): Use the Gitea token already in `InfraConfig.GiteaToken`
- All project repos are in Gitea, so one token covers all repos
- Pass token via HTTPS clone URL: `https://token@git.threesix.ai/org/repo.git`
- Option B (per-project): Look up project's git URL from database, resolve credentials
- **Recommendation:** Option A — the Gitea token is already loaded and available
3. **Integrate git ops into `BuildExecutor`**
- Before agent execution: clone the project's repo to a temp directory
- Look up project git URL from database (add `ProjectStore` port or query directly)
- After agent execution: if `auto_commit` is true, commit changes
- After commit: if `auto_push` is true, push to remote
- Capture `commit_sha` and `files_changed` in `BuildResult`
4. **Add project git URL lookup**
- The `ProjectInfraService` stores git URLs in the database during `CreateProject`
- Add a method to retrieve git info by project ID
- Or: include `git_url` in the `WorkTask.Spec` at enqueue time (simpler, no extra lookup)
5. **Test pod git operations**
- Integration test via cookbook scripts
- Verify commit is created in pod workspace
- Verify push succeeds via kubectl exec
6. **Integration test**
- Enqueue a build task with a real prompt
- Verify agent executes in cloned repo
- Verify commit is created (if auto_commit)
- Verify push succeeds (if auto_push)
- Verify BuildResult has correct fields
### Deliverables
- Full build cycle: enqueue → clone → execute → commit → push
- Git credentials resolved from infrastructure config
- Temp workspace created and cleaned per task
- Build audit shows commit SHA and files changed
### Files Created/Modified
| File | Action |
|------|--------|
| `internal/worker/pod_git_operations.go` | Create ✅ |
| `internal/worker/build_executor.go` | Modify (add git integration) |
| `internal/worker/work_executor.go` | Modify (pass git config) |
| `cmd/rdev-api/main.go` | Modify (pass gitea token to executor) |
---
## Week 3: API Enhancements
**Goal:** Add the REST endpoints that complete the platform experience. By end of week, users can create a project, enqueue a build, monitor CI status, and manage DNS — all through rdev-api.
### Tasks
1. **Worker management endpoints — `internal/handlers/workers.go`**
- `GET /workers` — list all workers with status
- `GET /workers/{id}` — get worker details
- `POST /workers/{id}/drain` — drain a worker
- Wire `WorkerService` into handler
- Register in `cmd/rdev-api/main.go` and `openapi.go`
2. **Build management endpoints — `internal/handlers/builds.go`**
- `POST /projects/{id}/builds` — enqueue a build (wraps `BuildService.StartBuild()`)
- `GET /projects/{id}/builds` — list build history
- `GET /projects/{id}/builds/{taskId}` — get build status
- Simpler API than raw `/work/enqueue` — project-scoped, build-specific
- Register in `cmd/rdev-api/main.go` and `openapi.go`
3. **DNS alias endpoint — `internal/handlers/infrastructure.go`**
- `POST /projects/{id}/domains` — add DNS alias (A or CNAME record)
- `GET /projects/{id}/domains` — list domains for project
- `DELETE /projects/{id}/domains/{domain}` — remove alias
- Uses existing Cloudflare adapter's `CreateRecord()` and `DeleteRecordByName()`
- The adapter already supports full CRUD — just needs a handler
4. **Woodpecker build status proxy — `internal/handlers/ci.go`**
- `GET /projects/{id}/ci/pipelines` — list recent Woodpecker pipelines
- `GET /projects/{id}/ci/pipelines/{number}` — get pipeline details
- Add `ListPipelines()` and `GetPipeline()` to `port.CIProvider`
- Implement in `internal/adapter/woodpecker/client.go` using Woodpecker SDK
- Low priority — can defer if time is tight
5. **Create-and-build endpoint — `internal/handlers/project_management.go`**
- `POST /project/create-and-build`
- Request: `{ name, description, template, prompt, auto_push }`
- Calls `ProjectInfraService.CreateProject()` then `BuildService.StartBuild()`
- Returns project info + task ID
- Trivial once executor is working
6. **Tests for all new handlers**
- Follow existing patterns in `handlers/*_test.go`
- Test request validation, success paths, error handling
### Deliverables
- `POST /projects/{id}/builds` as the clean API for code generation
- `GET /workers` for monitoring the worker pool
- `POST /projects/{id}/domains` for DNS aliases
- `POST /project/create-and-build` for the single-call flow
- All endpoints documented in `openapi.go`
### Files Created/Modified
| File | Action |
|------|--------|
| `internal/handlers/workers.go` | Create |
| `internal/handlers/workers_test.go` | Create |
| `internal/handlers/builds.go` | Create |
| `internal/handlers/builds_test.go` | Create |
| `internal/handlers/infrastructure.go` | Modify (add domain endpoints) |
| `internal/handlers/ci.go` | Create (if time) |
| `internal/handlers/project_management.go` | Modify (add create-and-build) |
| `internal/adapter/woodpecker/client.go` | Modify (add pipeline methods, if time) |
| `internal/port/ci.go` or port updates | Modify (add pipeline interface, if time) |
| `cmd/rdev-api/main.go` | Modify (wire new handlers) |
| `cmd/rdev-api/openapi.go` | Modify (add routes to spec) |
---
## Week 4: Polish, Validation & Observability
**Goal:** End-to-end validation of the cookbook flow. Observability for production operation. Documentation updated.
### Tasks
1. **End-to-end cookbook validation**
- Run the landing page cookbook flow from start to finish
- `POST /project` with `astro-landing` template
- `POST /projects/landing/builds` with customization prompt
- Monitor via `GET /work/{taskId}/status`
- Verify CI triggers on push
- Verify site is live at `https://landing.threesix.ai`
- Fix any issues found during validation
2. **Stale task recovery**
- Add periodic `RequeueStale()` call to the work executor
- Requeue tasks where the worker crashed mid-execution
- Add periodic `CleanupOld()` call to remove ancient completed tasks
- These methods exist on `WorkQueue` but nothing calls them
3. **Observability additions**
- Add metrics to work executor: tasks_claimed, tasks_completed, tasks_failed, execution_duration
- Add metrics to worker service: workers_registered, workers_idle, workers_busy
- Follow existing pattern in `internal/metrics/metrics.go`
- Add work executor health to readiness check (`GET /ready`)
4. **Queue maintenance worker**
- Create `internal/worker/queue_maintenance.go`
- Runs on a slower ticker (every 5 minutes)
- Calls `RequeueStale(ctx, 10*time.Minute)` — requeue tasks running > 10min with no heartbeat
- Calls `CleanupOld(ctx, 7*24*time.Hour)` — prune tasks older than 7 days
- Wire into main.go
5. **Update documentation**
- Update `cookbooks/landing-page.md` with final validated flow
- Update `ai-lookup/features/build-orchestration.md`
- Update `ai-lookup/services/worker-pool.md`
- Add `.claude/guides/services/build-orchestration.md` if needed
6. **Update CLAUDE.md roadmap**
- Mark "Work Queue" as implemented
- Mark "Worker Pool" as implemented
- Mark "Build Orchestration" as implemented
- Update "Bot Communication" status
### Deliverables
- Cookbook flow works end-to-end without manual intervention (except code generation prompt)
- Stale task recovery running in production
- Metrics visible in `/metrics` endpoint
- All documentation reflects actual capabilities
### Files Created/Modified
| File | Action |
|------|--------|
| `internal/worker/queue_maintenance.go` | Create |
| `internal/metrics/metrics.go` | Modify (add work executor metrics) |
| `internal/handlers/health.go` | Modify (add executor health) |
| `cookbooks/landing-page.md` | Modify (final validation) |
| `ai-lookup/features/build-orchestration.md` | Modify |
| `ai-lookup/services/worker-pool.md` | Modify |
| `CLAUDE.md` | Modify (update roadmap) |
| `cmd/rdev-api/main.go` | Modify (wire maintenance worker) |
---
## Risk & Dependencies
| Risk | Mitigation |
|------|-----------|
| CodeAgent execution in a temp directory (not a K8s pod) may not work the same as in-pod execution | Test early in Week 1; fallback is to kubectl exec into a worker pod |
| Gitea token may lack permissions for new repos created by different users | Test with actual token; all repos should be in the same org |
| Agent execution may take longer than expected (10+ minutes for complex prompts) | Make timeout configurable; increase default |
| Worker process crash loses in-flight task | Stale requeue (Week 4) handles this automatically |
| 500-line file limit may require splitting new files | Plan for split from the start; `work_executor.go` + `build_executor.go` + `pod_git_operations.go` keeps things modular |
## Architecture Decision: In-Process vs External Worker
The plan above implements the executor **in-process** (running inside the rdev-api binary). This is simpler and matches the existing `QueueProcessor` pattern. The alternative would be a separate worker binary, which would allow independent scaling. The in-process approach is the right starting point — it can be extracted into a separate binary later if scaling requires it.
## Summary
| Week | Focus | Key Deliverable |
|------|-------|----------------|
| 1 | Work executor core | Tasks flow from queue → agent → result |
| 2 | Git operations | Clone → execute → commit → push cycle |
| 3 | API enhancements | Build, worker, DNS, create-and-build endpoints |
| 4 | Polish & validation | E2E cookbook flow, observability, docs |