diff --git a/ai-lookup/index.md b/ai-lookup/index.md index 005413b..5954b88 100644 --- a/ai-lookup/index.md +++ b/ai-lookup/index.md @@ -14,7 +14,7 @@ Quick reference for rdev concepts and facts. | Webhooks | [services/webhooks.md](./services/webhooks.md) | High | 2025-01 | Event subscriptions and delivery | | **Worker Infrastructure** | | Work Queue | [services/work-queue.md](./services/work-queue.md) | High | 2025-01 | Task queue for worker pool | -| Worker Pool | [services/worker-pool.md](./services/worker-pool.md) | High | 2026-01 | Embedded work executor with queue maintenance and metrics | +| Worker Pool | [services/worker-pool.md](./services/worker-pool.md) | High | 2026-02 | Standalone worker pods with claudebox sidecar, HTTP polling | | External Health | [services/external-health.md](./services/external-health.md) | High | 2026-02 | Background health monitoring of registry, CI, git | | CI Provider | [services/ci-provider.md](./services/ci-provider.md) | High | 2025-01 | Woodpecker auto-activation | | DNS / Cloudflare | [services/dns-cloudflare.md](./services/dns-cloudflare.md) | High | 2026-01 | Domain management for threesix.ai | diff --git a/ai-lookup/services/worker-pool.md b/ai-lookup/services/worker-pool.md index be631a7..b0e0a71 100644 --- a/ai-lookup/services/worker-pool.md +++ b/ai-lookup/services/worker-pool.md @@ -1,79 +1,193 @@ # Worker Pool -**Last Updated:** 2026-01-31 +**Last Updated:** 2026-02-06 **Confidence:** High ## Summary -Shared worker pool that executes build tasks for any project. Currently runs as an embedded WorkExecutor daemon inside rdev-api. Workers register with the worker registry, poll the work queue for tasks, execute Claude Code in pods via kubectl exec. Post-build git operations (commit/push) are programmatic via PodGitOperations, not LLM-driven. +Distributed task execution system where standalone worker pods poll rdev-api for tasks and execute them via a claudebox sidecar. Supports horizontal scaling by adding more worker pods. **Key Facts:** -- **LLM vs rdev boundary:** Claude writes code; rdev handles git ops programmatically (no LLM for runbook tasks) -- Embedded WorkExecutor daemon runs inside rdev-api process -- Workers poll work queue every 5 seconds, heartbeat every 30 seconds -- Stale workers (no heartbeat for 2 minutes) automatically marked offline by QueueMaintenance -- Stale tasks (running >30 min without completion) automatically requeued -- Old tasks (>7 days) automatically cleaned up -- Queue depth and worker counts exported as Prometheus metrics -- Future: external worker binary for separate pod deployment +- **Architecture:** Pull-based polling (not push/websocket) +- **Sidecar pattern:** Worker + claudebox in same pod, communicate via localhost HTTP +- **Atomic dequeue:** PostgreSQL `FOR UPDATE SKIP LOCKED` prevents duplicate claims +- **Task types:** `build` (Claude Code prompts), `sdlc` (SDLC commands) +- **Scaling:** Add replicas to handle more concurrent tasks +- **Resilience:** Stale workers marked offline, stuck tasks re-queued automatically -**File Pointers:** -- Domain: `internal/domain/worker.go` (Worker, WorkerStatus) -- Domain: `internal/domain/build.go` (BuildSpec, BuildResult) -- Port: `internal/port/worker_registry.go` (WorkerRegistry interface) -- Port: `internal/port/build_audit.go` (BuildAudit interface) -- Adapter: `internal/adapter/postgres/worker_registry.go` -- Adapter: `internal/adapter/postgres/build_audit.go` -- Service: `internal/service/worker_service.go` -- Service: `internal/service/build_service.go` -- Executor: `internal/worker/work_executor.go` (poll loop, heartbeat, task routing) -- Executor: `internal/worker/build_executor.go` (BuildSpec→AgentRequest) -- Git: `internal/worker/pod_git_operations.go` (post-build commit/push via kubectl exec) -- Maintenance: `internal/worker/queue_maintenance.go` (stale recovery, cleanup, metrics) -- Handler: `internal/handlers/workers.go` (REST API for workers) -- Handler: `internal/handlers/builds.go` (REST API for builds) -- Handler: `internal/handlers/create_and_build.go` (combined create+build) -- Migration: `internal/db/migrations/012_worker_registry.sql` +## File Pointers -## Worker Lifecycle (Embedded) +### Standalone Worker Binary +- **Entry:** `cmd/rdev-worker/main.go` - Main binary, registration, heartbeat, poll loop +- **API Client:** `internal/worker/api_client.go` - HTTP client to rdev-api +- **Build Executor:** `internal/worker/http_build_executor.go` - Execute builds via claudebox +- **SDLC Executor:** `internal/worker/http_sdlc_executor.go` - Execute SDLC tasks via claudebox -1. rdev-api starts → WorkExecutor registers as worker in registry -2. Heartbeat loop: every 30s sends heartbeat via WorkerService -3. Poll loop: every 5s dequeues next task from work queue -4. BuildExecutor: executes CodeAgent in pod, then programmatically commits/pushes if auto_commit -5. Reports completion with BuildResult via WorkerService -6. Graceful shutdown: deregisters worker on rdev-api stop +### Claudebox Sidecar Client +- **Client:** `internal/adapter/claudebox/client.go` - HTTP client to claudebox sidecar +- **Endpoints:** `/health`, `/execute`, `/git/clone`, `/git/commit-and-push`, `/sdlc` + +### rdev-api Server-Side +- **Handlers:** `internal/handlers/workers.go` - `/workers/*` endpoints +- **Service:** `internal/service/worker_service.go` - Claim, complete, fail logic +- **Registry:** `internal/adapter/postgres/worker_registry.go` - Worker state persistence +- **Queue:** `internal/adapter/postgres/work_queue.go` - Task queue with atomic dequeue + +### Domain +- **Worker:** `internal/domain/worker.go` - Worker, WorkerStatus +- **Task:** `internal/domain/work.go` - WorkTask, WorkTaskType, WorkTaskStatus +- **Build:** `internal/domain/build.go` - BuildSpec, BuildResult + +### Kubernetes +- **Deployment:** `deployments/k8s/base/rdev-worker.yaml` - Worker + claudebox pod spec + +## Architecture + +``` +┌─────────────────────┐ HTTP Polling (5s) ┌──────────────────────────┐ +│ rdev-api │◄────────────────────────────────►│ Worker Pod │ +│ │ │ ┌─────────┐ ┌─────────┐ │ +│ POST /workers/register ← Register at startup │ │ worker │→│claudebox│ │ +│ POST /workers/{id}/heartbeat ← Every 30s │ └─────────┘ └─────────┘ │ +│ POST /workers/{id}/claim ← Poll for tasks │ ↓ HTTP localhost │ +│ POST /workers/{id}/complete/{taskId} ← Success │ Claude Code execution │ +│ POST /workers/{id}/fail/{taskId} ← Failure └──────────────────────────┘ +│ │ +│ PostgreSQL │ +│ ├─ workers │ (worker registry) +│ ├─ work_queue │ (task queue) +│ └─ build_audit │ (execution history) +└─────────────────────┘ +``` + +## Worker Lifecycle + +1. **Register:** Worker pod starts → `POST /workers/register` with ID, hostname, capabilities +2. **Heartbeat:** Every 30s → `POST /workers/{id}/heartbeat` to stay alive +3. **Poll:** Every 5s → `POST /workers/{id}/claim` to get next task +4. **Execute:** Call claudebox sidecar HTTP API to run Claude Code / SDLC commands +5. **Report:** `POST /workers/{id}/complete/{taskId}` or `/fail/{taskId}` with results +6. **Shutdown:** Graceful wait for in-flight tasks via `sync.WaitGroup` ## Worker Statuses -- `idle` - available for new tasks -- `busy` - currently executing a task -- `draining` - not accepting new tasks (pre-shutdown) -- `offline` - missed heartbeat threshold +| Status | Meaning | +|--------|---------| +| `idle` | Ready to claim new tasks | +| `busy` | Currently executing a task | +| `draining` | Not accepting new tasks (pre-shutdown) | +| `offline` | Missed heartbeat threshold (>90s) | + +## Task Types + +### Build Tasks (`WorkTaskTypeBuild`) + +Execute Claude Code prompts with optional git operations. + +**Spec:** +```json +{ + "prompt": "Build a React app with...", + "auto_commit": true, + "auto_push": false, + "git_clone_url": "https://gitea.../repo.git" +} +``` + +**Execution Flow:** +1. Clone repo via `claudebox /git/clone` +2. Execute prompt via `claudebox /execute` (streaming) +3. Commit/push via `claudebox /git/commit-and-push` + +### SDLC Tasks (`WorkTaskTypeSDLC`) + +Execute SDLC CLI commands. + +**Spec:** +```json +{ + "command": "feature", + "args": ["init", "feature-name"], + "git_clone_url": "https://gitea.../repo.git" +} +``` + +**Execution Flow:** +1. Clone repo via `claudebox /git/clone` +2. Run SDLC command via `claudebox /sdlc` +3. Commit/push changes ## API Endpoints | Method | Path | Description | |--------|------|-------------| -| GET | `/workers` | List all workers with status summary | -| GET | `/workers/{workerId}` | Get worker details | -| POST | `/workers/{workerId}/drain` | Set worker to draining | -| POST | `/projects/{id}/builds` | Start build for project | -| GET | `/projects/{id}/builds` | List builds for project | -| GET | `/builds/{taskId}` | Get build status | -| POST | `/project/create-and-build` | Create project + start build | +| POST | `/workers/register` | Register new worker | +| POST | `/workers/{id}/heartbeat` | Keep worker alive | +| POST | `/workers/{id}/claim` | Claim next available task (204 if none) | +| POST | `/workers/{id}/complete/{taskId}` | Report successful completion | +| POST | `/workers/{id}/fail/{taskId}` | Report failure | +| GET | `/workers` | List all workers | +| GET | `/workers/{id}` | Get worker details | +| POST | `/workers/{id}/drain` | Set worker to draining | + +## Kubernetes Deployment + +```yaml +# deployments/k8s/base/rdev-worker.yaml +spec: + replicas: 1 # Scale by increasing + strategy: + type: RollingUpdate # RWX PVC enables multi-pod mounts + rollingUpdate: + maxSurge: 2 + maxUnavailable: 0 + containers: + - name: worker + image: registry.threesix.ai/rdev/worker:latest + env: + - RDEV_API_URL: http://rdev-api.rdev.svc.cluster.local:8080 + - CLAUDEBOX_URL: http://localhost:8080 + - WORKER_POLL_INTERVAL: 5s + - WORKER_HEARTBEAT_INTERVAL: 30s + - WORKER_TASK_TIMEOUT: 15m + - name: claudebox + image: registry.threesix.ai/rdev/claudebox:latest + volumeMounts: + - /workspace (EmptyDir) + - /root/.claude (RWX PVC - shared Claude auth) +``` + +**Storage:** The `claudebox-claude-config` PVC uses `ReadWriteMany` (RWX) access mode with Longhorn NFS, allowing multiple worker pods to share Claude OAuth credentials. + +## Error Classification + +Failed tasks are classified for smart retry logic: + +| Code | Trigger | Retryable | +|------|---------|-----------| +| `RATE_LIMITED` | "rate limit", "quota exceeded" | Yes (with backoff) | +| `AUTH_FAILED` | "unauthorized", "invalid api key" | No | +| `TIMEOUT` | "context deadline exceeded" | Yes | +| `AGENT_ERROR` | Generic error | Yes (limited retries) | ## Queue Maintenance -The QueueMaintenance worker runs inside rdev-api alongside the WorkExecutor: -- **Stale task recovery** (every 1m): Requeues tasks running >30m without completion. Also syncs build_audit status to "pending" so API correctly reflects requeued state. -- **Stale worker marking** (every 1m): Marks workers offline after 2m without heartbeat -- **Old task cleanup** (every 1m): Removes completed/failed/cancelled tasks >7 days old -- **Metrics refresh** (every 15s): Updates Prometheus gauges for queue depth and worker counts +Background goroutine in rdev-api: +- **Stale worker marking:** Workers without heartbeat >90s → `offline` +- **Stale task recovery:** Tasks running >30m without completion → re-queued +- **Old task cleanup:** Completed/failed tasks >7 days → deleted +- **Metrics refresh:** Queue depth and worker counts → Prometheus -**Build Audit Sync:** When stale tasks are requeued, both `work_queue` and `build_audit` tables are updated atomically. This prevents builds from appearing stuck in "running" when the underlying task has been requeued for retry due to worker death. +## Graceful Shutdown + +Worker uses `sync.WaitGroup` to track in-flight tasks: +1. Receive SIGTERM/SIGINT +2. Cancel context (stops polling) +3. Wait for WaitGroup with timeout (`WORKER_TASK_TIMEOUT`) +4. Log success or timeout warning ## Related Topics -- [Work Queue](./work-queue.md) -- [Build Orchestration](../features/build-orchestration.md) +- [Work Queue](./work-queue.md) - Task queue implementation +- [Build Orchestration](../features/build-orchestration.md) - Build API and specs +- [SDLC Orchestration](./sdlc.md) - SDLC task integration diff --git a/cookbooks/trees/slackpath-5-full-lifecycle.yaml b/cookbooks/trees/slackpath-5-full-lifecycle.yaml new file mode 100644 index 0000000..e4941c3 --- /dev/null +++ b/cookbooks/trees/slackpath-5-full-lifecycle.yaml @@ -0,0 +1,536 @@ +name: full-lifecycle +description: "Slack Path 5: The Full Lifecycle. Tests all 10 SDLC phases with explicit artifact approvals." +version: 1 + +vars: + project_name: "" + feature_slug: "user-preferences" + feature_title: "User Preferences API" + +steps: + # ============================================================ + # INFRASTRUCTURE + # ============================================================ + create-project: + action: api + method: POST + endpoint: /project + body: + name: "{{ .vars.project_name }}" + description: "Slack Path 5: Full SDLC Lifecycle" + outputs: + - project_id: .data.name + - domain: .data.domain + + add-db: + description: Add database for preferences storage + depends_on: [create-project] + on_error: continue + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/components" + body: + type: postgres + name: "main-db" + + add-service: + description: Add API service + depends_on: [add-db] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/components" + body: + type: service + name: "preferences-api" + + wait-init: + depends_on: [add-service] + action: wait_pipeline + project_id: "{{ .outputs.create-project.project_id }}" + + # ============================================================ + # PHASE 1: DRAFT + # Create feature (starts in draft phase) + # ============================================================ + create-feature: + description: "Create feature in draft phase" + depends_on: [wait-init] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features" + body: + slug: "{{ .vars.feature_slug }}" + title: "{{ .vars.feature_title }}" + outputs: + - feature_phase: .data.phase + + verify-draft: + description: "Verify feature is in draft phase" + depends_on: [create-feature] + action: shell + command: | + PHASE="{{ .outputs.create-feature.feature_phase }}" + if [ "$PHASE" == "draft" ]; then + echo "Feature created in draft phase" + exit 0 + else + echo "Expected draft, got $PHASE" + exit 1 + fi + + # ============================================================ + # PHASE 2: DRAFT → SPECIFIED + # Agent writes spec, API approves, transition + # ============================================================ + write-spec: + description: "Agent writes the spec artifact" + depends_on: [verify-draft] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/builds" + body: + prompt: "/spec-feature {{ .vars.feature_slug }} --requirements 'CRUD API for user preferences. GET/PUT /preferences/{user_id}. Preferences are key-value pairs stored in DB. Support theme, language, notifications settings.'" + auto_commit: true + auto_push: true + git_clone_url: "https://git.threesix.ai/jordan/{{ .outputs.create-project.project_id }}.git" + outputs: + - build_id: .data.task_id + + wait-spec: + depends_on: [write-spec] + action: wait_build + build_id: "{{ .outputs.write-spec.build_id }}" + max_attempts: 60 + poll_interval: 5 + + approve-spec: + description: "API approves the spec artifact" + depends_on: [wait-spec] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/artifacts/spec/approve" + body: + comment: "Spec approved by automation" + + transition-to-specified: + description: "Transition from draft to specified" + depends_on: [approve-spec] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/transition" + body: + phase: "specified" + outputs: + - new_phase: .data.phase + + # ============================================================ + # PHASE 3: SPECIFIED → PLANNED + # Agent writes design, tasks, qa_plan. API approves each. + # ============================================================ + write-design: + description: "Agent writes the design artifact" + depends_on: [transition-to-specified] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/builds" + body: + prompt: "/design-feature {{ .vars.feature_slug }}" + auto_commit: true + auto_push: true + git_clone_url: "https://git.threesix.ai/jordan/{{ .outputs.create-project.project_id }}.git" + outputs: + - build_id: .data.task_id + + wait-design: + depends_on: [write-design] + action: wait_build + build_id: "{{ .outputs.write-design.build_id }}" + max_attempts: 60 + poll_interval: 5 + + approve-design: + description: "API approves the design artifact" + depends_on: [wait-design] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/artifacts/design/approve" + body: + comment: "Design approved by automation" + + write-tasks: + description: "Agent breaks down into tasks" + depends_on: [approve-design] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/builds" + body: + prompt: "/breakdown-feature {{ .vars.feature_slug }}" + auto_commit: true + auto_push: true + git_clone_url: "https://git.threesix.ai/jordan/{{ .outputs.create-project.project_id }}.git" + outputs: + - build_id: .data.task_id + + wait-tasks: + depends_on: [write-tasks] + action: wait_build + build_id: "{{ .outputs.write-tasks.build_id }}" + max_attempts: 60 + poll_interval: 5 + + approve-tasks: + description: "API approves the tasks artifact" + depends_on: [wait-tasks] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/artifacts/tasks/approve" + body: + comment: "Tasks approved by automation" + + write-qa-plan: + description: "Agent writes QA plan" + depends_on: [approve-tasks] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/builds" + body: + prompt: "/create-qa-plan {{ .vars.feature_slug }}" + auto_commit: true + auto_push: true + git_clone_url: "https://git.threesix.ai/jordan/{{ .outputs.create-project.project_id }}.git" + outputs: + - build_id: .data.task_id + + wait-qa-plan: + depends_on: [write-qa-plan] + action: wait_build + build_id: "{{ .outputs.write-qa-plan.build_id }}" + max_attempts: 60 + poll_interval: 5 + + approve-qa-plan: + description: "API approves the QA plan artifact" + depends_on: [wait-qa-plan] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/artifacts/qa_plan/approve" + body: + comment: "QA plan approved by automation" + + transition-to-planned: + description: "Transition from specified to planned" + depends_on: [approve-qa-plan] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/transition" + body: + phase: "planned" + outputs: + - new_phase: .data.phase + + # ============================================================ + # PHASE 4: PLANNED → READY + # No new artifacts needed, just transition + # ============================================================ + transition-to-ready: + description: "Transition from planned to ready" + depends_on: [transition-to-planned] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/transition" + body: + phase: "ready" + outputs: + - new_phase: .data.phase + + # ============================================================ + # PHASE 5: READY → IMPLEMENTATION + # Agent implements all tasks + # ============================================================ + implement-feature: + description: "Agent implements all tasks for the feature" + depends_on: [transition-to-ready] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/builds" + body: + prompt: "/implement-feature {{ .vars.feature_slug }}" + auto_commit: true + auto_push: true + git_clone_url: "https://git.threesix.ai/jordan/{{ .outputs.create-project.project_id }}.git" + outputs: + - build_id: .data.task_id + + wait-implement: + depends_on: [implement-feature] + action: wait_build + build_id: "{{ .outputs.implement-feature.build_id }}" + max_attempts: 120 + poll_interval: 5 + + wait-deploy-impl: + description: "Wait for implementation to deploy" + depends_on: [wait-implement] + action: wait_pipeline + project_id: "{{ .outputs.create-project.project_id }}" + + transition-to-implementation: + description: "Transition to implementation phase (marks code complete)" + depends_on: [wait-deploy-impl] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/transition" + body: + phase: "implementation" + outputs: + - new_phase: .data.phase + + # ============================================================ + # PHASE 6: IMPLEMENTATION → REVIEW + # Agent writes code review + # ============================================================ + write-review: + description: "Agent writes code review" + depends_on: [transition-to-implementation] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/builds" + body: + prompt: "/review-feature {{ .vars.feature_slug }}" + auto_commit: true + auto_push: true + git_clone_url: "https://git.threesix.ai/jordan/{{ .outputs.create-project.project_id }}.git" + outputs: + - build_id: .data.task_id + + wait-review: + depends_on: [write-review] + action: wait_build + build_id: "{{ .outputs.write-review.build_id }}" + max_attempts: 60 + poll_interval: 5 + + approve-review: + description: "API approves the review" + depends_on: [wait-review] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/artifacts/review/approve" + body: + comment: "Review approved by automation" + + transition-to-review: + description: "Transition to review phase" + depends_on: [approve-review] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/transition" + body: + phase: "review" + outputs: + - new_phase: .data.phase + + # ============================================================ + # PHASE 7: REVIEW → AUDIT + # Agent writes security/architecture audit + # ============================================================ + write-audit: + description: "Agent writes security audit" + depends_on: [transition-to-review] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/builds" + body: + prompt: "/audit-feature {{ .vars.feature_slug }}" + auto_commit: true + auto_push: true + git_clone_url: "https://git.threesix.ai/jordan/{{ .outputs.create-project.project_id }}.git" + outputs: + - build_id: .data.task_id + + wait-audit: + depends_on: [write-audit] + action: wait_build + build_id: "{{ .outputs.write-audit.build_id }}" + max_attempts: 60 + poll_interval: 5 + + approve-audit: + description: "API approves the audit" + depends_on: [wait-audit] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/artifacts/audit/approve" + body: + comment: "Audit approved by automation" + + transition-to-audit: + description: "Transition to audit phase" + depends_on: [approve-audit] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/transition" + body: + phase: "audit" + outputs: + - new_phase: .data.phase + + # ============================================================ + # PHASE 8: AUDIT → QA + # Agent runs QA tests + # ============================================================ + run-qa: + description: "Agent runs QA plan" + depends_on: [transition-to-audit] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/builds" + body: + prompt: "/run-qa {{ .vars.feature_slug }}" + auto_commit: true + auto_push: true + git_clone_url: "https://git.threesix.ai/jordan/{{ .outputs.create-project.project_id }}.git" + outputs: + - build_id: .data.task_id + + wait-qa: + depends_on: [run-qa] + action: wait_build + build_id: "{{ .outputs.run-qa.build_id }}" + max_attempts: 60 + poll_interval: 5 + + transition-to-qa: + description: "Transition to QA phase" + depends_on: [wait-qa] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/transition" + body: + phase: "qa" + outputs: + - new_phase: .data.phase + + # ============================================================ + # PHASE 9: QA → MERGE + # Merge feature branch to main + # ============================================================ + merge-feature: + description: "Merge feature branch to main" + depends_on: [transition-to-qa] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/merge" + body: + strategy: "squash" + outputs: + - merge_commit: .data.commit_sha + + transition-to-merge: + description: "Transition to merge phase" + depends_on: [merge-feature] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/transition" + body: + phase: "merge" + outputs: + - new_phase: .data.phase + + # ============================================================ + # PHASE 10: MERGE → RELEASED + # Archive the feature + # ============================================================ + wait-final-deploy: + description: "Wait for merged code to deploy" + depends_on: [transition-to-merge] + action: wait_pipeline + project_id: "{{ .outputs.create-project.project_id }}" + + archive-feature: + description: "Archive the completed feature" + depends_on: [wait-final-deploy] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/archive" + + transition-to-released: + description: "Transition to released phase" + depends_on: [archive-feature] + action: api + method: POST + endpoint: "/projects/{{ .outputs.create-project.project_id }}/sdlc/features/{{ .vars.feature_slug }}/transition" + body: + phase: "released" + outputs: + - final_phase: .data.phase + + # ============================================================ + # VERIFICATION + # ============================================================ + verify-service-running: + description: "Verify the preferences API is running" + depends_on: [transition-to-released] + action: shell + command: | + DOMAIN="{{ .outputs.create-project.domain }}" + HEALTH=$(curl -s "https://$DOMAIN/api/preferences-api/health" | jq -r '.data.status // empty') + if [ "$HEALTH" == "healthy" ]; then + echo "Service healthy" + exit 0 + else + echo "Service not healthy: $HEALTH" + exit 1 + fi + + verify-preferences-api: + description: "Test CRUD operations on preferences" + depends_on: [verify-service-running] + on_error: continue + action: shell + command: | + DOMAIN="{{ .outputs.create-project.domain }}" + BASE_URL="https://$DOMAIN/api/preferences-api" + USER_ID="test-user-123" + + # PUT preferences + echo "Setting preferences..." + PUT_RESP=$(curl -s -X PUT "$BASE_URL/preferences/$USER_ID" \ + -H "Content-Type: application/json" \ + -d '{"theme":"dark","language":"en","notifications":true}') + echo "PUT response: $PUT_RESP" + + # GET preferences + echo "Getting preferences..." + GET_RESP=$(curl -s "$BASE_URL/preferences/$USER_ID") + echo "GET response: $GET_RESP" + + # Verify theme is dark + THEME=$(echo "$GET_RESP" | jq -r '.theme // .data.theme // empty') + if [ "$THEME" == "dark" ]; then + echo "Preferences API working correctly" + exit 0 + else + echo "Expected theme=dark, got: $THEME" + exit 1 + fi + + verify-lifecycle-complete: + description: "Verify feature reached released phase" + depends_on: [verify-preferences-api] + action: shell + command: | + FINAL_PHASE="{{ .outputs.transition-to-released.final_phase }}" + if [ "$FINAL_PHASE" == "released" ]; then + echo "SUCCESS: Feature completed full lifecycle (draft → released)" + echo "All 10 phases traversed with explicit approvals" + exit 0 + else + echo "FAIL: Expected released, got $FINAL_PHASE" + exit 1 + fi + +teardown: + - action: api + method: DELETE + endpoint: "/project/{{ .outputs.create-project.project_id }}" diff --git a/deployments/k8s/base/kustomization.yaml b/deployments/k8s/base/kustomization.yaml index b21235d..c313bc7 100644 --- a/deployments/k8s/base/kustomization.yaml +++ b/deployments/k8s/base/kustomization.yaml @@ -6,9 +6,11 @@ namespace: rdev resources: - namespace.yaml + # Storage classes (must be applied before PVCs) + - storageclass-rwx.yaml + # Shared worker claudebox (runs all project builds) - pvc.yaml - - pvc-shared-claude.yaml - claudebox.yaml - configmaps.yaml diff --git a/deployments/k8s/base/pvc-shared-claude.yaml b/deployments/k8s/base/pvc-shared-claude.yaml deleted file mode 100644 index da6e88f..0000000 --- a/deployments/k8s/base/pvc-shared-claude.yaml +++ /dev/null @@ -1,29 +0,0 @@ -# Shared Claude credentials PVC -# v0.6 - All claudebox pods share this for auth -# Commands/skills/agents live in /workspace/.claude (per-project, in git) -# -# IMPORTANT: ReadWriteMany (RWX) requires Longhorn with NFS enabled. -# Verify with: kubectl get settings -n longhorn-system rwx-volume-fast-failover -# If RWX is not available, either: -# 1. Enable Longhorn NFS: kubectl apply -f longhorn-nfs-provisioner.yaml -# 2. Or use separate PVCs per pod (revert to per-project claude-config PVCs) -# -# RWX is needed because multiple claudebox pods mount this simultaneously -# to share Claude authentication credentials. - -apiVersion: v1 -kind: PersistentVolumeClaim -metadata: - name: claudebox-shared-claude-config - namespace: rdev - labels: - app.kubernetes.io/name: claudebox - app.kubernetes.io/part-of: rdev - rdev.orchard9.ai/type: shared-config -spec: - accessModes: - - ReadWriteMany # Multiple pods can mount simultaneously - storageClassName: longhorn - resources: - requests: - storage: 1Gi diff --git a/deployments/k8s/base/pvc.yaml b/deployments/k8s/base/pvc.yaml index b430a4d..b378510 100644 --- a/deployments/k8s/base/pvc.yaml +++ b/deployments/k8s/base/pvc.yaml @@ -14,6 +14,12 @@ spec: requests: storage: 20Gi --- +# Claude config PVC - shared across claudebox and worker pods +# RWX (ReadWriteMany) allows multiple pods to mount simultaneously +# Contains Claude subscription OAuth credentials (~/.claude) +# +# IMPORTANT: Requires longhorn-rwx StorageClass (see storageclass-rwx.yaml) +# After recreating this PVC, re-authenticate with: claude login apiVersion: v1 kind: PersistentVolumeClaim metadata: @@ -22,10 +28,11 @@ metadata: labels: app.kubernetes.io/name: claudebox app.kubernetes.io/part-of: rdev + rdev.orchard9.ai/type: shared-config spec: accessModes: - - ReadWriteOnce - storageClassName: longhorn + - ReadWriteMany + storageClassName: longhorn-rwx resources: requests: storage: 1Gi diff --git a/deployments/k8s/base/rdev-worker.yaml b/deployments/k8s/base/rdev-worker.yaml index 5448477..46d5720 100644 --- a/deployments/k8s/base/rdev-worker.yaml +++ b/deployments/k8s/base/rdev-worker.yaml @@ -10,10 +10,13 @@ metadata: app.kubernetes.io/part-of: rdev spec: replicas: 1 - # Recreate strategy required: claudebox-claude-config PVC is RWO (ReadWriteOnce) - # and cannot be attached to multiple pods simultaneously + # RollingUpdate enabled by RWX (ReadWriteMany) PVC for claude-config + # See: deployments/k8s/base/pvc.yaml and storageclass-rwx.yaml strategy: - type: Recreate + type: RollingUpdate + rollingUpdate: + maxSurge: 2 + maxUnavailable: 0 selector: matchLabels: app: rdev-worker diff --git a/deployments/k8s/base/storageclass-rwx.yaml b/deployments/k8s/base/storageclass-rwx.yaml new file mode 100644 index 0000000..d655266 --- /dev/null +++ b/deployments/k8s/base/storageclass-rwx.yaml @@ -0,0 +1,24 @@ +# RWX (ReadWriteMany) StorageClass for shared volumes +# Enables multiple pods to mount the same PVC simultaneously +# Used for: claudebox-claude-config (shared Claude auth credentials) +# +# Prerequisites: +# - Longhorn 1.4.0+ with NFS support +# - Verify: kubectl get settings -n longhorn-system | grep -i rwx +# +# If RWX is not available, enable it: +# kubectl patch -n longhorn-system settings rwx-volume-fast-failover --type merge -p '{"value":"true"}' + +apiVersion: storage.k8s.io/v1 +kind: StorageClass +metadata: + name: longhorn-rwx + labels: + app.kubernetes.io/part-of: rdev +provisioner: driver.longhorn.io +allowVolumeExpansion: true +reclaimPolicy: Retain +parameters: + numberOfReplicas: "2" + staleReplicaTimeout: "30" + nfsOptions: "vers=4.1,noresvport"