- Add UndeployAll() using label selectors to clean up monorepo components
on project deletion (replaces name-based Undeploy in DeleteProject and
the direct undeploy handler)
- Add ResourceGC background worker that periodically finds K8s resources
whose project label has no matching DB record, deletes after 1h safety
window
- Widen deployer client type from *kubernetes.Clientset to
kubernetes.Interface for testability
- UndeployAll accumulates errors via errors.Join instead of failing fast
- Add checkout/checkin sidecar dev flow: temporary git tokens, branch
checkout, review on checkin with cleanup workers
- Add interactive sessions: pod binding, command execution, SSE streaming,
ephemeral preview URLs with session cleanup workers
- Add GET /workers/pool endpoint for aggregate capacity and queue depth
- Add sessions:read and sessions:execute auth scopes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `sdlc merge` command reads the Branch field from the feature manifest
on main, but `sdlc branch create` was only committing that state to the
feature branch (via the executor's CommitAndPush). This caused merge to
fail with "feature has no branch".
Two changes:
1. cmd/sdlc/cmd_branch.go: commit .sdlc/ state to main before
`git checkout -b`, ensuring Branch metadata is on main where merge
reads it.
2. internal/worker/sdlc_executor.go: reset workspace to main
(`git fetch && git checkout main && git reset --hard origin/main`)
before each SDLC task, preventing cross-task branch contamination
from commands that switch branches.
Also updates foundary cookbook with architect fallback pattern and
on_error: continue for steps that may fail during early lifecycle.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two critical fixes for build retry behavior:
1. pod_git_operations.go: Normalize remote URL before comparison
- Clone stores URL with token (https://token:x@host/...)
- Subsequent retry compares against URL without token
- Without normalization, URLs never match, so workspace is always
cleared and re-cloned, losing all code from previous attempt
2. build_audit.go: Clear stale result data when task transitions to running
- When a failed task is retried, UpdateStatus only updated status/worker_id
- Result and completed_at from previous failure remained, causing
API to return stale failure data even while retry was running
- Now clears result, completed_at and resets started_at when
status is set to "running"
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Critical fix: WorkersHandler was missing workService dependency, causing
500 errors when workers tried to fail tasks. This caused tasks to get
stuck in "running" state permanently.
Also adds:
- /work/tasks endpoint for debugging all tasks across projects
- List method to WorkQueue interface for admin views
- HTTP client tests for api_client.go and claudebox/client.go (48 tests)
- Split work.go DTOs into work_dto.go to stay under 500 lines
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add WaitGroup for graceful shutdown of in-flight tasks
- Change replicas to 1 with Recreate strategy (RWO PVC limitation)
- Optimize Dockerfile: combine RUN commands for smaller layers
- Add compiled binaries to .gitignore
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements horizontally-scalable worker pool architecture:
- claudebox-sidecar: HTTP server for Claude Code, git, and SDLC ops
- rdev-worker: standalone worker binary polling rdev-api for tasks
- HTTP client adapter for sidecar communication
- HPA with custom Prometheus metrics for autoscaling
- ServiceMonitor for metrics scraping
Code review fixes applied:
- URL-encode query parameters in GitStatus (Critical #1)
- Remove unused shellQuote function (Critical #2)
- Use stdlib strings.Split/TrimSpace (Critical #3)
- Add version injection via ldflags (Warning #4)
- Add debug logging for swallowed git/sdlc errors (Warning #5, #6)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Three coordinated fixes for CI pipeline race conditions:
1. Woodpecker step dependencies: Added depends_on: [deps] to all 6 component
templates (service, worker, cli, app-astro, app-react, app-nextjs) so build
steps wait for go work sync to complete.
2. Idempotent resource provisioning: Modified provisionResources() to check
for existing database/cache before creating, preventing "already exists"
errors on component re-adds.
3. Batch component endpoint: POST /projects/{id}/components/batch enables
atomic multi-component additions in a single git commit. Validates all
components upfront, provisions infra sequentially, commits code components
atomically.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Worker template fixes:
- Replace panic() with logger.Error() + os.Exit(1) for config errors
- Remove double-timeout application (context + middleware)
- Add error message truncation to prevent log bloat
- Use named constants for shutdown grace period and stale check interval
Skeleton pkg/auth fixes:
- Fix error wrapping to use %w consistently in jwt.go
- Add GetUserOrError() as safe alternative to MustGetUser() panic
Skeleton pkg/queue fixes:
- Check RowsAffected() errors instead of ignoring them
- Add input validation to EnqueueWithOptions (require job type, cap retries)
- Add log truncation for error messages
- Fix inaccurate doc comment claiming exponential backoff
Worker timeout consolidation:
- Add internal/worker/timeouts.go with named constants
- Migrate all workers to use timeout constants
Cleanup:
- Remove obsolete slack-preparation-thoughts.md files
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Major changes:
- Add internal/logging package with field constants, context propagation,
sensitive data auto-redaction, and per-component log levels
- Add worker timeout constants (TimeoutQuickOp, TimeoutHealthCheck, etc.)
- Extend SDLC with callback handlers, generate endpoints, and executor
- Add new cookbook trees for aeries and slackpath progression
- Add skeleton templates for queue, realtime, and microservices
- Add worker component template with async job processing
- Refactor services and handlers to use new logging infrastructure
- Split component.go into component_infra.go and component_listing.go
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add /diagnostics endpoint for system health overview
- Add external health worker for monitoring Gitea, Woodpecker, Registry
- Add health check methods to Gitea and Woodpecker clients
- Remove hardcoded fallback projects (pantheon, aeries)
- Add diagnostics domain types and service layer
- Add comprehensive tests for diagnostics handler and service
- Fix tests to use registered test project instead of hardcoded one
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add WorkErrorCode type with RATE_LIMITED, AUTH_FAILED, TIMEOUT, STALE_WORKER, AGENT_ERROR, INVALID_SPEC
- Add ClassifyAgentError function to detect error patterns from stderr
- Add error_code column to work_queue table (migration 016)
- Add FailWithCode method to WorkQueue interface and implementations
- Update RequeueStaleWithIDs to mark permanently failed tasks with STALE_WORKER
- Add ErrorCode to BuildResult for API responses
- Update work executor to classify errors before failing tasks
This enables users to see actual failure reasons (e.g., "RATE_LIMITED") instead of
builds stuck in "running" state forever when Claude hits rate limits.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When a worker dies mid-build, queue maintenance now updates both
work_queue and build_audit tables when requeuing stale tasks.
This prevents builds from showing "running" forever in the API.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add session_id, model, allowed_tools to Claude request handler
- Update OpenAPI spec for Claude endpoint
- Fix BuildExecutor constructor call sites
- Rewrite landing-test.sh for agent-driven flow
- Fix cookbook documentation for correct API format
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: WorkerService.ClaimTask() was modifying the audit entry
in memory but never persisting it to the database. This caused build
tasks to remain stuck at "pending" status even after being claimed.
Changes:
- Add UpdateStatus method to port.BuildAudit interface
- Implement UpdateStatus in postgres.BuildAuditRepository
- Fix ClaimTask to call audit.UpdateStatus() to persist status
- Add test coverage for audit update during task claim
- Update all mock implementations
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>