Auth errors like "OAuth token has expired" were lost because Claude writes
them to stdout, not stderr. The error message only showed kubectl's generic
"command terminated with exit code 1". Now includes both stdout and stderr
in the error, making failures immediately diagnosable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add UndeployAll() using label selectors to clean up monorepo components
on project deletion (replaces name-based Undeploy in DeleteProject and
the direct undeploy handler)
- Add ResourceGC background worker that periodically finds K8s resources
whose project label has no matching DB record, deletes after 1h safety
window
- Widen deployer client type from *kubernetes.Clientset to
kubernetes.Interface for testability
- UndeployAll accumulates errors via errors.Join instead of failing fast
- Add checkout/checkin sidecar dev flow: temporary git tokens, branch
checkout, review on checkin with cleanup workers
- Add interactive sessions: pod binding, command execution, SSE streaming,
ephemeral preview URLs with session cleanup workers
- Add GET /workers/pool endpoint for aggregate capacity and queue depth
- Add sessions:read and sessions:execute auth scopes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `sdlc merge` command reads the Branch field from the feature manifest
on main, but `sdlc branch create` was only committing that state to the
feature branch (via the executor's CommitAndPush). This caused merge to
fail with "feature has no branch".
Two changes:
1. cmd/sdlc/cmd_branch.go: commit .sdlc/ state to main before
`git checkout -b`, ensuring Branch metadata is on main where merge
reads it.
2. internal/worker/sdlc_executor.go: reset workspace to main
(`git fetch && git checkout main && git reset --hard origin/main`)
before each SDLC task, preventing cross-task branch contamination
from commands that switch branches.
Also updates foundary cookbook with architect fallback pattern and
on_error: continue for steps that may fail during early lifecycle.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two critical fixes for build retry behavior:
1. pod_git_operations.go: Normalize remote URL before comparison
- Clone stores URL with token (https://token:x@host/...)
- Subsequent retry compares against URL without token
- Without normalization, URLs never match, so workspace is always
cleared and re-cloned, losing all code from previous attempt
2. build_audit.go: Clear stale result data when task transitions to running
- When a failed task is retried, UpdateStatus only updated status/worker_id
- Result and completed_at from previous failure remained, causing
API to return stale failure data even while retry was running
- Now clears result, completed_at and resets started_at when
status is set to "running"
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Critical fix: WorkersHandler was missing workService dependency, causing
500 errors when workers tried to fail tasks. This caused tasks to get
stuck in "running" state permanently.
Also adds:
- /work/tasks endpoint for debugging all tasks across projects
- List method to WorkQueue interface for admin views
- HTTP client tests for api_client.go and claudebox/client.go (48 tests)
- Split work.go DTOs into work_dto.go to stay under 500 lines
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add WaitGroup for graceful shutdown of in-flight tasks
- Change replicas to 1 with Recreate strategy (RWO PVC limitation)
- Optimize Dockerfile: combine RUN commands for smaller layers
- Add compiled binaries to .gitignore
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements horizontally-scalable worker pool architecture:
- claudebox-sidecar: HTTP server for Claude Code, git, and SDLC ops
- rdev-worker: standalone worker binary polling rdev-api for tasks
- HTTP client adapter for sidecar communication
- HPA with custom Prometheus metrics for autoscaling
- ServiceMonitor for metrics scraping
Code review fixes applied:
- URL-encode query parameters in GitStatus (Critical #1)
- Remove unused shellQuote function (Critical #2)
- Use stdlib strings.Split/TrimSpace (Critical #3)
- Add version injection via ldflags (Warning #4)
- Add debug logging for swallowed git/sdlc errors (Warning #5, #6)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Three coordinated fixes for CI pipeline race conditions:
1. Woodpecker step dependencies: Added depends_on: [deps] to all 6 component
templates (service, worker, cli, app-astro, app-react, app-nextjs) so build
steps wait for go work sync to complete.
2. Idempotent resource provisioning: Modified provisionResources() to check
for existing database/cache before creating, preventing "already exists"
errors on component re-adds.
3. Batch component endpoint: POST /projects/{id}/components/batch enables
atomic multi-component additions in a single git commit. Validates all
components upfront, provisions infra sequentially, commits code components
atomically.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Worker template fixes:
- Replace panic() with logger.Error() + os.Exit(1) for config errors
- Remove double-timeout application (context + middleware)
- Add error message truncation to prevent log bloat
- Use named constants for shutdown grace period and stale check interval
Skeleton pkg/auth fixes:
- Fix error wrapping to use %w consistently in jwt.go
- Add GetUserOrError() as safe alternative to MustGetUser() panic
Skeleton pkg/queue fixes:
- Check RowsAffected() errors instead of ignoring them
- Add input validation to EnqueueWithOptions (require job type, cap retries)
- Add log truncation for error messages
- Fix inaccurate doc comment claiming exponential backoff
Worker timeout consolidation:
- Add internal/worker/timeouts.go with named constants
- Migrate all workers to use timeout constants
Cleanup:
- Remove obsolete slack-preparation-thoughts.md files
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Major changes:
- Add internal/logging package with field constants, context propagation,
sensitive data auto-redaction, and per-component log levels
- Add worker timeout constants (TimeoutQuickOp, TimeoutHealthCheck, etc.)
- Extend SDLC with callback handlers, generate endpoints, and executor
- Add new cookbook trees for aeries and slackpath progression
- Add skeleton templates for queue, realtime, and microservices
- Add worker component template with async job processing
- Refactor services and handlers to use new logging infrastructure
- Split component.go into component_infra.go and component_listing.go
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add /diagnostics endpoint for system health overview
- Add external health worker for monitoring Gitea, Woodpecker, Registry
- Add health check methods to Gitea and Woodpecker clients
- Remove hardcoded fallback projects (pantheon, aeries)
- Add diagnostics domain types and service layer
- Add comprehensive tests for diagnostics handler and service
- Fix tests to use registered test project instead of hardcoded one
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add WorkErrorCode type with RATE_LIMITED, AUTH_FAILED, TIMEOUT, STALE_WORKER, AGENT_ERROR, INVALID_SPEC
- Add ClassifyAgentError function to detect error patterns from stderr
- Add error_code column to work_queue table (migration 016)
- Add FailWithCode method to WorkQueue interface and implementations
- Update RequeueStaleWithIDs to mark permanently failed tasks with STALE_WORKER
- Add ErrorCode to BuildResult for API responses
- Update work executor to classify errors before failing tasks
This enables users to see actual failure reasons (e.g., "RATE_LIMITED") instead of
builds stuck in "running" state forever when Claude hits rate limits.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When a worker dies mid-build, queue maintenance now updates both
work_queue and build_audit tables when requeuing stale tasks.
This prevents builds from showing "running" forever in the API.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add session_id, model, allowed_tools to Claude request handler
- Update OpenAPI spec for Claude endpoint
- Fix BuildExecutor constructor call sites
- Rewrite landing-test.sh for agent-driven flow
- Fix cookbook documentation for correct API format
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: WorkerService.ClaimTask() was modifying the audit entry
in memory but never persisting it to the database. This caused build
tasks to remain stuck at "pending" status even after being claimed.
Changes:
- Add UpdateStatus method to port.BuildAudit interface
- Implement UpdateStatus in postgres.BuildAuditRepository
- Fix ClaimTask to call audit.UpdateStatus() to persist status
- Add test coverage for audit update during task claim
- Update all mock implementations
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>