jordan/rdev

Author	SHA1	Message	Date
jordan	cefc15aa7d	fix(worker): include stdout in error messages when Claude command fails Some checks failed ci/woodpecker/push/woodpecker Pipeline failed Details Auth errors like "OAuth token has expired" were lost because Claude writes them to stdout, not stderr. The error message only showed kubectl's generic "command terminated with exit code 1". Now includes both stdout and stderr in the error, making failures immediately diagnosable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 17:55:46 -07:00
jordan	9226454b85	feat: label-based undeploy, GC reconciliation, checkout/sessions, pool status Some checks failed ci/woodpecker/push/woodpecker Pipeline failed Details - Add UndeployAll() using label selectors to clean up monorepo components on project deletion (replaces name-based Undeploy in DeleteProject and the direct undeploy handler) - Add ResourceGC background worker that periodically finds K8s resources whose project label has no matching DB record, deletes after 1h safety window - Widen deployer client type from *kubernetes.Clientset to kubernetes.Interface for testability - UndeployAll accumulates errors via errors.Join instead of failing fast - Add checkout/checkin sidecar dev flow: temporary git tokens, branch checkout, review on checkin with cleanup workers - Add interactive sessions: pod binding, command execution, SSE streaming, ephemeral preview URLs with session cleanup workers - Add GET /workers/pool endpoint for aggregate capacity and queue depth - Add sessions:read and sessions:execute auth scopes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 19:11:28 -07:00
jordan	6ec2a4fea3	fix(sdlc): persist branch metadata on main before feature branch creation Some checks failed ci/woodpecker/push/woodpecker Pipeline failed Details The `sdlc merge` command reads the Branch field from the feature manifest on main, but `sdlc branch create` was only committing that state to the feature branch (via the executor's CommitAndPush). This caused merge to fail with "feature has no branch". Two changes: 1. cmd/sdlc/cmd_branch.go: commit .sdlc/ state to main before `git checkout -b`, ensuring Branch metadata is on main where merge reads it. 2. internal/worker/sdlc_executor.go: reset workspace to main (`git fetch && git checkout main && git reset --hard origin/main`) before each SDLC task, preventing cross-task branch contamination from commands that switch branches. Also updates foundary cookbook with architect fallback pattern and on_error: continue for steps that may fail during early lifecycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 08:36:10 -07:00
jordan	9833725f31	fix: preserve work on build retry, clear stale audit data Two critical fixes for build retry behavior: 1. pod_git_operations.go: Normalize remote URL before comparison - Clone stores URL with token (https://token:x@host/...) - Subsequent retry compares against URL without token - Without normalization, URLs never match, so workspace is always cleared and re-cloned, losing all code from previous attempt 2. build_audit.go: Clear stale result data when task transitions to running - When a failed task is retried, UpdateStatus only updated status/worker_id - Result and completed_at from previous failure remained, causing API to return stale failure data even while retry was running - Now clears result, completed_at and resets started_at when status is set to "running" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 08:40:36 -07:00
jordan	d74efb75ff	fix: wire workService to WorkersHandler and add /work/tasks endpoint Critical fix: WorkersHandler was missing workService dependency, causing 500 errors when workers tried to fail tasks. This caused tasks to get stuck in "running" state permanently. Also adds: - /work/tasks endpoint for debugging all tasks across projects - List method to WorkQueue interface for admin views - HTTP client tests for api_client.go and claudebox/client.go (48 tests) - Split work.go DTOs into work_dto.go to stay under 500 lines Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 10:35:39 -07:00
jordan	d7a6f37593	fix: worker graceful shutdown and RWO PVC compatibility Some checks failed ci/woodpecker/push/woodpecker Pipeline failed Details - Add WaitGroup for graceful shutdown of in-flight tasks - Change replicas to 1 with Recreate strategy (RWO PVC limitation) - Optimize Dockerfile: combine RUN commands for smaller layers - Add compiled binaries to .gitignore Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 00:35:00 -07:00
jordan	3b35900a2d	feat: enterprise worker pool with HTTP sidecar pattern Implements horizontally-scalable worker pool architecture: - claudebox-sidecar: HTTP server for Claude Code, git, and SDLC ops - rdev-worker: standalone worker binary polling rdev-api for tasks - HTTP client adapter for sidecar communication - HPA with custom Prometheus metrics for autoscaling - ServiceMonitor for metrics scraping Code review fixes applied: - URL-encode query parameters in GitStatus (Critical #1) - Remove unused shellQuote function (Critical #2) - Use stdlib strings.Split/TrimSpace (Critical #3) - Add version injection via ldflags (Warning #4) - Add debug logging for swallowed git/sdlc errors (Warning #5, #6) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 16:21:11 -07:00
jordan	853ec4cf81	fix: go.work race condition with batch components and idempotent provisioning Three coordinated fixes for CI pipeline race conditions: 1. Woodpecker step dependencies: Added depends_on: [deps] to all 6 component templates (service, worker, cli, app-astro, app-react, app-nextjs) so build steps wait for go work sync to complete. 2. Idempotent resource provisioning: Modified provisionResources() to check for existing database/cache before creating, preventing "already exists" errors on component re-adds. 3. Batch component endpoint: POST /projects/{id}/components/batch enables atomic multi-component additions in a single git commit. Validates all components upfront, provisions infra sequentially, commits code components atomically. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 12:31:40 -07:00
jordan	53862c773b	fix: resolve systemic debt in worker and skeleton templates Worker template fixes: - Replace panic() with logger.Error() + os.Exit(1) for config errors - Remove double-timeout application (context + middleware) - Add error message truncation to prevent log bloat - Use named constants for shutdown grace period and stale check interval Skeleton pkg/auth fixes: - Fix error wrapping to use %w consistently in jwt.go - Add GetUserOrError() as safe alternative to MustGetUser() panic Skeleton pkg/queue fixes: - Check RowsAffected() errors instead of ignoring them - Add input validation to EnqueueWithOptions (require job type, cap retries) - Add log truncation for error messages - Fix inaccurate doc comment claiming exponential backoff Worker timeout consolidation: - Add internal/worker/timeouts.go with named constants - Migrate all workers to use timeout constants Cleanup: - Remove obsolete slack-preparation-thoughts.md files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 23:44:55 -07:00
jordan	d69da6d627	feat: add structured logging infrastructure and SDLC extensions Major changes: - Add internal/logging package with field constants, context propagation, sensitive data auto-redaction, and per-component log levels - Add worker timeout constants (TimeoutQuickOp, TimeoutHealthCheck, etc.) - Extend SDLC with callback handlers, generate endpoints, and executor - Add new cookbook trees for aeries and slackpath progression - Add skeleton templates for queue, realtime, and microservices - Add worker component template with async job processing - Refactor services and handlers to use new logging infrastructure - Split component.go into component_infra.go and component_listing.go Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 22:56:04 -07:00
jordan	210064d490	feat: add diagnostics endpoint and external health monitoring - Add /diagnostics endpoint for system health overview - Add external health worker for monitoring Gitea, Woodpecker, Registry - Add health check methods to Gitea and Woodpecker clients - Remove hardcoded fallback projects (pantheon, aeries) - Add diagnostics domain types and service layer - Add comprehensive tests for diagnostics handler and service - Fix tests to use registered test project instead of hardcoded one Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 19:10:56 -07:00
jordan	b5fdf35f1b	feat: add WorkerService.FailTask for audit updates + visual verification scaffolding - Add FailTask to WorkerService to update build_audit on failure path (fixes bug where audit showed "running" when task actually failed) - Add WorkServiceFailer interface to avoid circular dependency - Add VerifyExecutor with Playwright-based visual verification - Add verify domain types (VerifySpec, VerifyResult, screenshot capture) - Wire VerifyExecutor placeholder into WorkExecutor (impl in Week 2) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 00:09:16 -07:00
jordan	cfba724f8a	feat: add work task error classification and user-facing error codes - Add WorkErrorCode type with RATE_LIMITED, AUTH_FAILED, TIMEOUT, STALE_WORKER, AGENT_ERROR, INVALID_SPEC - Add ClassifyAgentError function to detect error patterns from stderr - Add error_code column to work_queue table (migration 016) - Add FailWithCode method to WorkQueue interface and implementations - Update RequeueStaleWithIDs to mark permanently failed tasks with STALE_WORKER - Add ErrorCode to BuildResult for API responses - Update work executor to classify errors before failing tasks This enables users to see actual failure reasons (e.g., "RATE_LIMITED") instead of builds stuck in "running" state forever when Claude hits rate limits. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 00:07:34 -07:00
jordan	c280a92012	feat: add operations audit system and template improvements Operations Audit (new feature): - Add Operation domain model with status tracking (pending, running, completed, failed, cancelled) - Add OperationRepository with PostgreSQL implementation - Add OperationService for CRUD and lifecycle management - Add operations handlers (list, get, cancel endpoints) - Add migration 015_operations.sql for operations table - Add operation cleanup worker for stale operation handling - Add ErrOperationNotFound to domain errors Template Improvements: - Add CLAUDE.md configuration files to astro-landing, default, and go-api templates - Fix PORT template variable usage in nginx configs for app templates - Add replace directives for local pkg module in Go templates - Simplify Go service/worker Dockerfiles for workspace builds - Fix TypeScript error in logger template Other: - Refactor landing-test.sh cookbook script - Update CLAUDE.md version reference Note: Some files exceed 500-line limit (pre-existing debt + new feature) - component.go: 550 lines (unchanged, pre-existing) - main.go: 522 lines (added operations wiring) - operation_repo.go: 569 lines (new, needs splitting) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-01 19:08:57 -07:00
jordan	c59d348040	chore: prepare for composable monorepo template implementation This commit captures the current state before implementing the composable monorepo template system. Key changes included: Infrastructure: - Add CockroachDB provisioner adapter for database provisioning - Add Redis provisioner adapter for cache provisioning - Add build events system with PostgreSQL storage - Add WebSocket endpoint for real-time build progress Code agent improvements: - Fix Claude Code adapter to use default allowed tools instead of dangerously-skip-permissions - Add context-aware stream closing for cancellation support - Improve parser tests for edge cases Build system: - Add build event constants and metrics - Remove deprecated git_operations.go (replaced by pod_git_operations.go) - Add rollback logic for multi-step provisioning operations Documentation: - Add composable-monorepo feature documentation - Add DNS/Cloudflare service documentation - Update deployment and troubleshooting guides Cookbooks: - Add fullstack-app cookbook - Refactor landing-test with shared library Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-31 11:39:28 -07:00
jordan	910bcb62e1	fix: Sync build audit with work queue when stale tasks are requeued When a worker dies mid-build, queue maintenance now updates both work_queue and build_audit tables when requeuing stale tasks. This prevents builds from showing "running" forever in the API. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-31 02:07:52 -07:00
jordan	9c15976f86	feat: Complete Claude endpoint and update cookbook - Add session_id, model, allowed_tools to Claude request handler - Update OpenAPI spec for Claude endpoint - Fix BuildExecutor constructor call sites - Rewrite landing-test.sh for agent-driven flow - Fix cookbook documentation for correct API format Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-29 21:25:29 -07:00
jordan	4a18b1cd07	fix: Persist build audit status when worker claims task Root cause: WorkerService.ClaimTask() was modifying the audit entry in memory but never persisting it to the database. This caused build tasks to remain stuck at "pending" status even after being claimed. Changes: - Add UpdateStatus method to port.BuildAudit interface - Implement UpdateStatus in postgres.BuildAuditRepository - Fix ClaimTask to call audit.UpdateStatus() to persist status - Add test coverage for audit update during task claim - Update all mock implementations Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-29 21:25:04 -07:00
jordan	bc47e426b0	feat: Add CI pipeline proxy, DNS alias management, and worker executor system - Add ListPipelines/GetPipeline to CIProvider port with Woodpecker adapter - Add DNS alias endpoints: GET/POST/DELETE /projects/{id}/domains - Implement worker executor daemon, build executor, and git operations - Add build service, worker service, and build audit tracking - Add worker registry with PostgreSQL adapter and migration - Add multi-provider code agent interface (Claude Code + OpenCode) - Add create-and-build combo endpoint - Update landing-page cookbook to reflect all gaps closed - Fix tech debt: unified validation, auth scopes, error wrapping, slog patterns Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-27 21:05:28 -07:00
jordan	72d16929ca	feat: Implement hexagonal architecture with services, webhooks, queue, and telemetry Major refactoring to hexagonal (ports & adapters) architecture: - Add service layer (apikey_service, project_service) for business logic - Add webhook system with dispatcher and delivery tracking - Add command queue with priority-based processing - Add rate limiting with sliding window algorithm - Add audit logging for command execution - Add OpenTelemetry integration (traces, metrics, spans) - Add circuit breaker for fault tolerance - Add cached repository wrapper for performance - Add comprehensive validation package - Add Kubernetes client integration for pod management - Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks) - Add network policy and PodDisruptionBudget for k8s - Remove legacy executor and projects/registry packages - Untrack secrets.yaml (now managed via envault) - Add coverage.out to .gitignore - Add e2e test infrastructure with docker-compose - Add comprehensive documentation (API, architecture, operations, plans) - Add golangci-lint config and pre-commit hook Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-25 19:57:46 -07:00

20 Commits