jordan/rdev

Author	SHA1	Message	Date
jordan	9226454b85	feat: label-based undeploy, GC reconciliation, checkout/sessions, pool status Some checks failed ci/woodpecker/push/woodpecker Pipeline failed Details - Add UndeployAll() using label selectors to clean up monorepo components on project deletion (replaces name-based Undeploy in DeleteProject and the direct undeploy handler) - Add ResourceGC background worker that periodically finds K8s resources whose project label has no matching DB record, deletes after 1h safety window - Widen deployer client type from *kubernetes.Clientset to kubernetes.Interface for testability - UndeployAll accumulates errors via errors.Join instead of failing fast - Add checkout/checkin sidecar dev flow: temporary git tokens, branch checkout, review on checkin with cleanup workers - Add interactive sessions: pod binding, command execution, SSE streaming, ephemeral preview URLs with session cleanup workers - Add GET /workers/pool endpoint for aggregate capacity and queue depth - Add sessions:read and sessions:execute auth scopes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 19:11:28 -07:00
jordan	6ec2a4fea3	fix(sdlc): persist branch metadata on main before feature branch creation Some checks failed ci/woodpecker/push/woodpecker Pipeline failed Details The `sdlc merge` command reads the Branch field from the feature manifest on main, but `sdlc branch create` was only committing that state to the feature branch (via the executor's CommitAndPush). This caused merge to fail with "feature has no branch". Two changes: 1. cmd/sdlc/cmd_branch.go: commit .sdlc/ state to main before `git checkout -b`, ensuring Branch metadata is on main where merge reads it. 2. internal/worker/sdlc_executor.go: reset workspace to main (`git fetch && git checkout main && git reset --hard origin/main`) before each SDLC task, preventing cross-task branch contamination from commands that switch branches. Also updates foundary cookbook with architect fallback pattern and on_error: continue for steps that may fail during early lifecycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 08:36:10 -07:00
jordan	9833725f31	fix: preserve work on build retry, clear stale audit data Two critical fixes for build retry behavior: 1. pod_git_operations.go: Normalize remote URL before comparison - Clone stores URL with token (https://token:x@host/...) - Subsequent retry compares against URL without token - Without normalization, URLs never match, so workspace is always cleared and re-cloned, losing all code from previous attempt 2. build_audit.go: Clear stale result data when task transitions to running - When a failed task is retried, UpdateStatus only updated status/worker_id - Result and completed_at from previous failure remained, causing API to return stale failure data even while retry was running - Now clears result, completed_at and resets started_at when status is set to "running" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-07 08:40:36 -07:00
jordan	d74efb75ff	fix: wire workService to WorkersHandler and add /work/tasks endpoint Critical fix: WorkersHandler was missing workService dependency, causing 500 errors when workers tried to fail tasks. This caused tasks to get stuck in "running" state permanently. Also adds: - /work/tasks endpoint for debugging all tasks across projects - List method to WorkQueue interface for admin views - HTTP client tests for api_client.go and claudebox/client.go (48 tests) - Split work.go DTOs into work_dto.go to stay under 500 lines Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 10:35:39 -07:00
jordan	d7a6f37593	fix: worker graceful shutdown and RWO PVC compatibility Some checks failed ci/woodpecker/push/woodpecker Pipeline failed Details - Add WaitGroup for graceful shutdown of in-flight tasks - Change replicas to 1 with Recreate strategy (RWO PVC limitation) - Optimize Dockerfile: combine RUN commands for smaller layers - Add compiled binaries to .gitignore Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-06 00:35:00 -07:00
jordan	3b35900a2d	feat: enterprise worker pool with HTTP sidecar pattern Implements horizontally-scalable worker pool architecture: - claudebox-sidecar: HTTP server for Claude Code, git, and SDLC ops - rdev-worker: standalone worker binary polling rdev-api for tasks - HTTP client adapter for sidecar communication - HPA with custom Prometheus metrics for autoscaling - ServiceMonitor for metrics scraping Code review fixes applied: - URL-encode query parameters in GitStatus (Critical #1) - Remove unused shellQuote function (Critical #2) - Use stdlib strings.Split/TrimSpace (Critical #3) - Add version injection via ldflags (Warning #4) - Add debug logging for swallowed git/sdlc errors (Warning #5, #6) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 16:21:11 -07:00
jordan	853ec4cf81	fix: go.work race condition with batch components and idempotent provisioning Three coordinated fixes for CI pipeline race conditions: 1. Woodpecker step dependencies: Added depends_on: [deps] to all 6 component templates (service, worker, cli, app-astro, app-react, app-nextjs) so build steps wait for go work sync to complete. 2. Idempotent resource provisioning: Modified provisionResources() to check for existing database/cache before creating, preventing "already exists" errors on component re-adds. 3. Batch component endpoint: POST /projects/{id}/components/batch enables atomic multi-component additions in a single git commit. Validates all components upfront, provisions infra sequentially, commits code components atomically. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-05 12:31:40 -07:00
jordan	53862c773b	fix: resolve systemic debt in worker and skeleton templates Worker template fixes: - Replace panic() with logger.Error() + os.Exit(1) for config errors - Remove double-timeout application (context + middleware) - Add error message truncation to prevent log bloat - Use named constants for shutdown grace period and stale check interval Skeleton pkg/auth fixes: - Fix error wrapping to use %w consistently in jwt.go - Add GetUserOrError() as safe alternative to MustGetUser() panic Skeleton pkg/queue fixes: - Check RowsAffected() errors instead of ignoring them - Add input validation to EnqueueWithOptions (require job type, cap retries) - Add log truncation for error messages - Fix inaccurate doc comment claiming exponential backoff Worker timeout consolidation: - Add internal/worker/timeouts.go with named constants - Migrate all workers to use timeout constants Cleanup: - Remove obsolete slack-preparation-thoughts.md files Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 23:44:55 -07:00
jordan	d69da6d627	feat: add structured logging infrastructure and SDLC extensions Major changes: - Add internal/logging package with field constants, context propagation, sensitive data auto-redaction, and per-component log levels - Add worker timeout constants (TimeoutQuickOp, TimeoutHealthCheck, etc.) - Extend SDLC with callback handlers, generate endpoints, and executor - Add new cookbook trees for aeries and slackpath progression - Add skeleton templates for queue, realtime, and microservices - Add worker component template with async job processing - Refactor services and handlers to use new logging infrastructure - Split component.go into component_infra.go and component_listing.go Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-04 22:56:04 -07:00
jordan	210064d490	feat: add diagnostics endpoint and external health monitoring - Add /diagnostics endpoint for system health overview - Add external health worker for monitoring Gitea, Woodpecker, Registry - Add health check methods to Gitea and Woodpecker clients - Remove hardcoded fallback projects (pantheon, aeries) - Add diagnostics domain types and service layer - Add comprehensive tests for diagnostics handler and service - Fix tests to use registered test project instead of hardcoded one Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 19:10:56 -07:00
jordan	b5fdf35f1b	feat: add WorkerService.FailTask for audit updates + visual verification scaffolding - Add FailTask to WorkerService to update build_audit on failure path (fixes bug where audit showed "running" when task actually failed) - Add WorkServiceFailer interface to avoid circular dependency - Add VerifyExecutor with Playwright-based visual verification - Add verify domain types (VerifySpec, VerifyResult, screenshot capture) - Wire VerifyExecutor placeholder into WorkExecutor (impl in Week 2) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 00:09:16 -07:00
jordan	cfba724f8a	feat: add work task error classification and user-facing error codes - Add WorkErrorCode type with RATE_LIMITED, AUTH_FAILED, TIMEOUT, STALE_WORKER, AGENT_ERROR, INVALID_SPEC - Add ClassifyAgentError function to detect error patterns from stderr - Add error_code column to work_queue table (migration 016) - Add FailWithCode method to WorkQueue interface and implementations - Update RequeueStaleWithIDs to mark permanently failed tasks with STALE_WORKER - Add ErrorCode to BuildResult for API responses - Update work executor to classify errors before failing tasks This enables users to see actual failure reasons (e.g., "RATE_LIMITED") instead of builds stuck in "running" state forever when Claude hits rate limits. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 00:07:34 -07:00
jordan	c280a92012	feat: add operations audit system and template improvements Operations Audit (new feature): - Add Operation domain model with status tracking (pending, running, completed, failed, cancelled) - Add OperationRepository with PostgreSQL implementation - Add OperationService for CRUD and lifecycle management - Add operations handlers (list, get, cancel endpoints) - Add migration 015_operations.sql for operations table - Add operation cleanup worker for stale operation handling - Add ErrOperationNotFound to domain errors Template Improvements: - Add CLAUDE.md configuration files to astro-landing, default, and go-api templates - Fix PORT template variable usage in nginx configs for app templates - Add replace directives for local pkg module in Go templates - Simplify Go service/worker Dockerfiles for workspace builds - Fix TypeScript error in logger template Other: - Refactor landing-test.sh cookbook script - Update CLAUDE.md version reference Note: Some files exceed 500-line limit (pre-existing debt + new feature) - component.go: 550 lines (unchanged, pre-existing) - main.go: 522 lines (added operations wiring) - operation_repo.go: 569 lines (new, needs splitting) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-01 19:08:57 -07:00
jordan	c59d348040	chore: prepare for composable monorepo template implementation This commit captures the current state before implementing the composable monorepo template system. Key changes included: Infrastructure: - Add CockroachDB provisioner adapter for database provisioning - Add Redis provisioner adapter for cache provisioning - Add build events system with PostgreSQL storage - Add WebSocket endpoint for real-time build progress Code agent improvements: - Fix Claude Code adapter to use default allowed tools instead of dangerously-skip-permissions - Add context-aware stream closing for cancellation support - Improve parser tests for edge cases Build system: - Add build event constants and metrics - Remove deprecated git_operations.go (replaced by pod_git_operations.go) - Add rollback logic for multi-step provisioning operations Documentation: - Add composable-monorepo feature documentation - Add DNS/Cloudflare service documentation - Update deployment and troubleshooting guides Cookbooks: - Add fullstack-app cookbook - Refactor landing-test with shared library Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-31 11:39:28 -07:00
jordan	910bcb62e1	fix: Sync build audit with work queue when stale tasks are requeued When a worker dies mid-build, queue maintenance now updates both work_queue and build_audit tables when requeuing stale tasks. This prevents builds from showing "running" forever in the API. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-31 02:07:52 -07:00
jordan	9c15976f86	feat: Complete Claude endpoint and update cookbook - Add session_id, model, allowed_tools to Claude request handler - Update OpenAPI spec for Claude endpoint - Fix BuildExecutor constructor call sites - Rewrite landing-test.sh for agent-driven flow - Fix cookbook documentation for correct API format Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-29 21:25:29 -07:00
jordan	4a18b1cd07	fix: Persist build audit status when worker claims task Root cause: WorkerService.ClaimTask() was modifying the audit entry in memory but never persisting it to the database. This caused build tasks to remain stuck at "pending" status even after being claimed. Changes: - Add UpdateStatus method to port.BuildAudit interface - Implement UpdateStatus in postgres.BuildAuditRepository - Fix ClaimTask to call audit.UpdateStatus() to persist status - Add test coverage for audit update during task claim - Update all mock implementations Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-29 21:25:04 -07:00
jordan	bc47e426b0	feat: Add CI pipeline proxy, DNS alias management, and worker executor system - Add ListPipelines/GetPipeline to CIProvider port with Woodpecker adapter - Add DNS alias endpoints: GET/POST/DELETE /projects/{id}/domains - Implement worker executor daemon, build executor, and git operations - Add build service, worker service, and build audit tracking - Add worker registry with PostgreSQL adapter and migration - Add multi-provider code agent interface (Claude Code + OpenCode) - Add create-and-build combo endpoint - Update landing-page cookbook to reflect all gaps closed - Fix tech debt: unified validation, auth scopes, error wrapping, slog patterns Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-27 21:05:28 -07:00
jordan	72d16929ca	feat: Implement hexagonal architecture with services, webhooks, queue, and telemetry Major refactoring to hexagonal (ports & adapters) architecture: - Add service layer (apikey_service, project_service) for business logic - Add webhook system with dispatcher and delivery tracking - Add command queue with priority-based processing - Add rate limiting with sliding window algorithm - Add audit logging for command execution - Add OpenTelemetry integration (traces, metrics, spans) - Add circuit breaker for fault tolerance - Add cached repository wrapper for performance - Add comprehensive validation package - Add Kubernetes client integration for pod management - Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks) - Add network policy and PodDisruptionBudget for k8s - Remove legacy executor and projects/registry packages - Untrack secrets.yaml (now managed via envault) - Add coverage.out to .gitignore - Add e2e test infrastructure with docker-compose - Add comprehensive documentation (API, architecture, operations, plans) - Add golangci-lint config and pre-commit hook Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-25 19:57:46 -07:00

19 Commits