- Add WaitGroup for graceful shutdown of in-flight tasks
- Change replicas to 1 with Recreate strategy (RWO PVC limitation)
- Optimize Dockerfile: combine RUN commands for smaller layers
- Add compiled binaries to .gitignore
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implements horizontally-scalable worker pool architecture:
- claudebox-sidecar: HTTP server for Claude Code, git, and SDLC ops
- rdev-worker: standalone worker binary polling rdev-api for tasks
- HTTP client adapter for sidecar communication
- HPA with custom Prometheus metrics for autoscaling
- ServiceMonitor for metrics scraping
Code review fixes applied:
- URL-encode query parameters in GitStatus (Critical #1)
- Remove unused shellQuote function (Critical #2)
- Use stdlib strings.Split/TrimSpace (Critical #3)
- Add version injection via ldflags (Warning #4)
- Add debug logging for swallowed git/sdlc errors (Warning #5, #6)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updates slackpath-2 and slackpath-4 to use POST /projects/{id}/components/batch
for adding multiple Go components atomically in a single git commit. This
prevents the go.work race condition where individual commits reference modules
that don't exist yet.
Also adds on_error: continue for infrastructure provisioning steps that may
already exist from skeleton (redis, postgres).
Verified:
- slackpath-1: ✅ Complete (wait_build polled 5 times, detected success)
- slackpath-2: ✅ Complete (wait_build polled 111 times, detected success)
- slackpath-3: ✅ Infrastructure passed (worker capacity limited testing)
- slackpath-4: ✅ Infrastructure passed (worker capacity limited testing)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Three coordinated fixes for CI pipeline race conditions:
1. Woodpecker step dependencies: Added depends_on: [deps] to all 6 component
templates (service, worker, cli, app-astro, app-react, app-nextjs) so build
steps wait for go work sync to complete.
2. Idempotent resource provisioning: Modified provisionResources() to check
for existing database/cache before creating, preventing "already exists"
errors on component re-adds.
3. Batch component endpoint: POST /projects/{id}/components/batch enables
atomic multi-component additions in a single git commit. Validates all
components upfront, provisions infra sequentially, commits code components
atomically.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Components now automatically receive DATABASE_URL, REDIS_URL, and other
infrastructure credentials when deployed. Previously, credentials were
provisioned and stored but never injected into K8s deployments.
Changes:
- Add fetchProjectCredentials() to component_deploy.go
- Populate spec.Secrets before calling deployer.Deploy()
- Fix slackpath-4 to provision postgres + redis before services
- Add terminology docs to clarify platform vs skeleton code
This completes the infrastructure provisioning flow:
1. add-db → provisions CockroachDB, stores DATABASE_URL
2. add-service → deploys with DATABASE_URL in environment
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Worker template fixes:
- Replace panic() with logger.Error() + os.Exit(1) for config errors
- Remove double-timeout application (context + middleware)
- Add error message truncation to prevent log bloat
- Use named constants for shutdown grace period and stale check interval
Skeleton pkg/auth fixes:
- Fix error wrapping to use %w consistently in jwt.go
- Add GetUserOrError() as safe alternative to MustGetUser() panic
Skeleton pkg/queue fixes:
- Check RowsAffected() errors instead of ignoring them
- Add input validation to EnqueueWithOptions (require job type, cap retries)
- Add log truncation for error messages
- Fix inaccurate doc comment claiming exponential backoff
Worker timeout consolidation:
- Add internal/worker/timeouts.go with named constants
- Migrate all workers to use timeout constants
Cleanup:
- Remove obsolete slack-preparation-thoughts.md files
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Major changes:
- Add internal/logging package with field constants, context propagation,
sensitive data auto-redaction, and per-component log levels
- Add worker timeout constants (TimeoutQuickOp, TimeoutHealthCheck, etc.)
- Extend SDLC with callback handlers, generate endpoints, and executor
- Add new cookbook trees for aeries and slackpath progression
- Add skeleton templates for queue, realtime, and microservices
- Add worker component template with async job processing
- Refactor services and handlers to use new logging infrastructure
- Split component.go into component_infra.go and component_listing.go
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds AddIngressPath and RemoveIngressPath to the Deployer interface
for managing per-component ingress rules in monorepo projects.
- Implement conflict retry logic for concurrent ingress updates
- Add K8s client interface for testability
- Add comprehensive unit tests for ingress path operations
- Add component deployment and teardown methods to ComponentService
- Update service templates with OpenAPI spec improvements
- Add evolving-app cookbook tree for reference
- Split resources.go into resources_ingress.go for path-based routing
- Split component.go into component_deploy.go for deployment helpers
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use glob pattern go.work.su[m] instead of go.work.sum to allow
the COPY to succeed even when go.work.sum doesn't exist yet.
This happens on fresh monorepos before dependencies are synced.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add REST API endpoints for submitting visual verification tasks,
tracking progress via SSE, and retrieving screenshot/video artifacts.
Changes:
- Add ScopeVerifyRead/ScopeVerifyWrite auth scopes
- Create VerifyService for task submission and lifecycle management
- Create VerifyHandler with POST/GET/DELETE/SSE endpoints:
- POST /verify - Submit capture task
- GET /verify/{taskId} - Get task status and artifacts
- GET /verify/{taskId}/stream - SSE progress stream
- DELETE /verify/{taskId} - Cancel pending task
- GET /projects/{id}/verify - List verify tasks
- Wire VerifyExecutor in main.go for Playwright pod execution
- Fix work.go validation to include "verify" task type
- Add comprehensive handler tests
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add /diagnostics endpoint for system health overview
- Add external health worker for monitoring Gitea, Woodpecker, Registry
- Add health check methods to Gitea and Woodpecker clients
- Remove hardcoded fallback projects (pantheon, aeries)
- Add diagnostics domain types and service layer
- Add comprehensive tests for diagnostics handler and service
- Fix tests to use registered test project instead of hardcoded one
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Composable monorepo CI fixes:
- Add empty go.sum.tmpl files for pkg, service, worker, and cli components
- Fix Dockerfile.tmpl glob patterns (COPY go.work.sum* is invalid in Kaniko)
- Add deps step to CI that runs go work sync and go mod tidy before builds
- Fix scalar-go dependency version (v0.1.2 doesn't exist, use v0.13.0)
Health endpoint improvements:
- Add registry health check (zot OCI /v2/ endpoint)
- Add health metrics for CI, registry, and Git
- Add /health/ci endpoint for Woodpecker health
Visual verification scaffolding:
- Add Playwright pod and scripts ConfigMap
- Add vision.md and implementation breakdown plan
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add WorkErrorCode type with RATE_LIMITED, AUTH_FAILED, TIMEOUT, STALE_WORKER, AGENT_ERROR, INVALID_SPEC
- Add ClassifyAgentError function to detect error patterns from stderr
- Add error_code column to work_queue table (migration 016)
- Add FailWithCode method to WorkQueue interface and implementations
- Update RequeueStaleWithIDs to mark permanently failed tasks with STALE_WORKER
- Add ErrorCode to BuildResult for API responses
- Update work executor to classify errors before failing tasks
This enables users to see actual failure reasons (e.g., "RATE_LIMITED") instead of
builds stuck in "running" state forever when Claude hits rate limits.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add pass/fail/needs-fix CLI commands to cmd/sdlc/cmd_artifact.go
- Add 3 new methods to SDLCExecutor interface in internal/port
- Implement methods in kubernetes adapter
- Add service methods to SDLCService
- Add HTTP handlers for POST .../artifacts/{type}/pass|fail|needs-fix
- Update 6 skeleton commands to evaluate and set artifact status
- Update test mocks
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add feature-dev-test.sh: full 10-step E2E test for SDLC + Claude Code workflow
- Update feature-development.md cookbook with complete workflow documentation
- Fix SDLC orchestrator and project management handler improvements
- Update scaffold-test.sh with minor fixes
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add POST /workers/register and POST /workers/{workerId}/heartbeat endpoints
- Start worker health checker goroutine in main.go
- Fix network policy to allow K8s API server access (includes real endpoint IPs)
- Add rdev.orchard9.ai/role: worker label to claudebox StatefulSet
This enables the embedded WorkExecutor to reach claudebox-0 for executing
builds on composable projects that don't have dedicated pods.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add AUTO_TEARDOWN env var and --auto-teardown flag to cookbook scripts
- Scripts automatically delete created projects on exit (including Ctrl+C)
- Add DELETE /projects/cleanup API endpoint for bulk cleanup
- Supports shell-style glob patterns (e.g., "tree-test-*")
- Includes dry_run mode and older_than_hours filter for safety
- Requires admin scope for actual deletion
- Update cookbook scripts: landing-test, composable-test, template-validation,
feature-test, tree-runner
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add branch lifecycle commands (branch, merge, archive) to the SDLC CLI.
Introduce orchestrator handler and service for multi-step SDLC workflows.
Expand skeleton template with 15 Claude commands covering the full feature
lifecycle. Extend classifier rules, error types, and executor port for
branch operations. Split rules.go and classifier_test.go to stay within
500-line limit.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add best-of-best Claude Code configuration from local setup to the
composable monorepo skeleton template, giving new projects a powerful
starting configuration.
Commands added (4):
- do-parallel: Execute tasks in parallel waves with agent selection
- remember: Store learnings as institutional memory
- prepare: Pre-implementation readiness assessment
- root-cause: Root cause analysis with parallel investigation
Skills added (5):
- orchestrated-execution: Task pipelines with implementation → review → fix
- root-cause-analyst: Systematic diagnosis with confidence scoring
- knowledge-librarian: Organize learnings in ai-lookup/ structure
- feature-verifier: Verify features work with evidence matrix
- prepare: Binary outcome readiness assessment (brief or gap list)
Agents added (1):
- quality-engineer: Code quality, test coverage, error handling reviewer
All Citadel-specific references genericized to use skeleton's existing
agents (go-specialist, testing-strategist, security-architect, etc).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add GET /projects/{id}/pipelines/{number}/steps endpoint
- Return step name, status, duration, exit_code for all steps
- Include last 50 lines of log for failed steps
- Enhance test script with automatic diagnostics on failure
- Add diagnose subcommand for deep pipeline analysis
- Show K8s pod state on site accessibility failures
- Split woodpecker adapter into client.go and pipelines.go
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add deploy-{name} CI steps to all component templates (app-astro,
app-react, service, worker) so each component deploys independently
via kubectl set image on merge to main. Replace the skeleton's
generic deploy step with a verify step that confirms deployments.
Add GET /templates/components endpoint for listing available component
templates with optional type filter. Simplify component API by merging
type+template into a single type field (e.g., "app-react" instead of
type="app" template="app-react").
Include ESLint configs and pnpm-workspace.yaml in app templates.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Short-form DNS names (e.g. postgres.databases.svc) fail to resolve in
new pods due to k8s DNS search domain limitations. Switch all service
hostnames to FQDNs (*.svc.cluster.local).
Remove commonLabels from kustomization.yaml — it injected labels into
all selectors including NetworkPolicy egress rules (blocking DNS to
CoreDNS) and Deployment selectors (causing immutability errors).
Add OTEL_EXPORTER_OTLP_ENDPOINT env var to deployment YAML so the
telemetry collector endpoint uses the FQDN without requiring a binary
rebuild.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds the composable monorepo template system that generates project skeletons
with pluggable components (service, worker, app-react, app-astro, cli).
Key changes:
- Monorepo skeleton templates with shared pkg/, scripts/, and git hooks
- Component templates (service, worker, app-react, app-astro, cli) with
Dockerfiles, CI steps, and component.yaml manifests
- Component domain model with validation and dependency resolution
- Component handler endpoints for CRUD and composition
- Template provider extended with BuildComposableProject and component assembly
- Deployer extended with composable project deployment support
- Handler timeout constants (TimeoutFastLookup through TimeoutLongRunning)
- envutil package for centralized env var reads with defaults
- api.DecodeJSON helper for standardized request body decoding
- Standardized response helpers (WriteBadRequest, WriteNotFound, etc.)
- Replaced fullstack-app cookbook with composable-app cookbook
- Hardened handler timeouts, logging, and error responses across all handlers
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When a worker dies mid-build, queue maintenance now updates both
work_queue and build_audit tables when requeuing stale tasks.
This prevents builds from showing "running" forever in the API.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When Claude fails to execute, error messages now include:
- Captured stderr output from the failed command
- Troubleshooting commands to exec into pod and run `claude login`
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add session_id, model, allowed_tools to Claude request handler
- Update OpenAPI spec for Claude endpoint
- Fix BuildExecutor constructor call sites
- Rewrite landing-test.sh for agent-driven flow
- Fix cookbook documentation for correct API format
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Root cause: WorkerService.ClaimTask() was modifying the audit entry
in memory but never persisting it to the database. This caused build
tasks to remain stuck at "pending" status even after being claimed.
Changes:
- Add UpdateStatus method to port.BuildAudit interface
- Implement UpdateStatus in postgres.BuildAuditRepository
- Fix ClaimTask to call audit.UpdateStatus() to persist status
- Add test coverage for audit update during task claim
- Update all mock implementations
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Initial K8s deployment auto-creation during project creation
- DNS record upsert support (create or update existing records)
- Ingress host management for domain aliases (AddIngressHost/RemoveIngressHost)
- Woodpecker deployer RBAC manifest for CI deploy steps
- Single-commit template seeding via Gitea bulk file API
Closes automation gaps exposed during www.threesix.ai launch:
- Projects now auto-create K8s Deployment/Service/Ingress on creation
- Domain aliases automatically update both DNS and K8s ingress
- CI deploy steps work without manual RBAC setup
- Template seeding triggers only one CI pipeline (not per-file)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The deployer was using cert-manager.io/issuer (namespace-scoped)
referencing letsencrypt-threesix which only exists in the threesix
namespace. Projects deploy to the projects namespace, so changed to
cert-manager.io/cluster-issuer with letsencrypt-prod.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The destinations format caused Kaniko to push images with the full
registry URL as part of the repo path (registry.threesix.ai/name
instead of just name). Using registry + repo + tags format pushes
to the correct path.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The templates referenced zot.orchard9.ai which has no DNS record.
The actual zot registry is at registry.threesix.ai. Also updated
static templates to use Kaniko plugin instead of docker:24-dind.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The deployer was blindly calling Namespaces().Create() which triggered
cluster-scope RBAC checks even when the namespace already existed.
Now checks with Get() first and only creates if NotFound.
Also adds namespace get/create and secrets create/update/patch
permissions to the rdev-api-deployer ClusterRole.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Switch from raw gcr.io/kaniko-project/executor:debug to
woodpeckerci/plugin-kaniko with destinations setting. Also use
npm install instead of npm ci (no lock file in templates) and
skip-tls-verify for self-signed registry certs.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Zot is configured without authentication, so remove the auth
configuration step from templates. Added --insecure flag for
internal registry access.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace Docker-in-Docker (privileged mode) with Kaniko for container
builds. This allows CI pipelines to run without requiring trusted
repo status in Woodpecker.
- astro-landing: Use Kaniko with from_secret for registry auth
- go-api: Use Kaniko with from_secret for registry auth
- default: Use Kaniko with from_secret for registry auth
Kaniko builds and pushes images without requiring privileged mode,
making it compatible with Woodpecker's default security settings.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add PipelineErrorResponse struct to handler
- Add Errors field to PipelineResponse struct
- Add mapPipelineErrors helper function
- Include errors in both ListPipelines and GetPipeline responses
Root cause of CI failures: Woodpecker trust level doesn't allow privileged mode
for docker steps. Errors were being returned by Woodpecker but not exposed.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add CIPipelineError struct to domain with Type, Message, IsWarning fields
- Map Woodpecker Pipeline.Errors to domain.CIPipeline.Errors
- Fix migration 013: UUID type for project_id, cast id to text for MD5
- Remove invalid domain data migration (columns don't exist)
- Update release.sh with --deploy flag and migration support
- Fix test nil pointer: check errors in TestAPIKeyRepository_ProjectIDArrayHandling
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Landing page cookbook implementation (Weeks 1-4):
Domain Infrastructure:
- Add project_domains table with migration (013_project_domains.sql)
- Add ProjectDomain model with domain types (primary_auto, primary_custom, alias)
- Add SlugGenerator and ProjectDomainRepository interfaces
- Implement postgres adapters for domain and slug management
Service Layer:
- Add domain CRUD methods to ProjectInfraService
- Generate 8-char random slugs for auto-domains
- Support custom subdomains during project creation
- Add site_live health check to project status
- Trigger CI build after template seeding
Handler Updates:
- Add DomainService interface and adapter pattern
- Rewrite domain handlers to use database-backed service
- Add proper error handling for duplicate/missing domains
CI Integration:
- Add TriggerBuild to CIProvider interface
- Implement TriggerBuild in Woodpecker adapter
- Manually trigger initial build after template seed
Cookbook & Scripts:
- Add landing-test.sh script for E2E testing
- Add release.sh for version releases
- Add logs.sh for quick log access
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>