The fatal push+retry logic added in the git flow hardening broke tests
that use local-only git repos without an origin remote. Check for the
origin remote before attempting to push, preserving the fatal behavior
in production (where origin always exists) while allowing local/test
contexts to proceed without a remote.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
5 fixes from stress test analysis:
1. CRITICAL: Add pull-before-push to claudebox GitOperations.CommitAndPush,
matching the fix already in PodGitOperations (prevents push rejections
when concurrent builds advance the remote).
2. HIGH: Extract ResetToMain into PodGitOperations as a shared public method.
Wire into BuildExecutor after CloneRepo and update SDLCTaskExecutor to
use the shared method. Prevents builds from running on wrong branch when
worker pods are reused across tasks.
3. HIGH: Make branch create push failure fatal with retry+rollback in
cmd/sdlc/cmd_branch.go. Prevents orphaned .sdlc/ state that causes
merge failures after completing all 10 SDLC phases.
4. MEDIUM: Shell-escape token in credential helpers (both PodGitOperations
and claudebox GitOperations) to prevent shell injection via tokens
containing special characters.
5. MEDIUM: Add GitResetToMain to claudebox sidecar (git.go implementation,
server.go endpoint, client.go HTTP method) and wire into
HTTPSDLCTaskExecutor for the HTTP sidecar path.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The foundary cookbook was using template: "default" which seeds a flat
CI pipeline without the COMPONENT_STEPS_BELOW marker. When components
were added via batch API, updateWoodpeckerYml couldn't find the marker
and silently returned the file unchanged — component build/deploy steps
were never inserted. This caused component images to never be built,
leaving pods at 0 replicas indefinitely.
The skeleton template has the correct DAG-mode pipeline with markers
for component step insertion and build-complete dependency wiring.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Auth errors like "OAuth token has expired" were lost because Claude writes
them to stdout, not stderr. The error message only showed kubectl's generic
"command terminated with exit code 1". Now includes both stdout and stderr
in the error, making failures immediately diagnosable.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Components are scaffolded before CI builds their images. Previously deployments
started with 1 replica, causing ImagePullBackOff until the first build completed.
Now deployments start at 0 replicas; CI deploy steps scale to 1 after verifying
the image exists in the registry.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Registry was full (29.4GB/30GB) causing DIGEST_INVALID on all image pushes.
Cleaned old test project repos, expanded PVC from 30Gi to 80Gi, re-enabled cache.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Corrupt cache blobs in rdev/claudebox/cache cause DIGEST_INVALID on
push. Disable cache temporarily to force a clean build. Can re-enable
once a good image is pushed and the cache is rebuilt.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add --connect-timeout 10 and --max-time 15 to all verify step curl
calls to prevent hanging on registry health checks
- Fix cli template: depends_on [deps] -> [preflight] for consistency
- Add cross-reference comment to service template about verify logic
being replicated across all 5 component templates
- Document component CI step rules in composable-monorepo.md
- Compile regexes at package level instead of per-call in
component_updates.go
- Add component_updates_test.go
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CI deploy step runs `kubectl set image statefulset/claudebox` but
the woodpecker-deployer Role only included `deployments`. Add
`statefulsets` to the allowed resources.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add UndeployAll() using label selectors to clean up monorepo components
on project deletion (replaces name-based Undeploy in DeleteProject and
the direct undeploy handler)
- Add ResourceGC background worker that periodically finds K8s resources
whose project label has no matching DB record, deletes after 1h safety
window
- Widen deployer client type from *kubernetes.Clientset to
kubernetes.Interface for testability
- UndeployAll accumulates errors via errors.Join instead of failing fast
- Add checkout/checkin sidecar dev flow: temporary git tokens, branch
checkout, review on checkin with cleanup workers
- Add interactive sessions: pod binding, command execution, SSE streaming,
ephemeral preview URLs with session cleanup workers
- Add GET /workers/pool endpoint for aggregate capacity and queue depth
- Add sessions:read and sessions:execute auth scopes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Site verification may fail when component images haven't built yet.
The SDLC lifecycle completes regardless of site availability.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sdlc merge command already transitions features to released
internally. The cookbook's transition step was running after archive,
which moved the feature and caused "feature not found". Fixed by:
- Reordering: transition before archive
- Adding on_error: continue to both (merge handles transition)
- Simplifying verification (no longer depends on transition outputs)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Both main and the feature branch modify .sdlc/state.yaml during the
SDLC lifecycle. Use -X theirs to auto-resolve conflicts in favor of the
feature branch, whose state is more current.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After resetToMain in the executor, only remote refs exist for feature
branches. The merge command now checks if the local branch exists and
falls back to origin/<branch> when it doesn't.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The logging middleware's responseWriter wrapped http.ResponseWriter but
only implemented WriteHeader, Write, and Unwrap. The missing Flush()
method caused w.(http.Flusher) type assertions to fail in the claudebox
sidecar's streaming endpoint, returning 500 "streaming not supported".
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The `sdlc merge` command reads the Branch field from the feature manifest
on main, but `sdlc branch create` was only committing that state to the
feature branch (via the executor's CommitAndPush). This caused merge to
fail with "feature has no branch".
Two changes:
1. cmd/sdlc/cmd_branch.go: commit .sdlc/ state to main before
`git checkout -b`, ensuring Branch metadata is on main where merge
reads it.
2. internal/worker/sdlc_executor.go: reset workspace to main
(`git fetch && git checkout main && git reset --hard origin/main`)
before each SDLC task, preventing cross-task branch contamination
from commands that switch branches.
Also updates foundary cookbook with architect fallback pattern and
on_error: continue for steps that may fail during early lifecycle.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Woodpecker CI was timing out when watching deployment rollout status
due to missing RBAC permissions. The deployments were succeeding but
CI couldn't verify completion.
Changes:
- Add 'watch' verb to woodpecker-deployer Role
- Add threesix/default service account to RoleBinding
- Consolidate woodpecker-deployer RBAC into base/rbac.yaml
This resolves the "Failed to watch: deployments.apps is forbidden"
errors in CI logs while maintaining successful deployment rollouts.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Implements all 5 phases of Foundary Studio backend:
Phase 1: Chat Persistence (8 API endpoints)
- Conversations and messages with proper cascading deletes
- PostgreSQL schema with auto-update triggers
- Full CRUD operations with structured logging
Phase 2: Blueprint Entity (5 API endpoints)
- JSONB spec storage with GIN indexes
- Flexible structured data for project specifications
- Version-controlled blueprint management
Phase 3: Architect Service (3 API endpoints)
- Conversational AI orchestration with Claude
- Multi-turn dialogue with context building
- Blueprint spec extraction from conversations
Phase 4: Work Queue Integration
- Verified existing endpoint compatibility
Phase 5: Structured Questions (6 API endpoints)
- Four question types: text, choice, multichoice, yesno
- Answer validation with proper constraints
- Conversation-linked Q&A flow
Architecture:
- Textbook hexagonal architecture (domain → port → adapter → service → handler)
- Zero external dependencies in domain layer
- Consistent error handling with proper wrapping
- Auth scopes on all routes (projects:read, projects:execute)
- Structured logging with operation context and duration tracking
- NULL-safe DTO converters throughout
Database:
- 3 new migrations (019, 020, 021)
- UUIDs for all primary keys
- Proper foreign key constraints with ON DELETE CASCADE
- Optimized indexes including partial index for unanswered questions
- Auto-update triggers for timestamps
OpenAPI Documentation:
- Complete API documentation under 'Foundary' tag
- 22 new endpoints documented with examples
- Request/response schemas for all operations
Logging Improvements:
- Added operation field to all service logs
- Added duration_ms tracking for performance monitoring
- Log response_length instead of full response content
- Consistent use of logging field constants
- Execute-then-log pattern for delete operations
Files: 32 changed, 2800+ lines added
- 7 domain models
- 3 database migrations
- 3 port interfaces
- 3 postgres adapters
- 4 services (conversation, blueprint, question, architect)
- 4 handlers with DTOs
- OpenAPI documentation
- Integration in main.go
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
Allow transitioning to the current phase (no-op success) instead of
rejecting it as a "backward" transition. This fixes issues where
external systems retry transition commands.
Before: draft -> draft returned error
After: draft -> draft returns nil (already there)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
SDLCGenerateHandler was using r.Route() to create a sub-router at
/projects/{id}/sdlc/features/{slug}, which shadowed SDLCHandler's
nested routes like /features/{slug}/artifacts/{type}/approve.
Changed to direct route registration to avoid chi route conflicts.
This fixes 404 errors on SDLC feature and artifact endpoints.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Root cause of DIGEST_INVALID errors was registry disk exhaustion.
Project teardown wasn't cleaning up container images, causing the
registry PVC to fill up over time.
Changes:
- Add RegistryProvider port interface for registry operations
- Extend zot.Client with DeleteProjectRepositories method
- Wire registry provider into ProjectInfraService
- Delete images during DeleteProject cleanup (step 4)
The zot client uses the OCI distribution API:
- Lists all repos, filters by project prefix
- Gets manifest digests via HEAD request
- Deletes manifests by digest to trigger GC
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Missed the 3 wait_pipeline steps (CI deploys) - now consistent with
wait_build steps at 720 attempts × 5s = 1hr.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Agent tasks (spec, design, implementation, review, etc.) can take significant
time. Increased all wait_build steps from 5-10 min to 720 attempts × 5s = 1hr.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The wait-init step was timing out because it waited for the entire pipeline
including docs build steps. The service (preferences-api) deploys successfully
before docs. Added on_error: continue so the tree proceeds after service deploy.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Agents were generating `:id` (Echo/Gin style) instead of `{id}` (chi style),
causing routes to not match. Updated api-designer, go-specialist agents and
skeleton CLAUDE.md with explicit CRITICAL notes about brace syntax.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When req.Template is empty, it defaults to 'skeleton' but the check
in createInitialDeployment only matched 'skeleton' explicitly, not
empty string. This caused a broken deployment to be created for
monorepo projects with a non-existent image.
Root cause: slackpath-5 creates project with empty template, which
defaults to skeleton, but createInitialDeployment was still creating
a root deployment that references registry.threesix.ai/{project}:latest
which never gets built (skeleton has no root Dockerfile).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When docs infrastructure doesn't exist, the docs build steps should
gracefully skip without failing the entire pipeline.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The raw gcr.io/kaniko-project/executor with commands: doesn't work
properly in Woodpecker. Switch to woodpeckerci/plugin-kaniko with
settings: to match other component builds.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The registry.threesix.ai uses a self-signed certificate.
Service builds use plugin-kaniko with skip-tls-verify, but docs
build used raw kaniko executor without TLS bypass, causing exit 128.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When --context=docs is set, the --dockerfile path should be relative
to the context directory. Changed from docs/Dockerfile.nginx to
Dockerfile.nginx since kaniko already looks in the docs/ directory.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
middleman-syntax ~> 3.2 requires rouge ~> 3.2, but Gemfile had rouge ~> 4.0
causing bundle install to fail with version resolution error.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The claudebox sidecar was using api.WriteJSON which wraps responses in
{data: ..., meta: ...} format. The claudebox HTTP client expects raw
JSON responses without wrapping.
This caused git clone to appear to fail - the HTTP request succeeded
and returned {data: {success: true, cloned: true}, meta: {...}}, but
the client decoded success=false because it couldn't find the fields
at the top level.
Added writeRawJSON helper and replaced all api.WriteJSON calls with it
for actual responses. Error responses still use api.WriteBadRequest
which returns proper error format.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds complete Slate documentation infrastructure to generated projects:
- docs/ directory with Gemfile, config.rb, and source templates
- Dockerfile for building docs site
- Dockerfile.nginx for serving static docs
- generate-docs.sh script for CI integration
- Claude command for AI-assisted docs generation
- OpenAPI → Slate markdown conversion via widdershins
Also includes:
- --export-openapi flag for service binaries
- DNS provisioning for docs.{domain} subdomain
- Updated project_infra for docs DNS records
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The export-openapi step was running in parallel with component builds
because it had no explicit dependency. This could cause docs generation
to run before component services were fully built.
Changes:
- Add build-complete step with NO depends_on (waits for ALL prior steps)
- Make export-openapi depend on build-complete
- Complete docs pipeline: export-openapi → generate-docs → build-docs →
build-docs-image → deploy-docs
- Update verify step label selector to use project= instead of app=
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When a task is retried (dequeued again after failure), the previous
error message was persisting in the work_queue table. This caused the
API to return confusing responses with status="running" but also
containing an error message from the previous attempt.
Now clears error and completed_at when claiming a task, matching the
fix already applied to build_audit.UpdateStatus.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Two critical fixes for build retry behavior:
1. pod_git_operations.go: Normalize remote URL before comparison
- Clone stores URL with token (https://token:x@host/...)
- Subsequent retry compares against URL without token
- Without normalization, URLs never match, so workspace is always
cleared and re-cloned, losing all code from previous attempt
2. build_audit.go: Clear stale result data when task transitions to running
- When a failed task is retried, UpdateStatus only updated status/worker_id
- Result and completed_at from previous failure remained, causing
API to return stale failure data even while retry was running
- Now clears result, completed_at and resets started_at when
status is set to "running"
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Log clone request with work_dir, URL, and token presence
- Log workspace state (is_git_repo, existing remote)
- Log all decision points (pull vs clone, clear workspace)
- Detect and clear non-empty non-git directories before clone
- Capture both stdout and stderr for clone failures
- Include exit code in error messages
Empty go.sum files were causing Docker builds to fail because
Go couldn't verify dependencies. Added go mod download steps
for both pkg and component directories before building.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add OpenTelemetry environment variables to export Claude Code logs
and metrics to the existing OTEL collector. Provides visibility into
long-running builds.
- claudebox-worker: sidecar in rdev-worker deployment
- claudebox-standalone: StatefulSet for direct access
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add longhorn-rwx StorageClass for RWX volume support
- Add slackpath-5-full-lifecycle.yaml cookbook tree (all 10 SDLC phases)
- Update worker-pool.md documentation
- Consolidate PVC configuration, remove separate pvc-shared-claude.yaml
- Update rdev-worker and kustomization for new PVC structure
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Critical fix: WorkersHandler was missing workService dependency, causing
500 errors when workers tried to fail tasks. This caused tasks to get
stuck in "running" state permanently.
Also adds:
- /work/tasks endpoint for debugging all tasks across projects
- List method to WorkQueue interface for admin views
- HTTP client tests for api_client.go and claudebox/client.go (48 tests)
- Split work.go DTOs into work_dto.go to stay under 500 lines
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add WaitGroup for graceful shutdown of in-flight tasks
- Change replicas to 1 with Recreate strategy (RWO PVC limitation)
- Optimize Dockerfile: combine RUN commands for smaller layers
- Add compiled binaries to .gitignore
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>