Commit Graph

17 Commits

Author SHA1 Message Date
jordan
9226454b85 feat: label-based undeploy, GC reconciliation, checkout/sessions, pool status
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
- Add UndeployAll() using label selectors to clean up monorepo components
  on project deletion (replaces name-based Undeploy in DeleteProject and
  the direct undeploy handler)
- Add ResourceGC background worker that periodically finds K8s resources
  whose project label has no matching DB record, deletes after 1h safety
  window
- Widen deployer client type from *kubernetes.Clientset to
  kubernetes.Interface for testability
- UndeployAll accumulates errors via errors.Join instead of failing fast
- Add checkout/checkin sidecar dev flow: temporary git tokens, branch
  checkout, review on checkin with cleanup workers
- Add interactive sessions: pod binding, command execution, SSE streaming,
  ephemeral preview URLs with session cleanup workers
- Add GET /workers/pool endpoint for aggregate capacity and queue depth
- Add sessions:read and sessions:execute auth scopes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 19:11:28 -07:00
jordan
a69eb7e587 feat(foundary): implement complete backend for conversational project design
Implements all 5 phases of Foundary Studio backend:

Phase 1: Chat Persistence (8 API endpoints)
- Conversations and messages with proper cascading deletes
- PostgreSQL schema with auto-update triggers
- Full CRUD operations with structured logging

Phase 2: Blueprint Entity (5 API endpoints)
- JSONB spec storage with GIN indexes
- Flexible structured data for project specifications
- Version-controlled blueprint management

Phase 3: Architect Service (3 API endpoints)
- Conversational AI orchestration with Claude
- Multi-turn dialogue with context building
- Blueprint spec extraction from conversations

Phase 4: Work Queue Integration
- Verified existing endpoint compatibility

Phase 5: Structured Questions (6 API endpoints)
- Four question types: text, choice, multichoice, yesno
- Answer validation with proper constraints
- Conversation-linked Q&A flow

Architecture:
- Textbook hexagonal architecture (domain → port → adapter → service → handler)
- Zero external dependencies in domain layer
- Consistent error handling with proper wrapping
- Auth scopes on all routes (projects:read, projects:execute)
- Structured logging with operation context and duration tracking
- NULL-safe DTO converters throughout

Database:
- 3 new migrations (019, 020, 021)
- UUIDs for all primary keys
- Proper foreign key constraints with ON DELETE CASCADE
- Optimized indexes including partial index for unanswered questions
- Auto-update triggers for timestamps

OpenAPI Documentation:
- Complete API documentation under 'Foundary' tag
- 22 new endpoints documented with examples
- Request/response schemas for all operations

Logging Improvements:
- Added operation field to all service logs
- Added duration_ms tracking for performance monitoring
- Log response_length instead of full response content
- Consistent use of logging field constants
- Execute-then-log pattern for delete operations

Files: 32 changed, 2800+ lines added
- 7 domain models
- 3 database migrations
- 3 port interfaces
- 3 postgres adapters
- 4 services (conversation, blueprint, question, architect)
- 4 handlers with DTOs
- OpenAPI documentation
- Integration in main.go

🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2026-02-09 00:50:46 -07:00
jordan
f20fc6c51c feat(saga): implement enterprise-grade resilience architecture
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
Fixes issues from code review of resilience implementation:

- Wire saga system in main.go (SagaRepository, SagaExecutor, SagaHandler)
- Fix CompletedSteps() to include skipped steps for dependency resolution
- Fix reverse loop bug in saga compensation (use standard swap pattern)
- Add circuit breaker state change callbacks for Prometheus metrics

Phase 1 (Build Resilience):
- Add failure:retry to all component Kaniko build steps
- Add preflight registry health check before builds
- Add services-deployed sync point to decouple docs from critical path

Phase 2 (API Resilience):
- Add pipeline retry endpoint (POST /projects/{id}/pipelines/{number}/retry)
- Wire circuit breakers with metrics callbacks
- Add /health/circuits endpoint for circuit breaker status

Phase 3 (Saga Engine):
- Full domain model (Saga, SagaStep, RetryPolicy, BackoffType)
- PostgreSQL saga repository with CRUD and step management
- Saga executor with retry, compensation, skip step support
- Saga API handlers with CRUD and control operations

Phase 4 (Observability):
- Add saga metrics (total, step_duration, retry, circuit_breaker_state)
- Add logging fields (saga_id, saga_name, step_name)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-08 01:58:02 -07:00
jordan
59aa173384 fix: clear stale error when dequeuing work tasks
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
When a task is retried (dequeued again after failure), the previous
error message was persisting in the work_queue table. This caused the
API to return confusing responses with status="running" but also
containing an error message from the previous attempt.

Now clears error and completed_at when claiming a task, matching the
fix already applied to build_audit.UpdateStatus.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 08:51:34 -07:00
jordan
9833725f31 fix: preserve work on build retry, clear stale audit data
Two critical fixes for build retry behavior:

1. pod_git_operations.go: Normalize remote URL before comparison
   - Clone stores URL with token (https://token:x@host/...)
   - Subsequent retry compares against URL without token
   - Without normalization, URLs never match, so workspace is always
     cleared and re-cloned, losing all code from previous attempt

2. build_audit.go: Clear stale result data when task transitions to running
   - When a failed task is retried, UpdateStatus only updated status/worker_id
   - Result and completed_at from previous failure remained, causing
     API to return stale failure data even while retry was running
   - Now clears result, completed_at and resets started_at when
     status is set to "running"

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-07 08:40:36 -07:00
jordan
d74efb75ff fix: wire workService to WorkersHandler and add /work/tasks endpoint
Critical fix: WorkersHandler was missing workService dependency, causing
500 errors when workers tried to fail tasks. This caused tasks to get
stuck in "running" state permanently.

Also adds:
- /work/tasks endpoint for debugging all tasks across projects
- List method to WorkQueue interface for admin views
- HTTP client tests for api_client.go and claudebox/client.go (48 tests)
- Split work.go DTOs into work_dto.go to stay under 500 lines

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 10:35:39 -07:00
jordan
cfba724f8a feat: add work task error classification and user-facing error codes
- Add WorkErrorCode type with RATE_LIMITED, AUTH_FAILED, TIMEOUT, STALE_WORKER, AGENT_ERROR, INVALID_SPEC
- Add ClassifyAgentError function to detect error patterns from stderr
- Add error_code column to work_queue table (migration 016)
- Add FailWithCode method to WorkQueue interface and implementations
- Update RequeueStaleWithIDs to mark permanently failed tasks with STALE_WORKER
- Add ErrorCode to BuildResult for API responses
- Update work executor to classify errors before failing tasks

This enables users to see actual failure reasons (e.g., "RATE_LIMITED") instead of
builds stuck in "running" state forever when Claude hits rate limits.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 00:07:34 -07:00
jordan
c280a92012 feat: add operations audit system and template improvements
Operations Audit (new feature):
- Add Operation domain model with status tracking (pending, running, completed, failed, cancelled)
- Add OperationRepository with PostgreSQL implementation
- Add OperationService for CRUD and lifecycle management
- Add operations handlers (list, get, cancel endpoints)
- Add migration 015_operations.sql for operations table
- Add operation cleanup worker for stale operation handling
- Add ErrOperationNotFound to domain errors

Template Improvements:
- Add CLAUDE.md configuration files to astro-landing, default, and go-api templates
- Fix PORT template variable usage in nginx configs for app templates
- Add replace directives for local pkg module in Go templates
- Simplify Go service/worker Dockerfiles for workspace builds
- Fix TypeScript error in logger template

Other:
- Refactor landing-test.sh cookbook script
- Update CLAUDE.md version reference

Note: Some files exceed 500-line limit (pre-existing debt + new feature)
- component.go: 550 lines (unchanged, pre-existing)
- main.go: 522 lines (added operations wiring)
- operation_repo.go: 569 lines (new, needs splitting)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 19:08:57 -07:00
jordan
c59d348040 chore: prepare for composable monorepo template implementation
This commit captures the current state before implementing the composable
monorepo template system. Key changes included:

Infrastructure:
- Add CockroachDB provisioner adapter for database provisioning
- Add Redis provisioner adapter for cache provisioning
- Add build events system with PostgreSQL storage
- Add WebSocket endpoint for real-time build progress

Code agent improvements:
- Fix Claude Code adapter to use default allowed tools instead of dangerously-skip-permissions
- Add context-aware stream closing for cancellation support
- Improve parser tests for edge cases

Build system:
- Add build event constants and metrics
- Remove deprecated git_operations.go (replaced by pod_git_operations.go)
- Add rollback logic for multi-step provisioning operations

Documentation:
- Add composable-monorepo feature documentation
- Add DNS/Cloudflare service documentation
- Update deployment and troubleshooting guides

Cookbooks:
- Add fullstack-app cookbook
- Refactor landing-test with shared library

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 11:39:28 -07:00
jordan
910bcb62e1 fix: Sync build audit with work queue when stale tasks are requeued
When a worker dies mid-build, queue maintenance now updates both
work_queue and build_audit tables when requeuing stale tasks.
This prevents builds from showing "running" forever in the API.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 02:07:52 -07:00
jordan
4a18b1cd07 fix: Persist build audit status when worker claims task
Root cause: WorkerService.ClaimTask() was modifying the audit entry
in memory but never persisting it to the database. This caused build
tasks to remain stuck at "pending" status even after being claimed.

Changes:
- Add UpdateStatus method to port.BuildAudit interface
- Implement UpdateStatus in postgres.BuildAuditRepository
- Fix ClaimTask to call audit.UpdateStatus() to persist status
- Add test coverage for audit update during task claim
- Update all mock implementations

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 21:25:04 -07:00
jordan
1ac8efa4c7 feat: Expose Woodpecker pipeline errors in API response
- Add CIPipelineError struct to domain with Type, Message, IsWarning fields
- Map Woodpecker Pipeline.Errors to domain.CIPipeline.Errors
- Fix migration 013: UUID type for project_id, cast id to text for MD5
- Remove invalid domain data migration (columns don't exist)
- Update release.sh with --deploy flag and migration support
- Fix test nil pointer: check errors in TestAPIKeyRepository_ProjectIDArrayHandling

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 16:16:36 -07:00
jordan
c86516c53a feat: Add multi-domain support with auto-generated slugs for landing page cookbook
Landing page cookbook implementation (Weeks 1-4):

Domain Infrastructure:
- Add project_domains table with migration (013_project_domains.sql)
- Add ProjectDomain model with domain types (primary_auto, primary_custom, alias)
- Add SlugGenerator and ProjectDomainRepository interfaces
- Implement postgres adapters for domain and slug management

Service Layer:
- Add domain CRUD methods to ProjectInfraService
- Generate 8-char random slugs for auto-domains
- Support custom subdomains during project creation
- Add site_live health check to project status
- Trigger CI build after template seeding

Handler Updates:
- Add DomainService interface and adapter pattern
- Rewrite domain handlers to use database-backed service
- Add proper error handling for duplicate/missing domains

CI Integration:
- Add TriggerBuild to CIProvider interface
- Implement TriggerBuild in Woodpecker adapter
- Manually trigger initial build after template seed

Cookbook & Scripts:
- Add landing-test.sh script for E2E testing
- Add release.sh for version releases
- Add logs.sh for quick log access

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 12:55:59 -07:00
jordan
bc47e426b0 feat: Add CI pipeline proxy, DNS alias management, and worker executor system
- Add ListPipelines/GetPipeline to CIProvider port with Woodpecker adapter
- Add DNS alias endpoints: GET/POST/DELETE /projects/{id}/domains
- Implement worker executor daemon, build executor, and git operations
- Add build service, worker service, and build audit tracking
- Add worker registry with PostgreSQL adapter and migration
- Add multi-provider code agent interface (Claude Code + OpenCode)
- Add create-and-build combo endpoint
- Update landing-page cookbook to reflect all gaps closed
- Fix tech debt: unified validation, auth scopes, error wrapping, slog patterns

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27 21:05:28 -07:00
jordan
39df51defd feat: Add multi-provider code agent interface with Claude Code and OpenCode adapters
Implements weeks 1-4 of the multi-provider architecture:

Week 1 - Foundation:
- Add domain models (AgentProvider, AgentRequest, AgentEvent, AgentResult)
- Define CodeAgent port interface with Execute, Cancel, Capabilities
- Create thread-safe provider registry with first-registered default

Week 2 - Claude Code Adapter:
- Extract kubectl exec logic into CodeAgent implementation
- Parse stream-json output format (init, message, tool_use, result)
- Support session continuation via --resume flag

Week 3 - OpenCode Adapter:
- HTTP/SSE client for opencode serve API
- Session management (create, send message, abort)
- Event streaming with documented buffer rationale

Week 4 - Quality & Polish:
- Fix race condition in OpenCode Cancel method
- Add AgentRequest.Validate() with ErrPromptRequired, ErrInvalidTimeout
- Document DefaultAvailabilityTimeout constants
- Add HTTP error context for debugging

Also includes:
- Work queue system with PostgreSQL adapter
- Credential store for infrastructure secrets
- Project templates with Woodpecker CI integration
- Comprehensive test coverage

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-27 09:25:51 -07:00
jordan
72d16929ca feat: Implement hexagonal architecture with services, webhooks, queue, and telemetry
Major refactoring to hexagonal (ports & adapters) architecture:

- Add service layer (apikey_service, project_service) for business logic
- Add webhook system with dispatcher and delivery tracking
- Add command queue with priority-based processing
- Add rate limiting with sliding window algorithm
- Add audit logging for command execution
- Add OpenTelemetry integration (traces, metrics, spans)
- Add circuit breaker for fault tolerance
- Add cached repository wrapper for performance
- Add comprehensive validation package
- Add Kubernetes client integration for pod management
- Add database migrations (allowed_ips, audit_log, rate_limiting, queue, webhooks)
- Add network policy and PodDisruptionBudget for k8s
- Remove legacy executor and projects/registry packages
- Untrack secrets.yaml (now managed via envault)
- Add coverage.out to .gitignore
- Add e2e test infrastructure with docker-compose
- Add comprehensive documentation (API, architecture, operations, plans)
- Add golangci-lint config and pre-commit hook

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 19:57:46 -07:00
jordan
538ea57ed4 feat: Add claude-config API, security hardening, and testing infrastructure
Claude Config API (v0.6):
- Add CRUD endpoints for commands, skills, and agents
- Commands/skills/agents stored in /workspace/.claude/ (per-project, in git)
- Credentials shared via PVC at /root/.claude/ (shared across pods)
- Use base64 encoding for file writes (prevents shell injection)
- Add content size limits (1MB max)

Security Hardening:
- Add sanitize package for command/prompt validation
- Add rate limiting middleware (token bucket algorithm)
- Add concurrent command limiting
- Add input sanitization to all command handlers
- Gitignore secrets.yaml and credentials.yaml
- Add *.example templates for secrets

Testing Infrastructure:
- Add testutil package with mocks and fixtures
- Add unit tests for auth package (63% coverage)
- Add unit tests for executor (47% coverage)
- Add handler integration tests (40% coverage)
- Add 100% coverage for sanitize, cmdlimit packages
- Add 96% coverage for ratelimit package

Infrastructure:
- Shared Claude credentials PVC (ReadWriteMany)
- Reduced workspace PVC size from 20Gi to 5Gi
- Add init container cleanup before git clone
- Document Longhorn RWX requirements

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-25 01:29:13 -07:00