ci/woodpecker/push/woodpecker Pipeline failed

Details

feat: label-based undeploy, GC reconciliation, checkout/sessions, pool status

- Add UndeployAll() using label selectors to clean up monorepo components
  on project deletion (replaces name-based Undeploy in DeleteProject and
  the direct undeploy handler)
- Add ResourceGC background worker that periodically finds K8s resources
  whose project label has no matching DB record, deletes after 1h safety
  window
- Widen deployer client type from *kubernetes.Clientset to
  kubernetes.Interface for testability
- UndeployAll accumulates errors via errors.Join instead of failing fast
- Add checkout/checkin sidecar dev flow: temporary git tokens, branch
  checkout, review on checkin with cleanup workers
- Add interactive sessions: pod binding, command execution, SSE streaming,
  ephemeral preview URLs with session cleanup workers
- Add GET /workers/pool endpoint for aggregate capacity and queue depth
- Add sessions:read and sessions:execute auth scopes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-09 19:11:28 -07:00

15 KiB

Raw Blame History

rdev - Remote Developer

Run Claude Code instances in isolated Kubernetes pods with REST API control. Enables bots, CI/CD systems, and external orchestrators to dispatch agentive development work to isolated environments.

Platform: threesix.ai - Agent-driven development at scale with shared worker pools.

Terminology

Term	Meaning	Location
platform	rdev itself (orchestrator API, handlers, workers)	`cmd/rdev-api/`, `internal/`, `pkg/api/`
skeleton	Code that ships in generated projects	`internal/adapter/templates/templates/skeleton/`
component templates	Service/worker/app/cli templates added to skeleton	`templates/components/{service,worker,cli,app-*}/`

When discussing code: "add to platform" = edit rdev; "add to skeleton" = edit project templates.

Find Your Guide

If you need to...	Read this
Set up local dev	local/setup.md
Run tests	local/testing.md
Write Go code / handlers	backend/go-guidelines.md
Understand pkg/api	packages/api-framework.md
Add a new handler/endpoint	backend/adding-handlers.md
Understand hexagonal architecture	backend/hexagonal.md
Deploy to k3s	ops/deploying.md
Release a new version	ops/releasing.md
Work with Kubernetes adapters	services/kubernetes.md
Database / migrations	ops/database.md
Manage credentials	ops/credentials.md
Work queue system	services/work-queue.md
Worker pool management	services/worker-pool.md
Project templates	services/templates.md
Composable monorepo templates	services/composable-monorepo.md
E2E testing strategy	services/e2e-testing-strategy.md
Cookbook tree system (commands)	services/cookbook-trees.md
Slackpath reference architectures	services/cookbook-trees.md
Write cookbook trees	cookbook-trees/SKILL.md
Build orchestration	services/build-orchestration.md
Build event streaming	services/build-streaming.md
Resource provisioning plan	services/resource-provisioning-plan.md
Database provisioning	services/database-provisioning.md
Cache provisioning	services/cache-provisioning.md
CockroachDB operations	services/cockroachdb.md
Redis operations	services/redis.md
DNS / Cloudflare	services/dns-cloudflare.md
Network policies / internal routing	ops/networking.md
Debug external system health	ops/external-health-diagnostics.md
SDLC orchestration	services/sdlc.md
Visual verification (Playwright)	services/visual-verification.md
Interactive remote development	services/interactive-remote-dev.md
Structured logging	`internal/logging/` - field constants, context propagation, redaction

Critical Rules

Root cause fixes: When diagnosing failures in generated projects, NEVER patch the project directly. Find the systemic root cause in: (1) platform - rdev handlers/services that create resources, (2) skeleton - templates that ship in generated projects, or (3) cookbook - test scripts with wrong assumptions. Fix the source, not the symptom. Every project-specific fix is technical debt that will recur.
LLM vs rdev: LLMs generate code; rdev executes deterministic operations (git, lint, deploy). Never rely on LLMs for runbook tasks.
Pod git ops: Git operations run inside pods via PodGitOperations (kubectl exec), never locally.
No dead code: Delete unused code immediately. Don't leave "might use later" exports.
KUBECONFIG: ALWAYS set export KUBECONFIG=~/.kube/orchard9-k3sf.yaml before kubectl commands
Container builds: NEVER build Docker images locally. ALWAYS use git push origin main to trigger Woodpecker CI which builds via in-cluster Kaniko. Local Docker builds produce wrong architecture (arm64 vs amd64). If an image is missing from registry.threesix.ai, push to origin — don't improvise.
Hexagonal: Domain models in internal/domain/ must have ZERO external dependencies
Ports: All adapters implement interfaces from internal/port/
Migrations: NEVER modify committed migrations. Create NEW ones.
500-line limit: Files exceeding 500 lines must be split
Tests: All handlers and services require tests
Multi-step ops: NEVER log-and-continue after partial failure. Rollback or document partial state.
Logging: Use logging.FromContext(ctx) or injected *slog.Logger. NEVER fmt.Println, log.Fatal, log.Printf, or bare slog.Info(). Error key is ALWAYS "error" (not "err"). Use field constants from internal/logging/fields.go (e.g., logging.FieldProjectID, logging.FieldError). Log once at boundary (handlers/workers log, services return errors). Sensitive data (passwords, tokens, keys) is auto-redacted.
HTTP clients: NEVER create &http.Client{} without a Timeout field. All HTTP clients must have explicit timeouts (30s standard, 5s for health checks). A bare client can hang indefinitely.
Config: Use envutil.GetEnv() / GetEnvInt() / GetEnvBool() from internal/envutil for all env var reads with defaults. NEVER define local getEnv helpers — they duplicate and drift. Raw os.Getenv() is fine for required values with no default (secrets, passwords).
Handler timeouts: NEVER use inline time.Duration in context.WithTimeout inside handlers. Use constants from internal/handlers/timeouts.go: TimeoutFastLookup (5s), TimeoutLookup (10s), TimeoutStandard (30s), TimeoutHeavyWrite (60s), TimeoutOrchestration (90s), TimeoutLongRunning (10m).
Worker timeouts: NEVER use inline time.Duration in context.WithTimeout inside worker code. Use constants from internal/worker/timeouts.go: TimeoutQuickOp (5s), TimeoutHealthCheck (10s), TimeoutMaintenance (30s), TimeoutWorkExecution (10m).
Response helpers: Use api.WriteUnauthorized, api.WriteForbidden, api.WriteBadRequest, api.WriteNotFound, api.WriteInternalError instead of bare api.WriteError with status codes. Only use api.WriteError directly for custom error codes (e.g., KEY_REVOKED, IP_NOT_ALLOWED).
Auth scopes: EVERY route in a handler's Mount() function MUST use r.With(auth.RequireScope(...)). Use ScopeProjectsRead for GET endpoints, ScopeProjectsExecute for mutation endpoints. Use the appropriate domain scope (e.g., ScopeQueueRead, ScopeBuildWrite) when available. Admin-only endpoints use auth.ScopeAdmin alone. See internal/handlers/builds.go for the canonical pattern.
JSON decoding: ALWAYS use api.DecodeJSON(r, &req) to decode request bodies. NEVER use raw json.NewDecoder(r.Body).Decode(). The helper handles nil body, EOF, and returns typed errors. Decode error message is always "invalid request body".
Validation: Use validate.New() accumulator for 2+ field checks in handlers: v := validate.New(); v.Required(req.Name, "name"); v.Required(req.Type, "type"); if err := v.Error() { ... }. Single-field checks can stay inline. NEVER duplicate validation logic that exists in internal/validate.
Error wrapping: ALWAYS use %w (not %v) when wrapping errors in fmt.Errorf. Using %v stringifies the error and breaks errors.Is/errors.As chains. For non-error types (structs, slices), create a typed error implementing error instead of stringifying with %v.
Context propagation: NEVER use context.Background() in handlers, services, or adapters that receive a context parameter. Always derive from parent context. Use context.WithoutCancel(ctx) for fire-and-forget goroutines that need tracing but independent cancellation.
Cookbooks: Load .claude/skills/cookbook-trees/SKILL.md before writing/modifying any cookbook tree.
Version alignment: Skeleton templates MUST use consistent versions across all files: Go 1.25 (go.work, go.mod, Dockerfiles, CI images), Node 20, Alpine 3.19. When updating a version, grep the entire templates/ tree and update ALL occurrences to prevent drift.

Quick Reference

# Environment (should already be in ~/.zshrc)
export KUBECONFIG=~/.kube/orchard9-k3sf.yaml
export RDEV_API_URL="https://rdev.masq-ops.orchard9.ai"
export RDEV_API_KEY="<your-api-key>"  # Already set in ~/.zshrc

# Verify environment is loaded
echo $RDEV_API_KEY  # Should print a base64 string
# If empty: source ~/.zshrc

# For scripts: use cookbooks/scripts/common.sh library
# Provides: api_call(), wait_for_build(), wait_for_pipeline(), wait_for_site()
# Example: source "$(dirname "$0")/common.sh" && api_call GET "/health"

# Run locally
go run ./cmd/rdev-api

# Run tests
go test ./...

# Automated deploy (push triggers Woodpecker CI)
git push origin main  # Builds and deploys automatically via Woodpecker

# Manual deploy (if Woodpecker unavailable)
kubectl apply -f deployments/k8s/base/rdev-api.yaml
kubectl rollout restart -n rdev deployment/rdev-api

# Images are at registry.threesix.ai/rdev/{api,worker,claudebox}

# Verify pods
kubectl get pods -n rdev

# View logs
./scripts/logs.sh           # Last 100 lines
./scripts/logs.sh -f        # Follow/stream
./scripts/logs.sh -n 500    # Last 500 lines
./scripts/logs.sh -e        # Errors only
./scripts/logs.sh -p        # Previous crashed container

# Shell aliases (after source ~/.zshrc)
rdev-logs                   # Last 100 lines
rdev-logs-f                 # Follow/stream
rdev-pods                   # List pods

# API calls - use cookbook test scripts (they handle auth via common.sh)
./cookbooks/scripts/landing-test.sh run|status|teardown <name>
./cookbooks/scripts/tree-runner.sh run <tree-name> --project-name <name>

# Or direct API calls (requires env vars above)
curl -s -H "X-API-Key: $RDEV_API_KEY" $RDEV_API_URL/health | jq
curl -s -H "X-API-Key: $RDEV_API_KEY" $RDEV_API_URL/projects | jq

Architecture Overview

cmd/rdev-api/          # Entry point, DI, OpenAPI spec
cmd/sdlc/              # SDLC CLI binary (runs inside project pods)
internal/
├── sdlc/              # SDLC library (types, classifier, state I/O)
├── domain/            # Pure business models (no deps)
├── port/              # Interface contracts
├── service/           # Business logic orchestration
├── handlers/          # HTTP handlers (REST endpoints)
├── adapter/           # Infrastructure implementations
│   ├── kubernetes/    # K8s client, pod executor
│   ├── postgres/      # Audit, queue, webhooks, credentials
│   ├── cockroach/     # Database provisioning (project DBs)
│   ├── redis/         # Cache provisioning via ACLs
│   ├── gitea/         # Git repository management
│   ├── cloudflare/    # DNS provider
│   └── woodpecker/    # CI provider
├── auth/              # API key auth, scopes
├── middleware/        # Rate limiting
├── worker/            # Background queue processor
└── webhook/           # Event dispatcher
pkg/api/               # HTTP framework (app, responses)
deployments/k8s/       # Kustomize manifests
  └── base/templates/  # Project templates
scripts/               # Operational scripts
  ├── load-credentials.sh  # Load secrets to rdev-api
  ├── release.sh           # Build, tag, push releases
  └── logs.sh              # View rdev-api logs
cookbooks/             # End-to-end workflow guides
  ├── landing-page.md      # Landing page deployment flow
  └── scripts/             # Executable cookbook scripts

Key Concepts

Projects: Kubernetes pods with Claude Code, discovered by label rdev.orchard9.ai/project=true
Workers: Shared claudebox pods that execute any project's tasks, labeled rdev.orchard9.ai/role=worker
Work Queue: Async task queue for build/test/deploy jobs
Credentials: Infrastructure secrets (tokens, keys) stored encrypted in PostgreSQL
Commands: Claude/shell/git commands executed via kubectl exec, streamed via SSE
API Keys: Scoped auth with project restrictions, IP filtering, expiration
Webhooks: Event subscriptions with retry delivery
Templates: Project scaffolding with .woodpecker.yml, .claude/, and stack files

threesix.ai Platform Status

Feature	Status	Description
Woodpecker Auto-Activation	Done	CI enabled on project creation via SDK
Project Templates	Done	Embedded templates (astro-landing, go-api, default)
Work Queue	Done	PostgreSQL with atomic dequeue, retry logic
Multi-Provider Agents	Done	Claude Code + OpenCode via registry
Webhooks	Done	Event dispatcher with retry delivery
Embedded Worker	Done	Goroutine in rdev-api, polls queue
Multi-Domain Support	Done	Auto-slugs, custom subdomains, DNS aliases
Build Event Streaming	Done	Real-time SSE/WebSocket for build output
Database Provisioning	Done	CockroachDB adapter with auto-provisioning
Cache Provisioning	Done	Redis ACL-based adapter with auto-provisioning
Build Orchestration	Planned	Structured build specs via API
SDLC Orchestration	Done	Deterministic feature lifecycle with classifier engine, API, orchestrator, and 15 skeleton commands
Composable Monorepo Templates	Done	Monorepo skeleton + component templates (service, worker, app-astro, app-react, cli)
Visual Verification	Planned	Playwright screenshots/video + AI evaluation for feature completeness
Checkout/Checkin	Done	Sidecar dev flow: temporary git tokens, branch checkout, review on checkin
Interactive Remote Dev	Done	Sessions with pod binding, command execution, SSE streaming, ephemeral preview URLs

Current Version: v0.10.25

Constraints

ON-PREM k3s - not GKE, always set KUBECONFIG
Kustomize only - no ArgoCD
chi/v5 router - no gin, echo, or other frameworks
sqlx for DB - no GORM
slog for logging - no logrus, zap

15 KiB Raw Blame History