rdev/.claude/guides/services/cookbook-trees.md
jordan 853ec4cf81 fix: go.work race condition with batch components and idempotent provisioning
Three coordinated fixes for CI pipeline race conditions:

1. Woodpecker step dependencies: Added depends_on: [deps] to all 6 component
   templates (service, worker, cli, app-astro, app-react, app-nextjs) so build
   steps wait for go work sync to complete.

2. Idempotent resource provisioning: Modified provisionResources() to check
   for existing database/cache before creating, preventing "already exists"
   errors on component re-adds.

3. Batch component endpoint: POST /projects/{id}/components/batch enables
   atomic multi-component additions in a single git commit. Validates all
   components upfront, provisions infra sequentially, commits code components
   atomically.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 12:31:40 -07:00

14 KiB

Cookbook Tree System

Checkpoint-based cookbook execution with YAML tree definitions. Enables resumable, debuggable E2E test workflows.

Quick Reference

# Validate tree and show execution plan (safe preview)
./cookbooks/scripts/tree-runner.sh run landing-page --project-name my-test --dry-run

# Run a tree (creates checkpoint on each step)
./cookbooks/scripts/tree-runner.sh run landing-page --project-name my-test

# Run with auto-cleanup on exit
./cookbooks/scripts/tree-runner.sh run landing-page --project-name my-test --auto-teardown

# Resume from last checkpoint after failure
./cookbooks/scripts/tree-runner.sh resume landing-page

# Run only a specific step (debugging)
./cookbooks/scripts/tree-runner.sh only landing-page wait-pipeline

# Check status of a tree run
./cookbooks/scripts/tree-runner.sh status landing-page

# Teardown resources (runs tree's teardown section)
./cookbooks/scripts/tree-runner.sh teardown landing-page

# List all available trees
./cookbooks/scripts/tree-runner.sh list

# Clean checkpoint (discard state)
./cookbooks/scripts/tree-runner.sh clean landing-page

Global Flags

Flag Description
--dry-run Validate tree and show execution plan without running
--auto-teardown Run teardown steps on exit (success or failure)

Dependencies

Required tools (pre-flight checks verify these):

  • yq - YAML parser (brew install yq)
  • jq - JSON parser (brew install jq)
  • curl - HTTP client (usually pre-installed)

Required environment variables:

  • RDEV_API_URL - API endpoint (e.g., https://rdev.masq-ops.orchard9.ai)
  • RDEV_API_KEY - API key for authentication

Optional:

  • API_TIMEOUT - Seconds before API calls timeout (default: 60)

Tree YAML Format

Tree definitions live in cookbooks/trees/ and define workflow steps as a DAG.

name: landing-page
description: Deploy a landing page
version: 1

# Variables (can be overridden via --var-name)
vars:
  project_name: ""  # Required, no default
  template: "app-astro"  # Optional, has default

steps:
  create-project:
    description: Create the project skeleton
    action: api
    method: POST
    endpoint: /project
    body:
      name: "{{ .vars.project_name }}"
      description: "Landing page E2E test"
    outputs:
      - project_id: .data.name
      - domain: .data.domain

  add-component:
    description: Add landing page component
    depends_on: [create-project]
    action: api
    method: POST
    endpoint: "/projects/{{ .outputs.create-project.project_id }}/components"
    body:
      type: "{{ .vars.template }}"
      name: landing
      template: "{{ .vars.template }}"

  wait-pipeline:
    description: Wait for CI pipeline to complete
    depends_on: [add-component]
    action: wait_pipeline
    project_id: "{{ .outputs.create-project.project_id }}"
    on_error: continue  # Don't fail the whole tree

  verify-site:
    description: Verify site is accessible
    depends_on: [wait-pipeline]
    action: wait_site
    domain: "{{ .outputs.create-project.domain }}"
    project_id: "{{ .outputs.create-project.project_id }}"

# Teardown runs in reverse order on failure or explicit teardown
teardown:
  - description: Delete project
    action: api
    method: DELETE
    endpoint: "/project/{{ .outputs.create-project.project_id }}"

Step Properties

Property Required Description
description No Human-readable description
action Yes Action type: api, wait_pipeline, wait_build, wait_site, diagnose, shell
depends_on No Array of step names that must complete first
on_error No fail (default) or continue
outputs No Extract values from response (jq paths)

Action Types

api

Make an authenticated API call.

action: api
method: POST  # GET, POST, DELETE, PUT, PATCH
endpoint: /projects/{{ .project_id }}/components
body:         # Optional, for POST/PUT/PATCH
  type: service
  name: api

wait_pipeline

Wait for a CI pipeline to complete.

action: wait_pipeline
project_id: "{{ .outputs.create-project.project_id }}"
max_attempts: 60    # Optional, default 60
poll_interval: 5    # Optional, default 5 seconds

wait_build

Wait for a build/agent task to complete. Replaces shell-based polling loops.

action: wait_build
build_id: "{{ .outputs.implement-feature.build_id }}"
max_attempts: 120   # Optional, default 120
poll_interval: 5    # Optional, default 5 seconds

wait_site

Wait for a site to be accessible.

action: wait_site
domain: "{{ .outputs.create-project.domain }}"
project_id: "{{ .outputs.create-project.project_id }}"  # For diagnostics
max_attempts: 30
poll_interval: 5

diagnose

Run diagnostic checks.

action: diagnose
type: pipeline  # or 'site'
project_id: "{{ .outputs.create-project.project_id }}"
domain: "{{ .outputs.create-project.domain }}"  # For site diagnostics

shell

Run a shell command.

action: shell
command: "curl -s https://{{ .outputs.create-project.domain }}/api/health | jq ."
outputs:
  - health_status: .status

Template Variables

Variables are expanded using Go template syntax ({{ .path }}):

  • .vars.<name> - Variables from CLI flags or tree defaults
  • .outputs.<step>.<key> - Outputs captured from previous steps

Checkpoint Format

Checkpoints are stored in cookbooks/.checkpoints/ (gitignored) as JSON:

{
  "tree": "landing-page",
  "run_id": "landing-page-1706889600",
  "status": "partial",
  "vars": {
    "project_name": "test-landing"
  },
  "steps": {
    "create-project": {
      "status": "completed",
      "started_at": "2025-02-01T10:00:00Z",
      "completed_at": "2025-02-01T10:00:05Z",
      "output": {
        "project_id": "test-landing",
        "domain": "test-landing.threesix.ai"
      }
    },
    "wait-pipeline": {
      "status": "failed",
      "started_at": "2025-02-01T10:00:05Z",
      "completed_at": "2025-02-01T10:05:00Z",
      "error": "Pipeline #3 failed with status: failure"
    }
  },
  "last_completed_step": "create-project"
}

Checkpoint Status Values

  • pending - Tree started but no steps completed
  • partial - Some steps completed, some pending/failed
  • completed - All steps completed successfully
  • failed - A step failed with on_error: fail

Creating a New Tree

  1. Create cookbooks/trees/<name>.yaml
  2. Define steps with dependencies
  3. Add teardown section
  4. Test with tree-runner.sh run <name> --project-name test-$(date +%s)

Best Practices

  • Always include teardown - Clean up resources even if the tree fails
  • Use descriptive step names - They appear in status output
  • Set on_error: continue for non-critical steps - Pipeline failures shouldn't block site verification
  • Capture outputs - Pass data between steps via outputs, not hardcoded values
  • Use vars for inputs - Makes trees reusable with different parameters

Common Mistakes

1. YAML Indentation Errors

YAML requires consistent indentation with spaces only (no tabs). Steps must be indented under steps::

# WRONG - tabs or inconsistent spacing
steps:
	create-project:    # Tab character - will fail
    action: api

# CORRECT - 2-space indent
steps:
  create-project:
    action: api

2. Missing Output Dependencies

If you reference {{ .outputs.step-name.key }}, the referencing step must have step-name in its depends_on array. Validation will catch this:

# WRONG - references create-project but doesn't depend on it
wait-pipeline:
  action: wait_pipeline
  project_id: "{{ .outputs.create-project.project_id }}"
  # Missing: depends_on: [create-project]

# CORRECT
wait-pipeline:
  depends_on: [create-project]
  action: wait_pipeline
  project_id: "{{ .outputs.create-project.project_id }}"

Error message: wait-pipeline: references outputs from "create-project" but does not depend on it (directly or transitively)

Note: Transitive dependencies are valid. If A depends on B, and B depends on C, then A can use outputs from C.

3. Template Escaping in Shell Commands

Shell commands with template variables need proper quoting to handle spaces and special characters:

# RISKY - unquoted expansion
action: shell
command: curl https://{{ .outputs.create-project.domain }}/api/health

# SAFER - quoted expansion
action: shell
command: 'curl "https://{{ .outputs.create-project.domain }}/api/health"'

4. Outputs Array Syntax

Outputs must be an array of single-key objects, not a flat object:

# WRONG - flat object
outputs:
  project_id: .data.name
  domain: .data.domain

# CORRECT - array of objects
outputs:
  - project_id: .data.name
  - domain: .data.domain

5. Circular Dependencies

Dependencies form a DAG (directed acyclic graph). Cycles cause validation failures:

# WRONG - circular dependency
step-a:
  depends_on: [step-b]
step-b:
  depends_on: [step-a]  # Creates cycle!

# CORRECT - linear or fan-out dependencies
step-a:
  depends_on: []
step-b:
  depends_on: [step-a]
step-c:
  depends_on: [step-a]  # Fan-out OK

Error message: Dependency cycle detected

6. Hardcoded Values Instead of Outputs

Avoid hardcoding values that should come from previous steps:

# WRONG - hardcoded project name
wait-pipeline:
  depends_on: [create-project]
  action: wait_pipeline
  project_id: "my-test-project"  # Should use output!

# CORRECT - use captured output
wait-pipeline:
  depends_on: [create-project]
  action: wait_pipeline
  project_id: "{{ .outputs.create-project.project_id }}"

Migrating from Script to Tree

Compare script steps to tree steps:

Script Pattern Tree Equivalent
api_call POST /project "$json" action: api, method: POST
wait_for_pipeline "$project" action: wait_pipeline
wait_for_site "$domain" 30 5 "$project" action: wait_site
diagnose_pipeline_failure "$project" action: diagnose, type: pipeline
curl ... | jq ... action: shell, command: "..."

Troubleshooting

Pre-flight check failures

Pre-flight checks failed:
  ✗ RDEV_API_URL environment variable is not set
  ✗ RDEV_API_KEY environment variable is not set

Set the required environment variables before running trees.

Tree not found

Error: Tree 'foo' not found
Available trees: landing-page, composable-app, sdlc-flow

Check that cookbooks/trees/foo.yaml exists.

yq not found

Error: yq is required but not installed

Install with brew install yq.

Resume finds no checkpoint

No checkpoint found for tree 'landing-page'

Run tree-runner.sh run landing-page ... first.

Step failed but outputs missing

Error: Output 'project_id' not found in step 'create-project'

The step may have failed silently. Check the checkpoint file:

cat cookbooks/.checkpoints/landing-page.json | jq '.steps["create-project"]'

API timeout

curl: (28) Operation timed out

Increase timeout with API_TIMEOUT=120 ./tree-runner.sh run ...

Available Trees

Basic Trees

Tree Description
landing-page Single-page landing site with astro
composable-app Multi-component monorepo with service + app
sdlc-flow Feature lifecycle with SDLC orchestration

Aeries Trees (Multi-Phase Game Development)

Multi-phase workflow demonstrating progressive complexity for an AI agent simulation game:

Tree Description Infrastructure
aeries-1-genesis Monolith: Core API + React app for agent creation Postgres
aeries-2-simulation Extraction: Simulation service via strangler pattern -
aeries-3-society Social layer: Spatial service + Redis pub/sub Redis

Running the Aeries sequence:

# Phase 1: Create the monolith
./tree-runner.sh run aeries-1-genesis --project-name aeries-test

# Phase 2: Extract simulation service (operates on existing project)
./tree-runner.sh run aeries-2-simulation --project-id aeries-test

# Phase 3: Add social layer
./tree-runner.sh run aeries-3-society --project-id aeries-test

These trees demonstrate:

  • Multi-phase patterns - Later phases take project_id not project_name
  • Build polling - Shell-based waits for long-running SDLC builds
  • Service extraction - Strangler pattern via /extract-service command
  • No teardown in phases 2+ - Project lifecycle owned by Phase 1

Slackpath Trees (Reference Architectures)

Progressive complexity paths for building Slack-like platforms:

Tree Description Infrastructure
slackpath-1-authenticated-service Identity layer: User auth, JWT, protected routes CockroachDB
slackpath-2-async-worker-pipeline Background jobs: Producer/consumer with Redis Redis
slackpath-3-realtime-chat WebSockets: Pub/sub broadcasting Redis
slackpath-4-microservice-constellation Service mesh: Auth + Chat + Worker coordination CockroachDB + Redis

Running a slackpath:

./cookbooks/scripts/tree-runner.sh run slackpath-1-authenticated-service \
  --project-name auth-test-$(date +%s)

These trees demonstrate:

  • Infrastructure provisioning (type: postgres, type: redis)
  • Automatic credential injection (DATABASE_URL, REDIS_URL)
  • SDLC-driven implementation via /implement-feature prompts
  • End-to-end verification scripts

Files

cookbooks/
├── .checkpoints/           # Checkpoint storage (gitignored)
│   └── landing-page.json
├── scripts/
│   ├── lib/
│   │   ├── checkpoint.sh   # Checkpoint I/O
│   │   └── tree-parser.sh  # YAML parsing
│   └── tree-runner.sh      # Main executable
└── trees/
    ├── landing-page.yaml
    ├── composable-app.yaml
    ├── sdlc-flow.yaml
    ├── aeries-1-genesis.yaml           # Multi-phase: monolith
    ├── aeries-2-simulation.yaml        # Multi-phase: extraction
    ├── aeries-3-society.yaml           # Multi-phase: social layer
    ├── slackpath-1-authenticated-service.yaml
    ├── slackpath-2-async-worker-pipeline.yaml
    ├── slackpath-3-realtime-chat.yaml
    └── slackpath-4-microservice-constellation.yaml