# Cookbook Tree System Checkpoint-based cookbook execution with YAML tree definitions. Enables resumable, debuggable E2E test workflows. ## Quick Reference ```bash # Validate tree and show execution plan (safe preview) ./cookbooks/scripts/tree-runner.sh run landing-page --project-name my-test --dry-run # Run a tree (creates checkpoint on each step) ./cookbooks/scripts/tree-runner.sh run landing-page --project-name my-test # Run with auto-cleanup on exit ./cookbooks/scripts/tree-runner.sh run landing-page --project-name my-test --auto-teardown # Resume from last checkpoint after failure ./cookbooks/scripts/tree-runner.sh resume landing-page # Run only a specific step (debugging) ./cookbooks/scripts/tree-runner.sh only landing-page wait-pipeline # Check status of a tree run ./cookbooks/scripts/tree-runner.sh status landing-page # Teardown resources (runs tree's teardown section) ./cookbooks/scripts/tree-runner.sh teardown landing-page # List all available trees ./cookbooks/scripts/tree-runner.sh list # Clean checkpoint (discard state) ./cookbooks/scripts/tree-runner.sh clean landing-page ``` ### Global Flags | Flag | Description | |------|-------------| | `--dry-run` | Validate tree and show execution plan without running | | `--auto-teardown` | Run teardown steps on exit (success or failure) | ## Dependencies Required tools (pre-flight checks verify these): - `yq` - YAML parser (`brew install yq`) - `jq` - JSON parser (`brew install jq`) - `curl` - HTTP client (usually pre-installed) Required environment variables: - `RDEV_API_URL` - API endpoint (e.g., `https://rdev.masq-ops.orchard9.ai`) - `RDEV_API_KEY` - API key for authentication Optional: - `API_TIMEOUT` - Seconds before API calls timeout (default: 60) ## Tree YAML Format Tree definitions live in `cookbooks/trees/` and define workflow steps as a DAG. ```yaml name: landing-page description: Deploy a landing page version: 1 # Variables (can be overridden via --var-name) vars: project_name: "" # Required, no default template: "app-astro" # Optional, has default steps: create-project: description: Create the project skeleton action: api method: POST endpoint: /project body: name: "{{ .vars.project_name }}" description: "Landing page E2E test" outputs: - project_id: .data.name - domain: .data.domain add-component: description: Add landing page component depends_on: [create-project] action: api method: POST endpoint: "/projects/{{ .outputs.create-project.project_id }}/components" body: type: "{{ .vars.template }}" name: landing template: "{{ .vars.template }}" wait-pipeline: description: Wait for CI pipeline to complete depends_on: [add-component] action: wait_pipeline project_id: "{{ .outputs.create-project.project_id }}" on_error: continue # Don't fail the whole tree verify-site: description: Verify site is accessible depends_on: [wait-pipeline] action: wait_site domain: "{{ .outputs.create-project.domain }}" project_id: "{{ .outputs.create-project.project_id }}" # Teardown runs in reverse order on failure or explicit teardown teardown: - description: Delete project action: api method: DELETE endpoint: "/project/{{ .outputs.create-project.project_id }}" ``` ### Step Properties | Property | Required | Description | |----------|----------|-------------| | `description` | No | Human-readable description | | `action` | Yes | Action type: `api`, `wait_pipeline`, `wait_build`, `wait_site`, `diagnose`, `shell` | | `depends_on` | No | Array of step names that must complete first | | `on_error` | No | `fail` (default) or `continue` | | `outputs` | No | Extract values from response (jq paths) | ### Action Types #### api Make an authenticated API call. ```yaml action: api method: POST # GET, POST, DELETE, PUT, PATCH endpoint: /projects/{{ .project_id }}/components body: # Optional, for POST/PUT/PATCH type: service name: api ``` #### wait_pipeline Wait for a CI pipeline to complete. ```yaml action: wait_pipeline project_id: "{{ .outputs.create-project.project_id }}" max_attempts: 60 # Optional, default 60 poll_interval: 5 # Optional, default 5 seconds ``` #### wait_build Wait for a build/agent task to complete. Replaces shell-based polling loops. ```yaml action: wait_build build_id: "{{ .outputs.implement-feature.build_id }}" max_attempts: 120 # Optional, default 120 poll_interval: 5 # Optional, default 5 seconds ``` #### wait_site Wait for a site to be accessible. ```yaml action: wait_site domain: "{{ .outputs.create-project.domain }}" project_id: "{{ .outputs.create-project.project_id }}" # For diagnostics max_attempts: 30 poll_interval: 5 ``` #### diagnose Run diagnostic checks. ```yaml action: diagnose type: pipeline # or 'site' project_id: "{{ .outputs.create-project.project_id }}" domain: "{{ .outputs.create-project.domain }}" # For site diagnostics ``` #### shell Run a shell command. ```yaml action: shell command: "curl -s https://{{ .outputs.create-project.domain }}/api/health | jq ." outputs: - health_status: .status ``` ### Template Variables Variables are expanded using Go template syntax (`{{ .path }}`): - `.vars.` - Variables from CLI flags or tree defaults - `.outputs..` - Outputs captured from previous steps ## Checkpoint Format Checkpoints are stored in `cookbooks/.checkpoints/` (gitignored) as JSON: ```json { "tree": "landing-page", "run_id": "landing-page-1706889600", "status": "partial", "vars": { "project_name": "test-landing" }, "steps": { "create-project": { "status": "completed", "started_at": "2025-02-01T10:00:00Z", "completed_at": "2025-02-01T10:00:05Z", "output": { "project_id": "test-landing", "domain": "test-landing.threesix.ai" } }, "wait-pipeline": { "status": "failed", "started_at": "2025-02-01T10:00:05Z", "completed_at": "2025-02-01T10:05:00Z", "error": "Pipeline #3 failed with status: failure" } }, "last_completed_step": "create-project" } ``` ### Checkpoint Status Values - `pending` - Tree started but no steps completed - `partial` - Some steps completed, some pending/failed - `completed` - All steps completed successfully - `failed` - A step failed with `on_error: fail` ## Creating a New Tree 1. Create `cookbooks/trees/.yaml` 2. Define steps with dependencies 3. Add teardown section 4. Test with `tree-runner.sh run --project-name test-$(date +%s)` ### Best Practices - **Always include teardown** - Clean up resources even if the tree fails - **Use descriptive step names** - They appear in status output - **Set on_error: continue for non-critical steps** - Pipeline failures shouldn't block site verification - **Capture outputs** - Pass data between steps via outputs, not hardcoded values - **Use vars for inputs** - Makes trees reusable with different parameters ### Common Mistakes #### 1. YAML Indentation Errors YAML requires consistent indentation with **spaces only** (no tabs). Steps must be indented under `steps:`: ```yaml # WRONG - tabs or inconsistent spacing steps: create-project: # Tab character - will fail action: api # CORRECT - 2-space indent steps: create-project: action: api ``` #### 2. Missing Output Dependencies If you reference `{{ .outputs.step-name.key }}`, the referencing step **must** have `step-name` in its `depends_on` array. Validation will catch this: ```yaml # WRONG - references create-project but doesn't depend on it wait-pipeline: action: wait_pipeline project_id: "{{ .outputs.create-project.project_id }}" # Missing: depends_on: [create-project] # CORRECT wait-pipeline: depends_on: [create-project] action: wait_pipeline project_id: "{{ .outputs.create-project.project_id }}" ``` **Error message:** `wait-pipeline: references outputs from "create-project" but does not depend on it (directly or transitively)` **Note:** Transitive dependencies are valid. If A depends on B, and B depends on C, then A can use outputs from C. #### 3. Template Escaping in Shell Commands Shell commands with template variables need proper quoting to handle spaces and special characters: ```yaml # RISKY - unquoted expansion action: shell command: curl https://{{ .outputs.create-project.domain }}/api/health # SAFER - quoted expansion action: shell command: 'curl "https://{{ .outputs.create-project.domain }}/api/health"' ``` #### 4. Outputs Array Syntax Outputs must be an array of single-key objects, not a flat object: ```yaml # WRONG - flat object outputs: project_id: .data.name domain: .data.domain # CORRECT - array of objects outputs: - project_id: .data.name - domain: .data.domain ``` #### 5. Circular Dependencies Dependencies form a DAG (directed acyclic graph). Cycles cause validation failures: ```yaml # WRONG - circular dependency step-a: depends_on: [step-b] step-b: depends_on: [step-a] # Creates cycle! # CORRECT - linear or fan-out dependencies step-a: depends_on: [] step-b: depends_on: [step-a] step-c: depends_on: [step-a] # Fan-out OK ``` **Error message:** `Dependency cycle detected` #### 6. Hardcoded Values Instead of Outputs Avoid hardcoding values that should come from previous steps: ```yaml # WRONG - hardcoded project name wait-pipeline: depends_on: [create-project] action: wait_pipeline project_id: "my-test-project" # Should use output! # CORRECT - use captured output wait-pipeline: depends_on: [create-project] action: wait_pipeline project_id: "{{ .outputs.create-project.project_id }}" ``` ## Migrating from Script to Tree Compare script steps to tree steps: | Script Pattern | Tree Equivalent | |----------------|-----------------| | `api_call POST /project "$json"` | `action: api`, `method: POST` | | `wait_for_pipeline "$project"` | `action: wait_pipeline` | | `wait_for_site "$domain" 30 5 "$project"` | `action: wait_site` | | `diagnose_pipeline_failure "$project"` | `action: diagnose`, `type: pipeline` | | `curl ... \| jq ...` | `action: shell`, `command: "..."` | ## Troubleshooting ### Pre-flight check failures ``` Pre-flight checks failed: ✗ RDEV_API_URL environment variable is not set ✗ RDEV_API_KEY environment variable is not set ``` Set the required environment variables before running trees. ### Tree not found ``` Error: Tree 'foo' not found Available trees: landing-page, composable-app, sdlc-flow ``` Check that `cookbooks/trees/foo.yaml` exists. ### yq not found ``` Error: yq is required but not installed ``` Install with `brew install yq`. ### Resume finds no checkpoint ``` No checkpoint found for tree 'landing-page' ``` Run `tree-runner.sh run landing-page ...` first. ### Step failed but outputs missing ``` Error: Output 'project_id' not found in step 'create-project' ``` The step may have failed silently. Check the checkpoint file: ```bash cat cookbooks/.checkpoints/landing-page.json | jq '.steps["create-project"]' ``` ### API timeout ``` curl: (28) Operation timed out ``` Increase timeout with `API_TIMEOUT=120 ./tree-runner.sh run ...` ## Available Trees ### Basic Trees | Tree | Description | |------|-------------| | `landing-page` | Single-page landing site with astro | | `composable-app` | Multi-component monorepo with service + app | | `sdlc-flow` | Feature lifecycle with SDLC orchestration | ### Aeries Trees (Multi-Phase Game Development) Multi-phase workflow demonstrating progressive complexity for an AI agent simulation game: | Tree | Description | Infrastructure | |------|-------------|----------------| | `aeries-1-genesis` | Monolith: Core API + React app for agent creation | Postgres | | `aeries-2-simulation` | Extraction: Simulation service via strangler pattern | - | | `aeries-3-society` | Social layer: Spatial service + Redis pub/sub | Redis | **Running the Aeries sequence:** ```bash # Phase 1: Create the monolith ./tree-runner.sh run aeries-1-genesis --project-name aeries-test # Phase 2: Extract simulation service (operates on existing project) ./tree-runner.sh run aeries-2-simulation --project-id aeries-test # Phase 3: Add social layer ./tree-runner.sh run aeries-3-society --project-id aeries-test ``` These trees demonstrate: - **Multi-phase patterns** - Later phases take `project_id` not `project_name` - **Build polling** - Shell-based waits for long-running SDLC builds - **Service extraction** - Strangler pattern via `/extract-service` command - **No teardown in phases 2+** - Project lifecycle owned by Phase 1 ### Slackpath Trees (Reference Architectures) Progressive complexity paths for building Slack-like platforms: | Tree | Description | Infrastructure | |------|-------------|----------------| | `slackpath-1-authenticated-service` | Identity layer: User auth, JWT, protected routes | CockroachDB | | `slackpath-2-async-worker-pipeline` | Background jobs: Producer/consumer with Redis | Redis | | `slackpath-3-realtime-chat` | WebSockets: Pub/sub broadcasting | Redis | | `slackpath-4-microservice-constellation` | Service mesh: Auth + Chat + Worker coordination | CockroachDB + Redis | **Running a slackpath:** ```bash ./cookbooks/scripts/tree-runner.sh run slackpath-1-authenticated-service \ --project-name auth-test-$(date +%s) ``` These trees demonstrate: - Infrastructure provisioning (`type: postgres`, `type: redis`) - Automatic credential injection (`DATABASE_URL`, `REDIS_URL`) - SDLC-driven implementation via `/implement-feature` prompts - End-to-end verification scripts ## Files ``` cookbooks/ ├── .checkpoints/ # Checkpoint storage (gitignored) │ └── landing-page.json ├── scripts/ │ ├── lib/ │ │ ├── checkpoint.sh # Checkpoint I/O │ │ └── tree-parser.sh # YAML parsing │ └── tree-runner.sh # Main executable └── trees/ ├── landing-page.yaml ├── composable-app.yaml ├── sdlc-flow.yaml ├── aeries-1-genesis.yaml # Multi-phase: monolith ├── aeries-2-simulation.yaml # Multi-phase: extraction ├── aeries-3-society.yaml # Multi-phase: social layer ├── slackpath-1-authenticated-service.yaml ├── slackpath-2-async-worker-pipeline.yaml ├── slackpath-3-realtime-chat.yaml └── slackpath-4-microservice-constellation.yaml ``` ## Related - [E2E Testing Strategy](./e2e-testing-strategy.md) — When to run trees, philosophy, history tracking - [Composable Monorepo Templates](./composable-monorepo.md) — Template structure tested by trees