rdev/.claude/guides/services/cookbook-trees.md

# Cookbook Tree System

Checkpoint-based cookbook execution with YAML tree definitions. Enables resumable, debuggable E2E test workflows.

## Quick Reference

```bash
# Run a tree (creates checkpoint on each step)
./cookbooks/scripts/tree-runner.sh run landing-page --project-name my-test

# Resume from last checkpoint after failure
./cookbooks/scripts/tree-runner.sh resume landing-page

# Run only a specific step (debugging)
./cookbooks/scripts/tree-runner.sh only landing-page wait-pipeline

# Check status of a tree run
./cookbooks/scripts/tree-runner.sh status landing-page

# Teardown resources (runs tree's teardown section)
./cookbooks/scripts/tree-runner.sh teardown landing-page

# List all available trees
./cookbooks/scripts/tree-runner.sh list

# Clean checkpoint (discard state)
./cookbooks/scripts/tree-runner.sh clean landing-page
```

## Dependencies

- `yq` - YAML parser (`brew install yq`)
- `jq` - JSON parser (already required by common.sh)

## Tree YAML Format

Tree definitions live in `cookbooks/trees/` and define workflow steps as a DAG.

```yaml
name: landing-page
description: Deploy a landing page
version: 1

# Variables (can be overridden via --var-name)
vars:
  project_name: ""  # Required, no default
  template: "app-astro"  # Optional, has default

steps:
  create-project:
    description: Create the project skeleton
    action: api
    method: POST
    endpoint: /project
    body:
      name: "{{ .vars.project_name }}"
      description: "Landing page E2E test"
    outputs:
      - project_id: .data.name
      - domain: .data.domain

  add-component:
    description: Add landing page component
    depends_on: [create-project]
    action: api
    method: POST
    endpoint: "/projects/{{ .outputs.create-project.project_id }}/components"
    body:
      type: "{{ .vars.template }}"
      name: landing
      template: "{{ .vars.template }}"

  wait-pipeline:
    description: Wait for CI pipeline to complete
    depends_on: [add-component]
    action: wait_pipeline
    project_id: "{{ .outputs.create-project.project_id }}"
    on_error: continue  # Don't fail the whole tree

  verify-site:
    description: Verify site is accessible
    depends_on: [wait-pipeline]
    action: wait_site
    domain: "{{ .outputs.create-project.domain }}"
    project_id: "{{ .outputs.create-project.project_id }}"

# Teardown runs in reverse order on failure or explicit teardown
teardown:
  - description: Delete project
    action: api
    method: DELETE
    endpoint: "/project/{{ .outputs.create-project.project_id }}"
```

### Step Properties

| Property | Required | Description |
|----------|----------|-------------|
| `description` | No | Human-readable description |
| `action` | Yes | Action type: `api`, `wait_pipeline`, `wait_site`, `diagnose`, `shell` |
| `depends_on` | No | Array of step names that must complete first |
| `on_error` | No | `fail` (default) or `continue` |
| `outputs` | No | Extract values from response (jq paths) |

### Action Types

#### api
Make an authenticated API call.

```yaml
action: api
method: POST  # GET, POST, DELETE, PUT, PATCH
endpoint: /projects/{{ .project_id }}/components
body:         # Optional, for POST/PUT/PATCH
  type: service
  name: api
```

#### wait_pipeline
Wait for a CI pipeline to complete.

```yaml
action: wait_pipeline
project_id: "{{ .outputs.create-project.project_id }}"
max_attempts: 60    # Optional, default 60
poll_interval: 5    # Optional, default 5 seconds
```

#### wait_site
Wait for a site to be accessible.

```yaml
action: wait_site
domain: "{{ .outputs.create-project.domain }}"
project_id: "{{ .outputs.create-project.project_id }}"  # For diagnostics
max_attempts: 30
poll_interval: 5
```

#### diagnose
Run diagnostic checks.

```yaml
action: diagnose
type: pipeline  # or 'site'
project_id: "{{ .outputs.create-project.project_id }}"
domain: "{{ .outputs.create-project.domain }}"  # For site diagnostics
```

#### shell
Run a shell command.

```yaml
action: shell
command: "curl -s https://{{ .outputs.create-project.domain }}/api/health | jq ."
outputs:
  - health_status: .status
```

### Template Variables

Variables are expanded using Go template syntax (`{{ .path }}`):

- `.vars.<name>` - Variables from CLI flags or tree defaults
- `.outputs.<step>.<key>` - Outputs captured from previous steps

## Checkpoint Format

Checkpoints are stored in `cookbooks/.checkpoints/` (gitignored) as JSON:

```json
{
  "tree": "landing-page",
  "run_id": "landing-page-1706889600",
  "status": "partial",
  "vars": {
    "project_name": "test-landing"
  },
  "steps": {
    "create-project": {
      "status": "completed",
      "started_at": "2025-02-01T10:00:00Z",
      "completed_at": "2025-02-01T10:00:05Z",
      "output": {
        "project_id": "test-landing",
        "domain": "test-landing.threesix.ai"
      }
    },
    "wait-pipeline": {
      "status": "failed",
      "started_at": "2025-02-01T10:00:05Z",
      "completed_at": "2025-02-01T10:05:00Z",
      "error": "Pipeline #3 failed with status: failure"
    }
  },
  "last_completed_step": "create-project"
}
```

### Checkpoint Status Values

- `pending` - Tree started but no steps completed
- `partial` - Some steps completed, some pending/failed
- `completed` - All steps completed successfully
- `failed` - A step failed with `on_error: fail`

## Creating a New Tree

1. Create `cookbooks/trees/<name>.yaml`
2. Define steps with dependencies
3. Add teardown section
4. Test with `tree-runner.sh run <name> --project-name test-$(date +%s)`

### Best Practices

- **Always include teardown** - Clean up resources even if the tree fails
- **Use descriptive step names** - They appear in status output
- **Set on_error: continue for non-critical steps** - Pipeline failures shouldn't block site verification
- **Capture outputs** - Pass data between steps via outputs, not hardcoded values
- **Use vars for inputs** - Makes trees reusable with different parameters

## Migrating from Script to Tree

Compare script steps to tree steps:

| Script Pattern | Tree Equivalent |
|----------------|-----------------|
| `api_call POST /project "$json"` | `action: api`, `method: POST` |
| `wait_for_pipeline "$project"` | `action: wait_pipeline` |
| `wait_for_site "$domain" 30 5 "$project"` | `action: wait_site` |
| `diagnose_pipeline_failure "$project"` | `action: diagnose`, `type: pipeline` |
| `curl ... \| jq ...` | `action: shell`, `command: "..."` |

## Troubleshooting

### Tree not found
```
Error: Tree 'foo' not found
Available trees: landing-page, composable-app, sdlc-flow
```
Check that `cookbooks/trees/foo.yaml` exists.

### yq not found
```
Error: yq is required but not installed
```
Install with `brew install yq`.

### Resume finds no checkpoint
```
No checkpoint found for tree 'landing-page'
```
Run `tree-runner.sh run landing-page ...` first.

### Step failed but outputs missing
```
Error: Output 'project_id' not found in step 'create-project'
```
The step may have failed silently. Check the checkpoint file:
```bash
cat cookbooks/.checkpoints/landing-page.json | jq '.steps["create-project"]'
```

## Available Trees

### Basic Trees

| Tree | Description |
|------|-------------|
| `landing-page` | Single-page landing site with astro |
| `composable-app` | Multi-component monorepo with service + app |
| `sdlc-flow` | Feature lifecycle with SDLC orchestration |

### Slackpath Trees (Reference Architectures)

Progressive complexity paths for building Slack-like platforms:

| Tree | Description | Infrastructure |
|------|-------------|----------------|
| `slackpath-1-authenticated-service` | Identity layer: User auth, JWT, protected routes | CockroachDB |
| `slackpath-2-async-worker-pipeline` | Background jobs: Producer/consumer with Redis | Redis |
| `slackpath-3-realtime-chat` | WebSockets: Pub/sub broadcasting | Redis |
| `slackpath-4-microservice-constellation` | Service mesh: Auth + Chat + Worker coordination | CockroachDB + Redis |

**Running a slackpath:**
```bash
./cookbooks/scripts/tree-runner.sh run slackpath-1-authenticated-service \
  --project-name auth-test-$(date +%s)
```

These trees demonstrate:
- Infrastructure provisioning (`type: postgres`, `type: redis`)
- Automatic credential injection (`DATABASE_URL`, `REDIS_URL`)
- SDLC-driven implementation via `/implement-feature` prompts
- End-to-end verification scripts

## Files

```
cookbooks/
├── .checkpoints/           # Checkpoint storage (gitignored)
│   └── landing-page.json
├── scripts/
│   ├── lib/
│   │   ├── checkpoint.sh   # Checkpoint I/O
│   │   └── tree-parser.sh  # YAML parsing
│   └── tree-runner.sh      # Main executable
└── trees/
    ├── landing-page.yaml
    ├── composable-app.yaml
    ├── sdlc-flow.yaml
    ├── slackpath-1-authenticated-service.yaml
    ├── slackpath-2-async-worker-pipeline.yaml
    ├── slackpath-3-realtime-chat.yaml
    └── slackpath-4-microservice-constellation.yaml
```

## Related

- [E2E Testing Strategy](./e2e-testing-strategy.md) — When to run trees, philosophy, history tracking
- [Composable Monorepo Templates](./composable-monorepo.md) — Template structure tested by trees