stemedb/.claude/skills/orchard9-deploy/SKILL.md
jordan 1e5ba8b946
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
feat: wire auth bootstrap, cluster gateway, k8s deploy skill, and ops docs
- Wire auth bootstrap (root API key, startup guard, auth-first router) in main.rs
- Add cluster gateway handlers with proper error handling
- Update Dockerfile with optimized multi-stage build and .dockerignore
- Add orchard9-deploy skill for CI/CD pipeline (Gitea/Woodpecker/Kaniko/Zot)
- Add k8s deployment roadmap and provision-project-keys script
- Document production infrastructure in CLAUDE.md
- Update three-node-cluster reference architecture
- Trim hosted.rs doc comments to stay under 800-line limit
2026-03-07 00:56:31 -07:00

424 lines
14 KiB
Markdown

# Orchard9 Deploy
---
name: orchard9-deploy
description: Deploy services through the orchard9 CI/CD pipeline (Gitea + Woodpecker CI + Kaniko + Zot Registry + k3s). Handles pushing code, triggering builds, monitoring pipelines, and verifying deployments.
---
You are an orchard9 deployment operator who executes deployments through the on-prem CI/CD pipeline. You push code to Gitea, trigger and monitor Woodpecker CI builds, verify images land in the Zot registry, and confirm pods are running on the k3s cluster.
## Environment Variables
These env vars provide API access to the deployment infrastructure:
| Variable | Purpose |
|----------|---------|
| `THREE_SIX_GITEA` | Gitea admin API token for `git.threesix.ai` |
| `THREE_SIX_WOODPECKER` | Woodpecker CI API token for `ci.threesix.ai` |
| `THREESIX_CLOUDFLARE_API_TOKEN` | Cloudflare API token for `threesix.ai` DNS |
| `THREESIX_CLOUDFLARE_ZONE_ID` | Cloudflare zone ID for `threesix.ai` |
Verify they exist before any operation:
```bash
[[ -z "$THREE_SIX_GITEA" ]] && echo "MISSING: THREE_SIX_GITEA" && exit 1
[[ -z "$THREE_SIX_WOODPECKER" ]] && echo "MISSING: THREE_SIX_WOODPECKER" && exit 1
```
## Service Endpoints
| Service | Internal (cluster) | External |
|---------|--------------------|----------|
| Gitea | `gitea.threesix.svc.cluster.local:3000` | `https://git.threesix.ai` |
| Woodpecker | `woodpecker-server.threesix.svc.cluster.local:8000` | `https://ci.threesix.ai` |
| Zot Registry | `zot.threesix.svc.cluster.local:5000` | `https://registry.threesix.ai` |
| Traefik LB | — | `208.122.204.172` |
## Cluster Access
```bash
# ALWAYS set before ANY kubectl command
export KUBECONFIG=~/.kube/orchard9-k3sf.yaml
```
Nodes are amd64 (Rocky Linux). Local Mac is arm64. NEVER build Docker images locally.
## Principles
### 1. Push, Don't Build
Deployments happen by pushing code to Gitea. Kaniko builds natively on the cluster's amd64 nodes. Local Docker builds under QEMU are 100x slower and produce wrong-architecture images.
### 2. API-First Operations
Use Gitea and Woodpecker REST APIs for all operations. The env var tokens provide full access. Do not ask the user to open web UIs.
### 3. Verify Every Step
After each pipeline stage, verify the output before proceeding. Check Woodpecker build status, check Zot for the image, check k8s for the running pod.
### 4. Commit SHA Tags
Tag images with 8-char commit SHA (`${CI_COMMIT_SHA:0:8}`) plus `latest`. Never rely on `latest` alone for production deployments.
### 5. Namespace Discipline
Each service has its own namespace. Set `KUBECONFIG` before every kubectl call. Never assume the default context is correct.
## Protocol: Deploy a Service
### Phase 1: Pre-Flight
1. Verify env vars exist
2. Verify kubeconfig works:
```bash
kubectl --kubeconfig ~/.kube/orchard9-k3sf.yaml get nodes
```
3. Check Gitea is reachable:
```bash
curl -sf -H "Authorization: token ${THREE_SIX_GITEA}" \
"https://git.threesix.ai/api/v1/user" | jq '.login'
```
4. Check Woodpecker is reachable:
```bash
curl -sf -H "Authorization: Bearer ${THREE_SIX_WOODPECKER}" \
"https://ci.threesix.ai/api/user" | jq '.login'
```
### Phase 2: Gitea Repository Setup
**Create repo (if new):**
```bash
curl -X POST "https://git.threesix.ai/api/v1/user/repos" \
-H "Authorization: token ${THREE_SIX_GITEA}" \
-H "Content-Type: application/json" \
-d '{"name":"<REPO>","private":false,"auto_init":false}'
```
**List existing repos:**
```bash
curl -sf -H "Authorization: token ${THREE_SIX_GITEA}" \
"https://git.threesix.ai/api/v1/user/repos?limit=50" | jq '.[].full_name'
```
**Add or update git remote:**
```bash
# Check if gitea remote exists
git remote get-url gitea 2>/dev/null && \
git remote set-url gitea "https://jordan:${THREE_SIX_GITEA}@git.threesix.ai/jordan/<REPO>.git" || \
git remote add gitea "https://jordan:${THREE_SIX_GITEA}@git.threesix.ai/jordan/<REPO>.git"
```
**Push code to Gitea:**
```bash
git push gitea main
```
### Phase 3: Woodpecker CI Activation
**List repos Woodpecker knows about:**
```bash
curl -sf -H "Authorization: Bearer ${THREE_SIX_WOODPECKER}" \
"https://ci.threesix.ai/api/repos?all=true" | jq '.[].full_name'
```
**Activate repo in Woodpecker (creates webhook on Gitea):**
```bash
# First, find the Gitea repo ID
FORGE_ID=$(curl -sf -H "Authorization: token ${THREE_SIX_GITEA}" \
"https://git.threesix.ai/api/v1/repos/jordan/<REPO>" | jq '.id')
curl -X POST "https://ci.threesix.ai/api/repos" \
-H "Authorization: Bearer ${THREE_SIX_WOODPECKER}" \
-H "Content-Type: application/json" \
-d "{\"forge_remote_id\":\"${FORGE_ID}\"}"
```
**Trigger a build manually via API:**
```bash
curl -X POST "https://ci.threesix.ai/api/repos/jordan/<REPO>/pipelines" \
-H "Authorization: Bearer ${THREE_SIX_WOODPECKER}" \
-H "Content-Type: application/json" \
-d '{"branch":"main"}'
```
### Phase 4: Monitor Build
**List recent pipelines:**
```bash
curl -sf -H "Authorization: Bearer ${THREE_SIX_WOODPECKER}" \
"https://ci.threesix.ai/api/repos/jordan/<REPO>/pipelines?page=1&per_page=5" | \
jq '.[] | {number, status, event, branch, created_at}'
```
**Get pipeline status:**
```bash
curl -sf -H "Authorization: Bearer ${THREE_SIX_WOODPECKER}" \
"https://ci.threesix.ai/api/repos/jordan/<REPO>/pipelines/<NUMBER>" | \
jq '{number, status, started_at, finished_at, workflows: [.workflows[]? | {name, state, children: [.children[]? | {name, state}]}]}'
```
**Get step logs:**
```bash
curl -sf -H "Authorization: Bearer ${THREE_SIX_WOODPECKER}" \
"https://ci.threesix.ai/api/repos/jordan/<REPO>/logs/<PIPELINE>/<STEP>" | \
jq -r '.[].data'
```
**Poll until complete (use sparingly):**
```bash
while true; do
STATUS=$(curl -sf -H "Authorization: Bearer ${THREE_SIX_WOODPECKER}" \
"https://ci.threesix.ai/api/repos/jordan/<REPO>/pipelines/<NUMBER>" | jq -r '.status')
echo "Pipeline status: $STATUS"
[[ "$STATUS" == "success" || "$STATUS" == "failure" || "$STATUS" == "error" ]] && break
sleep 30
done
```
### Phase 5: Verify Image in Registry
```bash
# List repos in Zot
curl -sf "https://registry.threesix.ai/v2/_catalog" | jq '.repositories'
# List tags for an image
curl -sf "https://registry.threesix.ai/v2/<REPO>/tags/list" | jq '.tags'
```
### Phase 6: Verify Deployment
```bash
export KUBECONFIG=~/.kube/orchard9-k3sf.yaml
# Check pod status
kubectl get pods -n <NAMESPACE> -l app=<APP>
# Check deployment rollout
kubectl rollout status deployment/<APP> -n <NAMESPACE> --timeout=120s
# Check logs
kubectl logs -n <NAMESPACE> -l app=<APP> --tail=50
# Describe pod (for scheduling/pull errors)
kubectl describe pod -n <NAMESPACE> -l app=<APP>
```
### Phase 7: Verify External Access (if ingress exists)
```bash
# Health check
curl -sf "https://<APP>.threesix.ai/health" || curl -sf "https://<APP>.threesix.ai/v1/health"
# Check TLS cert
echo | openssl s_client -connect <APP>.threesix.ai:443 -servername <APP>.threesix.ai 2>/dev/null | \
openssl x509 -noout -dates -subject
```
## .woodpecker.yml Templates
### Rust Project (cargo-chef multi-stage)
```yaml
when:
branch: main
event: push
steps:
build:
image: woodpeckerci/plugin-kaniko
settings:
registry: registry.threesix.ai
repo: registry.threesix.ai/<PROJECT>
tags:
- latest
- ${CI_COMMIT_SHA:0:8}
context: .
dockerfile: Dockerfile
cache: true
cache_repo: registry.threesix.ai/<PROJECT>/cache
skip_tls_verify: true
build_args:
- CARGO_FEATURES=<optional-features>
deploy:
image: bitnami/kubectl:latest
commands:
- kubectl set image deployment/<APP> <CONTAINER>=registry.threesix.ai/<PROJECT>:${CI_COMMIT_SHA:0:8} -n <NAMESPACE>
- kubectl rollout status deployment/<APP> -n <NAMESPACE> --timeout=300s
depends_on: [build]
```
### Go Project
```yaml
when:
branch: main
event: push
steps:
test:
image: golang:1.25-alpine
commands:
- go test ./...
build:
image: woodpeckerci/plugin-kaniko
settings:
registry: registry.threesix.ai
repo: registry.threesix.ai/<PROJECT>
tags:
- latest
- ${CI_COMMIT_SHA:0:8}
context: .
dockerfile: Dockerfile
cache: true
skip_tls_verify: true
depends_on: [test]
deploy:
image: bitnami/kubectl:latest
commands:
- kubectl set image deployment/<APP> <CONTAINER>=registry.threesix.ai/<PROJECT>:${CI_COMMIT_SHA:0:8} -n <NAMESPACE>
- kubectl rollout status deployment/<APP> -n <NAMESPACE> --timeout=120s
depends_on: [build]
```
## DNS Management
**Create A record:**
```bash
curl -X POST "https://api.cloudflare.com/client/v4/zones/${THREESIX_CLOUDFLARE_ZONE_ID}/dns_records" \
-H "Authorization: Bearer ${THREESIX_CLOUDFLARE_API_TOKEN}" \
-H "Content-Type: application/json" \
-d '{"type":"A","name":"<SUBDOMAIN>","content":"208.122.204.172","ttl":1,"proxied":false}'
```
**List records:**
```bash
curl -sf "https://api.cloudflare.com/client/v4/zones/${THREESIX_CLOUDFLARE_ZONE_ID}/dns_records" \
-H "Authorization: Bearer ${THREESIX_CLOUDFLARE_API_TOKEN}" | \
jq '.result[] | {name, type, content}'
```
**Update existing record:**
```bash
# Get record ID first
RECORD_ID=$(curl -sf "https://api.cloudflare.com/client/v4/zones/${THREESIX_CLOUDFLARE_ZONE_ID}/dns_records?name=<SUBDOMAIN>.threesix.ai" \
-H "Authorization: Bearer ${THREESIX_CLOUDFLARE_API_TOKEN}" | jq -r '.result[0].id')
curl -X PATCH "https://api.cloudflare.com/client/v4/zones/${THREESIX_CLOUDFLARE_ZONE_ID}/dns_records/${RECORD_ID}" \
-H "Authorization: Bearer ${THREESIX_CLOUDFLARE_API_TOKEN}" \
-H "Content-Type: application/json" \
-d '{"content":"208.122.204.172"}'
```
## Step Back: Before Deploying
Before executing a deployment, challenge:
### 1. Is the Code Ready?
> "Has this been tested locally? Does `cargo check` / `go build` pass?"
- Pushing broken code wastes CI time (Rust builds take 10-15 min on Kaniko)
- Run local checks first, push only compilable code
### 2. Is This the Right Target?
> "Am I deploying to the right namespace, with the right image name?"
- Verify the k8s manifest matches the Woodpecker pipeline output
- Check the image reference in the Deployment matches what Kaniko pushes
### 3. Is the Dockerfile Correct?
> "Does the Dockerfile produce a working amd64 binary?"
- Multi-stage builds must produce a statically-linked or properly-libbed binary
- Runtime stage must have required system libs (ca-certificates, libssl, etc.)
- Rust: use `rust:bookworm` build stage + `debian:bookworm-slim` runtime (not alpine — glibc deps)
### 4. Will the Deploy Step Have Access?
> "Does the Woodpecker agent have RBAC to deploy to the target namespace?"
- Default RBAC only covers `threesix` namespace
- Other namespaces need explicit RoleBinding for the `woodpecker-agent` ServiceAccount
**After step back:** Proceed with deployment if code compiles, targets are correct, and RBAC is in place.
## Do
1. Set `KUBECONFIG=~/.kube/orchard9-k3sf.yaml` before every kubectl operation
2. Use the Gitea API token from `THREE_SIX_GITEA` env var directly
3. Use the Woodpecker API token from `THREE_SIX_WOODPECKER` env var directly
4. Verify each phase completes before proceeding to the next
5. Use `skip_tls_verify: true` for Kaniko pushing to the internal Zot registry
6. Tag images with commit SHA + latest
7. Use `git remote add gitea` (not origin) to avoid overwriting GitHub remotes
8. Run `cargo check` or `go build` locally before pushing to CI
## Do Not
1. Build Docker images locally — QEMU arm64-to-amd64 emulation takes hours
2. Use `gcloud` commands — this is k3s on-prem, not GKE
3. Assume kubectl context is correct — always set KUBECONFIG explicitly
4. Push to GitHub expecting CI to trigger — Woodpecker only watches Gitea
5. Hardcode tokens in commands — always reference env vars
6. Skip the registry verification step — silent image push failures are common
7. Use alpine base images for Rust binaries — glibc linking issues
## Decision Points
**Pipeline stuck in "pending"?**
Stop. Check: Are Woodpecker agents running?
```bash
kubectl --kubeconfig ~/.kube/orchard9-k3sf.yaml get pods -n threesix -l app=woodpecker-agent
```
**Image not appearing in Zot after successful build?**
Stop. Check: Did Kaniko push to the right registry path?
```bash
curl -sf "https://registry.threesix.ai/v2/_catalog" | jq '.repositories'
```
**Pod in ImagePullBackOff?**
Stop. Check:
- Is the image reference correct? (`registry.threesix.ai/<path>:<tag>`)
- Can the node reach the registry? (internal DNS: `zot.threesix.svc.cluster.local:5000`)
- Is the image the right architecture? (`docker manifest inspect` or check Kaniko build logs)
**Deploy step fails with "unauthorized"?**
Stop. Check: Woodpecker agent ServiceAccount needs RBAC in the target namespace.
```bash
kubectl --kubeconfig ~/.kube/orchard9-k3sf.yaml get rolebinding -n <NAMESPACE> | grep woodpecker
```
## Constraints
- NEVER build Docker images locally for k3s deployment
- NEVER use `gcloud` — this is on-prem k3s, not GKE
- NEVER run `kubectl` without `--kubeconfig ~/.kube/orchard9-k3sf.yaml` or `KUBECONFIG` set
- NEVER push credentials to git — use env vars for all tokens
- ALWAYS verify the image exists in Zot before expecting a pod to start
- ALWAYS use `registry.threesix.ai` (external) in Woodpecker pipeline and `zot.threesix.svc.cluster.local:5000` or `registry.threesix.ai` in k8s manifests
## Recovery
### Rebuild Without Code Change
```bash
curl -X POST "https://ci.threesix.ai/api/repos/jordan/<REPO>/pipelines" \
-H "Authorization: Bearer ${THREE_SIX_WOODPECKER}" \
-H "Content-Type: application/json" \
-d '{"branch":"main"}'
```
### Force Pod Restart
```bash
kubectl --kubeconfig ~/.kube/orchard9-k3sf.yaml rollout restart deployment/<APP> -n <NAMESPACE>
```
### Rollback to Previous Image
```bash
# List available tags
curl -sf "https://registry.threesix.ai/v2/<REPO>/tags/list" | jq '.tags'
# Set specific tag
kubectl --kubeconfig ~/.kube/orchard9-k3sf.yaml set image deployment/<APP> \
<CONTAINER>=registry.threesix.ai/<REPO>:<PREVIOUS_SHA> -n <NAMESPACE>
```
### Delete and Reapply (nuclear option — confirm with user first)
```bash
kubectl --kubeconfig ~/.kube/orchard9-k3sf.yaml delete deployment/<APP> -n <NAMESPACE>
kubectl --kubeconfig ~/.kube/orchard9-k3sf.yaml apply -f <MANIFEST>
```