stemedb/docs/operations/deployment/k8s-deploy-roadmap.md

# k3s Deploy Roadmap: StemeDB + Aphoria → 100 Projects

**Target:** Production deployment on k3s-fleet with Longhorn, cert-manager, External Secrets, Prometheus/Grafana, Traefik.
**Timeline:** 3 weeks to ship-ready for 100 projects.

---

## Ship Blockers (P0) — Must Fix Before Any Project Onboards

### ~~1. Auth router not wired in production~~ ✅ RESOLVED (2026-03-02)

`create_router_full_protection_full_config` is now called when `STEMEDB_AUTH_ENABLED=true`.
Router dispatch checks `bootstrap::is_auth_enabled()` first — full protection stack activates
in production. Metering-only path still available when auth is disabled (local dev).

**Resolution:** `crates/stemedb-api/src/main.rs` updated.

---

### ~~2. `STEMEDB_UNSAFE_SKIP_SIGNATURES` startup guard missing~~ ✅ RESOLVED (2026-03-02)

Startup guard added: if `STEMEDB_UNSAFE_SKIP_SIGNATURES=true` and `STEMEDB_AUTH_ENABLED=true`,
server logs a fatal error and exits with code 1. Misconfiguration is caught at boot, not silently.

**Resolution:** `crates/stemedb-api/src/main.rs` updated.

---

### ~~3. Bootstrap key not seeded from env on fresh PVC~~ ✅ RESOLVED (2026-03-02)

`bootstrap::bootstrap_root_api_key()` is now called at startup (after IngestWorker spawn).
Reads `STEMEDB_ROOT_API_KEY`, idempotent — no-op if key already exists in the store. Fatal
error on failure.

**Resolution:** `crates/stemedb-api/src/main.rs` updated.

---

### ~~4. No k8s manifests — StemeDB cannot be deployed to k3s~~ ✅ RESOLVED (2026-03-02)

Manifests deployed to `k3s-fleet/deployments/k8s/base/stemedb/` (single `stemedb.yaml` following
`tidaldb/` pattern). Includes ExternalSecret, PVC (50Gi Longhorn), Deployment (Recreate, non-root,
all probes), ClusterIP Service, Traefik Ingress at `stemedb.threesix.ai`.

**Remaining manual step:** Build + push image, create GCP secret, add DNS record (see Pre-Deploy section below).

---

### ~~5. Image registry — k3s cannot pull without a registry~~ ✅ RESOLVED (2026-03-07)

Registry: `registry.threesix.ai` (Zot OCI registry on k3s). Woodpecker CI pipeline (`.woodpecker.yml`)
builds via Kaniko and pushes automatically on every merge to main. No manual docker build needed.

**Image:** `registry.threesix.ai/stemedb-api:latest` (also tagged with short commit SHA)

---

## Pre-Deploy Checklist (Manual Steps Before `kubectl apply`)

> **Note:** Image builds are now automated via Woodpecker CI. Push to `main` → Kaniko builds →
> pushes to `registry.threesix.ai` → `kubectl set image` on StatefulSet. Manual steps below are
> only needed for first-time setup.

```bash
# 1. Image builds are automatic (Woodpecker CI). For manual builds:
docker build --platform linux/amd64 -t registry.threesix.ai/stemedb-api:latest .
docker push registry.threesix.ai/stemedb-api:latest

# 2. Create root API key in GCP Secret Manager (first deploy only)
ROOT_KEY="steme_live_$(openssl rand -hex 24)"
echo "Root key: $ROOT_KEY"   # Save this — needed for provision-project-keys.sh
echo -n "$ROOT_KEY" | gcloud secrets create stemedb-root-api-key \
  --project=orchard9 --replication-policy=automatic --data-file=-

# 3. Add DNS: stemedb.threesix.ai → Traefik LB IP (Cloudflare) — already done
```

---

## Original Manifest Spec (archived for reference)

The following was the original spec. Actual implementation is in `k3s-fleet/deployments/k8s/base/stemedb/stemedb.yaml`.

Create `deployments/k8s/base/stemedb/` with the following files:

**`namespace.yaml`**
```yaml
apiVersion: v1
kind: Namespace
metadata:
  name: stemedb
```

**`pvc.yaml`** — Two PVCs to isolate WAL fsync from LSM compaction I/O
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: stemedb-wal
  namespace: stemedb
  annotations:
    volumeType: longhorn
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: longhorn
  resources:
    requests:
      storage: 20Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: stemedb-db
  namespace: stemedb
  annotations:
    volumeType: longhorn
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: longhorn
  resources:
    requests:
      storage: 50Gi
```

> Set `numberOfReplicas: 2` in Longhorn StorageClass (not default 3) to halve cross-node fsync amplification.

**`deployment.yaml`** — Critical spec decisions annotated
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: stemedb-api
  namespace: stemedb
spec:
  replicas: 1           # Non-negotiable. Embedded KV requires exclusive volume access.
  strategy:
    type: Recreate      # NOT RollingUpdate. RWO PVC + 2 pods = deadlock.
  selector:
    matchLabels:
      app: stemedb-api
  template:
    metadata:
      labels:
        app: stemedb-api
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "18180"
        prometheus.io/path: "/metrics"
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        readOnlyRootFilesystem: false  # WAL writes to /data
      terminationGracePeriodSeconds: 30  # Let in-flight WAL writes complete.
      containers:
        - name: stemedb-api
          image: <REGISTRY>/stemedb-api:latest
          ports:
            - containerPort: 18180
          env:
            - name: STEMEDB_BIND_ADDR
              value: "0.0.0.0:18180"
            - name: STEMEDB_WAL_DIR
              value: /data/wal
            - name: STEMEDB_DB_DIR
              value: /data/db
            - name: STEMEDB_METER_ENABLED
              value: "true"
            - name: STEMEDB_ROOT_API_KEY
              valueFrom:
                secretKeyRef:
                  name: stemedb-secrets
                  key: root-api-key
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              cpu: "2000m"
              memory: "4Gi"
          startupProbe:       # WAL replay can take 60s after crash — do not skip this.
            httpGet:
              path: /v1/health
              port: 18180
            periodSeconds: 5
            failureThreshold: 12   # 60s total window before k8s kills pod
          livenessProbe:
            httpGet:
              path: /v1/health
              port: 18180
            periodSeconds: 15
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /v1/health
              port: 18180
            periodSeconds: 5
            failureThreshold: 3
          volumeMounts:
            - name: wal
              mountPath: /data/wal
            - name: db
              mountPath: /data/db
      volumes:
        - name: wal
          persistentVolumeClaim:
            claimName: stemedb-wal
        - name: db
          persistentVolumeClaim:
            claimName: stemedb-db
```

**`service.yaml`**
```yaml
apiVersion: v1
kind: Service
metadata:
  name: stemedb-api
  namespace: stemedb
spec:
  selector:
    app: stemedb-api
  ports:
    - port: 18180
      targetPort: 18180
  type: ClusterIP
```

**`ingress.yaml`** — Traefik terminates TLS; do NOT set `STEMEDB_TLS_CERT_PATH`
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: stemedb-api
  namespace: stemedb
  annotations:
    traefik.ingress.kubernetes.io/router.entrypoints: websecure
    traefik.ingress.kubernetes.io/router.middlewares: stemedb-ratelimit@kubernetescrd
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: traefik
  rules:
    - host: stemedb.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: stemedb-api
                port:
                  number: 18180
  tls:
    - hosts:
        - stemedb.yourdomain.com
      secretName: stemedb-tls
```

**`middleware.yaml`** — Traefik rate limit (global, before app-level limits)
```yaml
apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
  name: ratelimit
  namespace: stemedb
spec:
  rateLimit:
    average: 500
    burst: 1000
    period: 1s
```

**`external-secret.yaml`** — Pull from GCP Secret Manager via External Secrets Operator
```yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: stemedb-secrets
  namespace: stemedb
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: gcp-secret-manager    # adjust to your cluster's SecretStore name
    kind: ClusterSecretStore
  target:
    name: stemedb-secrets
  data:
    - secretKey: root-api-key
      remoteRef:
        key: stemedb-root-api-key
```

**`kustomization.yaml`**
```yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - namespace.yaml
  - pvc.yaml
  - deployment.yaml
  - service.yaml
  - ingress.yaml
  - middleware.yaml
  - external-secret.yaml
```

**Deploy:**
```bash
kubectl apply -k deployments/k8s/base/stemedb/
kubectl rollout status deployment/stemedb-api -n stemedb
curl https://stemedb.yourdomain.com/v1/health
```

---

## Phase 1 Checklist (Week 1 — Gate: First Project Can Connect) ✅ COMPLETE

| # | Task | File(s) | Status |
|---|------|---------|--------|
| 1 | Wire auth router in `main.rs` | `crates/stemedb-api/src/main.rs` | ✅ Done |
| 2 | Add `STEMEDB_UNSAFE_SKIP_SIGNATURES` startup guard | `crates/stemedb-api/src/main.rs` | ✅ Done |
| 3 | Add bootstrap key seed from `STEMEDB_ROOT_API_KEY` | `crates/stemedb-api/src/main.rs` | ✅ Done |
| 4 | Add `--features aphoria` to Dockerfile | `Dockerfile` | ✅ Done |
| 5 | Create k8s manifests | `k3s-fleet/.../stemedb/` | ✅ Done |
| 6 | Write `scripts/provision-project-keys.sh` | `scripts/` | ✅ Done |
| 7 | Build + push Docker image | Woodpecker CI → Zot registry | ✅ Done (automated) |
| 8 | Store root API key in GCP Secret Manager | GCP Console | ✅ Done |
| 9 | Add DNS record: `stemedb.threesix.ai` | Cloudflare | ✅ Done |
| 10 | Deploy to k3s + smoke test | k3s-fleet | ✅ Done |
| 11 | Upgrade to 3-node StatefulSet | `stemedb.yaml` | ✅ Done (2026-03-07) |
| 12 | Woodpecker CI/CD pipeline | `.woodpecker.yml` | ✅ Done |

**Gate test (run after deploy):**
```bash
# Health check (routes through Gateway on :18181)
curl https://stemedb.threesix.ai/v1/health

# Direct API health on each pod (port-forward to :18180)
kubectl port-forward pod/stemedb-0 18180:18180 -n stemedb &
curl http://127.0.0.1:18180/v1/health

# Unauthenticated write → 401
curl -s -o /dev/null -w "%{http_code}" -X POST \
  https://stemedb.threesix.ai/v1/assert -H "Content-Type: application/json" -d '{}'

# Cluster status
curl https://stemedb.threesix.ai/v1/cluster/status

# Confirm pods survive rolling restart
kubectl rollout restart statefulset/stemedb -n stemedb
kubectl rollout status statefulset/stemedb -n stemedb --timeout=300s
curl https://stemedb.threesix.ai/v1/health
```

---

## Phase 2: Production Hardening (Week 2 — Gate: 10 Projects)

### Backup CronJob

Create `deployments/k8s/base/stemedb/backup-cronjob.yaml`:

```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: stemedb-backup
  namespace: stemedb
spec:
  schedule: "0 */6 * * *"   # Every 6 hours
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: rclone/rclone:latest
              command:
                - /bin/sh
                - -c
                - |
                  # WAL: copy all completed segments (all except the last, which is locked)
                  SEGMENTS=$(ls /data/wal/*.wal 2>/dev/null | sort | head -n -1)
                  if [ -n "$SEGMENTS" ]; then
                    rclone copy /data/wal/ gcs:$BACKUP_BUCKET/wal/ \
                      --include "*.wal" --exclude "$(ls /data/wal/*.wal | sort | tail -n 1 | xargs basename)"
                  fi
                  # DB snapshot
                  rclone copy /data/db/ gcs:$BACKUP_BUCKET/db/$(date -u +%Y%m%dT%H%M%SZ)/
                  echo "Backup complete"
              env:
                - name: BACKUP_BUCKET
                  value: stemedb-backups    # your GCS bucket name
              volumeMounts:
                - name: wal
                  mountPath: /data/wal
                  readOnly: true
                - name: db
                  mountPath: /data/db
                  readOnly: true
                - name: rclone-config
                  mountPath: /config/rclone
          volumes:
            - name: wal
              persistentVolumeClaim:
                claimName: stemedb-wal
            - name: db
              persistentVolumeClaim:
                claimName: stemedb-db
            - name: rclone-config
              secret:
                secretName: rclone-gcs-config
```

**Test backup manually:**
```bash
kubectl create job --from=cronjob/stemedb-backup backup-test -n stemedb
kubectl logs -l job-name=backup-test -n stemedb -f
```

### Monitoring — Wire into Prometheus

**`service-monitor.yaml`**
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: stemedb-api
  namespace: stemedb
  labels:
    release: prometheus    # must match your Prometheus Operator label selector
spec:
  selector:
    matchLabels:
      app: stemedb-api
  endpoints:
    - port: "18180"
      path: /metrics
      interval: 15s
```

**`alert-rules.yaml`** — 6 alerts that fire first at 100-project scale
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: stemedb-alerts
  namespace: stemedb
  labels:
    release: prometheus
spec:
  groups:
    - name: stemedb.rules
      rules:
        - alert: StemeDBPodNotRunning
          expr: absent(up{job="stemedb-api"}) > 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "StemeDB pod is not running"

        - alert: StemeDBWALLatencyHigh
          expr: histogram_quantile(0.99, rate(stemedb_wal_fsync_latency_seconds_bucket[5m])) > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "WAL fsync p99 > 50ms — Longhorn I/O degradation likely"

        - alert: StemeDBDataVolumeNearlyFull
          expr: |
            kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"stemedb-.*"}
            / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"stemedb-.*"}
            > 0.75
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "StemeDB PVC usage > 75% — resize requires downtime"

        - alert: StemeDBRateLimitSaturating
          expr: rate(stemedb_http_requests_total{status="429"}[5m]) > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "429 rate > 1/s — projects hitting rate limits"

        - alert: StemeDBErrorRateHigh
          expr: |
            rate(stemedb_http_requests_total{status=~"5.."}[5m])
            / rate(stemedb_http_requests_total[5m])
            > 0.01
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "5xx error rate > 1%"

        - alert: StemeDBOOMKilled
          expr: |
            kube_pod_container_status_last_terminated_reason{
              container="stemedb-api",
              reason="OOMKilled"
            } > 0
          labels:
            severity: critical
          annotations:
            summary: "StemeDB container OOM killed — increase memory limit or find leak"
```

### NetworkPolicy + PDB

**`network-policy.yaml`**
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: stemedb-api
  namespace: stemedb
spec:
  podSelector:
    matchLabels:
      app: stemedb-api
  policyTypes: [Ingress, Egress]
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system   # Traefik
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: monitoring    # Prometheus
      ports:
        - port: 18180
  egress:
    - ports:
        - port: 53     # DNS
        - port: 443    # GCP APIs (backup, secrets)
```

**`pdb.yaml`**
```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: stemedb-api
  namespace: stemedb
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      app: stemedb-api
```

### Phase 2 Checklist

| # | Task | File(s) | Est |
|---|------|---------|-----|
| 1 | Deploy backup CronJob | `deployments/k8s/base/stemedb/backup-cronjob.yaml` | 2h |
| 2 | Create GCS bucket + rclone Secret | GCP Console | 1h |
| 3 | Wire ServiceMonitor into Prometheus | `service-monitor.yaml` | 1h |
| 4 | Deploy 6 alert rules | `alert-rules.yaml` | 1h |
| 5 | Add NetworkPolicy + PDB | `network-policy.yaml`, `pdb.yaml` | 1h |
| 6 | Fix Longhorn PVC reclaim policy in DR runbook | `docs/operations/runbooks/disaster-recovery.md` | 30m |

**Gate test:** Kill pod → `StemeDBPodNotRunning` fires within 2 min. Run backup job manually → GCS has files.

---

## Phase 3: Scale to 100 Projects (Week 3)

### Per-project key provisioning script

Create `scripts/provision-project-keys.sh`:

```bash
#!/usr/bin/env bash
set -euo pipefail

# Usage: ./provision-project-keys.sh projects.txt
# projects.txt: one project name per line

STEMEDB_URL="${STEMEDB_URL:-https://stemedb.yourdomain.com}"
ADMIN_KEY="${STEMEDB_ADMIN_KEY:?Set STEMEDB_ADMIN_KEY}"
PROJECTS_FILE="${1:?Usage: $0 <projects-file>}"

while IFS= read -r project; do
  [[ -z "$project" ]] && continue

  echo "Provisioning key for: $project"

  response=$(curl -sf -X POST "$STEMEDB_URL/v1/admin/api-keys" \
    -H "X-API-Key: $ADMIN_KEY" \
    -H "Content-Type: application/json" \
    -d "{\"label\":\"project-$project\",\"role\":\"write_agent\"}")

  key=$(echo "$response" | jq -r '.key')

  # Store in GCP Secret Manager
  echo -n "$key" | gcloud secrets create "stemedb-key-$project" \
    --data-file=- \
    --replication-policy=automatic 2>/dev/null \
  || echo -n "$key" | gcloud secrets versions add "stemedb-key-$project" --data-file=-

  echo "  Key stored: stemedb-key-$project"
done < "$PROJECTS_FILE"

echo "Done."
```

**Onboarding runbook for each project:**
```bash
# 1. Retrieve key from Secret Manager
gcloud secrets versions access latest --secret="stemedb-key-<project>"

# 2. Update project's aphoria.toml
cat >> .aphoria/config.toml <<EOF
[hosted]
url = "https://stemedb.yourdomain.com"
api_key_env = "STEMEDB_API_KEY"
EOF

# 3. Export key in CI/CD env
# STEMEDB_API_KEY=steme_live_<value>
```

### Aphoria retry logic (P1)

Projects run `aphoria scan --persist` locally and call the remote StemeDB. During StemeDB pod
restarts (Recreate strategy = brief downtime), Aphoria should retry rather than fail the commit.

> This is a change to the `aphoria` binary, not to StemeDB. Add 3-attempt exponential backoff
> (2s, 4s, 8s) on HTTP 502/503 responses in the Aphoria HTTP client.

### Phase 3 Checklist

| # | Task | File(s) | Est |
|---|------|---------|-----|
| 1 | Run provision script for all 100 projects | `scripts/provision-project-keys.sh` | 2h |
| 2 | Write per-project onboarding runbook | `docs/operations/onboarding-project.md` | 1h |
| 3 | Add retry logic to `aphoria` HTTP client | `applications/aphoria/` | 2h |
| 4 | Split WAL + DB into two PVCs (migration) | `deployments/k8s/base/stemedb/` | 2h |

**Gate test:** 5 projects scan simultaneously with their own keys → each isolated → one rate-limited → others unaffected.

---

## What NOT to Build Yet

| Item | Why not |
|------|---------|
| HPA | StemeDB is stateful (embedded KV). StatefulSet replicas are fixed at 3. |
| mTLS between pods | Internal cluster traffic is on private network. Add when exposing cross-cluster. |
| WAF | Body limits + Traefik rate limit + circuit breaker is sufficient for 100 known projects. |
| Per-tenant namespaces | Multiplies operational surface 100x. API key isolation is the right model. |
| ~~Multi-region / clustering~~ | ✅ 3-node cluster deployed. Next: full SWIM inter-node connectivity. |
| PITR with WAL timestamps | 6-hour backup RPO is acceptable for pilot. Improve later. |
| Secrets rotation automation | Manual rotation via `/v1/admin/api-keys/:hash/rotate` is fine for 100 projects. |
| Distributed tracing | You have one service. WAL fsync histogram covers what you need. |

---

## Open Questions (Resolved)

1. ~~**Image registry**~~: ✅ Zot OCI registry at `registry.threesix.ai` on k3s. Woodpecker CI pushes automatically.
2. ~~**Bootstrap key API**~~: ✅ `bootstrap::bootstrap_root_api_key()` wired in main.rs.
3. **Aphoria scan model**: Projects run `aphoria scan --persist` locally, calling remote StemeDB. Retry logic lives in Aphoria binary.
4. **GCS bucket**: Needs to be created for backups (Phase 2).
5. **CORS**: All router variants use `allow_origin(Any)`. Restrict before public launch.

---

## Risk Register

| Risk | Likelihood | Mitigation |
|------|-----------|-----------|
| Longhorn fsync latency at 100-project burst | Medium | Pin pod + volume to same node (Phase 3), `dataLocality: bestEffort`; monitor WAL p99 from day 1 |
| Rolling restart brief downtime | Medium (StatefulSet rolls one pod at a time) | 3 replicas + readiness probe; Gateway routes to healthy pods |
| Fresh PVC after disaster = 100 project keys lost | Low but catastrophic | Bootstrap key seed in `main.rs` + `provision-project-keys.sh` idempotent re-run |
| ~~Image registry blocker~~ | ✅ Resolved | Zot registry on k3s, Woodpecker CI automates builds |
| CORS vulnerability | Medium | `allow_origin(Any)` in all router variants; fix before public launch |

---

## Directory Structure (Current)

```
# k3s-fleet repo
k3s-fleet/deployments/k8s/base/stemedb/
└── stemedb.yaml          # All-in-one: ExternalSecret, headless Service,
                          #   gateway Service, 3-replica StatefulSet, Ingress

# stemedb repo
scripts/
├── entrypoint.sh             # Dual-binary launcher (cluster mode)
└── provision-project-keys.sh

.woodpecker.yml               # CI/CD: Kaniko → Zot registry → kubectl deploy
Dockerfile                    # Multi-stage: builds stemedb-api + stemedb-node
```

After Phase 2 hardening, add to `k3s-fleet/.../stemedb/`:
- `backup-cronjob.yaml`
- `service-monitor.yaml`
- `alert-rules.yaml`
- `network-policy.yaml`
- `pdb.yaml`

---

*Last updated: 2026-03-07 — Phase 1 complete, 3-node StatefulSet deployed with Woodpecker CI/CD*