stemedb/docs/operations/deployment/k8s-deploy-roadmap.md
jordan 1e5ba8b946
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
feat: wire auth bootstrap, cluster gateway, k8s deploy skill, and ops docs
- Wire auth bootstrap (root API key, startup guard, auth-first router) in main.rs
- Add cluster gateway handlers with proper error handling
- Update Dockerfile with optimized multi-stage build and .dockerignore
- Add orchard9-deploy skill for CI/CD pipeline (Gitea/Woodpecker/Kaniko/Zot)
- Add k8s deployment roadmap and provision-project-keys script
- Document production infrastructure in CLAUDE.md
- Update three-node-cluster reference architecture
- Trim hosted.rs doc comments to stay under 800-line limit
2026-03-07 00:56:31 -07:00

22 KiB

k3s Deploy Roadmap: StemeDB + Aphoria → 100 Projects

Target: Production deployment on k3s-fleet with Longhorn, cert-manager, External Secrets, Prometheus/Grafana, Traefik. Timeline: 3 weeks to ship-ready for 100 projects.


Ship Blockers (P0) — Must Fix Before Any Project Onboards

1. Auth router not wired in production RESOLVED (2026-03-02)

create_router_full_protection_full_config is now called when STEMEDB_AUTH_ENABLED=true. Router dispatch checks bootstrap::is_auth_enabled() first — full protection stack activates in production. Metering-only path still available when auth is disabled (local dev).

Resolution: crates/stemedb-api/src/main.rs updated.


2. STEMEDB_UNSAFE_SKIP_SIGNATURES startup guard missing RESOLVED (2026-03-02)

Startup guard added: if STEMEDB_UNSAFE_SKIP_SIGNATURES=true and STEMEDB_AUTH_ENABLED=true, server logs a fatal error and exits with code 1. Misconfiguration is caught at boot, not silently.

Resolution: crates/stemedb-api/src/main.rs updated.


3. Bootstrap key not seeded from env on fresh PVC RESOLVED (2026-03-02)

bootstrap::bootstrap_root_api_key() is now called at startup (after IngestWorker spawn). Reads STEMEDB_ROOT_API_KEY, idempotent — no-op if key already exists in the store. Fatal error on failure.

Resolution: crates/stemedb-api/src/main.rs updated.


4. No k8s manifests — StemeDB cannot be deployed to k3s RESOLVED (2026-03-02)

Manifests deployed to k3s-fleet/deployments/k8s/base/stemedb/ (single stemedb.yaml following tidaldb/ pattern). Includes ExternalSecret, PVC (50Gi Longhorn), Deployment (Recreate, non-root, all probes), ClusterIP Service, Traefik Ingress at stemedb.threesix.ai.

Remaining manual step: Build + push image, create GCP secret, add DNS record (see Pre-Deploy section below).


5. Image registry — k3s cannot pull without a registry RESOLVED (2026-03-02)

Registry confirmed: us-central1-docker.pkg.dev/orchard9/docker-images/ (GAR). imagePullSecrets: gcr-secret wired in Deployment. Dockerfile updated with --features aphoria.

Remaining manual step: docker build && docker push to populate the image.


Pre-Deploy Checklist (Manual Steps Before kubectl apply)

# 1. Build and push image (from stemedb repo root)
docker build -t us-central1-docker.pkg.dev/orchard9/docker-images/stemedb-api:latest .
docker push us-central1-docker.pkg.dev/orchard9/docker-images/stemedb-api:latest

# 2. Create root API key in GCP Secret Manager
ROOT_KEY="steme_live_$(openssl rand -hex 24)"
echo "Root key: $ROOT_KEY"   # Save this — needed for provision-project-keys.sh
echo -n "$ROOT_KEY" | gcloud secrets create stemedb-root-api-key \
  --project=orchard9 --replication-policy=automatic --data-file=-

# 3. Add DNS: stemedb.threesix.ai → Traefik LB IP (Cloudflare)

Original Manifest Spec (archived for reference)

The following was the original spec. Actual implementation is in k3s-fleet/deployments/k8s/base/stemedb/stemedb.yaml.

Create deployments/k8s/base/stemedb/ with the following files:

namespace.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: stemedb

pvc.yaml — Two PVCs to isolate WAL fsync from LSM compaction I/O

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: stemedb-wal
  namespace: stemedb
  annotations:
    volumeType: longhorn
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: longhorn
  resources:
    requests:
      storage: 20Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: stemedb-db
  namespace: stemedb
  annotations:
    volumeType: longhorn
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: longhorn
  resources:
    requests:
      storage: 50Gi

Set numberOfReplicas: 2 in Longhorn StorageClass (not default 3) to halve cross-node fsync amplification.

deployment.yaml — Critical spec decisions annotated

apiVersion: apps/v1
kind: Deployment
metadata:
  name: stemedb-api
  namespace: stemedb
spec:
  replicas: 1           # Non-negotiable. Embedded KV requires exclusive volume access.
  strategy:
    type: Recreate      # NOT RollingUpdate. RWO PVC + 2 pods = deadlock.
  selector:
    matchLabels:
      app: stemedb-api
  template:
    metadata:
      labels:
        app: stemedb-api
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "18180"
        prometheus.io/path: "/metrics"
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        readOnlyRootFilesystem: false  # WAL writes to /data
      terminationGracePeriodSeconds: 30  # Let in-flight WAL writes complete.
      containers:
        - name: stemedb-api
          image: <REGISTRY>/stemedb-api:latest
          ports:
            - containerPort: 18180
          env:
            - name: STEMEDB_BIND_ADDR
              value: "0.0.0.0:18180"
            - name: STEMEDB_WAL_DIR
              value: /data/wal
            - name: STEMEDB_DB_DIR
              value: /data/db
            - name: STEMEDB_METER_ENABLED
              value: "true"
            - name: STEMEDB_ROOT_API_KEY
              valueFrom:
                secretKeyRef:
                  name: stemedb-secrets
                  key: root-api-key
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              cpu: "2000m"
              memory: "4Gi"
          startupProbe:       # WAL replay can take 60s after crash — do not skip this.
            httpGet:
              path: /v1/health
              port: 18180
            periodSeconds: 5
            failureThreshold: 12   # 60s total window before k8s kills pod
          livenessProbe:
            httpGet:
              path: /v1/health
              port: 18180
            periodSeconds: 15
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /v1/health
              port: 18180
            periodSeconds: 5
            failureThreshold: 3
          volumeMounts:
            - name: wal
              mountPath: /data/wal
            - name: db
              mountPath: /data/db
      volumes:
        - name: wal
          persistentVolumeClaim:
            claimName: stemedb-wal
        - name: db
          persistentVolumeClaim:
            claimName: stemedb-db

service.yaml

apiVersion: v1
kind: Service
metadata:
  name: stemedb-api
  namespace: stemedb
spec:
  selector:
    app: stemedb-api
  ports:
    - port: 18180
      targetPort: 18180
  type: ClusterIP

ingress.yaml — Traefik terminates TLS; do NOT set STEMEDB_TLS_CERT_PATH

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: stemedb-api
  namespace: stemedb
  annotations:
    traefik.ingress.kubernetes.io/router.entrypoints: websecure
    traefik.ingress.kubernetes.io/router.middlewares: stemedb-ratelimit@kubernetescrd
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: traefik
  rules:
    - host: stemedb.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: stemedb-api
                port:
                  number: 18180
  tls:
    - hosts:
        - stemedb.yourdomain.com
      secretName: stemedb-tls

middleware.yaml — Traefik rate limit (global, before app-level limits)

apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
  name: ratelimit
  namespace: stemedb
spec:
  rateLimit:
    average: 500
    burst: 1000
    period: 1s

external-secret.yaml — Pull from GCP Secret Manager via External Secrets Operator

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: stemedb-secrets
  namespace: stemedb
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: gcp-secret-manager    # adjust to your cluster's SecretStore name
    kind: ClusterSecretStore
  target:
    name: stemedb-secrets
  data:
    - secretKey: root-api-key
      remoteRef:
        key: stemedb-root-api-key

kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - namespace.yaml
  - pvc.yaml
  - deployment.yaml
  - service.yaml
  - ingress.yaml
  - middleware.yaml
  - external-secret.yaml

Deploy:

kubectl apply -k deployments/k8s/base/stemedb/
kubectl rollout status deployment/stemedb-api -n stemedb
curl https://stemedb.yourdomain.com/v1/health

Phase 1 Checklist (Week 1 — Gate: First Project Can Connect)

# Task File(s) Status
1 Wire auth router in main.rs crates/stemedb-api/src/main.rs Done
2 Add STEMEDB_UNSAFE_SKIP_SIGNATURES startup guard crates/stemedb-api/src/main.rs Done
3 Add bootstrap key seed from STEMEDB_ROOT_API_KEY crates/stemedb-api/src/main.rs Done
4 Add --features aphoria to Dockerfile Dockerfile Done
5 Create k8s manifests k3s-fleet/.../stemedb/ Done
6 Write scripts/provision-project-keys.sh scripts/ Done
7 Build + push Docker image GAR Manual
8 Store root API key in GCP Secret Manager GCP Console Manual
9 Add DNS record: stemedb.threesix.ai Cloudflare Manual
10 Deploy to k3s + smoke test k3s-fleet Pending

Gate test (run after deploy):

# Health check
curl https://stemedb.threesix.ai/v1/health

# Unauthenticated write → 401
curl -s -o /dev/null -w "%{http_code}" -X POST \
  https://stemedb.threesix.ai/v1/assert -H "Content-Type: application/json" -d '{}'

# Authenticated write → 200/201
curl -X POST https://stemedb.threesix.ai/v1/assert \
  -H "X-API-Key: $ROOT_KEY" -H "Content-Type: application/json" \
  -d '{"subject":"test/ping","predicate":"alive","value":true,"agent_id":"test"}'

# Confirm key persists across restart
kubectl rollout restart deployment/stemedb-api -n stemedb
kubectl rollout status deployment/stemedb-api -n stemedb --timeout=120s
curl https://stemedb.threesix.ai/v1/health

Phase 2: Production Hardening (Week 2 — Gate: 10 Projects)

Backup CronJob

Create deployments/k8s/base/stemedb/backup-cronjob.yaml:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: stemedb-backup
  namespace: stemedb
spec:
  schedule: "0 */6 * * *"   # Every 6 hours
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: rclone/rclone:latest
              command:
                - /bin/sh
                - -c
                - |
                  # WAL: copy all completed segments (all except the last, which is locked)
                  SEGMENTS=$(ls /data/wal/*.wal 2>/dev/null | sort | head -n -1)
                  if [ -n "$SEGMENTS" ]; then
                    rclone copy /data/wal/ gcs:$BACKUP_BUCKET/wal/ \
                      --include "*.wal" --exclude "$(ls /data/wal/*.wal | sort | tail -n 1 | xargs basename)"
                  fi
                  # DB snapshot
                  rclone copy /data/db/ gcs:$BACKUP_BUCKET/db/$(date -u +%Y%m%dT%H%M%SZ)/
                  echo "Backup complete"                  
              env:
                - name: BACKUP_BUCKET
                  value: stemedb-backups    # your GCS bucket name
              volumeMounts:
                - name: wal
                  mountPath: /data/wal
                  readOnly: true
                - name: db
                  mountPath: /data/db
                  readOnly: true
                - name: rclone-config
                  mountPath: /config/rclone
          volumes:
            - name: wal
              persistentVolumeClaim:
                claimName: stemedb-wal
            - name: db
              persistentVolumeClaim:
                claimName: stemedb-db
            - name: rclone-config
              secret:
                secretName: rclone-gcs-config

Test backup manually:

kubectl create job --from=cronjob/stemedb-backup backup-test -n stemedb
kubectl logs -l job-name=backup-test -n stemedb -f

Monitoring — Wire into Prometheus

service-monitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: stemedb-api
  namespace: stemedb
  labels:
    release: prometheus    # must match your Prometheus Operator label selector
spec:
  selector:
    matchLabels:
      app: stemedb-api
  endpoints:
    - port: "18180"
      path: /metrics
      interval: 15s

alert-rules.yaml — 6 alerts that fire first at 100-project scale

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: stemedb-alerts
  namespace: stemedb
  labels:
    release: prometheus
spec:
  groups:
    - name: stemedb.rules
      rules:
        - alert: StemeDBPodNotRunning
          expr: absent(up{job="stemedb-api"}) > 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "StemeDB pod is not running"

        - alert: StemeDBWALLatencyHigh
          expr: histogram_quantile(0.99, rate(stemedb_wal_fsync_latency_seconds_bucket[5m])) > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "WAL fsync p99 > 50ms — Longhorn I/O degradation likely"

        - alert: StemeDBDataVolumeNearlyFull
          expr: |
            kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"stemedb-.*"}
            / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"stemedb-.*"}
            > 0.75            
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "StemeDB PVC usage > 75% — resize requires downtime"

        - alert: StemeDBRateLimitSaturating
          expr: rate(stemedb_http_requests_total{status="429"}[5m]) > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "429 rate > 1/s — projects hitting rate limits"

        - alert: StemeDBErrorRateHigh
          expr: |
            rate(stemedb_http_requests_total{status=~"5.."}[5m])
            / rate(stemedb_http_requests_total[5m])
            > 0.01            
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "5xx error rate > 1%"

        - alert: StemeDBOOMKilled
          expr: |
            kube_pod_container_status_last_terminated_reason{
              container="stemedb-api",
              reason="OOMKilled"
            } > 0            
          labels:
            severity: critical
          annotations:
            summary: "StemeDB container OOM killed — increase memory limit or find leak"

NetworkPolicy + PDB

network-policy.yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: stemedb-api
  namespace: stemedb
spec:
  podSelector:
    matchLabels:
      app: stemedb-api
  policyTypes: [Ingress, Egress]
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system   # Traefik
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: monitoring    # Prometheus
      ports:
        - port: 18180
  egress:
    - ports:
        - port: 53     # DNS
        - port: 443    # GCP APIs (backup, secrets)

pdb.yaml

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: stemedb-api
  namespace: stemedb
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      app: stemedb-api

Phase 2 Checklist

# Task File(s) Est
1 Deploy backup CronJob deployments/k8s/base/stemedb/backup-cronjob.yaml 2h
2 Create GCS bucket + rclone Secret GCP Console 1h
3 Wire ServiceMonitor into Prometheus service-monitor.yaml 1h
4 Deploy 6 alert rules alert-rules.yaml 1h
5 Add NetworkPolicy + PDB network-policy.yaml, pdb.yaml 1h
6 Fix Longhorn PVC reclaim policy in DR runbook docs/operations/runbooks/disaster-recovery.md 30m

Gate test: Kill pod → StemeDBPodNotRunning fires within 2 min. Run backup job manually → GCS has files.


Phase 3: Scale to 100 Projects (Week 3)

Per-project key provisioning script

Create scripts/provision-project-keys.sh:

#!/usr/bin/env bash
set -euo pipefail

# Usage: ./provision-project-keys.sh projects.txt
# projects.txt: one project name per line

STEMEDB_URL="${STEMEDB_URL:-https://stemedb.yourdomain.com}"
ADMIN_KEY="${STEMEDB_ADMIN_KEY:?Set STEMEDB_ADMIN_KEY}"
PROJECTS_FILE="${1:?Usage: $0 <projects-file>}"

while IFS= read -r project; do
  [[ -z "$project" ]] && continue

  echo "Provisioning key for: $project"

  response=$(curl -sf -X POST "$STEMEDB_URL/v1/admin/api-keys" \
    -H "X-API-Key: $ADMIN_KEY" \
    -H "Content-Type: application/json" \
    -d "{\"label\":\"project-$project\",\"role\":\"write_agent\"}")

  key=$(echo "$response" | jq -r '.key')

  # Store in GCP Secret Manager
  echo -n "$key" | gcloud secrets create "stemedb-key-$project" \
    --data-file=- \
    --replication-policy=automatic 2>/dev/null \
  || echo -n "$key" | gcloud secrets versions add "stemedb-key-$project" --data-file=-

  echo "  Key stored: stemedb-key-$project"
done < "$PROJECTS_FILE"

echo "Done."

Onboarding runbook for each project:

# 1. Retrieve key from Secret Manager
gcloud secrets versions access latest --secret="stemedb-key-<project>"

# 2. Update project's aphoria.toml
cat >> .aphoria/config.toml <<EOF
[hosted]
url = "https://stemedb.yourdomain.com"
api_key_env = "STEMEDB_API_KEY"
EOF

# 3. Export key in CI/CD env
# STEMEDB_API_KEY=steme_live_<value>

Aphoria retry logic (P1)

Projects run aphoria scan --persist locally and call the remote StemeDB. During StemeDB pod restarts (Recreate strategy = brief downtime), Aphoria should retry rather than fail the commit.

This is a change to the aphoria binary, not to StemeDB. Add 3-attempt exponential backoff (2s, 4s, 8s) on HTTP 502/503 responses in the Aphoria HTTP client.

Phase 3 Checklist

# Task File(s) Est
1 Run provision script for all 100 projects scripts/provision-project-keys.sh 2h
2 Write per-project onboarding runbook docs/operations/onboarding-project.md 1h
3 Add retry logic to aphoria HTTP client applications/aphoria/ 2h
4 Split WAL + DB into two PVCs (migration) deployments/k8s/base/stemedb/ 2h

Gate test: 5 projects scan simultaneously with their own keys → each isolated → one rate-limited → others unaffected.


What NOT to Build Yet

Item Why not
HPA StemeDB is stateful (embedded KV). Cannot scale horizontally.
mTLS between pods Single service. Add when you have a second service.
WAF Body limits + Traefik rate limit + circuit breaker is sufficient for 100 known projects.
Per-tenant namespaces Multiplies operational surface 100x. API key isolation is the right model.
Multi-region / clustering 3-node k3s + Longhorn 2-replica is your HA story. P6 in roadmap.
PITR with WAL timestamps 6-hour backup RPO is acceptable for pilot. Improve later.
Secrets rotation automation Manual rotation via /v1/admin/api-keys/:hash/rotate is fine for 100 projects.
Distributed tracing You have one service. WAL fsync histogram covers what you need.

Open Questions (Resolve Week 1)

  1. Image registry: Which registry does k3s-fleet already use? Check get_service_config() in deploy-stack.sh.
  2. Bootstrap key API: Verify exact method signatures on ApiKeyStore before writing the seed logic in main.rs.
  3. Aphoria scan model: Do projects run aphoria scan locally (calling remote StemeDB) or as a k8s Job? Determines where retry logic lives.
  4. GCS bucket: Does one exist for backups, or does it need to be created?
  5. CORS: All router variants in routers.rs use allow_origin(Any). Production needs this restricted to Traefik's internal domain. Add STEMEDB_ALLOWED_ORIGINS env var support.

Risk Register

Risk Likelihood Mitigation
Longhorn fsync latency at 100-project burst Medium Pin pod + volume to same node (Phase 3), dataLocality: bestEffort; monitor WAL p99 from day 1
Single-instance downtime during deploys High (Recreate strategy) Startup probe + maintenance window policy + Aphoria retry logic
Fresh PVC after disaster = 100 project keys lost Low but catastrophic Bootstrap key seed in main.rs + provision-project-keys.sh idempotent re-run
Image registry blocker High if unresolved Resolve Day 1; entire deployment depends on it
CORS vulnerability Medium allow_origin(Any) in all router variants; fix before public launch

Directory Structure After Phase 1

deployments/
└── k8s/
    └── base/
        └── stemedb/
            ├── kustomization.yaml
            ├── namespace.yaml
            ├── pvc.yaml
            ├── deployment.yaml
            ├── service.yaml
            ├── ingress.yaml
            ├── middleware.yaml
            └── external-secret.yaml

scripts/
└── provision-project-keys.sh   (new)

After Phase 2, add to deployments/k8s/base/stemedb/:

  • backup-cronjob.yaml
  • service-monitor.yaml
  • alert-rules.yaml
  • network-policy.yaml
  • pdb.yaml

Last updated: 2026-03-02 — Week 1 code changes complete; 3 manual steps remain before deploy