stemedb/docs/operations/deployment/k8s-deploy-roadmap.md
jordan 6c6ee04e9c
Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
feat: complete cluster integration — SWIM gossip, gRPC server, shard rebalancing, single binary
8-task cluster completion bringing 3-replica StatefulSet from isolated
nodes to fully functional cluster:

1. Fix Gateway /metrics 500 (wire PrometheusHandle)
2. gRPC server + SWIM background tasks (probe, suspicion, gossip dissemination)
3. join() registers peers in membership table via PingResponse fields
4. Shard rebalancing on membership changes (deterministic round-robin)
5. API cluster wiring (DNS resolution, Gateway, gRPC, gossip broadcaster)
6. Single binary merge (stemedb-api --features cluster replaces stemedb-node)
7. Auth header forwarding (X-API-Key passed through Gateway to backends)
8. CORS restriction (STEMEDB_ALLOWED_ORIGINS env var, permissive fallback)
2026-03-07 15:09:29 -07:00

22 KiB

k3s Deploy Roadmap: StemeDB + Aphoria → 100 Projects

Target: Production deployment on k3s-fleet with Longhorn, cert-manager, External Secrets, Prometheus/Grafana, Traefik. Timeline: 3 weeks to ship-ready for 100 projects.


Ship Blockers (P0) — Must Fix Before Any Project Onboards

1. Auth router not wired in production RESOLVED (2026-03-02)

create_router_full_protection_full_config is now called when STEMEDB_AUTH_ENABLED=true. Router dispatch checks bootstrap::is_auth_enabled() first — full protection stack activates in production. Metering-only path still available when auth is disabled (local dev).

Resolution: crates/stemedb-api/src/main.rs updated.


2. STEMEDB_UNSAFE_SKIP_SIGNATURES startup guard missing RESOLVED (2026-03-02)

Startup guard added: if STEMEDB_UNSAFE_SKIP_SIGNATURES=true and STEMEDB_AUTH_ENABLED=true, server logs a fatal error and exits with code 1. Misconfiguration is caught at boot, not silently.

Resolution: crates/stemedb-api/src/main.rs updated.


3. Bootstrap key not seeded from env on fresh PVC RESOLVED (2026-03-02)

bootstrap::bootstrap_root_api_key() is now called at startup (after IngestWorker spawn). Reads STEMEDB_ROOT_API_KEY, idempotent — no-op if key already exists in the store. Fatal error on failure.

Resolution: crates/stemedb-api/src/main.rs updated.


4. No k8s manifests — StemeDB cannot be deployed to k3s RESOLVED (2026-03-02)

Manifests deployed to k3s-fleet/deployments/k8s/base/stemedb/ (single stemedb.yaml following tidaldb/ pattern). Includes ExternalSecret, PVC (50Gi Longhorn), Deployment (Recreate, non-root, all probes), ClusterIP Service, Traefik Ingress at stemedb.threesix.ai.

Remaining manual step: Build + push image, create GCP secret, add DNS record (see Pre-Deploy section below).


5. Image registry — k3s cannot pull without a registry RESOLVED (2026-03-07)

Registry: registry.threesix.ai (Zot OCI registry on k3s). Woodpecker CI pipeline (.woodpecker.yml) builds via Kaniko and pushes automatically on every merge to main. No manual docker build needed.

Image: registry.threesix.ai/stemedb-api:latest (also tagged with short commit SHA)


Pre-Deploy Checklist (Manual Steps Before kubectl apply)

Note: Image builds are now automated via Woodpecker CI. Push to main → Kaniko builds → pushes to registry.threesix.aikubectl set image on StatefulSet. Manual steps below are only needed for first-time setup.

# 1. Image builds are automatic (Woodpecker CI). For manual builds:
docker build --platform linux/amd64 -t registry.threesix.ai/stemedb-api:latest .
docker push registry.threesix.ai/stemedb-api:latest

# 2. Create root API key in GCP Secret Manager (first deploy only)
ROOT_KEY="steme_live_$(openssl rand -hex 24)"
echo "Root key: $ROOT_KEY"   # Save this — needed for provision-project-keys.sh
echo -n "$ROOT_KEY" | gcloud secrets create stemedb-root-api-key \
  --project=orchard9 --replication-policy=automatic --data-file=-

# 3. Add DNS: stemedb.threesix.ai → Traefik LB IP (Cloudflare) — already done

Original Manifest Spec (archived for reference)

The following was the original spec. Actual implementation is in k3s-fleet/deployments/k8s/base/stemedb/stemedb.yaml.

Create deployments/k8s/base/stemedb/ with the following files:

namespace.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: stemedb

pvc.yaml — Two PVCs to isolate WAL fsync from LSM compaction I/O

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: stemedb-wal
  namespace: stemedb
  annotations:
    volumeType: longhorn
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: longhorn
  resources:
    requests:
      storage: 20Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: stemedb-db
  namespace: stemedb
  annotations:
    volumeType: longhorn
spec:
  accessModes: [ReadWriteOnce]
  storageClassName: longhorn
  resources:
    requests:
      storage: 50Gi

Set numberOfReplicas: 2 in Longhorn StorageClass (not default 3) to halve cross-node fsync amplification.

deployment.yaml — Critical spec decisions annotated

apiVersion: apps/v1
kind: Deployment
metadata:
  name: stemedb-api
  namespace: stemedb
spec:
  replicas: 1           # Non-negotiable. Embedded KV requires exclusive volume access.
  strategy:
    type: Recreate      # NOT RollingUpdate. RWO PVC + 2 pods = deadlock.
  selector:
    matchLabels:
      app: stemedb-api
  template:
    metadata:
      labels:
        app: stemedb-api
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "18180"
        prometheus.io/path: "/metrics"
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
        readOnlyRootFilesystem: false  # WAL writes to /data
      terminationGracePeriodSeconds: 30  # Let in-flight WAL writes complete.
      containers:
        - name: stemedb-api
          image: <REGISTRY>/stemedb-api:latest
          ports:
            - containerPort: 18180
          env:
            - name: STEMEDB_BIND_ADDR
              value: "0.0.0.0:18180"
            - name: STEMEDB_WAL_DIR
              value: /data/wal
            - name: STEMEDB_DB_DIR
              value: /data/db
            - name: STEMEDB_METER_ENABLED
              value: "true"
            - name: STEMEDB_ROOT_API_KEY
              valueFrom:
                secretKeyRef:
                  name: stemedb-secrets
                  key: root-api-key
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              cpu: "2000m"
              memory: "4Gi"
          startupProbe:       # WAL replay can take 60s after crash — do not skip this.
            httpGet:
              path: /v1/health
              port: 18180
            periodSeconds: 5
            failureThreshold: 12   # 60s total window before k8s kills pod
          livenessProbe:
            httpGet:
              path: /v1/health
              port: 18180
            periodSeconds: 15
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /v1/health
              port: 18180
            periodSeconds: 5
            failureThreshold: 3
          volumeMounts:
            - name: wal
              mountPath: /data/wal
            - name: db
              mountPath: /data/db
      volumes:
        - name: wal
          persistentVolumeClaim:
            claimName: stemedb-wal
        - name: db
          persistentVolumeClaim:
            claimName: stemedb-db

service.yaml

apiVersion: v1
kind: Service
metadata:
  name: stemedb-api
  namespace: stemedb
spec:
  selector:
    app: stemedb-api
  ports:
    - port: 18180
      targetPort: 18180
  type: ClusterIP

ingress.yaml — Traefik terminates TLS; do NOT set STEMEDB_TLS_CERT_PATH

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: stemedb-api
  namespace: stemedb
  annotations:
    traefik.ingress.kubernetes.io/router.entrypoints: websecure
    traefik.ingress.kubernetes.io/router.middlewares: stemedb-ratelimit@kubernetescrd
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: traefik
  rules:
    - host: stemedb.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: stemedb-api
                port:
                  number: 18180
  tls:
    - hosts:
        - stemedb.yourdomain.com
      secretName: stemedb-tls

middleware.yaml — Traefik rate limit (global, before app-level limits)

apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
  name: ratelimit
  namespace: stemedb
spec:
  rateLimit:
    average: 500
    burst: 1000
    period: 1s

external-secret.yaml — Pull from GCP Secret Manager via External Secrets Operator

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: stemedb-secrets
  namespace: stemedb
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: gcp-secret-manager    # adjust to your cluster's SecretStore name
    kind: ClusterSecretStore
  target:
    name: stemedb-secrets
  data:
    - secretKey: root-api-key
      remoteRef:
        key: stemedb-root-api-key

kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - namespace.yaml
  - pvc.yaml
  - deployment.yaml
  - service.yaml
  - ingress.yaml
  - middleware.yaml
  - external-secret.yaml

Deploy:

kubectl apply -k deployments/k8s/base/stemedb/
kubectl rollout status deployment/stemedb-api -n stemedb
curl https://stemedb.yourdomain.com/v1/health

Phase 1 Checklist (Week 1 — Gate: First Project Can Connect) COMPLETE

# Task File(s) Status
1 Wire auth router in main.rs crates/stemedb-api/src/main.rs Done
2 Add STEMEDB_UNSAFE_SKIP_SIGNATURES startup guard crates/stemedb-api/src/main.rs Done
3 Add bootstrap key seed from STEMEDB_ROOT_API_KEY crates/stemedb-api/src/main.rs Done
4 Add --features aphoria to Dockerfile Dockerfile Done
5 Create k8s manifests k3s-fleet/.../stemedb/ Done
6 Write scripts/provision-project-keys.sh scripts/ Done
7 Build + push Docker image Woodpecker CI → Zot registry Done (automated)
8 Store root API key in GCP Secret Manager GCP Console Done
9 Add DNS record: stemedb.threesix.ai Cloudflare Done
10 Deploy to k3s + smoke test k3s-fleet Done
11 Upgrade to 3-node StatefulSet stemedb.yaml Done (2026-03-07)
12 Woodpecker CI/CD pipeline .woodpecker.yml Done

Gate test (run after deploy):

# Health check (routes through Gateway on :18181)
curl https://stemedb.threesix.ai/v1/health

# Direct API health on each pod (port-forward to :18180)
kubectl port-forward pod/stemedb-0 18180:18180 -n stemedb &
curl http://127.0.0.1:18180/v1/health

# Unauthenticated write → 401
curl -s -o /dev/null -w "%{http_code}" -X POST \
  https://stemedb.threesix.ai/v1/assert -H "Content-Type: application/json" -d '{}'

# Cluster status
curl https://stemedb.threesix.ai/v1/cluster/status

# Confirm pods survive rolling restart
kubectl rollout restart statefulset/stemedb -n stemedb
kubectl rollout status statefulset/stemedb -n stemedb --timeout=300s
curl https://stemedb.threesix.ai/v1/health

Phase 2: Production Hardening (Week 2 — Gate: 10 Projects)

Backup CronJob

Create deployments/k8s/base/stemedb/backup-cronjob.yaml:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: stemedb-backup
  namespace: stemedb
spec:
  schedule: "0 */6 * * *"   # Every 6 hours
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: rclone/rclone:latest
              command:
                - /bin/sh
                - -c
                - |
                  # WAL: copy all completed segments (all except the last, which is locked)
                  SEGMENTS=$(ls /data/wal/*.wal 2>/dev/null | sort | head -n -1)
                  if [ -n "$SEGMENTS" ]; then
                    rclone copy /data/wal/ gcs:$BACKUP_BUCKET/wal/ \
                      --include "*.wal" --exclude "$(ls /data/wal/*.wal | sort | tail -n 1 | xargs basename)"
                  fi
                  # DB snapshot
                  rclone copy /data/db/ gcs:$BACKUP_BUCKET/db/$(date -u +%Y%m%dT%H%M%SZ)/
                  echo "Backup complete"                  
              env:
                - name: BACKUP_BUCKET
                  value: stemedb-backups    # your GCS bucket name
              volumeMounts:
                - name: wal
                  mountPath: /data/wal
                  readOnly: true
                - name: db
                  mountPath: /data/db
                  readOnly: true
                - name: rclone-config
                  mountPath: /config/rclone
          volumes:
            - name: wal
              persistentVolumeClaim:
                claimName: stemedb-wal
            - name: db
              persistentVolumeClaim:
                claimName: stemedb-db
            - name: rclone-config
              secret:
                secretName: rclone-gcs-config

Test backup manually:

kubectl create job --from=cronjob/stemedb-backup backup-test -n stemedb
kubectl logs -l job-name=backup-test -n stemedb -f

Monitoring — Wire into Prometheus

service-monitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: stemedb-api
  namespace: stemedb
  labels:
    release: prometheus    # must match your Prometheus Operator label selector
spec:
  selector:
    matchLabels:
      app: stemedb-api
  endpoints:
    - port: "18180"
      path: /metrics
      interval: 15s

alert-rules.yaml — 6 alerts that fire first at 100-project scale

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: stemedb-alerts
  namespace: stemedb
  labels:
    release: prometheus
spec:
  groups:
    - name: stemedb.rules
      rules:
        - alert: StemeDBPodNotRunning
          expr: absent(up{job="stemedb-api"}) > 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "StemeDB pod is not running"

        - alert: StemeDBWALLatencyHigh
          expr: histogram_quantile(0.99, rate(stemedb_wal_fsync_latency_seconds_bucket[5m])) > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "WAL fsync p99 > 50ms — Longhorn I/O degradation likely"

        - alert: StemeDBDataVolumeNearlyFull
          expr: |
            kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"stemedb-.*"}
            / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"stemedb-.*"}
            > 0.75            
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "StemeDB PVC usage > 75% — resize requires downtime"

        - alert: StemeDBRateLimitSaturating
          expr: rate(stemedb_http_requests_total{status="429"}[5m]) > 1
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "429 rate > 1/s — projects hitting rate limits"

        - alert: StemeDBErrorRateHigh
          expr: |
            rate(stemedb_http_requests_total{status=~"5.."}[5m])
            / rate(stemedb_http_requests_total[5m])
            > 0.01            
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "5xx error rate > 1%"

        - alert: StemeDBOOMKilled
          expr: |
            kube_pod_container_status_last_terminated_reason{
              container="stemedb-api",
              reason="OOMKilled"
            } > 0            
          labels:
            severity: critical
          annotations:
            summary: "StemeDB container OOM killed — increase memory limit or find leak"

NetworkPolicy + PDB

network-policy.yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: stemedb-api
  namespace: stemedb
spec:
  podSelector:
    matchLabels:
      app: stemedb-api
  policyTypes: [Ingress, Egress]
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: kube-system   # Traefik
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: monitoring    # Prometheus
      ports:
        - port: 18180
  egress:
    - ports:
        - port: 53     # DNS
        - port: 443    # GCP APIs (backup, secrets)

pdb.yaml

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: stemedb-api
  namespace: stemedb
spec:
  maxUnavailable: 0
  selector:
    matchLabels:
      app: stemedb-api

Phase 2 Checklist

# Task File(s) Est
1 Deploy backup CronJob deployments/k8s/base/stemedb/backup-cronjob.yaml 2h
2 Create GCS bucket + rclone Secret GCP Console 1h
3 Wire ServiceMonitor into Prometheus service-monitor.yaml 1h
4 Deploy 6 alert rules alert-rules.yaml 1h
5 Add NetworkPolicy + PDB network-policy.yaml, pdb.yaml 1h
6 Fix Longhorn PVC reclaim policy in DR runbook docs/operations/runbooks/disaster-recovery.md 30m

Gate test: Kill pod → StemeDBPodNotRunning fires within 2 min. Run backup job manually → GCS has files.


Phase 3: Scale to 100 Projects (Week 3)

Per-project key provisioning script

Create scripts/provision-project-keys.sh:

#!/usr/bin/env bash
set -euo pipefail

# Usage: ./provision-project-keys.sh projects.txt
# projects.txt: one project name per line

STEMEDB_URL="${STEMEDB_URL:-https://stemedb.yourdomain.com}"
ADMIN_KEY="${STEMEDB_ADMIN_KEY:?Set STEMEDB_ADMIN_KEY}"
PROJECTS_FILE="${1:?Usage: $0 <projects-file>}"

while IFS= read -r project; do
  [[ -z "$project" ]] && continue

  echo "Provisioning key for: $project"

  response=$(curl -sf -X POST "$STEMEDB_URL/v1/admin/api-keys" \
    -H "X-API-Key: $ADMIN_KEY" \
    -H "Content-Type: application/json" \
    -d "{\"label\":\"project-$project\",\"role\":\"write_agent\"}")

  key=$(echo "$response" | jq -r '.key')

  # Store in GCP Secret Manager
  echo -n "$key" | gcloud secrets create "stemedb-key-$project" \
    --data-file=- \
    --replication-policy=automatic 2>/dev/null \
  || echo -n "$key" | gcloud secrets versions add "stemedb-key-$project" --data-file=-

  echo "  Key stored: stemedb-key-$project"
done < "$PROJECTS_FILE"

echo "Done."

Onboarding runbook for each project:

# 1. Retrieve key from Secret Manager
gcloud secrets versions access latest --secret="stemedb-key-<project>"

# 2. Update project's aphoria.toml
cat >> .aphoria/config.toml <<EOF
[hosted]
url = "https://stemedb.yourdomain.com"
api_key_env = "STEMEDB_API_KEY"
EOF

# 3. Export key in CI/CD env
# STEMEDB_API_KEY=steme_live_<value>

Aphoria retry logic (P1)

Projects run aphoria scan --persist locally and call the remote StemeDB. During StemeDB pod restarts (Recreate strategy = brief downtime), Aphoria should retry rather than fail the commit.

This is a change to the aphoria binary, not to StemeDB. Add 3-attempt exponential backoff (2s, 4s, 8s) on HTTP 502/503 responses in the Aphoria HTTP client.

Phase 3 Checklist

# Task File(s) Est
1 Run provision script for all 100 projects scripts/provision-project-keys.sh 2h
2 Write per-project onboarding runbook docs/operations/onboarding-project.md 1h
3 Add retry logic to aphoria HTTP client applications/aphoria/ 2h
4 Split WAL + DB into two PVCs (migration) deployments/k8s/base/stemedb/ 2h

Gate test: 5 projects scan simultaneously with their own keys → each isolated → one rate-limited → others unaffected.


What NOT to Build Yet

Item Why not
HPA StemeDB is stateful (embedded KV). StatefulSet replicas are fixed at 3.
mTLS between pods Internal cluster traffic is on private network. Add when exposing cross-cluster.
WAF Body limits + Traefik rate limit + circuit breaker is sufficient for 100 known projects.
Per-tenant namespaces Multiplies operational surface 100x. API key isolation is the right model.
Multi-region / clustering 3-node cluster deployed. Next: full SWIM inter-node connectivity.
PITR with WAL timestamps 6-hour backup RPO is acceptable for pilot. Improve later.
Secrets rotation automation Manual rotation via /v1/admin/api-keys/:hash/rotate is fine for 100 projects.
Distributed tracing You have one service. WAL fsync histogram covers what you need.

Open Questions (Resolved)

  1. Image registry: Zot OCI registry at registry.threesix.ai on k3s. Woodpecker CI pushes automatically.
  2. Bootstrap key API: bootstrap::bootstrap_root_api_key() wired in main.rs.
  3. Aphoria scan model: Projects run aphoria scan --persist locally, calling remote StemeDB. Retry logic lives in Aphoria binary.
  4. GCS bucket: Needs to be created for backups (Phase 2).
  5. CORS: All router variants use allow_origin(Any). Restrict before public launch.

Risk Register

Risk Likelihood Mitigation
Longhorn fsync latency at 100-project burst Medium Pin pod + volume to same node (Phase 3), dataLocality: bestEffort; monitor WAL p99 from day 1
Rolling restart brief downtime Medium (StatefulSet rolls one pod at a time) 3 replicas + readiness probe; Gateway routes to healthy pods
Fresh PVC after disaster = 100 project keys lost Low but catastrophic Bootstrap key seed in main.rs + provision-project-keys.sh idempotent re-run
Image registry blocker Resolved Zot registry on k3s, Woodpecker CI automates builds
CORS vulnerability Medium allow_origin(Any) in all router variants; fix before public launch

Directory Structure (Current)

# k3s-fleet repo
k3s-fleet/deployments/k8s/base/stemedb/
└── stemedb.yaml          # All-in-one: ExternalSecret, headless Service,
                          #   gateway Service, 3-replica StatefulSet, Ingress

# stemedb repo
scripts/
├── entrypoint.sh             # Dual-binary launcher (cluster mode)
└── provision-project-keys.sh

.woodpecker.yml               # CI/CD: Kaniko → Zot registry → kubectl deploy
Dockerfile                    # Multi-stage: builds stemedb-api + stemedb-node

After Phase 2 hardening, add to k3s-fleet/.../stemedb/:

  • backup-cronjob.yaml
  • service-monitor.yaml
  • alert-rules.yaml
  • network-policy.yaml
  • pdb.yaml

Last updated: 2026-03-07 — Phase 1 complete, 3-node StatefulSet deployed with Woodpecker CI/CD