Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
- Wire auth bootstrap (root API key, startup guard, auth-first router) in main.rs - Add cluster gateway handlers with proper error handling - Update Dockerfile with optimized multi-stage build and .dockerignore - Add orchard9-deploy skill for CI/CD pipeline (Gitea/Woodpecker/Kaniko/Zot) - Add k8s deployment roadmap and provision-project-keys script - Document production infrastructure in CLAUDE.md - Update three-node-cluster reference architecture - Trim hosted.rs doc comments to stay under 800-line limit
712 lines
22 KiB
Markdown
712 lines
22 KiB
Markdown
# k3s Deploy Roadmap: StemeDB + Aphoria → 100 Projects
|
|
|
|
**Target:** Production deployment on k3s-fleet with Longhorn, cert-manager, External Secrets, Prometheus/Grafana, Traefik.
|
|
**Timeline:** 3 weeks to ship-ready for 100 projects.
|
|
|
|
---
|
|
|
|
## Ship Blockers (P0) — Must Fix Before Any Project Onboards
|
|
|
|
### ~~1. Auth router not wired in production~~ ✅ RESOLVED (2026-03-02)
|
|
|
|
`create_router_full_protection_full_config` is now called when `STEMEDB_AUTH_ENABLED=true`.
|
|
Router dispatch checks `bootstrap::is_auth_enabled()` first — full protection stack activates
|
|
in production. Metering-only path still available when auth is disabled (local dev).
|
|
|
|
**Resolution:** `crates/stemedb-api/src/main.rs` updated.
|
|
|
|
---
|
|
|
|
### ~~2. `STEMEDB_UNSAFE_SKIP_SIGNATURES` startup guard missing~~ ✅ RESOLVED (2026-03-02)
|
|
|
|
Startup guard added: if `STEMEDB_UNSAFE_SKIP_SIGNATURES=true` and `STEMEDB_AUTH_ENABLED=true`,
|
|
server logs a fatal error and exits with code 1. Misconfiguration is caught at boot, not silently.
|
|
|
|
**Resolution:** `crates/stemedb-api/src/main.rs` updated.
|
|
|
|
---
|
|
|
|
### ~~3. Bootstrap key not seeded from env on fresh PVC~~ ✅ RESOLVED (2026-03-02)
|
|
|
|
`bootstrap::bootstrap_root_api_key()` is now called at startup (after IngestWorker spawn).
|
|
Reads `STEMEDB_ROOT_API_KEY`, idempotent — no-op if key already exists in the store. Fatal
|
|
error on failure.
|
|
|
|
**Resolution:** `crates/stemedb-api/src/main.rs` updated.
|
|
|
|
---
|
|
|
|
### ~~4. No k8s manifests — StemeDB cannot be deployed to k3s~~ ✅ RESOLVED (2026-03-02)
|
|
|
|
Manifests deployed to `k3s-fleet/deployments/k8s/base/stemedb/` (single `stemedb.yaml` following
|
|
`tidaldb/` pattern). Includes ExternalSecret, PVC (50Gi Longhorn), Deployment (Recreate, non-root,
|
|
all probes), ClusterIP Service, Traefik Ingress at `stemedb.threesix.ai`.
|
|
|
|
**Remaining manual step:** Build + push image, create GCP secret, add DNS record (see Pre-Deploy section below).
|
|
|
|
---
|
|
|
|
### ~~5. Image registry — k3s cannot pull without a registry~~ ✅ RESOLVED (2026-03-02)
|
|
|
|
Registry confirmed: `us-central1-docker.pkg.dev/orchard9/docker-images/` (GAR).
|
|
`imagePullSecrets: gcr-secret` wired in Deployment. Dockerfile updated with `--features aphoria`.
|
|
|
|
**Remaining manual step:** `docker build && docker push` to populate the image.
|
|
|
|
---
|
|
|
|
## Pre-Deploy Checklist (Manual Steps Before `kubectl apply`)
|
|
|
|
```bash
|
|
# 1. Build and push image (from stemedb repo root)
|
|
docker build -t us-central1-docker.pkg.dev/orchard9/docker-images/stemedb-api:latest .
|
|
docker push us-central1-docker.pkg.dev/orchard9/docker-images/stemedb-api:latest
|
|
|
|
# 2. Create root API key in GCP Secret Manager
|
|
ROOT_KEY="steme_live_$(openssl rand -hex 24)"
|
|
echo "Root key: $ROOT_KEY" # Save this — needed for provision-project-keys.sh
|
|
echo -n "$ROOT_KEY" | gcloud secrets create stemedb-root-api-key \
|
|
--project=orchard9 --replication-policy=automatic --data-file=-
|
|
|
|
# 3. Add DNS: stemedb.threesix.ai → Traefik LB IP (Cloudflare)
|
|
```
|
|
|
|
---
|
|
|
|
## Original Manifest Spec (archived for reference)
|
|
|
|
The following was the original spec. Actual implementation is in `k3s-fleet/deployments/k8s/base/stemedb/stemedb.yaml`.
|
|
|
|
Create `deployments/k8s/base/stemedb/` with the following files:
|
|
|
|
**`namespace.yaml`**
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Namespace
|
|
metadata:
|
|
name: stemedb
|
|
```
|
|
|
|
**`pvc.yaml`** — Two PVCs to isolate WAL fsync from LSM compaction I/O
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: PersistentVolumeClaim
|
|
metadata:
|
|
name: stemedb-wal
|
|
namespace: stemedb
|
|
annotations:
|
|
volumeType: longhorn
|
|
spec:
|
|
accessModes: [ReadWriteOnce]
|
|
storageClassName: longhorn
|
|
resources:
|
|
requests:
|
|
storage: 20Gi
|
|
---
|
|
apiVersion: v1
|
|
kind: PersistentVolumeClaim
|
|
metadata:
|
|
name: stemedb-db
|
|
namespace: stemedb
|
|
annotations:
|
|
volumeType: longhorn
|
|
spec:
|
|
accessModes: [ReadWriteOnce]
|
|
storageClassName: longhorn
|
|
resources:
|
|
requests:
|
|
storage: 50Gi
|
|
```
|
|
|
|
> Set `numberOfReplicas: 2` in Longhorn StorageClass (not default 3) to halve cross-node fsync amplification.
|
|
|
|
**`deployment.yaml`** — Critical spec decisions annotated
|
|
```yaml
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: stemedb-api
|
|
namespace: stemedb
|
|
spec:
|
|
replicas: 1 # Non-negotiable. Embedded KV requires exclusive volume access.
|
|
strategy:
|
|
type: Recreate # NOT RollingUpdate. RWO PVC + 2 pods = deadlock.
|
|
selector:
|
|
matchLabels:
|
|
app: stemedb-api
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: stemedb-api
|
|
annotations:
|
|
prometheus.io/scrape: "true"
|
|
prometheus.io/port: "18180"
|
|
prometheus.io/path: "/metrics"
|
|
spec:
|
|
securityContext:
|
|
runAsNonRoot: true
|
|
runAsUser: 1000
|
|
fsGroup: 1000
|
|
readOnlyRootFilesystem: false # WAL writes to /data
|
|
terminationGracePeriodSeconds: 30 # Let in-flight WAL writes complete.
|
|
containers:
|
|
- name: stemedb-api
|
|
image: <REGISTRY>/stemedb-api:latest
|
|
ports:
|
|
- containerPort: 18180
|
|
env:
|
|
- name: STEMEDB_BIND_ADDR
|
|
value: "0.0.0.0:18180"
|
|
- name: STEMEDB_WAL_DIR
|
|
value: /data/wal
|
|
- name: STEMEDB_DB_DIR
|
|
value: /data/db
|
|
- name: STEMEDB_METER_ENABLED
|
|
value: "true"
|
|
- name: STEMEDB_ROOT_API_KEY
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: stemedb-secrets
|
|
key: root-api-key
|
|
resources:
|
|
requests:
|
|
cpu: "500m"
|
|
memory: "1Gi"
|
|
limits:
|
|
cpu: "2000m"
|
|
memory: "4Gi"
|
|
startupProbe: # WAL replay can take 60s after crash — do not skip this.
|
|
httpGet:
|
|
path: /v1/health
|
|
port: 18180
|
|
periodSeconds: 5
|
|
failureThreshold: 12 # 60s total window before k8s kills pod
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /v1/health
|
|
port: 18180
|
|
periodSeconds: 15
|
|
failureThreshold: 3
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /v1/health
|
|
port: 18180
|
|
periodSeconds: 5
|
|
failureThreshold: 3
|
|
volumeMounts:
|
|
- name: wal
|
|
mountPath: /data/wal
|
|
- name: db
|
|
mountPath: /data/db
|
|
volumes:
|
|
- name: wal
|
|
persistentVolumeClaim:
|
|
claimName: stemedb-wal
|
|
- name: db
|
|
persistentVolumeClaim:
|
|
claimName: stemedb-db
|
|
```
|
|
|
|
**`service.yaml`**
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: stemedb-api
|
|
namespace: stemedb
|
|
spec:
|
|
selector:
|
|
app: stemedb-api
|
|
ports:
|
|
- port: 18180
|
|
targetPort: 18180
|
|
type: ClusterIP
|
|
```
|
|
|
|
**`ingress.yaml`** — Traefik terminates TLS; do NOT set `STEMEDB_TLS_CERT_PATH`
|
|
```yaml
|
|
apiVersion: networking.k8s.io/v1
|
|
kind: Ingress
|
|
metadata:
|
|
name: stemedb-api
|
|
namespace: stemedb
|
|
annotations:
|
|
traefik.ingress.kubernetes.io/router.entrypoints: websecure
|
|
traefik.ingress.kubernetes.io/router.middlewares: stemedb-ratelimit@kubernetescrd
|
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
|
spec:
|
|
ingressClassName: traefik
|
|
rules:
|
|
- host: stemedb.yourdomain.com
|
|
http:
|
|
paths:
|
|
- path: /
|
|
pathType: Prefix
|
|
backend:
|
|
service:
|
|
name: stemedb-api
|
|
port:
|
|
number: 18180
|
|
tls:
|
|
- hosts:
|
|
- stemedb.yourdomain.com
|
|
secretName: stemedb-tls
|
|
```
|
|
|
|
**`middleware.yaml`** — Traefik rate limit (global, before app-level limits)
|
|
```yaml
|
|
apiVersion: traefik.containo.us/v1alpha1
|
|
kind: Middleware
|
|
metadata:
|
|
name: ratelimit
|
|
namespace: stemedb
|
|
spec:
|
|
rateLimit:
|
|
average: 500
|
|
burst: 1000
|
|
period: 1s
|
|
```
|
|
|
|
**`external-secret.yaml`** — Pull from GCP Secret Manager via External Secrets Operator
|
|
```yaml
|
|
apiVersion: external-secrets.io/v1beta1
|
|
kind: ExternalSecret
|
|
metadata:
|
|
name: stemedb-secrets
|
|
namespace: stemedb
|
|
spec:
|
|
refreshInterval: 1h
|
|
secretStoreRef:
|
|
name: gcp-secret-manager # adjust to your cluster's SecretStore name
|
|
kind: ClusterSecretStore
|
|
target:
|
|
name: stemedb-secrets
|
|
data:
|
|
- secretKey: root-api-key
|
|
remoteRef:
|
|
key: stemedb-root-api-key
|
|
```
|
|
|
|
**`kustomization.yaml`**
|
|
```yaml
|
|
apiVersion: kustomize.config.k8s.io/v1beta1
|
|
kind: Kustomization
|
|
resources:
|
|
- namespace.yaml
|
|
- pvc.yaml
|
|
- deployment.yaml
|
|
- service.yaml
|
|
- ingress.yaml
|
|
- middleware.yaml
|
|
- external-secret.yaml
|
|
```
|
|
|
|
**Deploy:**
|
|
```bash
|
|
kubectl apply -k deployments/k8s/base/stemedb/
|
|
kubectl rollout status deployment/stemedb-api -n stemedb
|
|
curl https://stemedb.yourdomain.com/v1/health
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 1 Checklist (Week 1 — Gate: First Project Can Connect)
|
|
|
|
| # | Task | File(s) | Status |
|
|
|---|------|---------|--------|
|
|
| 1 | Wire auth router in `main.rs` | `crates/stemedb-api/src/main.rs` | ✅ Done |
|
|
| 2 | Add `STEMEDB_UNSAFE_SKIP_SIGNATURES` startup guard | `crates/stemedb-api/src/main.rs` | ✅ Done |
|
|
| 3 | Add bootstrap key seed from `STEMEDB_ROOT_API_KEY` | `crates/stemedb-api/src/main.rs` | ✅ Done |
|
|
| 4 | Add `--features aphoria` to Dockerfile | `Dockerfile` | ✅ Done |
|
|
| 5 | Create k8s manifests | `k3s-fleet/.../stemedb/` | ✅ Done |
|
|
| 6 | Write `scripts/provision-project-keys.sh` | `scripts/` | ✅ Done |
|
|
| 7 | Build + push Docker image | GAR | ⏳ Manual |
|
|
| 8 | Store root API key in GCP Secret Manager | GCP Console | ⏳ Manual |
|
|
| 9 | Add DNS record: `stemedb.threesix.ai` | Cloudflare | ⏳ Manual |
|
|
| 10 | Deploy to k3s + smoke test | k3s-fleet | ⏳ Pending |
|
|
|
|
**Gate test (run after deploy):**
|
|
```bash
|
|
# Health check
|
|
curl https://stemedb.threesix.ai/v1/health
|
|
|
|
# Unauthenticated write → 401
|
|
curl -s -o /dev/null -w "%{http_code}" -X POST \
|
|
https://stemedb.threesix.ai/v1/assert -H "Content-Type: application/json" -d '{}'
|
|
|
|
# Authenticated write → 200/201
|
|
curl -X POST https://stemedb.threesix.ai/v1/assert \
|
|
-H "X-API-Key: $ROOT_KEY" -H "Content-Type: application/json" \
|
|
-d '{"subject":"test/ping","predicate":"alive","value":true,"agent_id":"test"}'
|
|
|
|
# Confirm key persists across restart
|
|
kubectl rollout restart deployment/stemedb-api -n stemedb
|
|
kubectl rollout status deployment/stemedb-api -n stemedb --timeout=120s
|
|
curl https://stemedb.threesix.ai/v1/health
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 2: Production Hardening (Week 2 — Gate: 10 Projects)
|
|
|
|
### Backup CronJob
|
|
|
|
Create `deployments/k8s/base/stemedb/backup-cronjob.yaml`:
|
|
|
|
```yaml
|
|
apiVersion: batch/v1
|
|
kind: CronJob
|
|
metadata:
|
|
name: stemedb-backup
|
|
namespace: stemedb
|
|
spec:
|
|
schedule: "0 */6 * * *" # Every 6 hours
|
|
concurrencyPolicy: Forbid
|
|
jobTemplate:
|
|
spec:
|
|
template:
|
|
spec:
|
|
restartPolicy: OnFailure
|
|
containers:
|
|
- name: backup
|
|
image: rclone/rclone:latest
|
|
command:
|
|
- /bin/sh
|
|
- -c
|
|
- |
|
|
# WAL: copy all completed segments (all except the last, which is locked)
|
|
SEGMENTS=$(ls /data/wal/*.wal 2>/dev/null | sort | head -n -1)
|
|
if [ -n "$SEGMENTS" ]; then
|
|
rclone copy /data/wal/ gcs:$BACKUP_BUCKET/wal/ \
|
|
--include "*.wal" --exclude "$(ls /data/wal/*.wal | sort | tail -n 1 | xargs basename)"
|
|
fi
|
|
# DB snapshot
|
|
rclone copy /data/db/ gcs:$BACKUP_BUCKET/db/$(date -u +%Y%m%dT%H%M%SZ)/
|
|
echo "Backup complete"
|
|
env:
|
|
- name: BACKUP_BUCKET
|
|
value: stemedb-backups # your GCS bucket name
|
|
volumeMounts:
|
|
- name: wal
|
|
mountPath: /data/wal
|
|
readOnly: true
|
|
- name: db
|
|
mountPath: /data/db
|
|
readOnly: true
|
|
- name: rclone-config
|
|
mountPath: /config/rclone
|
|
volumes:
|
|
- name: wal
|
|
persistentVolumeClaim:
|
|
claimName: stemedb-wal
|
|
- name: db
|
|
persistentVolumeClaim:
|
|
claimName: stemedb-db
|
|
- name: rclone-config
|
|
secret:
|
|
secretName: rclone-gcs-config
|
|
```
|
|
|
|
**Test backup manually:**
|
|
```bash
|
|
kubectl create job --from=cronjob/stemedb-backup backup-test -n stemedb
|
|
kubectl logs -l job-name=backup-test -n stemedb -f
|
|
```
|
|
|
|
### Monitoring — Wire into Prometheus
|
|
|
|
**`service-monitor.yaml`**
|
|
```yaml
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: ServiceMonitor
|
|
metadata:
|
|
name: stemedb-api
|
|
namespace: stemedb
|
|
labels:
|
|
release: prometheus # must match your Prometheus Operator label selector
|
|
spec:
|
|
selector:
|
|
matchLabels:
|
|
app: stemedb-api
|
|
endpoints:
|
|
- port: "18180"
|
|
path: /metrics
|
|
interval: 15s
|
|
```
|
|
|
|
**`alert-rules.yaml`** — 6 alerts that fire first at 100-project scale
|
|
```yaml
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: PrometheusRule
|
|
metadata:
|
|
name: stemedb-alerts
|
|
namespace: stemedb
|
|
labels:
|
|
release: prometheus
|
|
spec:
|
|
groups:
|
|
- name: stemedb.rules
|
|
rules:
|
|
- alert: StemeDBPodNotRunning
|
|
expr: absent(up{job="stemedb-api"}) > 0
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "StemeDB pod is not running"
|
|
|
|
- alert: StemeDBWALLatencyHigh
|
|
expr: histogram_quantile(0.99, rate(stemedb_wal_fsync_latency_seconds_bucket[5m])) > 0.05
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "WAL fsync p99 > 50ms — Longhorn I/O degradation likely"
|
|
|
|
- alert: StemeDBDataVolumeNearlyFull
|
|
expr: |
|
|
kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"stemedb-.*"}
|
|
/ kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"stemedb-.*"}
|
|
> 0.75
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "StemeDB PVC usage > 75% — resize requires downtime"
|
|
|
|
- alert: StemeDBRateLimitSaturating
|
|
expr: rate(stemedb_http_requests_total{status="429"}[5m]) > 1
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "429 rate > 1/s — projects hitting rate limits"
|
|
|
|
- alert: StemeDBErrorRateHigh
|
|
expr: |
|
|
rate(stemedb_http_requests_total{status=~"5.."}[5m])
|
|
/ rate(stemedb_http_requests_total[5m])
|
|
> 0.01
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "5xx error rate > 1%"
|
|
|
|
- alert: StemeDBOOMKilled
|
|
expr: |
|
|
kube_pod_container_status_last_terminated_reason{
|
|
container="stemedb-api",
|
|
reason="OOMKilled"
|
|
} > 0
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "StemeDB container OOM killed — increase memory limit or find leak"
|
|
```
|
|
|
|
### NetworkPolicy + PDB
|
|
|
|
**`network-policy.yaml`**
|
|
```yaml
|
|
apiVersion: networking.k8s.io/v1
|
|
kind: NetworkPolicy
|
|
metadata:
|
|
name: stemedb-api
|
|
namespace: stemedb
|
|
spec:
|
|
podSelector:
|
|
matchLabels:
|
|
app: stemedb-api
|
|
policyTypes: [Ingress, Egress]
|
|
ingress:
|
|
- from:
|
|
- namespaceSelector:
|
|
matchLabels:
|
|
kubernetes.io/metadata.name: kube-system # Traefik
|
|
- namespaceSelector:
|
|
matchLabels:
|
|
kubernetes.io/metadata.name: monitoring # Prometheus
|
|
ports:
|
|
- port: 18180
|
|
egress:
|
|
- ports:
|
|
- port: 53 # DNS
|
|
- port: 443 # GCP APIs (backup, secrets)
|
|
```
|
|
|
|
**`pdb.yaml`**
|
|
```yaml
|
|
apiVersion: policy/v1
|
|
kind: PodDisruptionBudget
|
|
metadata:
|
|
name: stemedb-api
|
|
namespace: stemedb
|
|
spec:
|
|
maxUnavailable: 0
|
|
selector:
|
|
matchLabels:
|
|
app: stemedb-api
|
|
```
|
|
|
|
### Phase 2 Checklist
|
|
|
|
| # | Task | File(s) | Est |
|
|
|---|------|---------|-----|
|
|
| 1 | Deploy backup CronJob | `deployments/k8s/base/stemedb/backup-cronjob.yaml` | 2h |
|
|
| 2 | Create GCS bucket + rclone Secret | GCP Console | 1h |
|
|
| 3 | Wire ServiceMonitor into Prometheus | `service-monitor.yaml` | 1h |
|
|
| 4 | Deploy 6 alert rules | `alert-rules.yaml` | 1h |
|
|
| 5 | Add NetworkPolicy + PDB | `network-policy.yaml`, `pdb.yaml` | 1h |
|
|
| 6 | Fix Longhorn PVC reclaim policy in DR runbook | `docs/operations/runbooks/disaster-recovery.md` | 30m |
|
|
|
|
**Gate test:** Kill pod → `StemeDBPodNotRunning` fires within 2 min. Run backup job manually → GCS has files.
|
|
|
|
---
|
|
|
|
## Phase 3: Scale to 100 Projects (Week 3)
|
|
|
|
### Per-project key provisioning script
|
|
|
|
Create `scripts/provision-project-keys.sh`:
|
|
|
|
```bash
|
|
#!/usr/bin/env bash
|
|
set -euo pipefail
|
|
|
|
# Usage: ./provision-project-keys.sh projects.txt
|
|
# projects.txt: one project name per line
|
|
|
|
STEMEDB_URL="${STEMEDB_URL:-https://stemedb.yourdomain.com}"
|
|
ADMIN_KEY="${STEMEDB_ADMIN_KEY:?Set STEMEDB_ADMIN_KEY}"
|
|
PROJECTS_FILE="${1:?Usage: $0 <projects-file>}"
|
|
|
|
while IFS= read -r project; do
|
|
[[ -z "$project" ]] && continue
|
|
|
|
echo "Provisioning key for: $project"
|
|
|
|
response=$(curl -sf -X POST "$STEMEDB_URL/v1/admin/api-keys" \
|
|
-H "X-API-Key: $ADMIN_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d "{\"label\":\"project-$project\",\"role\":\"write_agent\"}")
|
|
|
|
key=$(echo "$response" | jq -r '.key')
|
|
|
|
# Store in GCP Secret Manager
|
|
echo -n "$key" | gcloud secrets create "stemedb-key-$project" \
|
|
--data-file=- \
|
|
--replication-policy=automatic 2>/dev/null \
|
|
|| echo -n "$key" | gcloud secrets versions add "stemedb-key-$project" --data-file=-
|
|
|
|
echo " Key stored: stemedb-key-$project"
|
|
done < "$PROJECTS_FILE"
|
|
|
|
echo "Done."
|
|
```
|
|
|
|
**Onboarding runbook for each project:**
|
|
```bash
|
|
# 1. Retrieve key from Secret Manager
|
|
gcloud secrets versions access latest --secret="stemedb-key-<project>"
|
|
|
|
# 2. Update project's aphoria.toml
|
|
cat >> .aphoria/config.toml <<EOF
|
|
[hosted]
|
|
url = "https://stemedb.yourdomain.com"
|
|
api_key_env = "STEMEDB_API_KEY"
|
|
EOF
|
|
|
|
# 3. Export key in CI/CD env
|
|
# STEMEDB_API_KEY=steme_live_<value>
|
|
```
|
|
|
|
### Aphoria retry logic (P1)
|
|
|
|
Projects run `aphoria scan --persist` locally and call the remote StemeDB. During StemeDB pod
|
|
restarts (Recreate strategy = brief downtime), Aphoria should retry rather than fail the commit.
|
|
|
|
> This is a change to the `aphoria` binary, not to StemeDB. Add 3-attempt exponential backoff
|
|
> (2s, 4s, 8s) on HTTP 502/503 responses in the Aphoria HTTP client.
|
|
|
|
### Phase 3 Checklist
|
|
|
|
| # | Task | File(s) | Est |
|
|
|---|------|---------|-----|
|
|
| 1 | Run provision script for all 100 projects | `scripts/provision-project-keys.sh` | 2h |
|
|
| 2 | Write per-project onboarding runbook | `docs/operations/onboarding-project.md` | 1h |
|
|
| 3 | Add retry logic to `aphoria` HTTP client | `applications/aphoria/` | 2h |
|
|
| 4 | Split WAL + DB into two PVCs (migration) | `deployments/k8s/base/stemedb/` | 2h |
|
|
|
|
**Gate test:** 5 projects scan simultaneously with their own keys → each isolated → one rate-limited → others unaffected.
|
|
|
|
---
|
|
|
|
## What NOT to Build Yet
|
|
|
|
| Item | Why not |
|
|
|------|---------|
|
|
| HPA | StemeDB is stateful (embedded KV). Cannot scale horizontally. |
|
|
| mTLS between pods | Single service. Add when you have a second service. |
|
|
| WAF | Body limits + Traefik rate limit + circuit breaker is sufficient for 100 known projects. |
|
|
| Per-tenant namespaces | Multiplies operational surface 100x. API key isolation is the right model. |
|
|
| Multi-region / clustering | 3-node k3s + Longhorn 2-replica is your HA story. P6 in roadmap. |
|
|
| PITR with WAL timestamps | 6-hour backup RPO is acceptable for pilot. Improve later. |
|
|
| Secrets rotation automation | Manual rotation via `/v1/admin/api-keys/:hash/rotate` is fine for 100 projects. |
|
|
| Distributed tracing | You have one service. WAL fsync histogram covers what you need. |
|
|
|
|
---
|
|
|
|
## Open Questions (Resolve Week 1)
|
|
|
|
1. **Image registry**: Which registry does k3s-fleet already use? Check `get_service_config()` in `deploy-stack.sh`.
|
|
2. **Bootstrap key API**: Verify exact method signatures on `ApiKeyStore` before writing the seed logic in `main.rs`.
|
|
3. **Aphoria scan model**: Do projects run `aphoria scan` locally (calling remote StemeDB) or as a k8s Job? Determines where retry logic lives.
|
|
4. **GCS bucket**: Does one exist for backups, or does it need to be created?
|
|
5. **CORS**: All router variants in `routers.rs` use `allow_origin(Any)`. Production needs this restricted to Traefik's internal domain. Add `STEMEDB_ALLOWED_ORIGINS` env var support.
|
|
|
|
---
|
|
|
|
## Risk Register
|
|
|
|
| Risk | Likelihood | Mitigation |
|
|
|------|-----------|-----------|
|
|
| Longhorn fsync latency at 100-project burst | Medium | Pin pod + volume to same node (Phase 3), `dataLocality: bestEffort`; monitor WAL p99 from day 1 |
|
|
| Single-instance downtime during deploys | High (Recreate strategy) | Startup probe + maintenance window policy + Aphoria retry logic |
|
|
| Fresh PVC after disaster = 100 project keys lost | Low but catastrophic | Bootstrap key seed in `main.rs` + `provision-project-keys.sh` idempotent re-run |
|
|
| Image registry blocker | High if unresolved | Resolve Day 1; entire deployment depends on it |
|
|
| CORS vulnerability | Medium | `allow_origin(Any)` in all router variants; fix before public launch |
|
|
|
|
---
|
|
|
|
## Directory Structure After Phase 1
|
|
|
|
```
|
|
deployments/
|
|
└── k8s/
|
|
└── base/
|
|
└── stemedb/
|
|
├── kustomization.yaml
|
|
├── namespace.yaml
|
|
├── pvc.yaml
|
|
├── deployment.yaml
|
|
├── service.yaml
|
|
├── ingress.yaml
|
|
├── middleware.yaml
|
|
└── external-secret.yaml
|
|
|
|
scripts/
|
|
└── provision-project-keys.sh (new)
|
|
```
|
|
|
|
After Phase 2, add to `deployments/k8s/base/stemedb/`:
|
|
- `backup-cronjob.yaml`
|
|
- `service-monitor.yaml`
|
|
- `alert-rules.yaml`
|
|
- `network-policy.yaml`
|
|
- `pdb.yaml`
|
|
|
|
---
|
|
|
|
*Last updated: 2026-03-02 — Week 1 code changes complete; 3 manual steps remain before deploy*
|