Some checks failed
ci/woodpecker/push/woodpecker Pipeline failed
8-task cluster completion bringing 3-replica StatefulSet from isolated nodes to fully functional cluster: 1. Fix Gateway /metrics 500 (wire PrometheusHandle) 2. gRPC server + SWIM background tasks (probe, suspicion, gossip dissemination) 3. join() registers peers in membership table via PingResponse fields 4. Shard rebalancing on membership changes (deterministic round-robin) 5. API cluster wiring (DNS resolution, Gateway, gRPC, gossip broadcaster) 6. Single binary merge (stemedb-api --features cluster replaces stemedb-node) 7. Auth header forwarding (X-API-Key passed through Gateway to backends) 8. CORS restriction (STEMEDB_ALLOWED_ORIGINS env var, permissive fallback)
717 lines
22 KiB
Markdown
717 lines
22 KiB
Markdown
# k3s Deploy Roadmap: StemeDB + Aphoria → 100 Projects
|
|
|
|
**Target:** Production deployment on k3s-fleet with Longhorn, cert-manager, External Secrets, Prometheus/Grafana, Traefik.
|
|
**Timeline:** 3 weeks to ship-ready for 100 projects.
|
|
|
|
---
|
|
|
|
## Ship Blockers (P0) — Must Fix Before Any Project Onboards
|
|
|
|
### ~~1. Auth router not wired in production~~ ✅ RESOLVED (2026-03-02)
|
|
|
|
`create_router_full_protection_full_config` is now called when `STEMEDB_AUTH_ENABLED=true`.
|
|
Router dispatch checks `bootstrap::is_auth_enabled()` first — full protection stack activates
|
|
in production. Metering-only path still available when auth is disabled (local dev).
|
|
|
|
**Resolution:** `crates/stemedb-api/src/main.rs` updated.
|
|
|
|
---
|
|
|
|
### ~~2. `STEMEDB_UNSAFE_SKIP_SIGNATURES` startup guard missing~~ ✅ RESOLVED (2026-03-02)
|
|
|
|
Startup guard added: if `STEMEDB_UNSAFE_SKIP_SIGNATURES=true` and `STEMEDB_AUTH_ENABLED=true`,
|
|
server logs a fatal error and exits with code 1. Misconfiguration is caught at boot, not silently.
|
|
|
|
**Resolution:** `crates/stemedb-api/src/main.rs` updated.
|
|
|
|
---
|
|
|
|
### ~~3. Bootstrap key not seeded from env on fresh PVC~~ ✅ RESOLVED (2026-03-02)
|
|
|
|
`bootstrap::bootstrap_root_api_key()` is now called at startup (after IngestWorker spawn).
|
|
Reads `STEMEDB_ROOT_API_KEY`, idempotent — no-op if key already exists in the store. Fatal
|
|
error on failure.
|
|
|
|
**Resolution:** `crates/stemedb-api/src/main.rs` updated.
|
|
|
|
---
|
|
|
|
### ~~4. No k8s manifests — StemeDB cannot be deployed to k3s~~ ✅ RESOLVED (2026-03-02)
|
|
|
|
Manifests deployed to `k3s-fleet/deployments/k8s/base/stemedb/` (single `stemedb.yaml` following
|
|
`tidaldb/` pattern). Includes ExternalSecret, PVC (50Gi Longhorn), Deployment (Recreate, non-root,
|
|
all probes), ClusterIP Service, Traefik Ingress at `stemedb.threesix.ai`.
|
|
|
|
**Remaining manual step:** Build + push image, create GCP secret, add DNS record (see Pre-Deploy section below).
|
|
|
|
---
|
|
|
|
### ~~5. Image registry — k3s cannot pull without a registry~~ ✅ RESOLVED (2026-03-07)
|
|
|
|
Registry: `registry.threesix.ai` (Zot OCI registry on k3s). Woodpecker CI pipeline (`.woodpecker.yml`)
|
|
builds via Kaniko and pushes automatically on every merge to main. No manual docker build needed.
|
|
|
|
**Image:** `registry.threesix.ai/stemedb-api:latest` (also tagged with short commit SHA)
|
|
|
|
---
|
|
|
|
## Pre-Deploy Checklist (Manual Steps Before `kubectl apply`)
|
|
|
|
> **Note:** Image builds are now automated via Woodpecker CI. Push to `main` → Kaniko builds →
|
|
> pushes to `registry.threesix.ai` → `kubectl set image` on StatefulSet. Manual steps below are
|
|
> only needed for first-time setup.
|
|
|
|
```bash
|
|
# 1. Image builds are automatic (Woodpecker CI). For manual builds:
|
|
docker build --platform linux/amd64 -t registry.threesix.ai/stemedb-api:latest .
|
|
docker push registry.threesix.ai/stemedb-api:latest
|
|
|
|
# 2. Create root API key in GCP Secret Manager (first deploy only)
|
|
ROOT_KEY="steme_live_$(openssl rand -hex 24)"
|
|
echo "Root key: $ROOT_KEY" # Save this — needed for provision-project-keys.sh
|
|
echo -n "$ROOT_KEY" | gcloud secrets create stemedb-root-api-key \
|
|
--project=orchard9 --replication-policy=automatic --data-file=-
|
|
|
|
# 3. Add DNS: stemedb.threesix.ai → Traefik LB IP (Cloudflare) — already done
|
|
```
|
|
|
|
---
|
|
|
|
## Original Manifest Spec (archived for reference)
|
|
|
|
The following was the original spec. Actual implementation is in `k3s-fleet/deployments/k8s/base/stemedb/stemedb.yaml`.
|
|
|
|
Create `deployments/k8s/base/stemedb/` with the following files:
|
|
|
|
**`namespace.yaml`**
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Namespace
|
|
metadata:
|
|
name: stemedb
|
|
```
|
|
|
|
**`pvc.yaml`** — Two PVCs to isolate WAL fsync from LSM compaction I/O
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: PersistentVolumeClaim
|
|
metadata:
|
|
name: stemedb-wal
|
|
namespace: stemedb
|
|
annotations:
|
|
volumeType: longhorn
|
|
spec:
|
|
accessModes: [ReadWriteOnce]
|
|
storageClassName: longhorn
|
|
resources:
|
|
requests:
|
|
storage: 20Gi
|
|
---
|
|
apiVersion: v1
|
|
kind: PersistentVolumeClaim
|
|
metadata:
|
|
name: stemedb-db
|
|
namespace: stemedb
|
|
annotations:
|
|
volumeType: longhorn
|
|
spec:
|
|
accessModes: [ReadWriteOnce]
|
|
storageClassName: longhorn
|
|
resources:
|
|
requests:
|
|
storage: 50Gi
|
|
```
|
|
|
|
> Set `numberOfReplicas: 2` in Longhorn StorageClass (not default 3) to halve cross-node fsync amplification.
|
|
|
|
**`deployment.yaml`** — Critical spec decisions annotated
|
|
```yaml
|
|
apiVersion: apps/v1
|
|
kind: Deployment
|
|
metadata:
|
|
name: stemedb-api
|
|
namespace: stemedb
|
|
spec:
|
|
replicas: 1 # Non-negotiable. Embedded KV requires exclusive volume access.
|
|
strategy:
|
|
type: Recreate # NOT RollingUpdate. RWO PVC + 2 pods = deadlock.
|
|
selector:
|
|
matchLabels:
|
|
app: stemedb-api
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: stemedb-api
|
|
annotations:
|
|
prometheus.io/scrape: "true"
|
|
prometheus.io/port: "18180"
|
|
prometheus.io/path: "/metrics"
|
|
spec:
|
|
securityContext:
|
|
runAsNonRoot: true
|
|
runAsUser: 1000
|
|
fsGroup: 1000
|
|
readOnlyRootFilesystem: false # WAL writes to /data
|
|
terminationGracePeriodSeconds: 30 # Let in-flight WAL writes complete.
|
|
containers:
|
|
- name: stemedb-api
|
|
image: <REGISTRY>/stemedb-api:latest
|
|
ports:
|
|
- containerPort: 18180
|
|
env:
|
|
- name: STEMEDB_BIND_ADDR
|
|
value: "0.0.0.0:18180"
|
|
- name: STEMEDB_WAL_DIR
|
|
value: /data/wal
|
|
- name: STEMEDB_DB_DIR
|
|
value: /data/db
|
|
- name: STEMEDB_METER_ENABLED
|
|
value: "true"
|
|
- name: STEMEDB_ROOT_API_KEY
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: stemedb-secrets
|
|
key: root-api-key
|
|
resources:
|
|
requests:
|
|
cpu: "500m"
|
|
memory: "1Gi"
|
|
limits:
|
|
cpu: "2000m"
|
|
memory: "4Gi"
|
|
startupProbe: # WAL replay can take 60s after crash — do not skip this.
|
|
httpGet:
|
|
path: /v1/health
|
|
port: 18180
|
|
periodSeconds: 5
|
|
failureThreshold: 12 # 60s total window before k8s kills pod
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /v1/health
|
|
port: 18180
|
|
periodSeconds: 15
|
|
failureThreshold: 3
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /v1/health
|
|
port: 18180
|
|
periodSeconds: 5
|
|
failureThreshold: 3
|
|
volumeMounts:
|
|
- name: wal
|
|
mountPath: /data/wal
|
|
- name: db
|
|
mountPath: /data/db
|
|
volumes:
|
|
- name: wal
|
|
persistentVolumeClaim:
|
|
claimName: stemedb-wal
|
|
- name: db
|
|
persistentVolumeClaim:
|
|
claimName: stemedb-db
|
|
```
|
|
|
|
**`service.yaml`**
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: Service
|
|
metadata:
|
|
name: stemedb-api
|
|
namespace: stemedb
|
|
spec:
|
|
selector:
|
|
app: stemedb-api
|
|
ports:
|
|
- port: 18180
|
|
targetPort: 18180
|
|
type: ClusterIP
|
|
```
|
|
|
|
**`ingress.yaml`** — Traefik terminates TLS; do NOT set `STEMEDB_TLS_CERT_PATH`
|
|
```yaml
|
|
apiVersion: networking.k8s.io/v1
|
|
kind: Ingress
|
|
metadata:
|
|
name: stemedb-api
|
|
namespace: stemedb
|
|
annotations:
|
|
traefik.ingress.kubernetes.io/router.entrypoints: websecure
|
|
traefik.ingress.kubernetes.io/router.middlewares: stemedb-ratelimit@kubernetescrd
|
|
cert-manager.io/cluster-issuer: letsencrypt-prod
|
|
spec:
|
|
ingressClassName: traefik
|
|
rules:
|
|
- host: stemedb.yourdomain.com
|
|
http:
|
|
paths:
|
|
- path: /
|
|
pathType: Prefix
|
|
backend:
|
|
service:
|
|
name: stemedb-api
|
|
port:
|
|
number: 18180
|
|
tls:
|
|
- hosts:
|
|
- stemedb.yourdomain.com
|
|
secretName: stemedb-tls
|
|
```
|
|
|
|
**`middleware.yaml`** — Traefik rate limit (global, before app-level limits)
|
|
```yaml
|
|
apiVersion: traefik.containo.us/v1alpha1
|
|
kind: Middleware
|
|
metadata:
|
|
name: ratelimit
|
|
namespace: stemedb
|
|
spec:
|
|
rateLimit:
|
|
average: 500
|
|
burst: 1000
|
|
period: 1s
|
|
```
|
|
|
|
**`external-secret.yaml`** — Pull from GCP Secret Manager via External Secrets Operator
|
|
```yaml
|
|
apiVersion: external-secrets.io/v1beta1
|
|
kind: ExternalSecret
|
|
metadata:
|
|
name: stemedb-secrets
|
|
namespace: stemedb
|
|
spec:
|
|
refreshInterval: 1h
|
|
secretStoreRef:
|
|
name: gcp-secret-manager # adjust to your cluster's SecretStore name
|
|
kind: ClusterSecretStore
|
|
target:
|
|
name: stemedb-secrets
|
|
data:
|
|
- secretKey: root-api-key
|
|
remoteRef:
|
|
key: stemedb-root-api-key
|
|
```
|
|
|
|
**`kustomization.yaml`**
|
|
```yaml
|
|
apiVersion: kustomize.config.k8s.io/v1beta1
|
|
kind: Kustomization
|
|
resources:
|
|
- namespace.yaml
|
|
- pvc.yaml
|
|
- deployment.yaml
|
|
- service.yaml
|
|
- ingress.yaml
|
|
- middleware.yaml
|
|
- external-secret.yaml
|
|
```
|
|
|
|
**Deploy:**
|
|
```bash
|
|
kubectl apply -k deployments/k8s/base/stemedb/
|
|
kubectl rollout status deployment/stemedb-api -n stemedb
|
|
curl https://stemedb.yourdomain.com/v1/health
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 1 Checklist (Week 1 — Gate: First Project Can Connect) ✅ COMPLETE
|
|
|
|
| # | Task | File(s) | Status |
|
|
|---|------|---------|--------|
|
|
| 1 | Wire auth router in `main.rs` | `crates/stemedb-api/src/main.rs` | ✅ Done |
|
|
| 2 | Add `STEMEDB_UNSAFE_SKIP_SIGNATURES` startup guard | `crates/stemedb-api/src/main.rs` | ✅ Done |
|
|
| 3 | Add bootstrap key seed from `STEMEDB_ROOT_API_KEY` | `crates/stemedb-api/src/main.rs` | ✅ Done |
|
|
| 4 | Add `--features aphoria` to Dockerfile | `Dockerfile` | ✅ Done |
|
|
| 5 | Create k8s manifests | `k3s-fleet/.../stemedb/` | ✅ Done |
|
|
| 6 | Write `scripts/provision-project-keys.sh` | `scripts/` | ✅ Done |
|
|
| 7 | Build + push Docker image | Woodpecker CI → Zot registry | ✅ Done (automated) |
|
|
| 8 | Store root API key in GCP Secret Manager | GCP Console | ✅ Done |
|
|
| 9 | Add DNS record: `stemedb.threesix.ai` | Cloudflare | ✅ Done |
|
|
| 10 | Deploy to k3s + smoke test | k3s-fleet | ✅ Done |
|
|
| 11 | Upgrade to 3-node StatefulSet | `stemedb.yaml` | ✅ Done (2026-03-07) |
|
|
| 12 | Woodpecker CI/CD pipeline | `.woodpecker.yml` | ✅ Done |
|
|
|
|
**Gate test (run after deploy):**
|
|
```bash
|
|
# Health check (routes through Gateway on :18181)
|
|
curl https://stemedb.threesix.ai/v1/health
|
|
|
|
# Direct API health on each pod (port-forward to :18180)
|
|
kubectl port-forward pod/stemedb-0 18180:18180 -n stemedb &
|
|
curl http://127.0.0.1:18180/v1/health
|
|
|
|
# Unauthenticated write → 401
|
|
curl -s -o /dev/null -w "%{http_code}" -X POST \
|
|
https://stemedb.threesix.ai/v1/assert -H "Content-Type: application/json" -d '{}'
|
|
|
|
# Cluster status
|
|
curl https://stemedb.threesix.ai/v1/cluster/status
|
|
|
|
# Confirm pods survive rolling restart
|
|
kubectl rollout restart statefulset/stemedb -n stemedb
|
|
kubectl rollout status statefulset/stemedb -n stemedb --timeout=300s
|
|
curl https://stemedb.threesix.ai/v1/health
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 2: Production Hardening (Week 2 — Gate: 10 Projects)
|
|
|
|
### Backup CronJob
|
|
|
|
Create `deployments/k8s/base/stemedb/backup-cronjob.yaml`:
|
|
|
|
```yaml
|
|
apiVersion: batch/v1
|
|
kind: CronJob
|
|
metadata:
|
|
name: stemedb-backup
|
|
namespace: stemedb
|
|
spec:
|
|
schedule: "0 */6 * * *" # Every 6 hours
|
|
concurrencyPolicy: Forbid
|
|
jobTemplate:
|
|
spec:
|
|
template:
|
|
spec:
|
|
restartPolicy: OnFailure
|
|
containers:
|
|
- name: backup
|
|
image: rclone/rclone:latest
|
|
command:
|
|
- /bin/sh
|
|
- -c
|
|
- |
|
|
# WAL: copy all completed segments (all except the last, which is locked)
|
|
SEGMENTS=$(ls /data/wal/*.wal 2>/dev/null | sort | head -n -1)
|
|
if [ -n "$SEGMENTS" ]; then
|
|
rclone copy /data/wal/ gcs:$BACKUP_BUCKET/wal/ \
|
|
--include "*.wal" --exclude "$(ls /data/wal/*.wal | sort | tail -n 1 | xargs basename)"
|
|
fi
|
|
# DB snapshot
|
|
rclone copy /data/db/ gcs:$BACKUP_BUCKET/db/$(date -u +%Y%m%dT%H%M%SZ)/
|
|
echo "Backup complete"
|
|
env:
|
|
- name: BACKUP_BUCKET
|
|
value: stemedb-backups # your GCS bucket name
|
|
volumeMounts:
|
|
- name: wal
|
|
mountPath: /data/wal
|
|
readOnly: true
|
|
- name: db
|
|
mountPath: /data/db
|
|
readOnly: true
|
|
- name: rclone-config
|
|
mountPath: /config/rclone
|
|
volumes:
|
|
- name: wal
|
|
persistentVolumeClaim:
|
|
claimName: stemedb-wal
|
|
- name: db
|
|
persistentVolumeClaim:
|
|
claimName: stemedb-db
|
|
- name: rclone-config
|
|
secret:
|
|
secretName: rclone-gcs-config
|
|
```
|
|
|
|
**Test backup manually:**
|
|
```bash
|
|
kubectl create job --from=cronjob/stemedb-backup backup-test -n stemedb
|
|
kubectl logs -l job-name=backup-test -n stemedb -f
|
|
```
|
|
|
|
### Monitoring — Wire into Prometheus
|
|
|
|
**`service-monitor.yaml`**
|
|
```yaml
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: ServiceMonitor
|
|
metadata:
|
|
name: stemedb-api
|
|
namespace: stemedb
|
|
labels:
|
|
release: prometheus # must match your Prometheus Operator label selector
|
|
spec:
|
|
selector:
|
|
matchLabels:
|
|
app: stemedb-api
|
|
endpoints:
|
|
- port: "18180"
|
|
path: /metrics
|
|
interval: 15s
|
|
```
|
|
|
|
**`alert-rules.yaml`** — 6 alerts that fire first at 100-project scale
|
|
```yaml
|
|
apiVersion: monitoring.coreos.com/v1
|
|
kind: PrometheusRule
|
|
metadata:
|
|
name: stemedb-alerts
|
|
namespace: stemedb
|
|
labels:
|
|
release: prometheus
|
|
spec:
|
|
groups:
|
|
- name: stemedb.rules
|
|
rules:
|
|
- alert: StemeDBPodNotRunning
|
|
expr: absent(up{job="stemedb-api"}) > 0
|
|
for: 2m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "StemeDB pod is not running"
|
|
|
|
- alert: StemeDBWALLatencyHigh
|
|
expr: histogram_quantile(0.99, rate(stemedb_wal_fsync_latency_seconds_bucket[5m])) > 0.05
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "WAL fsync p99 > 50ms — Longhorn I/O degradation likely"
|
|
|
|
- alert: StemeDBDataVolumeNearlyFull
|
|
expr: |
|
|
kubelet_volume_stats_used_bytes{persistentvolumeclaim=~"stemedb-.*"}
|
|
/ kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~"stemedb-.*"}
|
|
> 0.75
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "StemeDB PVC usage > 75% — resize requires downtime"
|
|
|
|
- alert: StemeDBRateLimitSaturating
|
|
expr: rate(stemedb_http_requests_total{status="429"}[5m]) > 1
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "429 rate > 1/s — projects hitting rate limits"
|
|
|
|
- alert: StemeDBErrorRateHigh
|
|
expr: |
|
|
rate(stemedb_http_requests_total{status=~"5.."}[5m])
|
|
/ rate(stemedb_http_requests_total[5m])
|
|
> 0.01
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "5xx error rate > 1%"
|
|
|
|
- alert: StemeDBOOMKilled
|
|
expr: |
|
|
kube_pod_container_status_last_terminated_reason{
|
|
container="stemedb-api",
|
|
reason="OOMKilled"
|
|
} > 0
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "StemeDB container OOM killed — increase memory limit or find leak"
|
|
```
|
|
|
|
### NetworkPolicy + PDB
|
|
|
|
**`network-policy.yaml`**
|
|
```yaml
|
|
apiVersion: networking.k8s.io/v1
|
|
kind: NetworkPolicy
|
|
metadata:
|
|
name: stemedb-api
|
|
namespace: stemedb
|
|
spec:
|
|
podSelector:
|
|
matchLabels:
|
|
app: stemedb-api
|
|
policyTypes: [Ingress, Egress]
|
|
ingress:
|
|
- from:
|
|
- namespaceSelector:
|
|
matchLabels:
|
|
kubernetes.io/metadata.name: kube-system # Traefik
|
|
- namespaceSelector:
|
|
matchLabels:
|
|
kubernetes.io/metadata.name: monitoring # Prometheus
|
|
ports:
|
|
- port: 18180
|
|
egress:
|
|
- ports:
|
|
- port: 53 # DNS
|
|
- port: 443 # GCP APIs (backup, secrets)
|
|
```
|
|
|
|
**`pdb.yaml`**
|
|
```yaml
|
|
apiVersion: policy/v1
|
|
kind: PodDisruptionBudget
|
|
metadata:
|
|
name: stemedb-api
|
|
namespace: stemedb
|
|
spec:
|
|
maxUnavailable: 0
|
|
selector:
|
|
matchLabels:
|
|
app: stemedb-api
|
|
```
|
|
|
|
### Phase 2 Checklist
|
|
|
|
| # | Task | File(s) | Est |
|
|
|---|------|---------|-----|
|
|
| 1 | Deploy backup CronJob | `deployments/k8s/base/stemedb/backup-cronjob.yaml` | 2h |
|
|
| 2 | Create GCS bucket + rclone Secret | GCP Console | 1h |
|
|
| 3 | Wire ServiceMonitor into Prometheus | `service-monitor.yaml` | 1h |
|
|
| 4 | Deploy 6 alert rules | `alert-rules.yaml` | 1h |
|
|
| 5 | Add NetworkPolicy + PDB | `network-policy.yaml`, `pdb.yaml` | 1h |
|
|
| 6 | Fix Longhorn PVC reclaim policy in DR runbook | `docs/operations/runbooks/disaster-recovery.md` | 30m |
|
|
|
|
**Gate test:** Kill pod → `StemeDBPodNotRunning` fires within 2 min. Run backup job manually → GCS has files.
|
|
|
|
---
|
|
|
|
## Phase 3: Scale to 100 Projects (Week 3)
|
|
|
|
### Per-project key provisioning script
|
|
|
|
Create `scripts/provision-project-keys.sh`:
|
|
|
|
```bash
|
|
#!/usr/bin/env bash
|
|
set -euo pipefail
|
|
|
|
# Usage: ./provision-project-keys.sh projects.txt
|
|
# projects.txt: one project name per line
|
|
|
|
STEMEDB_URL="${STEMEDB_URL:-https://stemedb.yourdomain.com}"
|
|
ADMIN_KEY="${STEMEDB_ADMIN_KEY:?Set STEMEDB_ADMIN_KEY}"
|
|
PROJECTS_FILE="${1:?Usage: $0 <projects-file>}"
|
|
|
|
while IFS= read -r project; do
|
|
[[ -z "$project" ]] && continue
|
|
|
|
echo "Provisioning key for: $project"
|
|
|
|
response=$(curl -sf -X POST "$STEMEDB_URL/v1/admin/api-keys" \
|
|
-H "X-API-Key: $ADMIN_KEY" \
|
|
-H "Content-Type: application/json" \
|
|
-d "{\"label\":\"project-$project\",\"role\":\"write_agent\"}")
|
|
|
|
key=$(echo "$response" | jq -r '.key')
|
|
|
|
# Store in GCP Secret Manager
|
|
echo -n "$key" | gcloud secrets create "stemedb-key-$project" \
|
|
--data-file=- \
|
|
--replication-policy=automatic 2>/dev/null \
|
|
|| echo -n "$key" | gcloud secrets versions add "stemedb-key-$project" --data-file=-
|
|
|
|
echo " Key stored: stemedb-key-$project"
|
|
done < "$PROJECTS_FILE"
|
|
|
|
echo "Done."
|
|
```
|
|
|
|
**Onboarding runbook for each project:**
|
|
```bash
|
|
# 1. Retrieve key from Secret Manager
|
|
gcloud secrets versions access latest --secret="stemedb-key-<project>"
|
|
|
|
# 2. Update project's aphoria.toml
|
|
cat >> .aphoria/config.toml <<EOF
|
|
[hosted]
|
|
url = "https://stemedb.yourdomain.com"
|
|
api_key_env = "STEMEDB_API_KEY"
|
|
EOF
|
|
|
|
# 3. Export key in CI/CD env
|
|
# STEMEDB_API_KEY=steme_live_<value>
|
|
```
|
|
|
|
### Aphoria retry logic (P1)
|
|
|
|
Projects run `aphoria scan --persist` locally and call the remote StemeDB. During StemeDB pod
|
|
restarts (Recreate strategy = brief downtime), Aphoria should retry rather than fail the commit.
|
|
|
|
> This is a change to the `aphoria` binary, not to StemeDB. Add 3-attempt exponential backoff
|
|
> (2s, 4s, 8s) on HTTP 502/503 responses in the Aphoria HTTP client.
|
|
|
|
### Phase 3 Checklist
|
|
|
|
| # | Task | File(s) | Est |
|
|
|---|------|---------|-----|
|
|
| 1 | Run provision script for all 100 projects | `scripts/provision-project-keys.sh` | 2h |
|
|
| 2 | Write per-project onboarding runbook | `docs/operations/onboarding-project.md` | 1h |
|
|
| 3 | Add retry logic to `aphoria` HTTP client | `applications/aphoria/` | 2h |
|
|
| 4 | Split WAL + DB into two PVCs (migration) | `deployments/k8s/base/stemedb/` | 2h |
|
|
|
|
**Gate test:** 5 projects scan simultaneously with their own keys → each isolated → one rate-limited → others unaffected.
|
|
|
|
---
|
|
|
|
## What NOT to Build Yet
|
|
|
|
| Item | Why not |
|
|
|------|---------|
|
|
| HPA | StemeDB is stateful (embedded KV). StatefulSet replicas are fixed at 3. |
|
|
| mTLS between pods | Internal cluster traffic is on private network. Add when exposing cross-cluster. |
|
|
| WAF | Body limits + Traefik rate limit + circuit breaker is sufficient for 100 known projects. |
|
|
| Per-tenant namespaces | Multiplies operational surface 100x. API key isolation is the right model. |
|
|
| ~~Multi-region / clustering~~ | ✅ 3-node cluster deployed. Next: full SWIM inter-node connectivity. |
|
|
| PITR with WAL timestamps | 6-hour backup RPO is acceptable for pilot. Improve later. |
|
|
| Secrets rotation automation | Manual rotation via `/v1/admin/api-keys/:hash/rotate` is fine for 100 projects. |
|
|
| Distributed tracing | You have one service. WAL fsync histogram covers what you need. |
|
|
|
|
---
|
|
|
|
## Open Questions (Resolved)
|
|
|
|
1. ~~**Image registry**~~: ✅ Zot OCI registry at `registry.threesix.ai` on k3s. Woodpecker CI pushes automatically.
|
|
2. ~~**Bootstrap key API**~~: ✅ `bootstrap::bootstrap_root_api_key()` wired in main.rs.
|
|
3. **Aphoria scan model**: Projects run `aphoria scan --persist` locally, calling remote StemeDB. Retry logic lives in Aphoria binary.
|
|
4. **GCS bucket**: Needs to be created for backups (Phase 2).
|
|
5. **CORS**: All router variants use `allow_origin(Any)`. Restrict before public launch.
|
|
|
|
---
|
|
|
|
## Risk Register
|
|
|
|
| Risk | Likelihood | Mitigation |
|
|
|------|-----------|-----------|
|
|
| Longhorn fsync latency at 100-project burst | Medium | Pin pod + volume to same node (Phase 3), `dataLocality: bestEffort`; monitor WAL p99 from day 1 |
|
|
| Rolling restart brief downtime | Medium (StatefulSet rolls one pod at a time) | 3 replicas + readiness probe; Gateway routes to healthy pods |
|
|
| Fresh PVC after disaster = 100 project keys lost | Low but catastrophic | Bootstrap key seed in `main.rs` + `provision-project-keys.sh` idempotent re-run |
|
|
| ~~Image registry blocker~~ | ✅ Resolved | Zot registry on k3s, Woodpecker CI automates builds |
|
|
| CORS vulnerability | Medium | `allow_origin(Any)` in all router variants; fix before public launch |
|
|
|
|
---
|
|
|
|
## Directory Structure (Current)
|
|
|
|
```
|
|
# k3s-fleet repo
|
|
k3s-fleet/deployments/k8s/base/stemedb/
|
|
└── stemedb.yaml # All-in-one: ExternalSecret, headless Service,
|
|
# gateway Service, 3-replica StatefulSet, Ingress
|
|
|
|
# stemedb repo
|
|
scripts/
|
|
├── entrypoint.sh # Dual-binary launcher (cluster mode)
|
|
└── provision-project-keys.sh
|
|
|
|
.woodpecker.yml # CI/CD: Kaniko → Zot registry → kubectl deploy
|
|
Dockerfile # Multi-stage: builds stemedb-api + stemedb-node
|
|
```
|
|
|
|
After Phase 2 hardening, add to `k3s-fleet/.../stemedb/`:
|
|
- `backup-cronjob.yaml`
|
|
- `service-monitor.yaml`
|
|
- `alert-rules.yaml`
|
|
- `network-policy.yaml`
|
|
- `pdb.yaml`
|
|
|
|
---
|
|
|
|
*Last updated: 2026-03-07 — Phase 1 complete, 3-node StatefulSet deployed with Woodpecker CI/CD*
|