# Production Readiness Verification Systematic verification checklist for deploying StemeDB in production environments. ## Quick Reference | Category | Status | Last Verified | |----------|--------|---------------| | Crash Recovery | ✅ Pass | 2026-02-05 | | Signature Verification | ✅ Pass | 2026-02-05 | | End-to-End Pipeline | ✅ Pass | 2026-02-05 | | Load Testing | ✅ Tooling ready | Run `./scripts/run-load-test.sh` | | API Security | ✅ Pass | 2026-02-08 | | Backup/Restore | ✅ Pass | 2026-02-08 | | Observability | ✅ Pass | 2026-02-08 | ## Verification Areas ### 1. Data Integrity & Durability | Check | Command | Expected | |-------|---------|----------| | WAL crash recovery | `cargo test -p stemedb-ingest test_crash_recovery` | All pass | | No duplicate assertions | `cargo test -p stemedb-ingest test_p0_crash_recovery` | All pass | | Cursor checkpoint | `cargo test -p stemedb-ingest test_cursor` | All pass | ### 2. Signature Verification | Check | Command | Expected | |-------|---------|----------| | v1 signatures (legacy) | `cargo test -p stemedb-ingest test_ingest_assertion` | Pass | | v2 signatures (enterprise) | Pharma-ingest with real keys | All assertions accepted | | Invalid signature rejection | `cargo test -p stemedb-ingest test_rejects_invalid` | Pass | | Unsigned assertion rejection | `cargo test -p stemedb-ingest test_rejects_unsigned` | Pass | ### 3. End-to-End Pipeline | Check | Command | Expected | |-------|---------|----------| | API server starts | `cargo run --bin stemedb-api` | Binds to :18180 | | Assertion ingestion | `POST /v1/assert` | Returns hash | | Query returns data | `GET /v1/query?subject=X` | Returns assertions | | Skeptic conflict analysis | `GET /v1/skeptic?subject=X&predicate=Y` | Returns conflict_score | | Health check | `GET /v1/health` | assertions_count > 0 | ### 4. Load Testing **Tool:** `cmd/load-test` (Go-based with native Ed25519 signing) | Scenario | Command | Target | Status | |----------|---------|--------|--------| | Baseline latency | `--scenario baseline` | 10K assertions, p99 < 200ms | ✅ Ready | | Sustained writes | `--scenario sustained` | 1K/sec for 1 hour, p99 < 200ms | ✅ Ready | | Concurrent readers | `--scenario concurrent` | 100 readers, <2x degradation | ✅ Ready | **Quick Start:** ```bash # Run all scenarios (5 min sustained by default) ./scripts/run-load-test.sh # Run full 1-hour sustained test LOAD_TEST_DURATION=1h ./scripts/run-load-test.sh # Run specific scenario ./scripts/run-load-test.sh --scenario baseline go run ./cmd/load-test --api-url http://localhost:18180 --scenario sustained --duration 10m ``` **Prerequisites:** - Set `STEMEDB_METER_ENABLED=false` for accurate sustained test results - Ensure ~10-20GB disk space for 1-hour tests (~3.6M assertions) - Results saved to `uat/production-readiness/results/` ### 5. API Security | Check | Implementation | Status | |-------|----------------|--------| | Authentication | `X-API-Key` header, BLAKE3-hashed storage | ✅ Implemented | | RBAC | 3 roles: admin, write, read-only | ✅ Implemented | | Per-key rate limiting | Token bucket per API key (separate from per-agent quota) | ✅ Implemented | | Key management | 5 CRUD endpoints: create, list, revoke, rotate, update | ✅ Implemented | | Bootstrap | `STEMEDB_ROOT_API_KEY` env var creates admin key on first start | ✅ Implemented | | Input validation | Oversized payloads rejected | ✅ Partial | | TLS in transit | HTTPS termination | External (nginx/LB) | **Key files:** - `crates/stemedb-api/src/middleware/api_key.rs` — X-API-Key middleware + RBAC + per-key rate limiting - `crates/stemedb-storage/src/api_key_store/` — Storage trait + implementation - `crates/stemedb-api/src/handlers/api_keys.rs` — Create, list, revoke, rotate, update endpoints - `crates/stemedb-api/src/bootstrap.rs` — STEMEDB_ROOT_API_KEY bootstrap ### 6. Backup & Restore | Check | Procedure | Status | |-------|-----------|--------| | WAL backup | `./scripts/backup-stemedb.sh` — rsync WAL + DB, timestamped | ✅ Implemented | | WAL-only backup | `./scripts/backup-stemedb.sh --wal-only` — faster, WAL only | ✅ Implemented | | Restore | `./scripts/restore-stemedb.sh ` | ✅ Implemented | | Safety checks | Server-not-running check, non-empty dir protection | ✅ Implemented | | Force restore | `--force` renames existing dirs (never deletes) | ✅ Implemented | | WAL verification | Magic bytes (`STEM`) checked on restore | ✅ Implemented | | Metadata | `backup-metadata.json` with timestamp, file counts, sizes | ✅ Implemented | **Usage:** ```bash # Create backup ./scripts/backup-stemedb.sh --output /mnt/backups # Restore from backup (stop server first) ./scripts/restore-stemedb.sh backups/stemedb-backup-20260208-120000/ --force ``` ### 7. Observability | Check | Implementation | Status | |-------|----------------|--------| | Structured logs | `tracing` crate | ✅ Implemented | | Metrics endpoint | `GET /metrics` Prometheus text format | ✅ Implemented | | Application metrics | 6 metrics: assertions, queries, latency, quarantine, circuit breakers | ✅ Implemented | | Sync/cluster metrics | 10 metrics: sync cycles, lag, convergence, node counts | ✅ Implemented | | Grafana dashboard | `docs/grafana/stemedb-overview.json` (4 rows, 12 panels) | ✅ Implemented | | Distributed tracing | OpenTelemetry | Planned (Phase 8B) | | Alerting | WAL lag, errors | Planned (Phase 9E) | **Application Metrics:** | Metric | Type | Description | |--------|------|-------------| | `stemedb_assertions_total` | Gauge | Total assertions in database (updated on health check) | | `stemedb_assertions_ingested_total` | Counter | Assertions ingested via API | | `stemedb_queries_total` | Counter | Queries by endpoint (query, skeptic, layered, constraints) | | `stemedb_query_latency_seconds` | Histogram | Query latency by endpoint | | `stemedb_quarantine_pending` | Gauge | Pending quarantine events | | `stemedb_circuit_breakers_open` | Gauge | Open circuit breakers | ## Running Full Verification ```bash # 1. Run all unit tests cargo test --workspace --lib # 2. Start fresh API server rm -rf /tmp/stemedb-prod-test && mkdir -p /tmp/stemedb-prod-test STEMEDB_DATA_DIR=/tmp/stemedb-prod-test cargo run --bin stemedb-api & sleep 5 # 3. Run pharma-ingest (tests v2 signatures) cargo run -p stemedb-ontology --bin pharma-ingest -- --with-conflicts # 4. Verify endpoints curl http://localhost:18180/v1/health curl "http://localhost:18180/v1/query?subject=Semaglutide" curl "http://localhost:18180/v1/skeptic?subject=Semaglutide&predicate=nausea_rate" # 5. Kill and restart (crash recovery test) pkill -9 -f stemedb-api STEMEDB_DATA_DIR=/tmp/stemedb-prod-test cargo run --bin stemedb-api & sleep 3 curl http://localhost:18180/v1/health # Should show same assertion count ``` ## Results Archive Date-stamped verification results: | Date | Report | Summary | |------|--------|---------| | 2026-02-05 | [wal-sync-fix.md](./results/2026-02-05-wal-sync-fix.md) | WAL segment cache fix, all tests pass | ## Next Steps **After passing verification**, follow these steps to deploy to production: 1. **Choose Architecture:** Review [Reference Architectures](../../docs/operations/reference-architecture/README.md) to select single-node pilot or three-node cluster based on scale and availability requirements. 2. **Set Up Monitoring:** Deploy metrics collection and dashboards per your chosen architecture: - Single-node: [Docker Compose with Monitoring](../../docs/operations/deployment/docker-compose/pilot-with-monitoring.yml) - Three-node: Configure Prometheus to scrape all nodes 3. **Review Runbooks:** Familiarize on-call team with [Operational Runbooks](../../docs/operations/runbooks/): - [Server Won't Start](../../docs/operations/runbooks/server-wont-start.md) - [High Query Latency](../../docs/operations/runbooks/high-query-latency.md) - [Quarantine Overflow](../../docs/operations/runbooks/quarantine-overflow.md) - [Restore from Backup](../../docs/operations/runbooks/restore-from-backup.md) - [Add Node to Cluster](../../docs/operations/runbooks/add-node.md) (cluster only) 4. **Validate Pilot:** Run [Pilot Success Criteria](../../docs/operations/pilot-success-criteria.md) validation suite: - All 15 "Must Pass" criteria - At least 4/6 "Should Pass" criteria - All 5 "Amazement Moments" demonstrable 5. **Deploy:** Follow deployment guide for your chosen architecture: - [Single-Node Pilot](../../docs/operations/reference-architecture/single-node-pilot.md) - [Three-Node Cluster](../../docs/operations/reference-architecture/three-node-cluster.md) 6. **Monitor:** Set up alerts based on [Resource Sizing Guide](../../docs/operations/reference-architecture/resource-sizing.md) thresholds (disk >80%, CPU >70%, latency p99 >1s). --- ## Related - [UAT Report Template](../how-to.md) - [Consumer Health UAT](../consumer-health/README.md) - [Production Readiness Facts](../../ai-lookup/features/production-readiness.md)