stemedb/uat/production-readiness/README.md
jml 3b5f88b4f0 feat(aphoria): implement claims architecture (A1-A5) with verify engine, corpus, coverage, and explain
Complete Aphoria claims system overhaul:
- A1: Rename ExtractedClaim to Observation (extractors produce observations, not claims)
- A2: Add AuthoredClaim with full provenance, invariants, and authority tiers
- A3: Verify engine comparing observations against authored claims, CLI + formatters
- A4: Corpus as first-class assertions with predicate indexing, authority lens, trust packs
- A5: Coverage analysis, explain/docs generation, self-audit extractor, claim suggester skill

Also includes: 42 extractors updated for Observation type, verifiable_predicates trait,
conflict detection with comparison modes, claims TOML persistence, Grafana dashboard,
backup/restore scripts, and comprehensive test coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 09:11:47 +00:00

175 lines
7.1 KiB
Markdown

# Production Readiness Verification
Systematic verification checklist for deploying StemeDB in production environments.
## Quick Reference
| Category | Status | Last Verified |
|----------|--------|---------------|
| Crash Recovery | ✅ Pass | 2026-02-05 |
| Signature Verification | ✅ Pass | 2026-02-05 |
| End-to-End Pipeline | ✅ Pass | 2026-02-05 |
| Load Testing | ✅ Tooling ready | Run `./scripts/run-load-test.sh` |
| API Security | ✅ Pass | 2026-02-08 |
| Backup/Restore | ✅ Pass | 2026-02-08 |
| Observability | ✅ Pass | 2026-02-08 |
## Verification Areas
### 1. Data Integrity & Durability
| Check | Command | Expected |
|-------|---------|----------|
| WAL crash recovery | `cargo test -p stemedb-ingest test_crash_recovery` | All pass |
| No duplicate assertions | `cargo test -p stemedb-ingest test_p0_crash_recovery` | All pass |
| Cursor checkpoint | `cargo test -p stemedb-ingest test_cursor` | All pass |
### 2. Signature Verification
| Check | Command | Expected |
|-------|---------|----------|
| v1 signatures (legacy) | `cargo test -p stemedb-ingest test_ingest_assertion` | Pass |
| v2 signatures (enterprise) | Pharma-ingest with real keys | All assertions accepted |
| Invalid signature rejection | `cargo test -p stemedb-ingest test_rejects_invalid` | Pass |
| Unsigned assertion rejection | `cargo test -p stemedb-ingest test_rejects_unsigned` | Pass |
### 3. End-to-End Pipeline
| Check | Command | Expected |
|-------|---------|----------|
| API server starts | `cargo run --bin stemedb-api` | Binds to :18180 |
| Assertion ingestion | `POST /v1/assert` | Returns hash |
| Query returns data | `GET /v1/query?subject=X` | Returns assertions |
| Skeptic conflict analysis | `GET /v1/skeptic?subject=X&predicate=Y` | Returns conflict_score |
| Health check | `GET /v1/health` | assertions_count > 0 |
### 4. Load Testing
**Tool:** `cmd/load-test` (Go-based with native Ed25519 signing)
| Scenario | Command | Target | Status |
|----------|---------|--------|--------|
| Baseline latency | `--scenario baseline` | 10K assertions, p99 < 200ms | Ready |
| Sustained writes | `--scenario sustained` | 1K/sec for 1 hour, p99 < 200ms | Ready |
| Concurrent readers | `--scenario concurrent` | 100 readers, <2x degradation | Ready |
**Quick Start:**
```bash
# Run all scenarios (5 min sustained by default)
./scripts/run-load-test.sh
# Run full 1-hour sustained test
LOAD_TEST_DURATION=1h ./scripts/run-load-test.sh
# Run specific scenario
./scripts/run-load-test.sh --scenario baseline
go run ./cmd/load-test --api-url http://localhost:18180 --scenario sustained --duration 10m
```
**Prerequisites:**
- Set `STEMEDB_METER_ENABLED=false` for accurate sustained test results
- Ensure ~10-20GB disk space for 1-hour tests (~3.6M assertions)
- Results saved to `uat/production-readiness/results/`
### 5. API Security
| Check | Implementation | Status |
|-------|----------------|--------|
| Authentication | `X-API-Key` header, BLAKE3-hashed storage | Implemented |
| RBAC | 3 roles: admin, write, read-only | Implemented |
| Per-key rate limiting | Token bucket per API key (separate from per-agent quota) | Implemented |
| Key management | 5 CRUD endpoints: create, list, revoke, rotate, update | Implemented |
| Bootstrap | `STEMEDB_ROOT_API_KEY` env var creates admin key on first start | Implemented |
| Input validation | Oversized payloads rejected | Partial |
| TLS in transit | HTTPS termination | External (nginx/LB) |
**Key files:**
- `crates/stemedb-api/src/middleware/api_key.rs` X-API-Key middleware + RBAC + per-key rate limiting
- `crates/stemedb-storage/src/api_key_store/` Storage trait + implementation
- `crates/stemedb-api/src/handlers/api_keys.rs` Create, list, revoke, rotate, update endpoints
- `crates/stemedb-api/src/bootstrap.rs` STEMEDB_ROOT_API_KEY bootstrap
### 6. Backup & Restore
| Check | Procedure | Status |
|-------|-----------|--------|
| WAL backup | `./scripts/backup-stemedb.sh` rsync WAL + DB, timestamped | Implemented |
| WAL-only backup | `./scripts/backup-stemedb.sh --wal-only` faster, WAL only | Implemented |
| Restore | `./scripts/restore-stemedb.sh <backup-path>` | Implemented |
| Safety checks | Server-not-running check, non-empty dir protection | Implemented |
| Force restore | `--force` renames existing dirs (never deletes) | Implemented |
| WAL verification | Magic bytes (`STEM`) checked on restore | Implemented |
| Metadata | `backup-metadata.json` with timestamp, file counts, sizes | Implemented |
**Usage:**
```bash
# Create backup
./scripts/backup-stemedb.sh --output /mnt/backups
# Restore from backup (stop server first)
./scripts/restore-stemedb.sh backups/stemedb-backup-20260208-120000/ --force
```
### 7. Observability
| Check | Implementation | Status |
|-------|----------------|--------|
| Structured logs | `tracing` crate | Implemented |
| Metrics endpoint | `GET /metrics` Prometheus text format | Implemented |
| Application metrics | 6 metrics: assertions, queries, latency, quarantine, circuit breakers | Implemented |
| Sync/cluster metrics | 10 metrics: sync cycles, lag, convergence, node counts | Implemented |
| Grafana dashboard | `docs/grafana/stemedb-overview.json` (4 rows, 12 panels) | Implemented |
| Distributed tracing | OpenTelemetry | Planned (Phase 8B) |
| Alerting | WAL lag, errors | Planned (Phase 9E) |
**Application Metrics:**
| Metric | Type | Description |
|--------|------|-------------|
| `stemedb_assertions_total` | Gauge | Total assertions in database (updated on health check) |
| `stemedb_assertions_ingested_total` | Counter | Assertions ingested via API |
| `stemedb_queries_total` | Counter | Queries by endpoint (query, skeptic, layered, constraints) |
| `stemedb_query_latency_seconds` | Histogram | Query latency by endpoint |
| `stemedb_quarantine_pending` | Gauge | Pending quarantine events |
| `stemedb_circuit_breakers_open` | Gauge | Open circuit breakers |
## Running Full Verification
```bash
# 1. Run all unit tests
cargo test --workspace --lib
# 2. Start fresh API server
rm -rf /tmp/stemedb-prod-test && mkdir -p /tmp/stemedb-prod-test
STEMEDB_DATA_DIR=/tmp/stemedb-prod-test cargo run --bin stemedb-api &
sleep 5
# 3. Run pharma-ingest (tests v2 signatures)
cargo run -p stemedb-ontology --bin pharma-ingest -- --with-conflicts
# 4. Verify endpoints
curl http://localhost:18180/v1/health
curl "http://localhost:18180/v1/query?subject=Semaglutide"
curl "http://localhost:18180/v1/skeptic?subject=Semaglutide&predicate=nausea_rate"
# 5. Kill and restart (crash recovery test)
pkill -9 -f stemedb-api
STEMEDB_DATA_DIR=/tmp/stemedb-prod-test cargo run --bin stemedb-api &
sleep 3
curl http://localhost:18180/v1/health # Should show same assertion count
```
## Results Archive
Date-stamped verification results:
| Date | Report | Summary |
|------|--------|---------|
| 2026-02-05 | [wal-sync-fix.md](./results/2026-02-05-wal-sync-fix.md) | WAL segment cache fix, all tests pass |
## Related
- [UAT Report Template](../how-to.md)
- [Consumer Health UAT](../consumer-health/README.md)
- [Production Readiness Facts](../../ai-lookup/features/production-readiness.md)