This commit implements comprehensive production hardening across multiple layers to prepare StemeDB for enterprise pilot deployments: ## API Layer - Add rate limiting middleware with configurable limits per endpoint - Enhance error handling with detailed context and proper HTTP status codes - Add security hardening tests for input validation and boundary conditions - Create store_helpers module for defensive storage access patterns ## Storage & WAL - Optimize group commit batching for higher throughput - Add defensive error handling in hybrid backend with proper fallbacks - Enhance WAL journal durability guarantees with fsync validation - Improve index store query performance with better caching ## Operations & Deployment - Add comprehensive operations documentation (deployment, monitoring, DR) - Create systemd units for backup, WAL archival, and verification - Add monitoring configs (Prometheus alerts, metrics exporters) - Implement backup/restore scripts with verification and S3 archival - Add DR drill automation and runbook procedures - Create load balancer configs (nginx, envoy) with health checks ## Documentation - Update CLAUDE.md with operations and troubleshooting guides - Expand roadmap with production readiness milestones - Add pilot success criteria and deployment reference architecture - Document TLS setup, monitoring integration, and incident response ## Configuration - Add .env.example with all required environment variables - Document resource sizing for different deployment scales - Add configuration examples for various deployment topologies This positions StemeDB for successful enterprise pilots with proper operational discipline, monitoring, backup/DR, and security hardening. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
205 lines
8.8 KiB
Markdown
205 lines
8.8 KiB
Markdown
# Production Readiness Verification
|
|
|
|
Systematic verification checklist for deploying StemeDB in production environments.
|
|
|
|
## Quick Reference
|
|
|
|
| Category | Status | Last Verified |
|
|
|----------|--------|---------------|
|
|
| Crash Recovery | ✅ Pass | 2026-02-05 |
|
|
| Signature Verification | ✅ Pass | 2026-02-05 |
|
|
| End-to-End Pipeline | ✅ Pass | 2026-02-05 |
|
|
| Load Testing | ✅ Tooling ready | Run `./scripts/run-load-test.sh` |
|
|
| API Security | ✅ Pass | 2026-02-08 |
|
|
| Backup/Restore | ✅ Pass | 2026-02-08 |
|
|
| Observability | ✅ Pass | 2026-02-08 |
|
|
|
|
## Verification Areas
|
|
|
|
### 1. Data Integrity & Durability
|
|
|
|
| Check | Command | Expected |
|
|
|-------|---------|----------|
|
|
| WAL crash recovery | `cargo test -p stemedb-ingest test_crash_recovery` | All pass |
|
|
| No duplicate assertions | `cargo test -p stemedb-ingest test_p0_crash_recovery` | All pass |
|
|
| Cursor checkpoint | `cargo test -p stemedb-ingest test_cursor` | All pass |
|
|
|
|
### 2. Signature Verification
|
|
|
|
| Check | Command | Expected |
|
|
|-------|---------|----------|
|
|
| v1 signatures (legacy) | `cargo test -p stemedb-ingest test_ingest_assertion` | Pass |
|
|
| v2 signatures (enterprise) | Pharma-ingest with real keys | All assertions accepted |
|
|
| Invalid signature rejection | `cargo test -p stemedb-ingest test_rejects_invalid` | Pass |
|
|
| Unsigned assertion rejection | `cargo test -p stemedb-ingest test_rejects_unsigned` | Pass |
|
|
|
|
### 3. End-to-End Pipeline
|
|
|
|
| Check | Command | Expected |
|
|
|-------|---------|----------|
|
|
| API server starts | `cargo run --bin stemedb-api` | Binds to :18180 |
|
|
| Assertion ingestion | `POST /v1/assert` | Returns hash |
|
|
| Query returns data | `GET /v1/query?subject=X` | Returns assertions |
|
|
| Skeptic conflict analysis | `GET /v1/skeptic?subject=X&predicate=Y` | Returns conflict_score |
|
|
| Health check | `GET /v1/health` | assertions_count > 0 |
|
|
|
|
### 4. Load Testing
|
|
|
|
**Tool:** `cmd/load-test` (Go-based with native Ed25519 signing)
|
|
|
|
| Scenario | Command | Target | Status |
|
|
|----------|---------|--------|--------|
|
|
| Baseline latency | `--scenario baseline` | 10K assertions, p99 < 200ms | ✅ Ready |
|
|
| Sustained writes | `--scenario sustained` | 1K/sec for 1 hour, p99 < 200ms | ✅ Ready |
|
|
| Concurrent readers | `--scenario concurrent` | 100 readers, <2x degradation | ✅ Ready |
|
|
|
|
**Quick Start:**
|
|
```bash
|
|
# Run all scenarios (5 min sustained by default)
|
|
./scripts/run-load-test.sh
|
|
|
|
# Run full 1-hour sustained test
|
|
LOAD_TEST_DURATION=1h ./scripts/run-load-test.sh
|
|
|
|
# Run specific scenario
|
|
./scripts/run-load-test.sh --scenario baseline
|
|
go run ./cmd/load-test --api-url http://localhost:18180 --scenario sustained --duration 10m
|
|
```
|
|
|
|
**Prerequisites:**
|
|
- Set `STEMEDB_METER_ENABLED=false` for accurate sustained test results
|
|
- Ensure ~10-20GB disk space for 1-hour tests (~3.6M assertions)
|
|
- Results saved to `uat/production-readiness/results/`
|
|
|
|
### 5. API Security
|
|
|
|
| Check | Implementation | Status |
|
|
|-------|----------------|--------|
|
|
| Authentication | `X-API-Key` header, BLAKE3-hashed storage | ✅ Implemented |
|
|
| RBAC | 3 roles: admin, write, read-only | ✅ Implemented |
|
|
| Per-key rate limiting | Token bucket per API key (separate from per-agent quota) | ✅ Implemented |
|
|
| Key management | 5 CRUD endpoints: create, list, revoke, rotate, update | ✅ Implemented |
|
|
| Bootstrap | `STEMEDB_ROOT_API_KEY` env var creates admin key on first start | ✅ Implemented |
|
|
| Input validation | Oversized payloads rejected | ✅ Partial |
|
|
| TLS in transit | HTTPS termination | External (nginx/LB) |
|
|
|
|
**Key files:**
|
|
- `crates/stemedb-api/src/middleware/api_key.rs` — X-API-Key middleware + RBAC + per-key rate limiting
|
|
- `crates/stemedb-storage/src/api_key_store/` — Storage trait + implementation
|
|
- `crates/stemedb-api/src/handlers/api_keys.rs` — Create, list, revoke, rotate, update endpoints
|
|
- `crates/stemedb-api/src/bootstrap.rs` — STEMEDB_ROOT_API_KEY bootstrap
|
|
|
|
### 6. Backup & Restore
|
|
|
|
| Check | Procedure | Status |
|
|
|-------|-----------|--------|
|
|
| WAL backup | `./scripts/backup-stemedb.sh` — rsync WAL + DB, timestamped | ✅ Implemented |
|
|
| WAL-only backup | `./scripts/backup-stemedb.sh --wal-only` — faster, WAL only | ✅ Implemented |
|
|
| Restore | `./scripts/restore-stemedb.sh <backup-path>` | ✅ Implemented |
|
|
| Safety checks | Server-not-running check, non-empty dir protection | ✅ Implemented |
|
|
| Force restore | `--force` renames existing dirs (never deletes) | ✅ Implemented |
|
|
| WAL verification | Magic bytes (`STEM`) checked on restore | ✅ Implemented |
|
|
| Metadata | `backup-metadata.json` with timestamp, file counts, sizes | ✅ Implemented |
|
|
|
|
**Usage:**
|
|
```bash
|
|
# Create backup
|
|
./scripts/backup-stemedb.sh --output /mnt/backups
|
|
|
|
# Restore from backup (stop server first)
|
|
./scripts/restore-stemedb.sh backups/stemedb-backup-20260208-120000/ --force
|
|
```
|
|
|
|
### 7. Observability
|
|
|
|
| Check | Implementation | Status |
|
|
|-------|----------------|--------|
|
|
| Structured logs | `tracing` crate | ✅ Implemented |
|
|
| Metrics endpoint | `GET /metrics` Prometheus text format | ✅ Implemented |
|
|
| Application metrics | 6 metrics: assertions, queries, latency, quarantine, circuit breakers | ✅ Implemented |
|
|
| Sync/cluster metrics | 10 metrics: sync cycles, lag, convergence, node counts | ✅ Implemented |
|
|
| Grafana dashboard | `docs/grafana/stemedb-overview.json` (4 rows, 12 panels) | ✅ Implemented |
|
|
| Distributed tracing | OpenTelemetry | Planned (Phase 8B) |
|
|
| Alerting | WAL lag, errors | Planned (Phase 9E) |
|
|
|
|
**Application Metrics:**
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `stemedb_assertions_total` | Gauge | Total assertions in database (updated on health check) |
|
|
| `stemedb_assertions_ingested_total` | Counter | Assertions ingested via API |
|
|
| `stemedb_queries_total` | Counter | Queries by endpoint (query, skeptic, layered, constraints) |
|
|
| `stemedb_query_latency_seconds` | Histogram | Query latency by endpoint |
|
|
| `stemedb_quarantine_pending` | Gauge | Pending quarantine events |
|
|
| `stemedb_circuit_breakers_open` | Gauge | Open circuit breakers |
|
|
|
|
## Running Full Verification
|
|
|
|
```bash
|
|
# 1. Run all unit tests
|
|
cargo test --workspace --lib
|
|
|
|
# 2. Start fresh API server
|
|
rm -rf /tmp/stemedb-prod-test && mkdir -p /tmp/stemedb-prod-test
|
|
STEMEDB_DATA_DIR=/tmp/stemedb-prod-test cargo run --bin stemedb-api &
|
|
sleep 5
|
|
|
|
# 3. Run pharma-ingest (tests v2 signatures)
|
|
cargo run -p stemedb-ontology --bin pharma-ingest -- --with-conflicts
|
|
|
|
# 4. Verify endpoints
|
|
curl http://localhost:18180/v1/health
|
|
curl "http://localhost:18180/v1/query?subject=Semaglutide"
|
|
curl "http://localhost:18180/v1/skeptic?subject=Semaglutide&predicate=nausea_rate"
|
|
|
|
# 5. Kill and restart (crash recovery test)
|
|
pkill -9 -f stemedb-api
|
|
STEMEDB_DATA_DIR=/tmp/stemedb-prod-test cargo run --bin stemedb-api &
|
|
sleep 3
|
|
curl http://localhost:18180/v1/health # Should show same assertion count
|
|
```
|
|
|
|
## Results Archive
|
|
|
|
Date-stamped verification results:
|
|
|
|
| Date | Report | Summary |
|
|
|------|--------|---------|
|
|
| 2026-02-05 | [wal-sync-fix.md](./results/2026-02-05-wal-sync-fix.md) | WAL segment cache fix, all tests pass |
|
|
|
|
## Next Steps
|
|
|
|
**After passing verification**, follow these steps to deploy to production:
|
|
|
|
1. **Choose Architecture:** Review [Reference Architectures](../../docs/operations/reference-architecture/README.md) to select single-node pilot or three-node cluster based on scale and availability requirements.
|
|
|
|
2. **Set Up Monitoring:** Deploy metrics collection and dashboards per your chosen architecture:
|
|
- Single-node: [Docker Compose with Monitoring](../../docs/operations/deployment/docker-compose/pilot-with-monitoring.yml)
|
|
- Three-node: Configure Prometheus to scrape all nodes
|
|
|
|
3. **Review Runbooks:** Familiarize on-call team with [Operational Runbooks](../../docs/operations/runbooks/):
|
|
- [Server Won't Start](../../docs/operations/runbooks/server-wont-start.md)
|
|
- [High Query Latency](../../docs/operations/runbooks/high-query-latency.md)
|
|
- [Quarantine Overflow](../../docs/operations/runbooks/quarantine-overflow.md)
|
|
- [Restore from Backup](../../docs/operations/runbooks/restore-from-backup.md)
|
|
- [Add Node to Cluster](../../docs/operations/runbooks/add-node.md) (cluster only)
|
|
|
|
4. **Validate Pilot:** Run [Pilot Success Criteria](../../docs/operations/pilot-success-criteria.md) validation suite:
|
|
- All 15 "Must Pass" criteria
|
|
- At least 4/6 "Should Pass" criteria
|
|
- All 5 "Amazement Moments" demonstrable
|
|
|
|
5. **Deploy:** Follow deployment guide for your chosen architecture:
|
|
- [Single-Node Pilot](../../docs/operations/reference-architecture/single-node-pilot.md)
|
|
- [Three-Node Cluster](../../docs/operations/reference-architecture/three-node-cluster.md)
|
|
|
|
6. **Monitor:** Set up alerts based on [Resource Sizing Guide](../../docs/operations/reference-architecture/resource-sizing.md) thresholds (disk >80%, CPU >70%, latency p99 >1s).
|
|
|
|
---
|
|
|
|
## Related
|
|
|
|
- [UAT Report Template](../how-to.md)
|
|
- [Consumer Health UAT](../consumer-health/README.md)
|
|
- [Production Readiness Facts](../../ai-lookup/features/production-readiness.md)
|