stemedb/uat/production-readiness
jml 3e7eddc074 feat: add enterprise production readiness infrastructure
This commit implements comprehensive production hardening across multiple
layers to prepare StemeDB for enterprise pilot deployments:

## API Layer
- Add rate limiting middleware with configurable limits per endpoint
- Enhance error handling with detailed context and proper HTTP status codes
- Add security hardening tests for input validation and boundary conditions
- Create store_helpers module for defensive storage access patterns

## Storage & WAL
- Optimize group commit batching for higher throughput
- Add defensive error handling in hybrid backend with proper fallbacks
- Enhance WAL journal durability guarantees with fsync validation
- Improve index store query performance with better caching

## Operations & Deployment
- Add comprehensive operations documentation (deployment, monitoring, DR)
- Create systemd units for backup, WAL archival, and verification
- Add monitoring configs (Prometheus alerts, metrics exporters)
- Implement backup/restore scripts with verification and S3 archival
- Add DR drill automation and runbook procedures
- Create load balancer configs (nginx, envoy) with health checks

## Documentation
- Update CLAUDE.md with operations and troubleshooting guides
- Expand roadmap with production readiness milestones
- Add pilot success criteria and deployment reference architecture
- Document TLS setup, monitoring integration, and incident response

## Configuration
- Add .env.example with all required environment variables
- Document resource sizing for different deployment scales
- Add configuration examples for various deployment topologies

This positions StemeDB for successful enterprise pilots with proper
operational discipline, monitoring, backup/DR, and security hardening.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-12 06:08:15 +00:00
..
results feat: Complete Aphoria Phase 8-9 + UAT suite (90/90 tests passing) 2026-02-06 22:50:55 -07:00
backup-dr-tests-simple.sh feat: add enterprise production readiness infrastructure 2026-02-12 06:08:15 +00:00
backup-dr-tests.sh feat: add enterprise production readiness infrastructure 2026-02-12 06:08:15 +00:00
README.md feat: add enterprise production readiness infrastructure 2026-02-12 06:08:15 +00:00

Production Readiness Verification

Systematic verification checklist for deploying StemeDB in production environments.

Quick Reference

Category Status Last Verified
Crash Recovery Pass 2026-02-05
Signature Verification Pass 2026-02-05
End-to-End Pipeline Pass 2026-02-05
Load Testing Tooling ready Run ./scripts/run-load-test.sh
API Security Pass 2026-02-08
Backup/Restore Pass 2026-02-08
Observability Pass 2026-02-08

Verification Areas

1. Data Integrity & Durability

Check Command Expected
WAL crash recovery cargo test -p stemedb-ingest test_crash_recovery All pass
No duplicate assertions cargo test -p stemedb-ingest test_p0_crash_recovery All pass
Cursor checkpoint cargo test -p stemedb-ingest test_cursor All pass

2. Signature Verification

Check Command Expected
v1 signatures (legacy) cargo test -p stemedb-ingest test_ingest_assertion Pass
v2 signatures (enterprise) Pharma-ingest with real keys All assertions accepted
Invalid signature rejection cargo test -p stemedb-ingest test_rejects_invalid Pass
Unsigned assertion rejection cargo test -p stemedb-ingest test_rejects_unsigned Pass

3. End-to-End Pipeline

Check Command Expected
API server starts cargo run --bin stemedb-api Binds to :18180
Assertion ingestion POST /v1/assert Returns hash
Query returns data GET /v1/query?subject=X Returns assertions
Skeptic conflict analysis GET /v1/skeptic?subject=X&predicate=Y Returns conflict_score
Health check GET /v1/health assertions_count > 0

4. Load Testing

Tool: cmd/load-test (Go-based with native Ed25519 signing)

Scenario Command Target Status
Baseline latency --scenario baseline 10K assertions, p99 < 200ms Ready
Sustained writes --scenario sustained 1K/sec for 1 hour, p99 < 200ms Ready
Concurrent readers --scenario concurrent 100 readers, <2x degradation Ready

Quick Start:

# Run all scenarios (5 min sustained by default)
./scripts/run-load-test.sh

# Run full 1-hour sustained test
LOAD_TEST_DURATION=1h ./scripts/run-load-test.sh

# Run specific scenario
./scripts/run-load-test.sh --scenario baseline
go run ./cmd/load-test --api-url http://localhost:18180 --scenario sustained --duration 10m

Prerequisites:

  • Set STEMEDB_METER_ENABLED=false for accurate sustained test results
  • Ensure ~10-20GB disk space for 1-hour tests (~3.6M assertions)
  • Results saved to uat/production-readiness/results/

5. API Security

Check Implementation Status
Authentication X-API-Key header, BLAKE3-hashed storage Implemented
RBAC 3 roles: admin, write, read-only Implemented
Per-key rate limiting Token bucket per API key (separate from per-agent quota) Implemented
Key management 5 CRUD endpoints: create, list, revoke, rotate, update Implemented
Bootstrap STEMEDB_ROOT_API_KEY env var creates admin key on first start Implemented
Input validation Oversized payloads rejected Partial
TLS in transit HTTPS termination External (nginx/LB)

Key files:

  • crates/stemedb-api/src/middleware/api_key.rs — X-API-Key middleware + RBAC + per-key rate limiting
  • crates/stemedb-storage/src/api_key_store/ — Storage trait + implementation
  • crates/stemedb-api/src/handlers/api_keys.rs — Create, list, revoke, rotate, update endpoints
  • crates/stemedb-api/src/bootstrap.rs — STEMEDB_ROOT_API_KEY bootstrap

6. Backup & Restore

Check Procedure Status
WAL backup ./scripts/backup-stemedb.sh — rsync WAL + DB, timestamped Implemented
WAL-only backup ./scripts/backup-stemedb.sh --wal-only — faster, WAL only Implemented
Restore ./scripts/restore-stemedb.sh <backup-path> Implemented
Safety checks Server-not-running check, non-empty dir protection Implemented
Force restore --force renames existing dirs (never deletes) Implemented
WAL verification Magic bytes (STEM) checked on restore Implemented
Metadata backup-metadata.json with timestamp, file counts, sizes Implemented

Usage:

# Create backup
./scripts/backup-stemedb.sh --output /mnt/backups

# Restore from backup (stop server first)
./scripts/restore-stemedb.sh backups/stemedb-backup-20260208-120000/ --force

7. Observability

Check Implementation Status
Structured logs tracing crate Implemented
Metrics endpoint GET /metrics Prometheus text format Implemented
Application metrics 6 metrics: assertions, queries, latency, quarantine, circuit breakers Implemented
Sync/cluster metrics 10 metrics: sync cycles, lag, convergence, node counts Implemented
Grafana dashboard docs/grafana/stemedb-overview.json (4 rows, 12 panels) Implemented
Distributed tracing OpenTelemetry Planned (Phase 8B)
Alerting WAL lag, errors Planned (Phase 9E)

Application Metrics:

Metric Type Description
stemedb_assertions_total Gauge Total assertions in database (updated on health check)
stemedb_assertions_ingested_total Counter Assertions ingested via API
stemedb_queries_total Counter Queries by endpoint (query, skeptic, layered, constraints)
stemedb_query_latency_seconds Histogram Query latency by endpoint
stemedb_quarantine_pending Gauge Pending quarantine events
stemedb_circuit_breakers_open Gauge Open circuit breakers

Running Full Verification

# 1. Run all unit tests
cargo test --workspace --lib

# 2. Start fresh API server
rm -rf /tmp/stemedb-prod-test && mkdir -p /tmp/stemedb-prod-test
STEMEDB_DATA_DIR=/tmp/stemedb-prod-test cargo run --bin stemedb-api &
sleep 5

# 3. Run pharma-ingest (tests v2 signatures)
cargo run -p stemedb-ontology --bin pharma-ingest -- --with-conflicts

# 4. Verify endpoints
curl http://localhost:18180/v1/health
curl "http://localhost:18180/v1/query?subject=Semaglutide"
curl "http://localhost:18180/v1/skeptic?subject=Semaglutide&predicate=nausea_rate"

# 5. Kill and restart (crash recovery test)
pkill -9 -f stemedb-api
STEMEDB_DATA_DIR=/tmp/stemedb-prod-test cargo run --bin stemedb-api &
sleep 3
curl http://localhost:18180/v1/health  # Should show same assertion count

Results Archive

Date-stamped verification results:

Date Report Summary
2026-02-05 wal-sync-fix.md WAL segment cache fix, all tests pass

Next Steps

After passing verification, follow these steps to deploy to production:

  1. Choose Architecture: Review Reference Architectures to select single-node pilot or three-node cluster based on scale and availability requirements.

  2. Set Up Monitoring: Deploy metrics collection and dashboards per your chosen architecture:

  3. Review Runbooks: Familiarize on-call team with Operational Runbooks:

  4. Validate Pilot: Run Pilot Success Criteria validation suite:

    • All 15 "Must Pass" criteria
    • At least 4/6 "Should Pass" criteria
    • All 5 "Amazement Moments" demonstrable
  5. Deploy: Follow deployment guide for your chosen architecture:

  6. Monitor: Set up alerts based on Resource Sizing Guide thresholds (disk >80%, CPU >70%, latency p99 >1s).