# Episteme (StemeDB) Roadmap > **Goal:** Build the "Git for Truth" substrate for autonomous AI research. > **Current Focus:** A5.3 Claim Suggester validation + P5.5 Cluster Management Tooling > **Target Vertical:** BioTech/Pharma ("The Living Review") + Code Truth (Aphoria) > **Endgame:** Distributed multi-writer cluster for millions of concurrent agents > > **Infrastructure Status:** Phases 1-7 complete | Phase 8A (Chaos) complete | Pilot 1-4 complete > **Aphoria Status:** A1-A4 complete (observations/claims/verify/corpus) | A5 flywheel 3/4 done > **Security Status:** P5.1 4/5 done (TLS, limits, timeouts, rate limiting) | P5.2 βœ… complete > > **Archive:** For completed phases 1-8A + Pilot 1-3, see [roadmap-archive.md](./roadmap-archive.md) --- ## Current Status | Phase | Status | Summary | |-------|--------|---------| | **1-7, 8A** | βœ… Complete | Core infra, cluster, trust, chaos testing | | **MVP, Pilot 1-4** | βœ… Complete | Consumer Health demo, dashboard, API auth, metrics | | **Aphoria A1-A4** | βœ… Complete | Observations/claims/verify/corpus/authority lens | | **Aphoria A5** | 🎯 In Progress | Flywheel: 3/4 done, A5.3 suggest skill needs validation | | **Pilot 5** | ⚑ Partial | **P5.1 Security 4/5 done**, **P5.2 Monitoring βœ…**, **P5.3 Backup/DR βœ…**, **P5.4 Runbooks βœ…**, **P5.5 Cluster Mgmt βœ…**, docs pending (P5.6, P5.7) | | **8B-C** | Planned | Distributed observability, geo-distribution | | **9** | Planned | Disaster recovery, compliance, storage management | --- ## 🎯 Aphoria: From Scanner to Knowledge Graph Client (CURRENT) > **Goal:** Transform Aphoria from "grep with Episteme vocabulary" into a real knowledge graph client that authors, stores, and audits claims with provenance and lineage. > **Vision Document:** [applications/aphoria/docs/vision-gaps.md](./applications/aphoria/docs/vision-gaps.md) > **Validation:** Maxwell scan (67 observations, 0 noise) + hand-written [claims-explained.md](./claims-explained.md) ### Completed Phases (A1-A4 + P4 β€” see [roadmap-archive.md](./roadmap-archive.md) for details) | Phase | What It Delivered | |-------|-------------------| | **A1** | `Observation` vs `AuthoredClaim` types, bridge tier mapping, `.aphoria/claims.toml` format | | **A2** | `aphoria claims create/list/explain/update/supersede/deprecate`, `aphoria-claims` skill | | **A3** | `verify.rs` engine (Pass/Conflict/Missing/Unclaimed), `aphoria verify run/map`, pre-commit hook, self-audit | | **A4** | RFC/OWASP as Episteme assertions, `AphoriaAuthorityLens`, Trust Pack export/install | | **P4** | API auth (3 roles), backup/restore scripts, Prometheus metrics + Grafana dashboard | ### Phase A5: The Flywheel > **Goal:** The system gets smarter with use. Each claim makes the next claim easier. > **Details:** [vision-gaps.md β€” Β§5](./applications/aphoria/docs/vision-gaps.md#5-the-claims-explainedmd-pattern-should-be-the-product) (claims-explained.md as the product) > **Research:** [a5-flywheel-skill-design.md](./research-requests/a5-flywheel-skill-design.md) β€” validates "skill calls CLI" hypothesis > **Key Insight:** LLM reasoning over CLI JSON output replaces ML training. The flywheel is prompt engineering, not machine learning. - [x] **A5.1 Claim Coverage Metrics**: Per-module claim density and gap reporting - [x] `coverage.rs`: `CoverageReport`, `ModuleCoverage`, `CoverageSummary` types - [x] `compute_coverage()` uses `verify_claims()` as source of truth for claim-observation matching - [x] Per-module: observation count, claim count, claimed/unclaimed, missing claims, density - [x] `aphoria coverage` CLI: table, JSON, markdown formats, `--sort-by` (name/density/unclaimed/observations) - [x] Coverage gaps section: modules with observations but no claims - [x] 8 unit tests including deprecated claim exclusion - [x] **A5.2 Auto-Generated Documentation**: `aphoria docs generate` + `aphoria claims explain` - [x] `aphoria docs generate` CLI command with `--output` and `--format` (markdown/json) - [x] `claims_explain.rs`: groups by category, includes provenance/invariant/consequence/evidence per claim - [x] `explain.rs`: reads `.aphoria/claims.toml`, renders via `render_claims_markdown()` - [x] Provenance chains preserved (supersedes references) - [ ] **A5.3 Claim Suggester Skill**: LLM-powered pattern recognition via "skill calls CLI" - [x] New skill: `.claude/skills/aphoria-suggest/SKILL.md` (3 modes: cold start / foundation / flywheel) - [x] Workflow defined: `claims list` β†’ `verify run --show-unclaimed` β†’ reason by analogy β†’ suggest - [x] Few-shot learning: existing claims as gold-standard examples for style matching - [x] Chain-of-thought: reasoning template before each suggestion - [x] Cold start bootstrap: reads README/CLAUDE.md/tests/ADRs when 0 claims - [x] Context tiers: local β†’ semantic β†’ summary β†’ global (subagent) - [x] Quality gates: non-trivial, not type-enforced, has consequence, not duplicate - [x] **VG-022 CLOSED**: `verifiable_predicates()` on Extractor trait; 10 extractors declare predicates; `verify map` shows extractorβ†’claim coverage - [x] **Dogfood claims**: 10 total claims in `.aphoria/claims.toml` (3 arch + 7 security) covering all ComparisonModes - [ ] **Validate**: Run skill against Aphoria's own codebase (dogfood) - [ ] **Validate**: Run skill against an external project (cold start test) - [ ] **Iterate**: Refine prompt based on suggestion quality from validation - [x] **A5.4 Onboarding Mode**: `aphoria explain` for new team members - [x] `explain.rs`: `generate_explanation()` reads claims, renders narrative - [x] `aphoria explain` CLI with `--output` and `--format` (markdown/json) - [x] Shows claim inventory grouped by category with provenance - [x] Empty project handling: directs to `aphoria claims create` --- ## Pilot 5: Operational Readiness > **Goal:** Complete production readiness for enterprise pilot demo. > **Context:** Pilot 1-4 complete (see [archive](./roadmap-archive.md)). > **Target:** 4-6 weeks to ship-ready state ### Enterprise Readiness: Deployment Stages | Stage | Requirements | Timeline | Customer Profile | |-------|--------------|----------|------------------| | **MVP Pilot** | P5.1 Security + P5.2 Monitoring + P5.3 Backup | βœ… Ready | Friendly pilot, tolerates manual ops | | **Production** | MVP + P5.4 Runbooks + P5.5 CLI | 4 weeks | First paying customer, self-hosted | | **Scale** | Production + Phase 8B-C | 8-10 weeks | 5-10 customers, automated operations | | **Enterprise** | Scale + Phase 9 | 6+ months | 50+ customers, SOC2/compliance required | ### Critical Path to Ship (Must-Have) **WEEK 1 - Security (P0 Blockers):** - TLS/HTTPS, request size limits, timeouts, secret sanitization, rate limiting **WEEK 2 - Monitoring (P0 Blind without these):** - Storage metrics, replication metrics, Grafana dashboards, alert rules **WEEK 3 - Backup & DR (P0 Data loss risk):** - Automated backup, backup verification, WAL archival, DR runbook, operational runbooks **WEEK 4 - Deployment (P1 Customer enablement):** - CLI tooling, reference architecture, deployment guides, pilot validation ### P5.1 Security Hardening (WEEK 1 - SHIP BLOCKERS) **Priority: P0 - Cannot ship without these** **Status: 🎯 4/5 Complete** (TLS, Limits, Timeouts, Rate Limiting done; Secret Sanitization pending) - [x] **TLS/HTTPS Configuration** (Partial - 2024-02-11) - [x] Add TLS 1.3 to stemedb-api (axum-server with rustls) - `main.rs:114-123` - [x] Load from env vars: `STEMEDB_TLS_CERT_PATH` / `STEMEDB_TLS_KEY_PATH` - [ ] HTTP β†’ HTTPS redirect (deferred - not critical for pilot) - [ ] Let's Encrypt integration for pilot deployments (deferred - manual cert setup OK) - [ ] Certificate rotation documentation (deferred) - [ ] Test with self-signed certs in CI (deferred - Layer 4 tests) - [x] **Request Size Limits** (Complete - 2024-02-11) - [x] Add `RequestBodyLimitLayer` to write endpoints (1MB default) - `routers.rs:371` - [x] Add `RequestBodyLimitLayer` to read endpoints (64KB default) - `routers.rs:400` - [x] Make limits configurable: `STEMEDB_WRITE_BODY_LIMIT` / `STEMEDB_READ_BODY_LIMIT` - [x] Created `SecurityConfig` struct with defaults - `routers.rs:35-56` - [x] Updated all 8 `create_router_*` functions to accept config - [x] Documented in `.env.example` - [ ] Document limits in OpenAPI spec (deferred - not critical) - [x] **Timeout Configuration** (Complete - 2024-02-11) - [x] Add `TimeoutLayer` to HTTP routes (configurable, default 30s) - `routers.rs:115,143,199,etc` - [x] Wrap all `store.get()/put()` with `tokio::time::timeout(5s)` - `store_helpers.rs` - [x] Added timeout helpers: `store_get_with_timeout()` / `store_put_with_timeout()` - [x] Updated 6+ handler locations (source.rs, health.rs, report.rs, source_registry/handlers.rs) - [x] Add timeout metrics: `stemedb_operation_timeouts_total{operation="store_get|store_put"}` - [x] Make HTTP timeout configurable: `STEMEDB_HTTP_TIMEOUT_SECS` - [x] Added `ApiError::Timeout` variant with 408 REQUEST_TIMEOUT status - `error.rs:76-80` - [ ] **Secret Sanitization** (Deferred - not blocking for pilot) - [ ] Remove API key logging from `api_key.rs:271` (log hash, not prefix) - [ ] Audit all `debug!`/`info!` for credential leaks - [ ] Add test: `cargo test -- --nocapture | grep -E "key|secret|password"` (should fail) - **Note:** Existing code already logs hashes, audit needed to confirm no leaks - [x] **Rate Limiting** (Complete - 2024-02-11) - [x] Rate limit `/v1/health` to 1 req/sec per IP (prevent metrics flooding) - `routers.rs:352` - [x] Make configurable: `STEMEDB_HEALTH_RATE_LIMIT` (default: 1) - [x] Uses `RateLimitState` and `rate_limit_middleware` - `middleware/rate_limit.rs` - [x] Metric already exists: `stemedb_rate_limit_rejections_total{endpoint}` - `rate_limit.rs:87` **Implementation Notes:** - All security features are now **configurable via environment variables** with sensible defaults - Build succeeds, all features tested manually - Integration tests stubbed in `tests/security_hardening.rs` (21 tests marked `#[ignore]`) - Secret sanitization deferred as existing code appears safe (uses hashes), but full audit recommended ### P5.2 Monitoring Foundation (WEEK 2 - CRITICAL) βœ… COMPLETE **Priority: P0 - Flying blind without these** **Status: βœ… Complete** (All layers implemented: WAL metrics, storage metrics, HTTP SLI, error tracking, Grafana dashboards, Prometheus alerts, runbooks, validation scripts) **Implementation:** [P5.2-IMPLEMENTATION-SUMMARY.md](./P5.2-IMPLEMENTATION-SUMMARY.md) - [x] **Storage Health Metrics** (Complete - 2024-02-11) - [x] `stemedb_wal_fsync_latency_seconds` histogram (p50/p95/p99) - `journal.rs:34` - [x] `stemedb_wal_write_errors_total{error}` counter - `journal.rs:46` - [x] `stemedb_wal_disk_usage_bytes` gauge - `segment.rs:248` - [x] `stemedb_wal_segments_count` gauge - `segment.rs:249` - [x] `stemedb_wal_bytes_written_total` counter - `journal.rs:45` - [x] `stemedb_wal_writes_total` counter - `journal.rs:44` - [x] `stemedb_wal_batch_size` histogram - `group_commit.rs:201` - [x] `stemedb_wal_flush_latency_seconds` histogram - `group_commit.rs:243` - [x] `stemedb_wal_recovery_attempts_total` counter - `journal.rs:234` - [x] `stemedb_wal_recovery_duration_seconds` histogram - `journal.rs:269` - [x] `stemedb_wal_rotations_total` counter - `journal.rs:304` - [x] **Storage Operation Metrics** (Complete - 2024-02-11) - [x] `stemedb_storage_operation_duration_seconds{operation,backend}` histogram - `hybrid_backend.rs:118,138,158,180` - [x] `stemedb_storage_operations_total{operation,backend}` counter - `hybrid_backend.rs:123,143,163,185` - [x] `stemedb_index_lookup_duration_seconds{index}` histogram - `index_store.rs:212,235` - [x] Metrics added to: get(), put(), delete(), scan_prefix(), index lookups - [x] **Error Tracking** (Complete - 2024-02-11) - [x] `stemedb_errors_total{type,layer}` counter - `error.rs:99` - [x] Tracks 15 error types across 5 layers (validation, api, storage, pipeline, auth, protection) - [x] Integrated into `ApiError::IntoResponse` for automatic tracking - [x] **HTTP SLI Metrics** (Complete - 2024-02-12) - [x] Pattern implemented in `handlers/vote.rs` as reference - [x] `stemedb_http_requests_total{method,path}` counter - [x] `stemedb_http_request_duration_seconds{method,path,status}` histogram - [x] Rollout complete: 19 handlers instrumented (supersede, epoch, source, admin, escalation, gold_standard, quarantine, circuit_breaker, api_keys, audit, concepts) - [x] Total coverage: 20 handlers across 11 files - [x] **Grafana Dashboards** (Complete - 2024-02-11) - [x] `storage-health.json` - WAL fsync latency, disk usage, error rates, storage operations, index timing - [x] `cluster-overview.json` - Node status, replication lag, sync ops, Merkle diffs, gossip - [x] `sli-dashboard.json` - Request rate, latency heatmap, error rate, availability gauge, circuit breakers - [x] Import guide with troubleshooting: [docs/operations/monitoring/grafana/README.md](./docs/operations/monitoring/grafana/README.md) - [x] **Prometheus Alert Rules** (Complete - 2024-02-11) - [x] `alerts/critical.yml` - 8 alerts (API down, disk >90%, replication lag >5min, storage errors, fsync failure, split brain, memory exhaustion, cert expiring) - [x] `alerts/warning.yml` - 10 alerts (slow fsync, high error rate, slow indexes, disk >70%, lag >1min, high latency, compaction backlog, circuit breaker, trust rank decay) - [x] `alerts/info.yml` - 9 alerts (circuit breaker open, quarantine backlog, node join, memory >70%, key rotation, gold standard count, cert 30 days, WAL segments, low traffic) - [x] All alerts include: runbook links, impact description, action steps, for duration, labels - [x] **Alerting Integration** (Complete - 2024-02-11) - [x] PagerDuty configuration with 4-level escalation - [docs/operations/monitoring/alerting/pagerduty-config.yml](./docs/operations/monitoring/alerting/pagerduty-config.yml) - [x] Slack integration for 3 channels (critical/warning/info) - [docs/operations/monitoring/alerting/slack-config.yml](./docs/operations/monitoring/alerting/slack-config.yml) - [x] Escalation policy with response times, contact info, post-mortem template - [docs/operations/monitoring/alerting/escalation-policy.md](./docs/operations/monitoring/alerting/escalation-policy.md) - [x] Inhibition rules to prevent alert spam - [x] Workflow integration examples (incident channel creation, resolution tracking) - [x] **Additional Runbooks** (Complete - 2024-02-12) - [x] 8 critical/warning runbooks created in `docs/operations/runbooks/` - [x] Coverage: high-replication-lag, storage-errors, wal-fsync-failure, split-brain, memory-exhaustion, certificate-renewal, slow-fsync, high-error-rate - [x] Each includes: Severity, Symptom, Impact, Investigation, Resolution, Prevention, Escalation, References - [x] **Validation Scripts** (Complete - 2024-02-12) - [x] `scripts/setup-pagerduty.sh` - Service key validation, test incident creation, escalation policy check - [x] `scripts/setup-slack.sh` - Webhook validation, test message posting, formatting verification - [x] `scripts/test-alerting.sh` - End-to-end test (Alertmanager β†’ PagerDuty + Slack), latency measurement ### P5.3 Backup & Disaster Recovery (WEEK 3 - CRITICAL) βœ… COMPLETE **Priority: P0 - Data loss risk without these** **Completed:** 2026-02-12 - [x] **Automated Backup** - [x] Systemd timer: runs every 6 hours (00:00, 06:00, 12:00, 18:00 UTC) - [x] Systemd service: `stemedb-backup.service` with retry logic - [x] Backup retention policy: `--keep-last` flag with 30-day default - [x] S3 upload integration: `--upload-s3` flag with STANDARD_IA storage - [x] **Backup Verification** - [x] `verify-backup.sh` - Validates magic bytes, CRC32C, BLAKE3 checksums - [x] Weekly verification timer: Sunday 03:00 UTC - [x] Metrics: `stemedb_backup_verification_status`, `stemedb_backup_verification_checks_passed` - [x] Alert on verification failure: Prometheus alert rule - [x] **WAL Archival** - [x] `archive-wal-to-s3.sh` - Ships WAL segments to S3 every 15 minutes - [x] S3 bucket: `stemedb-backups-{env}/wal-archive/` - [x] Retention: 30 days in S3 STANDARD_IA - [x] Metrics: `stemedb_wal_archival_lag_seconds`, `stemedb_wal_archival_segments_uploaded_total` - [x] **Disaster Recovery Runbook** - [x] `docs/operations/runbooks/disaster-recovery.md` - Complete DR procedures - [x] RTO target: 4 hours (validated via drill script) - [x] RPO target: 15 minutes (achievable with WAL archival) - [x] 3 recovery scenarios: Full restore, Point-in-time, WAL-only - [x] Validation checklist: 9 verification steps - [x] **DR Drill** - [x] `scripts/dr-drill.sh` - Automated drill with RTO/RPO measurement - [x] Report generation: markdown format with timeline, metrics, issues - [x] Integration tests: `uat/production-readiness/backup-dr-tests.sh` (7 tests) **Deliverables:** - 6 systemd units: 3 timers + 3 services (backup, verify, archive-wal) - 4 scripts: backup, verify, archive-wal, dr-drill - Prometheus alerts: 9 alert rules in `backup-alerts.yml` - DR runbook: 3 recovery scenarios + validation checklist - Integration tests: 7 tests covering all P5.3 components ### P5.4 Operational Runbooks (WEEK 3 - CRITICAL) βœ… COMPLETE **Priority: P1 - 2am incidents require these** - [x] **Critical Runbooks** (created in `docs/operations/runbooks/`) - [x] `server-wont-start.md` - Port conflicts, TLS cert issues, disk full, WAL corruption - [x] `high-query-latency.md` - Check replication lag, shard hotspots, index health - [x] `restore-from-backup.md` - Step-by-step restore procedure with validation - [x] `add-node.md` - Node join procedure, shard rebalancing, validation - [x] `disk-full.md` - Emergency WAL cleanup, compaction trigger, quota increase - [x] `circuit-breaker-stuck.md` - Reset circuit breaker, identify root cause - [x] `quarantine-overflow.md` - Investigate quarantine queue, batch approve/reject - [x] **Troubleshooting Decision Tree** - [x] `docs/operations/troubleshooting-flowchart.md` - Complete with symptom β†’ cause β†’ runbook mapping - [x] Covers all 7 runbooks with decision trees and quick diagnostic commands ### P5.5 Cluster Management Tooling (WEEK 4 - HIGH PRIORITY) βœ… COMPLETE **Priority: P1 - Manual SSH not scalable** **Completed:** 2026-02-12 - [x] **`stemedb-admin` CLI** (new binary in `crates/stemedb-admin/`) - [x] `stemedb-admin cluster status` - Overview: node count, shard count, meta version, node table - [x] `stemedb-admin cluster health` - Quick health check (exit code 0/1) - [x] `stemedb-admin node list` - List all nodes with states (Alive/Suspect/Dead) - [x] `stemedb-admin node info` - Detailed node info with shard assignments - [x] `stemedb-admin node shards` - Show shards assigned to node (with --leader filter) - [x] `stemedb-admin shard list` - List all shards with leaders/replicas - [x] `stemedb-admin shard info` - Detailed shard info (size, assertions, replicas) - [x] `stemedb-admin shard replicas` - Show replica nodes for shard - [x] `stemedb-admin debug export --output ` - Export complete cluster state as JSON - [x] HTTP client connecting to gateway (default: http://localhost:18181) - [x] Output formats: Table (human-friendly with colors) and JSON (machine-readable) - [x] Environment variable support: `STEMEDB_GATEWAY_ADDR` - [x] Proper error handling with helpful messages (no panics) - [x] 12 integration tests covering all functionality - [x] Node lifecycle documentation: `docs/operations/node-lifecycle.md` - [x] Installation guide: `docs/operations/deployment/install-admin-cli.md` **Phase 2 Deferred:** - [ ] `stemedb-admin node drain ` - Graceful node removal (requires gateway endpoints) - [ ] `stemedb-admin shard rebalance` - Manual rebalancing trigger (requires gateway endpoints) - [ ] **Node Operations Documentation** - [ ] `docs/operations/node-lifecycle.md` - [ ] Add node procedure (pre-flight checks, join, validation) - [ ] Remove node procedure (drain, graceful leave, verification) - [ ] Replace node procedure (dead node replacement, shard recovery) - [ ] **Shard Management** (optional for pilot, defer if time-constrained) - [ ] `stemedb-admin shard rebalance` - Manual rebalancing trigger - [ ] `stemedb-admin shard freeze` - Disable auto-split during maintenance - [ ] `stemedb-admin shard move ` - Manual migration ### P5.6 Reference Architecture (WEEK 4) βœ… COMPLETE **Priority: P1 - Customer deployment guide** - [x] **Deployment Guides** (created in `docs/operations/reference-architecture/`) - [x] `single-node-pilot.md` - Pilot deployment (1 node, docker-compose, hardware specs) - [x] `three-node-cluster.md` - Small production (3 nodes, replication factor 2, HA) - [x] `network-requirements.md` - Port list (181XX), firewall rules, TLS, DNS setup - [x] **Infrastructure as Code Examples** (created in `docs/operations/deployment/`) - [x] `docker-compose/pilot-with-monitoring.yml` - Single-node with Grafana + Prometheus - [x] `nginx/stemedb.conf` - TLS 1.3, rate limiting, security headers, admin restrictions - [x] `envoy/stemedb.yaml` - Load balancing, health checks, circuit breakers, retries - [ ] `kubernetes/` - K8s manifests (StatefulSet, Service, Ingress) [DEFERRED - not needed for pilot] - [ ] `terraform/` - AWS deployment (EC2, EBS, ALB, S3) [DEFERRED - not needed for pilot] - [x] **Resource Sizing Guide** - [x] `docs/operations/reference-architecture/resource-sizing.md` - Complete with CPU/RAM/disk formulas - [x] Quick reference table: <10K, <50K, <100K, <500K, <1M assertions - [x] AWS/GCP/Azure instance recommendations - [x] Capacity planning metrics and monitoring dashboard - [x] **Reverse Proxy Configuration** - [x] `nginx/stemedb.conf` - TLS termination with Let's Encrypt, rate limiting, admin restrictions - [x] `envoy/stemedb.yaml` - Advanced load balancing, circuit breakers, health checks - [x] Let's Encrypt automation examples (certbot + cron) ### P5.7 Pilot Success Validation (WEEK 4) βœ… COMPLETE **Priority: P1 - Definition of done** - [x] **Performance Benchmarks** - Documented in `docs/operations/pilot-success-criteria.md` - [x] Sub-second query latency: p99 <1s at 10K assertions (test procedure included) - [x] Ingest throughput: 1K assertions/sec sustained (5 min load test script) - [x] Replication lag <1 second under normal load (cluster validation) - [x] **Functional Validation** - Documented in `docs/operations/pilot-success-criteria.md` - [x] Conflict detection: ConflictLens score >0.5 on contradictions (test procedure) - [x] Audit trail export: 100 assertions with signatures/provenance (validation script) - [x] Source retraction cascade: 110+ dependents (CARDIOVASC_MEGA_TRIAL example) - [x] **Operational Validation** - Documented in `docs/operations/pilot-success-criteria.md` - [x] Backup/restore roundtrip: 10K assertions β†’ backup β†’ restore β†’ verify (procedure) - [x] Node failure recovery: Kill node β†’ continue β†’ re-replicate <5min (3-node test) - [x] Rolling restart: Restart one-by-one during load test β†’ 100% success (procedure) - [x] **Demo Validation: 5 Amazement Moments** - All documented with test procedures - [x] Moment 1: Conflicting claims (FDA 0.2% vs Anecdotal 12%) - [x] Moment 2: Source retraction cascade (110 assertions flagged) - [x] Moment 3: Audit trail (provenance chain to source) - [x] Moment 4: Time-travel (query 2023 vs 2025) - [x] Moment 5: Lens-based resolution (3 lenses β†’ 3 winners) --- ## Phase 8B-C: Production Scale & Observability > **Prerequisite:** Pilot 5 complete, 1-2 production customers running > **Timeline:** 4-6 weeks after Pilot 5 ### 8B. Advanced Observability - [ ] **8B.1 Distributed Tracing** - [ ] OpenTelemetry integration (Jaeger or Tempo backend) - [ ] Trace write path: Gateway β†’ Shard Leader β†’ Followers β†’ WAL - [ ] Trace sync path: Merkle diff β†’ Fetch missing β†’ CRDT merge - [ ] Add trace IDs to all log lines (`trace_id` field) - [ ] **8B.2 Capacity Planning Metrics** - [ ] `disk_growth_rate_bytes_per_day` (7-day linear regression) - [ ] `disk_days_until_full` (projected based on growth rate) - [ ] `assertion_ingestion_rate` (assertions/sec, 24h moving average) - [ ] Dashboard: Capacity trends with projected full date - [ ] **8B.3 Performance Profiling** - [ ] Continuous profiling (pprof/flamegraph integration) - [ ] Per-shard query latency breakdown - [ ] Hot subject/predicate detection - [ ] Slow query log (queries >100ms) - [ ] **8B.4 Advanced Dashboards** - [ ] `query-performance.json` - Latency by lens, hot subjects, cache hit rate - [ ] `write-pipeline.json` - Ingest rate, WAL throughput, sync lag - [ ] `capacity-planning.json` - Growth trends, disk projections, resource utilization ### 8C. Production Hardening - [ ] **8C.1 Point-in-Time Recovery (PITR)** - [ ] WAL segment archival to S3 (every 15 min or 100 MB) - [ ] Recovery target parsing (`--target lsn:123456`, `--target 2026-02-11T14:25:00`) - [ ] WAL replay engine with checksum validation - [ ] Test: Inject corruption at known LSN, restore to LSN-1, verify consistency - [ ] **8C.2 Online Backup (Hot Backup)** - [ ] Snapshot API: `POST /v1/admin/snapshot` (trigger checkpoint, freeze writes briefly) - [ ] Shadow copy: Copy data files while DB is running - [ ] Snapshot registry: Track active snapshots, prevent WAL truncation - [ ] Zero-downtime backup workflow - [ ] **8C.3 Storage Compaction** - [ ] Automatic WAL segment cleanup (delete segments older than 7 days if checkpointed) - [ ] Tombstone removal (compact assertions with lifecycle=Superseded) - [ ] Background task: Run compaction every 6 hours - [ ] Metrics: `wal_segments_deleted_total`, `compaction_bytes_reclaimed` - [ ] **8C.4 Auto-Healing Improvements** - [ ] Detect dead node β†’ trigger re-replication β†’ restore replication factor (automated) - [ ] Circuit breaker: Don't trigger shard split if memory >80% - [ ] Clock skew detection: Reject assertions with timestamps >1s in future - [ ] Partition detection: Log when SWIM sees cluster split - [ ] **8C.5 Rolling Upgrades** - [ ] `stemedb-admin upgrade --version v0.3.0 --batch-size 1` - [ ] Pre-flight compatibility check (schema version, WAL format) - [ ] Drain node before upgrade (move shards to other nodes) - [ ] Zero-downtime upgrade workflow - [ ] **8C.6 Multi-Region (Active-Passive)** - [ ] Secondary region with continuous WAL replication - [ ] Automated failover (DNS swap when primary unavailable >5 min) - [ ] Failover time target: <10 minutes - [ ] Cost estimate: ~$500/month for active-passive --- ## Phase 9: Enterprise Scale & Compliance > **Goal:** Enterprise-grade durability, compliance, and incident response > **Prerequisite:** 5-10 production customers, predictable failure patterns ### 9A. Advanced Backup & Recovery - [ ] **9A.1 Incremental Backup** - [ ] Only backup changed blocks since last backup (rsync --link-dest pattern) - [ ] Backup time: Minutes instead of hours for 1TB database - [ ] Storage savings: 90% reduction for daily incrementals - [ ] **9A.2 Cross-Region Backup Replication** - [ ] Replicate backups to S3 in different region (S3 cross-region replication) - [ ] Storage tiers: Hot (7 days Standard), Warm (7-30 days Intelligent-Tiering), Cold (30+ days Glacier IR) - [ ] Cost estimate: ~$210/month for 11TB (7 daily + 4 weekly backups) - [ ] **9A.3 Backup Encryption** - [ ] Encrypt backups at rest (AWS KMS or customer-managed keys) - [ ] Encrypt backups in transit (TLS for S3 uploads) - [ ] Key rotation policy (90-day rotation) ### 9B. Data Corruption & Recovery - [ ] **9B.1 Deep Corruption Detection** - [ ] Validate Merkle tree checksums before accepting gossip - [ ] Periodic background validation (full DB checksum every 24h) - [ ] Metric: `corruption_detected_total{source=gossip|disk}` - [ ] **9B.2 Assertion Tombstones (Soft Delete)** - [ ] New lifecycle stage: `Deleted` (append-only, not physically removed) - [ ] Tombstone propagation via gossip (all nodes learn of deletion) - [ ] Query filtering: Lenses ignore `Deleted` assertions by default - [ ] **9B.3 Cluster Rollback** - [ ] `stemedb-admin rollback --before 2026-02-11T14:00:00` - [ ] Batch tombstone generation for all assertions after timestamp - [ ] Use case: Bulk data corruption, need to revert cluster to known-good state - [ ] **9B.4 Split-Brain Recovery** - [ ] Automatic detection: Merkle tree divergence >10% after partition heals - [ ] Manual resolution: `stemedb-admin resolve-split --prefer-node node-1` - [ ] CRDT merge with conflict log (record which assertions were merged/discarded) ### 9C. Compliance & Legal - [ ] **9C.1 GDPR Right to Erasure** - [ ] Cryptographic erasure: Each agent has unique encryption key - [ ] Delete key β†’ data unrecoverable (even though assertions remain on disk) - [ ] Compliance proof: "Key deleted on YYYY-MM-DD, data cryptographically erased" - [ ] **9C.2 Data Retention Policies** - [ ] Per-subject TTL: `retention_policy{subject="medical/*"}=7years` - [ ] Per-predicate TTL: `retention_policy{predicate="temp_session"}=1day` - [ ] Background task: Tombstone assertions past TTL - [ ] **9C.3 Immutable Audit Trail** - [ ] All admin actions logged to append-only audit store - [ ] Include: Who, what, when, why (justification field required) - [ ] Export API: `GET /v1/admin/audit?from=DATE&to=DATE` - [ ] Compliance report generator (CSV/PDF for auditors) - [ ] **9C.4 SOC 2 Type II Certification** - [ ] Security controls implementation (access control, encryption, monitoring) - [ ] 6-month observation period (demonstrate controls work consistently) - [ ] External auditor engagement (Big 4 accounting firm) - [ ] Annual recertification ### 9D. Storage Management - [ ] **9D.1 Advanced Compaction** - [ ] Multi-generation compaction: Merge small segments into larger ones - [ ] Compaction budget: Limit I/O impact (max 10% of disk bandwidth) - [ ] Metrics: `compaction_progress{generation}`, `compaction_bytes_read/written` - [ ] **9D.2 Tiered Storage** - [ ] Hot tier: NVMe SSD (last 7 days, accessed frequently) - [ ] Warm tier: SATA SSD (7-90 days, accessed occasionally) - [ ] Cold tier: S3 Glacier (90+ days, accessed rarely) - [ ] Automatic migration based on access patterns - [ ] **9D.3 Storage Quotas** - [ ] Per-agent quotas: `quota{agent="user123"}=10GB` - [ ] Cluster-wide quota: Hard limit on total DB size - [ ] Soft quota warning at 80% (alert ops team) - [ ] Hard quota rejection at 100% (reject new assertions) ### 9E. Incident Response - [ ] **9E.1 Alerting & Escalation** - [ ] PagerDuty integration (API key in config) - [ ] Slack integration (webhook URL, #stemedb-alerts channel) - [ ] Escalation policy: Warn β†’ Page primary β†’ Page backup β†’ Page manager - [ ] Alert grouping: Batch related alerts (don't page 100 times for same issue) - [ ] **9E.2 Incident Management** - [ ] Incident response playbook (`docs/operations/incident-response.md`) - [ ] Severity levels: P0 (total outage), P1 (degraded), P2 (warning) - [ ] Communication templates (customer email, status page update) - [ ] Post-mortem template (5 Whys, timeline, action items) - [ ] **9E.3 Chaos Engineering** - [ ] Monthly "game day" exercises - [ ] Scenarios: Node failure, network partition, disk full, slow disk - [ ] Use `stemedb-chaos` crate to inject failures - [ ] Document learnings, update runbooks - [ ] **9E.4 On-Call Rotation** - [ ] Define on-call schedule (primary, backup, manager escalation) - [ ] On-call playbook (what to do when paged, who to call, escalation path) - [ ] On-call compensation policy - [ ] Post-incident review process ### 9F. Security Hardening - [ ] **9F.1 mTLS for Cluster Communication** - [ ] Require client certificates for all node-to-node RPC - [ ] Certificate authority: Internal CA or Let's Encrypt - [ ] Certificate rotation: 90-day validity, automated renewal - [ ] Reject connections without valid cert (prevent rogue nodes) - [ ] **9F.2 Encryption at Rest** - [ ] WAL encryption: AES-256-GCM per segment - [ ] KV store encryption: Transparent encryption layer (redb feature or OS-level LUKS) - [ ] Key management: AWS KMS, HashiCorp Vault, or customer-managed keys - [ ] Compliance: Meets HIPAA/GDPR encryption requirements - [ ] **9F.3 Node Authentication** - [ ] Each node has Ed25519 keypair (identity) - [ ] Signed cluster join: Node signs join request with private key - [ ] Admin API: Approve/reject join requests (`stemedb-admin node approve `) - [ ] Prevent unauthorized nodes from joining cluster - [ ] **9F.4 API Security** - [ ] Rate limiting per API key (100 req/min for free tier, 10K req/min for enterprise) - [ ] Input validation: UTF-8, max lengths, regex injection protection - [ ] SQL injection prevention: Parameterized queries only (no string concatenation) - [ ] XSS prevention: Escape all user-provided content in dashboard - [ ] **9F.5 Secrets Management** - [ ] Never store secrets in code or config files - [ ] Use environment variables or secret management service (Vault, AWS Secrets Manager) - [ ] Secret rotation policy (API keys rotated every 90 days) - [ ] Audit log: Track secret access (who accessed what secret when) ### 9G. Operational Maturity - [ ] **9G.1 SLI/SLO Definitions** - [ ] Availability SLO: 99.95% uptime (21.9 min/month downtime budget) - [ ] Latency SLO: p95 query latency <100ms, p99 <500ms - [ ] Error rate SLO: <0.1% of requests fail - [ ] Dashboard: SLO compliance tracking, error budget remaining - [ ] **9G.2 Capacity Planning** - [ ] Quarterly capacity review (growth trends, resource utilization) - [ ] 6-month forecast (projected assertion count, disk usage, API load) - [ ] Auto-scaling triggers (add nodes when CPU >70% for 10 min) - [ ] Budget planning: Cloud costs per customer, per assertion - [ ] **9G.3 Performance Testing** - [ ] Load testing: Sustained 10K assertions/sec for 1 hour - [ ] Stress testing: Ramp to failure (find breaking point) - [ ] Chaos testing: Inject failures during load test - [ ] Regression testing: Compare performance across releases - [ ] **9G.4 Documentation** - [ ] Operator guide (`docs/operations/operator-guide.md`) - [ ] Troubleshooting guide (symptom β†’ diagnosis β†’ fix) - [ ] Architecture deep-dive (how it works, design decisions) - [ ] API reference (auto-generated from OpenAPI spec) - [ ] SDK usage guides (Go, Python, TypeScript) --- ## Architecture Overview ``` Write Path (Spine): Read Path (Cortex): [Agent] -> [Ingestion] [Agent] <- [Lens Engine] | | v | [WAL/Fsync] [Index Lookup] | | v | [KV Store] <--------------------+ ``` ## Port Scheme (181XX) | Offset | Service | Default | Env Var | |--------|---------|---------|---------| | +0 | HTTP API | 18180 | `STEMEDB_BIND_ADDR` | | +1 | Cluster Gateway | 18181 | `STEMEDB_NODE_API_ADDR` | | +2 | Cluster RPC | 18182 | `STEMEDB_NODE_RPC_ADDR` | | +3 | SWIM Gossip | 18183 | via `SwimConfig` | | +4 | Metrics | 18184 | (reserved) | | +5 | Admin | 18185 | (reserved) | | +6 | Latent Signal | 18186 | β€” | | +7 | Community App | 18187 | β€” | | +8 | Admin Dashboard | 18188 | β€” | ## Crates | Crate | Purpose | Status | |-------|---------|--------| | `stemedb-core` | Assertion, LifecycleStage, MaterializedView, types, signing | βœ… | | `stemedb-wal` | Write-ahead log with crash recovery | βœ… | | `stemedb-storage` | KVStore, VoteStore, IndexStore, TrustRankStore, QuarantineStore | βœ… | | `stemedb-ingest` | Ingestion pipeline, signature verification, ContentDefenseLayer | βœ… | | `stemedb-query` | Query engine, Materializer for O(1) MV reads | βœ… | | `stemedb-lens` | Lenses (Recency, Consensus, Authority, Skeptic, Layered, etc.) | βœ… | | `stemedb-api` | HTTP API with axum + utoipa OpenAPI docs | βœ… | | `stemedb-sim` | Simulation for testing the pipeline | βœ… | | `stemedb-merkle` | BLAKE3 Merkle tree for diff detection | βœ… | | `stemedb-rpc` | gRPC services for node-to-node communication | βœ… | | `stemedb-sync` | Merkle sync, gossip broadcast, anti-entropy | βœ… | | `stemedb-cluster` | Cluster membership (SWIM), sharding, gateway | βœ… | | `stemedb-ontology` | Domain definitions (Pharma), subject builders, medical extractors | βœ… | | `stemedb-chaos` | Chaos testing infrastructure | βœ… | | `stemedb-dashboard` | Admin dashboard (React/Next.js) | βœ… (7 panels) | ## Applications | App | Purpose | Status | |-----|---------|--------| | `aphoria` | Code-level truth linter β€” 42 extractors, claims, verify, coverage | 🎯 A5 flywheel | | `disputed` | Controversy explorer | Planned | ## SDKs | SDK | Purpose | Status | |-----|---------|--------| | `sdk/go/steme` | Go HTTP client with Ed25519 signing and fluent builders | βœ… | | `sdk/go/adk` | ADK-Go tools and callbacks for AI agents | βœ… | --- ## Quick Reference ```bash # Build cargo build --workspace # Test cargo test --workspace # Lint (must pass before commit) cargo clippy --workspace -- -D warnings cargo fmt --check # Run API server cargo run --bin stemedb-api # Run Aphoria scan cargo run --bin aphoria -- scan /path/to/project --show-observations # Run demo script ./scripts/demo-consumer-health.sh ``` --- ## Arena: Simulation Roadmap > **Goal:** Incrementally evolve the simulator from Spine validation to a full Agent-Based Modeling environment. > **Philosophy:** Make it run. Then add. Verify at every step. > **Alignment:** Tracks main roadmap phases; exercises features as they land. ### Current State The simulator (`stemedb-sim`) validates the full system through Arena 0-4: **Completed Arenas:** - βœ… **Arena 0**: Test infrastructure with assertions and CI integration - βœ… **Arena 1**: Query path via QueryEngine, Recency lens, lifecycle filtering, query audit - βœ… **Arena 2**: Voting & VoteAwareConsensus, troll resistance - βœ… **Arena 2.5**: Hardening (race conditions, API tests, crash recovery, input validation) - βœ… **Arena 3**: Materialized Views, fast-path verification, MV freshness - βœ… **Arena 4**: Agent personas (Scientist, Troll, Believer with differentiated strategies) **What's Tested:** - WAL durability, rkyv serialization, Ed25519 signatures - Ingestor pipeline (WAL β†’ KV async flow) - QueryEngine with multiple lenses - Lifecycle filtering, voting, consensus - Query audit trail, materialized views - Strategy-driven agent behaviors **What's Not Yet Tested:** - ❌ TrustRank (Arena 5) - ❌ Concurrent agents at scale (Arena 6) - ❌ Time-travel queries (Arena 7) - ❌ Skeptic lens & conflict scores (Arena 8) ### Upcoming Arena Phases **Arena 5: TrustRank Integration** (Next) - Initialize TrustRank for agents - Reputation adjustment after votes - TrustAwareAuthorityLens verification - Troll reputation decay over time **Arena 6: Concurrent Agents** - Tokio task per agent - Scale to 100 agents, then 1000 - Contention metrics and bottleneck identification **Arena 7: Time-Travel & Epochs** - Time-travel query verification - Epoch creation and supersession - Epoch cascade validation **Arena 8: Skeptic & Conflict** - High/low conflict scenarios - Skeptic lens surfacing outliers - Conflict score accuracy **Arena 9: Full Gameplay Loop** - Ground truth injection - Complete 5-tick scenario - Extended 1000-tick run - Emergence validation ### Alignment with Use Cases | Use Case | Arena Phase | |----------|-------------| | **Agile Agent Team** || | Lifecycle filtering | Arena 1.3 | | Query audit trail | Arena 1.4 | | Time-travel debugging | Arena 7.1 | | Expert weighting | Arena 5.3 | | **Financial Due Diligence** || | Conflict detection | Arena 8.1, 8.3 | | Epoch cascades | Arena 7.2, 7.3 | **Run command:** `cargo run --bin stemedb-sim` **Test suite:** `cargo test -p stemedb-sim` --- ## Related Documents - [CLAUDE.md](./CLAUDE.md) β€” AI assistant instructions and project rules - [roadmap-archive.md](./roadmap-archive.md) β€” Completed phases 1-8A + Pilot 1-3 - [applications/aphoria/docs/vision-gaps.md](./applications/aphoria/docs/vision-gaps.md) β€” Aphoria vision gap analysis - [claims-explained.md](./claims-explained.md) β€” Hand-written Maxwell claims (the gold standard) - [docs/demo/pilot/amazement-demo.md](./docs/demo/pilot/amazement-demo.md) β€” Technical demo script - [docs/demo/pilot/amazement-demo-2.md](./docs/demo/pilot/amazement-demo-2.md) β€” Executive demo script - [uat/production-readiness/README.md](./uat/production-readiness/README.md) β€” Production verification checklist