stemedb/roadmap.md
jordan 157dbbb9eb feat: Complete Aphoria Phase 8-9 + UAT suite (90/90 tests passing)
## Phase 8: Enterprise Extractor Improvements 
- 14 security extractors (TLS, JWT, SQL injection, XSS, etc.)
- 10 framework-specific extractors (Spring, Django, Rails, etc.)
- Config file security detection (YAML, TOML)

## Phase 9: Autonomous Extractor Generation 
- Shadow mode executor with TP/FP tracking
- Graduation pipeline with confidence thresholds
- Auto-rollback on regression detection
- Cross-project pattern syncing

## UAT Suite Complete (14 scripts, 90 tests)
- test-core-detection.sh (6 tests)
- test-declarative-extractors.sh (5 tests)
- test-domain-frameworks.sh (5 tests)
- test-domain-unreal.sh (3 tests)
- test-llm-extraction.sh (6 tests)
- test-eval-harness.sh (5 tests)
- test-cross-language.sh (3 tests)
- test-precommit-performance.sh (4 tests)
- test-output-formats.sh (8 tests)
- test-drift-detection.sh (6 tests)
- test-exit-codes.sh (12 tests)
+ 3 more scripts

## Other Changes
- Updated roadmap to mark Phase 8-9 complete
- Added .gitignore entries for build artifacts
- Updated pre-commit: 800 line limit, exclude tests/data/cmd

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 22:50:55 -07:00

24 KiB

Episteme (StemeDB) Roadmap

Goal: Build the "Git for Truth" substrate for autonomous AI research. Current Focus: Enterprise Pilot Preparation Target Vertical: BioTech/Pharma ("The Living Review") Endgame: Distributed multi-writer cluster for millions of concurrent agents

Infrastructure Status: Phases 1-7 complete | Phase 8A (Chaos) complete Pilot Status: Consumer Health MVP complete | Enterprise Demo in progress

Archive: For completed phases 1-7, see roadmap-archive.md


Current Status

Phase Status Summary
1-7 Complete Core infrastructure, distributed cluster, trust & safety
8A Complete Chaos testing, Jepsen-style verification
MVP Complete Consumer Health demo with real FDA data
Pilot Prep 🎯 In Progress Dashboard, impact analysis, production hardening
8B-C Planned Observability, geo-distribution
9 Planned Disaster recovery, compliance, storage management

🎯 Phase: Enterprise Pilot Preparation (CURRENT)

Goal: Make the pilot bulletproof. Amaze enterprise decision makers. Timeline: 5 weeks Success Criteria: Dr. Sarah Chen (skeptical VP of Data Infrastructure) fights her CFO for budget

The 5 Amazement Moments We Must Deliver

# Moment Current State Gap
1 Contradictions visible with confidence scores Complete Dashboard scaffold + Skeptic Query UI
2 Cascade invalidation when source retracted Complete Full UI: Sources page + impact dialog (P3.1-3.3)
3 Full FDA-ready audit trail Complete Audit Trail Browser (P1.6)
4 Point-in-time queries + decay API ready No timeline UI
5 Malicious agent blocked by circuit breaker Complete Circuit Breaker Status (P1.5)

Pilot-1: Demo Dashboard (Week 1-2)

Deliverable: React admin dashboard that makes the API visual

  • P1.1 Dashboard Scaffold: Next.js + shadcn/ui project setup

    • Project structure: applications/stemedb-dashboard/
    • API client for StemeDB endpoints (src/lib/api/client.ts)
    • Authentication scaffold (API key header)
    • Dark mode (default), responsive layout with collapsible sidebar
    • shadcn/ui components: button, card, badge, input, separator, tabs
    • Live API status indicator (polls /health every 30s)
    • Port 18188, builds and runs successfully
  • P1.2 Skeptic Query Visualization: Show contradictions graphically

    • Query builder: subject, predicate inputs
    • Conflict score gauge (0.0-1.0 with color coding)
    • Claims table with weight bars, source tier badges
    • "CONTESTED" / "AGREED" / "UNANIMOUS" status badges
    • Expandable claim rows with source details, agents, provenance hashes
    • Loading skeleton, empty state, error state with retry
  • P1.3 Layered Consensus View: Per-tier breakdown

    • Tier accordion showing each source class (T0→T5, empty tiers hidden)
    • Within-tier conflict score (compact gauge in accordion header)
    • Cross-tier conflict visualization (full gauge with stats)
    • Extended ConflictGauge with variant prop for reuse
  • P1.4 Quarantine Admin Panel: Content defense visibility

    • Pending queue with reason, timestamp, quality score
      • quarantine-panel.tsx, quarantine-list.tsx, quarantine-row.tsx
    • Approve/Reject buttons with confirmation
      • ConfirmationDialog with restore/delete actions
    • Filter by reason (duplicate, spam, untrusted high-confidence)
      • quarantine-filters.tsx with dropdown selector
    • Metrics: pending count, approved/rejected today
      • quarantine-metrics.tsx with MetricCard grid
  • P1.5 Circuit Breaker Status: Trust & safety dashboard

    • Blocked agents list with failure count, retry time
      • circuit-list.tsx, circuit-card.tsx with full details
    • State badges: OPEN (red), HALF_OPEN (yellow), CLOSED (green)
      • state-badge.tsx with color-coded variants
    • Manual reset button for admin override
      • circuit-panel.tsx - handleReset calls API
    • Summary with state counts
      • circuit-summary.tsx replaces historical events (more useful)
    • Auto-refresh every 10 seconds
  • P1.6 Audit Trail Browser: Query provenance explorer

    • Recent queries list with agent, timestamp, subject
      • audit-list.tsx, audit-row.tsx with pagination
    • Drilldown: contributing assertions, weights, winner
      • Expandable row details in audit-row.tsx
    • Filter by agent, time range, subject
      • audit-filters.tsx with 1h/24h/7d/30d/all options
    • Export to JSON/CSV
      • audit-export.tsx with proper escaping

Pilot-2: Demo Data Seeder (Week 2)

Deliverable: Pre-signed realistic demo data using Go SDK Status: All complete

  • P2.1 Demo Keypair Management: Reproducible demo keys

    • 5 demo agents with realistic naming convention:
      • fda:drug-label-ingestor (Tier 0 - Regulatory)
      • pubmed:abstract-indexer (Tier 1 - Clinical)
      • clinicaltrials:study-importer (Tier 1 - Clinical)
      • reddit:health-discussion-scraper (Tier 5 - Anecdotal)
      • internal:clinical-ops-reviewer (Tier 3 - Expert)
    • Keys stored in demo/keys/ with README documenting each agent's role/scope
      • demo/keys/agents.json with seeds, public keys, tiers, descriptions
      • demo/keys/README.md with full documentation
      • demo/keys/keygen.go for deterministic regeneration
    • Go SDK script: cmd/demo-seed/main.go
      • Loads keys from agents.json
      • Creates 260+ assertions with realistic data
    • One-command setup: ./scripts/run-demo.sh (start DB → seed → open dashboard)
      • Build detection, health check, auto-cleanup on exit
      • --clean flag for fresh start, --no-open to skip browser
  • P2.2 Conflict Scenarios: Pre-built disagreements with real data

    • 3 drugs: semaglutide (45), tirzepatide (38), liraglutide (32) assertions
    • 150+ assertions total using real FDA label excerpts
    • ClinicalTrials.gov summaries (STEP, SURMOUNT, SELECT, LEADER trials)
    • Killer conflicts: Weight loss (FDA 14.9% vs STEP UP 20.7% vs Reddit variable), Gastroparesis (FDA 0.2% vs UBC 3x risk)
    • 4 genuine conflicts per drug (weight loss, nausea, gastroparesis, CV benefit)
    • Source registry with 30+ deterministic hash sources across T0-T5 tiers
  • P2.3 Retractable Sources: Set up cascade demo

    • New CARDIOVASC_MEGA_TRIAL source in sources.go (landmark multi-drug CV outcomes study)
    • 110 assertions citing this source across 8 categories (visceral cascade effect)
      • Primary/Secondary CV Outcomes (30), Biomarkers (15), Subgroup Analyses (20)
      • Expert Guidelines (15), Real-World Evidence (15), Comparative Efficacy (10), Community (5)
    • 5 agents represented: T0 (FDA), T1 (ClinicalTrials), T2 (PubMed), T3 (Internal), T5 (Reddit)
    • printCascadeDemoCommands() outputs curl commands for demo flow
    • Demo documentation updated in amazement-demo.md
    • Note: API endpoints (P3.1) complete , live demo ready
  • P2.4 Historical Data: Time-travel via lifecycle evolution

    • Approach: Use lifecycle states (Proposed → Approved → Deprecated), not fake timestamps
    • Each lifecycle transition auditable with real timestamps (signature timestamps)
    • Demo scenario: Wegovy CV indication change (pre-March 2024 vs post-SELECT)
    • 8 historical scenarios: CV indication, SELECT trial evolution, ADA guidelines, Tirzepatide expansion
    • 17 historical assertions showing lifecycle progression
    • Demo commands for as_of queries

Pilot-3: Impact Analysis (Week 3)

Deliverable: Automatic cascade when source is retracted Critical: This unblocks P2.3 (retractable sources demo data)

  • P3.1 Impact Analysis Endpoint: GET /v1/sources/{hash}/impact

    • Returns all assertions citing this source (verified: 110 assertions for CARDIOVASC_MEGA_TRIAL)
    • Returns count of queries that used those assertions
    • Returns list of affected agents/recommendations (verified: 4 agents)
    • Implementation in stemedb-api/src/handlers/source_registry/handlers.rs:237-439
    • POST /v1/sources/{hash}/quarantine with preview mode (preview=true shows impact without changes)
    • Preview response: "This will affect X assertions and Y agent recommendations"
    • Undo capability: POST /v1/sources/{hash}/restore (verified: restores 110 assertions)
    • 17 unit/integration tests passing
  • P3.2 Cascade Flagging: Automatic downstream impact

    • When source status → quarantined, flag citing assertions
      • Implemented query-time lookup (not index mutation) to preserve append-only immutability
      • SourceStatusEnricher service batch-lookups source statuses from SourceRegistry
      • SourceWarningDto attached to assertions with warning_type, message, source_label, status_updated_at
    • New field on assertion index → Query-time enrichment instead (preserves immutability)
    • Queries can filter by exclude_quarantined_sources=true
      • Added to QueryParams in dto/query_params.rs
      • POST-retrieval filter applied after query execution
    • Define query behavior: quarantined sources show with warning (not silently omitted)
      • source_warning field added to AssertionResponse and ClaimSummaryDto
      • Skeptic endpoint enriches claims with warnings
    • Export affected items list for regulatory documentation (CSV/JSON)
      • GET /v1/sources/{hash}/impact/export?format=csv|json
      • Returns ImpactExportRow with assertion_hash, subject, predicate, agent_id, timestamp, lifecycle, confidence
      • CSV includes proper escaping, JSON returns array of objects
  • P3.3 Impact Dashboard Widget: Visualize the cascade

    • Source status change UI (Active → Quarantined)
      • components/sources/status-badge.tsx with color-coded badges
      • components/sources/tier-badge.tsx with T0-T5 labels
    • Confirmation dialog: "This will affect 234 downstream assertions and 12 recommendations"
      • components/sources/quarantine-dialog.tsx with impact preview
      • Warning box shows exact affected counts from API
    • Choice: "Quarantine immediately" or "Review affected items first"
      • Dual action buttons in dialog
      • "Review first" opens ImpactDetailPanel with full assertion list
    • Animated "impact ripple" showing affected count
      • components/sources/impact-ripple.tsx with Tailwind animate-ping
      • Triggers on dialog open, counts pulse with amber styling
    • List of impacted queries with timestamp
      • components/sources/impact-detail-panel.tsx shows affected assertions table
      • Affected agents shown as chips
    • "Remediation status" tracking
      • Source status visible in list, metrics show quarantined count
      • components/sources/sources-metrics.tsx with Active/Deprecated/Quarantined counts
    • Audit trail: WHO retracted, WHEN, and WHY
      • RestoreDialog and QuarantineDialog capture reason field
      • Export to CSV/JSON for regulatory documentation

Pilot-4: Production Hardening (Week 4)

Deliverable: Load testing, authentication, backup documentation

  • P4.1 Load Testing: Prove performance claims

    • Go-based load tester with native Ed25519 signing (cmd/load-test/)
    • Benchmark: 10K assertions baseline latency (p99 < 200ms target)
    • Benchmark: 1K writes/sec sustained for configurable duration
    • Benchmark: 100 concurrent readers, <2x degradation target
    • Markdown report generator with pass/fail status
    • One-command runner: ./scripts/run-load-test.sh
    • Results saved to uat/production-readiness/results/
  • P4.2 API Authentication: Basic security for pilot

    • API key middleware (X-API-Key header)
    • Per-key rate limiting (separate from per-agent quota)
    • Admin keys vs read-only keys
    • Key management: POST /v1/admin/api-keys
  • P4.3 Backup/Restore Documentation: DR story

    • Document WAL-based recovery procedure
    • Script: scripts/backup-stemedb.sh (snapshot + WAL archive)
    • Script: scripts/restore-stemedb.sh (restore from backup)
    • Test restore procedure, document in UAT
  • P4.4 Prometheus Metrics: Observability baseline

    • GET /metrics endpoint with prometheus format
    • Key metrics: assertions_total, queries_total, query_latency_seconds
    • Trust metrics: quarantine_pending, circuit_breakers_open
    • Basic Grafana dashboard template

Pilot-5: Operational Readiness (Week 5)

Deliverable: Runbooks, monitoring, reference architecture

  • P5.1 Operational Runbooks: Common procedures documented

    • "Server won't start" troubleshooting
    • "High query latency" investigation
    • "Quarantine queue overflow" handling
    • "Circuit breaker stuck open" resolution
    • "Restore from backup" step-by-step
  • P5.2 Reference Architecture: Deployment guide

    • Single-node pilot deployment diagram
    • Network requirements (ports, firewall rules)
    • Reverse proxy configuration (nginx/envoy with TLS)
    • Resource sizing guide (CPU, memory, disk)
  • P5.3 Pilot Success Criteria Document: Definition of done

    • Sub-second query latency at 10K assertions: measured
    • Successful conflict detection on known contradictory studies: demonstrated
    • Complete audit trail export for mock regulatory review: tested
    • Source retraction workflow: exercised
  • P5.4 Executive Demo Script Validation: End-to-end rehearsal

    • Run through amazement-demo-2.md with real dashboard
    • Time each segment (target: 20 minutes total)
    • Record demo video for async sharing (backup if live demo fails)
    • All 5 Aha Moments demonstrable with real data (not mockups)
    • Enterprise Skeptic Questions (must have documented answers):
      • What's the data ingestion latency? (FDA update → queryable)
      • What happens when agents disagree on interpretation?
      • Can I export an audit report for regulators? (PDF/CSV)
      • What's the failure mode if service goes down mid-demo?
      • How do I verify demo data is representative of my real data?
      • If I retract a source, what happens to queries that would have used it?

Pilot Prep Deliverables Summary

Week Deliverable Owner Acceptance Criteria
1-2 stemedb-dashboard Frontend 6 functional panels, connects to API (P1.1-P1.6)
2 demo-seed (P2.1-P2.4) SDK 260+ assertions, 3 drugs, real FDA content, lifecycle history, cascade data
3 Impact Analysis (P3.1) Backend /v1/sources/{hash}/impact + quarantine/restore endpoints
3 Cascade Flagging (P3.2) Backend Source warnings, exclude filter, impact export
3 Impact Dashboard (P3.3) Frontend Sources page, quarantine dialog, impact ripple, export
3 demo-seed (P2.3) SDK Retractable source with 110 cascade assertions
4 Load Test Results QA cmd/load-test/ + scripts/run-load-test.sh
4 API Authentication Backend API keys work, rate limiting functional
4 Backup/Restore Ops Documented and tested procedure
4 Metrics Endpoint Backend /metrics returns Prometheus format
5 Runbooks Ops 5 runbooks in docs/runbooks/
5 Reference Architecture Docs Deployment guide complete
5 Demo Rehearsal All 20-minute demo runs smoothly
5 One-Command Demo Ops ./scripts/run-demo.sh works (P2.1)

Demo Data Quality Checklist (from Enterprise Skeptic Review)

  • Real FDA label excerpts (public domain) - not synthetic
  • ClinicalTrials.gov summaries for plausibility
  • Agent names map to real-world roles (fda:drug-label-ingestor) - P2.1
  • Conflicts are genuine (not "100% vs 0%" manufactured disagreements) - P2.2
  • Cascade demo shows 100+ affected items (visceral impact) - 110 assertions
  • Export capability for regulatory documentation (CSV/JSON) - P3.2
  • Recovery story: what happens if demo breaks mid-presentation?

Phase 8B-C: Production Observability (Planned)

Blocked by: Pilot Prep (need real production deployment first)

8B. Observability

  • 8B.1 Distributed Metrics: Per-node, per-range, per-agent metrics.

    • sync_lag_seconds{peer}, merkle_diff_size{peer}, convergence_latency_p99
    • assertions_total{node}, writes_per_second{node}
    • Crate: metrics + metrics-exporter-prometheus
  • 8B.2 Admin Dashboard: Cluster health visibility.

    • GET /v1/admin/cluster → node list, range assignments, leader locations
    • GET /v1/admin/ranges → range sizes, split/merge history
    • POST /v1/admin/sync → force anti-entropy sync

8C. Production Hardening

  • 8C.1 Snapshot/Restore: Fast replica bootstrap.

    • Serialize full node state as snapshot
    • New nodes join by restoring snapshot + replaying recent WAL
  • 8C.2 Backpressure: Don't overwhelm slow nodes.

    • Track per-peer sync queue depth
    • Throttle gossip to slow peers
  • 8C.3 Geo-Distribution: Multi-region deployment.

    • Regional clusters with CRDT federation
    • Locality-aware reads

Phase 9: The Bunker (Disaster Planning)

Goal: Survive the worst. Backup, restore, recover from corruption, comply with regulations.

9A. Backup & Cold Storage

  • 9A.1 Full Cluster Backup: Point-in-time snapshot to S3/GCS.
  • 9A.2 Point-in-Time Recovery (PITR): Restore to any HLC timestamp.
  • 9A.3 Backup Verification: Weekly automated restore tests.

9B. Data Corruption & Rollback

  • 9B.1 Corruption Detection: Deep validation before accepting gossip.
  • 9B.2 Assertion Tombstones: "Delete" in an append-only world.
  • 9B.3 Cluster Rollback: Batch tombstone generation for time ranges.
  • 9B.4 Fork Recovery: Heal split-brain after extended partition.
  • 9C.1 GDPR Right to Erasure: Cryptographic erasure via per-agent keys.
  • 9C.2 Data Retention Policies: Per-subject/predicate retention rules.
  • 9C.3 Audit Trail for Compliance: Immutable admin action log.
  • 9C.4 SOC 2 Type II Certification: External audit and certification.
    • Gap assessment and remediation
    • Evidence collection automation
    • Auditor engagement
    • Target: Q3 2026

9D. Storage Management

  • 9D.1 Compaction: Reclaim space from tombstoned data.
  • 9D.2 Tiered Storage: Hot/warm/cold based on access patterns.
  • 9D.3 Storage Quotas: Per-agent and cluster-wide limits.

9E. Incident Response

  • 9E.1 Alerting & Escalation: PagerDuty/Slack integration.
  • 9E.2 Operational Runbooks: Documented procedures for common failures.
  • 9E.3 Chaos Engineering: Monthly "game days" with controlled failures.

9F. Security Hardening

  • 9F.1 TLS Everywhere: mTLS for node-to-node traffic.
  • 9F.2 Encryption at Rest: WAL and KV store encryption.
  • 9F.3 Node Authentication: Ed25519 keypair identity, signed cluster join.

Architecture Overview

Write Path (Spine):           Read Path (Cortex):
[Agent] -> [Ingestion]        [Agent] <- [Lens Engine]
              |                              |
              v                              |
         [WAL/Fsync]                  [Index Lookup]
              |                              |
              v                              |
         [KV Store] <--------------------+

Port Scheme (181XX)

Offset Service Default Env Var
+0 HTTP API 18180 STEMEDB_BIND_ADDR
+1 Cluster Gateway 18181 STEMEDB_NODE_API_ADDR
+2 Cluster RPC 18182 STEMEDB_NODE_RPC_ADDR
+3 SWIM Gossip 18183 via SwimConfig
+4 Metrics 18184 (reserved)
+5 Admin 18185 (reserved)
+6 Latent Signal 18186
+7 Community App 18187
+8 Admin Dashboard 18188

Crates

Crate Purpose Status
stemedb-core Assertion, LifecycleStage, MaterializedView, types, signing
stemedb-wal Write-ahead log with crash recovery
stemedb-storage KVStore, VoteStore, IndexStore, TrustRankStore, QuarantineStore
stemedb-ingest Ingestion pipeline, signature verification, ContentDefenseLayer
stemedb-query Query engine, Materializer for O(1) MV reads
stemedb-lens Lenses (Recency, Consensus, Authority, Skeptic, Layered, etc.)
stemedb-api HTTP API with axum + utoipa OpenAPI docs
stemedb-sim Simulation for testing the pipeline
stemedb-merkle BLAKE3 Merkle tree for diff detection
stemedb-rpc gRPC services for node-to-node communication
stemedb-sync Merkle sync, gossip broadcast, anti-entropy
stemedb-cluster Cluster membership (SWIM), sharding, gateway
stemedb-ontology Domain definitions (Pharma), subject builders, medical extractors
stemedb-chaos Chaos testing infrastructure
stemedb-dashboard Admin dashboard (React/Next.js) 🎯 In Progress (7 panels complete)

SDKs

SDK Purpose Status
sdk/go/steme Go HTTP client with Ed25519 signing and fluent builders
sdk/go/adk ADK-Go tools and callbacks for AI agents

Specialized Agents

Domain Agent When to use
Product Vision episteme-product-visionary Use cases, "why not Postgres?", product-market fit
Pilot Prep enterprise-skeptic-buyer Pressure-test demos, find gaps, prepare for tough questions
General Rust primary-developer Feature implementation, refactoring
Code Quality rust-quality-engineer Reviews, test coverage, clippy
Storage storage-engine-architect WAL, LSM, crash recovery
Graph Engine rust-graph-engine-architect Lock-free structures, cache optimization
Defensive defensive-systems-architect Rate limiting, circuit breakers, hostile input
Distributed distributed-systems-engineer CRDT replication, Raft coordination, Merkle sync
Lenses stemedb-lens-architect Query resolution, ranking algorithms
Planning stemedb-planner Milestone planning, roadmap

Quick Reference

# Build
cargo build --workspace

# Test
cargo test --workspace

# Lint (must pass before commit)
cargo clippy --workspace -- -D warnings
cargo fmt --check

# Run API server
cargo run --bin stemedb-api

# Run demo script
./scripts/demo-consumer-health.sh