jordan 157dbbb9eb feat: Complete Aphoria Phase 8-9 + UAT suite (90/90 tests passing)

## Phase 8: Enterprise Extractor Improvements ✅
- 14 security extractors (TLS, JWT, SQL injection, XSS, etc.)
- 10 framework-specific extractors (Spring, Django, Rails, etc.)
- Config file security detection (YAML, TOML)

## Phase 9: Autonomous Extractor Generation ✅
- Shadow mode executor with TP/FP tracking
- Graduation pipeline with confidence thresholds
- Auto-rollback on regression detection
- Cross-project pattern syncing

## UAT Suite Complete (14 scripts, 90 tests)
- test-core-detection.sh (6 tests)
- test-declarative-extractors.sh (5 tests)
- test-domain-frameworks.sh (5 tests)
- test-domain-unreal.sh (3 tests)
- test-llm-extraction.sh (6 tests)
- test-eval-harness.sh (5 tests)
- test-cross-language.sh (3 tests)
- test-precommit-performance.sh (4 tests)
- test-output-formats.sh (8 tests)
- test-drift-detection.sh (6 tests)
- test-exit-codes.sh (12 tests)
+ 3 more scripts

## Other Changes
- Updated roadmap to mark Phase 8-9 complete
- Added .gitignore entries for build artifacts
- Updated pre-commit: 800 line limit, exclude tests/data/cmd

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-06 22:50:55 -07:00

24 KiB

Raw Blame History

Episteme (StemeDB) Roadmap

Goal: Build the "Git for Truth" substrate for autonomous AI research. Current Focus: Enterprise Pilot Preparation Target Vertical: BioTech/Pharma ("The Living Review") Endgame: Distributed multi-writer cluster for millions of concurrent agents

Infrastructure Status: Phases 1-7 complete ✅ | Phase 8A (Chaos) complete ✅ Pilot Status: Consumer Health MVP complete ✅ | Enterprise Demo in progress

Archive: For completed phases 1-7, see roadmap-archive.md

Current Status

Phase	Status	Summary
1-7	✅ Complete	Core infrastructure, distributed cluster, trust & safety
8A	✅ Complete	Chaos testing, Jepsen-style verification
MVP	✅ Complete	Consumer Health demo with real FDA data
Pilot Prep	🎯 In Progress	Dashboard, impact analysis, production hardening
8B-C	Planned	Observability, geo-distribution
9	Planned	Disaster recovery, compliance, storage management

🎯 Phase: Enterprise Pilot Preparation (CURRENT)

Goal: Make the pilot bulletproof. Amaze enterprise decision makers. Timeline: 5 weeks Success Criteria: Dr. Sarah Chen (skeptical VP of Data Infrastructure) fights her CFO for budget

The 5 Amazement Moments We Must Deliver

#	Moment	Current State	Gap
1	Contradictions visible with confidence scores	✅ Complete	Dashboard scaffold + Skeptic Query UI ✅
2	Cascade invalidation when source retracted	✅ Complete	Full UI: Sources page + impact dialog (P3.1-3.3) ✅
3	Full FDA-ready audit trail	✅ Complete	Audit Trail Browser (P1.6) ✅
4	Point-in-time queries + decay	✅ API ready	No timeline UI
5	Malicious agent blocked by circuit breaker	✅ Complete	Circuit Breaker Status (P1.5) ✅

Pilot-1: Demo Dashboard (Week 1-2)

Deliverable: React admin dashboard that makes the API visual

P1.1 Dashboard Scaffold: Next.js + shadcn/ui project setup ✅
- Project structure: applications/stemedb-dashboard/
- API client for StemeDB endpoints (src/lib/api/client.ts)
- Authentication scaffold (API key header)
- Dark mode (default), responsive layout with collapsible sidebar
- shadcn/ui components: button, card, badge, input, separator, tabs
- Live API status indicator (polls /health every 30s)
- Port 18188, builds and runs successfully
P1.2 Skeptic Query Visualization: Show contradictions graphically ✅
- Query builder: subject, predicate inputs
- Conflict score gauge (0.0-1.0 with color coding)
- Claims table with weight bars, source tier badges
- "CONTESTED" / "AGREED" / "UNANIMOUS" status badges
- Expandable claim rows with source details, agents, provenance hashes
- Loading skeleton, empty state, error state with retry
P1.3 Layered Consensus View: Per-tier breakdown ✅
- Tier accordion showing each source class (T0→T5, empty tiers hidden)
- Within-tier conflict score (compact gauge in accordion header)
- Cross-tier conflict visualization (full gauge with stats)
- Extended ConflictGauge with variant prop for reuse
P1.4 Quarantine Admin Panel: Content defense visibility ✅
- Pending queue with reason, timestamp, quality score
  - quarantine-panel.tsx, quarantine-list.tsx, quarantine-row.tsx
- Approve/Reject buttons with confirmation
  - ConfirmationDialog with restore/delete actions
- Filter by reason (duplicate, spam, untrusted high-confidence)
  - quarantine-filters.tsx with dropdown selector
- Metrics: pending count, approved/rejected today
  - quarantine-metrics.tsx with MetricCard grid
P1.5 Circuit Breaker Status: Trust & safety dashboard ✅
- Blocked agents list with failure count, retry time
  - circuit-list.tsx, circuit-card.tsx with full details
- State badges: OPEN (red), HALF_OPEN (yellow), CLOSED (green)
  - state-badge.tsx with color-coded variants
- Manual reset button for admin override
  - circuit-panel.tsx - handleReset calls API
- Summary with state counts
  - circuit-summary.tsx replaces historical events (more useful)
- Auto-refresh every 10 seconds
P1.6 Audit Trail Browser: Query provenance explorer ✅
- Recent queries list with agent, timestamp, subject
  - audit-list.tsx, audit-row.tsx with pagination
- Drilldown: contributing assertions, weights, winner
  - Expandable row details in audit-row.tsx
- Filter by agent, time range, subject
  - audit-filters.tsx with 1h/24h/7d/30d/all options
- Export to JSON/CSV
  - audit-export.tsx with proper escaping

Pilot-2: Demo Data Seeder (Week 2)

Deliverable: Pre-signed realistic demo data using Go SDK Status: All complete ✅

P2.1 Demo Keypair Management: Reproducible demo keys ✅
- 5 demo agents with realistic naming convention:
  - fda:drug-label-ingestor (Tier 0 - Regulatory)
  - pubmed:abstract-indexer (Tier 1 - Clinical)
  - clinicaltrials:study-importer (Tier 1 - Clinical)
  - reddit:health-discussion-scraper (Tier 5 - Anecdotal)
  - internal:clinical-ops-reviewer (Tier 3 - Expert)
- Keys stored in demo/keys/ with README documenting each agent's role/scope
  - demo/keys/agents.json with seeds, public keys, tiers, descriptions
  - demo/keys/README.md with full documentation
  - demo/keys/keygen.go for deterministic regeneration
- Go SDK script: cmd/demo-seed/main.go
  - Loads keys from agents.json
  - Creates 260+ assertions with realistic data
- One-command setup: ./scripts/run-demo.sh (start DB → seed → open dashboard)
  - Build detection, health check, auto-cleanup on exit
  - --clean flag for fresh start, --no-open to skip browser
P2.2 Conflict Scenarios: Pre-built disagreements with real data ✅
- 3 drugs: semaglutide (45), tirzepatide (38), liraglutide (32) assertions
- 150+ assertions total using real FDA label excerpts
- ClinicalTrials.gov summaries (STEP, SURMOUNT, SELECT, LEADER trials)
- Killer conflicts: Weight loss (FDA 14.9% vs STEP UP 20.7% vs Reddit variable), Gastroparesis (FDA 0.2% vs UBC 3x risk)
- 4 genuine conflicts per drug (weight loss, nausea, gastroparesis, CV benefit)
- Source registry with 30+ deterministic hash sources across T0-T5 tiers
P2.3 Retractable Sources: Set up cascade demo ✅
- New CARDIOVASC_MEGA_TRIAL source in sources.go (landmark multi-drug CV outcomes study)
- 110 assertions citing this source across 8 categories (visceral cascade effect)
  - Primary/Secondary CV Outcomes (30), Biomarkers (15), Subgroup Analyses (20)
  - Expert Guidelines (15), Real-World Evidence (15), Comparative Efficacy (10), Community (5)
- 5 agents represented: T0 (FDA), T1 (ClinicalTrials), T2 (PubMed), T3 (Internal), T5 (Reddit)
- printCascadeDemoCommands() outputs curl commands for demo flow
- Demo documentation updated in amazement-demo.md
- Note: API endpoints (P3.1) complete ✅, live demo ready
P2.4 Historical Data: Time-travel via lifecycle evolution ✅
- Approach: Use lifecycle states (Proposed → Approved → Deprecated), not fake timestamps
- Each lifecycle transition auditable with real timestamps (signature timestamps)
- Demo scenario: Wegovy CV indication change (pre-March 2024 vs post-SELECT)
- 8 historical scenarios: CV indication, SELECT trial evolution, ADA guidelines, Tirzepatide expansion
- 17 historical assertions showing lifecycle progression
- Demo commands for as_of queries

Pilot-3: Impact Analysis (Week 3)

Deliverable: Automatic cascade when source is retracted Critical: This unblocks P2.3 (retractable sources demo data)

P3.1 Impact Analysis Endpoint: GET /v1/sources/{hash}/impact ✅
- Returns all assertions citing this source (verified: 110 assertions for CARDIOVASC_MEGA_TRIAL)
- Returns count of queries that used those assertions
- Returns list of affected agents/recommendations (verified: 4 agents)
- Implementation in stemedb-api/src/handlers/source_registry/handlers.rs:237-439
- POST /v1/sources/{hash}/quarantine with preview mode (preview=true shows impact without changes)
- Preview response: "This will affect X assertions and Y agent recommendations"
- Undo capability: POST /v1/sources/{hash}/restore (verified: restores 110 assertions)
- 17 unit/integration tests passing
P3.2 Cascade Flagging: Automatic downstream impact ✅
- When source status → quarantined, flag citing assertions
  - Implemented query-time lookup (not index mutation) to preserve append-only immutability
  - SourceStatusEnricher service batch-lookups source statuses from SourceRegistry
  - SourceWarningDto attached to assertions with warning_type, message, source_label, status_updated_at
- ~~New field on assertion index~~ → Query-time enrichment instead (preserves immutability)
- Queries can filter by exclude_quarantined_sources=true
  - Added to QueryParams in dto/query_params.rs
  - POST-retrieval filter applied after query execution
- Define query behavior: quarantined sources show with warning (not silently omitted)
  - source_warning field added to AssertionResponse and ClaimSummaryDto
  - Skeptic endpoint enriches claims with warnings
- Export affected items list for regulatory documentation (CSV/JSON)
  - GET /v1/sources/{hash}/impact/export?format=csv|json
  - Returns ImpactExportRow with assertion_hash, subject, predicate, agent_id, timestamp, lifecycle, confidence
  - CSV includes proper escaping, JSON returns array of objects
P3.3 Impact Dashboard Widget: Visualize the cascade ✅
- Source status change UI (Active → Quarantined)
  - components/sources/status-badge.tsx with color-coded badges
  - components/sources/tier-badge.tsx with T0-T5 labels
- Confirmation dialog: "This will affect 234 downstream assertions and 12 recommendations"
  - components/sources/quarantine-dialog.tsx with impact preview
  - Warning box shows exact affected counts from API
- Choice: "Quarantine immediately" or "Review affected items first"
  - Dual action buttons in dialog
  - "Review first" opens ImpactDetailPanel with full assertion list
- Animated "impact ripple" showing affected count
  - components/sources/impact-ripple.tsx with Tailwind animate-ping
  - Triggers on dialog open, counts pulse with amber styling
- List of impacted queries with timestamp
  - components/sources/impact-detail-panel.tsx shows affected assertions table
  - Affected agents shown as chips
- "Remediation status" tracking
  - Source status visible in list, metrics show quarantined count
  - components/sources/sources-metrics.tsx with Active/Deprecated/Quarantined counts
- Audit trail: WHO retracted, WHEN, and WHY
  - RestoreDialog and QuarantineDialog capture reason field
  - Export to CSV/JSON for regulatory documentation

Pilot-4: Production Hardening (Week 4)

Deliverable: Load testing, authentication, backup documentation

P4.1 Load Testing: Prove performance claims ✅
- Go-based load tester with native Ed25519 signing (cmd/load-test/)
- Benchmark: 10K assertions baseline latency (p99 < 200ms target)
- Benchmark: 1K writes/sec sustained for configurable duration
- Benchmark: 100 concurrent readers, <2x degradation target
- Markdown report generator with pass/fail status
- One-command runner: ./scripts/run-load-test.sh
- Results saved to uat/production-readiness/results/
P4.2 API Authentication: Basic security for pilot
- API key middleware (X-API-Key header)
- Per-key rate limiting (separate from per-agent quota)
- Admin keys vs read-only keys
- Key management: POST /v1/admin/api-keys
P4.3 Backup/Restore Documentation: DR story
- Document WAL-based recovery procedure
- Script: scripts/backup-stemedb.sh (snapshot + WAL archive)
- Script: scripts/restore-stemedb.sh (restore from backup)
- Test restore procedure, document in UAT
P4.4 Prometheus Metrics: Observability baseline
- GET /metrics endpoint with prometheus format
- Key metrics: assertions_total, queries_total, query_latency_seconds
- Trust metrics: quarantine_pending, circuit_breakers_open
- Basic Grafana dashboard template

Pilot-5: Operational Readiness (Week 5)

Deliverable: Runbooks, monitoring, reference architecture

P5.1 Operational Runbooks: Common procedures documented
- "Server won't start" troubleshooting
- "High query latency" investigation
- "Quarantine queue overflow" handling
- "Circuit breaker stuck open" resolution
- "Restore from backup" step-by-step
P5.2 Reference Architecture: Deployment guide
- Single-node pilot deployment diagram
- Network requirements (ports, firewall rules)
- Reverse proxy configuration (nginx/envoy with TLS)
- Resource sizing guide (CPU, memory, disk)
P5.3 Pilot Success Criteria Document: Definition of done
- Sub-second query latency at 10K assertions: measured
- Successful conflict detection on known contradictory studies: demonstrated
- Complete audit trail export for mock regulatory review: tested
- Source retraction workflow: exercised
P5.4 Executive Demo Script Validation: End-to-end rehearsal
- Run through amazement-demo-2.md with real dashboard
- Time each segment (target: 20 minutes total)
- Record demo video for async sharing (backup if live demo fails)
- All 5 Aha Moments demonstrable with real data (not mockups)
- Enterprise Skeptic Questions (must have documented answers):
  - What's the data ingestion latency? (FDA update → queryable)
  - What happens when agents disagree on interpretation?
  - Can I export an audit report for regulators? (PDF/CSV)
  - What's the failure mode if service goes down mid-demo?
  - How do I verify demo data is representative of my real data?
  - If I retract a source, what happens to queries that would have used it?

Pilot Prep Deliverables Summary

Week	Deliverable	Owner	Acceptance Criteria
1-2	`stemedb-dashboard`	Frontend	✅ 6 functional panels, connects to API (P1.1-P1.6)
2	`demo-seed` (P2.1-P2.4)	SDK	✅ 260+ assertions, 3 drugs, real FDA content, lifecycle history, cascade data
3	Impact Analysis (P3.1)	Backend	✅ `/v1/sources/{hash}/impact` + quarantine/restore endpoints
3	Cascade Flagging (P3.2)	Backend	✅ Source warnings, exclude filter, impact export
3	Impact Dashboard (P3.3)	Frontend	✅ Sources page, quarantine dialog, impact ripple, export
3	`demo-seed` (P2.3)	SDK	✅ Retractable source with 110 cascade assertions
4	Load Test Results	QA	✅ `cmd/load-test/` + `scripts/run-load-test.sh`
4	API Authentication	Backend	API keys work, rate limiting functional
4	Backup/Restore	Ops	Documented and tested procedure
4	Metrics Endpoint	Backend	`/metrics` returns Prometheus format
5	Runbooks	Ops	5 runbooks in `docs/runbooks/`
5	Reference Architecture	Docs	Deployment guide complete
5	Demo Rehearsal	All	20-minute demo runs smoothly
5	One-Command Demo	Ops	✅ `./scripts/run-demo.sh` works (P2.1)

Demo Data Quality Checklist (from Enterprise Skeptic Review)

Real FDA label excerpts (public domain) - not synthetic ✅
ClinicalTrials.gov summaries for plausibility ✅
Agent names map to real-world roles (fda:drug-label-ingestor) - P2.1 ✅
Conflicts are genuine (not "100% vs 0%" manufactured disagreements) - P2.2 ✅
Cascade demo shows 100+ affected items (visceral impact) - 110 assertions ✅
Export capability for regulatory documentation (CSV/JSON) - P3.2 ✅
Recovery story: what happens if demo breaks mid-presentation?

Phase 8B-C: Production Observability (Planned)

Blocked by: Pilot Prep (need real production deployment first)

8B. Observability

8B.1 Distributed Metrics: Per-node, per-range, per-agent metrics.
- sync_lag_seconds{peer}, merkle_diff_size{peer}, convergence_latency_p99
- assertions_total{node}, writes_per_second{node}
- Crate: metrics + metrics-exporter-prometheus
8B.2 Admin Dashboard: Cluster health visibility.
- GET /v1/admin/cluster → node list, range assignments, leader locations
- GET /v1/admin/ranges → range sizes, split/merge history
- POST /v1/admin/sync → force anti-entropy sync

8C. Production Hardening

8C.1 Snapshot/Restore: Fast replica bootstrap.
- Serialize full node state as snapshot
- New nodes join by restoring snapshot + replaying recent WAL
8C.2 Backpressure: Don't overwhelm slow nodes.
- Track per-peer sync queue depth
- Throttle gossip to slow peers
8C.3 Geo-Distribution: Multi-region deployment.
- Regional clusters with CRDT federation
- Locality-aware reads

Phase 9: The Bunker (Disaster Planning)

Goal: Survive the worst. Backup, restore, recover from corruption, comply with regulations.

9A. Backup & Cold Storage

9A.1 Full Cluster Backup: Point-in-time snapshot to S3/GCS.
9A.2 Point-in-Time Recovery (PITR): Restore to any HLC timestamp.
9A.3 Backup Verification: Weekly automated restore tests.

9B. Data Corruption & Rollback

9B.1 Corruption Detection: Deep validation before accepting gossip.
9B.2 Assertion Tombstones: "Delete" in an append-only world.
9B.3 Cluster Rollback: Batch tombstone generation for time ranges.
9B.4 Fork Recovery: Heal split-brain after extended partition.

9C. Compliance & Legal

9C.1 GDPR Right to Erasure: Cryptographic erasure via per-agent keys.
9C.2 Data Retention Policies: Per-subject/predicate retention rules.
9C.3 Audit Trail for Compliance: Immutable admin action log.
9C.4 SOC 2 Type II Certification: External audit and certification.
- Gap assessment and remediation
- Evidence collection automation
- Auditor engagement
- Target: Q3 2026

9D. Storage Management

9D.1 Compaction: Reclaim space from tombstoned data.
9D.2 Tiered Storage: Hot/warm/cold based on access patterns.
9D.3 Storage Quotas: Per-agent and cluster-wide limits.

9E. Incident Response

9E.1 Alerting & Escalation: PagerDuty/Slack integration.
9E.2 Operational Runbooks: Documented procedures for common failures.
9E.3 Chaos Engineering: Monthly "game days" with controlled failures.

9F. Security Hardening

9F.1 TLS Everywhere: mTLS for node-to-node traffic.
9F.2 Encryption at Rest: WAL and KV store encryption.
9F.3 Node Authentication: Ed25519 keypair identity, signed cluster join.

Architecture Overview

Write Path (Spine):           Read Path (Cortex):
[Agent] -> [Ingestion]        [Agent] <- [Lens Engine]
              |                              |
              v                              |
         [WAL/Fsync]                  [Index Lookup]
              |                              |
              v                              |
         [KV Store] <--------------------+

Port Scheme (181XX)

Offset	Service	Default	Env Var
+0	HTTP API	18180	`STEMEDB_BIND_ADDR`
+1	Cluster Gateway	18181	`STEMEDB_NODE_API_ADDR`
+2	Cluster RPC	18182	`STEMEDB_NODE_RPC_ADDR`
+3	SWIM Gossip	18183	via `SwimConfig`
+4	Metrics	18184	(reserved)
+5	Admin	18185	(reserved)
+6	Latent Signal	18186	—
+7	Community App	18187	—
+8	Admin Dashboard	18188	—

Crates

Crate	Purpose	Status
`stemedb-core`	Assertion, LifecycleStage, MaterializedView, types, signing	✅
`stemedb-wal`	Write-ahead log with crash recovery	✅
`stemedb-storage`	KVStore, VoteStore, IndexStore, TrustRankStore, QuarantineStore	✅
`stemedb-ingest`	Ingestion pipeline, signature verification, ContentDefenseLayer	✅
`stemedb-query`	Query engine, Materializer for O(1) MV reads	✅
`stemedb-lens`	Lenses (Recency, Consensus, Authority, Skeptic, Layered, etc.)	✅
`stemedb-api`	HTTP API with axum + utoipa OpenAPI docs	✅
`stemedb-sim`	Simulation for testing the pipeline	✅
`stemedb-merkle`	BLAKE3 Merkle tree for diff detection	✅
`stemedb-rpc`	gRPC services for node-to-node communication	✅
`stemedb-sync`	Merkle sync, gossip broadcast, anti-entropy	✅
`stemedb-cluster`	Cluster membership (SWIM), sharding, gateway	✅
`stemedb-ontology`	Domain definitions (Pharma), subject builders, medical extractors	✅
`stemedb-chaos`	Chaos testing infrastructure	✅
`stemedb-dashboard`	Admin dashboard (React/Next.js)	🎯 In Progress (7 panels complete)

SDKs

SDK	Purpose	Status
`sdk/go/steme`	Go HTTP client with Ed25519 signing and fluent builders	✅
`sdk/go/adk`	ADK-Go tools and callbacks for AI agents	✅

Specialized Agents

Domain	Agent	When to use
Product Vision	`episteme-product-visionary`	Use cases, "why not Postgres?", product-market fit
Pilot Prep	`enterprise-skeptic-buyer`	Pressure-test demos, find gaps, prepare for tough questions
General Rust	`primary-developer`	Feature implementation, refactoring
Code Quality	`rust-quality-engineer`	Reviews, test coverage, clippy
Storage	`storage-engine-architect`	WAL, LSM, crash recovery
Graph Engine	`rust-graph-engine-architect`	Lock-free structures, cache optimization
Defensive	`defensive-systems-architect`	Rate limiting, circuit breakers, hostile input
Distributed	`distributed-systems-engineer`	CRDT replication, Raft coordination, Merkle sync
Lenses	`stemedb-lens-architect`	Query resolution, ranking algorithms
Planning	`stemedb-planner`	Milestone planning, roadmap

Quick Reference

# Build
cargo build --workspace

# Test
cargo test --workspace

# Lint (must pass before commit)
cargo clippy --workspace -- -D warnings
cargo fmt --check

# Run API server
cargo run --bin stemedb-api

# Run demo script
./scripts/demo-consumer-health.sh

CLAUDE.md — AI assistant instructions and project rules
roadmap-archive.md — Completed phases 1-7 detail
docs/demo/pilot/amazement-demo.md — Technical demo script
docs/demo/pilot/amazement-demo-2.md — Executive demo script
uat/production-readiness/README.md — Production verification checklist
.claude/agents/enterprise-skeptic-buyer.md — Dr. Sarah Chen persona

24 KiB Raw Blame History

Episteme (StemeDB) Roadmap

Current Status

🎯 Phase: Enterprise Pilot Preparation (CURRENT)

The 5 Amazement Moments We Must Deliver

Pilot-1: Demo Dashboard (Week 1-2)

Pilot-2: Demo Data Seeder (Week 2)

Pilot-3: Impact Analysis (Week 3)

Pilot-4: Production Hardening (Week 4)

Pilot-5: Operational Readiness (Week 5)

Pilot Prep Deliverables Summary

Demo Data Quality Checklist (from Enterprise Skeptic Review)

Phase 8B-C: Production Observability (Planned)

8B. Observability

8C. Production Hardening

Phase 9: The Bunker (Disaster Planning)

9A. Backup & Cold Storage

9B. Data Corruption & Rollback

9C. Compliance & Legal

9D. Storage Management

9E. Incident Response

9F. Security Hardening

Architecture Overview

Port Scheme (181XX)

Crates

SDKs

Specialized Agents

Quick Reference

Related Documents

24 KiB

Raw Blame History