stemedb/latent/roadmap.md
jordan b3e8a9a058 feat: Multi-application expansion with chaos testing and community UI
Major additions:
- Community Next.js app (port 18187) for browsing claims with API docs
- stemedb-chaos crate: Fault injection, chaos testing, CRDT properties
- Latent ingestion system: Reddit/FDA ingesters with ADK-Go agents
- Disputed claims handling: Manual review workflows and validation
- Aphoria security scanner: New extractors (SQL injection, command
  injection, weak crypto, TLS version), policy-based ignores, UAT reports
- Docker infrastructure: Dockerfile, docker-compose.yml for full stack
- VulnBank demo: Intentionally vulnerable multi-language test corpus

SDK & API enhancements:
- Source registry handlers for tracking data provenance
- Metrics endpoint
- Skeptic filtering improvements

Code quality:
- Split 14 large files (>500 lines) into focused modules
- All files now under 500-line limit per project guidelines

Documentation:
- Chaos testing guide, circuit breakers, observability docs
- Phase 7 UAT documentation updates
- Martin Kleppmann technical writer agent

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 01:24:14 -07:00

3.3 KiB

Latent: Implementation Roadmap (Solo Engineer)

Status: Phase 1 Complete. Phase 2 In Progress.


Phase 1: The "Semaglutide" Vertical (COMPLETED)

Goal: End-to-end signal detection for one drug family.

Week 1: Tier 0 (Regulatory) Ground Truth

  • Infrastructure: Set up a local StemeDB instance.
  • Source: OpenFDA API (Free, JSON).
  • Ingestor: Build latent-ingest-fda (Rust/Python) to fetch labels for: Semaglutide, Tirzepatide, Liraglutide.
  • Extract: Parse "Adverse Reactions" section into Assertions.
  • Output: A graph with the "Official Truth" for 3 drugs.

Week 2: Tier 5 (Social) Noise

  • Source: Reddit (Manual API script or Apify if budget permits).
  • Targets: /r/Ozempic, /r/Mounjaro.
  • Ingestor: Build latent-ingest-reddit to fetch last 30 days of posts.
  • Filter: Simple keyword matching: stomach, paralysis, vomit, hair loss.
  • Extract: Use gpt-4o-mini to turn matched posts into Assertions.

Week 3: The Divergence Engine

  • Logic: Implement the "Skeptic Lens" query in StemeDB.
  • Algorithm: Compare Tier 0 (Official) vs Tier 5 (Social).
  • Scoring: Calculate divergence score based on frequency of Social clusters vs presence in Tier 0.

Week 4: The Minimal Dashboard

  • UI: Simple Next.js page showing the "Semaglutide Conflict Heatmap".
  • Ship: Deployed local prototype with sample data.
  • Milestone: A working URL showing "Reddit hates Ozempic's side effects more than the FDA does."

Phase 2: Expansion & Hardening (Weeks 5-8)

Goal: Add credibility and history.

Week 5: Tier 1 (Clinical) Context

  • Source: ClinicalTrials.gov API (Free).
  • Ingestor: Fetch completed trials for target drugs.
  • Data: Extract "Serious Adverse Events" tables.
  • Value: Now you can show "Reddit vs. Trials" conflicts (stronger than just Reddit vs. Label).

Week 6: Time Travel (Backfilling)

  • Backfill: Scrape Reddit back to 2021.
  • History: Ingest historical FDA labels (from DailyMed archives).
  • Analysis: Generate the "Knowledge Lag" chart. Prove that Latent would have predicted the gastroparesis warning.

Week 7: The Daily Cron

  • Automation: Move scripts to a cron job/temporal workflow.
  • Alerting: Simple email/Discord alert when Divergence Score spikes.

Week 8: Marketing The Signal

  • Artifact: Write a blog post: "How Latent predicted the Ozempic warnings 6 months early."
  • Outreach: Send the report to 10 BioTech Hedge Funds.

Phase 3: Commercialization (Weeks 9-12)

Goal: First paying customer.

  • Expansion: Add 5 more high-volatility drugs (e.g., Alzheimer's, new Oncology).
  • Polish: Clean up the UI. Add export to CSV.
  • Sales: Demo the "Alpha Signal" to investors.

"Solo Scraper" Tech Stack

Cheap, resilient, manageable.

  • Language: Python (for scraping/NLP), Rust (for StemeDB).
  • Database: SQLite (local cache) -> StemeDB (Graph).
  • Proxies: BrightData (Pay-as-you-go) or ScraperAPI. Only use when strictly necessary.
  • Orchestration: Simple systemd timers or a lightweight Go scheduler. No Kubernetes.
  • Compute: One robust server (64GB RAM, plenty of cores) running everything.