stemedb/latent/sources.md
jordan b3e8a9a058 feat: Multi-application expansion with chaos testing and community UI
Major additions:
- Community Next.js app (port 18187) for browsing claims with API docs
- stemedb-chaos crate: Fault injection, chaos testing, CRDT properties
- Latent ingestion system: Reddit/FDA ingesters with ADK-Go agents
- Disputed claims handling: Manual review workflows and validation
- Aphoria security scanner: New extractors (SQL injection, command
  injection, weak crypto, TLS version), policy-based ignores, UAT reports
- Docker infrastructure: Dockerfile, docker-compose.yml for full stack
- VulnBank demo: Intentionally vulnerable multi-language test corpus

SDK & API enhancements:
- Source registry handlers for tracking data provenance
- Metrics endpoint
- Skeptic filtering improvements

Code quality:
- Split 14 large files (>500 lines) into focused modules
- All files now under 500-line limit per project guidelines

Documentation:
- Chaos testing guide, circuit breakers, observability docs
- Phase 7 UAT documentation updates
- Martin Kleppmann technical writer agent

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 01:24:14 -07:00

2.6 KiB

Latent: Data Sources Catalog

Latent maps the world's drug safety data into the StemeDB Source Class hierarchy.

Tier 0: Regulatory (The Ground Truth)

Static, authoritative, legally mandated.

Source Access Method Update Frequency Data Format
FDA Labels OpenFDA API Weekly Structured JSON
EMA Post-Auth Web Scraper / RSS Monthly PDF / HTML
DailyMed NIH API / Bulk Daily SPL (XML)
PMDA (Japan) Web Scraper Quarterly HTML (Japanese)

Tier 1: Clinical (The Science)

Rigorous, peer-reviewed, baseline statistics.

Source Access Method Data Points
ClinicalTrials.gov CT.gov API v2 Adverse Event Tables, Trial Status
EudraCT Web Scraper European Clinical Trial results
Registry Metadata Crossref API Publication status of completed trials

Tier 2: Observational & Expert (The Narrative)

Case reports, specialist guidelines, real-world studies.

Source Access Method Role
PubMed / MEDLINE Entrez E-utilities Case reports of rare adverse events
bioRxiv / medRxiv API Pre-print signals (Fast but unverified)
NICE Guidelines Web Scraper Standard of care changes

Tier 4: Aggregated Community (The Volume)

Structured reports from non-regulatory sources.

Source Access Method Role
FAERS OpenFDA API Public side-effect reporting (Noisy)
VAERS OpenFDA API Vaccine-specific adverse events
PatientsLikeMe Web Scraper Structured patient-reported outcomes

Tier 5: Anecdotal (The Early Warning)

Unstructured, high-velocity, messy.

Source Access Method Target Channels
Reddit Apify / Reddit API r/Ozempic, r/Medicine, r/Biohackers
Twitter / X Apify #MedTwitter, #PharmaSafety
TikTok Web Scraper Trending side-effect "storytimes"

Ingestion Strategy

1. The "Golden Path" (High Confidence)

Automatic ingestion of Tier 0 and Tier 1 data. These sources are considered permanent and override all others in the Authority Lens.

2. The "Signal Path" (Predictive)

Clustering of Tier 5 data.

  • Individual reports are ignored.
  • Clusters (e.g., 50+ mentions of a symptom in 7 days) are promoted to "Latent Signals" and flagged for comparison against Tier 0.

3. Language Translation

Latent uses google-cloud-translate or local marian-nmt models to normalize Tier 0 data from the PMDA (Japan) and EMA (EU) into English assertions for global conflict detection.