stemedb/latent/sources.md
jordan b3e8a9a058 feat: Multi-application expansion with chaos testing and community UI
Major additions:
- Community Next.js app (port 18187) for browsing claims with API docs
- stemedb-chaos crate: Fault injection, chaos testing, CRDT properties
- Latent ingestion system: Reddit/FDA ingesters with ADK-Go agents
- Disputed claims handling: Manual review workflows and validation
- Aphoria security scanner: New extractors (SQL injection, command
  injection, weak crypto, TLS version), policy-based ignores, UAT reports
- Docker infrastructure: Dockerfile, docker-compose.yml for full stack
- VulnBank demo: Intentionally vulnerable multi-language test corpus

SDK & API enhancements:
- Source registry handlers for tracking data provenance
- Metrics endpoint
- Skeptic filtering improvements

Code quality:
- Split 14 large files (>500 lines) into focused modules
- All files now under 500-line limit per project guidelines

Documentation:
- Chaos testing guide, circuit breakers, observability docs
- Phase 7 UAT documentation updates
- Martin Kleppmann technical writer agent

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-04 01:24:14 -07:00

63 lines
2.6 KiB
Markdown

# Latent: Data Sources Catalog
Latent maps the world's drug safety data into the StemeDB Source Class hierarchy.
## Tier 0: Regulatory (The Ground Truth)
*Static, authoritative, legally mandated.*
| Source | Access Method | Update Frequency | Data Format |
| :--- | :--- | :--- | :--- |
| **FDA Labels** | OpenFDA API | Weekly | Structured JSON |
| **EMA Post-Auth** | Web Scraper / RSS | Monthly | PDF / HTML |
| **DailyMed** | NIH API / Bulk | Daily | SPL (XML) |
| **PMDA (Japan)** | Web Scraper | Quarterly | HTML (Japanese) |
## Tier 1: Clinical (The Science)
*Rigorous, peer-reviewed, baseline statistics.*
| Source | Access Method | Data Points |
| :--- | :--- | :--- |
| **ClinicalTrials.gov** | CT.gov API v2 | Adverse Event Tables, Trial Status |
| **EudraCT** | Web Scraper | European Clinical Trial results |
| **Registry Metadata** | Crossref API | Publication status of completed trials |
## Tier 2: Observational & Expert (The Narrative)
*Case reports, specialist guidelines, real-world studies.*
| Source | Access Method | Role |
| :--- | :--- | :--- |
| **PubMed / MEDLINE** | Entrez E-utilities | Case reports of rare adverse events |
| **bioRxiv / medRxiv** | API | Pre-print signals (Fast but unverified) |
| **NICE Guidelines** | Web Scraper | Standard of care changes |
## Tier 4: Aggregated Community (The Volume)
*Structured reports from non-regulatory sources.*
| Source | Access Method | Role |
| :--- | :--- | :--- |
| **FAERS** | OpenFDA API | Public side-effect reporting (Noisy) |
| **VAERS** | OpenFDA API | Vaccine-specific adverse events |
| **PatientsLikeMe** | Web Scraper | Structured patient-reported outcomes |
## Tier 5: Anecdotal (The Early Warning)
*Unstructured, high-velocity, messy.*
| Source | Access Method | Target Channels |
| :--- | :--- | :--- |
| **Reddit** | Apify / Reddit API | r/Ozempic, r/Medicine, r/Biohackers |
| **Twitter / X** | Apify | #MedTwitter, #PharmaSafety |
| **TikTok** | Web Scraper | Trending side-effect "storytimes" |
## Ingestion Strategy
### 1. The "Golden Path" (High Confidence)
Automatic ingestion of **Tier 0 and Tier 1** data. These sources are considered permanent and override all others in the **Authority Lens**.
### 2. The "Signal Path" (Predictive)
Clustering of **Tier 5** data.
- Individual reports are ignored.
- **Clusters** (e.g., 50+ mentions of a symptom in 7 days) are promoted to "Latent Signals" and flagged for comparison against Tier 0.
### 3. Language Translation
Latent uses `google-cloud-translate` or local `marian-nmt` models to normalize Tier 0 data from the PMDA (Japan) and EMA (EU) into English assertions for global conflict detection.