# Ontology Layer + Medical Vertical Implementation Plan > **Goal:** Build the ontology layer that defines how claims are structured, extracted, and resolved — using Pharma/Medical as the first vertical to pressure-test every decision. > > **Philosophy:** Build medical end-to-end first, extract the reusable ontology layer as patterns emerge. Don't abstract prematurely. > > **Outcome:** A working claim extraction pipeline that can ingest FDA labels, clinical trial data, and eventually social media — producing properly structured assertions that conflict-detect correctly. --- ## Context: What Exists Today | Component | Status | Gap | |-----------|--------|-----| | Core StemeDB | ✅ Complete | Storage, lenses, conflict detection work | | SkepticLens | ✅ Complete | Can detect conflicts if subjects/predicates match | | `latent/ingest-fda` | 🟡 Prototype | Fetches FDA labels, outputs flat JSONL (not integrated) | | Aphoria extractors | ✅ Complete | Pattern-based extraction for code (14 extractors) | | Disputed LLM extraction | 🟡 Early | Generic SPO extraction, no domain schema | | Ontology definitions | ❌ Missing | No formal way to define subject patterns, predicate schemas | **The gap:** We can store claims. We cannot yet *structure* claims for a domain in a way that guarantees conflicts will collide correctly. --- ## The Core Insight When Journal A says "Semaglutide reduced HbA1c by 1.5%" and Journal B says "Semaglutide reduced HbA1c by 0.8%", we need: ``` Subject: Semaglutide:Type2Diabetes # Drug:Indication compound key Predicate: hba1c_change_percent # Reusable across drug:indication pairs Object: -1.5 # The conflicting value ``` The **subject granularity depends on the predicate type**: - Efficacy predicates → Subject is `{Drug}:{Indication}` - Safety predicates → Subject is `{Drug}` (indication-agnostic) - Mechanism predicates → Subject is `{Drug}` or `{Drug}:{Pathway}` This is what the ontology layer needs to express. --- ## High-Level Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ Ontology Layer (New) │ ├─────────────────────────────────────────────────────────────────┤ │ Domain Definition (YAML/TOML) │ │ ├── Entity Types (Drug, Condition, Biomarker...) │ │ ├── Predicate Schemas (subject pattern → predicates) │ │ ├── Source Hierarchy (Tier 0-5) │ │ └── Default Lens per predicate type │ ├─────────────────────────────────────────────────────────────────┤ │ Extraction Pipeline │ │ ├── Source Adapters (FDA API, PubMed, Reddit...) │ │ ├── Claim Extractor (LLM-based, schema-guided) │ │ ├── Normalizer (maps raw → ontology subjects/predicates) │ │ └── Validator (checks claim conforms to schema) │ ├─────────────────────────────────────────────────────────────────┤ │ StemeDB Core (Existing) │ │ ├── Assertion Storage │ │ ├── Conflict Detection (SkepticLens) │ │ └── Query/Resolution │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## Week-by-Week Implementation Plan ### Week 1: Domain Definition Schema **Goals:** - Define how domains express their ontology - Create the pharma domain definition as the first instance - No extraction yet — just the schema that extraction will target **Tasks:** 1. **Design domain definition format** (`applications/ontology/`) - Choose format: YAML or TOML (TOML aligns with Rust ecosystem) - Define schema for entity types, predicate schemas, source tiers 2. **Create pharma domain definition** (`applications/ontology/domains/pharma.toml`) ```toml [domain] name = "pharma" version = "0.1.0" [entity_types] Drug = { aliases = ["medication", "compound", "molecule"] } Indication = { aliases = ["condition", "disease", "disorder"] } Biomarker = { aliases = ["endpoint", "measure"] } Population = { aliases = ["cohort", "patient_group"] } [predicate_schemas.efficacy] subject_pattern = "{Drug}:{Indication}" predicates = ["hba1c_change_percent", "weight_change_percent", "remission_rate"] default_lens = "Skeptic" [predicate_schemas.safety] subject_pattern = "{Drug}" predicates = ["nausea_incidence", "discontinuation_rate", "has_boxed_warning"] default_lens = "Authority" [predicate_schemas.mechanism] subject_pattern = "{Drug}" predicates = ["target_receptor", "half_life_hours", "bioavailability_percent"] default_lens = "Recency" [source_hierarchy] tier0 = ["FDA_Label", "EMA_Approval"] tier1 = ["Phase3_RCT", "Meta_Analysis"] tier2 = ["Observational_Study", "Real_World_Evidence"] tier3 = ["Case_Report", "Expert_Opinion"] tier4 = ["Patient_Forum", "Social_Media"] ``` 3. **Implement domain parser** (`applications/ontology/src/domain.rs`) - Parse TOML into Rust structs - Validate schema consistency (no circular refs, valid patterns) - Unit tests for parsing 4. **Subject builder utility** - Given entity values + predicate schema, build correct subject string - `build_subject("efficacy", {"Drug": "Semaglutide", "Indication": "Type2Diabetes"})` → `"Semaglutide:Type2Diabetes"` **Deliverables:** - `applications/ontology/` crate with domain definition parsing - `domains/pharma.toml` as the reference implementation - Subject builder that enforces schema compliance **Foundation this enables:** - Extractors know what shape claims should have - Validators can check claims against schema - Future domains just add a new `.toml` file --- ### Week 2: FDA Label Extraction (Tier 0) **Goals:** - Upgrade `latent/ingest-fda` to produce schema-compliant assertions - Extract *structured* claims, not raw text blobs - Write directly to StemeDB (not JSONL files) **Tasks:** 1. **Refactor FDA ingestor** (`latent/ingest-fda/`) - Load pharma domain definition - Use LLM (Claude) to extract structured claims from label sections - Map extracted claims to ontology predicates 2. **LLM extraction prompt for FDA labels** ``` Given this FDA label section for {drug_name}, extract claims as structured data. For ADVERSE REACTIONS sections, extract: - Predicate: {symptom}_incidence - Object: decimal (0.0-1.0) - Quote: exact text supporting this For BOXED WARNINGS, extract: - Predicate: has_boxed_warning - Object: boolean (true) - Quote: warning text Return JSON array of claims. ``` 3. **Normalizer module** (`latent/ingest-fda/normalizer.py`) - Map drug names to canonical form (Ozempic → semaglutide) - Map symptom names to canonical predicates (nausea, vomiting → distinct) - Use drug synonym database (RxNorm or similar) 4. **StemeDB client integration** - Replace JSONL output with HTTP calls to StemeDB API - Sign assertions with ingestor's Ed25519 key - Set `source_class: Regulatory` (Tier 0) 5. **Integration test** - Ingest semaglutide FDA label - Query StemeDB for `Semaglutide` safety predicates - Verify structured claims exist with correct schema **Deliverables:** - FDA ingestor writes schema-compliant assertions to StemeDB - At least 3 drugs (semaglutide, tirzepatide, liraglutide) ingested - Integration test proving round-trip **Foundation this enables:** - Tier 0 (regulatory) baseline established - Lower-tier sources can now conflict with FDA data - Pattern for other structured source adapters --- ### Week 3: Clinical Trial Extraction (Tier 1) **Goals:** - Extract efficacy claims from clinical trial publications - These will conflict with each other (the whole point of Episteme) - Demonstrate SkepticLens showing real disagreement **Tasks:** 1. **PubMed/PMC source adapter** (`latent/ingest-pubmed/`) - Fetch abstracts + full text for GLP-1 trials - Filter by clinical trial registration (NCT numbers) - Extract study metadata (sample size, design, journal) 2. **LLM extraction prompt for efficacy claims** ``` Given this clinical trial abstract/results for {drug_name} in {indication}: Extract efficacy claims: - Subject: {drug}:{indication} - Predicate: {biomarker}_change_percent | remission_rate | etc. - Object: numeric value - Confidence: based on sample size and study design - Quote: exact text Include comparator arm if mentioned (vs placebo, vs {other_drug}). ``` 3. **Confidence scoring heuristics** - Phase 3 RCT: base confidence 0.9 - Phase 2: base confidence 0.7 - Observational: base confidence 0.5 - Adjust by sample size, blinding, journal impact factor 4. **Conflict demonstration** - Ingest 5+ trials with varying HbA1c results for semaglutide - Query with SkepticLens - Show `conflict_score > 0` and multiple competing claims **Deliverables:** - PubMed ingestor producing Tier 1 assertions - At least 10 clinical trial papers ingested - SkepticLens query showing real conflict in GLP-1 efficacy data **Foundation this enables:** - Proves the ontology enables conflict detection - Real "Trust but Verify" data for demos - Pattern for other publication sources (biorxiv, medrxiv) --- ### Week 4: Ontology Validation & Query Patterns **Goals:** - Validate that extracted claims conform to ontology - Build query helpers that leverage domain knowledge - CLI tool for exploring the pharma knowledge graph **Tasks:** 1. **Claim validator** (`applications/ontology/src/validator.rs`) - Check subject matches predicate schema's subject_pattern - Check predicate is defined for its schema - Check object type is valid for predicate - Return validation errors or Ok 2. **Query builder with domain awareness** - `query_efficacy("Semaglutide", "Type2Diabetes")` → correct subject + all efficacy predicates - `query_safety("Semaglutide")` → correct subject + all safety predicates - Uses domain definition to know which predicates to include 3. **CLI exploration tool** (`applications/ontology/src/bin/steme-pharma.rs`) ```bash steme-pharma efficacy semaglutide type2-diabetes # Shows all efficacy claims with conflict scores steme-pharma safety semaglutide # Shows all safety claims by tier steme-pharma compare semaglutide tirzepatide --indication type2-diabetes # Side-by-side comparison ``` 4. **Domain-aware API endpoints** (optional, if time) - `GET /v1/pharma/efficacy?drug=semaglutide&indication=type2_diabetes` - `GET /v1/pharma/safety?drug=semaglutide` - Thin wrapper over existing query API with domain knowledge **Deliverables:** - Validator catches schema-violating claims before ingestion - CLI tool for exploring pharma data - Query helpers that make domain queries easy **Foundation this enables:** - Quality gate for extraction pipelines - Developer experience for exploring data - Pattern for domain-specific APIs --- ### Week 5: Social Signal Extraction (Tier 4-5) **Goals:** - Extract anecdotal claims from Reddit/social media - These are low-confidence but high-volume - Demonstrate the full tier spectrum working together **Tasks:** 1. **Reddit source adapter** (`latent/ingest-reddit/`) - Fetch posts from r/Ozempic, r/Semaglutide, r/diabetes - Extract post text, upvotes, comment count - Rate limit appropriately 2. **LLM extraction for anecdotal claims** ``` Given this Reddit post about {drug_name}: Extract personal experience claims: - Predicate: reported_{symptom} | reported_effectiveness - Object: Text (the claim) or Boolean - Confidence: 0.2-0.4 based on detail and specificity - Quote: relevant excerpt Skip if: asking questions, discussing others' experiences, promotional. ``` 3. **Anecdote clustering** (stretch goal) - Group similar anecdotal claims - Escalate if cluster size exceeds threshold - "50+ users reporting gastroparesis" becomes a signal 4. **Full-tier query demonstration** ```bash steme-pharma safety semaglutide --all-tiers # Shows: # Tier 0 (FDA): has_boxed_warning = true (thyroid) # Tier 1 (RCTs): nausea_incidence = 0.44 # Tier 4 (Reddit): 127 reports of "gastroparesis" (pre-label) ``` **Deliverables:** - Reddit ingestor producing Tier 4-5 assertions - At least 100 anecdotal claims ingested - Query showing full tier spectrum for a safety topic **Foundation this enables:** - "Early signal detection" — social media flags issues before FDA - Real consumer health use case data - Validates source_class hierarchy is working --- ### Week 6: Ontology Layer Extraction & Documentation **Goals:** - Extract reusable patterns from pharma implementation - Document how to create a new domain - Prepare for second vertical (SEC/financial or code/security) **Tasks:** 1. **Generalize domain definition schema** - Review what was pharma-specific vs reusable - Document the schema formally - Create `domains/template.toml` with comments 2. **Extraction pipeline abstraction** - Common interface for source adapters - Common LLM prompt patterns - Common normalization utilities 3. **"Add a Domain" guide** (`docs/guides/adding-a-domain.md`) - Step-by-step: define entities, predicates, sources - Example: how pharma was built - Checklist: what you need before first extraction 4. **Second domain sketch** (no implementation) - Draft `domains/sec.toml` or `domains/security.toml` - Identify where patterns differ from pharma - Document what would need to change 5. **Integration with CLAUDE.md** - Add ontology layer to "Find Your Guide" table - Document new crates and tools - Update specialized agents if needed **Deliverables:** - Reusable ontology crate extracted from pharma-specific code - Documentation for adding new domains - Draft of second domain showing generalization **Foundation this enables:** - Other teams can add domains without core changes - Clear separation of domain logic from storage logic - Path to "platform" if that's the direction --- ## Risks & Mitigations | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | LLM extraction quality varies | High | Medium | Start with structured sources (FDA), add human review loop | | Ontology schema needs revision | High | Low | Keep Week 1 scope small, iterate based on Week 2-3 learnings | | Drug name normalization is hard | Medium | Medium | Use existing resources (RxNorm), accept some manual mapping | | Reddit rate limits | Medium | Low | Cache aggressively, use Reddit API properly | | Scope creep into full consumer app | Medium | High | Stay focused on extraction + storage, not UX | --- ## Success Criteria **Week 3 checkpoint:** Can we ingest FDA labels + clinical trials and see real conflicts via SkepticLens? **Week 6 checkpoint:** Is the ontology layer reusable enough that sketching a second domain feels like "filling in a template" rather than "starting over"? **End state:** ```bash # This should work and show meaningful data: steme-pharma efficacy semaglutide type2_diabetes # Output: Subject: Semaglutide:Type2Diabetes Predicate: hba1c_change_percent Status: Contested (conflict_score: 0.42) Claims: -1.5% (45% weight) — NEJM Phase 3, n=1847 -1.2% (35% weight) — Lancet Phase 3, n=892 -0.8% (20% weight) — JAMA Observational, n=12000 Sources span Tier 0-2. 47 anecdotal reports (Tier 4) mention efficacy. ``` --- ## Open Questions 1. **Where does the ontology crate live?** - Option A: `applications/ontology/` (domain-agnostic utility) - Option B: `crates/stemedb-ontology/` (core crate) - Recommendation: Start in `applications/`, promote to `crates/` if it proves domain-agnostic 2. **Python vs Rust for extractors?** - Python: faster iteration, better LLM libraries, existing `latent/` code - Rust: type safety, integration with core - Recommendation: Python for extractors (they're ETL jobs), Rust for ontology validation 3. **How do we handle ontology versioning?** - Predicates may change as we learn - Old assertions might not conform to new schema - Recommendation: Version in domain definition, allow "legacy" predicates with deprecation flags --- ## Next Steps 1. Review this plan and adjust scope/timeline 2. Decide on Question 1 (crate location) 3. Start Week 1: domain definition schema