stemedb/docs/specs/ontology-layer-medical-vertical.md

# Ontology Layer + Medical Vertical Implementation Plan

> **Goal:** Build the ontology layer that defines how claims are structured, extracted, and resolved — using Pharma/Medical as the first vertical to pressure-test every decision.
>
> **Philosophy:** Build medical end-to-end first, extract the reusable ontology layer as patterns emerge. Don't abstract prematurely.
>
> **Outcome:** A working claim extraction pipeline that can ingest FDA labels, clinical trial data, and eventually social media — producing properly structured assertions that conflict-detect correctly.

---

## Context: What Exists Today

| Component | Status | Gap |
|-----------|--------|-----|
| Core StemeDB | ✅ Complete | Storage, lenses, conflict detection work |
| SkepticLens | ✅ Complete | Can detect conflicts if subjects/predicates match |
| `latent/ingest-fda` | 🟡 Prototype | Fetches FDA labels, outputs flat JSONL (not integrated) |
| Aphoria extractors | ✅ Complete | Pattern-based extraction for code (14 extractors) |
| Disputed LLM extraction | 🟡 Early | Generic SPO extraction, no domain schema |
| Ontology definitions | ❌ Missing | No formal way to define subject patterns, predicate schemas |

**The gap:** We can store claims. We cannot yet *structure* claims for a domain in a way that guarantees conflicts will collide correctly.

---

## The Core Insight

When Journal A says "Semaglutide reduced HbA1c by 1.5%" and Journal B says "Semaglutide reduced HbA1c by 0.8%", we need:

```
Subject:   Semaglutide:Type2Diabetes   # Drug:Indication compound key
Predicate: hba1c_change_percent        # Reusable across drug:indication pairs
Object:    -1.5                         # The conflicting value
```

The **subject granularity depends on the predicate type**:
- Efficacy predicates → Subject is `{Drug}:{Indication}`
- Safety predicates → Subject is `{Drug}` (indication-agnostic)
- Mechanism predicates → Subject is `{Drug}` or `{Drug}:{Pathway}`

This is what the ontology layer needs to express.

---

## High-Level Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                     Ontology Layer (New)                        │
├─────────────────────────────────────────────────────────────────┤
│  Domain Definition (YAML/TOML)                                  │
│  ├── Entity Types (Drug, Condition, Biomarker...)               │
│  ├── Predicate Schemas (subject pattern → predicates)           │
│  ├── Source Hierarchy (Tier 0-5)                                │
│  └── Default Lens per predicate type                            │
├─────────────────────────────────────────────────────────────────┤
│  Extraction Pipeline                                            │
│  ├── Source Adapters (FDA API, PubMed, Reddit...)               │
│  ├── Claim Extractor (LLM-based, schema-guided)                 │
│  ├── Normalizer (maps raw → ontology subjects/predicates)       │
│  └── Validator (checks claim conforms to schema)                │
├─────────────────────────────────────────────────────────────────┤
│  StemeDB Core (Existing)                                        │
│  ├── Assertion Storage                                          │
│  ├── Conflict Detection (SkepticLens)                           │
│  └── Query/Resolution                                           │
└─────────────────────────────────────────────────────────────────┘
```

---

## Week-by-Week Implementation Plan

### Week 1: Domain Definition Schema

**Goals:**
- Define how domains express their ontology
- Create the pharma domain definition as the first instance
- No extraction yet — just the schema that extraction will target

**Tasks:**

1. **Design domain definition format** (`applications/ontology/`)
   - Choose format: YAML or TOML (TOML aligns with Rust ecosystem)
   - Define schema for entity types, predicate schemas, source tiers

2. **Create pharma domain definition** (`applications/ontology/domains/pharma.toml`)
   ```toml
   [domain]
   name = "pharma"
   version = "0.1.0"

   [entity_types]
   Drug = { aliases = ["medication", "compound", "molecule"] }
   Indication = { aliases = ["condition", "disease", "disorder"] }
   Biomarker = { aliases = ["endpoint", "measure"] }
   Population = { aliases = ["cohort", "patient_group"] }

   [predicate_schemas.efficacy]
   subject_pattern = "{Drug}:{Indication}"
   predicates = ["hba1c_change_percent", "weight_change_percent", "remission_rate"]
   default_lens = "Skeptic"

   [predicate_schemas.safety]
   subject_pattern = "{Drug}"
   predicates = ["nausea_incidence", "discontinuation_rate", "has_boxed_warning"]
   default_lens = "Authority"

   [predicate_schemas.mechanism]
   subject_pattern = "{Drug}"
   predicates = ["target_receptor", "half_life_hours", "bioavailability_percent"]
   default_lens = "Recency"

   [source_hierarchy]
   tier0 = ["FDA_Label", "EMA_Approval"]
   tier1 = ["Phase3_RCT", "Meta_Analysis"]
   tier2 = ["Observational_Study", "Real_World_Evidence"]
   tier3 = ["Case_Report", "Expert_Opinion"]
   tier4 = ["Patient_Forum", "Social_Media"]
   ```

3. **Implement domain parser** (`applications/ontology/src/domain.rs`)
   - Parse TOML into Rust structs
   - Validate schema consistency (no circular refs, valid patterns)
   - Unit tests for parsing

4. **Subject builder utility**
   - Given entity values + predicate schema, build correct subject string
   - `build_subject("efficacy", {"Drug": "Semaglutide", "Indication": "Type2Diabetes"})` → `"Semaglutide:Type2Diabetes"`

**Deliverables:**
- `applications/ontology/` crate with domain definition parsing
- `domains/pharma.toml` as the reference implementation
- Subject builder that enforces schema compliance

**Foundation this enables:**
- Extractors know what shape claims should have
- Validators can check claims against schema
- Future domains just add a new `.toml` file

---

### Week 2: FDA Label Extraction (Tier 0)

**Goals:**
- Upgrade `latent/ingest-fda` to produce schema-compliant assertions
- Extract *structured* claims, not raw text blobs
- Write directly to StemeDB (not JSONL files)

**Tasks:**

1. **Refactor FDA ingestor** (`latent/ingest-fda/`)
   - Load pharma domain definition
   - Use LLM (Claude) to extract structured claims from label sections
   - Map extracted claims to ontology predicates

2. **LLM extraction prompt for FDA labels**
   ```
   Given this FDA label section for {drug_name}, extract claims as structured data.

   For ADVERSE REACTIONS sections, extract:
   - Predicate: {symptom}_incidence
   - Object: decimal (0.0-1.0)
   - Quote: exact text supporting this

   For BOXED WARNINGS, extract:
   - Predicate: has_boxed_warning
   - Object: boolean (true)
   - Quote: warning text

   Return JSON array of claims.
   ```

3. **Normalizer module** (`latent/ingest-fda/normalizer.py`)
   - Map drug names to canonical form (Ozempic → semaglutide)
   - Map symptom names to canonical predicates (nausea, vomiting → distinct)
   - Use drug synonym database (RxNorm or similar)

4. **StemeDB client integration**
   - Replace JSONL output with HTTP calls to StemeDB API
   - Sign assertions with ingestor's Ed25519 key
   - Set `source_class: Regulatory` (Tier 0)

5. **Integration test**
   - Ingest semaglutide FDA label
   - Query StemeDB for `Semaglutide` safety predicates
   - Verify structured claims exist with correct schema

**Deliverables:**
- FDA ingestor writes schema-compliant assertions to StemeDB
- At least 3 drugs (semaglutide, tirzepatide, liraglutide) ingested
- Integration test proving round-trip

**Foundation this enables:**
- Tier 0 (regulatory) baseline established
- Lower-tier sources can now conflict with FDA data
- Pattern for other structured source adapters

---

### Week 3: Clinical Trial Extraction (Tier 1)

**Goals:**
- Extract efficacy claims from clinical trial publications
- These will conflict with each other (the whole point of Episteme)
- Demonstrate SkepticLens showing real disagreement

**Tasks:**

1. **PubMed/PMC source adapter** (`latent/ingest-pubmed/`)
   - Fetch abstracts + full text for GLP-1 trials
   - Filter by clinical trial registration (NCT numbers)
   - Extract study metadata (sample size, design, journal)

2. **LLM extraction prompt for efficacy claims**
   ```
   Given this clinical trial abstract/results for {drug_name} in {indication}:

   Extract efficacy claims:
   - Subject: {drug}:{indication}
   - Predicate: {biomarker}_change_percent | remission_rate | etc.
   - Object: numeric value
   - Confidence: based on sample size and study design
   - Quote: exact text

   Include comparator arm if mentioned (vs placebo, vs {other_drug}).
   ```

3. **Confidence scoring heuristics**
   - Phase 3 RCT: base confidence 0.9
   - Phase 2: base confidence 0.7
   - Observational: base confidence 0.5
   - Adjust by sample size, blinding, journal impact factor

4. **Conflict demonstration**
   - Ingest 5+ trials with varying HbA1c results for semaglutide
   - Query with SkepticLens
   - Show `conflict_score > 0` and multiple competing claims

**Deliverables:**
- PubMed ingestor producing Tier 1 assertions
- At least 10 clinical trial papers ingested
- SkepticLens query showing real conflict in GLP-1 efficacy data

**Foundation this enables:**
- Proves the ontology enables conflict detection
- Real "Trust but Verify" data for demos
- Pattern for other publication sources (biorxiv, medrxiv)

---

### Week 4: Ontology Validation & Query Patterns

**Goals:**
- Validate that extracted claims conform to ontology
- Build query helpers that leverage domain knowledge
- CLI tool for exploring the pharma knowledge graph

**Tasks:**

1. **Claim validator** (`applications/ontology/src/validator.rs`)
   - Check subject matches predicate schema's subject_pattern
   - Check predicate is defined for its schema
   - Check object type is valid for predicate
   - Return validation errors or Ok

2. **Query builder with domain awareness**
   - `query_efficacy("Semaglutide", "Type2Diabetes")` → correct subject + all efficacy predicates
   - `query_safety("Semaglutide")` → correct subject + all safety predicates
   - Uses domain definition to know which predicates to include

3. **CLI exploration tool** (`applications/ontology/src/bin/steme-pharma.rs`)
   ```bash
   steme-pharma efficacy semaglutide type2-diabetes
   # Shows all efficacy claims with conflict scores

   steme-pharma safety semaglutide
   # Shows all safety claims by tier

   steme-pharma compare semaglutide tirzepatide --indication type2-diabetes
   # Side-by-side comparison
   ```

4. **Domain-aware API endpoints** (optional, if time)
   - `GET /v1/pharma/efficacy?drug=semaglutide&indication=type2_diabetes`
   - `GET /v1/pharma/safety?drug=semaglutide`
   - Thin wrapper over existing query API with domain knowledge

**Deliverables:**
- Validator catches schema-violating claims before ingestion
- CLI tool for exploring pharma data
- Query helpers that make domain queries easy

**Foundation this enables:**
- Quality gate for extraction pipelines
- Developer experience for exploring data
- Pattern for domain-specific APIs

---

### Week 5: Social Signal Extraction (Tier 4-5)

**Goals:**
- Extract anecdotal claims from Reddit/social media
- These are low-confidence but high-volume
- Demonstrate the full tier spectrum working together

**Tasks:**

1. **Reddit source adapter** (`latent/ingest-reddit/`)
   - Fetch posts from r/Ozempic, r/Semaglutide, r/diabetes
   - Extract post text, upvotes, comment count
   - Rate limit appropriately

2. **LLM extraction for anecdotal claims**
   ```
   Given this Reddit post about {drug_name}:

   Extract personal experience claims:
   - Predicate: reported_{symptom} | reported_effectiveness
   - Object: Text (the claim) or Boolean
   - Confidence: 0.2-0.4 based on detail and specificity
   - Quote: relevant excerpt

   Skip if: asking questions, discussing others' experiences, promotional.
   ```

3. **Anecdote clustering** (stretch goal)
   - Group similar anecdotal claims
   - Escalate if cluster size exceeds threshold
   - "50+ users reporting gastroparesis" becomes a signal

4. **Full-tier query demonstration**
   ```bash
   steme-pharma safety semaglutide --all-tiers
   # Shows:
   # Tier 0 (FDA): has_boxed_warning = true (thyroid)
   # Tier 1 (RCTs): nausea_incidence = 0.44
   # Tier 4 (Reddit): 127 reports of "gastroparesis" (pre-label)
   ```

**Deliverables:**
- Reddit ingestor producing Tier 4-5 assertions
- At least 100 anecdotal claims ingested
- Query showing full tier spectrum for a safety topic

**Foundation this enables:**
- "Early signal detection" — social media flags issues before FDA
- Real consumer health use case data
- Validates source_class hierarchy is working

---

### Week 6: Ontology Layer Extraction & Documentation

**Goals:**
- Extract reusable patterns from pharma implementation
- Document how to create a new domain
- Prepare for second vertical (SEC/financial or code/security)

**Tasks:**

1. **Generalize domain definition schema**
   - Review what was pharma-specific vs reusable
   - Document the schema formally
   - Create `domains/template.toml` with comments

2. **Extraction pipeline abstraction**
   - Common interface for source adapters
   - Common LLM prompt patterns
   - Common normalization utilities

3. **"Add a Domain" guide** (`docs/guides/adding-a-domain.md`)
   - Step-by-step: define entities, predicates, sources
   - Example: how pharma was built
   - Checklist: what you need before first extraction

4. **Second domain sketch** (no implementation)
   - Draft `domains/sec.toml` or `domains/security.toml`
   - Identify where patterns differ from pharma
   - Document what would need to change

5. **Integration with CLAUDE.md**
   - Add ontology layer to "Find Your Guide" table
   - Document new crates and tools
   - Update specialized agents if needed

**Deliverables:**
- Reusable ontology crate extracted from pharma-specific code
- Documentation for adding new domains
- Draft of second domain showing generalization

**Foundation this enables:**
- Other teams can add domains without core changes
- Clear separation of domain logic from storage logic
- Path to "platform" if that's the direction

---

## Risks & Mitigations

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| LLM extraction quality varies | High | Medium | Start with structured sources (FDA), add human review loop |
| Ontology schema needs revision | High | Low | Keep Week 1 scope small, iterate based on Week 2-3 learnings |
| Drug name normalization is hard | Medium | Medium | Use existing resources (RxNorm), accept some manual mapping |
| Reddit rate limits | Medium | Low | Cache aggressively, use Reddit API properly |
| Scope creep into full consumer app | Medium | High | Stay focused on extraction + storage, not UX |

---

## Success Criteria

**Week 3 checkpoint:** Can we ingest FDA labels + clinical trials and see real conflicts via SkepticLens?

**Week 6 checkpoint:** Is the ontology layer reusable enough that sketching a second domain feels like "filling in a template" rather than "starting over"?

**End state:**
```bash
# This should work and show meaningful data:
steme-pharma efficacy semaglutide type2_diabetes

# Output:
Subject: Semaglutide:Type2Diabetes
Predicate: hba1c_change_percent
Status: Contested (conflict_score: 0.42)

Claims:
  -1.5% (45% weight) — NEJM Phase 3, n=1847
  -1.2% (35% weight) — Lancet Phase 3, n=892
  -0.8% (20% weight) — JAMA Observational, n=12000

Sources span Tier 0-2. 47 anecdotal reports (Tier 4) mention efficacy.
```

---

## Open Questions

1. **Where does the ontology crate live?**
   - Option A: `applications/ontology/` (domain-agnostic utility)
   - Option B: `crates/stemedb-ontology/` (core crate)
   - Recommendation: Start in `applications/`, promote to `crates/` if it proves domain-agnostic

2. **Python vs Rust for extractors?**
   - Python: faster iteration, better LLM libraries, existing `latent/` code
   - Rust: type safety, integration with core
   - Recommendation: Python for extractors (they're ETL jobs), Rust for ontology validation

3. **How do we handle ontology versioning?**
   - Predicates may change as we learn
   - Old assertions might not conform to new schema
   - Recommendation: Version in domain definition, allow "legacy" predicates with deprecation flags

---

## Next Steps

1. Review this plan and adjust scope/timeline
2. Decide on Question 1 (crate location)
3. Start Week 1: domain definition schema