stemedb/docs/specs/ontology-layer-medical-vertical.md
jml 9bfa626203 docs: reorganize documentation structure for clarity
Major documentation restructure to improve discoverability and reduce duplication.

## Changes

**Deleted (Archived/Consolidated)**:
- Removed duplicate getting started guides
- Archived outdated planning documents
- Consolidated corpus and configuration docs
- Removed obsolete vision/spec files (superseded by vision.md)
- Cleaned up scrapyard and old PDFs

**New Structure**:
- docs/about/ - Project overview and introduction
- docs/guides/ - User guides (moved from root)
- docs/specs/ - Technical specifications
- docs/sdk/ - SDK documentation (Go)
- docs/references/ - API references
- docs/archive/ - Archived historical docs
- applications/aphoria/docs/advanced/ - Advanced topics
- applications/aphoria/docs/reference/ - CLI reference
- applications/aphoria/docs/archive/ - Archived aphoria docs

**Updated**:
- README.md - New root README with clear navigation
- CONTRIBUTING.md - Contribution guidelines
- CLAUDE.md - Updated paths to new structure
- roadmap.md - Added recent completions

## Files Changed
- 57 files changed
- 1,977 insertions(+)
- 961 deletions(-)

**Net change**: +1,016 lines (added CONTRIBUTING.md, README.md, reorganized content)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-11 07:33:40 +00:00

462 lines
17 KiB
Markdown

# Ontology Layer + Medical Vertical Implementation Plan
> **Goal:** Build the ontology layer that defines how claims are structured, extracted, and resolved — using Pharma/Medical as the first vertical to pressure-test every decision.
>
> **Philosophy:** Build medical end-to-end first, extract the reusable ontology layer as patterns emerge. Don't abstract prematurely.
>
> **Outcome:** A working claim extraction pipeline that can ingest FDA labels, clinical trial data, and eventually social media — producing properly structured assertions that conflict-detect correctly.
---
## Context: What Exists Today
| Component | Status | Gap |
|-----------|--------|-----|
| Core StemeDB | ✅ Complete | Storage, lenses, conflict detection work |
| SkepticLens | ✅ Complete | Can detect conflicts if subjects/predicates match |
| `latent/ingest-fda` | 🟡 Prototype | Fetches FDA labels, outputs flat JSONL (not integrated) |
| Aphoria extractors | ✅ Complete | Pattern-based extraction for code (14 extractors) |
| Disputed LLM extraction | 🟡 Early | Generic SPO extraction, no domain schema |
| Ontology definitions | ❌ Missing | No formal way to define subject patterns, predicate schemas |
**The gap:** We can store claims. We cannot yet *structure* claims for a domain in a way that guarantees conflicts will collide correctly.
---
## The Core Insight
When Journal A says "Semaglutide reduced HbA1c by 1.5%" and Journal B says "Semaglutide reduced HbA1c by 0.8%", we need:
```
Subject: Semaglutide:Type2Diabetes # Drug:Indication compound key
Predicate: hba1c_change_percent # Reusable across drug:indication pairs
Object: -1.5 # The conflicting value
```
The **subject granularity depends on the predicate type**:
- Efficacy predicates → Subject is `{Drug}:{Indication}`
- Safety predicates → Subject is `{Drug}` (indication-agnostic)
- Mechanism predicates → Subject is `{Drug}` or `{Drug}:{Pathway}`
This is what the ontology layer needs to express.
---
## High-Level Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Ontology Layer (New) │
├─────────────────────────────────────────────────────────────────┤
│ Domain Definition (YAML/TOML) │
│ ├── Entity Types (Drug, Condition, Biomarker...) │
│ ├── Predicate Schemas (subject pattern → predicates) │
│ ├── Source Hierarchy (Tier 0-5) │
│ └── Default Lens per predicate type │
├─────────────────────────────────────────────────────────────────┤
│ Extraction Pipeline │
│ ├── Source Adapters (FDA API, PubMed, Reddit...) │
│ ├── Claim Extractor (LLM-based, schema-guided) │
│ ├── Normalizer (maps raw → ontology subjects/predicates) │
│ └── Validator (checks claim conforms to schema) │
├─────────────────────────────────────────────────────────────────┤
│ StemeDB Core (Existing) │
│ ├── Assertion Storage │
│ ├── Conflict Detection (SkepticLens) │
│ └── Query/Resolution │
└─────────────────────────────────────────────────────────────────┘
```
---
## Week-by-Week Implementation Plan
### Week 1: Domain Definition Schema
**Goals:**
- Define how domains express their ontology
- Create the pharma domain definition as the first instance
- No extraction yet — just the schema that extraction will target
**Tasks:**
1. **Design domain definition format** (`applications/ontology/`)
- Choose format: YAML or TOML (TOML aligns with Rust ecosystem)
- Define schema for entity types, predicate schemas, source tiers
2. **Create pharma domain definition** (`applications/ontology/domains/pharma.toml`)
```toml
[domain]
name = "pharma"
version = "0.1.0"
[entity_types]
Drug = { aliases = ["medication", "compound", "molecule"] }
Indication = { aliases = ["condition", "disease", "disorder"] }
Biomarker = { aliases = ["endpoint", "measure"] }
Population = { aliases = ["cohort", "patient_group"] }
[predicate_schemas.efficacy]
subject_pattern = "{Drug}:{Indication}"
predicates = ["hba1c_change_percent", "weight_change_percent", "remission_rate"]
default_lens = "Skeptic"
[predicate_schemas.safety]
subject_pattern = "{Drug}"
predicates = ["nausea_incidence", "discontinuation_rate", "has_boxed_warning"]
default_lens = "Authority"
[predicate_schemas.mechanism]
subject_pattern = "{Drug}"
predicates = ["target_receptor", "half_life_hours", "bioavailability_percent"]
default_lens = "Recency"
[source_hierarchy]
tier0 = ["FDA_Label", "EMA_Approval"]
tier1 = ["Phase3_RCT", "Meta_Analysis"]
tier2 = ["Observational_Study", "Real_World_Evidence"]
tier3 = ["Case_Report", "Expert_Opinion"]
tier4 = ["Patient_Forum", "Social_Media"]
```
3. **Implement domain parser** (`applications/ontology/src/domain.rs`)
- Parse TOML into Rust structs
- Validate schema consistency (no circular refs, valid patterns)
- Unit tests for parsing
4. **Subject builder utility**
- Given entity values + predicate schema, build correct subject string
- `build_subject("efficacy", {"Drug": "Semaglutide", "Indication": "Type2Diabetes"})``"Semaglutide:Type2Diabetes"`
**Deliverables:**
- `applications/ontology/` crate with domain definition parsing
- `domains/pharma.toml` as the reference implementation
- Subject builder that enforces schema compliance
**Foundation this enables:**
- Extractors know what shape claims should have
- Validators can check claims against schema
- Future domains just add a new `.toml` file
---
### Week 2: FDA Label Extraction (Tier 0)
**Goals:**
- Upgrade `latent/ingest-fda` to produce schema-compliant assertions
- Extract *structured* claims, not raw text blobs
- Write directly to StemeDB (not JSONL files)
**Tasks:**
1. **Refactor FDA ingestor** (`latent/ingest-fda/`)
- Load pharma domain definition
- Use LLM (Claude) to extract structured claims from label sections
- Map extracted claims to ontology predicates
2. **LLM extraction prompt for FDA labels**
```
Given this FDA label section for {drug_name}, extract claims as structured data.
For ADVERSE REACTIONS sections, extract:
- Predicate: {symptom}_incidence
- Object: decimal (0.0-1.0)
- Quote: exact text supporting this
For BOXED WARNINGS, extract:
- Predicate: has_boxed_warning
- Object: boolean (true)
- Quote: warning text
Return JSON array of claims.
```
3. **Normalizer module** (`latent/ingest-fda/normalizer.py`)
- Map drug names to canonical form (Ozempic → semaglutide)
- Map symptom names to canonical predicates (nausea, vomiting → distinct)
- Use drug synonym database (RxNorm or similar)
4. **StemeDB client integration**
- Replace JSONL output with HTTP calls to StemeDB API
- Sign assertions with ingestor's Ed25519 key
- Set `source_class: Regulatory` (Tier 0)
5. **Integration test**
- Ingest semaglutide FDA label
- Query StemeDB for `Semaglutide` safety predicates
- Verify structured claims exist with correct schema
**Deliverables:**
- FDA ingestor writes schema-compliant assertions to StemeDB
- At least 3 drugs (semaglutide, tirzepatide, liraglutide) ingested
- Integration test proving round-trip
**Foundation this enables:**
- Tier 0 (regulatory) baseline established
- Lower-tier sources can now conflict with FDA data
- Pattern for other structured source adapters
---
### Week 3: Clinical Trial Extraction (Tier 1)
**Goals:**
- Extract efficacy claims from clinical trial publications
- These will conflict with each other (the whole point of Episteme)
- Demonstrate SkepticLens showing real disagreement
**Tasks:**
1. **PubMed/PMC source adapter** (`latent/ingest-pubmed/`)
- Fetch abstracts + full text for GLP-1 trials
- Filter by clinical trial registration (NCT numbers)
- Extract study metadata (sample size, design, journal)
2. **LLM extraction prompt for efficacy claims**
```
Given this clinical trial abstract/results for {drug_name} in {indication}:
Extract efficacy claims:
- Subject: {drug}:{indication}
- Predicate: {biomarker}_change_percent | remission_rate | etc.
- Object: numeric value
- Confidence: based on sample size and study design
- Quote: exact text
Include comparator arm if mentioned (vs placebo, vs {other_drug}).
```
3. **Confidence scoring heuristics**
- Phase 3 RCT: base confidence 0.9
- Phase 2: base confidence 0.7
- Observational: base confidence 0.5
- Adjust by sample size, blinding, journal impact factor
4. **Conflict demonstration**
- Ingest 5+ trials with varying HbA1c results for semaglutide
- Query with SkepticLens
- Show `conflict_score > 0` and multiple competing claims
**Deliverables:**
- PubMed ingestor producing Tier 1 assertions
- At least 10 clinical trial papers ingested
- SkepticLens query showing real conflict in GLP-1 efficacy data
**Foundation this enables:**
- Proves the ontology enables conflict detection
- Real "Trust but Verify" data for demos
- Pattern for other publication sources (biorxiv, medrxiv)
---
### Week 4: Ontology Validation & Query Patterns
**Goals:**
- Validate that extracted claims conform to ontology
- Build query helpers that leverage domain knowledge
- CLI tool for exploring the pharma knowledge graph
**Tasks:**
1. **Claim validator** (`applications/ontology/src/validator.rs`)
- Check subject matches predicate schema's subject_pattern
- Check predicate is defined for its schema
- Check object type is valid for predicate
- Return validation errors or Ok
2. **Query builder with domain awareness**
- `query_efficacy("Semaglutide", "Type2Diabetes")` → correct subject + all efficacy predicates
- `query_safety("Semaglutide")` → correct subject + all safety predicates
- Uses domain definition to know which predicates to include
3. **CLI exploration tool** (`applications/ontology/src/bin/steme-pharma.rs`)
```bash
steme-pharma efficacy semaglutide type2-diabetes
# Shows all efficacy claims with conflict scores
steme-pharma safety semaglutide
# Shows all safety claims by tier
steme-pharma compare semaglutide tirzepatide --indication type2-diabetes
# Side-by-side comparison
```
4. **Domain-aware API endpoints** (optional, if time)
- `GET /v1/pharma/efficacy?drug=semaglutide&indication=type2_diabetes`
- `GET /v1/pharma/safety?drug=semaglutide`
- Thin wrapper over existing query API with domain knowledge
**Deliverables:**
- Validator catches schema-violating claims before ingestion
- CLI tool for exploring pharma data
- Query helpers that make domain queries easy
**Foundation this enables:**
- Quality gate for extraction pipelines
- Developer experience for exploring data
- Pattern for domain-specific APIs
---
### Week 5: Social Signal Extraction (Tier 4-5)
**Goals:**
- Extract anecdotal claims from Reddit/social media
- These are low-confidence but high-volume
- Demonstrate the full tier spectrum working together
**Tasks:**
1. **Reddit source adapter** (`latent/ingest-reddit/`)
- Fetch posts from r/Ozempic, r/Semaglutide, r/diabetes
- Extract post text, upvotes, comment count
- Rate limit appropriately
2. **LLM extraction for anecdotal claims**
```
Given this Reddit post about {drug_name}:
Extract personal experience claims:
- Predicate: reported_{symptom} | reported_effectiveness
- Object: Text (the claim) or Boolean
- Confidence: 0.2-0.4 based on detail and specificity
- Quote: relevant excerpt
Skip if: asking questions, discussing others' experiences, promotional.
```
3. **Anecdote clustering** (stretch goal)
- Group similar anecdotal claims
- Escalate if cluster size exceeds threshold
- "50+ users reporting gastroparesis" becomes a signal
4. **Full-tier query demonstration**
```bash
steme-pharma safety semaglutide --all-tiers
# Shows:
# Tier 0 (FDA): has_boxed_warning = true (thyroid)
# Tier 1 (RCTs): nausea_incidence = 0.44
# Tier 4 (Reddit): 127 reports of "gastroparesis" (pre-label)
```
**Deliverables:**
- Reddit ingestor producing Tier 4-5 assertions
- At least 100 anecdotal claims ingested
- Query showing full tier spectrum for a safety topic
**Foundation this enables:**
- "Early signal detection" — social media flags issues before FDA
- Real consumer health use case data
- Validates source_class hierarchy is working
---
### Week 6: Ontology Layer Extraction & Documentation
**Goals:**
- Extract reusable patterns from pharma implementation
- Document how to create a new domain
- Prepare for second vertical (SEC/financial or code/security)
**Tasks:**
1. **Generalize domain definition schema**
- Review what was pharma-specific vs reusable
- Document the schema formally
- Create `domains/template.toml` with comments
2. **Extraction pipeline abstraction**
- Common interface for source adapters
- Common LLM prompt patterns
- Common normalization utilities
3. **"Add a Domain" guide** (`docs/guides/adding-a-domain.md`)
- Step-by-step: define entities, predicates, sources
- Example: how pharma was built
- Checklist: what you need before first extraction
4. **Second domain sketch** (no implementation)
- Draft `domains/sec.toml` or `domains/security.toml`
- Identify where patterns differ from pharma
- Document what would need to change
5. **Integration with CLAUDE.md**
- Add ontology layer to "Find Your Guide" table
- Document new crates and tools
- Update specialized agents if needed
**Deliverables:**
- Reusable ontology crate extracted from pharma-specific code
- Documentation for adding new domains
- Draft of second domain showing generalization
**Foundation this enables:**
- Other teams can add domains without core changes
- Clear separation of domain logic from storage logic
- Path to "platform" if that's the direction
---
## Risks & Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| LLM extraction quality varies | High | Medium | Start with structured sources (FDA), add human review loop |
| Ontology schema needs revision | High | Low | Keep Week 1 scope small, iterate based on Week 2-3 learnings |
| Drug name normalization is hard | Medium | Medium | Use existing resources (RxNorm), accept some manual mapping |
| Reddit rate limits | Medium | Low | Cache aggressively, use Reddit API properly |
| Scope creep into full consumer app | Medium | High | Stay focused on extraction + storage, not UX |
---
## Success Criteria
**Week 3 checkpoint:** Can we ingest FDA labels + clinical trials and see real conflicts via SkepticLens?
**Week 6 checkpoint:** Is the ontology layer reusable enough that sketching a second domain feels like "filling in a template" rather than "starting over"?
**End state:**
```bash
# This should work and show meaningful data:
steme-pharma efficacy semaglutide type2_diabetes
# Output:
Subject: Semaglutide:Type2Diabetes
Predicate: hba1c_change_percent
Status: Contested (conflict_score: 0.42)
Claims:
-1.5% (45% weight) — NEJM Phase 3, n=1847
-1.2% (35% weight) — Lancet Phase 3, n=892
-0.8% (20% weight) — JAMA Observational, n=12000
Sources span Tier 0-2. 47 anecdotal reports (Tier 4) mention efficacy.
```
---
## Open Questions
1. **Where does the ontology crate live?**
- Option A: `applications/ontology/` (domain-agnostic utility)
- Option B: `crates/stemedb-ontology/` (core crate)
- Recommendation: Start in `applications/`, promote to `crates/` if it proves domain-agnostic
2. **Python vs Rust for extractors?**
- Python: faster iteration, better LLM libraries, existing `latent/` code
- Rust: type safety, integration with core
- Recommendation: Python for extractors (they're ETL jobs), Rust for ontology validation
3. **How do we handle ontology versioning?**
- Predicates may change as we learn
- Old assertions might not conform to new schema
- Recommendation: Version in domain definition, allow "legacy" predicates with deprecation flags
---
## Next Steps
1. Review this plan and adjust scope/timeline
2. Decide on Question 1 (crate location)
3. Start Week 1: domain definition schema