Major documentation restructure to improve discoverability and reduce duplication. ## Changes **Deleted (Archived/Consolidated)**: - Removed duplicate getting started guides - Archived outdated planning documents - Consolidated corpus and configuration docs - Removed obsolete vision/spec files (superseded by vision.md) - Cleaned up scrapyard and old PDFs **New Structure**: - docs/about/ - Project overview and introduction - docs/guides/ - User guides (moved from root) - docs/specs/ - Technical specifications - docs/sdk/ - SDK documentation (Go) - docs/references/ - API references - docs/archive/ - Archived historical docs - applications/aphoria/docs/advanced/ - Advanced topics - applications/aphoria/docs/reference/ - CLI reference - applications/aphoria/docs/archive/ - Archived aphoria docs **Updated**: - README.md - New root README with clear navigation - CONTRIBUTING.md - Contribution guidelines - CLAUDE.md - Updated paths to new structure - roadmap.md - Added recent completions ## Files Changed - 57 files changed - 1,977 insertions(+) - 961 deletions(-) **Net change**: +1,016 lines (added CONTRIBUTING.md, README.md, reorganized content) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
462 lines
17 KiB
Markdown
462 lines
17 KiB
Markdown
# Ontology Layer + Medical Vertical Implementation Plan
|
|
|
|
> **Goal:** Build the ontology layer that defines how claims are structured, extracted, and resolved — using Pharma/Medical as the first vertical to pressure-test every decision.
|
|
>
|
|
> **Philosophy:** Build medical end-to-end first, extract the reusable ontology layer as patterns emerge. Don't abstract prematurely.
|
|
>
|
|
> **Outcome:** A working claim extraction pipeline that can ingest FDA labels, clinical trial data, and eventually social media — producing properly structured assertions that conflict-detect correctly.
|
|
|
|
---
|
|
|
|
## Context: What Exists Today
|
|
|
|
| Component | Status | Gap |
|
|
|-----------|--------|-----|
|
|
| Core StemeDB | ✅ Complete | Storage, lenses, conflict detection work |
|
|
| SkepticLens | ✅ Complete | Can detect conflicts if subjects/predicates match |
|
|
| `latent/ingest-fda` | 🟡 Prototype | Fetches FDA labels, outputs flat JSONL (not integrated) |
|
|
| Aphoria extractors | ✅ Complete | Pattern-based extraction for code (14 extractors) |
|
|
| Disputed LLM extraction | 🟡 Early | Generic SPO extraction, no domain schema |
|
|
| Ontology definitions | ❌ Missing | No formal way to define subject patterns, predicate schemas |
|
|
|
|
**The gap:** We can store claims. We cannot yet *structure* claims for a domain in a way that guarantees conflicts will collide correctly.
|
|
|
|
---
|
|
|
|
## The Core Insight
|
|
|
|
When Journal A says "Semaglutide reduced HbA1c by 1.5%" and Journal B says "Semaglutide reduced HbA1c by 0.8%", we need:
|
|
|
|
```
|
|
Subject: Semaglutide:Type2Diabetes # Drug:Indication compound key
|
|
Predicate: hba1c_change_percent # Reusable across drug:indication pairs
|
|
Object: -1.5 # The conflicting value
|
|
```
|
|
|
|
The **subject granularity depends on the predicate type**:
|
|
- Efficacy predicates → Subject is `{Drug}:{Indication}`
|
|
- Safety predicates → Subject is `{Drug}` (indication-agnostic)
|
|
- Mechanism predicates → Subject is `{Drug}` or `{Drug}:{Pathway}`
|
|
|
|
This is what the ontology layer needs to express.
|
|
|
|
---
|
|
|
|
## High-Level Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ Ontology Layer (New) │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ Domain Definition (YAML/TOML) │
|
|
│ ├── Entity Types (Drug, Condition, Biomarker...) │
|
|
│ ├── Predicate Schemas (subject pattern → predicates) │
|
|
│ ├── Source Hierarchy (Tier 0-5) │
|
|
│ └── Default Lens per predicate type │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ Extraction Pipeline │
|
|
│ ├── Source Adapters (FDA API, PubMed, Reddit...) │
|
|
│ ├── Claim Extractor (LLM-based, schema-guided) │
|
|
│ ├── Normalizer (maps raw → ontology subjects/predicates) │
|
|
│ └── Validator (checks claim conforms to schema) │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ StemeDB Core (Existing) │
|
|
│ ├── Assertion Storage │
|
|
│ ├── Conflict Detection (SkepticLens) │
|
|
│ └── Query/Resolution │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Week-by-Week Implementation Plan
|
|
|
|
### Week 1: Domain Definition Schema
|
|
|
|
**Goals:**
|
|
- Define how domains express their ontology
|
|
- Create the pharma domain definition as the first instance
|
|
- No extraction yet — just the schema that extraction will target
|
|
|
|
**Tasks:**
|
|
|
|
1. **Design domain definition format** (`applications/ontology/`)
|
|
- Choose format: YAML or TOML (TOML aligns with Rust ecosystem)
|
|
- Define schema for entity types, predicate schemas, source tiers
|
|
|
|
2. **Create pharma domain definition** (`applications/ontology/domains/pharma.toml`)
|
|
```toml
|
|
[domain]
|
|
name = "pharma"
|
|
version = "0.1.0"
|
|
|
|
[entity_types]
|
|
Drug = { aliases = ["medication", "compound", "molecule"] }
|
|
Indication = { aliases = ["condition", "disease", "disorder"] }
|
|
Biomarker = { aliases = ["endpoint", "measure"] }
|
|
Population = { aliases = ["cohort", "patient_group"] }
|
|
|
|
[predicate_schemas.efficacy]
|
|
subject_pattern = "{Drug}:{Indication}"
|
|
predicates = ["hba1c_change_percent", "weight_change_percent", "remission_rate"]
|
|
default_lens = "Skeptic"
|
|
|
|
[predicate_schemas.safety]
|
|
subject_pattern = "{Drug}"
|
|
predicates = ["nausea_incidence", "discontinuation_rate", "has_boxed_warning"]
|
|
default_lens = "Authority"
|
|
|
|
[predicate_schemas.mechanism]
|
|
subject_pattern = "{Drug}"
|
|
predicates = ["target_receptor", "half_life_hours", "bioavailability_percent"]
|
|
default_lens = "Recency"
|
|
|
|
[source_hierarchy]
|
|
tier0 = ["FDA_Label", "EMA_Approval"]
|
|
tier1 = ["Phase3_RCT", "Meta_Analysis"]
|
|
tier2 = ["Observational_Study", "Real_World_Evidence"]
|
|
tier3 = ["Case_Report", "Expert_Opinion"]
|
|
tier4 = ["Patient_Forum", "Social_Media"]
|
|
```
|
|
|
|
3. **Implement domain parser** (`applications/ontology/src/domain.rs`)
|
|
- Parse TOML into Rust structs
|
|
- Validate schema consistency (no circular refs, valid patterns)
|
|
- Unit tests for parsing
|
|
|
|
4. **Subject builder utility**
|
|
- Given entity values + predicate schema, build correct subject string
|
|
- `build_subject("efficacy", {"Drug": "Semaglutide", "Indication": "Type2Diabetes"})` → `"Semaglutide:Type2Diabetes"`
|
|
|
|
**Deliverables:**
|
|
- `applications/ontology/` crate with domain definition parsing
|
|
- `domains/pharma.toml` as the reference implementation
|
|
- Subject builder that enforces schema compliance
|
|
|
|
**Foundation this enables:**
|
|
- Extractors know what shape claims should have
|
|
- Validators can check claims against schema
|
|
- Future domains just add a new `.toml` file
|
|
|
|
---
|
|
|
|
### Week 2: FDA Label Extraction (Tier 0)
|
|
|
|
**Goals:**
|
|
- Upgrade `latent/ingest-fda` to produce schema-compliant assertions
|
|
- Extract *structured* claims, not raw text blobs
|
|
- Write directly to StemeDB (not JSONL files)
|
|
|
|
**Tasks:**
|
|
|
|
1. **Refactor FDA ingestor** (`latent/ingest-fda/`)
|
|
- Load pharma domain definition
|
|
- Use LLM (Claude) to extract structured claims from label sections
|
|
- Map extracted claims to ontology predicates
|
|
|
|
2. **LLM extraction prompt for FDA labels**
|
|
```
|
|
Given this FDA label section for {drug_name}, extract claims as structured data.
|
|
|
|
For ADVERSE REACTIONS sections, extract:
|
|
- Predicate: {symptom}_incidence
|
|
- Object: decimal (0.0-1.0)
|
|
- Quote: exact text supporting this
|
|
|
|
For BOXED WARNINGS, extract:
|
|
- Predicate: has_boxed_warning
|
|
- Object: boolean (true)
|
|
- Quote: warning text
|
|
|
|
Return JSON array of claims.
|
|
```
|
|
|
|
3. **Normalizer module** (`latent/ingest-fda/normalizer.py`)
|
|
- Map drug names to canonical form (Ozempic → semaglutide)
|
|
- Map symptom names to canonical predicates (nausea, vomiting → distinct)
|
|
- Use drug synonym database (RxNorm or similar)
|
|
|
|
4. **StemeDB client integration**
|
|
- Replace JSONL output with HTTP calls to StemeDB API
|
|
- Sign assertions with ingestor's Ed25519 key
|
|
- Set `source_class: Regulatory` (Tier 0)
|
|
|
|
5. **Integration test**
|
|
- Ingest semaglutide FDA label
|
|
- Query StemeDB for `Semaglutide` safety predicates
|
|
- Verify structured claims exist with correct schema
|
|
|
|
**Deliverables:**
|
|
- FDA ingestor writes schema-compliant assertions to StemeDB
|
|
- At least 3 drugs (semaglutide, tirzepatide, liraglutide) ingested
|
|
- Integration test proving round-trip
|
|
|
|
**Foundation this enables:**
|
|
- Tier 0 (regulatory) baseline established
|
|
- Lower-tier sources can now conflict with FDA data
|
|
- Pattern for other structured source adapters
|
|
|
|
---
|
|
|
|
### Week 3: Clinical Trial Extraction (Tier 1)
|
|
|
|
**Goals:**
|
|
- Extract efficacy claims from clinical trial publications
|
|
- These will conflict with each other (the whole point of Episteme)
|
|
- Demonstrate SkepticLens showing real disagreement
|
|
|
|
**Tasks:**
|
|
|
|
1. **PubMed/PMC source adapter** (`latent/ingest-pubmed/`)
|
|
- Fetch abstracts + full text for GLP-1 trials
|
|
- Filter by clinical trial registration (NCT numbers)
|
|
- Extract study metadata (sample size, design, journal)
|
|
|
|
2. **LLM extraction prompt for efficacy claims**
|
|
```
|
|
Given this clinical trial abstract/results for {drug_name} in {indication}:
|
|
|
|
Extract efficacy claims:
|
|
- Subject: {drug}:{indication}
|
|
- Predicate: {biomarker}_change_percent | remission_rate | etc.
|
|
- Object: numeric value
|
|
- Confidence: based on sample size and study design
|
|
- Quote: exact text
|
|
|
|
Include comparator arm if mentioned (vs placebo, vs {other_drug}).
|
|
```
|
|
|
|
3. **Confidence scoring heuristics**
|
|
- Phase 3 RCT: base confidence 0.9
|
|
- Phase 2: base confidence 0.7
|
|
- Observational: base confidence 0.5
|
|
- Adjust by sample size, blinding, journal impact factor
|
|
|
|
4. **Conflict demonstration**
|
|
- Ingest 5+ trials with varying HbA1c results for semaglutide
|
|
- Query with SkepticLens
|
|
- Show `conflict_score > 0` and multiple competing claims
|
|
|
|
**Deliverables:**
|
|
- PubMed ingestor producing Tier 1 assertions
|
|
- At least 10 clinical trial papers ingested
|
|
- SkepticLens query showing real conflict in GLP-1 efficacy data
|
|
|
|
**Foundation this enables:**
|
|
- Proves the ontology enables conflict detection
|
|
- Real "Trust but Verify" data for demos
|
|
- Pattern for other publication sources (biorxiv, medrxiv)
|
|
|
|
---
|
|
|
|
### Week 4: Ontology Validation & Query Patterns
|
|
|
|
**Goals:**
|
|
- Validate that extracted claims conform to ontology
|
|
- Build query helpers that leverage domain knowledge
|
|
- CLI tool for exploring the pharma knowledge graph
|
|
|
|
**Tasks:**
|
|
|
|
1. **Claim validator** (`applications/ontology/src/validator.rs`)
|
|
- Check subject matches predicate schema's subject_pattern
|
|
- Check predicate is defined for its schema
|
|
- Check object type is valid for predicate
|
|
- Return validation errors or Ok
|
|
|
|
2. **Query builder with domain awareness**
|
|
- `query_efficacy("Semaglutide", "Type2Diabetes")` → correct subject + all efficacy predicates
|
|
- `query_safety("Semaglutide")` → correct subject + all safety predicates
|
|
- Uses domain definition to know which predicates to include
|
|
|
|
3. **CLI exploration tool** (`applications/ontology/src/bin/steme-pharma.rs`)
|
|
```bash
|
|
steme-pharma efficacy semaglutide type2-diabetes
|
|
# Shows all efficacy claims with conflict scores
|
|
|
|
steme-pharma safety semaglutide
|
|
# Shows all safety claims by tier
|
|
|
|
steme-pharma compare semaglutide tirzepatide --indication type2-diabetes
|
|
# Side-by-side comparison
|
|
```
|
|
|
|
4. **Domain-aware API endpoints** (optional, if time)
|
|
- `GET /v1/pharma/efficacy?drug=semaglutide&indication=type2_diabetes`
|
|
- `GET /v1/pharma/safety?drug=semaglutide`
|
|
- Thin wrapper over existing query API with domain knowledge
|
|
|
|
**Deliverables:**
|
|
- Validator catches schema-violating claims before ingestion
|
|
- CLI tool for exploring pharma data
|
|
- Query helpers that make domain queries easy
|
|
|
|
**Foundation this enables:**
|
|
- Quality gate for extraction pipelines
|
|
- Developer experience for exploring data
|
|
- Pattern for domain-specific APIs
|
|
|
|
---
|
|
|
|
### Week 5: Social Signal Extraction (Tier 4-5)
|
|
|
|
**Goals:**
|
|
- Extract anecdotal claims from Reddit/social media
|
|
- These are low-confidence but high-volume
|
|
- Demonstrate the full tier spectrum working together
|
|
|
|
**Tasks:**
|
|
|
|
1. **Reddit source adapter** (`latent/ingest-reddit/`)
|
|
- Fetch posts from r/Ozempic, r/Semaglutide, r/diabetes
|
|
- Extract post text, upvotes, comment count
|
|
- Rate limit appropriately
|
|
|
|
2. **LLM extraction for anecdotal claims**
|
|
```
|
|
Given this Reddit post about {drug_name}:
|
|
|
|
Extract personal experience claims:
|
|
- Predicate: reported_{symptom} | reported_effectiveness
|
|
- Object: Text (the claim) or Boolean
|
|
- Confidence: 0.2-0.4 based on detail and specificity
|
|
- Quote: relevant excerpt
|
|
|
|
Skip if: asking questions, discussing others' experiences, promotional.
|
|
```
|
|
|
|
3. **Anecdote clustering** (stretch goal)
|
|
- Group similar anecdotal claims
|
|
- Escalate if cluster size exceeds threshold
|
|
- "50+ users reporting gastroparesis" becomes a signal
|
|
|
|
4. **Full-tier query demonstration**
|
|
```bash
|
|
steme-pharma safety semaglutide --all-tiers
|
|
# Shows:
|
|
# Tier 0 (FDA): has_boxed_warning = true (thyroid)
|
|
# Tier 1 (RCTs): nausea_incidence = 0.44
|
|
# Tier 4 (Reddit): 127 reports of "gastroparesis" (pre-label)
|
|
```
|
|
|
|
**Deliverables:**
|
|
- Reddit ingestor producing Tier 4-5 assertions
|
|
- At least 100 anecdotal claims ingested
|
|
- Query showing full tier spectrum for a safety topic
|
|
|
|
**Foundation this enables:**
|
|
- "Early signal detection" — social media flags issues before FDA
|
|
- Real consumer health use case data
|
|
- Validates source_class hierarchy is working
|
|
|
|
---
|
|
|
|
### Week 6: Ontology Layer Extraction & Documentation
|
|
|
|
**Goals:**
|
|
- Extract reusable patterns from pharma implementation
|
|
- Document how to create a new domain
|
|
- Prepare for second vertical (SEC/financial or code/security)
|
|
|
|
**Tasks:**
|
|
|
|
1. **Generalize domain definition schema**
|
|
- Review what was pharma-specific vs reusable
|
|
- Document the schema formally
|
|
- Create `domains/template.toml` with comments
|
|
|
|
2. **Extraction pipeline abstraction**
|
|
- Common interface for source adapters
|
|
- Common LLM prompt patterns
|
|
- Common normalization utilities
|
|
|
|
3. **"Add a Domain" guide** (`docs/guides/adding-a-domain.md`)
|
|
- Step-by-step: define entities, predicates, sources
|
|
- Example: how pharma was built
|
|
- Checklist: what you need before first extraction
|
|
|
|
4. **Second domain sketch** (no implementation)
|
|
- Draft `domains/sec.toml` or `domains/security.toml`
|
|
- Identify where patterns differ from pharma
|
|
- Document what would need to change
|
|
|
|
5. **Integration with CLAUDE.md**
|
|
- Add ontology layer to "Find Your Guide" table
|
|
- Document new crates and tools
|
|
- Update specialized agents if needed
|
|
|
|
**Deliverables:**
|
|
- Reusable ontology crate extracted from pharma-specific code
|
|
- Documentation for adding new domains
|
|
- Draft of second domain showing generalization
|
|
|
|
**Foundation this enables:**
|
|
- Other teams can add domains without core changes
|
|
- Clear separation of domain logic from storage logic
|
|
- Path to "platform" if that's the direction
|
|
|
|
---
|
|
|
|
## Risks & Mitigations
|
|
|
|
| Risk | Likelihood | Impact | Mitigation |
|
|
|------|------------|--------|------------|
|
|
| LLM extraction quality varies | High | Medium | Start with structured sources (FDA), add human review loop |
|
|
| Ontology schema needs revision | High | Low | Keep Week 1 scope small, iterate based on Week 2-3 learnings |
|
|
| Drug name normalization is hard | Medium | Medium | Use existing resources (RxNorm), accept some manual mapping |
|
|
| Reddit rate limits | Medium | Low | Cache aggressively, use Reddit API properly |
|
|
| Scope creep into full consumer app | Medium | High | Stay focused on extraction + storage, not UX |
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
**Week 3 checkpoint:** Can we ingest FDA labels + clinical trials and see real conflicts via SkepticLens?
|
|
|
|
**Week 6 checkpoint:** Is the ontology layer reusable enough that sketching a second domain feels like "filling in a template" rather than "starting over"?
|
|
|
|
**End state:**
|
|
```bash
|
|
# This should work and show meaningful data:
|
|
steme-pharma efficacy semaglutide type2_diabetes
|
|
|
|
# Output:
|
|
Subject: Semaglutide:Type2Diabetes
|
|
Predicate: hba1c_change_percent
|
|
Status: Contested (conflict_score: 0.42)
|
|
|
|
Claims:
|
|
-1.5% (45% weight) — NEJM Phase 3, n=1847
|
|
-1.2% (35% weight) — Lancet Phase 3, n=892
|
|
-0.8% (20% weight) — JAMA Observational, n=12000
|
|
|
|
Sources span Tier 0-2. 47 anecdotal reports (Tier 4) mention efficacy.
|
|
```
|
|
|
|
---
|
|
|
|
## Open Questions
|
|
|
|
1. **Where does the ontology crate live?**
|
|
- Option A: `applications/ontology/` (domain-agnostic utility)
|
|
- Option B: `crates/stemedb-ontology/` (core crate)
|
|
- Recommendation: Start in `applications/`, promote to `crates/` if it proves domain-agnostic
|
|
|
|
2. **Python vs Rust for extractors?**
|
|
- Python: faster iteration, better LLM libraries, existing `latent/` code
|
|
- Rust: type safety, integration with core
|
|
- Recommendation: Python for extractors (they're ETL jobs), Rust for ontology validation
|
|
|
|
3. **How do we handle ontology versioning?**
|
|
- Predicates may change as we learn
|
|
- Old assertions might not conform to new schema
|
|
- Recommendation: Version in domain definition, allow "legacy" predicates with deprecation flags
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. Review this plan and adjust scope/timeline
|
|
2. Decide on Question 1 (crate location)
|
|
3. Start Week 1: domain definition schema
|