stemedb/docs/specs/ontology-layer-medical-vertical.md
jml 9bfa626203 docs: reorganize documentation structure for clarity
Major documentation restructure to improve discoverability and reduce duplication.

## Changes

**Deleted (Archived/Consolidated)**:
- Removed duplicate getting started guides
- Archived outdated planning documents
- Consolidated corpus and configuration docs
- Removed obsolete vision/spec files (superseded by vision.md)
- Cleaned up scrapyard and old PDFs

**New Structure**:
- docs/about/ - Project overview and introduction
- docs/guides/ - User guides (moved from root)
- docs/specs/ - Technical specifications
- docs/sdk/ - SDK documentation (Go)
- docs/references/ - API references
- docs/archive/ - Archived historical docs
- applications/aphoria/docs/advanced/ - Advanced topics
- applications/aphoria/docs/reference/ - CLI reference
- applications/aphoria/docs/archive/ - Archived aphoria docs

**Updated**:
- README.md - New root README with clear navigation
- CONTRIBUTING.md - Contribution guidelines
- CLAUDE.md - Updated paths to new structure
- roadmap.md - Added recent completions

## Files Changed
- 57 files changed
- 1,977 insertions(+)
- 961 deletions(-)

**Net change**: +1,016 lines (added CONTRIBUTING.md, README.md, reorganized content)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-11 07:33:40 +00:00

17 KiB

Ontology Layer + Medical Vertical Implementation Plan

Goal: Build the ontology layer that defines how claims are structured, extracted, and resolved — using Pharma/Medical as the first vertical to pressure-test every decision.

Philosophy: Build medical end-to-end first, extract the reusable ontology layer as patterns emerge. Don't abstract prematurely.

Outcome: A working claim extraction pipeline that can ingest FDA labels, clinical trial data, and eventually social media — producing properly structured assertions that conflict-detect correctly.


Context: What Exists Today

Component Status Gap
Core StemeDB Complete Storage, lenses, conflict detection work
SkepticLens Complete Can detect conflicts if subjects/predicates match
latent/ingest-fda 🟡 Prototype Fetches FDA labels, outputs flat JSONL (not integrated)
Aphoria extractors Complete Pattern-based extraction for code (14 extractors)
Disputed LLM extraction 🟡 Early Generic SPO extraction, no domain schema
Ontology definitions Missing No formal way to define subject patterns, predicate schemas

The gap: We can store claims. We cannot yet structure claims for a domain in a way that guarantees conflicts will collide correctly.


The Core Insight

When Journal A says "Semaglutide reduced HbA1c by 1.5%" and Journal B says "Semaglutide reduced HbA1c by 0.8%", we need:

Subject:   Semaglutide:Type2Diabetes   # Drug:Indication compound key
Predicate: hba1c_change_percent        # Reusable across drug:indication pairs
Object:    -1.5                         # The conflicting value

The subject granularity depends on the predicate type:

  • Efficacy predicates → Subject is {Drug}:{Indication}
  • Safety predicates → Subject is {Drug} (indication-agnostic)
  • Mechanism predicates → Subject is {Drug} or {Drug}:{Pathway}

This is what the ontology layer needs to express.


High-Level Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     Ontology Layer (New)                        │
├─────────────────────────────────────────────────────────────────┤
│  Domain Definition (YAML/TOML)                                  │
│  ├── Entity Types (Drug, Condition, Biomarker...)               │
│  ├── Predicate Schemas (subject pattern → predicates)           │
│  ├── Source Hierarchy (Tier 0-5)                                │
│  └── Default Lens per predicate type                            │
├─────────────────────────────────────────────────────────────────┤
│  Extraction Pipeline                                            │
│  ├── Source Adapters (FDA API, PubMed, Reddit...)               │
│  ├── Claim Extractor (LLM-based, schema-guided)                 │
│  ├── Normalizer (maps raw → ontology subjects/predicates)       │
│  └── Validator (checks claim conforms to schema)                │
├─────────────────────────────────────────────────────────────────┤
│  StemeDB Core (Existing)                                        │
│  ├── Assertion Storage                                          │
│  ├── Conflict Detection (SkepticLens)                           │
│  └── Query/Resolution                                           │
└─────────────────────────────────────────────────────────────────┘

Week-by-Week Implementation Plan

Week 1: Domain Definition Schema

Goals:

  • Define how domains express their ontology
  • Create the pharma domain definition as the first instance
  • No extraction yet — just the schema that extraction will target

Tasks:

  1. Design domain definition format (applications/ontology/)

    • Choose format: YAML or TOML (TOML aligns with Rust ecosystem)
    • Define schema for entity types, predicate schemas, source tiers
  2. Create pharma domain definition (applications/ontology/domains/pharma.toml)

    [domain]
    name = "pharma"
    version = "0.1.0"
    
    [entity_types]
    Drug = { aliases = ["medication", "compound", "molecule"] }
    Indication = { aliases = ["condition", "disease", "disorder"] }
    Biomarker = { aliases = ["endpoint", "measure"] }
    Population = { aliases = ["cohort", "patient_group"] }
    
    [predicate_schemas.efficacy]
    subject_pattern = "{Drug}:{Indication}"
    predicates = ["hba1c_change_percent", "weight_change_percent", "remission_rate"]
    default_lens = "Skeptic"
    
    [predicate_schemas.safety]
    subject_pattern = "{Drug}"
    predicates = ["nausea_incidence", "discontinuation_rate", "has_boxed_warning"]
    default_lens = "Authority"
    
    [predicate_schemas.mechanism]
    subject_pattern = "{Drug}"
    predicates = ["target_receptor", "half_life_hours", "bioavailability_percent"]
    default_lens = "Recency"
    
    [source_hierarchy]
    tier0 = ["FDA_Label", "EMA_Approval"]
    tier1 = ["Phase3_RCT", "Meta_Analysis"]
    tier2 = ["Observational_Study", "Real_World_Evidence"]
    tier3 = ["Case_Report", "Expert_Opinion"]
    tier4 = ["Patient_Forum", "Social_Media"]
    
  3. Implement domain parser (applications/ontology/src/domain.rs)

    • Parse TOML into Rust structs
    • Validate schema consistency (no circular refs, valid patterns)
    • Unit tests for parsing
  4. Subject builder utility

    • Given entity values + predicate schema, build correct subject string
    • build_subject("efficacy", {"Drug": "Semaglutide", "Indication": "Type2Diabetes"})"Semaglutide:Type2Diabetes"

Deliverables:

  • applications/ontology/ crate with domain definition parsing
  • domains/pharma.toml as the reference implementation
  • Subject builder that enforces schema compliance

Foundation this enables:

  • Extractors know what shape claims should have
  • Validators can check claims against schema
  • Future domains just add a new .toml file

Week 2: FDA Label Extraction (Tier 0)

Goals:

  • Upgrade latent/ingest-fda to produce schema-compliant assertions
  • Extract structured claims, not raw text blobs
  • Write directly to StemeDB (not JSONL files)

Tasks:

  1. Refactor FDA ingestor (latent/ingest-fda/)

    • Load pharma domain definition
    • Use LLM (Claude) to extract structured claims from label sections
    • Map extracted claims to ontology predicates
  2. LLM extraction prompt for FDA labels

    Given this FDA label section for {drug_name}, extract claims as structured data.
    
    For ADVERSE REACTIONS sections, extract:
    - Predicate: {symptom}_incidence
    - Object: decimal (0.0-1.0)
    - Quote: exact text supporting this
    
    For BOXED WARNINGS, extract:
    - Predicate: has_boxed_warning
    - Object: boolean (true)
    - Quote: warning text
    
    Return JSON array of claims.
    
  3. Normalizer module (latent/ingest-fda/normalizer.py)

    • Map drug names to canonical form (Ozempic → semaglutide)
    • Map symptom names to canonical predicates (nausea, vomiting → distinct)
    • Use drug synonym database (RxNorm or similar)
  4. StemeDB client integration

    • Replace JSONL output with HTTP calls to StemeDB API
    • Sign assertions with ingestor's Ed25519 key
    • Set source_class: Regulatory (Tier 0)
  5. Integration test

    • Ingest semaglutide FDA label
    • Query StemeDB for Semaglutide safety predicates
    • Verify structured claims exist with correct schema

Deliverables:

  • FDA ingestor writes schema-compliant assertions to StemeDB
  • At least 3 drugs (semaglutide, tirzepatide, liraglutide) ingested
  • Integration test proving round-trip

Foundation this enables:

  • Tier 0 (regulatory) baseline established
  • Lower-tier sources can now conflict with FDA data
  • Pattern for other structured source adapters

Week 3: Clinical Trial Extraction (Tier 1)

Goals:

  • Extract efficacy claims from clinical trial publications
  • These will conflict with each other (the whole point of Episteme)
  • Demonstrate SkepticLens showing real disagreement

Tasks:

  1. PubMed/PMC source adapter (latent/ingest-pubmed/)

    • Fetch abstracts + full text for GLP-1 trials
    • Filter by clinical trial registration (NCT numbers)
    • Extract study metadata (sample size, design, journal)
  2. LLM extraction prompt for efficacy claims

    Given this clinical trial abstract/results for {drug_name} in {indication}:
    
    Extract efficacy claims:
    - Subject: {drug}:{indication}
    - Predicate: {biomarker}_change_percent | remission_rate | etc.
    - Object: numeric value
    - Confidence: based on sample size and study design
    - Quote: exact text
    
    Include comparator arm if mentioned (vs placebo, vs {other_drug}).
    
  3. Confidence scoring heuristics

    • Phase 3 RCT: base confidence 0.9
    • Phase 2: base confidence 0.7
    • Observational: base confidence 0.5
    • Adjust by sample size, blinding, journal impact factor
  4. Conflict demonstration

    • Ingest 5+ trials with varying HbA1c results for semaglutide
    • Query with SkepticLens
    • Show conflict_score > 0 and multiple competing claims

Deliverables:

  • PubMed ingestor producing Tier 1 assertions
  • At least 10 clinical trial papers ingested
  • SkepticLens query showing real conflict in GLP-1 efficacy data

Foundation this enables:

  • Proves the ontology enables conflict detection
  • Real "Trust but Verify" data for demos
  • Pattern for other publication sources (biorxiv, medrxiv)

Week 4: Ontology Validation & Query Patterns

Goals:

  • Validate that extracted claims conform to ontology
  • Build query helpers that leverage domain knowledge
  • CLI tool for exploring the pharma knowledge graph

Tasks:

  1. Claim validator (applications/ontology/src/validator.rs)

    • Check subject matches predicate schema's subject_pattern
    • Check predicate is defined for its schema
    • Check object type is valid for predicate
    • Return validation errors or Ok
  2. Query builder with domain awareness

    • query_efficacy("Semaglutide", "Type2Diabetes") → correct subject + all efficacy predicates
    • query_safety("Semaglutide") → correct subject + all safety predicates
    • Uses domain definition to know which predicates to include
  3. CLI exploration tool (applications/ontology/src/bin/steme-pharma.rs)

    steme-pharma efficacy semaglutide type2-diabetes
    # Shows all efficacy claims with conflict scores
    
    steme-pharma safety semaglutide
    # Shows all safety claims by tier
    
    steme-pharma compare semaglutide tirzepatide --indication type2-diabetes
    # Side-by-side comparison
    
  4. Domain-aware API endpoints (optional, if time)

    • GET /v1/pharma/efficacy?drug=semaglutide&indication=type2_diabetes
    • GET /v1/pharma/safety?drug=semaglutide
    • Thin wrapper over existing query API with domain knowledge

Deliverables:

  • Validator catches schema-violating claims before ingestion
  • CLI tool for exploring pharma data
  • Query helpers that make domain queries easy

Foundation this enables:

  • Quality gate for extraction pipelines
  • Developer experience for exploring data
  • Pattern for domain-specific APIs

Week 5: Social Signal Extraction (Tier 4-5)

Goals:

  • Extract anecdotal claims from Reddit/social media
  • These are low-confidence but high-volume
  • Demonstrate the full tier spectrum working together

Tasks:

  1. Reddit source adapter (latent/ingest-reddit/)

    • Fetch posts from r/Ozempic, r/Semaglutide, r/diabetes
    • Extract post text, upvotes, comment count
    • Rate limit appropriately
  2. LLM extraction for anecdotal claims

    Given this Reddit post about {drug_name}:
    
    Extract personal experience claims:
    - Predicate: reported_{symptom} | reported_effectiveness
    - Object: Text (the claim) or Boolean
    - Confidence: 0.2-0.4 based on detail and specificity
    - Quote: relevant excerpt
    
    Skip if: asking questions, discussing others' experiences, promotional.
    
  3. Anecdote clustering (stretch goal)

    • Group similar anecdotal claims
    • Escalate if cluster size exceeds threshold
    • "50+ users reporting gastroparesis" becomes a signal
  4. Full-tier query demonstration

    steme-pharma safety semaglutide --all-tiers
    # Shows:
    # Tier 0 (FDA): has_boxed_warning = true (thyroid)
    # Tier 1 (RCTs): nausea_incidence = 0.44
    # Tier 4 (Reddit): 127 reports of "gastroparesis" (pre-label)
    

Deliverables:

  • Reddit ingestor producing Tier 4-5 assertions
  • At least 100 anecdotal claims ingested
  • Query showing full tier spectrum for a safety topic

Foundation this enables:

  • "Early signal detection" — social media flags issues before FDA
  • Real consumer health use case data
  • Validates source_class hierarchy is working

Week 6: Ontology Layer Extraction & Documentation

Goals:

  • Extract reusable patterns from pharma implementation
  • Document how to create a new domain
  • Prepare for second vertical (SEC/financial or code/security)

Tasks:

  1. Generalize domain definition schema

    • Review what was pharma-specific vs reusable
    • Document the schema formally
    • Create domains/template.toml with comments
  2. Extraction pipeline abstraction

    • Common interface for source adapters
    • Common LLM prompt patterns
    • Common normalization utilities
  3. "Add a Domain" guide (docs/guides/adding-a-domain.md)

    • Step-by-step: define entities, predicates, sources
    • Example: how pharma was built
    • Checklist: what you need before first extraction
  4. Second domain sketch (no implementation)

    • Draft domains/sec.toml or domains/security.toml
    • Identify where patterns differ from pharma
    • Document what would need to change
  5. Integration with CLAUDE.md

    • Add ontology layer to "Find Your Guide" table
    • Document new crates and tools
    • Update specialized agents if needed

Deliverables:

  • Reusable ontology crate extracted from pharma-specific code
  • Documentation for adding new domains
  • Draft of second domain showing generalization

Foundation this enables:

  • Other teams can add domains without core changes
  • Clear separation of domain logic from storage logic
  • Path to "platform" if that's the direction

Risks & Mitigations

Risk Likelihood Impact Mitigation
LLM extraction quality varies High Medium Start with structured sources (FDA), add human review loop
Ontology schema needs revision High Low Keep Week 1 scope small, iterate based on Week 2-3 learnings
Drug name normalization is hard Medium Medium Use existing resources (RxNorm), accept some manual mapping
Reddit rate limits Medium Low Cache aggressively, use Reddit API properly
Scope creep into full consumer app Medium High Stay focused on extraction + storage, not UX

Success Criteria

Week 3 checkpoint: Can we ingest FDA labels + clinical trials and see real conflicts via SkepticLens?

Week 6 checkpoint: Is the ontology layer reusable enough that sketching a second domain feels like "filling in a template" rather than "starting over"?

End state:

# This should work and show meaningful data:
steme-pharma efficacy semaglutide type2_diabetes

# Output:
Subject: Semaglutide:Type2Diabetes
Predicate: hba1c_change_percent
Status: Contested (conflict_score: 0.42)

Claims:
  -1.5% (45% weight) — NEJM Phase 3, n=1847
  -1.2% (35% weight) — Lancet Phase 3, n=892
  -0.8% (20% weight) — JAMA Observational, n=12000

Sources span Tier 0-2. 47 anecdotal reports (Tier 4) mention efficacy.

Open Questions

  1. Where does the ontology crate live?

    • Option A: applications/ontology/ (domain-agnostic utility)
    • Option B: crates/stemedb-ontology/ (core crate)
    • Recommendation: Start in applications/, promote to crates/ if it proves domain-agnostic
  2. Python vs Rust for extractors?

    • Python: faster iteration, better LLM libraries, existing latent/ code
    • Rust: type safety, integration with core
    • Recommendation: Python for extractors (they're ETL jobs), Rust for ontology validation
  3. How do we handle ontology versioning?

    • Predicates may change as we learn
    • Old assertions might not conform to new schema
    • Recommendation: Version in domain definition, allow "legacy" predicates with deprecation flags

Next Steps

  1. Review this plan and adjust scope/timeline
  2. Decide on Question 1 (crate location)
  3. Start Week 1: domain definition schema