Major additions: - Staged scanning modes (working tree, staged, committed) with git integration - Drift detection for baseline vs current state comparisons - Hosted API handlers for policy CRUD operations via StemeDB API - stemedb-ontology crate with domain definitions and medical extractors - Consumer health vertical UAT scenarios (GLP-1, gastroparesis, etc.) - Aphoria development skill documentation Code organization: - Split large files into focused modules to stay under 500-line limit - Extracted config tests, episteme helpers/drift/aliases, API helpers Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
17 KiB
Ontology Layer + Medical Vertical Implementation Plan
Goal: Build the ontology layer that defines how claims are structured, extracted, and resolved — using Pharma/Medical as the first vertical to pressure-test every decision.
Philosophy: Build medical end-to-end first, extract the reusable ontology layer as patterns emerge. Don't abstract prematurely.
Outcome: A working claim extraction pipeline that can ingest FDA labels, clinical trial data, and eventually social media — producing properly structured assertions that conflict-detect correctly.
Context: What Exists Today
| Component | Status | Gap |
|---|---|---|
| Core StemeDB | ✅ Complete | Storage, lenses, conflict detection work |
| SkepticLens | ✅ Complete | Can detect conflicts if subjects/predicates match |
latent/ingest-fda |
🟡 Prototype | Fetches FDA labels, outputs flat JSONL (not integrated) |
| Aphoria extractors | ✅ Complete | Pattern-based extraction for code (14 extractors) |
| Disputed LLM extraction | 🟡 Early | Generic SPO extraction, no domain schema |
| Ontology definitions | ❌ Missing | No formal way to define subject patterns, predicate schemas |
The gap: We can store claims. We cannot yet structure claims for a domain in a way that guarantees conflicts will collide correctly.
The Core Insight
When Journal A says "Semaglutide reduced HbA1c by 1.5%" and Journal B says "Semaglutide reduced HbA1c by 0.8%", we need:
Subject: Semaglutide:Type2Diabetes # Drug:Indication compound key
Predicate: hba1c_change_percent # Reusable across drug:indication pairs
Object: -1.5 # The conflicting value
The subject granularity depends on the predicate type:
- Efficacy predicates → Subject is
{Drug}:{Indication} - Safety predicates → Subject is
{Drug}(indication-agnostic) - Mechanism predicates → Subject is
{Drug}or{Drug}:{Pathway}
This is what the ontology layer needs to express.
High-Level Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Ontology Layer (New) │
├─────────────────────────────────────────────────────────────────┤
│ Domain Definition (YAML/TOML) │
│ ├── Entity Types (Drug, Condition, Biomarker...) │
│ ├── Predicate Schemas (subject pattern → predicates) │
│ ├── Source Hierarchy (Tier 0-5) │
│ └── Default Lens per predicate type │
├─────────────────────────────────────────────────────────────────┤
│ Extraction Pipeline │
│ ├── Source Adapters (FDA API, PubMed, Reddit...) │
│ ├── Claim Extractor (LLM-based, schema-guided) │
│ ├── Normalizer (maps raw → ontology subjects/predicates) │
│ └── Validator (checks claim conforms to schema) │
├─────────────────────────────────────────────────────────────────┤
│ StemeDB Core (Existing) │
│ ├── Assertion Storage │
│ ├── Conflict Detection (SkepticLens) │
│ └── Query/Resolution │
└─────────────────────────────────────────────────────────────────┘
Week-by-Week Implementation Plan
Week 1: Domain Definition Schema
Goals:
- Define how domains express their ontology
- Create the pharma domain definition as the first instance
- No extraction yet — just the schema that extraction will target
Tasks:
-
Design domain definition format (
applications/ontology/)- Choose format: YAML or TOML (TOML aligns with Rust ecosystem)
- Define schema for entity types, predicate schemas, source tiers
-
Create pharma domain definition (
applications/ontology/domains/pharma.toml)[domain] name = "pharma" version = "0.1.0" [entity_types] Drug = { aliases = ["medication", "compound", "molecule"] } Indication = { aliases = ["condition", "disease", "disorder"] } Biomarker = { aliases = ["endpoint", "measure"] } Population = { aliases = ["cohort", "patient_group"] } [predicate_schemas.efficacy] subject_pattern = "{Drug}:{Indication}" predicates = ["hba1c_change_percent", "weight_change_percent", "remission_rate"] default_lens = "Skeptic" [predicate_schemas.safety] subject_pattern = "{Drug}" predicates = ["nausea_incidence", "discontinuation_rate", "has_boxed_warning"] default_lens = "Authority" [predicate_schemas.mechanism] subject_pattern = "{Drug}" predicates = ["target_receptor", "half_life_hours", "bioavailability_percent"] default_lens = "Recency" [source_hierarchy] tier0 = ["FDA_Label", "EMA_Approval"] tier1 = ["Phase3_RCT", "Meta_Analysis"] tier2 = ["Observational_Study", "Real_World_Evidence"] tier3 = ["Case_Report", "Expert_Opinion"] tier4 = ["Patient_Forum", "Social_Media"] -
Implement domain parser (
applications/ontology/src/domain.rs)- Parse TOML into Rust structs
- Validate schema consistency (no circular refs, valid patterns)
- Unit tests for parsing
-
Subject builder utility
- Given entity values + predicate schema, build correct subject string
build_subject("efficacy", {"Drug": "Semaglutide", "Indication": "Type2Diabetes"})→"Semaglutide:Type2Diabetes"
Deliverables:
applications/ontology/crate with domain definition parsingdomains/pharma.tomlas the reference implementation- Subject builder that enforces schema compliance
Foundation this enables:
- Extractors know what shape claims should have
- Validators can check claims against schema
- Future domains just add a new
.tomlfile
Week 2: FDA Label Extraction (Tier 0)
Goals:
- Upgrade
latent/ingest-fdato produce schema-compliant assertions - Extract structured claims, not raw text blobs
- Write directly to StemeDB (not JSONL files)
Tasks:
-
Refactor FDA ingestor (
latent/ingest-fda/)- Load pharma domain definition
- Use LLM (Claude) to extract structured claims from label sections
- Map extracted claims to ontology predicates
-
LLM extraction prompt for FDA labels
Given this FDA label section for {drug_name}, extract claims as structured data. For ADVERSE REACTIONS sections, extract: - Predicate: {symptom}_incidence - Object: decimal (0.0-1.0) - Quote: exact text supporting this For BOXED WARNINGS, extract: - Predicate: has_boxed_warning - Object: boolean (true) - Quote: warning text Return JSON array of claims. -
Normalizer module (
latent/ingest-fda/normalizer.py)- Map drug names to canonical form (Ozempic → semaglutide)
- Map symptom names to canonical predicates (nausea, vomiting → distinct)
- Use drug synonym database (RxNorm or similar)
-
StemeDB client integration
- Replace JSONL output with HTTP calls to StemeDB API
- Sign assertions with ingestor's Ed25519 key
- Set
source_class: Regulatory(Tier 0)
-
Integration test
- Ingest semaglutide FDA label
- Query StemeDB for
Semaglutidesafety predicates - Verify structured claims exist with correct schema
Deliverables:
- FDA ingestor writes schema-compliant assertions to StemeDB
- At least 3 drugs (semaglutide, tirzepatide, liraglutide) ingested
- Integration test proving round-trip
Foundation this enables:
- Tier 0 (regulatory) baseline established
- Lower-tier sources can now conflict with FDA data
- Pattern for other structured source adapters
Week 3: Clinical Trial Extraction (Tier 1)
Goals:
- Extract efficacy claims from clinical trial publications
- These will conflict with each other (the whole point of Episteme)
- Demonstrate SkepticLens showing real disagreement
Tasks:
-
PubMed/PMC source adapter (
latent/ingest-pubmed/)- Fetch abstracts + full text for GLP-1 trials
- Filter by clinical trial registration (NCT numbers)
- Extract study metadata (sample size, design, journal)
-
LLM extraction prompt for efficacy claims
Given this clinical trial abstract/results for {drug_name} in {indication}: Extract efficacy claims: - Subject: {drug}:{indication} - Predicate: {biomarker}_change_percent | remission_rate | etc. - Object: numeric value - Confidence: based on sample size and study design - Quote: exact text Include comparator arm if mentioned (vs placebo, vs {other_drug}). -
Confidence scoring heuristics
- Phase 3 RCT: base confidence 0.9
- Phase 2: base confidence 0.7
- Observational: base confidence 0.5
- Adjust by sample size, blinding, journal impact factor
-
Conflict demonstration
- Ingest 5+ trials with varying HbA1c results for semaglutide
- Query with SkepticLens
- Show
conflict_score > 0and multiple competing claims
Deliverables:
- PubMed ingestor producing Tier 1 assertions
- At least 10 clinical trial papers ingested
- SkepticLens query showing real conflict in GLP-1 efficacy data
Foundation this enables:
- Proves the ontology enables conflict detection
- Real "Trust but Verify" data for demos
- Pattern for other publication sources (biorxiv, medrxiv)
Week 4: Ontology Validation & Query Patterns
Goals:
- Validate that extracted claims conform to ontology
- Build query helpers that leverage domain knowledge
- CLI tool for exploring the pharma knowledge graph
Tasks:
-
Claim validator (
applications/ontology/src/validator.rs)- Check subject matches predicate schema's subject_pattern
- Check predicate is defined for its schema
- Check object type is valid for predicate
- Return validation errors or Ok
-
Query builder with domain awareness
query_efficacy("Semaglutide", "Type2Diabetes")→ correct subject + all efficacy predicatesquery_safety("Semaglutide")→ correct subject + all safety predicates- Uses domain definition to know which predicates to include
-
CLI exploration tool (
applications/ontology/src/bin/steme-pharma.rs)steme-pharma efficacy semaglutide type2-diabetes # Shows all efficacy claims with conflict scores steme-pharma safety semaglutide # Shows all safety claims by tier steme-pharma compare semaglutide tirzepatide --indication type2-diabetes # Side-by-side comparison -
Domain-aware API endpoints (optional, if time)
GET /v1/pharma/efficacy?drug=semaglutide&indication=type2_diabetesGET /v1/pharma/safety?drug=semaglutide- Thin wrapper over existing query API with domain knowledge
Deliverables:
- Validator catches schema-violating claims before ingestion
- CLI tool for exploring pharma data
- Query helpers that make domain queries easy
Foundation this enables:
- Quality gate for extraction pipelines
- Developer experience for exploring data
- Pattern for domain-specific APIs
Week 5: Social Signal Extraction (Tier 4-5)
Goals:
- Extract anecdotal claims from Reddit/social media
- These are low-confidence but high-volume
- Demonstrate the full tier spectrum working together
Tasks:
-
Reddit source adapter (
latent/ingest-reddit/)- Fetch posts from r/Ozempic, r/Semaglutide, r/diabetes
- Extract post text, upvotes, comment count
- Rate limit appropriately
-
LLM extraction for anecdotal claims
Given this Reddit post about {drug_name}: Extract personal experience claims: - Predicate: reported_{symptom} | reported_effectiveness - Object: Text (the claim) or Boolean - Confidence: 0.2-0.4 based on detail and specificity - Quote: relevant excerpt Skip if: asking questions, discussing others' experiences, promotional. -
Anecdote clustering (stretch goal)
- Group similar anecdotal claims
- Escalate if cluster size exceeds threshold
- "50+ users reporting gastroparesis" becomes a signal
-
Full-tier query demonstration
steme-pharma safety semaglutide --all-tiers # Shows: # Tier 0 (FDA): has_boxed_warning = true (thyroid) # Tier 1 (RCTs): nausea_incidence = 0.44 # Tier 4 (Reddit): 127 reports of "gastroparesis" (pre-label)
Deliverables:
- Reddit ingestor producing Tier 4-5 assertions
- At least 100 anecdotal claims ingested
- Query showing full tier spectrum for a safety topic
Foundation this enables:
- "Early signal detection" — social media flags issues before FDA
- Real consumer health use case data
- Validates source_class hierarchy is working
Week 6: Ontology Layer Extraction & Documentation
Goals:
- Extract reusable patterns from pharma implementation
- Document how to create a new domain
- Prepare for second vertical (SEC/financial or code/security)
Tasks:
-
Generalize domain definition schema
- Review what was pharma-specific vs reusable
- Document the schema formally
- Create
domains/template.tomlwith comments
-
Extraction pipeline abstraction
- Common interface for source adapters
- Common LLM prompt patterns
- Common normalization utilities
-
"Add a Domain" guide (
docs/guides/adding-a-domain.md)- Step-by-step: define entities, predicates, sources
- Example: how pharma was built
- Checklist: what you need before first extraction
-
Second domain sketch (no implementation)
- Draft
domains/sec.tomlordomains/security.toml - Identify where patterns differ from pharma
- Document what would need to change
- Draft
-
Integration with CLAUDE.md
- Add ontology layer to "Find Your Guide" table
- Document new crates and tools
- Update specialized agents if needed
Deliverables:
- Reusable ontology crate extracted from pharma-specific code
- Documentation for adding new domains
- Draft of second domain showing generalization
Foundation this enables:
- Other teams can add domains without core changes
- Clear separation of domain logic from storage logic
- Path to "platform" if that's the direction
Risks & Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| LLM extraction quality varies | High | Medium | Start with structured sources (FDA), add human review loop |
| Ontology schema needs revision | High | Low | Keep Week 1 scope small, iterate based on Week 2-3 learnings |
| Drug name normalization is hard | Medium | Medium | Use existing resources (RxNorm), accept some manual mapping |
| Reddit rate limits | Medium | Low | Cache aggressively, use Reddit API properly |
| Scope creep into full consumer app | Medium | High | Stay focused on extraction + storage, not UX |
Success Criteria
Week 3 checkpoint: Can we ingest FDA labels + clinical trials and see real conflicts via SkepticLens?
Week 6 checkpoint: Is the ontology layer reusable enough that sketching a second domain feels like "filling in a template" rather than "starting over"?
End state:
# This should work and show meaningful data:
steme-pharma efficacy semaglutide type2_diabetes
# Output:
Subject: Semaglutide:Type2Diabetes
Predicate: hba1c_change_percent
Status: Contested (conflict_score: 0.42)
Claims:
-1.5% (45% weight) — NEJM Phase 3, n=1847
-1.2% (35% weight) — Lancet Phase 3, n=892
-0.8% (20% weight) — JAMA Observational, n=12000
Sources span Tier 0-2. 47 anecdotal reports (Tier 4) mention efficacy.
Open Questions
-
Where does the ontology crate live?
- Option A:
applications/ontology/(domain-agnostic utility) - Option B:
crates/stemedb-ontology/(core crate) - Recommendation: Start in
applications/, promote tocrates/if it proves domain-agnostic
- Option A:
-
Python vs Rust for extractors?
- Python: faster iteration, better LLM libraries, existing
latent/code - Rust: type safety, integration with core
- Recommendation: Python for extractors (they're ETL jobs), Rust for ontology validation
- Python: faster iteration, better LLM libraries, existing
-
How do we handle ontology versioning?
- Predicates may change as we learn
- Old assertions might not conform to new schema
- Recommendation: Version in domain definition, allow "legacy" predicates with deprecation flags
Next Steps
- Review this plan and adjust scope/timeline
- Decide on Question 1 (crate location)
- Start Week 1: domain definition schema