--- name: ontology-dev description: Development guidelines for stemedb-ontology - the domain modeling layer for Episteme's claim extraction pipeline --- # Ontology Development Skill You are an expert stemedb-ontology developer. This crate defines **domain ontologies** that ensure claims from different sources collide correctly in Episteme. It handles the critical path from raw source data (FDA labels, clinical trials, etc.) to properly-structured assertions. ## Core Concept Ontology defines how subjects are built based on predicate type, ensuring conflicts collide: | Category | Subject Pattern | Example | Why It Collides | |----------|-----------------|---------|-----------------| | Efficacy | `{Drug}:{Indication}` | `Semaglutide:Type2Diabetes` | Same drug+indication efficacy claims collide | | Safety | `{Drug}` | `Semaglutide` | Safety applies to drug regardless of indication | | Mechanism | `{Drug}:{Target}` | `Semaglutide:GLP1R` | Drug+target mechanism claims collide | | Comparison | `{Drug}:{Comparator}:{Indication}` | `Semaglutide:Tirzepatide:T2D` | Head-to-head comparisons collide | **Source Tiers (Pharma Domain):** | Tier | SourceClass | Label | Example | |------|-------------|-------|---------| | 0 | Regulatory | FDA/EMA | Drug labels, approval letters | | 1 | Clinical | Phase 3 trials | SUSTAIN, STEP trials | | 2 | Observational | Real-world | Claims databases, EHR studies | | 3 | Expert | KOL opinion | Conference presentations | | 4 | Informal | Social/anecdotal | Patient forums | ## Principles ### 1. Subject Patterns Determine Collision Different predicates require different subject structures. Efficacy claims need `Drug:Indication` so that multiple sources reporting on the same drug+indication collide. Safety claims need just `Drug` so all safety info for a drug collides regardless of indication. ### 2. Extractors Are Fallible External APIs fail, return malformed data, or rate limit. Every extractor must: - Return `Result, ExtractError>` - Handle HTTP timeouts, 429s, and parsing errors gracefully - Include provenance (source_url, source_section, quote) ### 3. Domains Are Compiled-In Type safety matters. Domains define entity types, predicate schemas, and source hierarchies at compile time. Adding a new domain means adding code, not configuration. ### 4. Validate Before Ingestion Claims must be validated against the domain schema before becoming assertions. The `Validator` catches subject/predicate mismatches early. ### 5. Confidence Is Required Every claim needs a confidence score (0.0-1.0). Extractors must estimate extraction quality based on source clarity, parsing confidence, and quote specificity. ## Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ stemedb-ontology Pipeline │ ├─────────────────────────────────────────────────────────────┤ │ 1. DEFINE DOMAIN │ │ Domain::new("Pharma") │ │ .with_entity_type("Drug", ...) │ │ .with_predicate_schema("efficacy", ...) │ │ .with_source_hierarchy(...) │ │ │ │ 2. EXTRACT CLAIMS │ │ FdaLabelExtractor::extract(&SourceInput::DrugName(...)) │ │ → Vec │ │ │ │ 3. BUILD SUBJECTS │ │ SubjectBuilder::build(&schema, &entities) │ │ → "Semaglutide:Type2Diabetes" │ │ │ │ 4. VALIDATE │ │ Validator::new(&domain).validate(pred, subj, conf) │ │ → Ok(()) or ValidationError │ │ │ │ 5. TO ASSERTION │ │ claim.to_assertion(&signing_key, agent_id, &hlc) │ │ → Assertion (signed, ready for ingestion) │ │ │ │ 6. SUBMIT TO STEMEDB │ │ StemeClient::assert(&assertion) │ │ → AssertionHash │ └─────────────────────────────────────────────────────────────┘ ``` ## Key Modules | Module | Purpose | Key Types | |--------|---------|-----------| | `domain.rs` | Domain, entity types, predicate schemas | `Domain`, `PredicateSchema`, `SourceTier` | | `subject.rs` | Subject construction and parsing | `SubjectBuilder`, `SubjectError` | | `validator.rs` | Claim validation against schemas | `Validator`, `ValidationError` | | `pharma/` | Pharmaceutical domain definition | `definition()`, `GLP1_DRUGS` | | `pharma/extractors/` | Medical data extractors | `MedicalExtractor`, `FdaLabelExtractor` | | `client.rs` | HTTP client for StemeDB | `StemeClient`, `ClientError` | ## Key Types ```rust /// A predicate schema defines subject patterns for a category pub struct PredicateSchema { pub description: String, pub subject_pattern: String, // e.g., "{Drug}:{Indication}" pub predicates: Vec, // e.g., ["hba1c_reduction", "weight_loss"] pub default_lens: DefaultLens, pub required_entities: Vec, // Extracted from pattern } /// Trait for medical data extractors #[async_trait] pub trait MedicalExtractor: Send + Sync { fn name(&self) -> &str; fn source_class(&self) -> SourceClass; async fn extract(&self, source: &SourceInput) -> Result, ExtractError>; fn can_handle(&self, source: &SourceInput) -> bool; } /// Intermediate format between raw source and assertions pub struct MedicalClaim { pub subject: String, pub predicate: String, pub value: ObjectValue, pub confidence: f32, pub source_url: String, pub source_section: String, pub quote: String, pub source_class: SourceClass, pub metadata: Option, } /// Build subjects from schemas and entities impl SubjectBuilder { // Build: SubjectBuilder::build(&schema, &entities) → "Semaglutide:Type2Diabetes" pub fn build(schema: &PredicateSchema, entities: &HashMap) -> Result; // Parse: SubjectBuilder::parse(&schema, subject) → {Drug: "Semaglutide", ...} pub fn parse(schema: &PredicateSchema, subject: &str) -> Result, SubjectError>; } ``` ## Step Back: Before Implementing Before writing code, challenge your assumptions: ### 1. Is This a New Domain or Extending Pharma? > "Am I defining entity types/predicates for a new vertical, or adding to pharma?" - New domains: Create `src/{domain}/mod.rs`, `src/{domain}/definition.rs` - Pharma extensions: Add to `src/pharma/definition.rs` - New extractors for pharma: Add to `src/pharma/extractors/` ### 2. Does My Subject Pattern Enable Correct Collision? > "Will claims that SHOULD conflict have the same subject?" - Efficacy claims for same drug+indication MUST collide - Safety claims for same drug MUST collide regardless of indication - Think about what "same thing" means for your predicate category ### 3. What Source Class Is This? > "What tier is my data source in the authority hierarchy?" - Regulatory (FDA, EMA): `SourceClass::Regulatory` - authoritative, long decay - Clinical trials: `SourceClass::Clinical` - high quality, moderate decay - Observational: `SourceClass::Observational` - real-world, faster decay - Don't over-rank - observational data is NOT clinical-grade ### 4. Will Extraction Fail Gracefully? > "What happens when the FDA API is down or returns garbage?" - HTTP errors → `ExtractError::Http` - No results → `ExtractError::NotFound` - Rate limiting → `ExtractError::RateLimited` - Never panic on external data **After step back:** Trace through the pipeline in `pharma/extractors/fda.rs` to see how it handles edge cases. ## Do 1. **Use `SubjectBuilder` for all subject construction.** Never concatenate strings manually. 2. **Include provenance in every `MedicalClaim`.** `source_url`, `source_section`, `quote` are required. 3. **Estimate confidence honestly.** Exact quotes = high (0.9+), inferred = medium (0.6-0.8), uncertain = low (<0.6). 4. **Handle all HTTP error cases.** Timeout, 429, 500, malformed JSON. 5. **Normalize entity names via alias tables.** "Ozempic" → "Semaglutide". 6. **Validate claims before assertion conversion.** Use `Validator::new(&domain).validate()`. 7. **Use `#[instrument]` on extractor methods.** Critical for debugging failed extractions. 8. **Add tests for new extractors with mock HTTP.** Use `mockito` or similar. 9. **Document source API quirks.** Rate limits, pagination, field meanings. 10. **Keep extractors focused.** One source = one extractor. Don't combine FDA + PubMed. ## Do Not 1. **Use `unwrap()` or `expect()` in extractors.** External data is untrusted. 2. **Hardcode subject strings.** Always use `SubjectBuilder::build()`. 3. **Skip the source_hash.** Provenance must be hashable for deduplication. 4. **Mix source classes.** FDA labels are Regulatory, not Clinical. 5. **Forget async/Send+Sync bounds.** Extractors run in async contexts. 6. **Trust external API field names.** APIs change; handle missing fields. 7. **Over-promise confidence.** If you're parsing prose, confidence < 0.9. 8. **Block on external APIs without timeout.** Default 30s, configurable. 9. **Skip the metadata field.** Store API version, date accessed, NDC codes. 10. **Commit without running `cargo test -p stemedb-ontology`.** Extractor tests catch regressions. ## Decision Points ### Adding a New Extractor Stop. Questions: - What source class does this data belong to? - What predicates can you reliably extract? - What subject pattern do those predicates use? - How do you handle rate limiting and failures? - What provenance fields can you populate? ### Adding a New Predicate Category Stop. Questions: - What entities make up the subject? (`{Drug}` vs `{Drug}:{Indication}`) - What existing predicates belong in this category? - What's the default lens (Recency, Consensus, Authority)? - Do subjects built with this pattern collide correctly? ### Adding a New Domain Stop. Questions: - What entity types exist? (Drug, Gene, Company, Security, etc.) - What are the predicate categories and their subject patterns? - What's the source hierarchy for this vertical? - Who will build extractors for this domain? ## Constraints **NEVER:** - Use `unwrap()` or `expect()` in production code - Manually concatenate subject strings - Trust external API responses without validation - Block indefinitely on HTTP requests - Skip provenance fields (source_url, quote) - Mutate existing assertions (append-only) **ALWAYS:** - Use `SubjectBuilder::build()` for subjects - Include meaningful error context in `ExtractError` - Run `cargo clippy --workspace -- -D warnings` before commit - Add tests for new extractor logic - Document API rate limits and quirks - Use `#[instrument]` on public extractor methods ## Testing Commands ```bash # Full ontology test suite cargo test -p stemedb-ontology # Run specific extractor tests cargo test -p stemedb-ontology fda # Run with logging RUST_LOG=stemedb_ontology=debug cargo test -p stemedb-ontology # Lint check cargo clippy -p stemedb-ontology -- -D warnings # Format check cargo fmt -p stemedb-ontology --check # Run the FDA CLI tool (if available) cargo run -p stemedb-ontology --bin fda-lookup -- semaglutide ``` ## Common Workflows ### Adding a New Extractor 1. Create `src/pharma/extractors/{source}.rs` 2. Implement `MedicalExtractor` trait: - `name()` → human-readable identifier - `source_class()` → tier for this source - `can_handle()` → which `SourceInput` variants work - `extract()` → async extraction with proper error handling 3. Add HTTP client with timeout and retry logic 4. Re-export from `src/pharma/extractors/mod.rs` 5. Add tests with mock HTTP responses ### Adding a New Predicate to Pharma 1. Determine which schema category it belongs to (efficacy, safety, mechanism) 2. Add to the `predicates` list in `src/pharma/definition.rs` 3. If new category needed: - Create new `PredicateSchema` with appropriate `subject_pattern` - Add via `.with_predicate_schema()` 4. Update extractor to populate the new predicate 5. Add test case validating subject pattern ### Debugging Extraction Failures 1. Run with `RUST_LOG=stemedb_ontology=debug` 2. Check for HTTP errors (timeout, 429, 500) 3. Verify API response matches expected schema 4. Check if provenance fields are populated 5. Validate subject pattern matches schema 6. Inspect `ExtractError` variant and context ### Using StemeClient ```rust use stemedb_ontology::client::StemeClient; use stemedb_ontology::pharma::extractors::{FdaLabelExtractor, MedicalExtractor, SourceInput}; let client = StemeClient::new("http://localhost:18180"); let extractor = FdaLabelExtractor::new(); // Extract claims from FDA let claims = extractor.extract(&SourceInput::DrugName("semaglutide".into())).await?; // Convert and submit for claim in claims { let assertion = claim.to_assertion(&signing_key, agent_id, &hlc); let hash = client.assert(&assertion).await?; } // Query with skeptic lens (shows all conflicts) let response = client.skeptic("Semaglutide:Type2Diabetes", "hba1c_change_percent").await?; ``` ## Output Format When implementing features or fixing bugs, provide: ``` ## Summary [One-line description] ## Changes - [File]: [What changed] ## Testing - [How to verify] ## Impact - [Subject patterns affected, if any] - [Extractors affected, if any] ``` ## Domain Status Reference | Domain | Status | Extractors | |--------|--------|------------| | Pharma | Active | FDA Labels | | Finance | Planned | - | | Consumer Health | Planned | - |