stemedb/.claude/skills/ontology-dev/SKILL.md

---
name: ontology-dev
description: Development guidelines for stemedb-ontology - the domain modeling layer for Episteme's claim extraction pipeline
---

# Ontology Development Skill

You are an expert stemedb-ontology developer. This crate defines **domain ontologies** that ensure claims from different sources collide correctly in Episteme. It handles the critical path from raw source data (FDA labels, clinical trials, etc.) to properly-structured assertions.

## Core Concept

Ontology defines how subjects are built based on predicate type, ensuring conflicts collide:

| Category | Subject Pattern | Example | Why It Collides |
|----------|-----------------|---------|-----------------|
| Efficacy | `{Drug}:{Indication}` | `Semaglutide:Type2Diabetes` | Same drug+indication efficacy claims collide |
| Safety | `{Drug}` | `Semaglutide` | Safety applies to drug regardless of indication |
| Mechanism | `{Drug}:{Target}` | `Semaglutide:GLP1R` | Drug+target mechanism claims collide |
| Comparison | `{Drug}:{Comparator}:{Indication}` | `Semaglutide:Tirzepatide:T2D` | Head-to-head comparisons collide |

**Source Tiers (Pharma Domain):**

| Tier | SourceClass | Label | Example |
|------|-------------|-------|---------|
| 0 | Regulatory | FDA/EMA | Drug labels, approval letters |
| 1 | Clinical | Phase 3 trials | SUSTAIN, STEP trials |
| 2 | Observational | Real-world | Claims databases, EHR studies |
| 3 | Expert | KOL opinion | Conference presentations |
| 4 | Informal | Social/anecdotal | Patient forums |

## Principles

### 1. Subject Patterns Determine Collision
Different predicates require different subject structures. Efficacy claims need `Drug:Indication` so that multiple sources reporting on the same drug+indication collide. Safety claims need just `Drug` so all safety info for a drug collides regardless of indication.

### 2. Extractors Are Fallible
External APIs fail, return malformed data, or rate limit. Every extractor must:
- Return `Result<Vec<MedicalClaim>, ExtractError>`
- Handle HTTP timeouts, 429s, and parsing errors gracefully
- Include provenance (source_url, source_section, quote)

### 3. Domains Are Compiled-In
Type safety matters. Domains define entity types, predicate schemas, and source hierarchies at compile time. Adding a new domain means adding code, not configuration.

### 4. Validate Before Ingestion
Claims must be validated against the domain schema before becoming assertions. The `Validator` catches subject/predicate mismatches early.

### 5. Confidence Is Required
Every claim needs a confidence score (0.0-1.0). Extractors must estimate extraction quality based on source clarity, parsing confidence, and quote specificity.

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│              stemedb-ontology Pipeline                      │
├─────────────────────────────────────────────────────────────┤
│  1. DEFINE DOMAIN                                           │
│     Domain::new("Pharma")                                   │
│       .with_entity_type("Drug", ...)                        │
│       .with_predicate_schema("efficacy", ...)               │
│       .with_source_hierarchy(...)                           │
│                                                             │
│  2. EXTRACT CLAIMS                                          │
│     FdaLabelExtractor::extract(&SourceInput::DrugName(...)) │
│     → Vec<MedicalClaim>                                     │
│                                                             │
│  3. BUILD SUBJECTS                                          │
│     SubjectBuilder::build(&schema, &entities)               │
│     → "Semaglutide:Type2Diabetes"                           │
│                                                             │
│  4. VALIDATE                                                │
│     Validator::new(&domain).validate(pred, subj, conf)      │
│     → Ok(()) or ValidationError                             │
│                                                             │
│  5. TO ASSERTION                                            │
│     claim.to_assertion(&signing_key, agent_id, &hlc)        │
│     → Assertion (signed, ready for ingestion)               │
│                                                             │
│  6. SUBMIT TO STEMEDB                                       │
│     StemeClient::assert(&assertion)                         │
│     → AssertionHash                                         │
└─────────────────────────────────────────────────────────────┘
```

## Key Modules

| Module | Purpose | Key Types |
|--------|---------|-----------|
| `domain.rs` | Domain, entity types, predicate schemas | `Domain`, `PredicateSchema`, `SourceTier` |
| `subject.rs` | Subject construction and parsing | `SubjectBuilder`, `SubjectError` |
| `validator.rs` | Claim validation against schemas | `Validator`, `ValidationError` |
| `pharma/` | Pharmaceutical domain definition | `definition()`, `GLP1_DRUGS` |
| `pharma/extractors/` | Medical data extractors | `MedicalExtractor`, `FdaLabelExtractor` |
| `client.rs` | HTTP client for StemeDB | `StemeClient`, `ClientError` |

## Key Types

```rust
/// A predicate schema defines subject patterns for a category
pub struct PredicateSchema {
    pub description: String,
    pub subject_pattern: String,  // e.g., "{Drug}:{Indication}"
    pub predicates: Vec<String>,  // e.g., ["hba1c_reduction", "weight_loss"]
    pub default_lens: DefaultLens,
    pub required_entities: Vec<String>,  // Extracted from pattern
}

/// Trait for medical data extractors
#[async_trait]
pub trait MedicalExtractor: Send + Sync {
    fn name(&self) -> &str;
    fn source_class(&self) -> SourceClass;
    async fn extract(&self, source: &SourceInput) -> Result<Vec<MedicalClaim>, ExtractError>;
    fn can_handle(&self, source: &SourceInput) -> bool;
}

/// Intermediate format between raw source and assertions
pub struct MedicalClaim {
    pub subject: String,
    pub predicate: String,
    pub value: ObjectValue,
    pub confidence: f32,
    pub source_url: String,
    pub source_section: String,
    pub quote: String,
    pub source_class: SourceClass,
    pub metadata: Option<serde_json::Value>,
}

/// Build subjects from schemas and entities
impl SubjectBuilder {
    // Build: SubjectBuilder::build(&schema, &entities) → "Semaglutide:Type2Diabetes"
    pub fn build(schema: &PredicateSchema, entities: &HashMap<String, String>)
        -> Result<String, SubjectError>;

    // Parse: SubjectBuilder::parse(&schema, subject) → {Drug: "Semaglutide", ...}
    pub fn parse(schema: &PredicateSchema, subject: &str)
        -> Result<HashMap<String, String>, SubjectError>;
}
```

## Step Back: Before Implementing

Before writing code, challenge your assumptions:

### 1. Is This a New Domain or Extending Pharma?
> "Am I defining entity types/predicates for a new vertical, or adding to pharma?"

- New domains: Create `src/{domain}/mod.rs`, `src/{domain}/definition.rs`
- Pharma extensions: Add to `src/pharma/definition.rs`
- New extractors for pharma: Add to `src/pharma/extractors/`

### 2. Does My Subject Pattern Enable Correct Collision?
> "Will claims that SHOULD conflict have the same subject?"

- Efficacy claims for same drug+indication MUST collide
- Safety claims for same drug MUST collide regardless of indication
- Think about what "same thing" means for your predicate category

### 3. What Source Class Is This?
> "What tier is my data source in the authority hierarchy?"

- Regulatory (FDA, EMA): `SourceClass::Regulatory` - authoritative, long decay
- Clinical trials: `SourceClass::Clinical` - high quality, moderate decay
- Observational: `SourceClass::Observational` - real-world, faster decay
- Don't over-rank - observational data is NOT clinical-grade

### 4. Will Extraction Fail Gracefully?
> "What happens when the FDA API is down or returns garbage?"

- HTTP errors → `ExtractError::Http`
- No results → `ExtractError::NotFound`
- Rate limiting → `ExtractError::RateLimited`
- Never panic on external data

**After step back:** Trace through the pipeline in `pharma/extractors/fda.rs` to see how it handles edge cases.

## Do

1. **Use `SubjectBuilder` for all subject construction.** Never concatenate strings manually.
2. **Include provenance in every `MedicalClaim`.** `source_url`, `source_section`, `quote` are required.
3. **Estimate confidence honestly.** Exact quotes = high (0.9+), inferred = medium (0.6-0.8), uncertain = low (<0.6).
4. **Handle all HTTP error cases.** Timeout, 429, 500, malformed JSON.
5. **Normalize entity names via alias tables.** "Ozempic" → "Semaglutide".
6. **Validate claims before assertion conversion.** Use `Validator::new(&domain).validate()`.
7. **Use `#[instrument]` on extractor methods.** Critical for debugging failed extractions.
8. **Add tests for new extractors with mock HTTP.** Use `mockito` or similar.
9. **Document source API quirks.** Rate limits, pagination, field meanings.
10. **Keep extractors focused.** One source = one extractor. Don't combine FDA + PubMed.

## Do Not

1. **Use `unwrap()` or `expect()` in extractors.** External data is untrusted.
2. **Hardcode subject strings.** Always use `SubjectBuilder::build()`.
3. **Skip the source_hash.** Provenance must be hashable for deduplication.
4. **Mix source classes.** FDA labels are Regulatory, not Clinical.
5. **Forget async/Send+Sync bounds.** Extractors run in async contexts.
6. **Trust external API field names.** APIs change; handle missing fields.
7. **Over-promise confidence.** If you're parsing prose, confidence < 0.9.
8. **Block on external APIs without timeout.** Default 30s, configurable.
9. **Skip the metadata field.** Store API version, date accessed, NDC codes.
10. **Commit without running `cargo test -p stemedb-ontology`.** Extractor tests catch regressions.

## Decision Points

### Adding a New Extractor

Stop. Questions:
- What source class does this data belong to?
- What predicates can you reliably extract?
- What subject pattern do those predicates use?
- How do you handle rate limiting and failures?
- What provenance fields can you populate?

### Adding a New Predicate Category

Stop. Questions:
- What entities make up the subject? (`{Drug}` vs `{Drug}:{Indication}`)
- What existing predicates belong in this category?
- What's the default lens (Recency, Consensus, Authority)?
- Do subjects built with this pattern collide correctly?

### Adding a New Domain

Stop. Questions:
- What entity types exist? (Drug, Gene, Company, Security, etc.)
- What are the predicate categories and their subject patterns?
- What's the source hierarchy for this vertical?
- Who will build extractors for this domain?

## Constraints

**NEVER:**
- Use `unwrap()` or `expect()` in production code
- Manually concatenate subject strings
- Trust external API responses without validation
- Block indefinitely on HTTP requests
- Skip provenance fields (source_url, quote)
- Mutate existing assertions (append-only)

**ALWAYS:**
- Use `SubjectBuilder::build()` for subjects
- Include meaningful error context in `ExtractError`
- Run `cargo clippy --workspace -- -D warnings` before commit
- Add tests for new extractor logic
- Document API rate limits and quirks
- Use `#[instrument]` on public extractor methods

## Testing Commands

```bash
# Full ontology test suite
cargo test -p stemedb-ontology

# Run specific extractor tests
cargo test -p stemedb-ontology fda

# Run with logging
RUST_LOG=stemedb_ontology=debug cargo test -p stemedb-ontology

# Lint check
cargo clippy -p stemedb-ontology -- -D warnings

# Format check
cargo fmt -p stemedb-ontology --check

# Run the FDA CLI tool (if available)
cargo run -p stemedb-ontology --bin fda-lookup -- semaglutide
```

## Common Workflows

### Adding a New Extractor

1. Create `src/pharma/extractors/{source}.rs`
2. Implement `MedicalExtractor` trait:
   - `name()` → human-readable identifier
   - `source_class()` → tier for this source
   - `can_handle()` → which `SourceInput` variants work
   - `extract()` → async extraction with proper error handling
3. Add HTTP client with timeout and retry logic
4. Re-export from `src/pharma/extractors/mod.rs`
5. Add tests with mock HTTP responses

### Adding a New Predicate to Pharma

1. Determine which schema category it belongs to (efficacy, safety, mechanism)
2. Add to the `predicates` list in `src/pharma/definition.rs`
3. If new category needed:
   - Create new `PredicateSchema` with appropriate `subject_pattern`
   - Add via `.with_predicate_schema()`
4. Update extractor to populate the new predicate
5. Add test case validating subject pattern

### Debugging Extraction Failures

1. Run with `RUST_LOG=stemedb_ontology=debug`
2. Check for HTTP errors (timeout, 429, 500)
3. Verify API response matches expected schema
4. Check if provenance fields are populated
5. Validate subject pattern matches schema
6. Inspect `ExtractError` variant and context

### Using StemeClient

```rust
use stemedb_ontology::client::StemeClient;
use stemedb_ontology::pharma::extractors::{FdaLabelExtractor, MedicalExtractor, SourceInput};

let client = StemeClient::new("http://localhost:18180");
let extractor = FdaLabelExtractor::new();

// Extract claims from FDA
let claims = extractor.extract(&SourceInput::DrugName("semaglutide".into())).await?;

// Convert and submit
for claim in claims {
    let assertion = claim.to_assertion(&signing_key, agent_id, &hlc);
    let hash = client.assert(&assertion).await?;
}

// Query with skeptic lens (shows all conflicts)
let response = client.skeptic("Semaglutide:Type2Diabetes", "hba1c_change_percent").await?;
```

## Output Format

When implementing features or fixing bugs, provide:

```
## Summary
[One-line description]

## Changes
- [File]: [What changed]

## Testing
- [How to verify]

## Impact
- [Subject patterns affected, if any]
- [Extractors affected, if any]
```

## Domain Status Reference

| Domain | Status | Extractors |
|--------|--------|------------|
| Pharma | Active | FDA Labels |
| Finance | Planned | - |
| Consumer Health | Planned | - |