stemedb/.claude/skills/ontology-dev/SKILL.md
jordan 41c676a78e feat: Aphoria enterprise features + ontology SDK + file length compliance
Enterprise Features:
- Hosted mode with remote sync for team pattern aggregation
- Community sharing with privacy-preserving anonymization
- LLM-based semantic claim extraction with Gemini integration
- Pattern learning with promotion to declarative extractors
- High-entropy secrets extractor with configurable thresholds
- Auth bypass and insecure cookies extractors

Module Refactoring:
- Split oversized files to comply with 500-line limit
- Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc.
- Handlers split: scan.rs, policy.rs, report.rs modules
- Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/
- Learning split: store modules with metrics and persistence

SDK & Ontology:
- stemedb-ontology SDK with fluent builders and StemeDB client
- Pharma domain extractors for FDA Orange Book data
- Consumer health UAT test infrastructure

Code Quality:
- Fixed clippy warnings (needless_borrows_for_generic_args)
- Added KVStore trait imports where needed
- Fixed utoipa path re-exports for OpenAPI docs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 12:55:29 -07:00

14 KiB

name description
ontology-dev Development guidelines for stemedb-ontology - the domain modeling layer for Episteme's claim extraction pipeline

Ontology Development Skill

You are an expert stemedb-ontology developer. This crate defines domain ontologies that ensure claims from different sources collide correctly in Episteme. It handles the critical path from raw source data (FDA labels, clinical trials, etc.) to properly-structured assertions.

Core Concept

Ontology defines how subjects are built based on predicate type, ensuring conflicts collide:

Category Subject Pattern Example Why It Collides
Efficacy {Drug}:{Indication} Semaglutide:Type2Diabetes Same drug+indication efficacy claims collide
Safety {Drug} Semaglutide Safety applies to drug regardless of indication
Mechanism {Drug}:{Target} Semaglutide:GLP1R Drug+target mechanism claims collide
Comparison {Drug}:{Comparator}:{Indication} Semaglutide:Tirzepatide:T2D Head-to-head comparisons collide

Source Tiers (Pharma Domain):

Tier SourceClass Label Example
0 Regulatory FDA/EMA Drug labels, approval letters
1 Clinical Phase 3 trials SUSTAIN, STEP trials
2 Observational Real-world Claims databases, EHR studies
3 Expert KOL opinion Conference presentations
4 Informal Social/anecdotal Patient forums

Principles

1. Subject Patterns Determine Collision

Different predicates require different subject structures. Efficacy claims need Drug:Indication so that multiple sources reporting on the same drug+indication collide. Safety claims need just Drug so all safety info for a drug collides regardless of indication.

2. Extractors Are Fallible

External APIs fail, return malformed data, or rate limit. Every extractor must:

  • Return Result<Vec<MedicalClaim>, ExtractError>
  • Handle HTTP timeouts, 429s, and parsing errors gracefully
  • Include provenance (source_url, source_section, quote)

3. Domains Are Compiled-In

Type safety matters. Domains define entity types, predicate schemas, and source hierarchies at compile time. Adding a new domain means adding code, not configuration.

4. Validate Before Ingestion

Claims must be validated against the domain schema before becoming assertions. The Validator catches subject/predicate mismatches early.

5. Confidence Is Required

Every claim needs a confidence score (0.0-1.0). Extractors must estimate extraction quality based on source clarity, parsing confidence, and quote specificity.

Architecture

┌─────────────────────────────────────────────────────────────┐
│              stemedb-ontology Pipeline                      │
├─────────────────────────────────────────────────────────────┤
│  1. DEFINE DOMAIN                                           │
│     Domain::new("Pharma")                                   │
│       .with_entity_type("Drug", ...)                        │
│       .with_predicate_schema("efficacy", ...)               │
│       .with_source_hierarchy(...)                           │
│                                                             │
│  2. EXTRACT CLAIMS                                          │
│     FdaLabelExtractor::extract(&SourceInput::DrugName(...)) │
│     → Vec<MedicalClaim>                                     │
│                                                             │
│  3. BUILD SUBJECTS                                          │
│     SubjectBuilder::build(&schema, &entities)               │
│     → "Semaglutide:Type2Diabetes"                           │
│                                                             │
│  4. VALIDATE                                                │
│     Validator::new(&domain).validate(pred, subj, conf)      │
│     → Ok(()) or ValidationError                             │
│                                                             │
│  5. TO ASSERTION                                            │
│     claim.to_assertion(&signing_key, agent_id, &hlc)        │
│     → Assertion (signed, ready for ingestion)               │
│                                                             │
│  6. SUBMIT TO STEMEDB                                       │
│     StemeClient::assert(&assertion)                         │
│     → AssertionHash                                         │
└─────────────────────────────────────────────────────────────┘

Key Modules

Module Purpose Key Types
domain.rs Domain, entity types, predicate schemas Domain, PredicateSchema, SourceTier
subject.rs Subject construction and parsing SubjectBuilder, SubjectError
validator.rs Claim validation against schemas Validator, ValidationError
pharma/ Pharmaceutical domain definition definition(), GLP1_DRUGS
pharma/extractors/ Medical data extractors MedicalExtractor, FdaLabelExtractor
client.rs HTTP client for StemeDB StemeClient, ClientError

Key Types

/// A predicate schema defines subject patterns for a category
pub struct PredicateSchema {
    pub description: String,
    pub subject_pattern: String,  // e.g., "{Drug}:{Indication}"
    pub predicates: Vec<String>,  // e.g., ["hba1c_reduction", "weight_loss"]
    pub default_lens: DefaultLens,
    pub required_entities: Vec<String>,  // Extracted from pattern
}

/// Trait for medical data extractors
#[async_trait]
pub trait MedicalExtractor: Send + Sync {
    fn name(&self) -> &str;
    fn source_class(&self) -> SourceClass;
    async fn extract(&self, source: &SourceInput) -> Result<Vec<MedicalClaim>, ExtractError>;
    fn can_handle(&self, source: &SourceInput) -> bool;
}

/// Intermediate format between raw source and assertions
pub struct MedicalClaim {
    pub subject: String,
    pub predicate: String,
    pub value: ObjectValue,
    pub confidence: f32,
    pub source_url: String,
    pub source_section: String,
    pub quote: String,
    pub source_class: SourceClass,
    pub metadata: Option<serde_json::Value>,
}

/// Build subjects from schemas and entities
impl SubjectBuilder {
    // Build: SubjectBuilder::build(&schema, &entities) → "Semaglutide:Type2Diabetes"
    pub fn build(schema: &PredicateSchema, entities: &HashMap<String, String>)
        -> Result<String, SubjectError>;

    // Parse: SubjectBuilder::parse(&schema, subject) → {Drug: "Semaglutide", ...}
    pub fn parse(schema: &PredicateSchema, subject: &str)
        -> Result<HashMap<String, String>, SubjectError>;
}

Step Back: Before Implementing

Before writing code, challenge your assumptions:

1. Is This a New Domain or Extending Pharma?

"Am I defining entity types/predicates for a new vertical, or adding to pharma?"

  • New domains: Create src/{domain}/mod.rs, src/{domain}/definition.rs
  • Pharma extensions: Add to src/pharma/definition.rs
  • New extractors for pharma: Add to src/pharma/extractors/

2. Does My Subject Pattern Enable Correct Collision?

"Will claims that SHOULD conflict have the same subject?"

  • Efficacy claims for same drug+indication MUST collide
  • Safety claims for same drug MUST collide regardless of indication
  • Think about what "same thing" means for your predicate category

3. What Source Class Is This?

"What tier is my data source in the authority hierarchy?"

  • Regulatory (FDA, EMA): SourceClass::Regulatory - authoritative, long decay
  • Clinical trials: SourceClass::Clinical - high quality, moderate decay
  • Observational: SourceClass::Observational - real-world, faster decay
  • Don't over-rank - observational data is NOT clinical-grade

4. Will Extraction Fail Gracefully?

"What happens when the FDA API is down or returns garbage?"

  • HTTP errors → ExtractError::Http
  • No results → ExtractError::NotFound
  • Rate limiting → ExtractError::RateLimited
  • Never panic on external data

After step back: Trace through the pipeline in pharma/extractors/fda.rs to see how it handles edge cases.

Do

  1. Use SubjectBuilder for all subject construction. Never concatenate strings manually.
  2. Include provenance in every MedicalClaim. source_url, source_section, quote are required.
  3. Estimate confidence honestly. Exact quotes = high (0.9+), inferred = medium (0.6-0.8), uncertain = low (<0.6).
  4. Handle all HTTP error cases. Timeout, 429, 500, malformed JSON.
  5. Normalize entity names via alias tables. "Ozempic" → "Semaglutide".
  6. Validate claims before assertion conversion. Use Validator::new(&domain).validate().
  7. Use #[instrument] on extractor methods. Critical for debugging failed extractions.
  8. Add tests for new extractors with mock HTTP. Use mockito or similar.
  9. Document source API quirks. Rate limits, pagination, field meanings.
  10. Keep extractors focused. One source = one extractor. Don't combine FDA + PubMed.

Do Not

  1. Use unwrap() or expect() in extractors. External data is untrusted.
  2. Hardcode subject strings. Always use SubjectBuilder::build().
  3. Skip the source_hash. Provenance must be hashable for deduplication.
  4. Mix source classes. FDA labels are Regulatory, not Clinical.
  5. Forget async/Send+Sync bounds. Extractors run in async contexts.
  6. Trust external API field names. APIs change; handle missing fields.
  7. Over-promise confidence. If you're parsing prose, confidence < 0.9.
  8. Block on external APIs without timeout. Default 30s, configurable.
  9. Skip the metadata field. Store API version, date accessed, NDC codes.
  10. Commit without running cargo test -p stemedb-ontology. Extractor tests catch regressions.

Decision Points

Adding a New Extractor

Stop. Questions:

  • What source class does this data belong to?
  • What predicates can you reliably extract?
  • What subject pattern do those predicates use?
  • How do you handle rate limiting and failures?
  • What provenance fields can you populate?

Adding a New Predicate Category

Stop. Questions:

  • What entities make up the subject? ({Drug} vs {Drug}:{Indication})
  • What existing predicates belong in this category?
  • What's the default lens (Recency, Consensus, Authority)?
  • Do subjects built with this pattern collide correctly?

Adding a New Domain

Stop. Questions:

  • What entity types exist? (Drug, Gene, Company, Security, etc.)
  • What are the predicate categories and their subject patterns?
  • What's the source hierarchy for this vertical?
  • Who will build extractors for this domain?

Constraints

NEVER:

  • Use unwrap() or expect() in production code
  • Manually concatenate subject strings
  • Trust external API responses without validation
  • Block indefinitely on HTTP requests
  • Skip provenance fields (source_url, quote)
  • Mutate existing assertions (append-only)

ALWAYS:

  • Use SubjectBuilder::build() for subjects
  • Include meaningful error context in ExtractError
  • Run cargo clippy --workspace -- -D warnings before commit
  • Add tests for new extractor logic
  • Document API rate limits and quirks
  • Use #[instrument] on public extractor methods

Testing Commands

# Full ontology test suite
cargo test -p stemedb-ontology

# Run specific extractor tests
cargo test -p stemedb-ontology fda

# Run with logging
RUST_LOG=stemedb_ontology=debug cargo test -p stemedb-ontology

# Lint check
cargo clippy -p stemedb-ontology -- -D warnings

# Format check
cargo fmt -p stemedb-ontology --check

# Run the FDA CLI tool (if available)
cargo run -p stemedb-ontology --bin fda-lookup -- semaglutide

Common Workflows

Adding a New Extractor

  1. Create src/pharma/extractors/{source}.rs
  2. Implement MedicalExtractor trait:
    • name() → human-readable identifier
    • source_class() → tier for this source
    • can_handle() → which SourceInput variants work
    • extract() → async extraction with proper error handling
  3. Add HTTP client with timeout and retry logic
  4. Re-export from src/pharma/extractors/mod.rs
  5. Add tests with mock HTTP responses

Adding a New Predicate to Pharma

  1. Determine which schema category it belongs to (efficacy, safety, mechanism)
  2. Add to the predicates list in src/pharma/definition.rs
  3. If new category needed:
    • Create new PredicateSchema with appropriate subject_pattern
    • Add via .with_predicate_schema()
  4. Update extractor to populate the new predicate
  5. Add test case validating subject pattern

Debugging Extraction Failures

  1. Run with RUST_LOG=stemedb_ontology=debug
  2. Check for HTTP errors (timeout, 429, 500)
  3. Verify API response matches expected schema
  4. Check if provenance fields are populated
  5. Validate subject pattern matches schema
  6. Inspect ExtractError variant and context

Using StemeClient

use stemedb_ontology::client::StemeClient;
use stemedb_ontology::pharma::extractors::{FdaLabelExtractor, MedicalExtractor, SourceInput};

let client = StemeClient::new("http://localhost:18180");
let extractor = FdaLabelExtractor::new();

// Extract claims from FDA
let claims = extractor.extract(&SourceInput::DrugName("semaglutide".into())).await?;

// Convert and submit
for claim in claims {
    let assertion = claim.to_assertion(&signing_key, agent_id, &hlc);
    let hash = client.assert(&assertion).await?;
}

// Query with skeptic lens (shows all conflicts)
let response = client.skeptic("Semaglutide:Type2Diabetes", "hba1c_change_percent").await?;

Output Format

When implementing features or fixing bugs, provide:

## Summary
[One-line description]

## Changes
- [File]: [What changed]

## Testing
- [How to verify]

## Impact
- [Subject patterns affected, if any]
- [Extractors affected, if any]

Domain Status Reference

Domain Status Extractors
Pharma Active FDA Labels
Finance Planned -
Consumer Health Planned -