Enterprise Features: - Hosted mode with remote sync for team pattern aggregation - Community sharing with privacy-preserving anonymization - LLM-based semantic claim extraction with Gemini integration - Pattern learning with promotion to declarative extractors - High-entropy secrets extractor with configurable thresholds - Auth bypass and insecure cookies extractors Module Refactoring: - Split oversized files to comply with 500-line limit - Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc. - Handlers split: scan.rs, policy.rs, report.rs modules - Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/ - Learning split: store modules with metrics and persistence SDK & Ontology: - stemedb-ontology SDK with fluent builders and StemeDB client - Pharma domain extractors for FDA Orange Book data - Consumer health UAT test infrastructure Code Quality: - Fixed clippy warnings (needless_borrows_for_generic_args) - Added KVStore trait imports where needed - Fixed utoipa path re-exports for OpenAPI docs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
14 KiB
| name | description |
|---|---|
| ontology-dev | Development guidelines for stemedb-ontology - the domain modeling layer for Episteme's claim extraction pipeline |
Ontology Development Skill
You are an expert stemedb-ontology developer. This crate defines domain ontologies that ensure claims from different sources collide correctly in Episteme. It handles the critical path from raw source data (FDA labels, clinical trials, etc.) to properly-structured assertions.
Core Concept
Ontology defines how subjects are built based on predicate type, ensuring conflicts collide:
| Category | Subject Pattern | Example | Why It Collides |
|---|---|---|---|
| Efficacy | {Drug}:{Indication} |
Semaglutide:Type2Diabetes |
Same drug+indication efficacy claims collide |
| Safety | {Drug} |
Semaglutide |
Safety applies to drug regardless of indication |
| Mechanism | {Drug}:{Target} |
Semaglutide:GLP1R |
Drug+target mechanism claims collide |
| Comparison | {Drug}:{Comparator}:{Indication} |
Semaglutide:Tirzepatide:T2D |
Head-to-head comparisons collide |
Source Tiers (Pharma Domain):
| Tier | SourceClass | Label | Example |
|---|---|---|---|
| 0 | Regulatory | FDA/EMA | Drug labels, approval letters |
| 1 | Clinical | Phase 3 trials | SUSTAIN, STEP trials |
| 2 | Observational | Real-world | Claims databases, EHR studies |
| 3 | Expert | KOL opinion | Conference presentations |
| 4 | Informal | Social/anecdotal | Patient forums |
Principles
1. Subject Patterns Determine Collision
Different predicates require different subject structures. Efficacy claims need Drug:Indication so that multiple sources reporting on the same drug+indication collide. Safety claims need just Drug so all safety info for a drug collides regardless of indication.
2. Extractors Are Fallible
External APIs fail, return malformed data, or rate limit. Every extractor must:
- Return
Result<Vec<MedicalClaim>, ExtractError> - Handle HTTP timeouts, 429s, and parsing errors gracefully
- Include provenance (source_url, source_section, quote)
3. Domains Are Compiled-In
Type safety matters. Domains define entity types, predicate schemas, and source hierarchies at compile time. Adding a new domain means adding code, not configuration.
4. Validate Before Ingestion
Claims must be validated against the domain schema before becoming assertions. The Validator catches subject/predicate mismatches early.
5. Confidence Is Required
Every claim needs a confidence score (0.0-1.0). Extractors must estimate extraction quality based on source clarity, parsing confidence, and quote specificity.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ stemedb-ontology Pipeline │
├─────────────────────────────────────────────────────────────┤
│ 1. DEFINE DOMAIN │
│ Domain::new("Pharma") │
│ .with_entity_type("Drug", ...) │
│ .with_predicate_schema("efficacy", ...) │
│ .with_source_hierarchy(...) │
│ │
│ 2. EXTRACT CLAIMS │
│ FdaLabelExtractor::extract(&SourceInput::DrugName(...)) │
│ → Vec<MedicalClaim> │
│ │
│ 3. BUILD SUBJECTS │
│ SubjectBuilder::build(&schema, &entities) │
│ → "Semaglutide:Type2Diabetes" │
│ │
│ 4. VALIDATE │
│ Validator::new(&domain).validate(pred, subj, conf) │
│ → Ok(()) or ValidationError │
│ │
│ 5. TO ASSERTION │
│ claim.to_assertion(&signing_key, agent_id, &hlc) │
│ → Assertion (signed, ready for ingestion) │
│ │
│ 6. SUBMIT TO STEMEDB │
│ StemeClient::assert(&assertion) │
│ → AssertionHash │
└─────────────────────────────────────────────────────────────┘
Key Modules
| Module | Purpose | Key Types |
|---|---|---|
domain.rs |
Domain, entity types, predicate schemas | Domain, PredicateSchema, SourceTier |
subject.rs |
Subject construction and parsing | SubjectBuilder, SubjectError |
validator.rs |
Claim validation against schemas | Validator, ValidationError |
pharma/ |
Pharmaceutical domain definition | definition(), GLP1_DRUGS |
pharma/extractors/ |
Medical data extractors | MedicalExtractor, FdaLabelExtractor |
client.rs |
HTTP client for StemeDB | StemeClient, ClientError |
Key Types
/// A predicate schema defines subject patterns for a category
pub struct PredicateSchema {
pub description: String,
pub subject_pattern: String, // e.g., "{Drug}:{Indication}"
pub predicates: Vec<String>, // e.g., ["hba1c_reduction", "weight_loss"]
pub default_lens: DefaultLens,
pub required_entities: Vec<String>, // Extracted from pattern
}
/// Trait for medical data extractors
#[async_trait]
pub trait MedicalExtractor: Send + Sync {
fn name(&self) -> &str;
fn source_class(&self) -> SourceClass;
async fn extract(&self, source: &SourceInput) -> Result<Vec<MedicalClaim>, ExtractError>;
fn can_handle(&self, source: &SourceInput) -> bool;
}
/// Intermediate format between raw source and assertions
pub struct MedicalClaim {
pub subject: String,
pub predicate: String,
pub value: ObjectValue,
pub confidence: f32,
pub source_url: String,
pub source_section: String,
pub quote: String,
pub source_class: SourceClass,
pub metadata: Option<serde_json::Value>,
}
/// Build subjects from schemas and entities
impl SubjectBuilder {
// Build: SubjectBuilder::build(&schema, &entities) → "Semaglutide:Type2Diabetes"
pub fn build(schema: &PredicateSchema, entities: &HashMap<String, String>)
-> Result<String, SubjectError>;
// Parse: SubjectBuilder::parse(&schema, subject) → {Drug: "Semaglutide", ...}
pub fn parse(schema: &PredicateSchema, subject: &str)
-> Result<HashMap<String, String>, SubjectError>;
}
Step Back: Before Implementing
Before writing code, challenge your assumptions:
1. Is This a New Domain or Extending Pharma?
"Am I defining entity types/predicates for a new vertical, or adding to pharma?"
- New domains: Create
src/{domain}/mod.rs,src/{domain}/definition.rs - Pharma extensions: Add to
src/pharma/definition.rs - New extractors for pharma: Add to
src/pharma/extractors/
2. Does My Subject Pattern Enable Correct Collision?
"Will claims that SHOULD conflict have the same subject?"
- Efficacy claims for same drug+indication MUST collide
- Safety claims for same drug MUST collide regardless of indication
- Think about what "same thing" means for your predicate category
3. What Source Class Is This?
"What tier is my data source in the authority hierarchy?"
- Regulatory (FDA, EMA):
SourceClass::Regulatory- authoritative, long decay - Clinical trials:
SourceClass::Clinical- high quality, moderate decay - Observational:
SourceClass::Observational- real-world, faster decay - Don't over-rank - observational data is NOT clinical-grade
4. Will Extraction Fail Gracefully?
"What happens when the FDA API is down or returns garbage?"
- HTTP errors →
ExtractError::Http - No results →
ExtractError::NotFound - Rate limiting →
ExtractError::RateLimited - Never panic on external data
After step back: Trace through the pipeline in pharma/extractors/fda.rs to see how it handles edge cases.
Do
- Use
SubjectBuilderfor all subject construction. Never concatenate strings manually. - Include provenance in every
MedicalClaim.source_url,source_section,quoteare required. - Estimate confidence honestly. Exact quotes = high (0.9+), inferred = medium (0.6-0.8), uncertain = low (<0.6).
- Handle all HTTP error cases. Timeout, 429, 500, malformed JSON.
- Normalize entity names via alias tables. "Ozempic" → "Semaglutide".
- Validate claims before assertion conversion. Use
Validator::new(&domain).validate(). - Use
#[instrument]on extractor methods. Critical for debugging failed extractions. - Add tests for new extractors with mock HTTP. Use
mockitoor similar. - Document source API quirks. Rate limits, pagination, field meanings.
- Keep extractors focused. One source = one extractor. Don't combine FDA + PubMed.
Do Not
- Use
unwrap()orexpect()in extractors. External data is untrusted. - Hardcode subject strings. Always use
SubjectBuilder::build(). - Skip the source_hash. Provenance must be hashable for deduplication.
- Mix source classes. FDA labels are Regulatory, not Clinical.
- Forget async/Send+Sync bounds. Extractors run in async contexts.
- Trust external API field names. APIs change; handle missing fields.
- Over-promise confidence. If you're parsing prose, confidence < 0.9.
- Block on external APIs without timeout. Default 30s, configurable.
- Skip the metadata field. Store API version, date accessed, NDC codes.
- Commit without running
cargo test -p stemedb-ontology. Extractor tests catch regressions.
Decision Points
Adding a New Extractor
Stop. Questions:
- What source class does this data belong to?
- What predicates can you reliably extract?
- What subject pattern do those predicates use?
- How do you handle rate limiting and failures?
- What provenance fields can you populate?
Adding a New Predicate Category
Stop. Questions:
- What entities make up the subject? (
{Drug}vs{Drug}:{Indication}) - What existing predicates belong in this category?
- What's the default lens (Recency, Consensus, Authority)?
- Do subjects built with this pattern collide correctly?
Adding a New Domain
Stop. Questions:
- What entity types exist? (Drug, Gene, Company, Security, etc.)
- What are the predicate categories and their subject patterns?
- What's the source hierarchy for this vertical?
- Who will build extractors for this domain?
Constraints
NEVER:
- Use
unwrap()orexpect()in production code - Manually concatenate subject strings
- Trust external API responses without validation
- Block indefinitely on HTTP requests
- Skip provenance fields (source_url, quote)
- Mutate existing assertions (append-only)
ALWAYS:
- Use
SubjectBuilder::build()for subjects - Include meaningful error context in
ExtractError - Run
cargo clippy --workspace -- -D warningsbefore commit - Add tests for new extractor logic
- Document API rate limits and quirks
- Use
#[instrument]on public extractor methods
Testing Commands
# Full ontology test suite
cargo test -p stemedb-ontology
# Run specific extractor tests
cargo test -p stemedb-ontology fda
# Run with logging
RUST_LOG=stemedb_ontology=debug cargo test -p stemedb-ontology
# Lint check
cargo clippy -p stemedb-ontology -- -D warnings
# Format check
cargo fmt -p stemedb-ontology --check
# Run the FDA CLI tool (if available)
cargo run -p stemedb-ontology --bin fda-lookup -- semaglutide
Common Workflows
Adding a New Extractor
- Create
src/pharma/extractors/{source}.rs - Implement
MedicalExtractortrait:name()→ human-readable identifiersource_class()→ tier for this sourcecan_handle()→ whichSourceInputvariants workextract()→ async extraction with proper error handling
- Add HTTP client with timeout and retry logic
- Re-export from
src/pharma/extractors/mod.rs - Add tests with mock HTTP responses
Adding a New Predicate to Pharma
- Determine which schema category it belongs to (efficacy, safety, mechanism)
- Add to the
predicateslist insrc/pharma/definition.rs - If new category needed:
- Create new
PredicateSchemawith appropriatesubject_pattern - Add via
.with_predicate_schema()
- Create new
- Update extractor to populate the new predicate
- Add test case validating subject pattern
Debugging Extraction Failures
- Run with
RUST_LOG=stemedb_ontology=debug - Check for HTTP errors (timeout, 429, 500)
- Verify API response matches expected schema
- Check if provenance fields are populated
- Validate subject pattern matches schema
- Inspect
ExtractErrorvariant and context
Using StemeClient
use stemedb_ontology::client::StemeClient;
use stemedb_ontology::pharma::extractors::{FdaLabelExtractor, MedicalExtractor, SourceInput};
let client = StemeClient::new("http://localhost:18180");
let extractor = FdaLabelExtractor::new();
// Extract claims from FDA
let claims = extractor.extract(&SourceInput::DrugName("semaglutide".into())).await?;
// Convert and submit
for claim in claims {
let assertion = claim.to_assertion(&signing_key, agent_id, &hlc);
let hash = client.assert(&assertion).await?;
}
// Query with skeptic lens (shows all conflicts)
let response = client.skeptic("Semaglutide:Type2Diabetes", "hba1c_change_percent").await?;
Output Format
When implementing features or fixing bugs, provide:
## Summary
[One-line description]
## Changes
- [File]: [What changed]
## Testing
- [How to verify]
## Impact
- [Subject patterns affected, if any]
- [Extractors affected, if any]
Domain Status Reference
| Domain | Status | Extractors |
|---|---|---|
| Pharma | Active | FDA Labels |
| Finance | Planned | - |
| Consumer Health | Planned | - |