jordan 41c676a78e feat: Aphoria enterprise features + ontology SDK + file length compliance

Enterprise Features:
- Hosted mode with remote sync for team pattern aggregation
- Community sharing with privacy-preserving anonymization
- LLM-based semantic claim extraction with Gemini integration
- Pattern learning with promotion to declarative extractors
- High-entropy secrets extractor with configurable thresholds
- Auth bypass and insecure cookies extractors

Module Refactoring:
- Split oversized files to comply with 500-line limit
- Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc.
- Handlers split: scan.rs, policy.rs, report.rs modules
- Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/
- Learning split: store modules with metrics and persistence

SDK & Ontology:
- stemedb-ontology SDK with fluent builders and StemeDB client
- Pharma domain extractors for FDA Orange Book data
- Consumer health UAT test infrastructure

Code Quality:
- Fixed clippy warnings (needless_borrows_for_generic_args)
- Added KVStore trait imports where needed
- Fixed utoipa path re-exports for OpenAPI docs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 12:55:29 -07:00

14 KiB

Raw Blame History

name	description
ontology-dev	Development guidelines for stemedb-ontology - the domain modeling layer for Episteme's claim extraction pipeline

Ontology Development Skill

You are an expert stemedb-ontology developer. This crate defines domain ontologies that ensure claims from different sources collide correctly in Episteme. It handles the critical path from raw source data (FDA labels, clinical trials, etc.) to properly-structured assertions.

Core Concept

Ontology defines how subjects are built based on predicate type, ensuring conflicts collide:

Category	Subject Pattern	Example	Why It Collides
Efficacy	`{Drug}:{Indication}`	`Semaglutide:Type2Diabetes`	Same drug+indication efficacy claims collide
Safety	`{Drug}`	`Semaglutide`	Safety applies to drug regardless of indication
Mechanism	`{Drug}:{Target}`	`Semaglutide:GLP1R`	Drug+target mechanism claims collide
Comparison	`{Drug}:{Comparator}:{Indication}`	`Semaglutide:Tirzepatide:T2D`	Head-to-head comparisons collide

Source Tiers (Pharma Domain):

Tier	SourceClass	Label	Example
0	Regulatory	FDA/EMA	Drug labels, approval letters
1	Clinical	Phase 3 trials	SUSTAIN, STEP trials
2	Observational	Real-world	Claims databases, EHR studies
3	Expert	KOL opinion	Conference presentations
4	Informal	Social/anecdotal	Patient forums

Principles

1. Subject Patterns Determine Collision

Different predicates require different subject structures. Efficacy claims need Drug:Indication so that multiple sources reporting on the same drug+indication collide. Safety claims need just Drug so all safety info for a drug collides regardless of indication.

2. Extractors Are Fallible

External APIs fail, return malformed data, or rate limit. Every extractor must:

Return Result<Vec<MedicalClaim>, ExtractError>
Handle HTTP timeouts, 429s, and parsing errors gracefully
Include provenance (source_url, source_section, quote)

3. Domains Are Compiled-In

Type safety matters. Domains define entity types, predicate schemas, and source hierarchies at compile time. Adding a new domain means adding code, not configuration.

4. Validate Before Ingestion

Claims must be validated against the domain schema before becoming assertions. The Validator catches subject/predicate mismatches early.

5. Confidence Is Required

Every claim needs a confidence score (0.0-1.0). Extractors must estimate extraction quality based on source clarity, parsing confidence, and quote specificity.

Architecture

┌─────────────────────────────────────────────────────────────┐
│              stemedb-ontology Pipeline                      │
├─────────────────────────────────────────────────────────────┤
│  1. DEFINE DOMAIN                                           │
│     Domain::new("Pharma")                                   │
│       .with_entity_type("Drug", ...)                        │
│       .with_predicate_schema("efficacy", ...)               │
│       .with_source_hierarchy(...)                           │
│                                                             │
│  2. EXTRACT CLAIMS                                          │
│     FdaLabelExtractor::extract(&SourceInput::DrugName(...)) │
│     → Vec<MedicalClaim>                                     │
│                                                             │
│  3. BUILD SUBJECTS                                          │
│     SubjectBuilder::build(&schema, &entities)               │
│     → "Semaglutide:Type2Diabetes"                           │
│                                                             │
│  4. VALIDATE                                                │
│     Validator::new(&domain).validate(pred, subj, conf)      │
│     → Ok(()) or ValidationError                             │
│                                                             │
│  5. TO ASSERTION                                            │
│     claim.to_assertion(&signing_key, agent_id, &hlc)        │
│     → Assertion (signed, ready for ingestion)               │
│                                                             │
│  6. SUBMIT TO STEMEDB                                       │
│     StemeClient::assert(&assertion)                         │
│     → AssertionHash                                         │
└─────────────────────────────────────────────────────────────┘

Key Modules

Module	Purpose	Key Types
`domain.rs`	Domain, entity types, predicate schemas	`Domain`, `PredicateSchema`, `SourceTier`
`subject.rs`	Subject construction and parsing	`SubjectBuilder`, `SubjectError`
`validator.rs`	Claim validation against schemas	`Validator`, `ValidationError`
`pharma/`	Pharmaceutical domain definition	`definition()`, `GLP1_DRUGS`
`pharma/extractors/`	Medical data extractors	`MedicalExtractor`, `FdaLabelExtractor`
`client.rs`	HTTP client for StemeDB	`StemeClient`, `ClientError`

Key Types

/// A predicate schema defines subject patterns for a category
pub struct PredicateSchema {
    pub description: String,
    pub subject_pattern: String,  // e.g., "{Drug}:{Indication}"
    pub predicates: Vec<String>,  // e.g., ["hba1c_reduction", "weight_loss"]
    pub default_lens: DefaultLens,
    pub required_entities: Vec<String>,  // Extracted from pattern
}

/// Trait for medical data extractors
#[async_trait]
pub trait MedicalExtractor: Send + Sync {
    fn name(&self) -> &str;
    fn source_class(&self) -> SourceClass;
    async fn extract(&self, source: &SourceInput) -> Result<Vec<MedicalClaim>, ExtractError>;
    fn can_handle(&self, source: &SourceInput) -> bool;
}

/// Intermediate format between raw source and assertions
pub struct MedicalClaim {
    pub subject: String,
    pub predicate: String,
    pub value: ObjectValue,
    pub confidence: f32,
    pub source_url: String,
    pub source_section: String,
    pub quote: String,
    pub source_class: SourceClass,
    pub metadata: Option<serde_json::Value>,
}

/// Build subjects from schemas and entities
impl SubjectBuilder {
    // Build: SubjectBuilder::build(&schema, &entities) → "Semaglutide:Type2Diabetes"
    pub fn build(schema: &PredicateSchema, entities: &HashMap<String, String>)
        -> Result<String, SubjectError>;

    // Parse: SubjectBuilder::parse(&schema, subject) → {Drug: "Semaglutide", ...}
    pub fn parse(schema: &PredicateSchema, subject: &str)
        -> Result<HashMap<String, String>, SubjectError>;
}

Step Back: Before Implementing

Before writing code, challenge your assumptions:

1. Is This a New Domain or Extending Pharma?

"Am I defining entity types/predicates for a new vertical, or adding to pharma?"

New domains: Create src/{domain}/mod.rs, src/{domain}/definition.rs
Pharma extensions: Add to src/pharma/definition.rs
New extractors for pharma: Add to src/pharma/extractors/

2. Does My Subject Pattern Enable Correct Collision?

"Will claims that SHOULD conflict have the same subject?"

Efficacy claims for same drug+indication MUST collide
Safety claims for same drug MUST collide regardless of indication
Think about what "same thing" means for your predicate category

3. What Source Class Is This?

"What tier is my data source in the authority hierarchy?"

Regulatory (FDA, EMA): SourceClass::Regulatory - authoritative, long decay
Clinical trials: SourceClass::Clinical - high quality, moderate decay
Observational: SourceClass::Observational - real-world, faster decay
Don't over-rank - observational data is NOT clinical-grade

4. Will Extraction Fail Gracefully?

"What happens when the FDA API is down or returns garbage?"

HTTP errors → ExtractError::Http
No results → ExtractError::NotFound
Rate limiting → ExtractError::RateLimited
Never panic on external data

After step back: Trace through the pipeline in pharma/extractors/fda.rs to see how it handles edge cases.

Do

Use SubjectBuilder for all subject construction. Never concatenate strings manually.
Include provenance in every MedicalClaim. source_url, source_section, quote are required.
Estimate confidence honestly. Exact quotes = high (0.9+), inferred = medium (0.6-0.8), uncertain = low (<0.6).
Handle all HTTP error cases. Timeout, 429, 500, malformed JSON.
Normalize entity names via alias tables. "Ozempic" → "Semaglutide".
Validate claims before assertion conversion. Use Validator::new(&domain).validate().
Use #[instrument] on extractor methods. Critical for debugging failed extractions.
Add tests for new extractors with mock HTTP. Use mockito or similar.
Document source API quirks. Rate limits, pagination, field meanings.
Keep extractors focused. One source = one extractor. Don't combine FDA + PubMed.

Do Not

Use unwrap() or expect() in extractors. External data is untrusted.
Hardcode subject strings. Always use SubjectBuilder::build().
Skip the source_hash. Provenance must be hashable for deduplication.
Mix source classes. FDA labels are Regulatory, not Clinical.
Forget async/Send+Sync bounds. Extractors run in async contexts.
Trust external API field names. APIs change; handle missing fields.
Over-promise confidence. If you're parsing prose, confidence < 0.9.
Block on external APIs without timeout. Default 30s, configurable.
Skip the metadata field. Store API version, date accessed, NDC codes.
Commit without running cargo test -p stemedb-ontology. Extractor tests catch regressions.

Decision Points

Adding a New Extractor

Stop. Questions:

What source class does this data belong to?
What predicates can you reliably extract?
What subject pattern do those predicates use?
How do you handle rate limiting and failures?
What provenance fields can you populate?

Adding a New Predicate Category

Stop. Questions:

What entities make up the subject? ({Drug} vs {Drug}:{Indication})
What existing predicates belong in this category?
What's the default lens (Recency, Consensus, Authority)?
Do subjects built with this pattern collide correctly?

Adding a New Domain

Stop. Questions:

What entity types exist? (Drug, Gene, Company, Security, etc.)
What are the predicate categories and their subject patterns?
What's the source hierarchy for this vertical?
Who will build extractors for this domain?

Constraints

NEVER:

Use unwrap() or expect() in production code
Manually concatenate subject strings
Trust external API responses without validation
Block indefinitely on HTTP requests
Skip provenance fields (source_url, quote)
Mutate existing assertions (append-only)

ALWAYS:

Use SubjectBuilder::build() for subjects
Include meaningful error context in ExtractError
Run cargo clippy --workspace -- -D warnings before commit
Add tests for new extractor logic
Document API rate limits and quirks
Use #[instrument] on public extractor methods

Testing Commands

# Full ontology test suite
cargo test -p stemedb-ontology

# Run specific extractor tests
cargo test -p stemedb-ontology fda

# Run with logging
RUST_LOG=stemedb_ontology=debug cargo test -p stemedb-ontology

# Lint check
cargo clippy -p stemedb-ontology -- -D warnings

# Format check
cargo fmt -p stemedb-ontology --check

# Run the FDA CLI tool (if available)
cargo run -p stemedb-ontology --bin fda-lookup -- semaglutide

Common Workflows

Adding a New Extractor

Create src/pharma/extractors/{source}.rs
Implement MedicalExtractor trait:
- name() → human-readable identifier
- source_class() → tier for this source
- can_handle() → which SourceInput variants work
- extract() → async extraction with proper error handling
Add HTTP client with timeout and retry logic
Re-export from src/pharma/extractors/mod.rs
Add tests with mock HTTP responses

Adding a New Predicate to Pharma

Determine which schema category it belongs to (efficacy, safety, mechanism)
Add to the predicates list in src/pharma/definition.rs
If new category needed:
- Create new PredicateSchema with appropriate subject_pattern
- Add via .with_predicate_schema()
Update extractor to populate the new predicate
Add test case validating subject pattern

Debugging Extraction Failures

Run with RUST_LOG=stemedb_ontology=debug
Check for HTTP errors (timeout, 429, 500)
Verify API response matches expected schema
Check if provenance fields are populated
Validate subject pattern matches schema
Inspect ExtractError variant and context

Using StemeClient

use stemedb_ontology::client::StemeClient;
use stemedb_ontology::pharma::extractors::{FdaLabelExtractor, MedicalExtractor, SourceInput};

let client = StemeClient::new("http://localhost:18180");
let extractor = FdaLabelExtractor::new();

// Extract claims from FDA
let claims = extractor.extract(&SourceInput::DrugName("semaglutide".into())).await?;

// Convert and submit
for claim in claims {
    let assertion = claim.to_assertion(&signing_key, agent_id, &hlc);
    let hash = client.assert(&assertion).await?;
}

// Query with skeptic lens (shows all conflicts)
let response = client.skeptic("Semaglutide:Type2Diabetes", "hba1c_change_percent").await?;

Output Format

When implementing features or fixing bugs, provide:

## Summary
[One-line description]

## Changes
- [File]: [What changed]

## Testing
- [How to verify]

## Impact
- [Subject patterns affected, if any]
- [Extractors affected, if any]

Domain Status Reference

Domain	Status	Extractors
Pharma	Active	FDA Labels
Finance	Planned	-
Consumer Health	Planned	-

14 KiB Raw Blame History

Ontology Development Skill

Core Concept

Principles

1. Subject Patterns Determine Collision

2. Extractors Are Fallible

3. Domains Are Compiled-In

4. Validate Before Ingestion

5. Confidence Is Required

Architecture

Key Modules

Key Types

Step Back: Before Implementing

1. Is This a New Domain or Extending Pharma?

2. Does My Subject Pattern Enable Correct Collision?

3. What Source Class Is This?

4. Will Extraction Fail Gracefully?

Do

Do Not

Decision Points

Adding a New Extractor

Adding a New Predicate Category

Adding a New Domain

Constraints

Testing Commands

Common Workflows

Adding a New Extractor

Adding a New Predicate to Pharma

Debugging Extraction Failures

Using StemeClient

Output Format

Domain Status Reference

14 KiB

Raw Blame History