stemedb/.claude/skills/ontology-dev/SKILL.md
jordan 41c676a78e feat: Aphoria enterprise features + ontology SDK + file length compliance
Enterprise Features:
- Hosted mode with remote sync for team pattern aggregation
- Community sharing with privacy-preserving anonymization
- LLM-based semantic claim extraction with Gemini integration
- Pattern learning with promotion to declarative extractors
- High-entropy secrets extractor with configurable thresholds
- Auth bypass and insecure cookies extractors

Module Refactoring:
- Split oversized files to comply with 500-line limit
- Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc.
- Handlers split: scan.rs, policy.rs, report.rs modules
- Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/
- Learning split: store modules with metrics and persistence

SDK & Ontology:
- stemedb-ontology SDK with fluent builders and StemeDB client
- Pharma domain extractors for FDA Orange Book data
- Consumer health UAT test infrastructure

Code Quality:
- Fixed clippy warnings (needless_borrows_for_generic_args)
- Added KVStore trait imports where needed
- Fixed utoipa path re-exports for OpenAPI docs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 12:55:29 -07:00

352 lines
14 KiB
Markdown

---
name: ontology-dev
description: Development guidelines for stemedb-ontology - the domain modeling layer for Episteme's claim extraction pipeline
---
# Ontology Development Skill
You are an expert stemedb-ontology developer. This crate defines **domain ontologies** that ensure claims from different sources collide correctly in Episteme. It handles the critical path from raw source data (FDA labels, clinical trials, etc.) to properly-structured assertions.
## Core Concept
Ontology defines how subjects are built based on predicate type, ensuring conflicts collide:
| Category | Subject Pattern | Example | Why It Collides |
|----------|-----------------|---------|-----------------|
| Efficacy | `{Drug}:{Indication}` | `Semaglutide:Type2Diabetes` | Same drug+indication efficacy claims collide |
| Safety | `{Drug}` | `Semaglutide` | Safety applies to drug regardless of indication |
| Mechanism | `{Drug}:{Target}` | `Semaglutide:GLP1R` | Drug+target mechanism claims collide |
| Comparison | `{Drug}:{Comparator}:{Indication}` | `Semaglutide:Tirzepatide:T2D` | Head-to-head comparisons collide |
**Source Tiers (Pharma Domain):**
| Tier | SourceClass | Label | Example |
|------|-------------|-------|---------|
| 0 | Regulatory | FDA/EMA | Drug labels, approval letters |
| 1 | Clinical | Phase 3 trials | SUSTAIN, STEP trials |
| 2 | Observational | Real-world | Claims databases, EHR studies |
| 3 | Expert | KOL opinion | Conference presentations |
| 4 | Informal | Social/anecdotal | Patient forums |
## Principles
### 1. Subject Patterns Determine Collision
Different predicates require different subject structures. Efficacy claims need `Drug:Indication` so that multiple sources reporting on the same drug+indication collide. Safety claims need just `Drug` so all safety info for a drug collides regardless of indication.
### 2. Extractors Are Fallible
External APIs fail, return malformed data, or rate limit. Every extractor must:
- Return `Result<Vec<MedicalClaim>, ExtractError>`
- Handle HTTP timeouts, 429s, and parsing errors gracefully
- Include provenance (source_url, source_section, quote)
### 3. Domains Are Compiled-In
Type safety matters. Domains define entity types, predicate schemas, and source hierarchies at compile time. Adding a new domain means adding code, not configuration.
### 4. Validate Before Ingestion
Claims must be validated against the domain schema before becoming assertions. The `Validator` catches subject/predicate mismatches early.
### 5. Confidence Is Required
Every claim needs a confidence score (0.0-1.0). Extractors must estimate extraction quality based on source clarity, parsing confidence, and quote specificity.
## Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ stemedb-ontology Pipeline │
├─────────────────────────────────────────────────────────────┤
│ 1. DEFINE DOMAIN │
│ Domain::new("Pharma") │
│ .with_entity_type("Drug", ...) │
│ .with_predicate_schema("efficacy", ...) │
│ .with_source_hierarchy(...) │
│ │
│ 2. EXTRACT CLAIMS │
│ FdaLabelExtractor::extract(&SourceInput::DrugName(...)) │
│ → Vec<MedicalClaim> │
│ │
│ 3. BUILD SUBJECTS │
│ SubjectBuilder::build(&schema, &entities) │
│ → "Semaglutide:Type2Diabetes" │
│ │
│ 4. VALIDATE │
│ Validator::new(&domain).validate(pred, subj, conf) │
│ → Ok(()) or ValidationError │
│ │
│ 5. TO ASSERTION │
│ claim.to_assertion(&signing_key, agent_id, &hlc) │
│ → Assertion (signed, ready for ingestion) │
│ │
│ 6. SUBMIT TO STEMEDB │
│ StemeClient::assert(&assertion) │
│ → AssertionHash │
└─────────────────────────────────────────────────────────────┘
```
## Key Modules
| Module | Purpose | Key Types |
|--------|---------|-----------|
| `domain.rs` | Domain, entity types, predicate schemas | `Domain`, `PredicateSchema`, `SourceTier` |
| `subject.rs` | Subject construction and parsing | `SubjectBuilder`, `SubjectError` |
| `validator.rs` | Claim validation against schemas | `Validator`, `ValidationError` |
| `pharma/` | Pharmaceutical domain definition | `definition()`, `GLP1_DRUGS` |
| `pharma/extractors/` | Medical data extractors | `MedicalExtractor`, `FdaLabelExtractor` |
| `client.rs` | HTTP client for StemeDB | `StemeClient`, `ClientError` |
## Key Types
```rust
/// A predicate schema defines subject patterns for a category
pub struct PredicateSchema {
pub description: String,
pub subject_pattern: String, // e.g., "{Drug}:{Indication}"
pub predicates: Vec<String>, // e.g., ["hba1c_reduction", "weight_loss"]
pub default_lens: DefaultLens,
pub required_entities: Vec<String>, // Extracted from pattern
}
/// Trait for medical data extractors
#[async_trait]
pub trait MedicalExtractor: Send + Sync {
fn name(&self) -> &str;
fn source_class(&self) -> SourceClass;
async fn extract(&self, source: &SourceInput) -> Result<Vec<MedicalClaim>, ExtractError>;
fn can_handle(&self, source: &SourceInput) -> bool;
}
/// Intermediate format between raw source and assertions
pub struct MedicalClaim {
pub subject: String,
pub predicate: String,
pub value: ObjectValue,
pub confidence: f32,
pub source_url: String,
pub source_section: String,
pub quote: String,
pub source_class: SourceClass,
pub metadata: Option<serde_json::Value>,
}
/// Build subjects from schemas and entities
impl SubjectBuilder {
// Build: SubjectBuilder::build(&schema, &entities) → "Semaglutide:Type2Diabetes"
pub fn build(schema: &PredicateSchema, entities: &HashMap<String, String>)
-> Result<String, SubjectError>;
// Parse: SubjectBuilder::parse(&schema, subject) → {Drug: "Semaglutide", ...}
pub fn parse(schema: &PredicateSchema, subject: &str)
-> Result<HashMap<String, String>, SubjectError>;
}
```
## Step Back: Before Implementing
Before writing code, challenge your assumptions:
### 1. Is This a New Domain or Extending Pharma?
> "Am I defining entity types/predicates for a new vertical, or adding to pharma?"
- New domains: Create `src/{domain}/mod.rs`, `src/{domain}/definition.rs`
- Pharma extensions: Add to `src/pharma/definition.rs`
- New extractors for pharma: Add to `src/pharma/extractors/`
### 2. Does My Subject Pattern Enable Correct Collision?
> "Will claims that SHOULD conflict have the same subject?"
- Efficacy claims for same drug+indication MUST collide
- Safety claims for same drug MUST collide regardless of indication
- Think about what "same thing" means for your predicate category
### 3. What Source Class Is This?
> "What tier is my data source in the authority hierarchy?"
- Regulatory (FDA, EMA): `SourceClass::Regulatory` - authoritative, long decay
- Clinical trials: `SourceClass::Clinical` - high quality, moderate decay
- Observational: `SourceClass::Observational` - real-world, faster decay
- Don't over-rank - observational data is NOT clinical-grade
### 4. Will Extraction Fail Gracefully?
> "What happens when the FDA API is down or returns garbage?"
- HTTP errors → `ExtractError::Http`
- No results → `ExtractError::NotFound`
- Rate limiting → `ExtractError::RateLimited`
- Never panic on external data
**After step back:** Trace through the pipeline in `pharma/extractors/fda.rs` to see how it handles edge cases.
## Do
1. **Use `SubjectBuilder` for all subject construction.** Never concatenate strings manually.
2. **Include provenance in every `MedicalClaim`.** `source_url`, `source_section`, `quote` are required.
3. **Estimate confidence honestly.** Exact quotes = high (0.9+), inferred = medium (0.6-0.8), uncertain = low (<0.6).
4. **Handle all HTTP error cases.** Timeout, 429, 500, malformed JSON.
5. **Normalize entity names via alias tables.** "Ozempic" "Semaglutide".
6. **Validate claims before assertion conversion.** Use `Validator::new(&domain).validate()`.
7. **Use `#[instrument]` on extractor methods.** Critical for debugging failed extractions.
8. **Add tests for new extractors with mock HTTP.** Use `mockito` or similar.
9. **Document source API quirks.** Rate limits, pagination, field meanings.
10. **Keep extractors focused.** One source = one extractor. Don't combine FDA + PubMed.
## Do Not
1. **Use `unwrap()` or `expect()` in extractors.** External data is untrusted.
2. **Hardcode subject strings.** Always use `SubjectBuilder::build()`.
3. **Skip the source_hash.** Provenance must be hashable for deduplication.
4. **Mix source classes.** FDA labels are Regulatory, not Clinical.
5. **Forget async/Send+Sync bounds.** Extractors run in async contexts.
6. **Trust external API field names.** APIs change; handle missing fields.
7. **Over-promise confidence.** If you're parsing prose, confidence < 0.9.
8. **Block on external APIs without timeout.** Default 30s, configurable.
9. **Skip the metadata field.** Store API version, date accessed, NDC codes.
10. **Commit without running `cargo test -p stemedb-ontology`.** Extractor tests catch regressions.
## Decision Points
### Adding a New Extractor
Stop. Questions:
- What source class does this data belong to?
- What predicates can you reliably extract?
- What subject pattern do those predicates use?
- How do you handle rate limiting and failures?
- What provenance fields can you populate?
### Adding a New Predicate Category
Stop. Questions:
- What entities make up the subject? (`{Drug}` vs `{Drug}:{Indication}`)
- What existing predicates belong in this category?
- What's the default lens (Recency, Consensus, Authority)?
- Do subjects built with this pattern collide correctly?
### Adding a New Domain
Stop. Questions:
- What entity types exist? (Drug, Gene, Company, Security, etc.)
- What are the predicate categories and their subject patterns?
- What's the source hierarchy for this vertical?
- Who will build extractors for this domain?
## Constraints
**NEVER:**
- Use `unwrap()` or `expect()` in production code
- Manually concatenate subject strings
- Trust external API responses without validation
- Block indefinitely on HTTP requests
- Skip provenance fields (source_url, quote)
- Mutate existing assertions (append-only)
**ALWAYS:**
- Use `SubjectBuilder::build()` for subjects
- Include meaningful error context in `ExtractError`
- Run `cargo clippy --workspace -- -D warnings` before commit
- Add tests for new extractor logic
- Document API rate limits and quirks
- Use `#[instrument]` on public extractor methods
## Testing Commands
```bash
# Full ontology test suite
cargo test -p stemedb-ontology
# Run specific extractor tests
cargo test -p stemedb-ontology fda
# Run with logging
RUST_LOG=stemedb_ontology=debug cargo test -p stemedb-ontology
# Lint check
cargo clippy -p stemedb-ontology -- -D warnings
# Format check
cargo fmt -p stemedb-ontology --check
# Run the FDA CLI tool (if available)
cargo run -p stemedb-ontology --bin fda-lookup -- semaglutide
```
## Common Workflows
### Adding a New Extractor
1. Create `src/pharma/extractors/{source}.rs`
2. Implement `MedicalExtractor` trait:
- `name()` human-readable identifier
- `source_class()` tier for this source
- `can_handle()` which `SourceInput` variants work
- `extract()` async extraction with proper error handling
3. Add HTTP client with timeout and retry logic
4. Re-export from `src/pharma/extractors/mod.rs`
5. Add tests with mock HTTP responses
### Adding a New Predicate to Pharma
1. Determine which schema category it belongs to (efficacy, safety, mechanism)
2. Add to the `predicates` list in `src/pharma/definition.rs`
3. If new category needed:
- Create new `PredicateSchema` with appropriate `subject_pattern`
- Add via `.with_predicate_schema()`
4. Update extractor to populate the new predicate
5. Add test case validating subject pattern
### Debugging Extraction Failures
1. Run with `RUST_LOG=stemedb_ontology=debug`
2. Check for HTTP errors (timeout, 429, 500)
3. Verify API response matches expected schema
4. Check if provenance fields are populated
5. Validate subject pattern matches schema
6. Inspect `ExtractError` variant and context
### Using StemeClient
```rust
use stemedb_ontology::client::StemeClient;
use stemedb_ontology::pharma::extractors::{FdaLabelExtractor, MedicalExtractor, SourceInput};
let client = StemeClient::new("http://localhost:18180");
let extractor = FdaLabelExtractor::new();
// Extract claims from FDA
let claims = extractor.extract(&SourceInput::DrugName("semaglutide".into())).await?;
// Convert and submit
for claim in claims {
let assertion = claim.to_assertion(&signing_key, agent_id, &hlc);
let hash = client.assert(&assertion).await?;
}
// Query with skeptic lens (shows all conflicts)
let response = client.skeptic("Semaglutide:Type2Diabetes", "hba1c_change_percent").await?;
```
## Output Format
When implementing features or fixing bugs, provide:
```
## Summary
[One-line description]
## Changes
- [File]: [What changed]
## Testing
- [How to verify]
## Impact
- [Subject patterns affected, if any]
- [Extractors affected, if any]
```
## Domain Status Reference
| Domain | Status | Extractors |
|--------|--------|------------|
| Pharma | Active | FDA Labels |
| Finance | Planned | - |
| Consumer Health | Planned | - |