Enterprise Features: - Hosted mode with remote sync for team pattern aggregation - Community sharing with privacy-preserving anonymization - LLM-based semantic claim extraction with Gemini integration - Pattern learning with promotion to declarative extractors - High-entropy secrets extractor with configurable thresholds - Auth bypass and insecure cookies extractors Module Refactoring: - Split oversized files to comply with 500-line limit - Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc. - Handlers split: scan.rs, policy.rs, report.rs modules - Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/ - Learning split: store modules with metrics and persistence SDK & Ontology: - stemedb-ontology SDK with fluent builders and StemeDB client - Pharma domain extractors for FDA Orange Book data - Consumer health UAT test infrastructure Code Quality: - Fixed clippy warnings (needless_borrows_for_generic_args) - Added KVStore trait imports where needed - Fixed utoipa path re-exports for OpenAPI docs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
352 lines
14 KiB
Markdown
352 lines
14 KiB
Markdown
---
|
|
name: ontology-dev
|
|
description: Development guidelines for stemedb-ontology - the domain modeling layer for Episteme's claim extraction pipeline
|
|
---
|
|
|
|
# Ontology Development Skill
|
|
|
|
You are an expert stemedb-ontology developer. This crate defines **domain ontologies** that ensure claims from different sources collide correctly in Episteme. It handles the critical path from raw source data (FDA labels, clinical trials, etc.) to properly-structured assertions.
|
|
|
|
## Core Concept
|
|
|
|
Ontology defines how subjects are built based on predicate type, ensuring conflicts collide:
|
|
|
|
| Category | Subject Pattern | Example | Why It Collides |
|
|
|----------|-----------------|---------|-----------------|
|
|
| Efficacy | `{Drug}:{Indication}` | `Semaglutide:Type2Diabetes` | Same drug+indication efficacy claims collide |
|
|
| Safety | `{Drug}` | `Semaglutide` | Safety applies to drug regardless of indication |
|
|
| Mechanism | `{Drug}:{Target}` | `Semaglutide:GLP1R` | Drug+target mechanism claims collide |
|
|
| Comparison | `{Drug}:{Comparator}:{Indication}` | `Semaglutide:Tirzepatide:T2D` | Head-to-head comparisons collide |
|
|
|
|
**Source Tiers (Pharma Domain):**
|
|
|
|
| Tier | SourceClass | Label | Example |
|
|
|------|-------------|-------|---------|
|
|
| 0 | Regulatory | FDA/EMA | Drug labels, approval letters |
|
|
| 1 | Clinical | Phase 3 trials | SUSTAIN, STEP trials |
|
|
| 2 | Observational | Real-world | Claims databases, EHR studies |
|
|
| 3 | Expert | KOL opinion | Conference presentations |
|
|
| 4 | Informal | Social/anecdotal | Patient forums |
|
|
|
|
## Principles
|
|
|
|
### 1. Subject Patterns Determine Collision
|
|
Different predicates require different subject structures. Efficacy claims need `Drug:Indication` so that multiple sources reporting on the same drug+indication collide. Safety claims need just `Drug` so all safety info for a drug collides regardless of indication.
|
|
|
|
### 2. Extractors Are Fallible
|
|
External APIs fail, return malformed data, or rate limit. Every extractor must:
|
|
- Return `Result<Vec<MedicalClaim>, ExtractError>`
|
|
- Handle HTTP timeouts, 429s, and parsing errors gracefully
|
|
- Include provenance (source_url, source_section, quote)
|
|
|
|
### 3. Domains Are Compiled-In
|
|
Type safety matters. Domains define entity types, predicate schemas, and source hierarchies at compile time. Adding a new domain means adding code, not configuration.
|
|
|
|
### 4. Validate Before Ingestion
|
|
Claims must be validated against the domain schema before becoming assertions. The `Validator` catches subject/predicate mismatches early.
|
|
|
|
### 5. Confidence Is Required
|
|
Every claim needs a confidence score (0.0-1.0). Extractors must estimate extraction quality based on source clarity, parsing confidence, and quote specificity.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ stemedb-ontology Pipeline │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ 1. DEFINE DOMAIN │
|
|
│ Domain::new("Pharma") │
|
|
│ .with_entity_type("Drug", ...) │
|
|
│ .with_predicate_schema("efficacy", ...) │
|
|
│ .with_source_hierarchy(...) │
|
|
│ │
|
|
│ 2. EXTRACT CLAIMS │
|
|
│ FdaLabelExtractor::extract(&SourceInput::DrugName(...)) │
|
|
│ → Vec<MedicalClaim> │
|
|
│ │
|
|
│ 3. BUILD SUBJECTS │
|
|
│ SubjectBuilder::build(&schema, &entities) │
|
|
│ → "Semaglutide:Type2Diabetes" │
|
|
│ │
|
|
│ 4. VALIDATE │
|
|
│ Validator::new(&domain).validate(pred, subj, conf) │
|
|
│ → Ok(()) or ValidationError │
|
|
│ │
|
|
│ 5. TO ASSERTION │
|
|
│ claim.to_assertion(&signing_key, agent_id, &hlc) │
|
|
│ → Assertion (signed, ready for ingestion) │
|
|
│ │
|
|
│ 6. SUBMIT TO STEMEDB │
|
|
│ StemeClient::assert(&assertion) │
|
|
│ → AssertionHash │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Key Modules
|
|
|
|
| Module | Purpose | Key Types |
|
|
|--------|---------|-----------|
|
|
| `domain.rs` | Domain, entity types, predicate schemas | `Domain`, `PredicateSchema`, `SourceTier` |
|
|
| `subject.rs` | Subject construction and parsing | `SubjectBuilder`, `SubjectError` |
|
|
| `validator.rs` | Claim validation against schemas | `Validator`, `ValidationError` |
|
|
| `pharma/` | Pharmaceutical domain definition | `definition()`, `GLP1_DRUGS` |
|
|
| `pharma/extractors/` | Medical data extractors | `MedicalExtractor`, `FdaLabelExtractor` |
|
|
| `client.rs` | HTTP client for StemeDB | `StemeClient`, `ClientError` |
|
|
|
|
## Key Types
|
|
|
|
```rust
|
|
/// A predicate schema defines subject patterns for a category
|
|
pub struct PredicateSchema {
|
|
pub description: String,
|
|
pub subject_pattern: String, // e.g., "{Drug}:{Indication}"
|
|
pub predicates: Vec<String>, // e.g., ["hba1c_reduction", "weight_loss"]
|
|
pub default_lens: DefaultLens,
|
|
pub required_entities: Vec<String>, // Extracted from pattern
|
|
}
|
|
|
|
/// Trait for medical data extractors
|
|
#[async_trait]
|
|
pub trait MedicalExtractor: Send + Sync {
|
|
fn name(&self) -> &str;
|
|
fn source_class(&self) -> SourceClass;
|
|
async fn extract(&self, source: &SourceInput) -> Result<Vec<MedicalClaim>, ExtractError>;
|
|
fn can_handle(&self, source: &SourceInput) -> bool;
|
|
}
|
|
|
|
/// Intermediate format between raw source and assertions
|
|
pub struct MedicalClaim {
|
|
pub subject: String,
|
|
pub predicate: String,
|
|
pub value: ObjectValue,
|
|
pub confidence: f32,
|
|
pub source_url: String,
|
|
pub source_section: String,
|
|
pub quote: String,
|
|
pub source_class: SourceClass,
|
|
pub metadata: Option<serde_json::Value>,
|
|
}
|
|
|
|
/// Build subjects from schemas and entities
|
|
impl SubjectBuilder {
|
|
// Build: SubjectBuilder::build(&schema, &entities) → "Semaglutide:Type2Diabetes"
|
|
pub fn build(schema: &PredicateSchema, entities: &HashMap<String, String>)
|
|
-> Result<String, SubjectError>;
|
|
|
|
// Parse: SubjectBuilder::parse(&schema, subject) → {Drug: "Semaglutide", ...}
|
|
pub fn parse(schema: &PredicateSchema, subject: &str)
|
|
-> Result<HashMap<String, String>, SubjectError>;
|
|
}
|
|
```
|
|
|
|
## Step Back: Before Implementing
|
|
|
|
Before writing code, challenge your assumptions:
|
|
|
|
### 1. Is This a New Domain or Extending Pharma?
|
|
> "Am I defining entity types/predicates for a new vertical, or adding to pharma?"
|
|
|
|
- New domains: Create `src/{domain}/mod.rs`, `src/{domain}/definition.rs`
|
|
- Pharma extensions: Add to `src/pharma/definition.rs`
|
|
- New extractors for pharma: Add to `src/pharma/extractors/`
|
|
|
|
### 2. Does My Subject Pattern Enable Correct Collision?
|
|
> "Will claims that SHOULD conflict have the same subject?"
|
|
|
|
- Efficacy claims for same drug+indication MUST collide
|
|
- Safety claims for same drug MUST collide regardless of indication
|
|
- Think about what "same thing" means for your predicate category
|
|
|
|
### 3. What Source Class Is This?
|
|
> "What tier is my data source in the authority hierarchy?"
|
|
|
|
- Regulatory (FDA, EMA): `SourceClass::Regulatory` - authoritative, long decay
|
|
- Clinical trials: `SourceClass::Clinical` - high quality, moderate decay
|
|
- Observational: `SourceClass::Observational` - real-world, faster decay
|
|
- Don't over-rank - observational data is NOT clinical-grade
|
|
|
|
### 4. Will Extraction Fail Gracefully?
|
|
> "What happens when the FDA API is down or returns garbage?"
|
|
|
|
- HTTP errors → `ExtractError::Http`
|
|
- No results → `ExtractError::NotFound`
|
|
- Rate limiting → `ExtractError::RateLimited`
|
|
- Never panic on external data
|
|
|
|
**After step back:** Trace through the pipeline in `pharma/extractors/fda.rs` to see how it handles edge cases.
|
|
|
|
## Do
|
|
|
|
1. **Use `SubjectBuilder` for all subject construction.** Never concatenate strings manually.
|
|
2. **Include provenance in every `MedicalClaim`.** `source_url`, `source_section`, `quote` are required.
|
|
3. **Estimate confidence honestly.** Exact quotes = high (0.9+), inferred = medium (0.6-0.8), uncertain = low (<0.6).
|
|
4. **Handle all HTTP error cases.** Timeout, 429, 500, malformed JSON.
|
|
5. **Normalize entity names via alias tables.** "Ozempic" → "Semaglutide".
|
|
6. **Validate claims before assertion conversion.** Use `Validator::new(&domain).validate()`.
|
|
7. **Use `#[instrument]` on extractor methods.** Critical for debugging failed extractions.
|
|
8. **Add tests for new extractors with mock HTTP.** Use `mockito` or similar.
|
|
9. **Document source API quirks.** Rate limits, pagination, field meanings.
|
|
10. **Keep extractors focused.** One source = one extractor. Don't combine FDA + PubMed.
|
|
|
|
## Do Not
|
|
|
|
1. **Use `unwrap()` or `expect()` in extractors.** External data is untrusted.
|
|
2. **Hardcode subject strings.** Always use `SubjectBuilder::build()`.
|
|
3. **Skip the source_hash.** Provenance must be hashable for deduplication.
|
|
4. **Mix source classes.** FDA labels are Regulatory, not Clinical.
|
|
5. **Forget async/Send+Sync bounds.** Extractors run in async contexts.
|
|
6. **Trust external API field names.** APIs change; handle missing fields.
|
|
7. **Over-promise confidence.** If you're parsing prose, confidence < 0.9.
|
|
8. **Block on external APIs without timeout.** Default 30s, configurable.
|
|
9. **Skip the metadata field.** Store API version, date accessed, NDC codes.
|
|
10. **Commit without running `cargo test -p stemedb-ontology`.** Extractor tests catch regressions.
|
|
|
|
## Decision Points
|
|
|
|
### Adding a New Extractor
|
|
|
|
Stop. Questions:
|
|
- What source class does this data belong to?
|
|
- What predicates can you reliably extract?
|
|
- What subject pattern do those predicates use?
|
|
- How do you handle rate limiting and failures?
|
|
- What provenance fields can you populate?
|
|
|
|
### Adding a New Predicate Category
|
|
|
|
Stop. Questions:
|
|
- What entities make up the subject? (`{Drug}` vs `{Drug}:{Indication}`)
|
|
- What existing predicates belong in this category?
|
|
- What's the default lens (Recency, Consensus, Authority)?
|
|
- Do subjects built with this pattern collide correctly?
|
|
|
|
### Adding a New Domain
|
|
|
|
Stop. Questions:
|
|
- What entity types exist? (Drug, Gene, Company, Security, etc.)
|
|
- What are the predicate categories and their subject patterns?
|
|
- What's the source hierarchy for this vertical?
|
|
- Who will build extractors for this domain?
|
|
|
|
## Constraints
|
|
|
|
**NEVER:**
|
|
- Use `unwrap()` or `expect()` in production code
|
|
- Manually concatenate subject strings
|
|
- Trust external API responses without validation
|
|
- Block indefinitely on HTTP requests
|
|
- Skip provenance fields (source_url, quote)
|
|
- Mutate existing assertions (append-only)
|
|
|
|
**ALWAYS:**
|
|
- Use `SubjectBuilder::build()` for subjects
|
|
- Include meaningful error context in `ExtractError`
|
|
- Run `cargo clippy --workspace -- -D warnings` before commit
|
|
- Add tests for new extractor logic
|
|
- Document API rate limits and quirks
|
|
- Use `#[instrument]` on public extractor methods
|
|
|
|
## Testing Commands
|
|
|
|
```bash
|
|
# Full ontology test suite
|
|
cargo test -p stemedb-ontology
|
|
|
|
# Run specific extractor tests
|
|
cargo test -p stemedb-ontology fda
|
|
|
|
# Run with logging
|
|
RUST_LOG=stemedb_ontology=debug cargo test -p stemedb-ontology
|
|
|
|
# Lint check
|
|
cargo clippy -p stemedb-ontology -- -D warnings
|
|
|
|
# Format check
|
|
cargo fmt -p stemedb-ontology --check
|
|
|
|
# Run the FDA CLI tool (if available)
|
|
cargo run -p stemedb-ontology --bin fda-lookup -- semaglutide
|
|
```
|
|
|
|
## Common Workflows
|
|
|
|
### Adding a New Extractor
|
|
|
|
1. Create `src/pharma/extractors/{source}.rs`
|
|
2. Implement `MedicalExtractor` trait:
|
|
- `name()` → human-readable identifier
|
|
- `source_class()` → tier for this source
|
|
- `can_handle()` → which `SourceInput` variants work
|
|
- `extract()` → async extraction with proper error handling
|
|
3. Add HTTP client with timeout and retry logic
|
|
4. Re-export from `src/pharma/extractors/mod.rs`
|
|
5. Add tests with mock HTTP responses
|
|
|
|
### Adding a New Predicate to Pharma
|
|
|
|
1. Determine which schema category it belongs to (efficacy, safety, mechanism)
|
|
2. Add to the `predicates` list in `src/pharma/definition.rs`
|
|
3. If new category needed:
|
|
- Create new `PredicateSchema` with appropriate `subject_pattern`
|
|
- Add via `.with_predicate_schema()`
|
|
4. Update extractor to populate the new predicate
|
|
5. Add test case validating subject pattern
|
|
|
|
### Debugging Extraction Failures
|
|
|
|
1. Run with `RUST_LOG=stemedb_ontology=debug`
|
|
2. Check for HTTP errors (timeout, 429, 500)
|
|
3. Verify API response matches expected schema
|
|
4. Check if provenance fields are populated
|
|
5. Validate subject pattern matches schema
|
|
6. Inspect `ExtractError` variant and context
|
|
|
|
### Using StemeClient
|
|
|
|
```rust
|
|
use stemedb_ontology::client::StemeClient;
|
|
use stemedb_ontology::pharma::extractors::{FdaLabelExtractor, MedicalExtractor, SourceInput};
|
|
|
|
let client = StemeClient::new("http://localhost:18180");
|
|
let extractor = FdaLabelExtractor::new();
|
|
|
|
// Extract claims from FDA
|
|
let claims = extractor.extract(&SourceInput::DrugName("semaglutide".into())).await?;
|
|
|
|
// Convert and submit
|
|
for claim in claims {
|
|
let assertion = claim.to_assertion(&signing_key, agent_id, &hlc);
|
|
let hash = client.assert(&assertion).await?;
|
|
}
|
|
|
|
// Query with skeptic lens (shows all conflicts)
|
|
let response = client.skeptic("Semaglutide:Type2Diabetes", "hba1c_change_percent").await?;
|
|
```
|
|
|
|
## Output Format
|
|
|
|
When implementing features or fixing bugs, provide:
|
|
|
|
```
|
|
## Summary
|
|
[One-line description]
|
|
|
|
## Changes
|
|
- [File]: [What changed]
|
|
|
|
## Testing
|
|
- [How to verify]
|
|
|
|
## Impact
|
|
- [Subject patterns affected, if any]
|
|
- [Extractors affected, if any]
|
|
```
|
|
|
|
## Domain Status Reference
|
|
|
|
| Domain | Status | Extractors |
|
|
|--------|--------|------------|
|
|
| Pharma | Active | FDA Labels |
|
|
| Finance | Planned | - |
|
|
| Consumer Health | Planned | - |
|