New security extractors: - insecure_deserialization, orm_injection, path_traversal, security_headers - ssrf, unvalidated_redirects, weak_password, xxe - Enhanced tls_version extractor with comprehensive cipher/protocol checks Architecture docs: - Scout-judge extraction pattern for LLM-based code analysis - LLM prompt evaluation framework - LLM eval implementation guide Core improvements: - stemedb-ontology README and client enhancements - WAL journal/segment instrumentation - Signing and ingestion refinements - Consumer health demo script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
591 lines
17 KiB
Markdown
591 lines
17 KiB
Markdown
# Adding a New Domain to stemedb-ontology
|
|
|
|
This guide walks you through implementing a new domain (vertical) in the stemedb-ontology crate. By the end, you'll have a working domain with entity types, predicate schemas, and optional extractors.
|
|
|
|
**Time:** ~30 minutes
|
|
**Prerequisites:** Rust knowledge, familiarity with StemeDB concepts
|
|
|
|
## Overview
|
|
|
|
A domain in stemedb-ontology defines:
|
|
|
|
1. **Entity Types** - The kinds of things in your domain (e.g., Drug, Company, Asset)
|
|
2. **Predicate Schemas** - How subjects are built for different predicate categories
|
|
3. **Source Hierarchy** - How to weight different source authorities
|
|
4. **Extractors (optional)** - Code that extracts claims from external sources
|
|
|
|
## Step 1: Plan Your Domain Model
|
|
|
|
Before writing code, answer these questions:
|
|
|
|
### What entities exist in your domain?
|
|
|
|
| Entity | Description | Example Values |
|
|
|--------|-------------|----------------|
|
|
| ? | ? | ? |
|
|
|
|
**Pharma example:**
|
|
| Entity | Description | Example Values |
|
|
|--------|-------------|----------------|
|
|
| Drug | Pharmaceutical compound | Semaglutide, Tirzepatide |
|
|
| Indication | Medical condition | Type2Diabetes, Obesity |
|
|
| Target | Molecular target | GLP1R, GIPR |
|
|
|
|
### What predicates will you track?
|
|
|
|
Group predicates by category (determines subject pattern):
|
|
|
|
| Category | Subject Pattern | Example Predicates |
|
|
|----------|-----------------|-------------------|
|
|
| ? | ? | ? |
|
|
|
|
**Pharma example:**
|
|
| Category | Subject Pattern | Example Predicates |
|
|
|----------|-----------------|-------------------|
|
|
| Efficacy | `{Drug}:{Indication}` | hba1c_reduction_percent, weight_loss_percent |
|
|
| Safety | `{Drug}` | nausea_rate, has_boxed_warning |
|
|
| Mechanism | `{Drug}:{Target}` | binding_affinity, mechanism_of_action |
|
|
|
|
### What sources will provide data?
|
|
|
|
Order from most to least authoritative:
|
|
|
|
| Tier | Source Class | Examples | Weight |
|
|
|------|--------------|----------|--------|
|
|
| 0 | Regulatory | ? | 1.0 |
|
|
| 1 | Clinical | ? | 0.9 |
|
|
| ... | ... | ... | ... |
|
|
|
|
## Step 2: Create Domain Module
|
|
|
|
Create the directory structure:
|
|
|
|
```
|
|
crates/stemedb-ontology/src/
|
|
{domain}/
|
|
mod.rs # Re-exports
|
|
definition.rs # Domain::new() builder
|
|
```
|
|
|
|
### Template: `{domain}/mod.rs`
|
|
|
|
```rust
|
|
//! {Domain} domain ontology.
|
|
//!
|
|
//! This module defines the {domain} vertical with:
|
|
//! - Entity types (...)
|
|
//! - Predicate schemas (...)
|
|
//! - Source hierarchy (...)
|
|
|
|
pub mod definition;
|
|
|
|
pub use definition::definition;
|
|
|
|
// Re-export domain-specific types if any
|
|
// pub use definition::{...};
|
|
```
|
|
|
|
### Template: `{domain}/definition.rs`
|
|
|
|
```rust
|
|
//! Compiled-in {domain} domain definition.
|
|
|
|
use crate::domain::{
|
|
DefaultLens, Domain, EntityType, NamingConvention, PredicateSchema, SourceTier,
|
|
};
|
|
use stemedb_core::types::SourceClass;
|
|
|
|
/// Build the {domain} domain definition.
|
|
pub fn definition() -> Domain {
|
|
let mut domain = Domain::new(
|
|
"{Domain}",
|
|
"Description of what this domain covers",
|
|
);
|
|
|
|
// -------------------------------------------------------------------------
|
|
// Entity Types
|
|
// -------------------------------------------------------------------------
|
|
|
|
// Primary entity (e.g., the main subject of claims)
|
|
domain = domain.with_entity_type(
|
|
"{PrimaryEntity}",
|
|
EntityType::required("Description")
|
|
.with_naming(NamingConvention::CamelCase)
|
|
// Add aliases for common variations
|
|
.with_alias("ALIAS", "Canonical"),
|
|
);
|
|
|
|
// Secondary entity (for compound subjects)
|
|
domain = domain.with_entity_type(
|
|
"{SecondaryEntity}",
|
|
EntityType::required("Description")
|
|
.with_naming(NamingConvention::CamelCase),
|
|
);
|
|
|
|
// -------------------------------------------------------------------------
|
|
// Predicate Schemas
|
|
// -------------------------------------------------------------------------
|
|
|
|
// Category 1: Primary predicates (single entity subject)
|
|
domain = domain.with_predicate_schema(
|
|
"category1",
|
|
PredicateSchema::new(
|
|
"Description of this predicate category",
|
|
"{PrimaryEntity}",
|
|
)
|
|
.with_predicates(vec![
|
|
"predicate_one",
|
|
"predicate_two",
|
|
])
|
|
.with_default_lens(DefaultLens::Recency),
|
|
);
|
|
|
|
// Category 2: Compound predicates (multi-entity subject)
|
|
domain = domain.with_predicate_schema(
|
|
"category2",
|
|
PredicateSchema::new(
|
|
"Description",
|
|
"{PrimaryEntity}:{SecondaryEntity}",
|
|
)
|
|
.with_predicates(vec![
|
|
"compound_predicate",
|
|
])
|
|
.with_default_lens(DefaultLens::LayeredConsensus),
|
|
);
|
|
|
|
// -------------------------------------------------------------------------
|
|
// Source Hierarchy
|
|
// -------------------------------------------------------------------------
|
|
|
|
domain = domain.with_source_hierarchy(vec![
|
|
SourceTier::new(SourceClass::Regulatory, "Tier 0: Official Sources")
|
|
.with_examples(vec!["Government agencies", "Standards bodies"])
|
|
.with_weight(1.0),
|
|
SourceTier::new(SourceClass::Clinical, "Tier 1: Primary Research")
|
|
.with_examples(vec!["Peer-reviewed journals", "Research institutions"])
|
|
.with_weight(0.9)
|
|
.with_decay(730), // 2 year half-life
|
|
SourceTier::new(SourceClass::Observational, "Tier 2: Secondary Analysis")
|
|
.with_examples(vec!["Industry reports", "Analyst research"])
|
|
.with_weight(0.7)
|
|
.with_decay(365),
|
|
SourceTier::new(SourceClass::Expert, "Tier 3: Expert Opinion")
|
|
.with_examples(vec!["Industry experts", "Consultants"])
|
|
.with_weight(0.5)
|
|
.with_decay(180),
|
|
SourceTier::new(SourceClass::Community, "Tier 4: Community")
|
|
.with_examples(vec!["Professional forums", "Curated discussions"])
|
|
.with_weight(0.3)
|
|
.with_decay(90),
|
|
SourceTier::new(SourceClass::Anecdotal, "Tier 5: Anecdotal")
|
|
.with_examples(vec!["Social media", "Blog posts"])
|
|
.with_weight(0.1)
|
|
.with_decay(30),
|
|
]);
|
|
|
|
domain
|
|
}
|
|
|
|
#[cfg(test)]
|
|
mod tests {
|
|
use super::*;
|
|
|
|
#[test]
|
|
fn test_definition_builds() {
|
|
let domain = definition();
|
|
assert_eq!(domain.name, "{Domain}");
|
|
assert!(!domain.entity_types.is_empty());
|
|
assert!(!domain.predicate_schemas.is_empty());
|
|
assert!(!domain.source_hierarchy.is_empty());
|
|
}
|
|
|
|
#[test]
|
|
fn test_entity_normalization() {
|
|
let domain = definition();
|
|
let entity = domain.get_entity_type("{PrimaryEntity}").expect("entity exists");
|
|
|
|
// Test alias normalization
|
|
assert_eq!(entity.normalize("ALIAS"), "Canonical");
|
|
assert_eq!(entity.normalize("Canonical"), "Canonical");
|
|
}
|
|
|
|
#[test]
|
|
fn test_predicate_schema_lookup() {
|
|
let domain = definition();
|
|
|
|
// Direct lookup
|
|
let schema = domain.get_schema("category1").expect("schema exists");
|
|
assert_eq!(schema.subject_pattern, "{PrimaryEntity}");
|
|
|
|
// Lookup by predicate
|
|
let schema = domain.schema_for_predicate("predicate_one").expect("found");
|
|
assert!(schema.predicates.contains(&"predicate_one".to_string()));
|
|
}
|
|
}
|
|
```
|
|
|
|
## Step 3: Implement Extractors (Optional)
|
|
|
|
If your domain has external data sources, implement the `MedicalExtractor` trait.
|
|
|
|
### Directory Structure
|
|
|
|
```
|
|
crates/stemedb-ontology/src/
|
|
{domain}/
|
|
mod.rs
|
|
definition.rs
|
|
extractors/
|
|
mod.rs
|
|
{source}.rs
|
|
```
|
|
|
|
### Template: `{domain}/extractors/mod.rs`
|
|
|
|
```rust
|
|
//! Data extractors for {domain}.
|
|
|
|
mod {source};
|
|
|
|
pub use {source}::{Source}Extractor;
|
|
|
|
// Re-export common traits from parent
|
|
pub use crate::pharma::extractors::{
|
|
ExtractError, MedicalClaim, MedicalExtractor, RetryConfig, SourceInput,
|
|
};
|
|
```
|
|
|
|
### Template: `{domain}/extractors/{source}.rs`
|
|
|
|
```rust
|
|
//! {Source} data extractor.
|
|
|
|
use super::{ExtractError, MedicalClaim, MedicalExtractor, SourceInput};
|
|
use async_trait::async_trait;
|
|
use stemedb_core::types::{ObjectValue, SourceClass};
|
|
|
|
/// Extractor for {Source} data.
|
|
pub struct {Source}Extractor {
|
|
http_client: reqwest::Client,
|
|
base_url: String,
|
|
}
|
|
|
|
impl {Source}Extractor {
|
|
/// Create a new extractor.
|
|
pub fn new() -> Self {
|
|
Self {
|
|
http_client: reqwest::Client::new(),
|
|
base_url: "https://api.example.com".to_string(),
|
|
}
|
|
}
|
|
}
|
|
|
|
impl Default for {Source}Extractor {
|
|
fn default() -> Self {
|
|
Self::new()
|
|
}
|
|
}
|
|
|
|
#[async_trait]
|
|
impl MedicalExtractor for {Source}Extractor {
|
|
fn name(&self) -> &str {
|
|
"{Source} Extractor"
|
|
}
|
|
|
|
fn source_class(&self) -> SourceClass {
|
|
SourceClass::Regulatory // Adjust based on source authority
|
|
}
|
|
|
|
fn can_handle(&self, source: &SourceInput) -> bool {
|
|
matches!(source, SourceInput::DrugName(_) | SourceInput::Url(_))
|
|
}
|
|
|
|
async fn extract(&self, source: &SourceInput) -> Result<Vec<MedicalClaim>, ExtractError> {
|
|
let query = match source {
|
|
SourceInput::DrugName(name) => name.clone(),
|
|
SourceInput::Url(url) => url.clone(),
|
|
_ => return Err(ExtractError::NotFound("Unsupported input type".into())),
|
|
};
|
|
|
|
// Fetch data from source
|
|
let url = format!("{}/search?q={}", self.base_url, urlencoding::encode(&query));
|
|
let response = self.http_client.get(&url).send().await?;
|
|
|
|
if !response.status().is_success() {
|
|
return Err(ExtractError::ApiError(format!(
|
|
"HTTP {}", response.status()
|
|
)));
|
|
}
|
|
|
|
// Parse response and extract claims
|
|
let mut claims = Vec::new();
|
|
|
|
// Example claim
|
|
claims.push(
|
|
MedicalClaim::new(
|
|
"Subject",
|
|
"predicate_name",
|
|
ObjectValue::Float(42.0),
|
|
)
|
|
.with_confidence(0.9)
|
|
.with_source_url(&url)
|
|
.with_source_section("Section Name")
|
|
.with_quote("Supporting quote from source")
|
|
.with_source_class(self.source_class())
|
|
);
|
|
|
|
Ok(claims)
|
|
}
|
|
}
|
|
```
|
|
|
|
## Step 4: Create CLI Binary (Optional)
|
|
|
|
For user-facing domains, create a CLI tool.
|
|
|
|
### Template: `src/bin/steme_{domain}.rs`
|
|
|
|
```rust
|
|
//! CLI for {domain} domain operations.
|
|
|
|
use clap::Parser;
|
|
use stemedb_ontology::client::StemeClient;
|
|
use stemedb_ontology::{domain}::definition;
|
|
|
|
mod cli;
|
|
mod commands;
|
|
|
|
#[derive(Parser)]
|
|
#[command(name = "steme-{domain}")]
|
|
#[command(about = "{Domain} data operations for StemeDB")]
|
|
struct Cli {
|
|
#[arg(long, default_value = "http://localhost:18180")]
|
|
server: String,
|
|
|
|
#[command(subcommand)]
|
|
command: Commands,
|
|
}
|
|
|
|
#[derive(clap::Subcommand)]
|
|
enum Commands {
|
|
/// Ingest data
|
|
Ingest { /* args */ },
|
|
/// Query data
|
|
Query { /* args */ },
|
|
}
|
|
|
|
#[tokio::main]
|
|
async fn main() -> Result<(), Box<dyn std::error::Error>> {
|
|
let cli = Cli::parse();
|
|
let client = StemeClient::new(&cli.server);
|
|
|
|
match cli.command {
|
|
Commands::Ingest { /* args */ } => {
|
|
// Implementation
|
|
}
|
|
Commands::Query { /* args */ } => {
|
|
// Implementation
|
|
}
|
|
}
|
|
|
|
Ok(())
|
|
}
|
|
```
|
|
|
|
## Step 5: Testing Checklist
|
|
|
|
Before considering your domain complete:
|
|
|
|
- [ ] `cargo build -p stemedb-ontology` succeeds
|
|
- [ ] `definition()` returns a valid Domain
|
|
- [ ] All entity types have meaningful descriptions
|
|
- [ ] All predicate schemas have correct subject patterns
|
|
- [ ] Entity normalization works (aliases resolve correctly)
|
|
- [ ] `schema_for_predicate()` finds the right schema
|
|
- [ ] Source hierarchy has 6 tiers with decreasing weights
|
|
- [ ] (If extractors) `cargo test -p stemedb-ontology` passes
|
|
|
|
Run the tests:
|
|
|
|
```bash
|
|
cargo test -p stemedb-ontology
|
|
cargo clippy -p stemedb-ontology -- -D warnings
|
|
```
|
|
|
|
## Step 6: Integration
|
|
|
|
### Export from lib.rs
|
|
|
|
Edit `crates/stemedb-ontology/src/lib.rs`:
|
|
|
|
```rust
|
|
// Add your domain module
|
|
pub mod {domain};
|
|
|
|
// Re-export for convenience
|
|
pub use {domain}::definition as {domain}_domain;
|
|
```
|
|
|
|
### Update ai-lookup
|
|
|
|
Add entry to `ai-lookup/index.md` under Domain Ontology section.
|
|
|
|
### Update CLAUDE.md routing (if significant)
|
|
|
|
If your domain is frequently used, add a routing entry in the Find Your Guide table.
|
|
|
|
## Complete Example: Cardiology Domain (Skeleton)
|
|
|
|
Here's a minimal working example for a cardiology domain:
|
|
|
|
```rust
|
|
// crates/stemedb-ontology/src/cardiology/mod.rs
|
|
//! Cardiology domain ontology.
|
|
|
|
pub mod definition;
|
|
pub use definition::definition;
|
|
```
|
|
|
|
```rust
|
|
// crates/stemedb-ontology/src/cardiology/definition.rs
|
|
use crate::domain::{DefaultLens, Domain, EntityType, NamingConvention, PredicateSchema, SourceTier};
|
|
use stemedb_core::types::SourceClass;
|
|
|
|
pub fn definition() -> Domain {
|
|
let mut domain = Domain::new(
|
|
"Cardiology",
|
|
"Cardiovascular conditions, procedures, and outcomes",
|
|
);
|
|
|
|
// Entities
|
|
domain = domain
|
|
.with_entity_type(
|
|
"Condition",
|
|
EntityType::required("Cardiovascular condition")
|
|
.with_naming(NamingConvention::CamelCase)
|
|
.with_alias("MI", "MyocardialInfarction")
|
|
.with_alias("CHF", "CongestiveHeartFailure")
|
|
.with_alias("AF", "AtrialFibrillation"),
|
|
)
|
|
.with_entity_type(
|
|
"Procedure",
|
|
EntityType::required("Medical procedure")
|
|
.with_naming(NamingConvention::CamelCase)
|
|
.with_alias("CABG", "CoronaryArteryBypassGraft")
|
|
.with_alias("PCI", "PercutaneousCoronaryIntervention"),
|
|
)
|
|
.with_entity_type(
|
|
"Biomarker",
|
|
EntityType::required("Diagnostic biomarker")
|
|
.with_naming(NamingConvention::CamelCase),
|
|
);
|
|
|
|
// Schemas
|
|
domain = domain
|
|
.with_predicate_schema(
|
|
"diagnosis",
|
|
PredicateSchema::new("Diagnostic criteria", "{Condition}")
|
|
.with_predicates(vec![
|
|
"diagnostic_criteria",
|
|
"staging_system",
|
|
"severity_classification",
|
|
])
|
|
.with_default_lens(DefaultLens::Authority),
|
|
)
|
|
.with_predicate_schema(
|
|
"outcome",
|
|
PredicateSchema::new("Treatment outcomes", "{Condition}:{Procedure}")
|
|
.with_predicates(vec![
|
|
"mortality_rate",
|
|
"complication_rate",
|
|
"readmission_rate",
|
|
"length_of_stay_days",
|
|
])
|
|
.with_default_lens(DefaultLens::LayeredConsensus),
|
|
)
|
|
.with_predicate_schema(
|
|
"biomarker",
|
|
PredicateSchema::new("Biomarker thresholds", "{Biomarker}")
|
|
.with_predicates(vec![
|
|
"normal_range",
|
|
"diagnostic_threshold",
|
|
"prognostic_value",
|
|
])
|
|
.with_default_lens(DefaultLens::Consensus),
|
|
);
|
|
|
|
// Source hierarchy
|
|
domain = domain.with_source_hierarchy(vec![
|
|
SourceTier::new(SourceClass::Regulatory, "Tier 0: Guidelines")
|
|
.with_examples(vec!["ACC/AHA Guidelines", "ESC Guidelines"])
|
|
.with_weight(1.0),
|
|
SourceTier::new(SourceClass::Clinical, "Tier 1: Clinical Trials")
|
|
.with_examples(vec!["Landmark RCTs", "Meta-analyses"])
|
|
.with_weight(0.9)
|
|
.with_decay(730),
|
|
SourceTier::new(SourceClass::Observational, "Tier 2: Registries")
|
|
.with_examples(vec!["NCDR", "Get With The Guidelines"])
|
|
.with_weight(0.7)
|
|
.with_decay(365),
|
|
SourceTier::new(SourceClass::Expert, "Tier 3: Expert Consensus")
|
|
.with_examples(vec!["Consensus statements", "Textbooks"])
|
|
.with_weight(0.5)
|
|
.with_decay(180),
|
|
SourceTier::new(SourceClass::Community, "Tier 4: Community")
|
|
.with_examples(vec!["Medical forums", "CME discussions"])
|
|
.with_weight(0.3)
|
|
.with_decay(90),
|
|
SourceTier::new(SourceClass::Anecdotal, "Tier 5: Anecdotal")
|
|
.with_examples(vec!["Case reports", "Social media"])
|
|
.with_weight(0.1)
|
|
.with_decay(30),
|
|
]);
|
|
|
|
domain
|
|
}
|
|
|
|
#[cfg(test)]
|
|
mod tests {
|
|
use super::*;
|
|
|
|
#[test]
|
|
fn test_cardiology_domain() {
|
|
let domain = definition();
|
|
assert_eq!(domain.name, "Cardiology");
|
|
|
|
// Check entity aliases
|
|
let condition = domain.get_entity_type("Condition").unwrap();
|
|
assert_eq!(condition.normalize("MI"), "MyocardialInfarction");
|
|
|
|
// Check schema lookup
|
|
let schema = domain.schema_for_predicate("mortality_rate").unwrap();
|
|
assert_eq!(schema.subject_pattern, "{Condition}:{Procedure}");
|
|
}
|
|
}
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### "Unknown predicate" errors
|
|
|
|
Your predicate isn't in any schema. Add it to the appropriate `with_predicates()` call.
|
|
|
|
### Subject collision issues
|
|
|
|
If claims that should conflict aren't conflicting, check that:
|
|
1. The subject pattern matches your intent
|
|
2. Entity values are being normalized consistently
|
|
3. The predicate is in the right schema category
|
|
|
|
### Extractor not finding data
|
|
|
|
1. Check the API URL is correct
|
|
2. Verify the query parameters match the API's expectations
|
|
3. Add debug logging to see raw responses
|
|
|
|
## Next Steps
|
|
|
|
- Run the Consumer Health UAT to see the pharma domain in action
|
|
- Read the [Lens documentation](../services/lens.md) to understand conflict resolution
|
|
- Check the [SDK guide](../../ai-lookup/services/sdk.md) for Go integration
|