New security extractors: - insecure_deserialization, orm_injection, path_traversal, security_headers - ssrf, unvalidated_redirects, weak_password, xxe - Enhanced tls_version extractor with comprehensive cipher/protocol checks Architecture docs: - Scout-judge extraction pattern for LLM-based code analysis - LLM prompt evaluation framework - LLM eval implementation guide Core improvements: - stemedb-ontology README and client enhancements - WAL journal/segment instrumentation - Signing and ingestion refinements - Consumer health demo script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
17 KiB
Adding a New Domain to stemedb-ontology
This guide walks you through implementing a new domain (vertical) in the stemedb-ontology crate. By the end, you'll have a working domain with entity types, predicate schemas, and optional extractors.
Time: ~30 minutes Prerequisites: Rust knowledge, familiarity with StemeDB concepts
Overview
A domain in stemedb-ontology defines:
- Entity Types - The kinds of things in your domain (e.g., Drug, Company, Asset)
- Predicate Schemas - How subjects are built for different predicate categories
- Source Hierarchy - How to weight different source authorities
- Extractors (optional) - Code that extracts claims from external sources
Step 1: Plan Your Domain Model
Before writing code, answer these questions:
What entities exist in your domain?
| Entity | Description | Example Values |
|---|---|---|
| ? | ? | ? |
Pharma example:
| Entity | Description | Example Values |
|---|---|---|
| Drug | Pharmaceutical compound | Semaglutide, Tirzepatide |
| Indication | Medical condition | Type2Diabetes, Obesity |
| Target | Molecular target | GLP1R, GIPR |
What predicates will you track?
Group predicates by category (determines subject pattern):
| Category | Subject Pattern | Example Predicates |
|---|---|---|
| ? | ? | ? |
Pharma example:
| Category | Subject Pattern | Example Predicates |
|---|---|---|
| Efficacy | {Drug}:{Indication} |
hba1c_reduction_percent, weight_loss_percent |
| Safety | {Drug} |
nausea_rate, has_boxed_warning |
| Mechanism | {Drug}:{Target} |
binding_affinity, mechanism_of_action |
What sources will provide data?
Order from most to least authoritative:
| Tier | Source Class | Examples | Weight |
|---|---|---|---|
| 0 | Regulatory | ? | 1.0 |
| 1 | Clinical | ? | 0.9 |
| ... | ... | ... | ... |
Step 2: Create Domain Module
Create the directory structure:
crates/stemedb-ontology/src/
{domain}/
mod.rs # Re-exports
definition.rs # Domain::new() builder
Template: {domain}/mod.rs
//! {Domain} domain ontology.
//!
//! This module defines the {domain} vertical with:
//! - Entity types (...)
//! - Predicate schemas (...)
//! - Source hierarchy (...)
pub mod definition;
pub use definition::definition;
// Re-export domain-specific types if any
// pub use definition::{...};
Template: {domain}/definition.rs
//! Compiled-in {domain} domain definition.
use crate::domain::{
DefaultLens, Domain, EntityType, NamingConvention, PredicateSchema, SourceTier,
};
use stemedb_core::types::SourceClass;
/// Build the {domain} domain definition.
pub fn definition() -> Domain {
let mut domain = Domain::new(
"{Domain}",
"Description of what this domain covers",
);
// -------------------------------------------------------------------------
// Entity Types
// -------------------------------------------------------------------------
// Primary entity (e.g., the main subject of claims)
domain = domain.with_entity_type(
"{PrimaryEntity}",
EntityType::required("Description")
.with_naming(NamingConvention::CamelCase)
// Add aliases for common variations
.with_alias("ALIAS", "Canonical"),
);
// Secondary entity (for compound subjects)
domain = domain.with_entity_type(
"{SecondaryEntity}",
EntityType::required("Description")
.with_naming(NamingConvention::CamelCase),
);
// -------------------------------------------------------------------------
// Predicate Schemas
// -------------------------------------------------------------------------
// Category 1: Primary predicates (single entity subject)
domain = domain.with_predicate_schema(
"category1",
PredicateSchema::new(
"Description of this predicate category",
"{PrimaryEntity}",
)
.with_predicates(vec![
"predicate_one",
"predicate_two",
])
.with_default_lens(DefaultLens::Recency),
);
// Category 2: Compound predicates (multi-entity subject)
domain = domain.with_predicate_schema(
"category2",
PredicateSchema::new(
"Description",
"{PrimaryEntity}:{SecondaryEntity}",
)
.with_predicates(vec![
"compound_predicate",
])
.with_default_lens(DefaultLens::LayeredConsensus),
);
// -------------------------------------------------------------------------
// Source Hierarchy
// -------------------------------------------------------------------------
domain = domain.with_source_hierarchy(vec![
SourceTier::new(SourceClass::Regulatory, "Tier 0: Official Sources")
.with_examples(vec!["Government agencies", "Standards bodies"])
.with_weight(1.0),
SourceTier::new(SourceClass::Clinical, "Tier 1: Primary Research")
.with_examples(vec!["Peer-reviewed journals", "Research institutions"])
.with_weight(0.9)
.with_decay(730), // 2 year half-life
SourceTier::new(SourceClass::Observational, "Tier 2: Secondary Analysis")
.with_examples(vec!["Industry reports", "Analyst research"])
.with_weight(0.7)
.with_decay(365),
SourceTier::new(SourceClass::Expert, "Tier 3: Expert Opinion")
.with_examples(vec!["Industry experts", "Consultants"])
.with_weight(0.5)
.with_decay(180),
SourceTier::new(SourceClass::Community, "Tier 4: Community")
.with_examples(vec!["Professional forums", "Curated discussions"])
.with_weight(0.3)
.with_decay(90),
SourceTier::new(SourceClass::Anecdotal, "Tier 5: Anecdotal")
.with_examples(vec!["Social media", "Blog posts"])
.with_weight(0.1)
.with_decay(30),
]);
domain
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_definition_builds() {
let domain = definition();
assert_eq!(domain.name, "{Domain}");
assert!(!domain.entity_types.is_empty());
assert!(!domain.predicate_schemas.is_empty());
assert!(!domain.source_hierarchy.is_empty());
}
#[test]
fn test_entity_normalization() {
let domain = definition();
let entity = domain.get_entity_type("{PrimaryEntity}").expect("entity exists");
// Test alias normalization
assert_eq!(entity.normalize("ALIAS"), "Canonical");
assert_eq!(entity.normalize("Canonical"), "Canonical");
}
#[test]
fn test_predicate_schema_lookup() {
let domain = definition();
// Direct lookup
let schema = domain.get_schema("category1").expect("schema exists");
assert_eq!(schema.subject_pattern, "{PrimaryEntity}");
// Lookup by predicate
let schema = domain.schema_for_predicate("predicate_one").expect("found");
assert!(schema.predicates.contains(&"predicate_one".to_string()));
}
}
Step 3: Implement Extractors (Optional)
If your domain has external data sources, implement the MedicalExtractor trait.
Directory Structure
crates/stemedb-ontology/src/
{domain}/
mod.rs
definition.rs
extractors/
mod.rs
{source}.rs
Template: {domain}/extractors/mod.rs
//! Data extractors for {domain}.
mod {source};
pub use {source}::{Source}Extractor;
// Re-export common traits from parent
pub use crate::pharma::extractors::{
ExtractError, MedicalClaim, MedicalExtractor, RetryConfig, SourceInput,
};
Template: {domain}/extractors/{source}.rs
//! {Source} data extractor.
use super::{ExtractError, MedicalClaim, MedicalExtractor, SourceInput};
use async_trait::async_trait;
use stemedb_core::types::{ObjectValue, SourceClass};
/// Extractor for {Source} data.
pub struct {Source}Extractor {
http_client: reqwest::Client,
base_url: String,
}
impl {Source}Extractor {
/// Create a new extractor.
pub fn new() -> Self {
Self {
http_client: reqwest::Client::new(),
base_url: "https://api.example.com".to_string(),
}
}
}
impl Default for {Source}Extractor {
fn default() -> Self {
Self::new()
}
}
#[async_trait]
impl MedicalExtractor for {Source}Extractor {
fn name(&self) -> &str {
"{Source} Extractor"
}
fn source_class(&self) -> SourceClass {
SourceClass::Regulatory // Adjust based on source authority
}
fn can_handle(&self, source: &SourceInput) -> bool {
matches!(source, SourceInput::DrugName(_) | SourceInput::Url(_))
}
async fn extract(&self, source: &SourceInput) -> Result<Vec<MedicalClaim>, ExtractError> {
let query = match source {
SourceInput::DrugName(name) => name.clone(),
SourceInput::Url(url) => url.clone(),
_ => return Err(ExtractError::NotFound("Unsupported input type".into())),
};
// Fetch data from source
let url = format!("{}/search?q={}", self.base_url, urlencoding::encode(&query));
let response = self.http_client.get(&url).send().await?;
if !response.status().is_success() {
return Err(ExtractError::ApiError(format!(
"HTTP {}", response.status()
)));
}
// Parse response and extract claims
let mut claims = Vec::new();
// Example claim
claims.push(
MedicalClaim::new(
"Subject",
"predicate_name",
ObjectValue::Float(42.0),
)
.with_confidence(0.9)
.with_source_url(&url)
.with_source_section("Section Name")
.with_quote("Supporting quote from source")
.with_source_class(self.source_class())
);
Ok(claims)
}
}
Step 4: Create CLI Binary (Optional)
For user-facing domains, create a CLI tool.
Template: src/bin/steme_{domain}.rs
//! CLI for {domain} domain operations.
use clap::Parser;
use stemedb_ontology::client::StemeClient;
use stemedb_ontology::{domain}::definition;
mod cli;
mod commands;
#[derive(Parser)]
#[command(name = "steme-{domain}")]
#[command(about = "{Domain} data operations for StemeDB")]
struct Cli {
#[arg(long, default_value = "http://localhost:18180")]
server: String,
#[command(subcommand)]
command: Commands,
}
#[derive(clap::Subcommand)]
enum Commands {
/// Ingest data
Ingest { /* args */ },
/// Query data
Query { /* args */ },
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let cli = Cli::parse();
let client = StemeClient::new(&cli.server);
match cli.command {
Commands::Ingest { /* args */ } => {
// Implementation
}
Commands::Query { /* args */ } => {
// Implementation
}
}
Ok(())
}
Step 5: Testing Checklist
Before considering your domain complete:
cargo build -p stemedb-ontologysucceedsdefinition()returns a valid Domain- All entity types have meaningful descriptions
- All predicate schemas have correct subject patterns
- Entity normalization works (aliases resolve correctly)
schema_for_predicate()finds the right schema- Source hierarchy has 6 tiers with decreasing weights
- (If extractors)
cargo test -p stemedb-ontologypasses
Run the tests:
cargo test -p stemedb-ontology
cargo clippy -p stemedb-ontology -- -D warnings
Step 6: Integration
Export from lib.rs
Edit crates/stemedb-ontology/src/lib.rs:
// Add your domain module
pub mod {domain};
// Re-export for convenience
pub use {domain}::definition as {domain}_domain;
Update ai-lookup
Add entry to ai-lookup/index.md under Domain Ontology section.
Update CLAUDE.md routing (if significant)
If your domain is frequently used, add a routing entry in the Find Your Guide table.
Complete Example: Cardiology Domain (Skeleton)
Here's a minimal working example for a cardiology domain:
// crates/stemedb-ontology/src/cardiology/mod.rs
//! Cardiology domain ontology.
pub mod definition;
pub use definition::definition;
// crates/stemedb-ontology/src/cardiology/definition.rs
use crate::domain::{DefaultLens, Domain, EntityType, NamingConvention, PredicateSchema, SourceTier};
use stemedb_core::types::SourceClass;
pub fn definition() -> Domain {
let mut domain = Domain::new(
"Cardiology",
"Cardiovascular conditions, procedures, and outcomes",
);
// Entities
domain = domain
.with_entity_type(
"Condition",
EntityType::required("Cardiovascular condition")
.with_naming(NamingConvention::CamelCase)
.with_alias("MI", "MyocardialInfarction")
.with_alias("CHF", "CongestiveHeartFailure")
.with_alias("AF", "AtrialFibrillation"),
)
.with_entity_type(
"Procedure",
EntityType::required("Medical procedure")
.with_naming(NamingConvention::CamelCase)
.with_alias("CABG", "CoronaryArteryBypassGraft")
.with_alias("PCI", "PercutaneousCoronaryIntervention"),
)
.with_entity_type(
"Biomarker",
EntityType::required("Diagnostic biomarker")
.with_naming(NamingConvention::CamelCase),
);
// Schemas
domain = domain
.with_predicate_schema(
"diagnosis",
PredicateSchema::new("Diagnostic criteria", "{Condition}")
.with_predicates(vec![
"diagnostic_criteria",
"staging_system",
"severity_classification",
])
.with_default_lens(DefaultLens::Authority),
)
.with_predicate_schema(
"outcome",
PredicateSchema::new("Treatment outcomes", "{Condition}:{Procedure}")
.with_predicates(vec![
"mortality_rate",
"complication_rate",
"readmission_rate",
"length_of_stay_days",
])
.with_default_lens(DefaultLens::LayeredConsensus),
)
.with_predicate_schema(
"biomarker",
PredicateSchema::new("Biomarker thresholds", "{Biomarker}")
.with_predicates(vec![
"normal_range",
"diagnostic_threshold",
"prognostic_value",
])
.with_default_lens(DefaultLens::Consensus),
);
// Source hierarchy
domain = domain.with_source_hierarchy(vec![
SourceTier::new(SourceClass::Regulatory, "Tier 0: Guidelines")
.with_examples(vec!["ACC/AHA Guidelines", "ESC Guidelines"])
.with_weight(1.0),
SourceTier::new(SourceClass::Clinical, "Tier 1: Clinical Trials")
.with_examples(vec!["Landmark RCTs", "Meta-analyses"])
.with_weight(0.9)
.with_decay(730),
SourceTier::new(SourceClass::Observational, "Tier 2: Registries")
.with_examples(vec!["NCDR", "Get With The Guidelines"])
.with_weight(0.7)
.with_decay(365),
SourceTier::new(SourceClass::Expert, "Tier 3: Expert Consensus")
.with_examples(vec!["Consensus statements", "Textbooks"])
.with_weight(0.5)
.with_decay(180),
SourceTier::new(SourceClass::Community, "Tier 4: Community")
.with_examples(vec!["Medical forums", "CME discussions"])
.with_weight(0.3)
.with_decay(90),
SourceTier::new(SourceClass::Anecdotal, "Tier 5: Anecdotal")
.with_examples(vec!["Case reports", "Social media"])
.with_weight(0.1)
.with_decay(30),
]);
domain
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_cardiology_domain() {
let domain = definition();
assert_eq!(domain.name, "Cardiology");
// Check entity aliases
let condition = domain.get_entity_type("Condition").unwrap();
assert_eq!(condition.normalize("MI"), "MyocardialInfarction");
// Check schema lookup
let schema = domain.schema_for_predicate("mortality_rate").unwrap();
assert_eq!(schema.subject_pattern, "{Condition}:{Procedure}");
}
}
Troubleshooting
"Unknown predicate" errors
Your predicate isn't in any schema. Add it to the appropriate with_predicates() call.
Subject collision issues
If claims that should conflict aren't conflicting, check that:
- The subject pattern matches your intent
- Entity values are being normalized consistently
- The predicate is in the right schema category
Extractor not finding data
- Check the API URL is correct
- Verify the query parameters match the API's expectations
- Add debug logging to see raw responses
Next Steps
- Run the Consumer Health UAT to see the pharma domain in action
- Read the Lens documentation to understand conflict resolution
- Check the SDK guide for Go integration