stemedb/docs/guides/adding-a-domain.md
jordan bbe6aedc40 feat: Aphoria security extractors + LLM evaluation architecture + ontology docs
New security extractors:
- insecure_deserialization, orm_injection, path_traversal, security_headers
- ssrf, unvalidated_redirects, weak_password, xxe
- Enhanced tls_version extractor with comprehensive cipher/protocol checks

Architecture docs:
- Scout-judge extraction pattern for LLM-based code analysis
- LLM prompt evaluation framework
- LLM eval implementation guide

Core improvements:
- stemedb-ontology README and client enhancements
- WAL journal/segment instrumentation
- Signing and ingestion refinements
- Consumer health demo script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 15:22:55 -07:00

591 lines
17 KiB
Markdown

# Adding a New Domain to stemedb-ontology
This guide walks you through implementing a new domain (vertical) in the stemedb-ontology crate. By the end, you'll have a working domain with entity types, predicate schemas, and optional extractors.
**Time:** ~30 minutes
**Prerequisites:** Rust knowledge, familiarity with StemeDB concepts
## Overview
A domain in stemedb-ontology defines:
1. **Entity Types** - The kinds of things in your domain (e.g., Drug, Company, Asset)
2. **Predicate Schemas** - How subjects are built for different predicate categories
3. **Source Hierarchy** - How to weight different source authorities
4. **Extractors (optional)** - Code that extracts claims from external sources
## Step 1: Plan Your Domain Model
Before writing code, answer these questions:
### What entities exist in your domain?
| Entity | Description | Example Values |
|--------|-------------|----------------|
| ? | ? | ? |
**Pharma example:**
| Entity | Description | Example Values |
|--------|-------------|----------------|
| Drug | Pharmaceutical compound | Semaglutide, Tirzepatide |
| Indication | Medical condition | Type2Diabetes, Obesity |
| Target | Molecular target | GLP1R, GIPR |
### What predicates will you track?
Group predicates by category (determines subject pattern):
| Category | Subject Pattern | Example Predicates |
|----------|-----------------|-------------------|
| ? | ? | ? |
**Pharma example:**
| Category | Subject Pattern | Example Predicates |
|----------|-----------------|-------------------|
| Efficacy | `{Drug}:{Indication}` | hba1c_reduction_percent, weight_loss_percent |
| Safety | `{Drug}` | nausea_rate, has_boxed_warning |
| Mechanism | `{Drug}:{Target}` | binding_affinity, mechanism_of_action |
### What sources will provide data?
Order from most to least authoritative:
| Tier | Source Class | Examples | Weight |
|------|--------------|----------|--------|
| 0 | Regulatory | ? | 1.0 |
| 1 | Clinical | ? | 0.9 |
| ... | ... | ... | ... |
## Step 2: Create Domain Module
Create the directory structure:
```
crates/stemedb-ontology/src/
{domain}/
mod.rs # Re-exports
definition.rs # Domain::new() builder
```
### Template: `{domain}/mod.rs`
```rust
//! {Domain} domain ontology.
//!
//! This module defines the {domain} vertical with:
//! - Entity types (...)
//! - Predicate schemas (...)
//! - Source hierarchy (...)
pub mod definition;
pub use definition::definition;
// Re-export domain-specific types if any
// pub use definition::{...};
```
### Template: `{domain}/definition.rs`
```rust
//! Compiled-in {domain} domain definition.
use crate::domain::{
DefaultLens, Domain, EntityType, NamingConvention, PredicateSchema, SourceTier,
};
use stemedb_core::types::SourceClass;
/// Build the {domain} domain definition.
pub fn definition() -> Domain {
let mut domain = Domain::new(
"{Domain}",
"Description of what this domain covers",
);
// -------------------------------------------------------------------------
// Entity Types
// -------------------------------------------------------------------------
// Primary entity (e.g., the main subject of claims)
domain = domain.with_entity_type(
"{PrimaryEntity}",
EntityType::required("Description")
.with_naming(NamingConvention::CamelCase)
// Add aliases for common variations
.with_alias("ALIAS", "Canonical"),
);
// Secondary entity (for compound subjects)
domain = domain.with_entity_type(
"{SecondaryEntity}",
EntityType::required("Description")
.with_naming(NamingConvention::CamelCase),
);
// -------------------------------------------------------------------------
// Predicate Schemas
// -------------------------------------------------------------------------
// Category 1: Primary predicates (single entity subject)
domain = domain.with_predicate_schema(
"category1",
PredicateSchema::new(
"Description of this predicate category",
"{PrimaryEntity}",
)
.with_predicates(vec![
"predicate_one",
"predicate_two",
])
.with_default_lens(DefaultLens::Recency),
);
// Category 2: Compound predicates (multi-entity subject)
domain = domain.with_predicate_schema(
"category2",
PredicateSchema::new(
"Description",
"{PrimaryEntity}:{SecondaryEntity}",
)
.with_predicates(vec![
"compound_predicate",
])
.with_default_lens(DefaultLens::LayeredConsensus),
);
// -------------------------------------------------------------------------
// Source Hierarchy
// -------------------------------------------------------------------------
domain = domain.with_source_hierarchy(vec![
SourceTier::new(SourceClass::Regulatory, "Tier 0: Official Sources")
.with_examples(vec!["Government agencies", "Standards bodies"])
.with_weight(1.0),
SourceTier::new(SourceClass::Clinical, "Tier 1: Primary Research")
.with_examples(vec!["Peer-reviewed journals", "Research institutions"])
.with_weight(0.9)
.with_decay(730), // 2 year half-life
SourceTier::new(SourceClass::Observational, "Tier 2: Secondary Analysis")
.with_examples(vec!["Industry reports", "Analyst research"])
.with_weight(0.7)
.with_decay(365),
SourceTier::new(SourceClass::Expert, "Tier 3: Expert Opinion")
.with_examples(vec!["Industry experts", "Consultants"])
.with_weight(0.5)
.with_decay(180),
SourceTier::new(SourceClass::Community, "Tier 4: Community")
.with_examples(vec!["Professional forums", "Curated discussions"])
.with_weight(0.3)
.with_decay(90),
SourceTier::new(SourceClass::Anecdotal, "Tier 5: Anecdotal")
.with_examples(vec!["Social media", "Blog posts"])
.with_weight(0.1)
.with_decay(30),
]);
domain
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_definition_builds() {
let domain = definition();
assert_eq!(domain.name, "{Domain}");
assert!(!domain.entity_types.is_empty());
assert!(!domain.predicate_schemas.is_empty());
assert!(!domain.source_hierarchy.is_empty());
}
#[test]
fn test_entity_normalization() {
let domain = definition();
let entity = domain.get_entity_type("{PrimaryEntity}").expect("entity exists");
// Test alias normalization
assert_eq!(entity.normalize("ALIAS"), "Canonical");
assert_eq!(entity.normalize("Canonical"), "Canonical");
}
#[test]
fn test_predicate_schema_lookup() {
let domain = definition();
// Direct lookup
let schema = domain.get_schema("category1").expect("schema exists");
assert_eq!(schema.subject_pattern, "{PrimaryEntity}");
// Lookup by predicate
let schema = domain.schema_for_predicate("predicate_one").expect("found");
assert!(schema.predicates.contains(&"predicate_one".to_string()));
}
}
```
## Step 3: Implement Extractors (Optional)
If your domain has external data sources, implement the `MedicalExtractor` trait.
### Directory Structure
```
crates/stemedb-ontology/src/
{domain}/
mod.rs
definition.rs
extractors/
mod.rs
{source}.rs
```
### Template: `{domain}/extractors/mod.rs`
```rust
//! Data extractors for {domain}.
mod {source};
pub use {source}::{Source}Extractor;
// Re-export common traits from parent
pub use crate::pharma::extractors::{
ExtractError, MedicalClaim, MedicalExtractor, RetryConfig, SourceInput,
};
```
### Template: `{domain}/extractors/{source}.rs`
```rust
//! {Source} data extractor.
use super::{ExtractError, MedicalClaim, MedicalExtractor, SourceInput};
use async_trait::async_trait;
use stemedb_core::types::{ObjectValue, SourceClass};
/// Extractor for {Source} data.
pub struct {Source}Extractor {
http_client: reqwest::Client,
base_url: String,
}
impl {Source}Extractor {
/// Create a new extractor.
pub fn new() -> Self {
Self {
http_client: reqwest::Client::new(),
base_url: "https://api.example.com".to_string(),
}
}
}
impl Default for {Source}Extractor {
fn default() -> Self {
Self::new()
}
}
#[async_trait]
impl MedicalExtractor for {Source}Extractor {
fn name(&self) -> &str {
"{Source} Extractor"
}
fn source_class(&self) -> SourceClass {
SourceClass::Regulatory // Adjust based on source authority
}
fn can_handle(&self, source: &SourceInput) -> bool {
matches!(source, SourceInput::DrugName(_) | SourceInput::Url(_))
}
async fn extract(&self, source: &SourceInput) -> Result<Vec<MedicalClaim>, ExtractError> {
let query = match source {
SourceInput::DrugName(name) => name.clone(),
SourceInput::Url(url) => url.clone(),
_ => return Err(ExtractError::NotFound("Unsupported input type".into())),
};
// Fetch data from source
let url = format!("{}/search?q={}", self.base_url, urlencoding::encode(&query));
let response = self.http_client.get(&url).send().await?;
if !response.status().is_success() {
return Err(ExtractError::ApiError(format!(
"HTTP {}", response.status()
)));
}
// Parse response and extract claims
let mut claims = Vec::new();
// Example claim
claims.push(
MedicalClaim::new(
"Subject",
"predicate_name",
ObjectValue::Float(42.0),
)
.with_confidence(0.9)
.with_source_url(&url)
.with_source_section("Section Name")
.with_quote("Supporting quote from source")
.with_source_class(self.source_class())
);
Ok(claims)
}
}
```
## Step 4: Create CLI Binary (Optional)
For user-facing domains, create a CLI tool.
### Template: `src/bin/steme_{domain}.rs`
```rust
//! CLI for {domain} domain operations.
use clap::Parser;
use stemedb_ontology::client::StemeClient;
use stemedb_ontology::{domain}::definition;
mod cli;
mod commands;
#[derive(Parser)]
#[command(name = "steme-{domain}")]
#[command(about = "{Domain} data operations for StemeDB")]
struct Cli {
#[arg(long, default_value = "http://localhost:18180")]
server: String,
#[command(subcommand)]
command: Commands,
}
#[derive(clap::Subcommand)]
enum Commands {
/// Ingest data
Ingest { /* args */ },
/// Query data
Query { /* args */ },
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let cli = Cli::parse();
let client = StemeClient::new(&cli.server);
match cli.command {
Commands::Ingest { /* args */ } => {
// Implementation
}
Commands::Query { /* args */ } => {
// Implementation
}
}
Ok(())
}
```
## Step 5: Testing Checklist
Before considering your domain complete:
- [ ] `cargo build -p stemedb-ontology` succeeds
- [ ] `definition()` returns a valid Domain
- [ ] All entity types have meaningful descriptions
- [ ] All predicate schemas have correct subject patterns
- [ ] Entity normalization works (aliases resolve correctly)
- [ ] `schema_for_predicate()` finds the right schema
- [ ] Source hierarchy has 6 tiers with decreasing weights
- [ ] (If extractors) `cargo test -p stemedb-ontology` passes
Run the tests:
```bash
cargo test -p stemedb-ontology
cargo clippy -p stemedb-ontology -- -D warnings
```
## Step 6: Integration
### Export from lib.rs
Edit `crates/stemedb-ontology/src/lib.rs`:
```rust
// Add your domain module
pub mod {domain};
// Re-export for convenience
pub use {domain}::definition as {domain}_domain;
```
### Update ai-lookup
Add entry to `ai-lookup/index.md` under Domain Ontology section.
### Update CLAUDE.md routing (if significant)
If your domain is frequently used, add a routing entry in the Find Your Guide table.
## Complete Example: Cardiology Domain (Skeleton)
Here's a minimal working example for a cardiology domain:
```rust
// crates/stemedb-ontology/src/cardiology/mod.rs
//! Cardiology domain ontology.
pub mod definition;
pub use definition::definition;
```
```rust
// crates/stemedb-ontology/src/cardiology/definition.rs
use crate::domain::{DefaultLens, Domain, EntityType, NamingConvention, PredicateSchema, SourceTier};
use stemedb_core::types::SourceClass;
pub fn definition() -> Domain {
let mut domain = Domain::new(
"Cardiology",
"Cardiovascular conditions, procedures, and outcomes",
);
// Entities
domain = domain
.with_entity_type(
"Condition",
EntityType::required("Cardiovascular condition")
.with_naming(NamingConvention::CamelCase)
.with_alias("MI", "MyocardialInfarction")
.with_alias("CHF", "CongestiveHeartFailure")
.with_alias("AF", "AtrialFibrillation"),
)
.with_entity_type(
"Procedure",
EntityType::required("Medical procedure")
.with_naming(NamingConvention::CamelCase)
.with_alias("CABG", "CoronaryArteryBypassGraft")
.with_alias("PCI", "PercutaneousCoronaryIntervention"),
)
.with_entity_type(
"Biomarker",
EntityType::required("Diagnostic biomarker")
.with_naming(NamingConvention::CamelCase),
);
// Schemas
domain = domain
.with_predicate_schema(
"diagnosis",
PredicateSchema::new("Diagnostic criteria", "{Condition}")
.with_predicates(vec![
"diagnostic_criteria",
"staging_system",
"severity_classification",
])
.with_default_lens(DefaultLens::Authority),
)
.with_predicate_schema(
"outcome",
PredicateSchema::new("Treatment outcomes", "{Condition}:{Procedure}")
.with_predicates(vec![
"mortality_rate",
"complication_rate",
"readmission_rate",
"length_of_stay_days",
])
.with_default_lens(DefaultLens::LayeredConsensus),
)
.with_predicate_schema(
"biomarker",
PredicateSchema::new("Biomarker thresholds", "{Biomarker}")
.with_predicates(vec![
"normal_range",
"diagnostic_threshold",
"prognostic_value",
])
.with_default_lens(DefaultLens::Consensus),
);
// Source hierarchy
domain = domain.with_source_hierarchy(vec![
SourceTier::new(SourceClass::Regulatory, "Tier 0: Guidelines")
.with_examples(vec!["ACC/AHA Guidelines", "ESC Guidelines"])
.with_weight(1.0),
SourceTier::new(SourceClass::Clinical, "Tier 1: Clinical Trials")
.with_examples(vec!["Landmark RCTs", "Meta-analyses"])
.with_weight(0.9)
.with_decay(730),
SourceTier::new(SourceClass::Observational, "Tier 2: Registries")
.with_examples(vec!["NCDR", "Get With The Guidelines"])
.with_weight(0.7)
.with_decay(365),
SourceTier::new(SourceClass::Expert, "Tier 3: Expert Consensus")
.with_examples(vec!["Consensus statements", "Textbooks"])
.with_weight(0.5)
.with_decay(180),
SourceTier::new(SourceClass::Community, "Tier 4: Community")
.with_examples(vec!["Medical forums", "CME discussions"])
.with_weight(0.3)
.with_decay(90),
SourceTier::new(SourceClass::Anecdotal, "Tier 5: Anecdotal")
.with_examples(vec!["Case reports", "Social media"])
.with_weight(0.1)
.with_decay(30),
]);
domain
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_cardiology_domain() {
let domain = definition();
assert_eq!(domain.name, "Cardiology");
// Check entity aliases
let condition = domain.get_entity_type("Condition").unwrap();
assert_eq!(condition.normalize("MI"), "MyocardialInfarction");
// Check schema lookup
let schema = domain.schema_for_predicate("mortality_rate").unwrap();
assert_eq!(schema.subject_pattern, "{Condition}:{Procedure}");
}
}
```
## Troubleshooting
### "Unknown predicate" errors
Your predicate isn't in any schema. Add it to the appropriate `with_predicates()` call.
### Subject collision issues
If claims that should conflict aren't conflicting, check that:
1. The subject pattern matches your intent
2. Entity values are being normalized consistently
3. The predicate is in the right schema category
### Extractor not finding data
1. Check the API URL is correct
2. Verify the query parameters match the API's expectations
3. Add debug logging to see raw responses
## Next Steps
- Run the Consumer Health UAT to see the pharma domain in action
- Read the [Lens documentation](../services/lens.md) to understand conflict resolution
- Check the [SDK guide](../../ai-lookup/services/sdk.md) for Go integration