stemedb/docs/guides/adding-a-domain.md
jordan bbe6aedc40 feat: Aphoria security extractors + LLM evaluation architecture + ontology docs
New security extractors:
- insecure_deserialization, orm_injection, path_traversal, security_headers
- ssrf, unvalidated_redirects, weak_password, xxe
- Enhanced tls_version extractor with comprehensive cipher/protocol checks

Architecture docs:
- Scout-judge extraction pattern for LLM-based code analysis
- LLM prompt evaluation framework
- LLM eval implementation guide

Core improvements:
- stemedb-ontology README and client enhancements
- WAL journal/segment instrumentation
- Signing and ingestion refinements
- Consumer health demo script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 15:22:55 -07:00

17 KiB

Adding a New Domain to stemedb-ontology

This guide walks you through implementing a new domain (vertical) in the stemedb-ontology crate. By the end, you'll have a working domain with entity types, predicate schemas, and optional extractors.

Time: ~30 minutes Prerequisites: Rust knowledge, familiarity with StemeDB concepts

Overview

A domain in stemedb-ontology defines:

  1. Entity Types - The kinds of things in your domain (e.g., Drug, Company, Asset)
  2. Predicate Schemas - How subjects are built for different predicate categories
  3. Source Hierarchy - How to weight different source authorities
  4. Extractors (optional) - Code that extracts claims from external sources

Step 1: Plan Your Domain Model

Before writing code, answer these questions:

What entities exist in your domain?

Entity Description Example Values
? ? ?

Pharma example:

Entity Description Example Values
Drug Pharmaceutical compound Semaglutide, Tirzepatide
Indication Medical condition Type2Diabetes, Obesity
Target Molecular target GLP1R, GIPR

What predicates will you track?

Group predicates by category (determines subject pattern):

Category Subject Pattern Example Predicates
? ? ?

Pharma example:

Category Subject Pattern Example Predicates
Efficacy {Drug}:{Indication} hba1c_reduction_percent, weight_loss_percent
Safety {Drug} nausea_rate, has_boxed_warning
Mechanism {Drug}:{Target} binding_affinity, mechanism_of_action

What sources will provide data?

Order from most to least authoritative:

Tier Source Class Examples Weight
0 Regulatory ? 1.0
1 Clinical ? 0.9
... ... ... ...

Step 2: Create Domain Module

Create the directory structure:

crates/stemedb-ontology/src/
  {domain}/
    mod.rs          # Re-exports
    definition.rs   # Domain::new() builder

Template: {domain}/mod.rs

//! {Domain} domain ontology.
//!
//! This module defines the {domain} vertical with:
//! - Entity types (...)
//! - Predicate schemas (...)
//! - Source hierarchy (...)

pub mod definition;

pub use definition::definition;

// Re-export domain-specific types if any
// pub use definition::{...};

Template: {domain}/definition.rs

//! Compiled-in {domain} domain definition.

use crate::domain::{
    DefaultLens, Domain, EntityType, NamingConvention, PredicateSchema, SourceTier,
};
use stemedb_core::types::SourceClass;

/// Build the {domain} domain definition.
pub fn definition() -> Domain {
    let mut domain = Domain::new(
        "{Domain}",
        "Description of what this domain covers",
    );

    // -------------------------------------------------------------------------
    // Entity Types
    // -------------------------------------------------------------------------

    // Primary entity (e.g., the main subject of claims)
    domain = domain.with_entity_type(
        "{PrimaryEntity}",
        EntityType::required("Description")
            .with_naming(NamingConvention::CamelCase)
            // Add aliases for common variations
            .with_alias("ALIAS", "Canonical"),
    );

    // Secondary entity (for compound subjects)
    domain = domain.with_entity_type(
        "{SecondaryEntity}",
        EntityType::required("Description")
            .with_naming(NamingConvention::CamelCase),
    );

    // -------------------------------------------------------------------------
    // Predicate Schemas
    // -------------------------------------------------------------------------

    // Category 1: Primary predicates (single entity subject)
    domain = domain.with_predicate_schema(
        "category1",
        PredicateSchema::new(
            "Description of this predicate category",
            "{PrimaryEntity}",
        )
        .with_predicates(vec![
            "predicate_one",
            "predicate_two",
        ])
        .with_default_lens(DefaultLens::Recency),
    );

    // Category 2: Compound predicates (multi-entity subject)
    domain = domain.with_predicate_schema(
        "category2",
        PredicateSchema::new(
            "Description",
            "{PrimaryEntity}:{SecondaryEntity}",
        )
        .with_predicates(vec![
            "compound_predicate",
        ])
        .with_default_lens(DefaultLens::LayeredConsensus),
    );

    // -------------------------------------------------------------------------
    // Source Hierarchy
    // -------------------------------------------------------------------------

    domain = domain.with_source_hierarchy(vec![
        SourceTier::new(SourceClass::Regulatory, "Tier 0: Official Sources")
            .with_examples(vec!["Government agencies", "Standards bodies"])
            .with_weight(1.0),
        SourceTier::new(SourceClass::Clinical, "Tier 1: Primary Research")
            .with_examples(vec!["Peer-reviewed journals", "Research institutions"])
            .with_weight(0.9)
            .with_decay(730), // 2 year half-life
        SourceTier::new(SourceClass::Observational, "Tier 2: Secondary Analysis")
            .with_examples(vec!["Industry reports", "Analyst research"])
            .with_weight(0.7)
            .with_decay(365),
        SourceTier::new(SourceClass::Expert, "Tier 3: Expert Opinion")
            .with_examples(vec!["Industry experts", "Consultants"])
            .with_weight(0.5)
            .with_decay(180),
        SourceTier::new(SourceClass::Community, "Tier 4: Community")
            .with_examples(vec!["Professional forums", "Curated discussions"])
            .with_weight(0.3)
            .with_decay(90),
        SourceTier::new(SourceClass::Anecdotal, "Tier 5: Anecdotal")
            .with_examples(vec!["Social media", "Blog posts"])
            .with_weight(0.1)
            .with_decay(30),
    ]);

    domain
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_definition_builds() {
        let domain = definition();
        assert_eq!(domain.name, "{Domain}");
        assert!(!domain.entity_types.is_empty());
        assert!(!domain.predicate_schemas.is_empty());
        assert!(!domain.source_hierarchy.is_empty());
    }

    #[test]
    fn test_entity_normalization() {
        let domain = definition();
        let entity = domain.get_entity_type("{PrimaryEntity}").expect("entity exists");

        // Test alias normalization
        assert_eq!(entity.normalize("ALIAS"), "Canonical");
        assert_eq!(entity.normalize("Canonical"), "Canonical");
    }

    #[test]
    fn test_predicate_schema_lookup() {
        let domain = definition();

        // Direct lookup
        let schema = domain.get_schema("category1").expect("schema exists");
        assert_eq!(schema.subject_pattern, "{PrimaryEntity}");

        // Lookup by predicate
        let schema = domain.schema_for_predicate("predicate_one").expect("found");
        assert!(schema.predicates.contains(&"predicate_one".to_string()));
    }
}

Step 3: Implement Extractors (Optional)

If your domain has external data sources, implement the MedicalExtractor trait.

Directory Structure

crates/stemedb-ontology/src/
  {domain}/
    mod.rs
    definition.rs
    extractors/
      mod.rs
      {source}.rs

Template: {domain}/extractors/mod.rs

//! Data extractors for {domain}.

mod {source};

pub use {source}::{Source}Extractor;

// Re-export common traits from parent
pub use crate::pharma::extractors::{
    ExtractError, MedicalClaim, MedicalExtractor, RetryConfig, SourceInput,
};

Template: {domain}/extractors/{source}.rs

//! {Source} data extractor.

use super::{ExtractError, MedicalClaim, MedicalExtractor, SourceInput};
use async_trait::async_trait;
use stemedb_core::types::{ObjectValue, SourceClass};

/// Extractor for {Source} data.
pub struct {Source}Extractor {
    http_client: reqwest::Client,
    base_url: String,
}

impl {Source}Extractor {
    /// Create a new extractor.
    pub fn new() -> Self {
        Self {
            http_client: reqwest::Client::new(),
            base_url: "https://api.example.com".to_string(),
        }
    }
}

impl Default for {Source}Extractor {
    fn default() -> Self {
        Self::new()
    }
}

#[async_trait]
impl MedicalExtractor for {Source}Extractor {
    fn name(&self) -> &str {
        "{Source} Extractor"
    }

    fn source_class(&self) -> SourceClass {
        SourceClass::Regulatory  // Adjust based on source authority
    }

    fn can_handle(&self, source: &SourceInput) -> bool {
        matches!(source, SourceInput::DrugName(_) | SourceInput::Url(_))
    }

    async fn extract(&self, source: &SourceInput) -> Result<Vec<MedicalClaim>, ExtractError> {
        let query = match source {
            SourceInput::DrugName(name) => name.clone(),
            SourceInput::Url(url) => url.clone(),
            _ => return Err(ExtractError::NotFound("Unsupported input type".into())),
        };

        // Fetch data from source
        let url = format!("{}/search?q={}", self.base_url, urlencoding::encode(&query));
        let response = self.http_client.get(&url).send().await?;

        if !response.status().is_success() {
            return Err(ExtractError::ApiError(format!(
                "HTTP {}", response.status()
            )));
        }

        // Parse response and extract claims
        let mut claims = Vec::new();

        // Example claim
        claims.push(
            MedicalClaim::new(
                "Subject",
                "predicate_name",
                ObjectValue::Float(42.0),
            )
            .with_confidence(0.9)
            .with_source_url(&url)
            .with_source_section("Section Name")
            .with_quote("Supporting quote from source")
            .with_source_class(self.source_class())
        );

        Ok(claims)
    }
}

Step 4: Create CLI Binary (Optional)

For user-facing domains, create a CLI tool.

Template: src/bin/steme_{domain}.rs

//! CLI for {domain} domain operations.

use clap::Parser;
use stemedb_ontology::client::StemeClient;
use stemedb_ontology::{domain}::definition;

mod cli;
mod commands;

#[derive(Parser)]
#[command(name = "steme-{domain}")]
#[command(about = "{Domain} data operations for StemeDB")]
struct Cli {
    #[arg(long, default_value = "http://localhost:18180")]
    server: String,

    #[command(subcommand)]
    command: Commands,
}

#[derive(clap::Subcommand)]
enum Commands {
    /// Ingest data
    Ingest { /* args */ },
    /// Query data
    Query { /* args */ },
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let cli = Cli::parse();
    let client = StemeClient::new(&cli.server);

    match cli.command {
        Commands::Ingest { /* args */ } => {
            // Implementation
        }
        Commands::Query { /* args */ } => {
            // Implementation
        }
    }

    Ok(())
}

Step 5: Testing Checklist

Before considering your domain complete:

  • cargo build -p stemedb-ontology succeeds
  • definition() returns a valid Domain
  • All entity types have meaningful descriptions
  • All predicate schemas have correct subject patterns
  • Entity normalization works (aliases resolve correctly)
  • schema_for_predicate() finds the right schema
  • Source hierarchy has 6 tiers with decreasing weights
  • (If extractors) cargo test -p stemedb-ontology passes

Run the tests:

cargo test -p stemedb-ontology
cargo clippy -p stemedb-ontology -- -D warnings

Step 6: Integration

Export from lib.rs

Edit crates/stemedb-ontology/src/lib.rs:

// Add your domain module
pub mod {domain};

// Re-export for convenience
pub use {domain}::definition as {domain}_domain;

Update ai-lookup

Add entry to ai-lookup/index.md under Domain Ontology section.

Update CLAUDE.md routing (if significant)

If your domain is frequently used, add a routing entry in the Find Your Guide table.

Complete Example: Cardiology Domain (Skeleton)

Here's a minimal working example for a cardiology domain:

// crates/stemedb-ontology/src/cardiology/mod.rs
//! Cardiology domain ontology.

pub mod definition;
pub use definition::definition;
// crates/stemedb-ontology/src/cardiology/definition.rs
use crate::domain::{DefaultLens, Domain, EntityType, NamingConvention, PredicateSchema, SourceTier};
use stemedb_core::types::SourceClass;

pub fn definition() -> Domain {
    let mut domain = Domain::new(
        "Cardiology",
        "Cardiovascular conditions, procedures, and outcomes",
    );

    // Entities
    domain = domain
        .with_entity_type(
            "Condition",
            EntityType::required("Cardiovascular condition")
                .with_naming(NamingConvention::CamelCase)
                .with_alias("MI", "MyocardialInfarction")
                .with_alias("CHF", "CongestiveHeartFailure")
                .with_alias("AF", "AtrialFibrillation"),
        )
        .with_entity_type(
            "Procedure",
            EntityType::required("Medical procedure")
                .with_naming(NamingConvention::CamelCase)
                .with_alias("CABG", "CoronaryArteryBypassGraft")
                .with_alias("PCI", "PercutaneousCoronaryIntervention"),
        )
        .with_entity_type(
            "Biomarker",
            EntityType::required("Diagnostic biomarker")
                .with_naming(NamingConvention::CamelCase),
        );

    // Schemas
    domain = domain
        .with_predicate_schema(
            "diagnosis",
            PredicateSchema::new("Diagnostic criteria", "{Condition}")
                .with_predicates(vec![
                    "diagnostic_criteria",
                    "staging_system",
                    "severity_classification",
                ])
                .with_default_lens(DefaultLens::Authority),
        )
        .with_predicate_schema(
            "outcome",
            PredicateSchema::new("Treatment outcomes", "{Condition}:{Procedure}")
                .with_predicates(vec![
                    "mortality_rate",
                    "complication_rate",
                    "readmission_rate",
                    "length_of_stay_days",
                ])
                .with_default_lens(DefaultLens::LayeredConsensus),
        )
        .with_predicate_schema(
            "biomarker",
            PredicateSchema::new("Biomarker thresholds", "{Biomarker}")
                .with_predicates(vec![
                    "normal_range",
                    "diagnostic_threshold",
                    "prognostic_value",
                ])
                .with_default_lens(DefaultLens::Consensus),
        );

    // Source hierarchy
    domain = domain.with_source_hierarchy(vec![
        SourceTier::new(SourceClass::Regulatory, "Tier 0: Guidelines")
            .with_examples(vec!["ACC/AHA Guidelines", "ESC Guidelines"])
            .with_weight(1.0),
        SourceTier::new(SourceClass::Clinical, "Tier 1: Clinical Trials")
            .with_examples(vec!["Landmark RCTs", "Meta-analyses"])
            .with_weight(0.9)
            .with_decay(730),
        SourceTier::new(SourceClass::Observational, "Tier 2: Registries")
            .with_examples(vec!["NCDR", "Get With The Guidelines"])
            .with_weight(0.7)
            .with_decay(365),
        SourceTier::new(SourceClass::Expert, "Tier 3: Expert Consensus")
            .with_examples(vec!["Consensus statements", "Textbooks"])
            .with_weight(0.5)
            .with_decay(180),
        SourceTier::new(SourceClass::Community, "Tier 4: Community")
            .with_examples(vec!["Medical forums", "CME discussions"])
            .with_weight(0.3)
            .with_decay(90),
        SourceTier::new(SourceClass::Anecdotal, "Tier 5: Anecdotal")
            .with_examples(vec!["Case reports", "Social media"])
            .with_weight(0.1)
            .with_decay(30),
    ]);

    domain
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_cardiology_domain() {
        let domain = definition();
        assert_eq!(domain.name, "Cardiology");

        // Check entity aliases
        let condition = domain.get_entity_type("Condition").unwrap();
        assert_eq!(condition.normalize("MI"), "MyocardialInfarction");

        // Check schema lookup
        let schema = domain.schema_for_predicate("mortality_rate").unwrap();
        assert_eq!(schema.subject_pattern, "{Condition}:{Procedure}");
    }
}

Troubleshooting

"Unknown predicate" errors

Your predicate isn't in any schema. Add it to the appropriate with_predicates() call.

Subject collision issues

If claims that should conflict aren't conflicting, check that:

  1. The subject pattern matches your intent
  2. Entity values are being normalized consistently
  3. The predicate is in the right schema category

Extractor not finding data

  1. Check the API URL is correct
  2. Verify the query parameters match the API's expectations
  3. Add debug logging to see raw responses

Next Steps

  • Run the Consumer Health UAT to see the pharma domain in action
  • Read the Lens documentation to understand conflict resolution
  • Check the SDK guide for Go integration