stemedb/applications/aphoria/docs/architecture/concept-matching-analysis.md
jordan 41c676a78e feat: Aphoria enterprise features + ontology SDK + file length compliance
Enterprise Features:
- Hosted mode with remote sync for team pattern aggregation
- Community sharing with privacy-preserving anonymization
- LLM-based semantic claim extraction with Gemini integration
- Pattern learning with promotion to declarative extractors
- High-entropy secrets extractor with configurable thresholds
- Auth bypass and insecure cookies extractors

Module Refactoring:
- Split oversized files to comply with 500-line limit
- Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc.
- Handlers split: scan.rs, policy.rs, report.rs modules
- Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/
- Learning split: store modules with metrics and persistence

SDK & Ontology:
- stemedb-ontology SDK with fluent builders and StemeDB client
- Pharma domain extractors for FDA Orange Book data
- Consumer health UAT test infrastructure

Code Quality:
- Fixed clippy warnings (needless_borrows_for_generic_args)
- Added KVStore trait imports where needed
- Fixed utoipa path re-exports for OpenAPI docs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 12:55:29 -07:00

11 KiB

Concept Matching Architecture Analysis

Date: 2026-02-04 Status: Critical Gap Identified Priority: High (Enterprise Blocker)


Executive Summary

The current tail-path matching system (ConceptIndex) enables cross-scheme concept matching but has fundamental limitations for enterprise policy enforcement. While it works well for bundled RFC/OWASP corpus matching, it fails when security teams create custom policies that don't align with extractor output paths.

Recommendation: Implement a three-tier matching system with explicit policy aliasing.


Current Architecture

1. Tail-Path Matching (ConceptIndex)

Algorithm:

// Both produce key: "tls/cert_verification::enabled"
"rfc://5246/tls/cert_verification"           // RFC corpus
"code://rust/myapp/tls/cert_verification"    // Code extractor

How it works:

  1. Strip scheme (rfc://, code://)
  2. Take last 2 path segments
  3. Append predicate
  4. Key = {seg[-2]}/{seg[-1]}::{predicate}

Scan Flow:

scan.rs:210 → ConceptIndex::build(&corpus)
  ↓
local.rs:273 → index.lookup(&claim.concept_path, &claim.predicate)
  ↓
concept_index.rs:54 → make_key(subject, predicate)

2. Trust Pack Import

Current State:

  • Assertions stored in KV
  • Indexed under predicates::AUTHORITATIVE
  • Loaded into corpus at scan time (scan.rs:201)
  • Included in ConceptIndex (scan.rs:210)

The Gap: Trust Pack assertions use paths defined by security teams, which may not match extractor conventions.


The Problem

Scenario: Enterprise Policy Mismatch

Security Team's Intent:

# They create a "blessed" standard
subject = "code://standards/tls/cert_verification"
predicate = "enabled"
object = true

What Code Extractors Produce:

// Rust extractor output
concept_path: "code://rust/myapp/tls/cert_verification"

// Go extractor output
concept_path: "code://go/myapp/tls/cert_verification"

// Python extractor output
concept_path: "code://python/myapp/tls/cert_verification"

Current Behavior:

Security standard:     "standards/tls" → key: "tls/cert_verification::enabled"
Rust code:             "rust/myapp/tls" → key: "myapp/tls::enabled"  ❌ MISMATCH

Root Cause

Tail-path matching assumes:

  1. Uniform Depth: All sources use similar path hierarchies
  2. Language Agnostic: The "tls/cert_verification" pattern is universal

But enterprise policies violate these assumptions:

  • Security teams think in domains (standards/tls)
  • Extractors output language-qualified paths (rust/myapp/tls)

Analysis: Is Tail-Path Matching Sufficient?

What Works Well

  1. RFC ↔ Code Matching

    • RFCs use domain concepts: rfc://5246/tls/cert_verification
    • Code extractors intentionally align: code://rust/.../tls/cert_verification
    • This was designed to work
  2. Zero Configuration

    • No manual alias mapping required
    • "Just works" for bundled corpus
  3. Cross-Language Matching

    • code://rust/.../tls/cert_verification
    • code://python/.../tls/cert_verification
    • Both match the same RFC

What Breaks

  1. Enterprise Policy Hierarchies

    • Security teams use logical groupings: standards/, internal/, exceptions/
    • These don't map to extractor output
  2. Vendor-Specific Patterns

    • Unreal Engine: unreal://engine/rendering/synchronous_loading
    • Code: code://cpp/mygame/rendering/assets/load_sync
    • Different semantic levels
  3. Domain-Specific Abstractions

    • Healthcare: hipaa://patient_data/encryption
    • Finance: pci://cardholder_data/storage
    • Code may not mirror these hierarchies

Solution Options

Option 1: Normalize Extractor Output (Rejected)

Idea: Make extractors output "canonical" paths that match standards.

Why it fails:

  • Extractors need language context (rust/myapp)
  • Path structure conveys information (file location, module hierarchy)
  • Breaks existing aliases and observations

Option 2: Flexible Tail-Path Length (Partial)

Idea: Try matching with N=1, N=2, N=3 segments.

// Try multiple keys
"cert_verification::enabled"           // N=1
"tls/cert_verification::enabled"       // N=2
"myapp/tls/cert_verification::enabled" // N=3

Pros:

  • Handles some depth mismatches
  • Backward compatible

Cons:

  • Ambiguous matches (which key wins?)
  • Still doesn't solve semantic differences
  • Performance impact (3x index lookups)

Idea: Add an alias layer in Trust Packs.

Trust Pack Schema Extension:

pub struct TrustPack {
    pub header: PackHeader,
    pub assertions: Vec<Assertion>,
    pub aliases: Vec<ConceptAlias>,  // Already exists!
    pub policy_aliases: Vec<PolicyAlias>,  // NEW
    pub signature: [u8; 64],
}

pub struct PolicyAlias {
    /// The policy path used in assertions
    pub policy_path: String,
    /// Glob patterns that should match this policy
    pub target_patterns: Vec<String>,
}

Example:

PolicyAlias {
    policy_path: "code://standards/tls/cert_verification",
    target_patterns: vec![
        "code://rust/*/tls/cert_verification",
        "code://go/*/tls/cert_verification",
        "code://python/*/tls/cert_verification",
    ],
}

Matching Algorithm:

impl ConceptIndex {
    pub fn lookup_with_policy_aliases(
        &self,
        subject: &str,
        predicate: &str,
        policy_aliases: &[PolicyAlias],
    ) -> Option<&Vec<Assertion>> {
        // 1. Try direct tail-path match (existing)
        if let Some(result) = self.lookup(subject, predicate) {
            return Some(result);
        }

        // 2. Try policy alias expansion
        for alias in policy_aliases {
            if subject_matches_pattern(subject, &alias.target_patterns) {
                if let Some(result) = self.lookup(&alias.policy_path, predicate) {
                    return Some(result);
                }
            }
        }

        None
    }
}

Phase 1: Extend Trust Pack Schema

Files:

  • applications/aphoria/src/policy.rs

Changes:

#[derive(Archive, Deserialize, Serialize, Debug, Clone)]
pub struct PolicyAlias {
    pub policy_path: String,
    pub target_patterns: Vec<String>,
}

pub struct TrustPack {
    // ... existing fields
    pub policy_aliases: Vec<PolicyAlias>,
    // ...
}

Phase 2: Add Pattern Matching

Files:

  • applications/aphoria/src/episteme/concept_index.rs

New Functions:

impl ConceptIndex {
    /// Extended lookup that tries policy aliases after tail-path match
    pub fn lookup_with_aliases(
        &self,
        subject: &str,
        predicate: &str,
        aliases: &[PolicyAlias],
    ) -> Option<&Vec<Assertion>> { ... }
}

/// Check if a subject matches a glob pattern
fn subject_matches_pattern(subject: &str, patterns: &[String]) -> bool {
    // Use glob crate or simple wildcard matching
    patterns.iter().any(|p| glob_match(p, subject))
}

Phase 3: Integrate into Scan Flow

Files:

  • applications/aphoria/src/scan.rs
  • applications/aphoria/src/episteme/local.rs

Changes:

// scan.rs:210 - Load policy aliases from Trust Packs
let policy_manager = PolicyManager::new(&config.corpus.cache_dir);
let policies = policy_manager.load_policies(&config.policies)?;
let policy_aliases: Vec<PolicyAlias> = policies
    .iter()
    .flat_map(|p| &p.policy_aliases)
    .cloned()
    .collect();

// local.rs:273 - Use extended lookup
let auth_assertions = match index.lookup_with_aliases(
    &claim.concept_path,
    &claim.predicate,
    &policy_aliases,
) {
    Some(assertions) => assertions,
    None => continue,
};

Phase 4: CLI Tooling

New Command:

# Generate policy aliases from existing assertions
aphoria policy generate-aliases \
  --policy-path "code://standards/tls/cert_verification" \
  --target "code://rust/*/tls/cert_verification" \
  --target "code://go/*/tls/cert_verification"

Output: Adds PolicyAlias to Trust Pack before signing.


Extension Points

1. Dynamic Alias Discovery

Future Enhancement: Auto-generate aliases during scan if code paths differ from policy paths.

// If tail-path matches but full paths differ, suggest alias
if tail_match && !full_match {
    suggestions.push(PolicyAlias {
        policy_path: assertion.subject.clone(),
        target_patterns: vec![claim.concept_path.clone()],
    });
}

2. Semantic Equivalence

Future Enhancement: Use embedding similarity for fuzzy matching.

pub struct SemanticAlias {
    pub policy_path: String,
    pub similarity_threshold: f32,
}

// Match if embedding distance < threshold

3. Hierarchical Policy Inheritance

Future Enhancement: Support policy hierarchies.

// Match "code://standards/tls/*" against any TLS assertion
pub struct HierarchyAlias {
    pub policy_prefix: String,  // "code://standards/tls"
    pub target_prefix: String,  // "code://rust/*/tls"
}

Migration Path

Backward Compatibility

Zero Breaking Changes:

  • Tail-path matching still works for existing use cases
  • PolicyAlias is optional (empty vec = current behavior)
  • Existing Trust Packs without policy_aliases field deserialize fine (add default)

Adoption Strategy

Week 1: Implement core functionality (Phase 1-2) Week 2: Integrate into scan flow (Phase 3) Week 3: Add CLI tooling (Phase 4) Week 4: Document + UAT with enterprise scenario


Metrics for Success

Functional

  • Security team can create code://standards/* assertions
  • Dev team code (code://rust/myapp/*) matches standards
  • Conflicts are detected and reported
  • Trust Pack signature verification passes

Performance

  • Scan time increase < 5% (alias lookup is O(P*A) where P=patterns, A=aliases)
  • Memory overhead < 10KB per Trust Pack (policy aliases are small)

Usability

  • Security team can export Trust Pack with aliases in < 5 commands
  • Dev team imports Trust Pack with policies = ["security.pack"] (no code changes)

Open Questions

  1. Wildcard Syntax: Use glob (*) or regex (.*)?

    • Recommendation: Start with glob (simpler, more intuitive)
  2. Alias Priority: If multiple aliases match, which wins?

    • Recommendation: First match wins (deterministic order in Trust Pack)
  3. Alias Storage: Persist discovered aliases back to local store?

    • Recommendation: No (keep Trust Pack as source of truth)
  4. Alias Validation: Check patterns at Trust Pack creation time?

    • Recommendation: Yes (fail fast if invalid glob pattern)

Conclusion

Current State: Tail-path matching works for bundled corpus but breaks for enterprise policies.

Root Cause: Semantic mismatch between policy hierarchies and extractor output.

Solution: Add explicit PolicyAlias layer in Trust Packs.

Impact: Unblocks enterprise adoption without breaking existing functionality.

Effort: ~2-3 days (schema extension + pattern matching + integration)

Risk: Low (additive change, backward compatible)


Next Steps

  1. Review this analysis with team
  2. Validate glob pattern syntax choice
  3. Implement Phase 1 (schema extension)
  4. Write UAT scenario mimicking enterprise use case
  5. Iterate based on feedback

Questions or feedback? Discuss in #aphoria-architecture.