Enterprise Features: - Hosted mode with remote sync for team pattern aggregation - Community sharing with privacy-preserving anonymization - LLM-based semantic claim extraction with Gemini integration - Pattern learning with promotion to declarative extractors - High-entropy secrets extractor with configurable thresholds - Auth bypass and insecure cookies extractors Module Refactoring: - Split oversized files to comply with 500-line limit - Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc. - Handlers split: scan.rs, policy.rs, report.rs modules - Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/ - Learning split: store modules with metrics and persistence SDK & Ontology: - stemedb-ontology SDK with fluent builders and StemeDB client - Pharma domain extractors for FDA Orange Book data - Consumer health UAT test infrastructure Code Quality: - Fixed clippy warnings (needless_borrows_for_generic_args) - Added KVStore trait imports where needed - Fixed utoipa path re-exports for OpenAPI docs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
11 KiB
Concept Matching Architecture Analysis
Date: 2026-02-04 Status: Critical Gap Identified Priority: High (Enterprise Blocker)
Executive Summary
The current tail-path matching system (ConceptIndex) enables cross-scheme concept matching but has fundamental limitations for enterprise policy enforcement. While it works well for bundled RFC/OWASP corpus matching, it fails when security teams create custom policies that don't align with extractor output paths.
Recommendation: Implement a three-tier matching system with explicit policy aliasing.
Current Architecture
1. Tail-Path Matching (ConceptIndex)
Algorithm:
// Both produce key: "tls/cert_verification::enabled"
"rfc://5246/tls/cert_verification" // RFC corpus
"code://rust/myapp/tls/cert_verification" // Code extractor
How it works:
- Strip scheme (
rfc://,code://) - Take last 2 path segments
- Append predicate
- Key =
{seg[-2]}/{seg[-1]}::{predicate}
Scan Flow:
scan.rs:210 → ConceptIndex::build(&corpus)
↓
local.rs:273 → index.lookup(&claim.concept_path, &claim.predicate)
↓
concept_index.rs:54 → make_key(subject, predicate)
2. Trust Pack Import
Current State:
- ✅ Assertions stored in KV
- ✅ Indexed under
predicates::AUTHORITATIVE - ✅ Loaded into corpus at scan time (scan.rs:201)
- ✅ Included in ConceptIndex (scan.rs:210)
The Gap: Trust Pack assertions use paths defined by security teams, which may not match extractor conventions.
The Problem
Scenario: Enterprise Policy Mismatch
Security Team's Intent:
# They create a "blessed" standard
subject = "code://standards/tls/cert_verification"
predicate = "enabled"
object = true
What Code Extractors Produce:
// Rust extractor output
concept_path: "code://rust/myapp/tls/cert_verification"
// Go extractor output
concept_path: "code://go/myapp/tls/cert_verification"
// Python extractor output
concept_path: "code://python/myapp/tls/cert_verification"
Current Behavior:
Security standard: "standards/tls" → key: "tls/cert_verification::enabled"
Rust code: "rust/myapp/tls" → key: "myapp/tls::enabled" ❌ MISMATCH
Root Cause
Tail-path matching assumes:
- Uniform Depth: All sources use similar path hierarchies
- Language Agnostic: The "tls/cert_verification" pattern is universal
But enterprise policies violate these assumptions:
- Security teams think in domains (
standards/tls) - Extractors output language-qualified paths (
rust/myapp/tls)
Analysis: Is Tail-Path Matching Sufficient?
What Works Well
-
RFC ↔ Code Matching
- RFCs use domain concepts:
rfc://5246/tls/cert_verification - Code extractors intentionally align:
code://rust/.../tls/cert_verification - This was designed to work
- RFCs use domain concepts:
-
Zero Configuration
- No manual alias mapping required
- "Just works" for bundled corpus
-
Cross-Language Matching
code://rust/.../tls/cert_verificationcode://python/.../tls/cert_verification- Both match the same RFC
What Breaks
-
Enterprise Policy Hierarchies
- Security teams use logical groupings:
standards/,internal/,exceptions/ - These don't map to extractor output
- Security teams use logical groupings:
-
Vendor-Specific Patterns
- Unreal Engine:
unreal://engine/rendering/synchronous_loading - Code:
code://cpp/mygame/rendering/assets/load_sync - Different semantic levels
- Unreal Engine:
-
Domain-Specific Abstractions
- Healthcare:
hipaa://patient_data/encryption - Finance:
pci://cardholder_data/storage - Code may not mirror these hierarchies
- Healthcare:
Solution Options
Option 1: Normalize Extractor Output (Rejected)
Idea: Make extractors output "canonical" paths that match standards.
Why it fails:
- Extractors need language context (
rust/myapp) - Path structure conveys information (file location, module hierarchy)
- Breaks existing aliases and observations
Option 2: Flexible Tail-Path Length (Partial)
Idea: Try matching with N=1, N=2, N=3 segments.
// Try multiple keys
"cert_verification::enabled" // N=1
"tls/cert_verification::enabled" // N=2
"myapp/tls/cert_verification::enabled" // N=3
Pros:
- Handles some depth mismatches
- Backward compatible
Cons:
- Ambiguous matches (which key wins?)
- Still doesn't solve semantic differences
- Performance impact (3x index lookups)
Option 3: Explicit Policy Aliases (Recommended)
Idea: Add an alias layer in Trust Packs.
Trust Pack Schema Extension:
pub struct TrustPack {
pub header: PackHeader,
pub assertions: Vec<Assertion>,
pub aliases: Vec<ConceptAlias>, // Already exists!
pub policy_aliases: Vec<PolicyAlias>, // NEW
pub signature: [u8; 64],
}
pub struct PolicyAlias {
/// The policy path used in assertions
pub policy_path: String,
/// Glob patterns that should match this policy
pub target_patterns: Vec<String>,
}
Example:
PolicyAlias {
policy_path: "code://standards/tls/cert_verification",
target_patterns: vec![
"code://rust/*/tls/cert_verification",
"code://go/*/tls/cert_verification",
"code://python/*/tls/cert_verification",
],
}
Matching Algorithm:
impl ConceptIndex {
pub fn lookup_with_policy_aliases(
&self,
subject: &str,
predicate: &str,
policy_aliases: &[PolicyAlias],
) -> Option<&Vec<Assertion>> {
// 1. Try direct tail-path match (existing)
if let Some(result) = self.lookup(subject, predicate) {
return Some(result);
}
// 2. Try policy alias expansion
for alias in policy_aliases {
if subject_matches_pattern(subject, &alias.target_patterns) {
if let Some(result) = self.lookup(&alias.policy_path, predicate) {
return Some(result);
}
}
}
None
}
}
Recommended Implementation Plan
Phase 1: Extend Trust Pack Schema
Files:
applications/aphoria/src/policy.rs
Changes:
#[derive(Archive, Deserialize, Serialize, Debug, Clone)]
pub struct PolicyAlias {
pub policy_path: String,
pub target_patterns: Vec<String>,
}
pub struct TrustPack {
// ... existing fields
pub policy_aliases: Vec<PolicyAlias>,
// ...
}
Phase 2: Add Pattern Matching
Files:
applications/aphoria/src/episteme/concept_index.rs
New Functions:
impl ConceptIndex {
/// Extended lookup that tries policy aliases after tail-path match
pub fn lookup_with_aliases(
&self,
subject: &str,
predicate: &str,
aliases: &[PolicyAlias],
) -> Option<&Vec<Assertion>> { ... }
}
/// Check if a subject matches a glob pattern
fn subject_matches_pattern(subject: &str, patterns: &[String]) -> bool {
// Use glob crate or simple wildcard matching
patterns.iter().any(|p| glob_match(p, subject))
}
Phase 3: Integrate into Scan Flow
Files:
applications/aphoria/src/scan.rsapplications/aphoria/src/episteme/local.rs
Changes:
// scan.rs:210 - Load policy aliases from Trust Packs
let policy_manager = PolicyManager::new(&config.corpus.cache_dir);
let policies = policy_manager.load_policies(&config.policies)?;
let policy_aliases: Vec<PolicyAlias> = policies
.iter()
.flat_map(|p| &p.policy_aliases)
.cloned()
.collect();
// local.rs:273 - Use extended lookup
let auth_assertions = match index.lookup_with_aliases(
&claim.concept_path,
&claim.predicate,
&policy_aliases,
) {
Some(assertions) => assertions,
None => continue,
};
Phase 4: CLI Tooling
New Command:
# Generate policy aliases from existing assertions
aphoria policy generate-aliases \
--policy-path "code://standards/tls/cert_verification" \
--target "code://rust/*/tls/cert_verification" \
--target "code://go/*/tls/cert_verification"
Output: Adds PolicyAlias to Trust Pack before signing.
Extension Points
1. Dynamic Alias Discovery
Future Enhancement: Auto-generate aliases during scan if code paths differ from policy paths.
// If tail-path matches but full paths differ, suggest alias
if tail_match && !full_match {
suggestions.push(PolicyAlias {
policy_path: assertion.subject.clone(),
target_patterns: vec![claim.concept_path.clone()],
});
}
2. Semantic Equivalence
Future Enhancement: Use embedding similarity for fuzzy matching.
pub struct SemanticAlias {
pub policy_path: String,
pub similarity_threshold: f32,
}
// Match if embedding distance < threshold
3. Hierarchical Policy Inheritance
Future Enhancement: Support policy hierarchies.
// Match "code://standards/tls/*" against any TLS assertion
pub struct HierarchyAlias {
pub policy_prefix: String, // "code://standards/tls"
pub target_prefix: String, // "code://rust/*/tls"
}
Migration Path
Backward Compatibility
✅ Zero Breaking Changes:
- Tail-path matching still works for existing use cases
PolicyAliasis optional (empty vec = current behavior)- Existing Trust Packs without
policy_aliasesfield deserialize fine (add default)
Adoption Strategy
Week 1: Implement core functionality (Phase 1-2) Week 2: Integrate into scan flow (Phase 3) Week 3: Add CLI tooling (Phase 4) Week 4: Document + UAT with enterprise scenario
Metrics for Success
Functional
- Security team can create
code://standards/*assertions - Dev team code (
code://rust/myapp/*) matches standards - Conflicts are detected and reported
- Trust Pack signature verification passes
Performance
- Scan time increase < 5% (alias lookup is O(P*A) where P=patterns, A=aliases)
- Memory overhead < 10KB per Trust Pack (policy aliases are small)
Usability
- Security team can export Trust Pack with aliases in < 5 commands
- Dev team imports Trust Pack with
policies = ["security.pack"](no code changes)
Open Questions
-
Wildcard Syntax: Use glob (
*) or regex (.*)?- Recommendation: Start with glob (simpler, more intuitive)
-
Alias Priority: If multiple aliases match, which wins?
- Recommendation: First match wins (deterministic order in Trust Pack)
-
Alias Storage: Persist discovered aliases back to local store?
- Recommendation: No (keep Trust Pack as source of truth)
-
Alias Validation: Check patterns at Trust Pack creation time?
- Recommendation: Yes (fail fast if invalid glob pattern)
Conclusion
Current State: Tail-path matching works for bundled corpus but breaks for enterprise policies.
Root Cause: Semantic mismatch between policy hierarchies and extractor output.
Solution: Add explicit PolicyAlias layer in Trust Packs.
Impact: Unblocks enterprise adoption without breaking existing functionality.
Effort: ~2-3 days (schema extension + pattern matching + integration)
Risk: Low (additive change, backward compatible)
Next Steps
- Review this analysis with team
- Validate glob pattern syntax choice
- Implement Phase 1 (schema extension)
- Write UAT scenario mimicking enterprise use case
- Iterate based on feedback
Questions or feedback? Discuss in #aphoria-architecture.