# Concept Matching Architecture Analysis **Date:** 2026-02-04 **Status:** Critical Gap Identified **Priority:** High (Enterprise Blocker) --- ## Executive Summary The current tail-path matching system (ConceptIndex) enables cross-scheme concept matching but has fundamental limitations for enterprise policy enforcement. While it works well for bundled RFC/OWASP corpus matching, it fails when security teams create custom policies that don't align with extractor output paths. **Recommendation:** Implement a three-tier matching system with explicit policy aliasing. --- ## Current Architecture ### 1. Tail-Path Matching (ConceptIndex) **Algorithm:** ```rust // Both produce key: "tls/cert_verification::enabled" "rfc://5246/tls/cert_verification" // RFC corpus "code://rust/myapp/tls/cert_verification" // Code extractor ``` **How it works:** 1. Strip scheme (`rfc://`, `code://`) 2. Take last 2 path segments 3. Append predicate 4. Key = `{seg[-2]}/{seg[-1]}::{predicate}` **Scan Flow:** ``` scan.rs:210 → ConceptIndex::build(&corpus) ↓ local.rs:273 → index.lookup(&claim.concept_path, &claim.predicate) ↓ concept_index.rs:54 → make_key(subject, predicate) ``` ### 2. Trust Pack Import **Current State:** - ✅ Assertions stored in KV - ✅ Indexed under `predicates::AUTHORITATIVE` - ✅ Loaded into corpus at scan time (scan.rs:201) - ✅ Included in ConceptIndex (scan.rs:210) **The Gap:** Trust Pack assertions use paths defined by security teams, which may not match extractor conventions. --- ## The Problem ### Scenario: Enterprise Policy Mismatch **Security Team's Intent:** ```toml # They create a "blessed" standard subject = "code://standards/tls/cert_verification" predicate = "enabled" object = true ``` **What Code Extractors Produce:** ```rust // Rust extractor output concept_path: "code://rust/myapp/tls/cert_verification" // Go extractor output concept_path: "code://go/myapp/tls/cert_verification" // Python extractor output concept_path: "code://python/myapp/tls/cert_verification" ``` **Current Behavior:** ``` Security standard: "standards/tls" → key: "tls/cert_verification::enabled" Rust code: "rust/myapp/tls" → key: "myapp/tls::enabled" ❌ MISMATCH ``` ### Root Cause Tail-path matching assumes: 1. **Uniform Depth:** All sources use similar path hierarchies 2. **Language Agnostic:** The "tls/cert_verification" pattern is universal But enterprise policies violate these assumptions: - Security teams think in **domains** (`standards/tls`) - Extractors output **language-qualified** paths (`rust/myapp/tls`) --- ## Analysis: Is Tail-Path Matching Sufficient? ### What Works Well 1. **RFC ↔ Code Matching** - RFCs use domain concepts: `rfc://5246/tls/cert_verification` - Code extractors intentionally align: `code://rust/.../tls/cert_verification` - This was designed to work 2. **Zero Configuration** - No manual alias mapping required - "Just works" for bundled corpus 3. **Cross-Language Matching** - `code://rust/.../tls/cert_verification` - `code://python/.../tls/cert_verification` - Both match the same RFC ### What Breaks 1. **Enterprise Policy Hierarchies** - Security teams use logical groupings: `standards/`, `internal/`, `exceptions/` - These don't map to extractor output 2. **Vendor-Specific Patterns** - Unreal Engine: `unreal://engine/rendering/synchronous_loading` - Code: `code://cpp/mygame/rendering/assets/load_sync` - Different semantic levels 3. **Domain-Specific Abstractions** - Healthcare: `hipaa://patient_data/encryption` - Finance: `pci://cardholder_data/storage` - Code may not mirror these hierarchies --- ## Solution Options ### Option 1: Normalize Extractor Output (Rejected) **Idea:** Make extractors output "canonical" paths that match standards. **Why it fails:** - Extractors need language context (`rust/myapp`) - Path structure conveys information (file location, module hierarchy) - Breaks existing aliases and observations ### Option 2: Flexible Tail-Path Length (Partial) **Idea:** Try matching with N=1, N=2, N=3 segments. ```rust // Try multiple keys "cert_verification::enabled" // N=1 "tls/cert_verification::enabled" // N=2 "myapp/tls/cert_verification::enabled" // N=3 ``` **Pros:** - Handles some depth mismatches - Backward compatible **Cons:** - Ambiguous matches (which key wins?) - Still doesn't solve semantic differences - Performance impact (3x index lookups) ### Option 3: Explicit Policy Aliases (Recommended) **Idea:** Add an alias layer in Trust Packs. **Trust Pack Schema Extension:** ```rust pub struct TrustPack { pub header: PackHeader, pub assertions: Vec, pub aliases: Vec, // Already exists! pub policy_aliases: Vec, // NEW pub signature: [u8; 64], } pub struct PolicyAlias { /// The policy path used in assertions pub policy_path: String, /// Glob patterns that should match this policy pub target_patterns: Vec, } ``` **Example:** ```rust PolicyAlias { policy_path: "code://standards/tls/cert_verification", target_patterns: vec![ "code://rust/*/tls/cert_verification", "code://go/*/tls/cert_verification", "code://python/*/tls/cert_verification", ], } ``` **Matching Algorithm:** ```rust impl ConceptIndex { pub fn lookup_with_policy_aliases( &self, subject: &str, predicate: &str, policy_aliases: &[PolicyAlias], ) -> Option<&Vec> { // 1. Try direct tail-path match (existing) if let Some(result) = self.lookup(subject, predicate) { return Some(result); } // 2. Try policy alias expansion for alias in policy_aliases { if subject_matches_pattern(subject, &alias.target_patterns) { if let Some(result) = self.lookup(&alias.policy_path, predicate) { return Some(result); } } } None } } ``` --- ## Recommended Implementation Plan ### Phase 1: Extend Trust Pack Schema **Files:** - `applications/aphoria/src/policy.rs` **Changes:** ```rust #[derive(Archive, Deserialize, Serialize, Debug, Clone)] pub struct PolicyAlias { pub policy_path: String, pub target_patterns: Vec, } pub struct TrustPack { // ... existing fields pub policy_aliases: Vec, // ... } ``` ### Phase 2: Add Pattern Matching **Files:** - `applications/aphoria/src/episteme/concept_index.rs` **New Functions:** ```rust impl ConceptIndex { /// Extended lookup that tries policy aliases after tail-path match pub fn lookup_with_aliases( &self, subject: &str, predicate: &str, aliases: &[PolicyAlias], ) -> Option<&Vec> { ... } } /// Check if a subject matches a glob pattern fn subject_matches_pattern(subject: &str, patterns: &[String]) -> bool { // Use glob crate or simple wildcard matching patterns.iter().any(|p| glob_match(p, subject)) } ``` ### Phase 3: Integrate into Scan Flow **Files:** - `applications/aphoria/src/scan.rs` - `applications/aphoria/src/episteme/local.rs` **Changes:** ```rust // scan.rs:210 - Load policy aliases from Trust Packs let policy_manager = PolicyManager::new(&config.corpus.cache_dir); let policies = policy_manager.load_policies(&config.policies)?; let policy_aliases: Vec = policies .iter() .flat_map(|p| &p.policy_aliases) .cloned() .collect(); // local.rs:273 - Use extended lookup let auth_assertions = match index.lookup_with_aliases( &claim.concept_path, &claim.predicate, &policy_aliases, ) { Some(assertions) => assertions, None => continue, }; ``` ### Phase 4: CLI Tooling **New Command:** ```bash # Generate policy aliases from existing assertions aphoria policy generate-aliases \ --policy-path "code://standards/tls/cert_verification" \ --target "code://rust/*/tls/cert_verification" \ --target "code://go/*/tls/cert_verification" ``` **Output:** Adds `PolicyAlias` to Trust Pack before signing. --- ## Extension Points ### 1. Dynamic Alias Discovery **Future Enhancement:** Auto-generate aliases during scan if code paths differ from policy paths. ```rust // If tail-path matches but full paths differ, suggest alias if tail_match && !full_match { suggestions.push(PolicyAlias { policy_path: assertion.subject.clone(), target_patterns: vec![claim.concept_path.clone()], }); } ``` ### 2. Semantic Equivalence **Future Enhancement:** Use embedding similarity for fuzzy matching. ```rust pub struct SemanticAlias { pub policy_path: String, pub similarity_threshold: f32, } // Match if embedding distance < threshold ``` ### 3. Hierarchical Policy Inheritance **Future Enhancement:** Support policy hierarchies. ```rust // Match "code://standards/tls/*" against any TLS assertion pub struct HierarchyAlias { pub policy_prefix: String, // "code://standards/tls" pub target_prefix: String, // "code://rust/*/tls" } ``` --- ## Migration Path ### Backward Compatibility ✅ **Zero Breaking Changes:** - Tail-path matching still works for existing use cases - `PolicyAlias` is optional (empty vec = current behavior) - Existing Trust Packs without `policy_aliases` field deserialize fine (add default) ### Adoption Strategy **Week 1:** Implement core functionality (Phase 1-2) **Week 2:** Integrate into scan flow (Phase 3) **Week 3:** Add CLI tooling (Phase 4) **Week 4:** Document + UAT with enterprise scenario --- ## Metrics for Success ### Functional - [ ] Security team can create `code://standards/*` assertions - [ ] Dev team code (`code://rust/myapp/*`) matches standards - [ ] Conflicts are detected and reported - [ ] Trust Pack signature verification passes ### Performance - [ ] Scan time increase < 5% (alias lookup is O(P*A) where P=patterns, A=aliases) - [ ] Memory overhead < 10KB per Trust Pack (policy aliases are small) ### Usability - [ ] Security team can export Trust Pack with aliases in < 5 commands - [ ] Dev team imports Trust Pack with `policies = ["security.pack"]` (no code changes) --- ## Open Questions 1. **Wildcard Syntax:** Use glob (`*`) or regex (`.*`)? - **Recommendation:** Start with glob (simpler, more intuitive) 2. **Alias Priority:** If multiple aliases match, which wins? - **Recommendation:** First match wins (deterministic order in Trust Pack) 3. **Alias Storage:** Persist discovered aliases back to local store? - **Recommendation:** No (keep Trust Pack as source of truth) 4. **Alias Validation:** Check patterns at Trust Pack creation time? - **Recommendation:** Yes (fail fast if invalid glob pattern) --- ## Conclusion **Current State:** Tail-path matching works for bundled corpus but breaks for enterprise policies. **Root Cause:** Semantic mismatch between policy hierarchies and extractor output. **Solution:** Add explicit `PolicyAlias` layer in Trust Packs. **Impact:** Unblocks enterprise adoption without breaking existing functionality. **Effort:** ~2-3 days (schema extension + pattern matching + integration) **Risk:** Low (additive change, backward compatible) --- ## Next Steps 1. Review this analysis with team 2. Validate glob pattern syntax choice 3. Implement Phase 1 (schema extension) 4. Write UAT scenario mimicking enterprise use case 5. Iterate based on feedback --- **Questions or feedback?** Discuss in `#aphoria-architecture`.