# Concept Matching Philosophy **Context:** Aphoria's policy enforcement depends on matching code extractors to authoritative sources. **Question:** How do we enable flexible matching without over-engineering? --- ## Core Design Principles ### 1. Semantic Over Syntactic **Bad:** Match exact string paths ``` "code://rust/myapp/tls/cert_verification" != "rfc://5246/tls/cert_verification" ``` **Good:** Match semantic tail paths ``` Both produce key: "tls/cert_verification::enabled" ``` **Principle:** Concepts should match across schemes if they represent the same idea. --- ### 2. Progressive Precision **Layer 1:** Tail-path matching (works 80% of the time) **Layer 2:** Policy aliases (handles enterprise hierarchies) **Layer 3:** Semantic embeddings (future: fuzzy matching) **Principle:** Start with simple heuristics, add precision layers as needed. --- ### 3. Explicit Over Implicit **Bad:** Auto-generate aliases behind the scenes - Hard to debug ("why did this match?") - Fragile (breaks with refactoring) - Opaque (security teams lose control) **Good:** Require explicit policy aliases - Clear provenance (alias is in Trust Pack) - Auditable (signature covers aliases) - Controllable (security team decides matches) **Principle:** Matching logic should be transparent and intentional. --- ## Why Tail-Path Matching Works ### Design Insight Code extractors are **intentionally designed** to align with RFC/OWASP paths: **RFC Structure:** ``` rfc://5246/tls/cert_verification rfc://7519/jwt/audience_validation rfc://8996/tls/min_version ``` **Extractor Output:** ``` code://rust/myapp/tls/cert_verification code://python/myapp/jwt/audience_validation code://go/myapp/tls/min_version ``` **Key Insight:** The last 2 segments (`tls/cert_verification`) are the **concept name**. Language prefix (`rust/myapp`) provides **context** but not **identity**. --- ## Why Tail-Path Matching Breaks ### Enterprise Hierarchies Security teams think in **logical domains**, not **RFC hierarchies**: ``` code://standards/tls/cert_verification (Security team's mental model) code://internal/exceptions/md5_allowed (Policy exceptions) code://vendor/aws/s3/public_access (Cloud-specific rules) ``` These don't map to extractor output: ``` code://rust/myapp/tls/cert_verification (Extractor output) ``` **Problem:** `standards/tls` (2 segments) vs. `rust/myapp/tls` (3 segments) Tail-path key mismatch: - Policy: `"tls/cert_verification::enabled"` - Code: `"myapp/tls::enabled"` (extracts wrong segments!) --- ## Why Policy Aliases Are the Right Solution ### 1. Preserves Tail-Path Matching Most cases (bundled corpus) still use fast path: ```rust // 1. Try direct tail-path match (O(1) hash lookup) if let Some(result) = self.lookup(subject, predicate) { return Some(result); } ``` ### 2. Adds Flexibility Without Complexity Only when direct match fails, try aliases: ```rust // 2. Try policy alias patterns (O(P*A), small P and A) for alias in policy_aliases { if glob_match(alias.target_patterns, subject) { return self.lookup(&alias.policy_path, predicate); } } ``` ### 3. Keeps Control with Policy Authors Security team explicitly states: ```rust PolicyAlias { policy_path: "code://standards/tls/cert_verification", target_patterns: vec![ "code://rust/*/tls/cert_verification", "code://go/*/tls/cert_verification", ], } ``` This is **documentation** (what matches what) and **enforcement** (signed in Trust Pack). --- ## Extension Points: Future Matching Layers ### Layer 3: Semantic Equivalence (Future) **Idea:** Use embeddings to match concepts with different names. **Example:** ``` Policy: "code://standards/tls/certificate_validation" Code: "code://rust/myapp/tls/cert_verification" ``` Embedding similarity: 0.92 → match **When to add:** If alias management becomes too manual. --- ### Layer 4: Ontology Mapping (Future) **Idea:** Define semantic relationships between concepts. **Example:** ```yaml ontology: "tls/cert_verification": equivalent_to: - "tls/certificate_validation" - "ssl/verify_certs" broader_than: - "security/transport_layer" ``` **When to add:** If multiple industries need cross-domain mapping. --- ## Comparison: Alternative Approaches ### Alt 1: Variable Tail Length **Idea:** Try N=1, N=2, N=3 segment keys. **Problems:** - Ambiguous matches (which key wins?) - Performance hit (3x lookups) - Doesn't solve semantic differences **Verdict:** Rejected (complexity without solving root cause) --- ### Alt 2: Normalize All Paths **Idea:** Extractors output "canonical" paths that match standards. **Problems:** - Loses language context (`rust/myapp`) - Breaks existing aliases/observations - Forces extractors to know about ALL standards **Verdict:** Rejected (breaks modularity) --- ### Alt 3: Dynamic Alias Discovery **Idea:** Auto-create aliases during scan when tail-path matches but full path differs. **Problems:** - Implicit behavior (hard to debug) - No security team approval (bypasses policy control) - May create false positives **Verdict:** Future enhancement (as suggestions, not automatic) --- ## Architectural Trade-offs ### Chosen: Explicit Policy Aliases **Pros:** - Clear provenance (aliases are in Trust Pack) - Auditable (covered by signature) - Flexible (glob patterns support many cases) - Backward compatible (empty aliases = current behavior) **Cons:** - Requires manual alias creation - Adds cognitive overhead (security teams must think about patterns) - Another field in Trust Pack schema **Why this trade-off wins:** - Enterprise adoption requires **auditability** - Security teams WANT explicit control - Manual work is one-time (create pack once, reuse everywhere) --- ## Recommended Patterns ### Pattern 1: Language Wildcards **Use Case:** Standard applies to all languages. ```rust PolicyAlias { policy_path: "code://standards/tls/cert_verification", target_patterns: vec![ "code://*/*/tls/cert_verification", // any language, any project ], } ``` ### Pattern 2: Project-Specific **Use Case:** Internal policy for specific service. ```rust PolicyAlias { policy_path: "code://internal/auth/jwt_validation", target_patterns: vec![ "code://rust/auth-service/jwt/validation", "code://go/auth-service/jwt/validation", ], } ``` ### Pattern 3: Domain-Scoped **Use Case:** Cloud-specific rules. ```rust PolicyAlias { policy_path: "code://vendor/aws/s3/public_access", target_patterns: vec![ "code://*/*/aws/s3/bucket/public", "code://*/*/cloud/storage/s3/public_access", ], } ``` --- ## Open Questions for Long-Term Evolution ### Q1: Should we support recursive wildcards? **Current:** `code://rust/*/tls` (single segment wildcard) **Proposed:** `code://rust/**/tls` (any depth) **Trade-off:** More flexible, but harder to reason about matches. **Decision:** Start with single-segment, add recursive if needed. --- ### Q2: Should aliases be bidirectional? **Current:** Policy path → Code patterns (one direction) **Proposed:** Allow code path → Policy path mapping **Use Case:** "This code path is an exception to standard X." **Decision:** Defer until use case emerges. --- ### Q3: Should we cache pattern matches? **Current:** Recompute glob match on every lookup **Proposed:** Cache subject → policy_path map per scan **Trade-off:** Faster (O(1) after first match) vs. memory overhead **Decision:** Benchmark first, optimize if needed (premature optimization). --- ### Q4: Should policy aliases be mergeable? **Current:** Each Trust Pack has independent aliases **Proposed:** Allow Trust Pack B to "extend" Trust Pack A's aliases **Use Case:** Company-wide base pack + team-specific extensions **Decision:** Future enhancement (Trust Pack composition system). --- ## Guiding Heuristic **When adding matching features, ask:** 1. **Does this preserve tail-path matching for the common case?** - Yes → Maintains performance - No → Reconsider 2. **Is the behavior explicit and auditable?** - Yes → Security teams can reason about it - No → Will cause trust issues 3. **Can it be disabled or overridden?** - Yes → Progressive adoption - No → May block some use cases 4. **Does it add cognitive overhead?** - Minimal → Worth the flexibility - Significant → Document heavily or defer --- ## Conclusion **Current State:** Tail-path matching works for bundled corpus but breaks for enterprise policies. **Root Cause:** Semantic mismatch between policy hierarchies and extractor output. **Solution:** Add explicit policy aliases as a second matching layer. **Philosophy:** Start simple (tail-path), add precision layers (aliases), keep it auditable. **Future:** Semantic embeddings and ontology mapping if manual aliases become burdensome. --- **This document should guide future matching feature discussions.** Always return to the core principles: semantic matching, progressive precision, explicit control.