Enterprise Features: - Hosted mode with remote sync for team pattern aggregation - Community sharing with privacy-preserving anonymization - LLM-based semantic claim extraction with Gemini integration - Pattern learning with promotion to declarative extractors - High-entropy secrets extractor with configurable thresholds - Auth bypass and insecure cookies extractors Module Refactoring: - Split oversized files to comply with 500-line limit - Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc. - Handlers split: scan.rs, policy.rs, report.rs modules - Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/ - Learning split: store modules with metrics and persistence SDK & Ontology: - stemedb-ontology SDK with fluent builders and StemeDB client - Pharma domain extractors for FDA Orange Book data - Consumer health UAT test infrastructure Code Quality: - Fixed clippy warnings (needless_borrows_for_generic_args) - Added KVStore trait imports where needed - Fixed utoipa path re-exports for OpenAPI docs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
8.9 KiB
Concept Matching Philosophy
Context: Aphoria's policy enforcement depends on matching code extractors to authoritative sources. Question: How do we enable flexible matching without over-engineering?
Core Design Principles
1. Semantic Over Syntactic
Bad: Match exact string paths
"code://rust/myapp/tls/cert_verification" != "rfc://5246/tls/cert_verification"
Good: Match semantic tail paths
Both produce key: "tls/cert_verification::enabled"
Principle: Concepts should match across schemes if they represent the same idea.
2. Progressive Precision
Layer 1: Tail-path matching (works 80% of the time) Layer 2: Policy aliases (handles enterprise hierarchies) Layer 3: Semantic embeddings (future: fuzzy matching)
Principle: Start with simple heuristics, add precision layers as needed.
3. Explicit Over Implicit
Bad: Auto-generate aliases behind the scenes
- Hard to debug ("why did this match?")
- Fragile (breaks with refactoring)
- Opaque (security teams lose control)
Good: Require explicit policy aliases
- Clear provenance (alias is in Trust Pack)
- Auditable (signature covers aliases)
- Controllable (security team decides matches)
Principle: Matching logic should be transparent and intentional.
Why Tail-Path Matching Works
Design Insight
Code extractors are intentionally designed to align with RFC/OWASP paths:
RFC Structure:
rfc://5246/tls/cert_verification
rfc://7519/jwt/audience_validation
rfc://8996/tls/min_version
Extractor Output:
code://rust/myapp/tls/cert_verification
code://python/myapp/jwt/audience_validation
code://go/myapp/tls/min_version
Key Insight: The last 2 segments (tls/cert_verification) are the concept name.
Language prefix (rust/myapp) provides context but not identity.
Why Tail-Path Matching Breaks
Enterprise Hierarchies
Security teams think in logical domains, not RFC hierarchies:
code://standards/tls/cert_verification (Security team's mental model)
code://internal/exceptions/md5_allowed (Policy exceptions)
code://vendor/aws/s3/public_access (Cloud-specific rules)
These don't map to extractor output:
code://rust/myapp/tls/cert_verification (Extractor output)
Problem: standards/tls (2 segments) vs. rust/myapp/tls (3 segments)
Tail-path key mismatch:
- Policy:
"tls/cert_verification::enabled" - Code:
"myapp/tls::enabled"(extracts wrong segments!)
Why Policy Aliases Are the Right Solution
1. Preserves Tail-Path Matching
Most cases (bundled corpus) still use fast path:
// 1. Try direct tail-path match (O(1) hash lookup)
if let Some(result) = self.lookup(subject, predicate) {
return Some(result);
}
2. Adds Flexibility Without Complexity
Only when direct match fails, try aliases:
// 2. Try policy alias patterns (O(P*A), small P and A)
for alias in policy_aliases {
if glob_match(alias.target_patterns, subject) {
return self.lookup(&alias.policy_path, predicate);
}
}
3. Keeps Control with Policy Authors
Security team explicitly states:
PolicyAlias {
policy_path: "code://standards/tls/cert_verification",
target_patterns: vec![
"code://rust/*/tls/cert_verification",
"code://go/*/tls/cert_verification",
],
}
This is documentation (what matches what) and enforcement (signed in Trust Pack).
Extension Points: Future Matching Layers
Layer 3: Semantic Equivalence (Future)
Idea: Use embeddings to match concepts with different names.
Example:
Policy: "code://standards/tls/certificate_validation"
Code: "code://rust/myapp/tls/cert_verification"
Embedding similarity: 0.92 → match
When to add: If alias management becomes too manual.
Layer 4: Ontology Mapping (Future)
Idea: Define semantic relationships between concepts.
Example:
ontology:
"tls/cert_verification":
equivalent_to:
- "tls/certificate_validation"
- "ssl/verify_certs"
broader_than:
- "security/transport_layer"
When to add: If multiple industries need cross-domain mapping.
Comparison: Alternative Approaches
Alt 1: Variable Tail Length
Idea: Try N=1, N=2, N=3 segment keys.
Problems:
- Ambiguous matches (which key wins?)
- Performance hit (3x lookups)
- Doesn't solve semantic differences
Verdict: Rejected (complexity without solving root cause)
Alt 2: Normalize All Paths
Idea: Extractors output "canonical" paths that match standards.
Problems:
- Loses language context (
rust/myapp) - Breaks existing aliases/observations
- Forces extractors to know about ALL standards
Verdict: Rejected (breaks modularity)
Alt 3: Dynamic Alias Discovery
Idea: Auto-create aliases during scan when tail-path matches but full path differs.
Problems:
- Implicit behavior (hard to debug)
- No security team approval (bypasses policy control)
- May create false positives
Verdict: Future enhancement (as suggestions, not automatic)
Architectural Trade-offs
Chosen: Explicit Policy Aliases
Pros:
- Clear provenance (aliases are in Trust Pack)
- Auditable (covered by signature)
- Flexible (glob patterns support many cases)
- Backward compatible (empty aliases = current behavior)
Cons:
- Requires manual alias creation
- Adds cognitive overhead (security teams must think about patterns)
- Another field in Trust Pack schema
Why this trade-off wins:
- Enterprise adoption requires auditability
- Security teams WANT explicit control
- Manual work is one-time (create pack once, reuse everywhere)
Recommended Patterns
Pattern 1: Language Wildcards
Use Case: Standard applies to all languages.
PolicyAlias {
policy_path: "code://standards/tls/cert_verification",
target_patterns: vec![
"code://*/*/tls/cert_verification", // any language, any project
],
}
Pattern 2: Project-Specific
Use Case: Internal policy for specific service.
PolicyAlias {
policy_path: "code://internal/auth/jwt_validation",
target_patterns: vec![
"code://rust/auth-service/jwt/validation",
"code://go/auth-service/jwt/validation",
],
}
Pattern 3: Domain-Scoped
Use Case: Cloud-specific rules.
PolicyAlias {
policy_path: "code://vendor/aws/s3/public_access",
target_patterns: vec![
"code://*/*/aws/s3/bucket/public",
"code://*/*/cloud/storage/s3/public_access",
],
}
Open Questions for Long-Term Evolution
Q1: Should we support recursive wildcards?
Current: code://rust/*/tls (single segment wildcard)
Proposed: code://rust/**/tls (any depth)
Trade-off: More flexible, but harder to reason about matches.
Decision: Start with single-segment, add recursive if needed.
Q2: Should aliases be bidirectional?
Current: Policy path → Code patterns (one direction) Proposed: Allow code path → Policy path mapping
Use Case: "This code path is an exception to standard X."
Decision: Defer until use case emerges.
Q3: Should we cache pattern matches?
Current: Recompute glob match on every lookup Proposed: Cache subject → policy_path map per scan
Trade-off: Faster (O(1) after first match) vs. memory overhead
Decision: Benchmark first, optimize if needed (premature optimization).
Q4: Should policy aliases be mergeable?
Current: Each Trust Pack has independent aliases Proposed: Allow Trust Pack B to "extend" Trust Pack A's aliases
Use Case: Company-wide base pack + team-specific extensions
Decision: Future enhancement (Trust Pack composition system).
Guiding Heuristic
When adding matching features, ask:
-
Does this preserve tail-path matching for the common case?
- Yes → Maintains performance
- No → Reconsider
-
Is the behavior explicit and auditable?
- Yes → Security teams can reason about it
- No → Will cause trust issues
-
Can it be disabled or overridden?
- Yes → Progressive adoption
- No → May block some use cases
-
Does it add cognitive overhead?
- Minimal → Worth the flexibility
- Significant → Document heavily or defer
Conclusion
Current State: Tail-path matching works for bundled corpus but breaks for enterprise policies.
Root Cause: Semantic mismatch between policy hierarchies and extractor output.
Solution: Add explicit policy aliases as a second matching layer.
Philosophy: Start simple (tail-path), add precision layers (aliases), keep it auditable.
Future: Semantic embeddings and ontology mapping if manual aliases become burdensome.
This document should guide future matching feature discussions. Always return to the core principles: semantic matching, progressive precision, explicit control.