Enterprise Features: - Hosted mode with remote sync for team pattern aggregation - Community sharing with privacy-preserving anonymization - LLM-based semantic claim extraction with Gemini integration - Pattern learning with promotion to declarative extractors - High-entropy secrets extractor with configurable thresholds - Auth bypass and insecure cookies extractors Module Refactoring: - Split oversized files to comply with 500-line limit - Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc. - Handlers split: scan.rs, policy.rs, report.rs modules - Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/ - Learning split: store modules with metrics and persistence SDK & Ontology: - stemedb-ontology SDK with fluent builders and StemeDB client - Pharma domain extractors for FDA Orange Book data - Consumer health UAT test infrastructure Code Quality: - Fixed clippy warnings (needless_borrows_for_generic_args) - Added KVStore trait imports where needed - Fixed utoipa path re-exports for OpenAPI docs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
373 lines
8.9 KiB
Markdown
373 lines
8.9 KiB
Markdown
# Concept Matching Philosophy
|
|
|
|
**Context:** Aphoria's policy enforcement depends on matching code extractors to authoritative sources.
|
|
**Question:** How do we enable flexible matching without over-engineering?
|
|
|
|
---
|
|
|
|
## Core Design Principles
|
|
|
|
### 1. Semantic Over Syntactic
|
|
|
|
**Bad:** Match exact string paths
|
|
```
|
|
"code://rust/myapp/tls/cert_verification" != "rfc://5246/tls/cert_verification"
|
|
```
|
|
|
|
**Good:** Match semantic tail paths
|
|
```
|
|
Both produce key: "tls/cert_verification::enabled"
|
|
```
|
|
|
|
**Principle:** Concepts should match across schemes if they represent the same idea.
|
|
|
|
---
|
|
|
|
### 2. Progressive Precision
|
|
|
|
**Layer 1:** Tail-path matching (works 80% of the time)
|
|
**Layer 2:** Policy aliases (handles enterprise hierarchies)
|
|
**Layer 3:** Semantic embeddings (future: fuzzy matching)
|
|
|
|
**Principle:** Start with simple heuristics, add precision layers as needed.
|
|
|
|
---
|
|
|
|
### 3. Explicit Over Implicit
|
|
|
|
**Bad:** Auto-generate aliases behind the scenes
|
|
- Hard to debug ("why did this match?")
|
|
- Fragile (breaks with refactoring)
|
|
- Opaque (security teams lose control)
|
|
|
|
**Good:** Require explicit policy aliases
|
|
- Clear provenance (alias is in Trust Pack)
|
|
- Auditable (signature covers aliases)
|
|
- Controllable (security team decides matches)
|
|
|
|
**Principle:** Matching logic should be transparent and intentional.
|
|
|
|
---
|
|
|
|
## Why Tail-Path Matching Works
|
|
|
|
### Design Insight
|
|
|
|
Code extractors are **intentionally designed** to align with RFC/OWASP paths:
|
|
|
|
**RFC Structure:**
|
|
```
|
|
rfc://5246/tls/cert_verification
|
|
rfc://7519/jwt/audience_validation
|
|
rfc://8996/tls/min_version
|
|
```
|
|
|
|
**Extractor Output:**
|
|
```
|
|
code://rust/myapp/tls/cert_verification
|
|
code://python/myapp/jwt/audience_validation
|
|
code://go/myapp/tls/min_version
|
|
```
|
|
|
|
**Key Insight:** The last 2 segments (`tls/cert_verification`) are the **concept name**.
|
|
|
|
Language prefix (`rust/myapp`) provides **context** but not **identity**.
|
|
|
|
---
|
|
|
|
## Why Tail-Path Matching Breaks
|
|
|
|
### Enterprise Hierarchies
|
|
|
|
Security teams think in **logical domains**, not **RFC hierarchies**:
|
|
|
|
```
|
|
code://standards/tls/cert_verification (Security team's mental model)
|
|
code://internal/exceptions/md5_allowed (Policy exceptions)
|
|
code://vendor/aws/s3/public_access (Cloud-specific rules)
|
|
```
|
|
|
|
These don't map to extractor output:
|
|
|
|
```
|
|
code://rust/myapp/tls/cert_verification (Extractor output)
|
|
```
|
|
|
|
**Problem:** `standards/tls` (2 segments) vs. `rust/myapp/tls` (3 segments)
|
|
|
|
Tail-path key mismatch:
|
|
- Policy: `"tls/cert_verification::enabled"`
|
|
- Code: `"myapp/tls::enabled"` (extracts wrong segments!)
|
|
|
|
---
|
|
|
|
## Why Policy Aliases Are the Right Solution
|
|
|
|
### 1. Preserves Tail-Path Matching
|
|
|
|
Most cases (bundled corpus) still use fast path:
|
|
```rust
|
|
// 1. Try direct tail-path match (O(1) hash lookup)
|
|
if let Some(result) = self.lookup(subject, predicate) {
|
|
return Some(result);
|
|
}
|
|
```
|
|
|
|
### 2. Adds Flexibility Without Complexity
|
|
|
|
Only when direct match fails, try aliases:
|
|
```rust
|
|
// 2. Try policy alias patterns (O(P*A), small P and A)
|
|
for alias in policy_aliases {
|
|
if glob_match(alias.target_patterns, subject) {
|
|
return self.lookup(&alias.policy_path, predicate);
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. Keeps Control with Policy Authors
|
|
|
|
Security team explicitly states:
|
|
```rust
|
|
PolicyAlias {
|
|
policy_path: "code://standards/tls/cert_verification",
|
|
target_patterns: vec![
|
|
"code://rust/*/tls/cert_verification",
|
|
"code://go/*/tls/cert_verification",
|
|
],
|
|
}
|
|
```
|
|
|
|
This is **documentation** (what matches what) and **enforcement** (signed in Trust Pack).
|
|
|
|
---
|
|
|
|
## Extension Points: Future Matching Layers
|
|
|
|
### Layer 3: Semantic Equivalence (Future)
|
|
|
|
**Idea:** Use embeddings to match concepts with different names.
|
|
|
|
**Example:**
|
|
```
|
|
Policy: "code://standards/tls/certificate_validation"
|
|
Code: "code://rust/myapp/tls/cert_verification"
|
|
```
|
|
|
|
Embedding similarity: 0.92 → match
|
|
|
|
**When to add:** If alias management becomes too manual.
|
|
|
|
---
|
|
|
|
### Layer 4: Ontology Mapping (Future)
|
|
|
|
**Idea:** Define semantic relationships between concepts.
|
|
|
|
**Example:**
|
|
```yaml
|
|
ontology:
|
|
"tls/cert_verification":
|
|
equivalent_to:
|
|
- "tls/certificate_validation"
|
|
- "ssl/verify_certs"
|
|
broader_than:
|
|
- "security/transport_layer"
|
|
```
|
|
|
|
**When to add:** If multiple industries need cross-domain mapping.
|
|
|
|
---
|
|
|
|
## Comparison: Alternative Approaches
|
|
|
|
### Alt 1: Variable Tail Length
|
|
|
|
**Idea:** Try N=1, N=2, N=3 segment keys.
|
|
|
|
**Problems:**
|
|
- Ambiguous matches (which key wins?)
|
|
- Performance hit (3x lookups)
|
|
- Doesn't solve semantic differences
|
|
|
|
**Verdict:** Rejected (complexity without solving root cause)
|
|
|
|
---
|
|
|
|
### Alt 2: Normalize All Paths
|
|
|
|
**Idea:** Extractors output "canonical" paths that match standards.
|
|
|
|
**Problems:**
|
|
- Loses language context (`rust/myapp`)
|
|
- Breaks existing aliases/observations
|
|
- Forces extractors to know about ALL standards
|
|
|
|
**Verdict:** Rejected (breaks modularity)
|
|
|
|
---
|
|
|
|
### Alt 3: Dynamic Alias Discovery
|
|
|
|
**Idea:** Auto-create aliases during scan when tail-path matches but full path differs.
|
|
|
|
**Problems:**
|
|
- Implicit behavior (hard to debug)
|
|
- No security team approval (bypasses policy control)
|
|
- May create false positives
|
|
|
|
**Verdict:** Future enhancement (as suggestions, not automatic)
|
|
|
|
---
|
|
|
|
## Architectural Trade-offs
|
|
|
|
### Chosen: Explicit Policy Aliases
|
|
|
|
**Pros:**
|
|
- Clear provenance (aliases are in Trust Pack)
|
|
- Auditable (covered by signature)
|
|
- Flexible (glob patterns support many cases)
|
|
- Backward compatible (empty aliases = current behavior)
|
|
|
|
**Cons:**
|
|
- Requires manual alias creation
|
|
- Adds cognitive overhead (security teams must think about patterns)
|
|
- Another field in Trust Pack schema
|
|
|
|
**Why this trade-off wins:**
|
|
- Enterprise adoption requires **auditability**
|
|
- Security teams WANT explicit control
|
|
- Manual work is one-time (create pack once, reuse everywhere)
|
|
|
|
---
|
|
|
|
## Recommended Patterns
|
|
|
|
### Pattern 1: Language Wildcards
|
|
|
|
**Use Case:** Standard applies to all languages.
|
|
|
|
```rust
|
|
PolicyAlias {
|
|
policy_path: "code://standards/tls/cert_verification",
|
|
target_patterns: vec![
|
|
"code://*/*/tls/cert_verification", // any language, any project
|
|
],
|
|
}
|
|
```
|
|
|
|
### Pattern 2: Project-Specific
|
|
|
|
**Use Case:** Internal policy for specific service.
|
|
|
|
```rust
|
|
PolicyAlias {
|
|
policy_path: "code://internal/auth/jwt_validation",
|
|
target_patterns: vec![
|
|
"code://rust/auth-service/jwt/validation",
|
|
"code://go/auth-service/jwt/validation",
|
|
],
|
|
}
|
|
```
|
|
|
|
### Pattern 3: Domain-Scoped
|
|
|
|
**Use Case:** Cloud-specific rules.
|
|
|
|
```rust
|
|
PolicyAlias {
|
|
policy_path: "code://vendor/aws/s3/public_access",
|
|
target_patterns: vec![
|
|
"code://*/*/aws/s3/bucket/public",
|
|
"code://*/*/cloud/storage/s3/public_access",
|
|
],
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Open Questions for Long-Term Evolution
|
|
|
|
### Q1: Should we support recursive wildcards?
|
|
|
|
**Current:** `code://rust/*/tls` (single segment wildcard)
|
|
**Proposed:** `code://rust/**/tls` (any depth)
|
|
|
|
**Trade-off:** More flexible, but harder to reason about matches.
|
|
|
|
**Decision:** Start with single-segment, add recursive if needed.
|
|
|
|
---
|
|
|
|
### Q2: Should aliases be bidirectional?
|
|
|
|
**Current:** Policy path → Code patterns (one direction)
|
|
**Proposed:** Allow code path → Policy path mapping
|
|
|
|
**Use Case:** "This code path is an exception to standard X."
|
|
|
|
**Decision:** Defer until use case emerges.
|
|
|
|
---
|
|
|
|
### Q3: Should we cache pattern matches?
|
|
|
|
**Current:** Recompute glob match on every lookup
|
|
**Proposed:** Cache subject → policy_path map per scan
|
|
|
|
**Trade-off:** Faster (O(1) after first match) vs. memory overhead
|
|
|
|
**Decision:** Benchmark first, optimize if needed (premature optimization).
|
|
|
|
---
|
|
|
|
### Q4: Should policy aliases be mergeable?
|
|
|
|
**Current:** Each Trust Pack has independent aliases
|
|
**Proposed:** Allow Trust Pack B to "extend" Trust Pack A's aliases
|
|
|
|
**Use Case:** Company-wide base pack + team-specific extensions
|
|
|
|
**Decision:** Future enhancement (Trust Pack composition system).
|
|
|
|
---
|
|
|
|
## Guiding Heuristic
|
|
|
|
**When adding matching features, ask:**
|
|
|
|
1. **Does this preserve tail-path matching for the common case?**
|
|
- Yes → Maintains performance
|
|
- No → Reconsider
|
|
|
|
2. **Is the behavior explicit and auditable?**
|
|
- Yes → Security teams can reason about it
|
|
- No → Will cause trust issues
|
|
|
|
3. **Can it be disabled or overridden?**
|
|
- Yes → Progressive adoption
|
|
- No → May block some use cases
|
|
|
|
4. **Does it add cognitive overhead?**
|
|
- Minimal → Worth the flexibility
|
|
- Significant → Document heavily or defer
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
**Current State:** Tail-path matching works for bundled corpus but breaks for enterprise policies.
|
|
|
|
**Root Cause:** Semantic mismatch between policy hierarchies and extractor output.
|
|
|
|
**Solution:** Add explicit policy aliases as a second matching layer.
|
|
|
|
**Philosophy:** Start simple (tail-path), add precision layers (aliases), keep it auditable.
|
|
|
|
**Future:** Semantic embeddings and ontology mapping if manual aliases become burdensome.
|
|
|
|
---
|
|
|
|
**This document should guide future matching feature discussions.** Always return to the core principles: semantic matching, progressive precision, explicit control.
|