stemedb/applications/aphoria/docs/architecture/matching-philosophy.md
jordan 41c676a78e feat: Aphoria enterprise features + ontology SDK + file length compliance
Enterprise Features:
- Hosted mode with remote sync for team pattern aggregation
- Community sharing with privacy-preserving anonymization
- LLM-based semantic claim extraction with Gemini integration
- Pattern learning with promotion to declarative extractors
- High-entropy secrets extractor with configurable thresholds
- Auth bypass and insecure cookies extractors

Module Refactoring:
- Split oversized files to comply with 500-line limit
- Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc.
- Handlers split: scan.rs, policy.rs, report.rs modules
- Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/
- Learning split: store modules with metrics and persistence

SDK & Ontology:
- stemedb-ontology SDK with fluent builders and StemeDB client
- Pharma domain extractors for FDA Orange Book data
- Consumer health UAT test infrastructure

Code Quality:
- Fixed clippy warnings (needless_borrows_for_generic_args)
- Added KVStore trait imports where needed
- Fixed utoipa path re-exports for OpenAPI docs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 12:55:29 -07:00

373 lines
8.9 KiB
Markdown

# Concept Matching Philosophy
**Context:** Aphoria's policy enforcement depends on matching code extractors to authoritative sources.
**Question:** How do we enable flexible matching without over-engineering?
---
## Core Design Principles
### 1. Semantic Over Syntactic
**Bad:** Match exact string paths
```
"code://rust/myapp/tls/cert_verification" != "rfc://5246/tls/cert_verification"
```
**Good:** Match semantic tail paths
```
Both produce key: "tls/cert_verification::enabled"
```
**Principle:** Concepts should match across schemes if they represent the same idea.
---
### 2. Progressive Precision
**Layer 1:** Tail-path matching (works 80% of the time)
**Layer 2:** Policy aliases (handles enterprise hierarchies)
**Layer 3:** Semantic embeddings (future: fuzzy matching)
**Principle:** Start with simple heuristics, add precision layers as needed.
---
### 3. Explicit Over Implicit
**Bad:** Auto-generate aliases behind the scenes
- Hard to debug ("why did this match?")
- Fragile (breaks with refactoring)
- Opaque (security teams lose control)
**Good:** Require explicit policy aliases
- Clear provenance (alias is in Trust Pack)
- Auditable (signature covers aliases)
- Controllable (security team decides matches)
**Principle:** Matching logic should be transparent and intentional.
---
## Why Tail-Path Matching Works
### Design Insight
Code extractors are **intentionally designed** to align with RFC/OWASP paths:
**RFC Structure:**
```
rfc://5246/tls/cert_verification
rfc://7519/jwt/audience_validation
rfc://8996/tls/min_version
```
**Extractor Output:**
```
code://rust/myapp/tls/cert_verification
code://python/myapp/jwt/audience_validation
code://go/myapp/tls/min_version
```
**Key Insight:** The last 2 segments (`tls/cert_verification`) are the **concept name**.
Language prefix (`rust/myapp`) provides **context** but not **identity**.
---
## Why Tail-Path Matching Breaks
### Enterprise Hierarchies
Security teams think in **logical domains**, not **RFC hierarchies**:
```
code://standards/tls/cert_verification (Security team's mental model)
code://internal/exceptions/md5_allowed (Policy exceptions)
code://vendor/aws/s3/public_access (Cloud-specific rules)
```
These don't map to extractor output:
```
code://rust/myapp/tls/cert_verification (Extractor output)
```
**Problem:** `standards/tls` (2 segments) vs. `rust/myapp/tls` (3 segments)
Tail-path key mismatch:
- Policy: `"tls/cert_verification::enabled"`
- Code: `"myapp/tls::enabled"` (extracts wrong segments!)
---
## Why Policy Aliases Are the Right Solution
### 1. Preserves Tail-Path Matching
Most cases (bundled corpus) still use fast path:
```rust
// 1. Try direct tail-path match (O(1) hash lookup)
if let Some(result) = self.lookup(subject, predicate) {
return Some(result);
}
```
### 2. Adds Flexibility Without Complexity
Only when direct match fails, try aliases:
```rust
// 2. Try policy alias patterns (O(P*A), small P and A)
for alias in policy_aliases {
if glob_match(alias.target_patterns, subject) {
return self.lookup(&alias.policy_path, predicate);
}
}
```
### 3. Keeps Control with Policy Authors
Security team explicitly states:
```rust
PolicyAlias {
policy_path: "code://standards/tls/cert_verification",
target_patterns: vec![
"code://rust/*/tls/cert_verification",
"code://go/*/tls/cert_verification",
],
}
```
This is **documentation** (what matches what) and **enforcement** (signed in Trust Pack).
---
## Extension Points: Future Matching Layers
### Layer 3: Semantic Equivalence (Future)
**Idea:** Use embeddings to match concepts with different names.
**Example:**
```
Policy: "code://standards/tls/certificate_validation"
Code: "code://rust/myapp/tls/cert_verification"
```
Embedding similarity: 0.92 → match
**When to add:** If alias management becomes too manual.
---
### Layer 4: Ontology Mapping (Future)
**Idea:** Define semantic relationships between concepts.
**Example:**
```yaml
ontology:
"tls/cert_verification":
equivalent_to:
- "tls/certificate_validation"
- "ssl/verify_certs"
broader_than:
- "security/transport_layer"
```
**When to add:** If multiple industries need cross-domain mapping.
---
## Comparison: Alternative Approaches
### Alt 1: Variable Tail Length
**Idea:** Try N=1, N=2, N=3 segment keys.
**Problems:**
- Ambiguous matches (which key wins?)
- Performance hit (3x lookups)
- Doesn't solve semantic differences
**Verdict:** Rejected (complexity without solving root cause)
---
### Alt 2: Normalize All Paths
**Idea:** Extractors output "canonical" paths that match standards.
**Problems:**
- Loses language context (`rust/myapp`)
- Breaks existing aliases/observations
- Forces extractors to know about ALL standards
**Verdict:** Rejected (breaks modularity)
---
### Alt 3: Dynamic Alias Discovery
**Idea:** Auto-create aliases during scan when tail-path matches but full path differs.
**Problems:**
- Implicit behavior (hard to debug)
- No security team approval (bypasses policy control)
- May create false positives
**Verdict:** Future enhancement (as suggestions, not automatic)
---
## Architectural Trade-offs
### Chosen: Explicit Policy Aliases
**Pros:**
- Clear provenance (aliases are in Trust Pack)
- Auditable (covered by signature)
- Flexible (glob patterns support many cases)
- Backward compatible (empty aliases = current behavior)
**Cons:**
- Requires manual alias creation
- Adds cognitive overhead (security teams must think about patterns)
- Another field in Trust Pack schema
**Why this trade-off wins:**
- Enterprise adoption requires **auditability**
- Security teams WANT explicit control
- Manual work is one-time (create pack once, reuse everywhere)
---
## Recommended Patterns
### Pattern 1: Language Wildcards
**Use Case:** Standard applies to all languages.
```rust
PolicyAlias {
policy_path: "code://standards/tls/cert_verification",
target_patterns: vec![
"code://*/*/tls/cert_verification", // any language, any project
],
}
```
### Pattern 2: Project-Specific
**Use Case:** Internal policy for specific service.
```rust
PolicyAlias {
policy_path: "code://internal/auth/jwt_validation",
target_patterns: vec![
"code://rust/auth-service/jwt/validation",
"code://go/auth-service/jwt/validation",
],
}
```
### Pattern 3: Domain-Scoped
**Use Case:** Cloud-specific rules.
```rust
PolicyAlias {
policy_path: "code://vendor/aws/s3/public_access",
target_patterns: vec![
"code://*/*/aws/s3/bucket/public",
"code://*/*/cloud/storage/s3/public_access",
],
}
```
---
## Open Questions for Long-Term Evolution
### Q1: Should we support recursive wildcards?
**Current:** `code://rust/*/tls` (single segment wildcard)
**Proposed:** `code://rust/**/tls` (any depth)
**Trade-off:** More flexible, but harder to reason about matches.
**Decision:** Start with single-segment, add recursive if needed.
---
### Q2: Should aliases be bidirectional?
**Current:** Policy path → Code patterns (one direction)
**Proposed:** Allow code path → Policy path mapping
**Use Case:** "This code path is an exception to standard X."
**Decision:** Defer until use case emerges.
---
### Q3: Should we cache pattern matches?
**Current:** Recompute glob match on every lookup
**Proposed:** Cache subject → policy_path map per scan
**Trade-off:** Faster (O(1) after first match) vs. memory overhead
**Decision:** Benchmark first, optimize if needed (premature optimization).
---
### Q4: Should policy aliases be mergeable?
**Current:** Each Trust Pack has independent aliases
**Proposed:** Allow Trust Pack B to "extend" Trust Pack A's aliases
**Use Case:** Company-wide base pack + team-specific extensions
**Decision:** Future enhancement (Trust Pack composition system).
---
## Guiding Heuristic
**When adding matching features, ask:**
1. **Does this preserve tail-path matching for the common case?**
- Yes → Maintains performance
- No → Reconsider
2. **Is the behavior explicit and auditable?**
- Yes → Security teams can reason about it
- No → Will cause trust issues
3. **Can it be disabled or overridden?**
- Yes → Progressive adoption
- No → May block some use cases
4. **Does it add cognitive overhead?**
- Minimal → Worth the flexibility
- Significant → Document heavily or defer
---
## Conclusion
**Current State:** Tail-path matching works for bundled corpus but breaks for enterprise policies.
**Root Cause:** Semantic mismatch between policy hierarchies and extractor output.
**Solution:** Add explicit policy aliases as a second matching layer.
**Philosophy:** Start simple (tail-path), add precision layers (aliases), keep it auditable.
**Future:** Semantic embeddings and ontology mapping if manual aliases become burdensome.
---
**This document should guide future matching feature discussions.** Always return to the core principles: semantic matching, progressive precision, explicit control.