stemedb/applications/aphoria/docs/architecture/matching-philosophy.md

# Concept Matching Philosophy

**Context:** Aphoria's policy enforcement depends on matching code extractors to authoritative sources.
**Question:** How do we enable flexible matching without over-engineering?

---

## Core Design Principles

### 1. Semantic Over Syntactic

**Bad:** Match exact string paths
```
"code://rust/myapp/tls/cert_verification" != "rfc://5246/tls/cert_verification"
```

**Good:** Match semantic tail paths
```
Both produce key: "tls/cert_verification::enabled"
```

**Principle:** Concepts should match across schemes if they represent the same idea.

---

### 2. Progressive Precision

**Layer 1:** Tail-path matching (works 80% of the time)
**Layer 2:** Policy aliases (handles enterprise hierarchies)
**Layer 3:** Semantic embeddings (future: fuzzy matching)

**Principle:** Start with simple heuristics, add precision layers as needed.

---

### 3. Explicit Over Implicit

**Bad:** Auto-generate aliases behind the scenes
- Hard to debug ("why did this match?")
- Fragile (breaks with refactoring)
- Opaque (security teams lose control)

**Good:** Require explicit policy aliases
- Clear provenance (alias is in Trust Pack)
- Auditable (signature covers aliases)
- Controllable (security team decides matches)

**Principle:** Matching logic should be transparent and intentional.

---

## Why Tail-Path Matching Works

### Design Insight

Code extractors are **intentionally designed** to align with RFC/OWASP paths:

**RFC Structure:**
```
rfc://5246/tls/cert_verification
rfc://7519/jwt/audience_validation
rfc://8996/tls/min_version
```

**Extractor Output:**
```
code://rust/myapp/tls/cert_verification
code://python/myapp/jwt/audience_validation
code://go/myapp/tls/min_version
```

**Key Insight:** The last 2 segments (`tls/cert_verification`) are the **concept name**.

Language prefix (`rust/myapp`) provides **context** but not **identity**.

---

## Why Tail-Path Matching Breaks

### Enterprise Hierarchies

Security teams think in **logical domains**, not **RFC hierarchies**:

```
code://standards/tls/cert_verification    (Security team's mental model)
code://internal/exceptions/md5_allowed    (Policy exceptions)
code://vendor/aws/s3/public_access        (Cloud-specific rules)
```

These don't map to extractor output:

```
code://rust/myapp/tls/cert_verification   (Extractor output)
```

**Problem:** `standards/tls` (2 segments) vs. `rust/myapp/tls` (3 segments)

Tail-path key mismatch:
- Policy: `"tls/cert_verification::enabled"`
- Code: `"myapp/tls::enabled"` (extracts wrong segments!)

---

## Why Policy Aliases Are the Right Solution

### 1. Preserves Tail-Path Matching

Most cases (bundled corpus) still use fast path:
```rust
// 1. Try direct tail-path match (O(1) hash lookup)
if let Some(result) = self.lookup(subject, predicate) {
    return Some(result);
}
```

### 2. Adds Flexibility Without Complexity

Only when direct match fails, try aliases:
```rust
// 2. Try policy alias patterns (O(P*A), small P and A)
for alias in policy_aliases {
    if glob_match(alias.target_patterns, subject) {
        return self.lookup(&alias.policy_path, predicate);
    }
}
```

### 3. Keeps Control with Policy Authors

Security team explicitly states:
```rust
PolicyAlias {
    policy_path: "code://standards/tls/cert_verification",
    target_patterns: vec![
        "code://rust/*/tls/cert_verification",
        "code://go/*/tls/cert_verification",
    ],
}
```

This is **documentation** (what matches what) and **enforcement** (signed in Trust Pack).

---

## Extension Points: Future Matching Layers

### Layer 3: Semantic Equivalence (Future)

**Idea:** Use embeddings to match concepts with different names.

**Example:**
```
Policy: "code://standards/tls/certificate_validation"
Code:   "code://rust/myapp/tls/cert_verification"
```

Embedding similarity: 0.92 → match

**When to add:** If alias management becomes too manual.

---

### Layer 4: Ontology Mapping (Future)

**Idea:** Define semantic relationships between concepts.

**Example:**
```yaml
ontology:
  "tls/cert_verification":
    equivalent_to:
      - "tls/certificate_validation"
      - "ssl/verify_certs"
    broader_than:
      - "security/transport_layer"
```

**When to add:** If multiple industries need cross-domain mapping.

---

## Comparison: Alternative Approaches

### Alt 1: Variable Tail Length

**Idea:** Try N=1, N=2, N=3 segment keys.

**Problems:**
- Ambiguous matches (which key wins?)
- Performance hit (3x lookups)
- Doesn't solve semantic differences

**Verdict:** Rejected (complexity without solving root cause)

---

### Alt 2: Normalize All Paths

**Idea:** Extractors output "canonical" paths that match standards.

**Problems:**
- Loses language context (`rust/myapp`)
- Breaks existing aliases/observations
- Forces extractors to know about ALL standards

**Verdict:** Rejected (breaks modularity)

---

### Alt 3: Dynamic Alias Discovery

**Idea:** Auto-create aliases during scan when tail-path matches but full path differs.

**Problems:**
- Implicit behavior (hard to debug)
- No security team approval (bypasses policy control)
- May create false positives

**Verdict:** Future enhancement (as suggestions, not automatic)

---

## Architectural Trade-offs

### Chosen: Explicit Policy Aliases

**Pros:**
- Clear provenance (aliases are in Trust Pack)
- Auditable (covered by signature)
- Flexible (glob patterns support many cases)
- Backward compatible (empty aliases = current behavior)

**Cons:**
- Requires manual alias creation
- Adds cognitive overhead (security teams must think about patterns)
- Another field in Trust Pack schema

**Why this trade-off wins:**
- Enterprise adoption requires **auditability**
- Security teams WANT explicit control
- Manual work is one-time (create pack once, reuse everywhere)

---

## Recommended Patterns

### Pattern 1: Language Wildcards

**Use Case:** Standard applies to all languages.

```rust
PolicyAlias {
    policy_path: "code://standards/tls/cert_verification",
    target_patterns: vec![
        "code://*/*/tls/cert_verification",  // any language, any project
    ],
}
```

### Pattern 2: Project-Specific

**Use Case:** Internal policy for specific service.

```rust
PolicyAlias {
    policy_path: "code://internal/auth/jwt_validation",
    target_patterns: vec![
        "code://rust/auth-service/jwt/validation",
        "code://go/auth-service/jwt/validation",
    ],
}
```

### Pattern 3: Domain-Scoped

**Use Case:** Cloud-specific rules.

```rust
PolicyAlias {
    policy_path: "code://vendor/aws/s3/public_access",
    target_patterns: vec![
        "code://*/*/aws/s3/bucket/public",
        "code://*/*/cloud/storage/s3/public_access",
    ],
}
```

---

## Open Questions for Long-Term Evolution

### Q1: Should we support recursive wildcards?

**Current:** `code://rust/*/tls` (single segment wildcard)
**Proposed:** `code://rust/**/tls` (any depth)

**Trade-off:** More flexible, but harder to reason about matches.

**Decision:** Start with single-segment, add recursive if needed.

---

### Q2: Should aliases be bidirectional?

**Current:** Policy path → Code patterns (one direction)
**Proposed:** Allow code path → Policy path mapping

**Use Case:** "This code path is an exception to standard X."

**Decision:** Defer until use case emerges.

---

### Q3: Should we cache pattern matches?

**Current:** Recompute glob match on every lookup
**Proposed:** Cache subject → policy_path map per scan

**Trade-off:** Faster (O(1) after first match) vs. memory overhead

**Decision:** Benchmark first, optimize if needed (premature optimization).

---

### Q4: Should policy aliases be mergeable?

**Current:** Each Trust Pack has independent aliases
**Proposed:** Allow Trust Pack B to "extend" Trust Pack A's aliases

**Use Case:** Company-wide base pack + team-specific extensions

**Decision:** Future enhancement (Trust Pack composition system).

---

## Guiding Heuristic

**When adding matching features, ask:**

1. **Does this preserve tail-path matching for the common case?**
   - Yes → Maintains performance
   - No → Reconsider

2. **Is the behavior explicit and auditable?**
   - Yes → Security teams can reason about it
   - No → Will cause trust issues

3. **Can it be disabled or overridden?**
   - Yes → Progressive adoption
   - No → May block some use cases

4. **Does it add cognitive overhead?**
   - Minimal → Worth the flexibility
   - Significant → Document heavily or defer

---

## Conclusion

**Current State:** Tail-path matching works for bundled corpus but breaks for enterprise policies.

**Root Cause:** Semantic mismatch between policy hierarchies and extractor output.

**Solution:** Add explicit policy aliases as a second matching layer.

**Philosophy:** Start simple (tail-path), add precision layers (aliases), keep it auditable.

**Future:** Semantic embeddings and ontology mapping if manual aliases become burdensome.

---

**This document should guide future matching feature discussions.** Always return to the core principles: semantic matching, progressive precision, explicit control.