stemedb/applications/aphoria/docs/architecture/concept-matching-analysis.md

# Concept Matching Architecture Analysis

**Date:** 2026-02-04
**Status:** Critical Gap Identified
**Priority:** High (Enterprise Blocker)

---

## Executive Summary

The current tail-path matching system (ConceptIndex) enables cross-scheme concept matching but has fundamental limitations for enterprise policy enforcement. While it works well for bundled RFC/OWASP corpus matching, it fails when security teams create custom policies that don't align with extractor output paths.

**Recommendation:** Implement a three-tier matching system with explicit policy aliasing.

---

## Current Architecture

### 1. Tail-Path Matching (ConceptIndex)

**Algorithm:**
```rust
// Both produce key: "tls/cert_verification::enabled"
"rfc://5246/tls/cert_verification"           // RFC corpus
"code://rust/myapp/tls/cert_verification"    // Code extractor
```

**How it works:**
1. Strip scheme (`rfc://`, `code://`)
2. Take last 2 path segments
3. Append predicate
4. Key = `{seg[-2]}/{seg[-1]}::{predicate}`

**Scan Flow:**
```
scan.rs:210 → ConceptIndex::build(&corpus)
  ↓
local.rs:273 → index.lookup(&claim.concept_path, &claim.predicate)
  ↓
concept_index.rs:54 → make_key(subject, predicate)
```

### 2. Trust Pack Import

**Current State:**
- ✅ Assertions stored in KV
- ✅ Indexed under `predicates::AUTHORITATIVE`
- ✅ Loaded into corpus at scan time (scan.rs:201)
- ✅ Included in ConceptIndex (scan.rs:210)

**The Gap:**
Trust Pack assertions use paths defined by security teams, which may not match extractor conventions.

---

## The Problem

### Scenario: Enterprise Policy Mismatch

**Security Team's Intent:**
```toml
# They create a "blessed" standard
subject = "code://standards/tls/cert_verification"
predicate = "enabled"
object = true
```

**What Code Extractors Produce:**
```rust
// Rust extractor output
concept_path: "code://rust/myapp/tls/cert_verification"

// Go extractor output
concept_path: "code://go/myapp/tls/cert_verification"

// Python extractor output
concept_path: "code://python/myapp/tls/cert_verification"
```

**Current Behavior:**
```
Security standard:     "standards/tls" → key: "tls/cert_verification::enabled"
Rust code:             "rust/myapp/tls" → key: "myapp/tls::enabled"  ❌ MISMATCH
```

### Root Cause

Tail-path matching assumes:
1. **Uniform Depth:** All sources use similar path hierarchies
2. **Language Agnostic:** The "tls/cert_verification" pattern is universal

But enterprise policies violate these assumptions:
- Security teams think in **domains** (`standards/tls`)
- Extractors output **language-qualified** paths (`rust/myapp/tls`)

---

## Analysis: Is Tail-Path Matching Sufficient?

### What Works Well

1. **RFC ↔ Code Matching**
   - RFCs use domain concepts: `rfc://5246/tls/cert_verification`
   - Code extractors intentionally align: `code://rust/.../tls/cert_verification`
   - This was designed to work

2. **Zero Configuration**
   - No manual alias mapping required
   - "Just works" for bundled corpus

3. **Cross-Language Matching**
   - `code://rust/.../tls/cert_verification`
   - `code://python/.../tls/cert_verification`
   - Both match the same RFC

### What Breaks

1. **Enterprise Policy Hierarchies**
   - Security teams use logical groupings: `standards/`, `internal/`, `exceptions/`
   - These don't map to extractor output

2. **Vendor-Specific Patterns**
   - Unreal Engine: `unreal://engine/rendering/synchronous_loading`
   - Code: `code://cpp/mygame/rendering/assets/load_sync`
   - Different semantic levels

3. **Domain-Specific Abstractions**
   - Healthcare: `hipaa://patient_data/encryption`
   - Finance: `pci://cardholder_data/storage`
   - Code may not mirror these hierarchies

---

## Solution Options

### Option 1: Normalize Extractor Output (Rejected)

**Idea:** Make extractors output "canonical" paths that match standards.

**Why it fails:**
- Extractors need language context (`rust/myapp`)
- Path structure conveys information (file location, module hierarchy)
- Breaks existing aliases and observations

### Option 2: Flexible Tail-Path Length (Partial)

**Idea:** Try matching with N=1, N=2, N=3 segments.

```rust
// Try multiple keys
"cert_verification::enabled"           // N=1
"tls/cert_verification::enabled"       // N=2
"myapp/tls/cert_verification::enabled" // N=3
```

**Pros:**
- Handles some depth mismatches
- Backward compatible

**Cons:**
- Ambiguous matches (which key wins?)
- Still doesn't solve semantic differences
- Performance impact (3x index lookups)

### Option 3: Explicit Policy Aliases (Recommended)

**Idea:** Add an alias layer in Trust Packs.

**Trust Pack Schema Extension:**
```rust
pub struct TrustPack {
    pub header: PackHeader,
    pub assertions: Vec<Assertion>,
    pub aliases: Vec<ConceptAlias>,  // Already exists!
    pub policy_aliases: Vec<PolicyAlias>,  // NEW
    pub signature: [u8; 64],
}

pub struct PolicyAlias {
    /// The policy path used in assertions
    pub policy_path: String,
    /// Glob patterns that should match this policy
    pub target_patterns: Vec<String>,
}
```

**Example:**
```rust
PolicyAlias {
    policy_path: "code://standards/tls/cert_verification",
    target_patterns: vec![
        "code://rust/*/tls/cert_verification",
        "code://go/*/tls/cert_verification",
        "code://python/*/tls/cert_verification",
    ],
}
```

**Matching Algorithm:**
```rust
impl ConceptIndex {
    pub fn lookup_with_policy_aliases(
        &self,
        subject: &str,
        predicate: &str,
        policy_aliases: &[PolicyAlias],
    ) -> Option<&Vec<Assertion>> {
        // 1. Try direct tail-path match (existing)
        if let Some(result) = self.lookup(subject, predicate) {
            return Some(result);
        }

        // 2. Try policy alias expansion
        for alias in policy_aliases {
            if subject_matches_pattern(subject, &alias.target_patterns) {
                if let Some(result) = self.lookup(&alias.policy_path, predicate) {
                    return Some(result);
                }
            }
        }

        None
    }
}
```

---

## Recommended Implementation Plan

### Phase 1: Extend Trust Pack Schema

**Files:**
- `applications/aphoria/src/policy.rs`

**Changes:**
```rust
#[derive(Archive, Deserialize, Serialize, Debug, Clone)]
pub struct PolicyAlias {
    pub policy_path: String,
    pub target_patterns: Vec<String>,
}

pub struct TrustPack {
    // ... existing fields
    pub policy_aliases: Vec<PolicyAlias>,
    // ...
}
```

### Phase 2: Add Pattern Matching

**Files:**
- `applications/aphoria/src/episteme/concept_index.rs`

**New Functions:**
```rust
impl ConceptIndex {
    /// Extended lookup that tries policy aliases after tail-path match
    pub fn lookup_with_aliases(
        &self,
        subject: &str,
        predicate: &str,
        aliases: &[PolicyAlias],
    ) -> Option<&Vec<Assertion>> { ... }
}

/// Check if a subject matches a glob pattern
fn subject_matches_pattern(subject: &str, patterns: &[String]) -> bool {
    // Use glob crate or simple wildcard matching
    patterns.iter().any(|p| glob_match(p, subject))
}
```

### Phase 3: Integrate into Scan Flow

**Files:**
- `applications/aphoria/src/scan.rs`
- `applications/aphoria/src/episteme/local.rs`

**Changes:**
```rust
// scan.rs:210 - Load policy aliases from Trust Packs
let policy_manager = PolicyManager::new(&config.corpus.cache_dir);
let policies = policy_manager.load_policies(&config.policies)?;
let policy_aliases: Vec<PolicyAlias> = policies
    .iter()
    .flat_map(|p| &p.policy_aliases)
    .cloned()
    .collect();

// local.rs:273 - Use extended lookup
let auth_assertions = match index.lookup_with_aliases(
    &claim.concept_path,
    &claim.predicate,
    &policy_aliases,
) {
    Some(assertions) => assertions,
    None => continue,
};
```

### Phase 4: CLI Tooling

**New Command:**
```bash
# Generate policy aliases from existing assertions
aphoria policy generate-aliases \
  --policy-path "code://standards/tls/cert_verification" \
  --target "code://rust/*/tls/cert_verification" \
  --target "code://go/*/tls/cert_verification"
```

**Output:** Adds `PolicyAlias` to Trust Pack before signing.

---

## Extension Points

### 1. Dynamic Alias Discovery

**Future Enhancement:** Auto-generate aliases during scan if code paths differ from policy paths.

```rust
// If tail-path matches but full paths differ, suggest alias
if tail_match && !full_match {
    suggestions.push(PolicyAlias {
        policy_path: assertion.subject.clone(),
        target_patterns: vec![claim.concept_path.clone()],
    });
}
```

### 2. Semantic Equivalence

**Future Enhancement:** Use embedding similarity for fuzzy matching.

```rust
pub struct SemanticAlias {
    pub policy_path: String,
    pub similarity_threshold: f32,
}

// Match if embedding distance < threshold
```

### 3. Hierarchical Policy Inheritance

**Future Enhancement:** Support policy hierarchies.

```rust
// Match "code://standards/tls/*" against any TLS assertion
pub struct HierarchyAlias {
    pub policy_prefix: String,  // "code://standards/tls"
    pub target_prefix: String,  // "code://rust/*/tls"
}
```

---

## Migration Path

### Backward Compatibility

✅ **Zero Breaking Changes:**
- Tail-path matching still works for existing use cases
- `PolicyAlias` is optional (empty vec = current behavior)
- Existing Trust Packs without `policy_aliases` field deserialize fine (add default)

### Adoption Strategy

**Week 1:** Implement core functionality (Phase 1-2)
**Week 2:** Integrate into scan flow (Phase 3)
**Week 3:** Add CLI tooling (Phase 4)
**Week 4:** Document + UAT with enterprise scenario

---

## Metrics for Success

### Functional
- [ ] Security team can create `code://standards/*` assertions
- [ ] Dev team code (`code://rust/myapp/*`) matches standards
- [ ] Conflicts are detected and reported
- [ ] Trust Pack signature verification passes

### Performance
- [ ] Scan time increase < 5% (alias lookup is O(P*A) where P=patterns, A=aliases)
- [ ] Memory overhead < 10KB per Trust Pack (policy aliases are small)

### Usability
- [ ] Security team can export Trust Pack with aliases in < 5 commands
- [ ] Dev team imports Trust Pack with `policies = ["security.pack"]` (no code changes)

---

## Open Questions

1. **Wildcard Syntax:** Use glob (`*`) or regex (`.*`)?
   - **Recommendation:** Start with glob (simpler, more intuitive)

2. **Alias Priority:** If multiple aliases match, which wins?
   - **Recommendation:** First match wins (deterministic order in Trust Pack)

3. **Alias Storage:** Persist discovered aliases back to local store?
   - **Recommendation:** No (keep Trust Pack as source of truth)

4. **Alias Validation:** Check patterns at Trust Pack creation time?
   - **Recommendation:** Yes (fail fast if invalid glob pattern)

---

## Conclusion

**Current State:** Tail-path matching works for bundled corpus but breaks for enterprise policies.

**Root Cause:** Semantic mismatch between policy hierarchies and extractor output.

**Solution:** Add explicit `PolicyAlias` layer in Trust Packs.

**Impact:** Unblocks enterprise adoption without breaking existing functionality.

**Effort:** ~2-3 days (schema extension + pattern matching + integration)

**Risk:** Low (additive change, backward compatible)

---

## Next Steps

1. Review this analysis with team
2. Validate glob pattern syntax choice
3. Implement Phase 1 (schema extension)
4. Write UAT scenario mimicking enterprise use case
5. Iterate based on feedback

---

**Questions or feedback?** Discuss in `#aphoria-architecture`.