stemedb/applications/aphoria/docs/architecture/concept-matching-analysis.md
jordan 41c676a78e feat: Aphoria enterprise features + ontology SDK + file length compliance
Enterprise Features:
- Hosted mode with remote sync for team pattern aggregation
- Community sharing with privacy-preserving anonymization
- LLM-based semantic claim extraction with Gemini integration
- Pattern learning with promotion to declarative extractors
- High-entropy secrets extractor with configurable thresholds
- Auth bypass and insecure cookies extractors

Module Refactoring:
- Split oversized files to comply with 500-line limit
- Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc.
- Handlers split: scan.rs, policy.rs, report.rs modules
- Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/
- Learning split: store modules with metrics and persistence

SDK & Ontology:
- stemedb-ontology SDK with fluent builders and StemeDB client
- Pharma domain extractors for FDA Orange Book data
- Consumer health UAT test infrastructure

Code Quality:
- Fixed clippy warnings (needless_borrows_for_generic_args)
- Added KVStore trait imports where needed
- Fixed utoipa path re-exports for OpenAPI docs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 12:55:29 -07:00

440 lines
11 KiB
Markdown

# Concept Matching Architecture Analysis
**Date:** 2026-02-04
**Status:** Critical Gap Identified
**Priority:** High (Enterprise Blocker)
---
## Executive Summary
The current tail-path matching system (ConceptIndex) enables cross-scheme concept matching but has fundamental limitations for enterprise policy enforcement. While it works well for bundled RFC/OWASP corpus matching, it fails when security teams create custom policies that don't align with extractor output paths.
**Recommendation:** Implement a three-tier matching system with explicit policy aliasing.
---
## Current Architecture
### 1. Tail-Path Matching (ConceptIndex)
**Algorithm:**
```rust
// Both produce key: "tls/cert_verification::enabled"
"rfc://5246/tls/cert_verification" // RFC corpus
"code://rust/myapp/tls/cert_verification" // Code extractor
```
**How it works:**
1. Strip scheme (`rfc://`, `code://`)
2. Take last 2 path segments
3. Append predicate
4. Key = `{seg[-2]}/{seg[-1]}::{predicate}`
**Scan Flow:**
```
scan.rs:210 → ConceptIndex::build(&corpus)
local.rs:273 → index.lookup(&claim.concept_path, &claim.predicate)
concept_index.rs:54 → make_key(subject, predicate)
```
### 2. Trust Pack Import
**Current State:**
- ✅ Assertions stored in KV
- ✅ Indexed under `predicates::AUTHORITATIVE`
- ✅ Loaded into corpus at scan time (scan.rs:201)
- ✅ Included in ConceptIndex (scan.rs:210)
**The Gap:**
Trust Pack assertions use paths defined by security teams, which may not match extractor conventions.
---
## The Problem
### Scenario: Enterprise Policy Mismatch
**Security Team's Intent:**
```toml
# They create a "blessed" standard
subject = "code://standards/tls/cert_verification"
predicate = "enabled"
object = true
```
**What Code Extractors Produce:**
```rust
// Rust extractor output
concept_path: "code://rust/myapp/tls/cert_verification"
// Go extractor output
concept_path: "code://go/myapp/tls/cert_verification"
// Python extractor output
concept_path: "code://python/myapp/tls/cert_verification"
```
**Current Behavior:**
```
Security standard: "standards/tls" → key: "tls/cert_verification::enabled"
Rust code: "rust/myapp/tls" → key: "myapp/tls::enabled" ❌ MISMATCH
```
### Root Cause
Tail-path matching assumes:
1. **Uniform Depth:** All sources use similar path hierarchies
2. **Language Agnostic:** The "tls/cert_verification" pattern is universal
But enterprise policies violate these assumptions:
- Security teams think in **domains** (`standards/tls`)
- Extractors output **language-qualified** paths (`rust/myapp/tls`)
---
## Analysis: Is Tail-Path Matching Sufficient?
### What Works Well
1. **RFC ↔ Code Matching**
- RFCs use domain concepts: `rfc://5246/tls/cert_verification`
- Code extractors intentionally align: `code://rust/.../tls/cert_verification`
- This was designed to work
2. **Zero Configuration**
- No manual alias mapping required
- "Just works" for bundled corpus
3. **Cross-Language Matching**
- `code://rust/.../tls/cert_verification`
- `code://python/.../tls/cert_verification`
- Both match the same RFC
### What Breaks
1. **Enterprise Policy Hierarchies**
- Security teams use logical groupings: `standards/`, `internal/`, `exceptions/`
- These don't map to extractor output
2. **Vendor-Specific Patterns**
- Unreal Engine: `unreal://engine/rendering/synchronous_loading`
- Code: `code://cpp/mygame/rendering/assets/load_sync`
- Different semantic levels
3. **Domain-Specific Abstractions**
- Healthcare: `hipaa://patient_data/encryption`
- Finance: `pci://cardholder_data/storage`
- Code may not mirror these hierarchies
---
## Solution Options
### Option 1: Normalize Extractor Output (Rejected)
**Idea:** Make extractors output "canonical" paths that match standards.
**Why it fails:**
- Extractors need language context (`rust/myapp`)
- Path structure conveys information (file location, module hierarchy)
- Breaks existing aliases and observations
### Option 2: Flexible Tail-Path Length (Partial)
**Idea:** Try matching with N=1, N=2, N=3 segments.
```rust
// Try multiple keys
"cert_verification::enabled" // N=1
"tls/cert_verification::enabled" // N=2
"myapp/tls/cert_verification::enabled" // N=3
```
**Pros:**
- Handles some depth mismatches
- Backward compatible
**Cons:**
- Ambiguous matches (which key wins?)
- Still doesn't solve semantic differences
- Performance impact (3x index lookups)
### Option 3: Explicit Policy Aliases (Recommended)
**Idea:** Add an alias layer in Trust Packs.
**Trust Pack Schema Extension:**
```rust
pub struct TrustPack {
pub header: PackHeader,
pub assertions: Vec<Assertion>,
pub aliases: Vec<ConceptAlias>, // Already exists!
pub policy_aliases: Vec<PolicyAlias>, // NEW
pub signature: [u8; 64],
}
pub struct PolicyAlias {
/// The policy path used in assertions
pub policy_path: String,
/// Glob patterns that should match this policy
pub target_patterns: Vec<String>,
}
```
**Example:**
```rust
PolicyAlias {
policy_path: "code://standards/tls/cert_verification",
target_patterns: vec![
"code://rust/*/tls/cert_verification",
"code://go/*/tls/cert_verification",
"code://python/*/tls/cert_verification",
],
}
```
**Matching Algorithm:**
```rust
impl ConceptIndex {
pub fn lookup_with_policy_aliases(
&self,
subject: &str,
predicate: &str,
policy_aliases: &[PolicyAlias],
) -> Option<&Vec<Assertion>> {
// 1. Try direct tail-path match (existing)
if let Some(result) = self.lookup(subject, predicate) {
return Some(result);
}
// 2. Try policy alias expansion
for alias in policy_aliases {
if subject_matches_pattern(subject, &alias.target_patterns) {
if let Some(result) = self.lookup(&alias.policy_path, predicate) {
return Some(result);
}
}
}
None
}
}
```
---
## Recommended Implementation Plan
### Phase 1: Extend Trust Pack Schema
**Files:**
- `applications/aphoria/src/policy.rs`
**Changes:**
```rust
#[derive(Archive, Deserialize, Serialize, Debug, Clone)]
pub struct PolicyAlias {
pub policy_path: String,
pub target_patterns: Vec<String>,
}
pub struct TrustPack {
// ... existing fields
pub policy_aliases: Vec<PolicyAlias>,
// ...
}
```
### Phase 2: Add Pattern Matching
**Files:**
- `applications/aphoria/src/episteme/concept_index.rs`
**New Functions:**
```rust
impl ConceptIndex {
/// Extended lookup that tries policy aliases after tail-path match
pub fn lookup_with_aliases(
&self,
subject: &str,
predicate: &str,
aliases: &[PolicyAlias],
) -> Option<&Vec<Assertion>> { ... }
}
/// Check if a subject matches a glob pattern
fn subject_matches_pattern(subject: &str, patterns: &[String]) -> bool {
// Use glob crate or simple wildcard matching
patterns.iter().any(|p| glob_match(p, subject))
}
```
### Phase 3: Integrate into Scan Flow
**Files:**
- `applications/aphoria/src/scan.rs`
- `applications/aphoria/src/episteme/local.rs`
**Changes:**
```rust
// scan.rs:210 - Load policy aliases from Trust Packs
let policy_manager = PolicyManager::new(&config.corpus.cache_dir);
let policies = policy_manager.load_policies(&config.policies)?;
let policy_aliases: Vec<PolicyAlias> = policies
.iter()
.flat_map(|p| &p.policy_aliases)
.cloned()
.collect();
// local.rs:273 - Use extended lookup
let auth_assertions = match index.lookup_with_aliases(
&claim.concept_path,
&claim.predicate,
&policy_aliases,
) {
Some(assertions) => assertions,
None => continue,
};
```
### Phase 4: CLI Tooling
**New Command:**
```bash
# Generate policy aliases from existing assertions
aphoria policy generate-aliases \
--policy-path "code://standards/tls/cert_verification" \
--target "code://rust/*/tls/cert_verification" \
--target "code://go/*/tls/cert_verification"
```
**Output:** Adds `PolicyAlias` to Trust Pack before signing.
---
## Extension Points
### 1. Dynamic Alias Discovery
**Future Enhancement:** Auto-generate aliases during scan if code paths differ from policy paths.
```rust
// If tail-path matches but full paths differ, suggest alias
if tail_match && !full_match {
suggestions.push(PolicyAlias {
policy_path: assertion.subject.clone(),
target_patterns: vec![claim.concept_path.clone()],
});
}
```
### 2. Semantic Equivalence
**Future Enhancement:** Use embedding similarity for fuzzy matching.
```rust
pub struct SemanticAlias {
pub policy_path: String,
pub similarity_threshold: f32,
}
// Match if embedding distance < threshold
```
### 3. Hierarchical Policy Inheritance
**Future Enhancement:** Support policy hierarchies.
```rust
// Match "code://standards/tls/*" against any TLS assertion
pub struct HierarchyAlias {
pub policy_prefix: String, // "code://standards/tls"
pub target_prefix: String, // "code://rust/*/tls"
}
```
---
## Migration Path
### Backward Compatibility
**Zero Breaking Changes:**
- Tail-path matching still works for existing use cases
- `PolicyAlias` is optional (empty vec = current behavior)
- Existing Trust Packs without `policy_aliases` field deserialize fine (add default)
### Adoption Strategy
**Week 1:** Implement core functionality (Phase 1-2)
**Week 2:** Integrate into scan flow (Phase 3)
**Week 3:** Add CLI tooling (Phase 4)
**Week 4:** Document + UAT with enterprise scenario
---
## Metrics for Success
### Functional
- [ ] Security team can create `code://standards/*` assertions
- [ ] Dev team code (`code://rust/myapp/*`) matches standards
- [ ] Conflicts are detected and reported
- [ ] Trust Pack signature verification passes
### Performance
- [ ] Scan time increase < 5% (alias lookup is O(P*A) where P=patterns, A=aliases)
- [ ] Memory overhead < 10KB per Trust Pack (policy aliases are small)
### Usability
- [ ] Security team can export Trust Pack with aliases in < 5 commands
- [ ] Dev team imports Trust Pack with `policies = ["security.pack"]` (no code changes)
---
## Open Questions
1. **Wildcard Syntax:** Use glob (`*`) or regex (`.*`)?
- **Recommendation:** Start with glob (simpler, more intuitive)
2. **Alias Priority:** If multiple aliases match, which wins?
- **Recommendation:** First match wins (deterministic order in Trust Pack)
3. **Alias Storage:** Persist discovered aliases back to local store?
- **Recommendation:** No (keep Trust Pack as source of truth)
4. **Alias Validation:** Check patterns at Trust Pack creation time?
- **Recommendation:** Yes (fail fast if invalid glob pattern)
---
## Conclusion
**Current State:** Tail-path matching works for bundled corpus but breaks for enterprise policies.
**Root Cause:** Semantic mismatch between policy hierarchies and extractor output.
**Solution:** Add explicit `PolicyAlias` layer in Trust Packs.
**Impact:** Unblocks enterprise adoption without breaking existing functionality.
**Effort:** ~2-3 days (schema extension + pattern matching + integration)
**Risk:** Low (additive change, backward compatible)
---
## Next Steps
1. Review this analysis with team
2. Validate glob pattern syntax choice
3. Implement Phase 1 (schema extension)
4. Write UAT scenario mimicking enterprise use case
5. Iterate based on feedback
---
**Questions or feedback?** Discuss in `#aphoria-architecture`.