Enterprise Features: - Hosted mode with remote sync for team pattern aggregation - Community sharing with privacy-preserving anonymization - LLM-based semantic claim extraction with Gemini integration - Pattern learning with promotion to declarative extractors - High-entropy secrets extractor with configurable thresholds - Auth bypass and insecure cookies extractors Module Refactoring: - Split oversized files to comply with 500-line limit - Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc. - Handlers split: scan.rs, policy.rs, report.rs modules - Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/ - Learning split: store modules with metrics and persistence SDK & Ontology: - stemedb-ontology SDK with fluent builders and StemeDB client - Pharma domain extractors for FDA Orange Book data - Consumer health UAT test infrastructure Code Quality: - Fixed clippy warnings (needless_borrows_for_generic_args) - Added KVStore trait imports where needed - Fixed utoipa path re-exports for OpenAPI docs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
440 lines
11 KiB
Markdown
440 lines
11 KiB
Markdown
# Concept Matching Architecture Analysis
|
|
|
|
**Date:** 2026-02-04
|
|
**Status:** Critical Gap Identified
|
|
**Priority:** High (Enterprise Blocker)
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
The current tail-path matching system (ConceptIndex) enables cross-scheme concept matching but has fundamental limitations for enterprise policy enforcement. While it works well for bundled RFC/OWASP corpus matching, it fails when security teams create custom policies that don't align with extractor output paths.
|
|
|
|
**Recommendation:** Implement a three-tier matching system with explicit policy aliasing.
|
|
|
|
---
|
|
|
|
## Current Architecture
|
|
|
|
### 1. Tail-Path Matching (ConceptIndex)
|
|
|
|
**Algorithm:**
|
|
```rust
|
|
// Both produce key: "tls/cert_verification::enabled"
|
|
"rfc://5246/tls/cert_verification" // RFC corpus
|
|
"code://rust/myapp/tls/cert_verification" // Code extractor
|
|
```
|
|
|
|
**How it works:**
|
|
1. Strip scheme (`rfc://`, `code://`)
|
|
2. Take last 2 path segments
|
|
3. Append predicate
|
|
4. Key = `{seg[-2]}/{seg[-1]}::{predicate}`
|
|
|
|
**Scan Flow:**
|
|
```
|
|
scan.rs:210 → ConceptIndex::build(&corpus)
|
|
↓
|
|
local.rs:273 → index.lookup(&claim.concept_path, &claim.predicate)
|
|
↓
|
|
concept_index.rs:54 → make_key(subject, predicate)
|
|
```
|
|
|
|
### 2. Trust Pack Import
|
|
|
|
**Current State:**
|
|
- ✅ Assertions stored in KV
|
|
- ✅ Indexed under `predicates::AUTHORITATIVE`
|
|
- ✅ Loaded into corpus at scan time (scan.rs:201)
|
|
- ✅ Included in ConceptIndex (scan.rs:210)
|
|
|
|
**The Gap:**
|
|
Trust Pack assertions use paths defined by security teams, which may not match extractor conventions.
|
|
|
|
---
|
|
|
|
## The Problem
|
|
|
|
### Scenario: Enterprise Policy Mismatch
|
|
|
|
**Security Team's Intent:**
|
|
```toml
|
|
# They create a "blessed" standard
|
|
subject = "code://standards/tls/cert_verification"
|
|
predicate = "enabled"
|
|
object = true
|
|
```
|
|
|
|
**What Code Extractors Produce:**
|
|
```rust
|
|
// Rust extractor output
|
|
concept_path: "code://rust/myapp/tls/cert_verification"
|
|
|
|
// Go extractor output
|
|
concept_path: "code://go/myapp/tls/cert_verification"
|
|
|
|
// Python extractor output
|
|
concept_path: "code://python/myapp/tls/cert_verification"
|
|
```
|
|
|
|
**Current Behavior:**
|
|
```
|
|
Security standard: "standards/tls" → key: "tls/cert_verification::enabled"
|
|
Rust code: "rust/myapp/tls" → key: "myapp/tls::enabled" ❌ MISMATCH
|
|
```
|
|
|
|
### Root Cause
|
|
|
|
Tail-path matching assumes:
|
|
1. **Uniform Depth:** All sources use similar path hierarchies
|
|
2. **Language Agnostic:** The "tls/cert_verification" pattern is universal
|
|
|
|
But enterprise policies violate these assumptions:
|
|
- Security teams think in **domains** (`standards/tls`)
|
|
- Extractors output **language-qualified** paths (`rust/myapp/tls`)
|
|
|
|
---
|
|
|
|
## Analysis: Is Tail-Path Matching Sufficient?
|
|
|
|
### What Works Well
|
|
|
|
1. **RFC ↔ Code Matching**
|
|
- RFCs use domain concepts: `rfc://5246/tls/cert_verification`
|
|
- Code extractors intentionally align: `code://rust/.../tls/cert_verification`
|
|
- This was designed to work
|
|
|
|
2. **Zero Configuration**
|
|
- No manual alias mapping required
|
|
- "Just works" for bundled corpus
|
|
|
|
3. **Cross-Language Matching**
|
|
- `code://rust/.../tls/cert_verification`
|
|
- `code://python/.../tls/cert_verification`
|
|
- Both match the same RFC
|
|
|
|
### What Breaks
|
|
|
|
1. **Enterprise Policy Hierarchies**
|
|
- Security teams use logical groupings: `standards/`, `internal/`, `exceptions/`
|
|
- These don't map to extractor output
|
|
|
|
2. **Vendor-Specific Patterns**
|
|
- Unreal Engine: `unreal://engine/rendering/synchronous_loading`
|
|
- Code: `code://cpp/mygame/rendering/assets/load_sync`
|
|
- Different semantic levels
|
|
|
|
3. **Domain-Specific Abstractions**
|
|
- Healthcare: `hipaa://patient_data/encryption`
|
|
- Finance: `pci://cardholder_data/storage`
|
|
- Code may not mirror these hierarchies
|
|
|
|
---
|
|
|
|
## Solution Options
|
|
|
|
### Option 1: Normalize Extractor Output (Rejected)
|
|
|
|
**Idea:** Make extractors output "canonical" paths that match standards.
|
|
|
|
**Why it fails:**
|
|
- Extractors need language context (`rust/myapp`)
|
|
- Path structure conveys information (file location, module hierarchy)
|
|
- Breaks existing aliases and observations
|
|
|
|
### Option 2: Flexible Tail-Path Length (Partial)
|
|
|
|
**Idea:** Try matching with N=1, N=2, N=3 segments.
|
|
|
|
```rust
|
|
// Try multiple keys
|
|
"cert_verification::enabled" // N=1
|
|
"tls/cert_verification::enabled" // N=2
|
|
"myapp/tls/cert_verification::enabled" // N=3
|
|
```
|
|
|
|
**Pros:**
|
|
- Handles some depth mismatches
|
|
- Backward compatible
|
|
|
|
**Cons:**
|
|
- Ambiguous matches (which key wins?)
|
|
- Still doesn't solve semantic differences
|
|
- Performance impact (3x index lookups)
|
|
|
|
### Option 3: Explicit Policy Aliases (Recommended)
|
|
|
|
**Idea:** Add an alias layer in Trust Packs.
|
|
|
|
**Trust Pack Schema Extension:**
|
|
```rust
|
|
pub struct TrustPack {
|
|
pub header: PackHeader,
|
|
pub assertions: Vec<Assertion>,
|
|
pub aliases: Vec<ConceptAlias>, // Already exists!
|
|
pub policy_aliases: Vec<PolicyAlias>, // NEW
|
|
pub signature: [u8; 64],
|
|
}
|
|
|
|
pub struct PolicyAlias {
|
|
/// The policy path used in assertions
|
|
pub policy_path: String,
|
|
/// Glob patterns that should match this policy
|
|
pub target_patterns: Vec<String>,
|
|
}
|
|
```
|
|
|
|
**Example:**
|
|
```rust
|
|
PolicyAlias {
|
|
policy_path: "code://standards/tls/cert_verification",
|
|
target_patterns: vec![
|
|
"code://rust/*/tls/cert_verification",
|
|
"code://go/*/tls/cert_verification",
|
|
"code://python/*/tls/cert_verification",
|
|
],
|
|
}
|
|
```
|
|
|
|
**Matching Algorithm:**
|
|
```rust
|
|
impl ConceptIndex {
|
|
pub fn lookup_with_policy_aliases(
|
|
&self,
|
|
subject: &str,
|
|
predicate: &str,
|
|
policy_aliases: &[PolicyAlias],
|
|
) -> Option<&Vec<Assertion>> {
|
|
// 1. Try direct tail-path match (existing)
|
|
if let Some(result) = self.lookup(subject, predicate) {
|
|
return Some(result);
|
|
}
|
|
|
|
// 2. Try policy alias expansion
|
|
for alias in policy_aliases {
|
|
if subject_matches_pattern(subject, &alias.target_patterns) {
|
|
if let Some(result) = self.lookup(&alias.policy_path, predicate) {
|
|
return Some(result);
|
|
}
|
|
}
|
|
}
|
|
|
|
None
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Recommended Implementation Plan
|
|
|
|
### Phase 1: Extend Trust Pack Schema
|
|
|
|
**Files:**
|
|
- `applications/aphoria/src/policy.rs`
|
|
|
|
**Changes:**
|
|
```rust
|
|
#[derive(Archive, Deserialize, Serialize, Debug, Clone)]
|
|
pub struct PolicyAlias {
|
|
pub policy_path: String,
|
|
pub target_patterns: Vec<String>,
|
|
}
|
|
|
|
pub struct TrustPack {
|
|
// ... existing fields
|
|
pub policy_aliases: Vec<PolicyAlias>,
|
|
// ...
|
|
}
|
|
```
|
|
|
|
### Phase 2: Add Pattern Matching
|
|
|
|
**Files:**
|
|
- `applications/aphoria/src/episteme/concept_index.rs`
|
|
|
|
**New Functions:**
|
|
```rust
|
|
impl ConceptIndex {
|
|
/// Extended lookup that tries policy aliases after tail-path match
|
|
pub fn lookup_with_aliases(
|
|
&self,
|
|
subject: &str,
|
|
predicate: &str,
|
|
aliases: &[PolicyAlias],
|
|
) -> Option<&Vec<Assertion>> { ... }
|
|
}
|
|
|
|
/// Check if a subject matches a glob pattern
|
|
fn subject_matches_pattern(subject: &str, patterns: &[String]) -> bool {
|
|
// Use glob crate or simple wildcard matching
|
|
patterns.iter().any(|p| glob_match(p, subject))
|
|
}
|
|
```
|
|
|
|
### Phase 3: Integrate into Scan Flow
|
|
|
|
**Files:**
|
|
- `applications/aphoria/src/scan.rs`
|
|
- `applications/aphoria/src/episteme/local.rs`
|
|
|
|
**Changes:**
|
|
```rust
|
|
// scan.rs:210 - Load policy aliases from Trust Packs
|
|
let policy_manager = PolicyManager::new(&config.corpus.cache_dir);
|
|
let policies = policy_manager.load_policies(&config.policies)?;
|
|
let policy_aliases: Vec<PolicyAlias> = policies
|
|
.iter()
|
|
.flat_map(|p| &p.policy_aliases)
|
|
.cloned()
|
|
.collect();
|
|
|
|
// local.rs:273 - Use extended lookup
|
|
let auth_assertions = match index.lookup_with_aliases(
|
|
&claim.concept_path,
|
|
&claim.predicate,
|
|
&policy_aliases,
|
|
) {
|
|
Some(assertions) => assertions,
|
|
None => continue,
|
|
};
|
|
```
|
|
|
|
### Phase 4: CLI Tooling
|
|
|
|
**New Command:**
|
|
```bash
|
|
# Generate policy aliases from existing assertions
|
|
aphoria policy generate-aliases \
|
|
--policy-path "code://standards/tls/cert_verification" \
|
|
--target "code://rust/*/tls/cert_verification" \
|
|
--target "code://go/*/tls/cert_verification"
|
|
```
|
|
|
|
**Output:** Adds `PolicyAlias` to Trust Pack before signing.
|
|
|
|
---
|
|
|
|
## Extension Points
|
|
|
|
### 1. Dynamic Alias Discovery
|
|
|
|
**Future Enhancement:** Auto-generate aliases during scan if code paths differ from policy paths.
|
|
|
|
```rust
|
|
// If tail-path matches but full paths differ, suggest alias
|
|
if tail_match && !full_match {
|
|
suggestions.push(PolicyAlias {
|
|
policy_path: assertion.subject.clone(),
|
|
target_patterns: vec![claim.concept_path.clone()],
|
|
});
|
|
}
|
|
```
|
|
|
|
### 2. Semantic Equivalence
|
|
|
|
**Future Enhancement:** Use embedding similarity for fuzzy matching.
|
|
|
|
```rust
|
|
pub struct SemanticAlias {
|
|
pub policy_path: String,
|
|
pub similarity_threshold: f32,
|
|
}
|
|
|
|
// Match if embedding distance < threshold
|
|
```
|
|
|
|
### 3. Hierarchical Policy Inheritance
|
|
|
|
**Future Enhancement:** Support policy hierarchies.
|
|
|
|
```rust
|
|
// Match "code://standards/tls/*" against any TLS assertion
|
|
pub struct HierarchyAlias {
|
|
pub policy_prefix: String, // "code://standards/tls"
|
|
pub target_prefix: String, // "code://rust/*/tls"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Migration Path
|
|
|
|
### Backward Compatibility
|
|
|
|
✅ **Zero Breaking Changes:**
|
|
- Tail-path matching still works for existing use cases
|
|
- `PolicyAlias` is optional (empty vec = current behavior)
|
|
- Existing Trust Packs without `policy_aliases` field deserialize fine (add default)
|
|
|
|
### Adoption Strategy
|
|
|
|
**Week 1:** Implement core functionality (Phase 1-2)
|
|
**Week 2:** Integrate into scan flow (Phase 3)
|
|
**Week 3:** Add CLI tooling (Phase 4)
|
|
**Week 4:** Document + UAT with enterprise scenario
|
|
|
|
---
|
|
|
|
## Metrics for Success
|
|
|
|
### Functional
|
|
- [ ] Security team can create `code://standards/*` assertions
|
|
- [ ] Dev team code (`code://rust/myapp/*`) matches standards
|
|
- [ ] Conflicts are detected and reported
|
|
- [ ] Trust Pack signature verification passes
|
|
|
|
### Performance
|
|
- [ ] Scan time increase < 5% (alias lookup is O(P*A) where P=patterns, A=aliases)
|
|
- [ ] Memory overhead < 10KB per Trust Pack (policy aliases are small)
|
|
|
|
### Usability
|
|
- [ ] Security team can export Trust Pack with aliases in < 5 commands
|
|
- [ ] Dev team imports Trust Pack with `policies = ["security.pack"]` (no code changes)
|
|
|
|
---
|
|
|
|
## Open Questions
|
|
|
|
1. **Wildcard Syntax:** Use glob (`*`) or regex (`.*`)?
|
|
- **Recommendation:** Start with glob (simpler, more intuitive)
|
|
|
|
2. **Alias Priority:** If multiple aliases match, which wins?
|
|
- **Recommendation:** First match wins (deterministic order in Trust Pack)
|
|
|
|
3. **Alias Storage:** Persist discovered aliases back to local store?
|
|
- **Recommendation:** No (keep Trust Pack as source of truth)
|
|
|
|
4. **Alias Validation:** Check patterns at Trust Pack creation time?
|
|
- **Recommendation:** Yes (fail fast if invalid glob pattern)
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
**Current State:** Tail-path matching works for bundled corpus but breaks for enterprise policies.
|
|
|
|
**Root Cause:** Semantic mismatch between policy hierarchies and extractor output.
|
|
|
|
**Solution:** Add explicit `PolicyAlias` layer in Trust Packs.
|
|
|
|
**Impact:** Unblocks enterprise adoption without breaking existing functionality.
|
|
|
|
**Effort:** ~2-3 days (schema extension + pattern matching + integration)
|
|
|
|
**Risk:** Low (additive change, backward compatible)
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. Review this analysis with team
|
|
2. Validate glob pattern syntax choice
|
|
3. Implement Phase 1 (schema extension)
|
|
4. Write UAT scenario mimicking enterprise use case
|
|
5. Iterate based on feedback
|
|
|
|
---
|
|
|
|
**Questions or feedback?** Discuss in `#aphoria-architecture`.
|