jordan 41c676a78e feat: Aphoria enterprise features + ontology SDK + file length compliance

Enterprise Features:
- Hosted mode with remote sync for team pattern aggregation
- Community sharing with privacy-preserving anonymization
- LLM-based semantic claim extraction with Gemini integration
- Pattern learning with promotion to declarative extractors
- High-entropy secrets extractor with configurable thresholds
- Auth bypass and insecure cookies extractors

Module Refactoring:
- Split oversized files to comply with 500-line limit
- Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc.
- Handlers split: scan.rs, policy.rs, report.rs modules
- Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/
- Learning split: store modules with metrics and persistence

SDK & Ontology:
- stemedb-ontology SDK with fluent builders and StemeDB client
- Pharma domain extractors for FDA Orange Book data
- Consumer health UAT test infrastructure

Code Quality:
- Fixed clippy warnings (needless_borrows_for_generic_args)
- Added KVStore trait imports where needed
- Fixed utoipa path re-exports for OpenAPI docs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 12:55:29 -07:00

11 KiB

Raw Blame History

Concept Matching Architecture Analysis

Date: 2026-02-04 Status: Critical Gap Identified Priority: High (Enterprise Blocker)

Executive Summary

The current tail-path matching system (ConceptIndex) enables cross-scheme concept matching but has fundamental limitations for enterprise policy enforcement. While it works well for bundled RFC/OWASP corpus matching, it fails when security teams create custom policies that don't align with extractor output paths.

Recommendation: Implement a three-tier matching system with explicit policy aliasing.

Current Architecture

1. Tail-Path Matching (ConceptIndex)

Algorithm:

// Both produce key: "tls/cert_verification::enabled"
"rfc://5246/tls/cert_verification"           // RFC corpus
"code://rust/myapp/tls/cert_verification"    // Code extractor

How it works:

Strip scheme (rfc://, code://)
Take last 2 path segments
Append predicate
Key = {seg[-2]}/{seg[-1]}::{predicate}

Scan Flow:

scan.rs:210 → ConceptIndex::build(&corpus)
  ↓
local.rs:273 → index.lookup(&claim.concept_path, &claim.predicate)
  ↓
concept_index.rs:54 → make_key(subject, predicate)

2. Trust Pack Import

Current State:

✅ Assertions stored in KV
✅ Indexed under predicates::AUTHORITATIVE
✅ Loaded into corpus at scan time (scan.rs:201)
✅ Included in ConceptIndex (scan.rs:210)

The Gap: Trust Pack assertions use paths defined by security teams, which may not match extractor conventions.

The Problem

Scenario: Enterprise Policy Mismatch

Security Team's Intent:

# They create a "blessed" standard
subject = "code://standards/tls/cert_verification"
predicate = "enabled"
object = true

What Code Extractors Produce:

// Rust extractor output
concept_path: "code://rust/myapp/tls/cert_verification"

// Go extractor output
concept_path: "code://go/myapp/tls/cert_verification"

// Python extractor output
concept_path: "code://python/myapp/tls/cert_verification"

Current Behavior:

Security standard:     "standards/tls" → key: "tls/cert_verification::enabled"
Rust code:             "rust/myapp/tls" → key: "myapp/tls::enabled"  ❌ MISMATCH

Root Cause

Tail-path matching assumes:

Uniform Depth: All sources use similar path hierarchies
Language Agnostic: The "tls/cert_verification" pattern is universal

But enterprise policies violate these assumptions:

Security teams think in domains (standards/tls)
Extractors output language-qualified paths (rust/myapp/tls)

Analysis: Is Tail-Path Matching Sufficient?

What Works Well

RFC ↔ Code Matching
- RFCs use domain concepts: rfc://5246/tls/cert_verification
- Code extractors intentionally align: code://rust/.../tls/cert_verification
- This was designed to work
Zero Configuration
- No manual alias mapping required
- "Just works" for bundled corpus
Cross-Language Matching
- code://rust/.../tls/cert_verification
- code://python/.../tls/cert_verification
- Both match the same RFC

What Breaks

Enterprise Policy Hierarchies
- Security teams use logical groupings: standards/, internal/, exceptions/
- These don't map to extractor output
Vendor-Specific Patterns
- Unreal Engine: unreal://engine/rendering/synchronous_loading
- Code: code://cpp/mygame/rendering/assets/load_sync
- Different semantic levels
Domain-Specific Abstractions
- Healthcare: hipaa://patient_data/encryption
- Finance: pci://cardholder_data/storage
- Code may not mirror these hierarchies

Solution Options

Option 1: Normalize Extractor Output (Rejected)

Idea: Make extractors output "canonical" paths that match standards.

Why it fails:

Extractors need language context (rust/myapp)
Path structure conveys information (file location, module hierarchy)
Breaks existing aliases and observations

Option 2: Flexible Tail-Path Length (Partial)

Idea: Try matching with N=1, N=2, N=3 segments.

// Try multiple keys
"cert_verification::enabled"           // N=1
"tls/cert_verification::enabled"       // N=2
"myapp/tls/cert_verification::enabled" // N=3

Pros:

Handles some depth mismatches
Backward compatible

Cons:

Ambiguous matches (which key wins?)
Still doesn't solve semantic differences
Performance impact (3x index lookups)

Option 3: Explicit Policy Aliases (Recommended)

Idea: Add an alias layer in Trust Packs.

Trust Pack Schema Extension:

pub struct TrustPack {
    pub header: PackHeader,
    pub assertions: Vec<Assertion>,
    pub aliases: Vec<ConceptAlias>,  // Already exists!
    pub policy_aliases: Vec<PolicyAlias>,  // NEW
    pub signature: [u8; 64],
}

pub struct PolicyAlias {
    /// The policy path used in assertions
    pub policy_path: String,
    /// Glob patterns that should match this policy
    pub target_patterns: Vec<String>,
}

Example:

PolicyAlias {
    policy_path: "code://standards/tls/cert_verification",
    target_patterns: vec![
        "code://rust/*/tls/cert_verification",
        "code://go/*/tls/cert_verification",
        "code://python/*/tls/cert_verification",
    ],
}

Matching Algorithm:

impl ConceptIndex {
    pub fn lookup_with_policy_aliases(
        &self,
        subject: &str,
        predicate: &str,
        policy_aliases: &[PolicyAlias],
    ) -> Option<&Vec<Assertion>> {
        // 1. Try direct tail-path match (existing)
        if let Some(result) = self.lookup(subject, predicate) {
            return Some(result);
        }

        // 2. Try policy alias expansion
        for alias in policy_aliases {
            if subject_matches_pattern(subject, &alias.target_patterns) {
                if let Some(result) = self.lookup(&alias.policy_path, predicate) {
                    return Some(result);
                }
            }
        }

        None
    }
}

Recommended Implementation Plan

Phase 1: Extend Trust Pack Schema

Files:

applications/aphoria/src/policy.rs

Changes:

#[derive(Archive, Deserialize, Serialize, Debug, Clone)]
pub struct PolicyAlias {
    pub policy_path: String,
    pub target_patterns: Vec<String>,
}

pub struct TrustPack {
    // ... existing fields
    pub policy_aliases: Vec<PolicyAlias>,
    // ...
}

Phase 2: Add Pattern Matching

Files:

applications/aphoria/src/episteme/concept_index.rs

New Functions:

impl ConceptIndex {
    /// Extended lookup that tries policy aliases after tail-path match
    pub fn lookup_with_aliases(
        &self,
        subject: &str,
        predicate: &str,
        aliases: &[PolicyAlias],
    ) -> Option<&Vec<Assertion>> { ... }
}

/// Check if a subject matches a glob pattern
fn subject_matches_pattern(subject: &str, patterns: &[String]) -> bool {
    // Use glob crate or simple wildcard matching
    patterns.iter().any(|p| glob_match(p, subject))
}

Phase 3: Integrate into Scan Flow

Files:

applications/aphoria/src/scan.rs
applications/aphoria/src/episteme/local.rs

Changes:

// scan.rs:210 - Load policy aliases from Trust Packs
let policy_manager = PolicyManager::new(&config.corpus.cache_dir);
let policies = policy_manager.load_policies(&config.policies)?;
let policy_aliases: Vec<PolicyAlias> = policies
    .iter()
    .flat_map(|p| &p.policy_aliases)
    .cloned()
    .collect();

// local.rs:273 - Use extended lookup
let auth_assertions = match index.lookup_with_aliases(
    &claim.concept_path,
    &claim.predicate,
    &policy_aliases,
) {
    Some(assertions) => assertions,
    None => continue,
};

Phase 4: CLI Tooling

New Command:

# Generate policy aliases from existing assertions
aphoria policy generate-aliases \
  --policy-path "code://standards/tls/cert_verification" \
  --target "code://rust/*/tls/cert_verification" \
  --target "code://go/*/tls/cert_verification"

Output: Adds PolicyAlias to Trust Pack before signing.

Extension Points

1. Dynamic Alias Discovery

Future Enhancement: Auto-generate aliases during scan if code paths differ from policy paths.

// If tail-path matches but full paths differ, suggest alias
if tail_match && !full_match {
    suggestions.push(PolicyAlias {
        policy_path: assertion.subject.clone(),
        target_patterns: vec![claim.concept_path.clone()],
    });
}

2. Semantic Equivalence

Future Enhancement: Use embedding similarity for fuzzy matching.

pub struct SemanticAlias {
    pub policy_path: String,
    pub similarity_threshold: f32,
}

// Match if embedding distance < threshold

3. Hierarchical Policy Inheritance

Future Enhancement: Support policy hierarchies.

// Match "code://standards/tls/*" against any TLS assertion
pub struct HierarchyAlias {
    pub policy_prefix: String,  // "code://standards/tls"
    pub target_prefix: String,  // "code://rust/*/tls"
}

Migration Path

Backward Compatibility

✅ Zero Breaking Changes:

Tail-path matching still works for existing use cases
PolicyAlias is optional (empty vec = current behavior)
Existing Trust Packs without policy_aliases field deserialize fine (add default)

Adoption Strategy

Week 1: Implement core functionality (Phase 1-2) Week 2: Integrate into scan flow (Phase 3) Week 3: Add CLI tooling (Phase 4) Week 4: Document + UAT with enterprise scenario

Metrics for Success

Functional

Security team can create code://standards/* assertions
Dev team code (code://rust/myapp/*) matches standards
Conflicts are detected and reported
Trust Pack signature verification passes

Performance

Scan time increase < 5% (alias lookup is O(P*A) where P=patterns, A=aliases)
Memory overhead < 10KB per Trust Pack (policy aliases are small)

Usability

Security team can export Trust Pack with aliases in < 5 commands
Dev team imports Trust Pack with policies = ["security.pack"] (no code changes)

Open Questions

Wildcard Syntax: Use glob (*) or regex (.*)?
- Recommendation: Start with glob (simpler, more intuitive)
Alias Priority: If multiple aliases match, which wins?
- Recommendation: First match wins (deterministic order in Trust Pack)
Alias Storage: Persist discovered aliases back to local store?
- Recommendation: No (keep Trust Pack as source of truth)
Alias Validation: Check patterns at Trust Pack creation time?
- Recommendation: Yes (fail fast if invalid glob pattern)

Conclusion

Current State: Tail-path matching works for bundled corpus but breaks for enterprise policies.

Root Cause: Semantic mismatch between policy hierarchies and extractor output.

Solution: Add explicit PolicyAlias layer in Trust Packs.

Impact: Unblocks enterprise adoption without breaking existing functionality.

Effort: ~2-3 days (schema extension + pattern matching + integration)

Risk: Low (additive change, backward compatible)

Next Steps

Review this analysis with team
Validate glob pattern syntax choice
Implement Phase 1 (schema extension)
Write UAT scenario mimicking enterprise use case
Iterate based on feedback

Questions or feedback? Discuss in #aphoria-architecture.

11 KiB Raw Blame History

Concept Matching Architecture Analysis

Executive Summary

Current Architecture

1. Tail-Path Matching (ConceptIndex)

2. Trust Pack Import

The Problem

Scenario: Enterprise Policy Mismatch

Root Cause

Analysis: Is Tail-Path Matching Sufficient?

What Works Well

What Breaks

Solution Options

Option 1: Normalize Extractor Output (Rejected)

Option 2: Flexible Tail-Path Length (Partial)

Option 3: Explicit Policy Aliases (Recommended)

Recommended Implementation Plan

Phase 1: Extend Trust Pack Schema

Phase 2: Add Pattern Matching

Phase 3: Integrate into Scan Flow

Phase 4: CLI Tooling

Extension Points

1. Dynamic Alias Discovery

2. Semantic Equivalence

3. Hierarchical Policy Inheritance

Migration Path

Backward Compatibility

Adoption Strategy

Metrics for Success

Functional

Performance

Usability

Open Questions

Conclusion

Next Steps

11 KiB

Raw Blame History