jordan 41c676a78e feat: Aphoria enterprise features + ontology SDK + file length compliance

Enterprise Features:
- Hosted mode with remote sync for team pattern aggregation
- Community sharing with privacy-preserving anonymization
- LLM-based semantic claim extraction with Gemini integration
- Pattern learning with promotion to declarative extractors
- High-entropy secrets extractor with configurable thresholds
- Auth bypass and insecure cookies extractors

Module Refactoring:
- Split oversized files to comply with 500-line limit
- Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc.
- Handlers split: scan.rs, policy.rs, report.rs modules
- Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/
- Learning split: store modules with metrics and persistence

SDK & Ontology:
- stemedb-ontology SDK with fluent builders and StemeDB client
- Pharma domain extractors for FDA Orange Book data
- Consumer health UAT test infrastructure

Code Quality:
- Fixed clippy warnings (needless_borrows_for_generic_args)
- Added KVStore trait imports where needed
- Fixed utoipa path re-exports for OpenAPI docs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 12:55:29 -07:00

8.9 KiB

Raw Blame History

Concept Matching Philosophy

Context: Aphoria's policy enforcement depends on matching code extractors to authoritative sources. Question: How do we enable flexible matching without over-engineering?

Core Design Principles

1. Semantic Over Syntactic

Bad: Match exact string paths

"code://rust/myapp/tls/cert_verification" != "rfc://5246/tls/cert_verification"

Good: Match semantic tail paths

Both produce key: "tls/cert_verification::enabled"

Principle: Concepts should match across schemes if they represent the same idea.

2. Progressive Precision

Layer 1: Tail-path matching (works 80% of the time) Layer 2: Policy aliases (handles enterprise hierarchies) Layer 3: Semantic embeddings (future: fuzzy matching)

Principle: Start with simple heuristics, add precision layers as needed.

3. Explicit Over Implicit

Bad: Auto-generate aliases behind the scenes

Hard to debug ("why did this match?")
Fragile (breaks with refactoring)
Opaque (security teams lose control)

Good: Require explicit policy aliases

Clear provenance (alias is in Trust Pack)
Auditable (signature covers aliases)
Controllable (security team decides matches)

Principle: Matching logic should be transparent and intentional.

Why Tail-Path Matching Works

Design Insight

Code extractors are intentionally designed to align with RFC/OWASP paths:

RFC Structure:

rfc://5246/tls/cert_verification
rfc://7519/jwt/audience_validation
rfc://8996/tls/min_version

Extractor Output:

code://rust/myapp/tls/cert_verification
code://python/myapp/jwt/audience_validation
code://go/myapp/tls/min_version

Key Insight: The last 2 segments (tls/cert_verification) are the concept name.

Language prefix (rust/myapp) provides context but not identity.

Why Tail-Path Matching Breaks

Enterprise Hierarchies

Security teams think in logical domains, not RFC hierarchies:

code://standards/tls/cert_verification    (Security team's mental model)
code://internal/exceptions/md5_allowed    (Policy exceptions)
code://vendor/aws/s3/public_access        (Cloud-specific rules)

These don't map to extractor output:

code://rust/myapp/tls/cert_verification   (Extractor output)

Problem: standards/tls (2 segments) vs. rust/myapp/tls (3 segments)

Tail-path key mismatch:

Policy: "tls/cert_verification::enabled"
Code: "myapp/tls::enabled" (extracts wrong segments!)

Why Policy Aliases Are the Right Solution

1. Preserves Tail-Path Matching

Most cases (bundled corpus) still use fast path:

// 1. Try direct tail-path match (O(1) hash lookup)
if let Some(result) = self.lookup(subject, predicate) {
    return Some(result);
}

2. Adds Flexibility Without Complexity

Only when direct match fails, try aliases:

// 2. Try policy alias patterns (O(P*A), small P and A)
for alias in policy_aliases {
    if glob_match(alias.target_patterns, subject) {
        return self.lookup(&alias.policy_path, predicate);
    }
}

3. Keeps Control with Policy Authors

Security team explicitly states:

PolicyAlias {
    policy_path: "code://standards/tls/cert_verification",
    target_patterns: vec![
        "code://rust/*/tls/cert_verification",
        "code://go/*/tls/cert_verification",
    ],
}

This is documentation (what matches what) and enforcement (signed in Trust Pack).

Extension Points: Future Matching Layers

Layer 3: Semantic Equivalence (Future)

Idea: Use embeddings to match concepts with different names.

Example:

Policy: "code://standards/tls/certificate_validation"
Code:   "code://rust/myapp/tls/cert_verification"

Embedding similarity: 0.92 → match

When to add: If alias management becomes too manual.

Layer 4: Ontology Mapping (Future)

Idea: Define semantic relationships between concepts.

Example:

ontology:
  "tls/cert_verification":
    equivalent_to:
      - "tls/certificate_validation"
      - "ssl/verify_certs"
    broader_than:
      - "security/transport_layer"

When to add: If multiple industries need cross-domain mapping.

Comparison: Alternative Approaches

Alt 1: Variable Tail Length

Idea: Try N=1, N=2, N=3 segment keys.

Problems:

Ambiguous matches (which key wins?)
Performance hit (3x lookups)
Doesn't solve semantic differences

Verdict: Rejected (complexity without solving root cause)

Alt 2: Normalize All Paths

Idea: Extractors output "canonical" paths that match standards.

Problems:

Loses language context (rust/myapp)
Breaks existing aliases/observations
Forces extractors to know about ALL standards

Verdict: Rejected (breaks modularity)

Alt 3: Dynamic Alias Discovery

Idea: Auto-create aliases during scan when tail-path matches but full path differs.

Problems:

Implicit behavior (hard to debug)
No security team approval (bypasses policy control)
May create false positives

Verdict: Future enhancement (as suggestions, not automatic)

Architectural Trade-offs

Chosen: Explicit Policy Aliases

Pros:

Clear provenance (aliases are in Trust Pack)
Auditable (covered by signature)
Flexible (glob patterns support many cases)
Backward compatible (empty aliases = current behavior)

Cons:

Requires manual alias creation
Adds cognitive overhead (security teams must think about patterns)
Another field in Trust Pack schema

Why this trade-off wins:

Enterprise adoption requires auditability
Security teams WANT explicit control
Manual work is one-time (create pack once, reuse everywhere)

Recommended Patterns

Pattern 1: Language Wildcards

Use Case: Standard applies to all languages.

PolicyAlias {
    policy_path: "code://standards/tls/cert_verification",
    target_patterns: vec![
        "code://*/*/tls/cert_verification",  // any language, any project
    ],
}

Pattern 2: Project-Specific

Use Case: Internal policy for specific service.

PolicyAlias {
    policy_path: "code://internal/auth/jwt_validation",
    target_patterns: vec![
        "code://rust/auth-service/jwt/validation",
        "code://go/auth-service/jwt/validation",
    ],
}

Pattern 3: Domain-Scoped

Use Case: Cloud-specific rules.

PolicyAlias {
    policy_path: "code://vendor/aws/s3/public_access",
    target_patterns: vec![
        "code://*/*/aws/s3/bucket/public",
        "code://*/*/cloud/storage/s3/public_access",
    ],
}

Open Questions for Long-Term Evolution

Q1: Should we support recursive wildcards?

Current: code://rust/*/tls (single segment wildcard) Proposed: code://rust/**/tls (any depth)

Trade-off: More flexible, but harder to reason about matches.

Decision: Start with single-segment, add recursive if needed.

Q2: Should aliases be bidirectional?

Current: Policy path → Code patterns (one direction) Proposed: Allow code path → Policy path mapping

Use Case: "This code path is an exception to standard X."

Decision: Defer until use case emerges.

Q3: Should we cache pattern matches?

Current: Recompute glob match on every lookup Proposed: Cache subject → policy_path map per scan

Trade-off: Faster (O(1) after first match) vs. memory overhead

Decision: Benchmark first, optimize if needed (premature optimization).

Q4: Should policy aliases be mergeable?

Current: Each Trust Pack has independent aliases Proposed: Allow Trust Pack B to "extend" Trust Pack A's aliases

Use Case: Company-wide base pack + team-specific extensions

Decision: Future enhancement (Trust Pack composition system).

Guiding Heuristic

When adding matching features, ask:

Does this preserve tail-path matching for the common case?
- Yes → Maintains performance
- No → Reconsider
Is the behavior explicit and auditable?
- Yes → Security teams can reason about it
- No → Will cause trust issues
Can it be disabled or overridden?
- Yes → Progressive adoption
- No → May block some use cases
Does it add cognitive overhead?
- Minimal → Worth the flexibility
- Significant → Document heavily or defer

Conclusion

Current State: Tail-path matching works for bundled corpus but breaks for enterprise policies.

Root Cause: Semantic mismatch between policy hierarchies and extractor output.

Solution: Add explicit policy aliases as a second matching layer.

Philosophy: Start simple (tail-path), add precision layers (aliases), keep it auditable.

Future: Semantic embeddings and ontology mapping if manual aliases become burdensome.

This document should guide future matching feature discussions. Always return to the core principles: semantic matching, progressive precision, explicit control.

8.9 KiB Raw Blame History

Concept Matching Philosophy

Core Design Principles

1. Semantic Over Syntactic

2. Progressive Precision

3. Explicit Over Implicit

Why Tail-Path Matching Works

Design Insight

Why Tail-Path Matching Breaks

Enterprise Hierarchies

Why Policy Aliases Are the Right Solution

1. Preserves Tail-Path Matching

2. Adds Flexibility Without Complexity

3. Keeps Control with Policy Authors

Extension Points: Future Matching Layers

Layer 3: Semantic Equivalence (Future)

Layer 4: Ontology Mapping (Future)

Comparison: Alternative Approaches

Alt 1: Variable Tail Length

Alt 2: Normalize All Paths

Alt 3: Dynamic Alias Discovery

Architectural Trade-offs

Chosen: Explicit Policy Aliases

Recommended Patterns

Pattern 1: Language Wildcards

Pattern 2: Project-Specific

Pattern 3: Domain-Scoped

Open Questions for Long-Term Evolution

Q1: Should we support recursive wildcards?

Q2: Should aliases be bidirectional?

Q3: Should we cache pattern matches?

Q4: Should policy aliases be mergeable?

Guiding Heuristic

Conclusion

8.9 KiB

Raw Blame History