stemedb/applications/aphoria/docs/architecture/matching-philosophy.md
jordan 41c676a78e feat: Aphoria enterprise features + ontology SDK + file length compliance
Enterprise Features:
- Hosted mode with remote sync for team pattern aggregation
- Community sharing with privacy-preserving anonymization
- LLM-based semantic claim extraction with Gemini integration
- Pattern learning with promotion to declarative extractors
- High-entropy secrets extractor with configurable thresholds
- Auth bypass and insecure cookies extractors

Module Refactoring:
- Split oversized files to comply with 500-line limit
- Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc.
- Handlers split: scan.rs, policy.rs, report.rs modules
- Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/
- Learning split: store modules with metrics and persistence

SDK & Ontology:
- stemedb-ontology SDK with fluent builders and StemeDB client
- Pharma domain extractors for FDA Orange Book data
- Consumer health UAT test infrastructure

Code Quality:
- Fixed clippy warnings (needless_borrows_for_generic_args)
- Added KVStore trait imports where needed
- Fixed utoipa path re-exports for OpenAPI docs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 12:55:29 -07:00

8.9 KiB

Concept Matching Philosophy

Context: Aphoria's policy enforcement depends on matching code extractors to authoritative sources. Question: How do we enable flexible matching without over-engineering?


Core Design Principles

1. Semantic Over Syntactic

Bad: Match exact string paths

"code://rust/myapp/tls/cert_verification" != "rfc://5246/tls/cert_verification"

Good: Match semantic tail paths

Both produce key: "tls/cert_verification::enabled"

Principle: Concepts should match across schemes if they represent the same idea.


2. Progressive Precision

Layer 1: Tail-path matching (works 80% of the time) Layer 2: Policy aliases (handles enterprise hierarchies) Layer 3: Semantic embeddings (future: fuzzy matching)

Principle: Start with simple heuristics, add precision layers as needed.


3. Explicit Over Implicit

Bad: Auto-generate aliases behind the scenes

  • Hard to debug ("why did this match?")
  • Fragile (breaks with refactoring)
  • Opaque (security teams lose control)

Good: Require explicit policy aliases

  • Clear provenance (alias is in Trust Pack)
  • Auditable (signature covers aliases)
  • Controllable (security team decides matches)

Principle: Matching logic should be transparent and intentional.


Why Tail-Path Matching Works

Design Insight

Code extractors are intentionally designed to align with RFC/OWASP paths:

RFC Structure:

rfc://5246/tls/cert_verification
rfc://7519/jwt/audience_validation
rfc://8996/tls/min_version

Extractor Output:

code://rust/myapp/tls/cert_verification
code://python/myapp/jwt/audience_validation
code://go/myapp/tls/min_version

Key Insight: The last 2 segments (tls/cert_verification) are the concept name.

Language prefix (rust/myapp) provides context but not identity.


Why Tail-Path Matching Breaks

Enterprise Hierarchies

Security teams think in logical domains, not RFC hierarchies:

code://standards/tls/cert_verification    (Security team's mental model)
code://internal/exceptions/md5_allowed    (Policy exceptions)
code://vendor/aws/s3/public_access        (Cloud-specific rules)

These don't map to extractor output:

code://rust/myapp/tls/cert_verification   (Extractor output)

Problem: standards/tls (2 segments) vs. rust/myapp/tls (3 segments)

Tail-path key mismatch:

  • Policy: "tls/cert_verification::enabled"
  • Code: "myapp/tls::enabled" (extracts wrong segments!)

Why Policy Aliases Are the Right Solution

1. Preserves Tail-Path Matching

Most cases (bundled corpus) still use fast path:

// 1. Try direct tail-path match (O(1) hash lookup)
if let Some(result) = self.lookup(subject, predicate) {
    return Some(result);
}

2. Adds Flexibility Without Complexity

Only when direct match fails, try aliases:

// 2. Try policy alias patterns (O(P*A), small P and A)
for alias in policy_aliases {
    if glob_match(alias.target_patterns, subject) {
        return self.lookup(&alias.policy_path, predicate);
    }
}

3. Keeps Control with Policy Authors

Security team explicitly states:

PolicyAlias {
    policy_path: "code://standards/tls/cert_verification",
    target_patterns: vec![
        "code://rust/*/tls/cert_verification",
        "code://go/*/tls/cert_verification",
    ],
}

This is documentation (what matches what) and enforcement (signed in Trust Pack).


Extension Points: Future Matching Layers

Layer 3: Semantic Equivalence (Future)

Idea: Use embeddings to match concepts with different names.

Example:

Policy: "code://standards/tls/certificate_validation"
Code:   "code://rust/myapp/tls/cert_verification"

Embedding similarity: 0.92 → match

When to add: If alias management becomes too manual.


Layer 4: Ontology Mapping (Future)

Idea: Define semantic relationships between concepts.

Example:

ontology:
  "tls/cert_verification":
    equivalent_to:
      - "tls/certificate_validation"
      - "ssl/verify_certs"
    broader_than:
      - "security/transport_layer"

When to add: If multiple industries need cross-domain mapping.


Comparison: Alternative Approaches

Alt 1: Variable Tail Length

Idea: Try N=1, N=2, N=3 segment keys.

Problems:

  • Ambiguous matches (which key wins?)
  • Performance hit (3x lookups)
  • Doesn't solve semantic differences

Verdict: Rejected (complexity without solving root cause)


Alt 2: Normalize All Paths

Idea: Extractors output "canonical" paths that match standards.

Problems:

  • Loses language context (rust/myapp)
  • Breaks existing aliases/observations
  • Forces extractors to know about ALL standards

Verdict: Rejected (breaks modularity)


Alt 3: Dynamic Alias Discovery

Idea: Auto-create aliases during scan when tail-path matches but full path differs.

Problems:

  • Implicit behavior (hard to debug)
  • No security team approval (bypasses policy control)
  • May create false positives

Verdict: Future enhancement (as suggestions, not automatic)


Architectural Trade-offs

Chosen: Explicit Policy Aliases

Pros:

  • Clear provenance (aliases are in Trust Pack)
  • Auditable (covered by signature)
  • Flexible (glob patterns support many cases)
  • Backward compatible (empty aliases = current behavior)

Cons:

  • Requires manual alias creation
  • Adds cognitive overhead (security teams must think about patterns)
  • Another field in Trust Pack schema

Why this trade-off wins:

  • Enterprise adoption requires auditability
  • Security teams WANT explicit control
  • Manual work is one-time (create pack once, reuse everywhere)

Pattern 1: Language Wildcards

Use Case: Standard applies to all languages.

PolicyAlias {
    policy_path: "code://standards/tls/cert_verification",
    target_patterns: vec![
        "code://*/*/tls/cert_verification",  // any language, any project
    ],
}

Pattern 2: Project-Specific

Use Case: Internal policy for specific service.

PolicyAlias {
    policy_path: "code://internal/auth/jwt_validation",
    target_patterns: vec![
        "code://rust/auth-service/jwt/validation",
        "code://go/auth-service/jwt/validation",
    ],
}

Pattern 3: Domain-Scoped

Use Case: Cloud-specific rules.

PolicyAlias {
    policy_path: "code://vendor/aws/s3/public_access",
    target_patterns: vec![
        "code://*/*/aws/s3/bucket/public",
        "code://*/*/cloud/storage/s3/public_access",
    ],
}

Open Questions for Long-Term Evolution

Q1: Should we support recursive wildcards?

Current: code://rust/*/tls (single segment wildcard) Proposed: code://rust/**/tls (any depth)

Trade-off: More flexible, but harder to reason about matches.

Decision: Start with single-segment, add recursive if needed.


Q2: Should aliases be bidirectional?

Current: Policy path → Code patterns (one direction) Proposed: Allow code path → Policy path mapping

Use Case: "This code path is an exception to standard X."

Decision: Defer until use case emerges.


Q3: Should we cache pattern matches?

Current: Recompute glob match on every lookup Proposed: Cache subject → policy_path map per scan

Trade-off: Faster (O(1) after first match) vs. memory overhead

Decision: Benchmark first, optimize if needed (premature optimization).


Q4: Should policy aliases be mergeable?

Current: Each Trust Pack has independent aliases Proposed: Allow Trust Pack B to "extend" Trust Pack A's aliases

Use Case: Company-wide base pack + team-specific extensions

Decision: Future enhancement (Trust Pack composition system).


Guiding Heuristic

When adding matching features, ask:

  1. Does this preserve tail-path matching for the common case?

    • Yes → Maintains performance
    • No → Reconsider
  2. Is the behavior explicit and auditable?

    • Yes → Security teams can reason about it
    • No → Will cause trust issues
  3. Can it be disabled or overridden?

    • Yes → Progressive adoption
    • No → May block some use cases
  4. Does it add cognitive overhead?

    • Minimal → Worth the flexibility
    • Significant → Document heavily or defer

Conclusion

Current State: Tail-path matching works for bundled corpus but breaks for enterprise policies.

Root Cause: Semantic mismatch between policy hierarchies and extractor output.

Solution: Add explicit policy aliases as a second matching layer.

Philosophy: Start simple (tail-path), add precision layers (aliases), keep it auditable.

Future: Semantic embeddings and ontology mapping if manual aliases become burdensome.


This document should guide future matching feature discussions. Always return to the core principles: semantic matching, progressive precision, explicit control.