# Aphoria Architecture Documentation This directory contains architectural decision records, analysis, and design philosophy for Aphoria. --- ## System Overview Aphoria is a **code-level truth linter** that validates code against authoritative sources (RFCs, OWASP, vendor docs). It extracts implicit claims from code and configs, then checks them against a tiered authority system. ### High-Level Architecture ``` ┌──────────────────────────────────────────────────────────────────────────┐ │ Aphoria CLI Pipeline │ ├──────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────────────┐ │ │ │ CLI/Args │ ──▶ handlers.rs dispatches to scan, policy, research │ │ └──────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ ┌────────────────┐ ┌──────────────┐ │ │ │ Walker │──▶│ Extractors │──▶│ Bridge │ │ │ │ (walk files) │ │ (14 built-in) │ │ (claim→assn) │ │ │ └──────────────┘ └────────────────┘ └──────────────┘ │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ ┌──────────────────┐ │ │ │ │ │ Episteme Layer │ │ │ │ │ │ │ │ │ │ │ │ ┌──────────────┐ │ │ │ │ │ │ │ Ephemeral │ │ ◀─ Fast path │ │ │ │ │ │ Detector │ │ (~0.25s) │ │ │ │ │ └──────────────┘ │ │ │ │ │ │ OR │ │ │ │ │ │ ┌──────────────┐ │ │ │ │ │ │ │ Local │ │ ◀─ Full path │ │ │ │ │ │ Episteme │ │ (~1-2s) │ │ │ │ │ └──────────────┘ │ │ │ │ │ └──────────────────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌────────────────────────────────────────────────────────────────┐ │ │ │ Conflict Detection │ │ │ │ ConceptIndex (tail-path) + Aliases + Policy Source Tracking │ │ │ └────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────┐ ┌────────────────┐ ┌──────────────┐ │ │ │ Report │ │ Drift Check │ │ Observation │ │ │ │ (table/json/ │ │ (self-conflict)│ │ Write-back │ │ │ │ sarif/md) │ │ │ │ (--sync) │ │ │ └──────────────┘ └────────────────┘ └──────────────┘ │ │ │ └──────────────────────────────────────────────────────────────────────────┘ ``` ### Data Flow 1. **WALK** - Traverse project directory (respects `.gitignore`, supports `--staged` for git-staged files only) 2. **EXTRACT** - Run 14 built-in extractors + declarative extractors to find implicit claims 3. **INGEST** - Convert claims to Episteme assertions (BLAKE3 hash + Ed25519 signature) 4. **CONFLICT** - Query ConceptIndex for authority matches using tail-path matching 5. **DRIFT** - Compare against prior observations (self-conflict detection) 6. **REPORT** - Output in table, JSON, SARIF 2.1.0, or Markdown format 7. **SYNC** - (Optional) Write-back novel observations to local store or hosted server --- ## Key Modules | Module | Purpose | Key Files | |--------|---------|-----------| | `cli.rs` | Clap-based CLI argument parsing | Command definitions | | `handlers.rs` | Command dispatch, validation | `--sync requires --persist` | | `scan.rs` | Main scan orchestrator | Mode dispatch, observation flow | | `walker/` | Project traversal | `mod.rs`, `git.rs`, `path_mapper.rs`, `language.rs` | | `extractors/` | 14 pattern-based claim extractors | `mod.rs`, individual extractors | | `bridge.rs` | ExtractedClaim → Assertion conversion | BLAKE3 hashing, Ed25519 signing | | `episteme/` | Conflict detection core | `ephemeral.rs`, `local.rs`, `concept_index.rs` | | `policy.rs` | Trust Pack management | Load/save/verify signed packs | | `policy_ops.rs` | `bless`, `ack`, `update`, `export/import` | CLI policy operations | | `report/` | Output formatting | `table.rs`, `json.rs`, `sarif.rs`, `markdown.rs` | | `hosted.rs` | HTTP client for team aggregation | Push observations to remote server | | `community/` | Anonymous pattern contribution | `anonymizer.rs`, `types.rs` | | `research/` | Gap detection and auto-research | `gap_detector.rs`, `researcher.rs` | | `config/` | `aphoria.toml` parsing | All configuration types | | `types/` | Domain types | `claim.rs`, `verdict.rs`, `result.rs`, `command.rs` | | `corpus/` | Authoritative source builders | `rfc/`, `owasp/`, `vendor.rs`, `hardcoded.rs` | --- ## Scan Modes | Mode | Storage | Performance | Features | |------|---------|-------------|----------| | **Ephemeral** (default) | None | ~0.25s | Conflict detection only | | **Persistent** (`--persist`) | WAL + KV | ~1-2s | Baseline, diff, aliases, drift, observation write-back | ### Ephemeral Mode (`EphemeralDetector`) - Builds corpus + ConceptIndex entirely in-memory - No disk I/O during scan - Perfect for CI/pre-commit hooks - Cannot detect drift (no prior state) - Cannot write observations (no storage) ### Persistent Mode (`LocalEpisteme`) - Full Episteme stack initialization - WAL recovery on startup - Enables: baseline tracking, diff, auto-alias creation, drift detection, `--sync` --- ## Authority Tiers | Tier | Source | Example | Weight | |------|--------|---------|--------| | 0 | Regulatory | RFC 7519: "JWT audience validation is mandatory" | 1.0 | | 1 | Clinical | OWASP: "TLS certificate verification required" | 0.9 | | 2 | Observational | Vendor docs: "Redis timeout should be > 0" | 0.7 | | 3 | Expert | Team policy: "Our pool size is 50" | 0.5 | | 4 | Community | Prior observations from this codebase | 0.3 | **Conflict Score Formula:** ``` score = Σ(tier_weight × assertion_confidence × value_difference) ``` --- ## Concept Matching ### Tail-Path Matching (ConceptIndex) The primary matching algorithm uses the last 2 path segments to enable cross-scheme matching: ``` RFC assertion: rfc://5246/tls/cert_verification Code claim: code://rust/myapp/tls/cert_verification Both produce key: "tls/cert_verification::enabled" ``` **Algorithm:** 1. Strip scheme (`rfc://`, `code://`) 2. Take last 2 non-empty path segments 3. Append predicate 4. Key = `{seg[-2]}/{seg[-1]}::{predicate}` ### Alias Resolution When tail-path matching fails, the system checks registered aliases. Aliases can be: - **Auto-created** - When conflicts are detected, persist the relationship (persistent mode) - **Manual** - Created via `aphoria bless` or Trust Pack import - **Policy aliases** - (Planned) From Trust Packs for enterprise policy enforcement - see [Policy Alias Implementation](./policy-alias-implementation.md) --- ## Extractors ### Built-in Extractors (14) | Extractor | Languages | Detects | |-----------|-----------|---------| | `tls_verify` | 8 | TLS certificate verification disabled | | `tls_version` | 8 | Deprecated TLS 1.0/1.1 per RFC 8996 | | `jwt_config` | 8 | JWT alg:none, skip signature verification | | `hardcoded_secrets` | 8 | API keys, passwords in code | | `timeout_config` | 8 | HTTP/DB/Redis timeout values | | `dep_versions` | 3 | Dependency versions for advisory lookup | | `cors_config` | 8 | CORS wildcard + credentials | | `rate_limit` | 8 | Rate limiting configuration | | `weak_crypto` | 5 | MD5, SHA1, DES, RC4 usage | | `sql_injection` | 5 | SQL string interpolation | | `command_injection` | 5 | Shell exec, os.system | | `unreal_cpp` | C++ | Unreal Engine Exec functions | | `unreal_config` | INI | Unreal Engine INI patterns | | `unreal_performance` | C++ | Synchronous asset loading | ### Declarative Extractors Users can define custom extractors in `aphoria.toml`: ```toml [[extractors.declarative]] name = "deprecated_api_v1" description = "Detects usage of deprecated v1 API endpoints" languages = ["go", "rust", "python"] pattern = '/api/v1/\w+' claim.subject = "api/deprecated_endpoint" claim.predicate = "version" claim.value = "v1" confidence = 1.0 ``` --- ## Verdicts | Verdict | Score Range | Exit Code | Action | |---------|-------------|-----------|--------| | `Block` | ≥ 0.7 | 2 | Must fix before commit | | `Flag` | ≥ 0.4 | 1 | Should review | | `Pass` | < 0.4 | 0 | No conflict | | `Ack` | N/A | 0 | Acknowledged intentional | | `Drift` | N/A | 1 | Changed from prior value | --- ## Trust Packs (Phase 6) Signed bundles of assertions and aliases for federated policy distribution. **Schema:** ```rust pub struct TrustPack { pub header: PackHeader, // name, version, issuer_id, timestamp pub assertions: Vec, pub aliases: Vec, pub signature: [u8; 64], // Ed25519 signature } ``` **Operations:** - `aphoria policy export` - Create signed pack from local decisions - `aphoria policy import` - Load pack, verify signature, ingest assertions - `aphoria.toml` - Auto-load policies from `policies = [...]` list --- ## Hosted Mode (Phase 4E) Team aggregation via central StemeDB server. ```toml [hosted] url = "https://episteme.acme.corp" project_id = "billing-service" team_id = "platform-team" sync_mode = "remote-only" # or "local-and-remote" offline_fallback = "skip" # or "fail" or "queue" api_key_env = "APHORIA_API_KEY" ``` **Flow:** ``` Developer scans → HostedClient → POST /v1/aphoria/observations → Team Server ``` --- ## Community Sharing (Phase 5.6) Opt-in anonymous pattern contribution. **Privacy Model:** - Project names wildcarded: `code://rust/myapp/tls` → `code://rust/*/tls` - File paths, line numbers, matched text NEVER shared - Timestamps rounded to hour (k-anonymity) - `enabled` defaults to `false` (explicit opt-in) ```toml [community] enabled = true anonymize = true min_confidence = 0.8 exclude = ["vendor://acme/internal/*"] ``` --- ## Key Documents ### Concept Matching System **Problem:** How do we match code extractors to authoritative policies across different hierarchies? 1. **[Concept Matching Analysis](./concept-matching-analysis.md)** - Identifies the gap: tail-path matching works for RFCs but breaks for enterprise policies - Analyzes root cause: semantic mismatch between policy hierarchies and extractor output - Proposes solution: explicit policy aliases in Trust Packs 2. **[Policy Alias Implementation Guide](./policy-alias-implementation.md)** - Day-by-day implementation plan (5 phases over 3 days) - Code sketches with exact file locations - Test strategies and success criteria - Migration and rollout plan 3. **[Matching Philosophy](./matching-philosophy.md)** - Core design principles: semantic over syntactic, progressive precision, explicit control - Why tail-path matching works (by design for RFC/OWASP corpus) - Why it breaks (enterprise hierarchies violate assumptions) - Future extension points (semantic embeddings, ontology mapping) 4. **[Enterprise Validation](./enterprise-validation.md)** - End-to-end scenario walkthrough - Validates that policy aliases solve the enterprise use case - Edge case analysis - Real-world adoption path --- ## Quick Reference ### When to Read What | If you need to... | Read this | |-------------------|-----------| | Understand the problem | [Concept Matching Analysis](./concept-matching-analysis.md) | | Implement the solution | [Policy Alias Implementation](./policy-alias-implementation.md) | | Understand design philosophy | [Matching Philosophy](./matching-philosophy.md) | | Validate enterprise scenarios | [Enterprise Validation](./enterprise-validation.md) | | Add a new extractor | `src/extractors/mod.rs` | | Understand scan flow | `src/scan.rs` | | Modify conflict detection | `src/episteme/conflict.rs` | | Work with Trust Packs | `src/policy.rs`, `src/policy_ops.rs` | --- ## Architecture Decisions ### AD-001: Explicit Policy Aliases **Status:** Approved (2026-02-04) - **Not Yet Implemented** **Context:** Security teams need to create policies using logical hierarchies (`code://standards/*`) that don't align with extractor output (`code://rust/myapp/*`). **Decision:** Add `PolicyAlias` type to Trust Packs with glob pattern matching. **Implementation:** See [Policy Alias Implementation Guide](./policy-alias-implementation.md) for detailed implementation plan. **Consequences:** - ✅ Enables enterprise policy enforcement - ✅ Maintains backward compatibility - ✅ Keeps security teams in control (explicit aliases) - ⚠️ Requires manual alias creation - ⚠️ Adds cognitive overhead (pattern syntax) ### AD-002: Ephemeral Mode Default **Status:** Implemented (2026-01-28) **Context:** Full Episteme initialization took ~1-2s, too slow for pre-commit hooks. **Decision:** Default to ephemeral mode (in-memory only), opt-in to persistent with `--persist`. **Consequences:** - ✅ 40x faster scans (~0.25s) - ✅ No storage pollution for quick checks - ⚠️ Drift detection requires `--persist` - ⚠️ Observation write-back requires `--persist --sync` ### AD-003: Tail-Path Matching **Status:** Implemented **Context:** Need to match code claims against RFCs/OWASP assertions with different URI schemes. **Decision:** Use last 2 path segments + predicate as index key. **Consequences:** - ✅ O(1) lookup via HashMap - ✅ Works for RFC/OWASP corpus by design - ⚠️ Breaks for enterprise policies with different hierarchies (solved by AD-001) --- ## Design Principles ### 1. Semantic Over Syntactic Match concepts by meaning, not exact string paths. ### 2. Progressive Precision Start with simple heuristics (tail-path), add layers (aliases, embeddings) as needed. ### 3. Explicit Over Implicit Matching logic should be transparent, auditable, and controllable. ### 4. Zero Configuration (for common cases) Bundled corpus (RFCs, OWASP) should "just work" with tail-path matching. ### 5. Cryptographic Trust All policies are signed (Ed25519) and verified before use. ### 6. Privacy by Default Community sharing is opt-in with anonymization enabled by default. --- ## Extension Points ### Current (2026-02-05) - Tail-path matching (O(1) hash lookup) - Concept aliases (auto-created on conflict detection) - Declarative extractors (user-defined in TOML) - Hosted mode (team aggregation) - Community corpus (anonymous sharing) ### In Progress - **Policy aliases** - Enterprise policy matching via glob patterns ([AD-001](./policy-alias-implementation.md)) ### Planned (Q1 2026) - Semantic embeddings (fuzzy matching via vector similarity) - Alias auto-discovery (suggest aliases during scan) - High-entropy secret detection - Framework-specific extractors (Spring, Django, Express) ### Future (Q2+ 2026) - Ontology mapping (define semantic relationships) - Trust Pack composition (packs can extend other packs) - LLM-assisted extraction (semantic code understanding) - Config file deep parsing (structured YAML/JSON/TOML) --- ## Performance Targets ### Scan Time - **Ephemeral:** < 0.3s for typical project - **Persistent:** < 2s for typical project - **With Policy Aliases:** < 5% increase ### Memory Overhead - **Policy Alias Storage:** ~100 bytes per alias - **Typical Trust Pack:** < 10 KB (10 aliases) - **Corpus in memory:** ~2-5 MB (varies by sources enabled) ### Lookup Complexity - **Direct tail-path:** O(1) - **Concept alias resolution:** O(A) where A=aliases - **Policy alias fallback (planned):** O(P * A) where P=patterns, A=aliases --- ## Testing Strategy ### Unit Tests - Extractor pattern matching - ConceptIndex key generation - Conflict score calculation - Trust Pack serialization/verification ### Integration Tests - Full scan flow with corpus - Trust Pack import/export - Drift detection - Observation write-back ### UAT Scenarios - Enterprise security team workflow - Multi-language policy enforcement - CI/CD integration - Hosted mode aggregation --- ## Related Documentation ### Product - [Product Overview](../../product.md) - What Aphoria does - [Roadmap](../../roadmap.md) - Implementation status and plans ### Guides - [Enterprise Quick Start](../guides/enterprise-quick-start.md) - Getting started - [Federating Truth](../guides/federating-truth.md) - Trust Pack workflows ### Implementation - [Policy Ops](../../src/policy_ops.rs) - Trust Pack CLI handlers - [Concept Index](../../src/episteme/concept_index.rs) - Matching algorithm - [Local Episteme](../../src/episteme/local.rs) - Conflict detection - [Ephemeral Detector](../../src/episteme/ephemeral.rs) - Fast path --- ## Questions or Feedback? Discuss in: - `#aphoria-architecture` (internal Slack) - GitHub Issues (public feedback) - Architecture review meetings (Fridays 2pm PT) --- **This directory is the source of truth for architectural decisions.** All major changes should be documented here before implementation. --- *Last updated: 2026-02-05*