stemedb/applications/aphoria/docs/architecture/README.md
jml 65065f3d8f feat(aphoria): implement community corpus with wiki import and pattern aggregation
Implements Phase 4 (A4) - Community corpus as first-class citizens:

- **Community Corpus Builder** - Queries StemeDB pattern aggregates
- **Wiki Import** - Bootstrap corpus from markdown docs (aphoria corpus import wiki)
- **Pattern Aggregation** - Automatic learning from local scans (--sync flag)
- **Storage Layer** - StemeDBPatternStore with content-addressed deduplication
- **Promotion Logic** - Multi-tier thresholds (95%/80%/50% adoption rates)
- **Corpus Build** - Unified registry for RFC/OWASP/Vendor/Community sources
- **Trust Packs** - Export corpus as signed, distributable artifacts
- **Documentation** - bootstrap-corpus.md guide + CLI reference updates

Technical details:
- Pattern aggregates stored as assertions with predicate "pattern_aggregate"
- Content-addressed subjects via BLAKE3(subject:predicate:value)
- PatternAggregator handles write path (observations → patterns)
- StemeDBPatternStore handles read path (pattern queries)
- Integration tests + fixtures in tests/wiki_import_test.rs

Deleted hardcoded.rs (368 lines) - corpus now fully emergent from StemeDB.
Deleted enriched-corpus-patterns.md (677 lines) - feature shipped.

Closes VG-026 (community corpus), part of A4 milestone.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-09 00:12:31 +00:00

23 KiB
Raw Blame History

Aphoria Architecture Documentation

This directory contains architectural decision records, analysis, and design philosophy for Aphoria.


System Overview

Aphoria is a code-level truth linter that validates code against authoritative sources (RFCs, OWASP, vendor docs). It extracts implicit claims from code and configs, then checks them against a tiered authority system.

High-Level Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                        Aphoria CLI Pipeline                               │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────────┐                                                        │
│  │   CLI/Args   │ ──▶ handlers.rs dispatches to scan, policy, research   │
│  └──────────────┘                                                        │
│         │                                                                │
│         ▼                                                                │
│  ┌──────────────┐  ┌────────────────┐  ┌──────────────┐                  │
│  │    Walker    │──▶│  Extractors    │──▶│    Bridge    │                │
│  │ (walk files) │  │ (14 built-in)  │  │ (claim→assn) │                  │
│  └──────────────┘  └────────────────┘  └──────────────┘                  │
│         │                 │                    │                         │
│         │                 │                    ▼                         │
│         │                 │         ┌──────────────────┐                 │
│         │                 │         │  Episteme Layer  │                 │
│         │                 │         │                  │                 │
│         │                 │         │ ┌──────────────┐ │                 │
│         │                 │         │ │  Ephemeral   │ │ ◀─ Fast path    │
│         │                 │         │ │  Detector    │ │    (~0.25s)     │
│         │                 │         │ └──────────────┘ │                 │
│         │                 │         │        OR        │                 │
│         │                 │         │ ┌──────────────┐ │                 │
│         │                 │         │ │   Local      │ │ ◀─ Full path    │
│         │                 │         │ │  Episteme    │ │    (~1-2s)      │
│         │                 │         │ └──────────────┘ │                 │
│         │                 │         └──────────────────┘                 │
│         │                 │                    │                         │
│         ▼                 ▼                    ▼                         │
│  ┌────────────────────────────────────────────────────────────────┐      │
│  │                      Conflict Detection                         │      │
│  │  ConceptIndex (tail-path) + Aliases + Policy Source Tracking   │      │
│  └────────────────────────────────────────────────────────────────┘      │
│                                    │                                     │
│                                    ▼                                     │
│  ┌──────────────┐  ┌────────────────┐  ┌──────────────┐                  │
│  │    Report    │  │  Drift Check   │  │  Observation │                  │
│  │ (table/json/ │  │ (self-conflict)│  │  Write-back  │                  │
│  │  sarif/md)   │  │                │  │  (--sync)    │                  │
│  └──────────────┘  └────────────────┘  └──────────────┘                  │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

Data Flow

  1. WALK - Traverse project directory (respects .gitignore, supports --staged for git-staged files only)
  2. EXTRACT - Run 14 built-in extractors + declarative extractors to find implicit claims
  3. INGEST - Convert claims to Episteme assertions (BLAKE3 hash + Ed25519 signature)
  4. CONFLICT - Query ConceptIndex for authority matches using tail-path matching
  5. DRIFT - Compare against prior observations (self-conflict detection)
  6. REPORT - Output in table, JSON, SARIF 2.1.0, or Markdown format
  7. SYNC - (Optional) Write-back novel observations to local store or hosted server

Key Modules

Module Purpose Key Files
cli.rs Clap-based CLI argument parsing Command definitions
handlers.rs Command dispatch, validation --sync requires --persist
scan.rs Main scan orchestrator Mode dispatch, observation flow
walker/ Project traversal mod.rs, git.rs, path_mapper.rs, language.rs
extractors/ 14 pattern-based claim extractors mod.rs, individual extractors
bridge.rs Observation → Assertion conversion BLAKE3 hashing, Ed25519 signing
episteme/ Conflict detection core ephemeral.rs, local.rs, concept_index.rs
policy.rs Trust Pack management Load/save/verify signed packs
policy_ops.rs bless, ack, update, export/import CLI policy operations
report/ Output formatting table.rs, json.rs, sarif.rs, markdown.rs
hosted.rs HTTP client for team aggregation Push observations to remote server
community/ Anonymous pattern contribution anonymizer.rs, types.rs
research/ Gap detection and auto-research gap_detector.rs, researcher.rs
config/ aphoria.toml parsing All configuration types
types/ Domain types claim.rs, verdict.rs, result.rs, command.rs
corpus/ Authoritative source builders community.rs, rfc/, owasp/, vendor.rs, enricher.rs

Scan Modes

Mode Storage Performance Features
Ephemeral (default) None ~0.25s Conflict detection only
Persistent (--persist) WAL + KV ~1-2s Baseline, diff, aliases, drift, observation write-back

Ephemeral Mode (EphemeralDetector)

  • Builds corpus + ConceptIndex entirely in-memory
  • No disk I/O during scan
  • Perfect for CI/pre-commit hooks
  • Cannot detect drift (no prior state)
  • Cannot write observations (no storage)

Persistent Mode (LocalEpisteme)

  • Full Episteme stack initialization
  • WAL recovery on startup
  • Enables: baseline tracking, diff, auto-alias creation, drift detection, --sync

Authority Tiers

Tier Source Example Weight
0 Regulatory RFC 7519: "JWT audience validation is mandatory" 1.0
1 Clinical OWASP: "TLS certificate verification required" 0.9
2 Observational Vendor docs: "Redis timeout should be > 0" 0.7
3 Expert Team policy: "Our pool size is 50" 0.5
4 Community Prior observations from this codebase 0.3

Conflict Score Formula:

score = Σ(tier_weight × assertion_confidence × value_difference)

Concept Matching

Tail-Path Matching (ConceptIndex)

The primary matching algorithm uses the last 2 path segments to enable cross-scheme matching:

RFC assertion:  rfc://5246/tls/cert_verification
Code claim:     code://rust/myapp/tls/cert_verification

Both produce key: "tls/cert_verification::enabled"

Algorithm:

  1. Strip scheme (rfc://, code://)
  2. Take last 2 non-empty path segments
  3. Append predicate
  4. Key = {seg[-2]}/{seg[-1]}::{predicate}

Alias Resolution

When tail-path matching fails, the system checks registered aliases. Aliases can be:

  • Auto-created - When conflicts are detected, persist the relationship (persistent mode)
  • Manual - Created via aphoria bless or Trust Pack import
  • Policy aliases - (Planned) From Trust Packs for enterprise policy enforcement - see Policy Alias Implementation

Extractors

Built-in Extractors (14)

Extractor Languages Detects
tls_verify 8 TLS certificate verification disabled
tls_version 8 Deprecated TLS 1.0/1.1 per RFC 8996
jwt_config 8 JWT alg:none, skip signature verification
hardcoded_secrets 8 API keys, passwords in code
timeout_config 8 HTTP/DB/Redis timeout values
dep_versions 3 Dependency versions for advisory lookup
cors_config 8 CORS wildcard + credentials
rate_limit 8 Rate limiting configuration
weak_crypto 5 MD5, SHA1, DES, RC4 usage
sql_injection 5 SQL string interpolation
command_injection 5 Shell exec, os.system
unreal_cpp C++ Unreal Engine Exec functions
unreal_config INI Unreal Engine INI patterns
unreal_performance C++ Synchronous asset loading

Declarative Extractors

Users can define custom extractors in aphoria.toml:

[[extractors.declarative]]
name = "deprecated_api_v1"
description = "Detects usage of deprecated v1 API endpoints"
languages = ["go", "rust", "python"]
pattern = '/api/v1/\w+'
claim.subject = "api/deprecated_endpoint"
claim.predicate = "version"
claim.value = "v1"
confidence = 1.0

Verdicts

Verdict Score Range Exit Code Action
Block ≥ 0.7 2 Must fix before commit
Flag ≥ 0.4 1 Should review
Pass < 0.4 0 No conflict
Ack N/A 0 Acknowledged intentional
Drift N/A 1 Changed from prior value

Trust Packs (Phase 6)

Signed bundles of assertions and aliases for federated policy distribution.

Schema:

pub struct TrustPack {
    pub header: PackHeader,     // name, version, issuer_id, timestamp
    pub assertions: Vec<Assertion>,
    pub aliases: Vec<ConceptAlias>,
    pub signature: [u8; 64],    // Ed25519 signature
}

Operations:

  • aphoria policy export - Create signed pack from local decisions
  • aphoria policy import - Load pack, verify signature, ingest assertions
  • aphoria.toml - Auto-load policies from policies = [...] list

Hosted Mode (Phase 4E)

Team aggregation via central StemeDB server.

[hosted]
url = "https://episteme.acme.corp"
project_id = "billing-service"
team_id = "platform-team"
sync_mode = "remote-only"       # or "local-and-remote"
offline_fallback = "skip"       # or "fail" or "queue"
api_key_env = "APHORIA_API_KEY"

Flow:

Developer scans → HostedClient → POST /v1/aphoria/observations → Team Server

Community Sharing (Phase 5.6)

Opt-in anonymous pattern contribution.

Privacy Model:

  • Project names wildcarded: code://rust/myapp/tlscode://rust/*/tls
  • File paths, line numbers, matched text NEVER shared
  • Timestamps rounded to hour (k-anonymity)
  • enabled defaults to false (explicit opt-in)
[community]
enabled = true
anonymize = true
min_confidence = 0.8
exclude = ["vendor://acme/internal/*"]

Key Documents

Concept Matching System

Problem: How do we match code extractors to authoritative policies across different hierarchies?

  1. Concept Matching Analysis

    • Identifies the gap: tail-path matching works for RFCs but breaks for enterprise policies
    • Analyzes root cause: semantic mismatch between policy hierarchies and extractor output
    • Proposes solution: explicit policy aliases in Trust Packs
  2. Policy Alias Implementation Guide

    • Day-by-day implementation plan (5 phases over 3 days)
    • Code sketches with exact file locations
    • Test strategies and success criteria
    • Migration and rollout plan
  3. Matching Philosophy

    • Core design principles: semantic over syntactic, progressive precision, explicit control
    • Why tail-path matching works (by design for RFC/OWASP corpus)
    • Why it breaks (enterprise hierarchies violate assumptions)
    • Future extension points (semantic embeddings, ontology mapping)
  4. Enterprise Validation

    • End-to-end scenario walkthrough
    • Validates that policy aliases solve the enterprise use case
    • Edge case analysis
    • Real-world adoption path

LLM Extraction Quality

Problem: How do we ensure LLM prompts produce consistent, high-quality extraction results?

  1. LLM Prompt Evaluation - Vision

    • Problem statement and enterprise requirements
    • Architecture overview and core components
    • Fixture format design
    • CI/CD integration patterns
  2. LLM Prompt Evaluation - Implementation ← START HERE

    • Actionable implementation spec
    • Code snippets and file locations
    • 5-phase implementation plan (11 days)
    • Seed fixture list

Quick Reference

When to Read What

If you need to... Read this
Understand concept matching Concept Matching Analysis
Implement policy aliases Policy Alias Implementation
Understand design philosophy Matching Philosophy
Validate enterprise scenarios Enterprise Validation
Test/evaluate LLM prompts LLM Eval Implementation
Add a new extractor src/extractors/mod.rs
Understand scan flow src/scan.rs
Modify conflict detection src/episteme/conflict.rs
Work with Trust Packs src/policy.rs, src/policy_ops.rs
Work with LLM extraction src/llm/

Architecture Decisions

AD-001: Explicit Policy Aliases

Status: Approved (2026-02-04) - Not Yet Implemented

Context: Security teams need to create policies using logical hierarchies (code://standards/*) that don't align with extractor output (code://rust/myapp/*).

Decision: Add PolicyAlias type to Trust Packs with glob pattern matching.

Implementation: See Policy Alias Implementation Guide for detailed implementation plan.

Consequences:

  • Enables enterprise policy enforcement
  • Maintains backward compatibility
  • Keeps security teams in control (explicit aliases)
  • ⚠️ Requires manual alias creation
  • ⚠️ Adds cognitive overhead (pattern syntax)

AD-002: Ephemeral Mode Default

Status: Implemented (2026-01-28)

Context: Full Episteme initialization took ~1-2s, too slow for pre-commit hooks.

Decision: Default to ephemeral mode (in-memory only), opt-in to persistent with --persist.

Consequences:

  • 40x faster scans (~0.25s)
  • No storage pollution for quick checks
  • ⚠️ Drift detection requires --persist
  • ⚠️ Observation write-back requires --persist --sync

AD-003: Tail-Path Matching

Status: Implemented

Context: Need to match code claims against RFCs/OWASP assertions with different URI schemes.

Decision: Use last 2 path segments + predicate as index key.

Consequences:

  • O(1) lookup via HashMap
  • Works for RFC/OWASP corpus by design
  • ⚠️ Breaks for enterprise policies with different hierarchies (solved by AD-001)

AD-004: LLM Prompt Evaluation System

Status: Proposed (2026-02-05)

Context: LLM prompts that drive claim extraction are code, but we don't treat them like code. No tests, no metrics, no regression detection. When prompts change, we don't know if quality improved or degraded.

Decision: Build a comprehensive prompt evaluation system with:

  • Golden corpus of test fixtures with expected outcomes
  • Observation logging for every extraction
  • Metrics computation (precision, recall, F1, cost)
  • Regression detection against baselines
  • CI integration (smoke tests per-PR, full eval nightly)

Implementation: See LLM Prompt Evaluation Spec

Consequences:

  • Prompt changes are validated before deployment
  • Regressions are caught automatically
  • Quality is measurable over time
  • Enterprise confidence in extraction reliability
  • ⚠️ Requires maintaining golden corpus
  • ⚠️ Live evaluation has token cost

Design Principles

1. Semantic Over Syntactic

Match concepts by meaning, not exact string paths.

2. Progressive Precision

Start with simple heuristics (tail-path), add layers (aliases, embeddings) as needed.

3. Explicit Over Implicit

Matching logic should be transparent, auditable, and controllable.

4. Zero Configuration (for common cases)

Bundled corpus (RFCs, OWASP) should "just work" with tail-path matching.

5. Cryptographic Trust

All policies are signed (Ed25519) and verified before use.

6. Privacy by Default

Community sharing is opt-in with anonymization enabled by default.


Extension Points

Current (2026-02-05)

  • Tail-path matching (O(1) hash lookup)
  • Concept aliases (auto-created on conflict detection)
  • Declarative extractors (user-defined in TOML)
  • Hosted mode (team aggregation)
  • Community corpus (anonymous sharing)
  • LLM-in-the-loop extraction (Gemini semantic claims)
  • Pattern learning (LLM-extracted patterns remembered)

In Progress

  • LLM Prompt Evaluation - Testing, metrics, and regression detection for prompts (Spec)
  • Policy aliases - Enterprise policy matching via glob patterns (AD-001)

Planned (Q1 2026)

  • Semantic embeddings (fuzzy matching via vector similarity)
  • Alias auto-discovery (suggest aliases during scan)
  • High-entropy secret detection
  • Framework-specific extractors (Spring, Django, Express)

Future (Q2+ 2026)

  • Ontology mapping (define semantic relationships)
  • Trust Pack composition (packs can extend other packs)
  • LLM-assisted extraction (semantic code understanding)
  • Config file deep parsing (structured YAML/JSON/TOML)

Performance Targets

Scan Time

  • Ephemeral: < 0.3s for typical project
  • Persistent: < 2s for typical project
  • With Policy Aliases: < 5% increase

Memory Overhead

  • Policy Alias Storage: ~100 bytes per alias
  • Typical Trust Pack: < 10 KB (10 aliases)
  • Corpus in memory: ~2-5 MB (varies by sources enabled)

Lookup Complexity

  • Direct tail-path: O(1)
  • Concept alias resolution: O(A) where A=aliases
  • Policy alias fallback (planned): O(P * A) where P=patterns, A=aliases

Testing Strategy

Unit Tests

  • Extractor pattern matching
  • ConceptIndex key generation
  • Conflict score calculation
  • Trust Pack serialization/verification

Integration Tests

  • Full scan flow with corpus
  • Trust Pack import/export
  • Drift detection
  • Observation write-back

UAT Scenarios

  • Enterprise security team workflow
  • Multi-language policy enforcement
  • CI/CD integration
  • Hosted mode aggregation

Corpus Architecture

Aphoria's corpus is emergent, not hardcoded. Best practices come from community usage and external sources.

Community Corpus (Primary)

Source: StemeDB pattern aggregates Builder: CommunityCorpusBuilder queries PatternAggregateStore Promotion: Patterns with 95%+ adoption + RFC/OWASP match auto-promote to corpus Storage: StemeDB (graph database), indexed as AUTHORITATIVE predicate

Example:

Pattern: tls/cert_verification:enabled=true
Adoption: 847/892 projects (95%)
Authority: RFC 5246
→ Auto-promoted to corpus (Tier 0: Regulatory)

Bootstrap Options

New projects need baseline assertions.

Option 1: Wiki Import

aphoria corpus import --from-wiki ~/docs
# Parses markdown for MUST/SHOULD patterns
# Creates assertions, stores in StemeDB

Option 2: Trust Pack

aphoria trust-pack install rfc-owasp-baseline
# Imports curated assertions
# Stores in StemeDB

Option 3: Skill Cold Start

# aphoria-suggest analyzes project
# Suggests 3-5 foundation claims
# User approves → CLI creates assertions

No More Hardcoded Corpus

hardcoded.rs deleted. The 19 original assertions are available as rfc-owasp-baseline Trust Pack for bootstrap only.

Philosophy: The corpus isn't written by experts. It's discovered by the community and validated by authorities.


Product

Guides

Implementation


Questions or Feedback?

Discuss in:

  • #aphoria-architecture (internal Slack)
  • GitHub Issues (public feedback)
  • Architecture review meetings (Fridays 2pm PT)

This directory is the source of truth for architectural decisions. All major changes should be documented here before implementation.


Last updated: 2026-02-05