jml 65065f3d8f feat(aphoria): implement community corpus with wiki import and pattern aggregation

Implements Phase 4 (A4) - Community corpus as first-class citizens:

- **Community Corpus Builder** - Queries StemeDB pattern aggregates
- **Wiki Import** - Bootstrap corpus from markdown docs (aphoria corpus import wiki)
- **Pattern Aggregation** - Automatic learning from local scans (--sync flag)
- **Storage Layer** - StemeDBPatternStore with content-addressed deduplication
- **Promotion Logic** - Multi-tier thresholds (95%/80%/50% adoption rates)
- **Corpus Build** - Unified registry for RFC/OWASP/Vendor/Community sources
- **Trust Packs** - Export corpus as signed, distributable artifacts
- **Documentation** - bootstrap-corpus.md guide + CLI reference updates

Technical details:
- Pattern aggregates stored as assertions with predicate "pattern_aggregate"
- Content-addressed subjects via BLAKE3(subject:predicate:value)
- PatternAggregator handles write path (observations → patterns)
- StemeDBPatternStore handles read path (pattern queries)
- Integration tests + fixtures in tests/wiki_import_test.rs

Deleted hardcoded.rs (368 lines) - corpus now fully emergent from StemeDB.
Deleted enriched-corpus-patterns.md (677 lines) - feature shipped.

Closes VG-026 (community corpus), part of A4 milestone.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-09 00:12:31 +00:00

23 KiB

Raw Blame History

Aphoria Architecture Documentation

This directory contains architectural decision records, analysis, and design philosophy for Aphoria.

System Overview

Aphoria is a code-level truth linter that validates code against authoritative sources (RFCs, OWASP, vendor docs). It extracts implicit claims from code and configs, then checks them against a tiered authority system.

High-Level Architecture

┌──────────────────────────────────────────────────────────────────────────┐
│                        Aphoria CLI Pipeline                               │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────────┐                                                        │
│  │   CLI/Args   │ ──▶ handlers.rs dispatches to scan, policy, research   │
│  └──────────────┘                                                        │
│         │                                                                │
│         ▼                                                                │
│  ┌──────────────┐  ┌────────────────┐  ┌──────────────┐                  │
│  │    Walker    │──▶│  Extractors    │──▶│    Bridge    │                │
│  │ (walk files) │  │ (14 built-in)  │  │ (claim→assn) │                  │
│  └──────────────┘  └────────────────┘  └──────────────┘                  │
│         │                 │                    │                         │
│         │                 │                    ▼                         │
│         │                 │         ┌──────────────────┐                 │
│         │                 │         │  Episteme Layer  │                 │
│         │                 │         │                  │                 │
│         │                 │         │ ┌──────────────┐ │                 │
│         │                 │         │ │  Ephemeral   │ │ ◀─ Fast path    │
│         │                 │         │ │  Detector    │ │    (~0.25s)     │
│         │                 │         │ └──────────────┘ │                 │
│         │                 │         │        OR        │                 │
│         │                 │         │ ┌──────────────┐ │                 │
│         │                 │         │ │   Local      │ │ ◀─ Full path    │
│         │                 │         │ │  Episteme    │ │    (~1-2s)      │
│         │                 │         │ └──────────────┘ │                 │
│         │                 │         └──────────────────┘                 │
│         │                 │                    │                         │
│         ▼                 ▼                    ▼                         │
│  ┌────────────────────────────────────────────────────────────────┐      │
│  │                      Conflict Detection                         │      │
│  │  ConceptIndex (tail-path) + Aliases + Policy Source Tracking   │      │
│  └────────────────────────────────────────────────────────────────┘      │
│                                    │                                     │
│                                    ▼                                     │
│  ┌──────────────┐  ┌────────────────┐  ┌──────────────┐                  │
│  │    Report    │  │  Drift Check   │  │  Observation │                  │
│  │ (table/json/ │  │ (self-conflict)│  │  Write-back  │                  │
│  │  sarif/md)   │  │                │  │  (--sync)    │                  │
│  └──────────────┘  └────────────────┘  └──────────────┘                  │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

Data Flow

WALK - Traverse project directory (respects .gitignore, supports --staged for git-staged files only)
EXTRACT - Run 14 built-in extractors + declarative extractors to find implicit claims
INGEST - Convert claims to Episteme assertions (BLAKE3 hash + Ed25519 signature)
CONFLICT - Query ConceptIndex for authority matches using tail-path matching
DRIFT - Compare against prior observations (self-conflict detection)
REPORT - Output in table, JSON, SARIF 2.1.0, or Markdown format
SYNC - (Optional) Write-back novel observations to local store or hosted server

Key Modules

Module	Purpose	Key Files
`cli.rs`	Clap-based CLI argument parsing	Command definitions
`handlers.rs`	Command dispatch, validation	`--sync requires --persist`
`scan.rs`	Main scan orchestrator	Mode dispatch, observation flow
`walker/`	Project traversal	`mod.rs`, `git.rs`, `path_mapper.rs`, `language.rs`
`extractors/`	14 pattern-based claim extractors	`mod.rs`, individual extractors
`bridge.rs`	Observation → Assertion conversion	BLAKE3 hashing, Ed25519 signing
`episteme/`	Conflict detection core	`ephemeral.rs`, `local.rs`, `concept_index.rs`
`policy.rs`	Trust Pack management	Load/save/verify signed packs
`policy_ops.rs`	`bless`, `ack`, `update`, `export/import`	CLI policy operations
`report/`	Output formatting	`table.rs`, `json.rs`, `sarif.rs`, `markdown.rs`
`hosted.rs`	HTTP client for team aggregation	Push observations to remote server
`community/`	Anonymous pattern contribution	`anonymizer.rs`, `types.rs`
`research/`	Gap detection and auto-research	`gap_detector.rs`, `researcher.rs`
`config/`	`aphoria.toml` parsing	All configuration types
`types/`	Domain types	`claim.rs`, `verdict.rs`, `result.rs`, `command.rs`
`corpus/`	Authoritative source builders	`community.rs`, `rfc/`, `owasp/`, `vendor.rs`, `enricher.rs`

Scan Modes

Mode	Storage	Performance	Features
Ephemeral (default)	None	~0.25s	Conflict detection only
Persistent (`--persist`)	WAL + KV	~1-2s	Baseline, diff, aliases, drift, observation write-back

Ephemeral Mode (`EphemeralDetector`)

Builds corpus + ConceptIndex entirely in-memory
No disk I/O during scan
Perfect for CI/pre-commit hooks
Cannot detect drift (no prior state)
Cannot write observations (no storage)

Persistent Mode (`LocalEpisteme`)

Full Episteme stack initialization
WAL recovery on startup
Enables: baseline tracking, diff, auto-alias creation, drift detection, --sync

Authority Tiers

Tier	Source	Example	Weight
0	Regulatory	RFC 7519: "JWT audience validation is mandatory"	1.0
1	Clinical	OWASP: "TLS certificate verification required"	0.9
2	Observational	Vendor docs: "Redis timeout should be > 0"	0.7
3	Expert	Team policy: "Our pool size is 50"	0.5
4	Community	Prior observations from this codebase	0.3

Conflict Score Formula:

score = Σ(tier_weight × assertion_confidence × value_difference)

Concept Matching

Tail-Path Matching (ConceptIndex)

The primary matching algorithm uses the last 2 path segments to enable cross-scheme matching:

RFC assertion:  rfc://5246/tls/cert_verification
Code claim:     code://rust/myapp/tls/cert_verification

Both produce key: "tls/cert_verification::enabled"

Algorithm:

Strip scheme (rfc://, code://)
Take last 2 non-empty path segments
Append predicate
Key = {seg[-2]}/{seg[-1]}::{predicate}

Alias Resolution

When tail-path matching fails, the system checks registered aliases. Aliases can be:

Auto-created - When conflicts are detected, persist the relationship (persistent mode)
Manual - Created via aphoria bless or Trust Pack import
Policy aliases - (Planned) From Trust Packs for enterprise policy enforcement - see Policy Alias Implementation

Extractors

Built-in Extractors (14)

Extractor	Languages	Detects
`tls_verify`	8	TLS certificate verification disabled
`tls_version`	8	Deprecated TLS 1.0/1.1 per RFC 8996
`jwt_config`	8	JWT alg:none, skip signature verification
`hardcoded_secrets`	8	API keys, passwords in code
`timeout_config`	8	HTTP/DB/Redis timeout values
`dep_versions`	3	Dependency versions for advisory lookup
`cors_config`	8	CORS wildcard + credentials
`rate_limit`	8	Rate limiting configuration
`weak_crypto`	5	MD5, SHA1, DES, RC4 usage
`sql_injection`	5	SQL string interpolation
`command_injection`	5	Shell exec, os.system
`unreal_cpp`	C++	Unreal Engine Exec functions
`unreal_config`	INI	Unreal Engine INI patterns
`unreal_performance`	C++	Synchronous asset loading

Declarative Extractors

Users can define custom extractors in aphoria.toml:

[[extractors.declarative]]
name = "deprecated_api_v1"
description = "Detects usage of deprecated v1 API endpoints"
languages = ["go", "rust", "python"]
pattern = '/api/v1/\w+'
claim.subject = "api/deprecated_endpoint"
claim.predicate = "version"
claim.value = "v1"
confidence = 1.0

Verdicts

Verdict	Score Range	Exit Code	Action
`Block`	≥ 0.7	2	Must fix before commit
`Flag`	≥ 0.4	1	Should review
`Pass`	< 0.4	0	No conflict
`Ack`	N/A	0	Acknowledged intentional
`Drift`	N/A	1	Changed from prior value

Trust Packs (Phase 6)

Signed bundles of assertions and aliases for federated policy distribution.

Schema:

pub struct TrustPack {
    pub header: PackHeader,     // name, version, issuer_id, timestamp
    pub assertions: Vec<Assertion>,
    pub aliases: Vec<ConceptAlias>,
    pub signature: [u8; 64],    // Ed25519 signature
}

Operations:

aphoria policy export - Create signed pack from local decisions
aphoria policy import - Load pack, verify signature, ingest assertions
aphoria.toml - Auto-load policies from policies = [...] list

Hosted Mode (Phase 4E)

Team aggregation via central StemeDB server.

[hosted]
url = "https://episteme.acme.corp"
project_id = "billing-service"
team_id = "platform-team"
sync_mode = "remote-only"       # or "local-and-remote"
offline_fallback = "skip"       # or "fail" or "queue"
api_key_env = "APHORIA_API_KEY"

Flow:

Developer scans → HostedClient → POST /v1/aphoria/observations → Team Server

Opt-in anonymous pattern contribution.

Privacy Model:

Project names wildcarded: code://rust/myapp/tls → code://rust/*/tls
File paths, line numbers, matched text NEVER shared
Timestamps rounded to hour (k-anonymity)
enabled defaults to false (explicit opt-in)

[community]
enabled = true
anonymize = true
min_confidence = 0.8
exclude = ["vendor://acme/internal/*"]

Key Documents

Concept Matching System

Problem: How do we match code extractors to authoritative policies across different hierarchies?

Concept Matching Analysis
- Identifies the gap: tail-path matching works for RFCs but breaks for enterprise policies
- Analyzes root cause: semantic mismatch between policy hierarchies and extractor output
- Proposes solution: explicit policy aliases in Trust Packs
Policy Alias Implementation Guide
- Day-by-day implementation plan (5 phases over 3 days)
- Code sketches with exact file locations
- Test strategies and success criteria
- Migration and rollout plan
Matching Philosophy
- Core design principles: semantic over syntactic, progressive precision, explicit control
- Why tail-path matching works (by design for RFC/OWASP corpus)
- Why it breaks (enterprise hierarchies violate assumptions)
- Future extension points (semantic embeddings, ontology mapping)
Enterprise Validation
- End-to-end scenario walkthrough
- Validates that policy aliases solve the enterprise use case
- Edge case analysis
- Real-world adoption path

LLM Extraction Quality

Problem: How do we ensure LLM prompts produce consistent, high-quality extraction results?

LLM Prompt Evaluation - Vision
- Problem statement and enterprise requirements
- Architecture overview and core components
- Fixture format design
- CI/CD integration patterns
LLM Prompt Evaluation - Implementation ← START HERE
- Actionable implementation spec
- Code snippets and file locations
- 5-phase implementation plan (11 days)
- Seed fixture list

Quick Reference

When to Read What

If you need to...	Read this
Understand concept matching	Concept Matching Analysis
Implement policy aliases	Policy Alias Implementation
Understand design philosophy	Matching Philosophy
Validate enterprise scenarios	Enterprise Validation
Test/evaluate LLM prompts	LLM Eval Implementation
Add a new extractor	`src/extractors/mod.rs`
Understand scan flow	`src/scan.rs`
Modify conflict detection	`src/episteme/conflict.rs`
Work with Trust Packs	`src/policy.rs`, `src/policy_ops.rs`
Work with LLM extraction	`src/llm/`

Architecture Decisions

AD-001: Explicit Policy Aliases

Status: Approved (2026-02-04) - Not Yet Implemented

Context: Security teams need to create policies using logical hierarchies (code://standards/*) that don't align with extractor output (code://rust/myapp/*).

Decision: Add PolicyAlias type to Trust Packs with glob pattern matching.

Implementation: See Policy Alias Implementation Guide for detailed implementation plan.

Consequences:

✅ Enables enterprise policy enforcement
✅ Maintains backward compatibility
✅ Keeps security teams in control (explicit aliases)
⚠️ Requires manual alias creation
⚠️ Adds cognitive overhead (pattern syntax)

AD-002: Ephemeral Mode Default

Status: Implemented (2026-01-28)

Context: Full Episteme initialization took ~1-2s, too slow for pre-commit hooks.

Decision: Default to ephemeral mode (in-memory only), opt-in to persistent with --persist.

Consequences:

✅ 40x faster scans (~0.25s)
✅ No storage pollution for quick checks
⚠️ Drift detection requires --persist
⚠️ Observation write-back requires --persist --sync

AD-003: Tail-Path Matching

Status: Implemented

Context: Need to match code claims against RFCs/OWASP assertions with different URI schemes.

Decision: Use last 2 path segments + predicate as index key.

Consequences:

✅ O(1) lookup via HashMap
✅ Works for RFC/OWASP corpus by design
⚠️ Breaks for enterprise policies with different hierarchies (solved by AD-001)

AD-004: LLM Prompt Evaluation System

Status: Proposed (2026-02-05)

Context: LLM prompts that drive claim extraction are code, but we don't treat them like code. No tests, no metrics, no regression detection. When prompts change, we don't know if quality improved or degraded.

Decision: Build a comprehensive prompt evaluation system with:

Golden corpus of test fixtures with expected outcomes
Observation logging for every extraction
Metrics computation (precision, recall, F1, cost)
Regression detection against baselines
CI integration (smoke tests per-PR, full eval nightly)

Implementation: See LLM Prompt Evaluation Spec

Consequences:

✅ Prompt changes are validated before deployment
✅ Regressions are caught automatically
✅ Quality is measurable over time
✅ Enterprise confidence in extraction reliability
⚠️ Requires maintaining golden corpus
⚠️ Live evaluation has token cost

Design Principles

1. Semantic Over Syntactic

Match concepts by meaning, not exact string paths.

2. Progressive Precision

Start with simple heuristics (tail-path), add layers (aliases, embeddings) as needed.

3. Explicit Over Implicit

Matching logic should be transparent, auditable, and controllable.

4. Zero Configuration (for common cases)

Bundled corpus (RFCs, OWASP) should "just work" with tail-path matching.

5. Cryptographic Trust

All policies are signed (Ed25519) and verified before use.

6. Privacy by Default

Community sharing is opt-in with anonymization enabled by default.

Extension Points

Current (2026-02-05)

Tail-path matching (O(1) hash lookup)
Concept aliases (auto-created on conflict detection)
Declarative extractors (user-defined in TOML)
Hosted mode (team aggregation)
Community corpus (anonymous sharing)
LLM-in-the-loop extraction (Gemini semantic claims)
Pattern learning (LLM-extracted patterns remembered)

In Progress

LLM Prompt Evaluation - Testing, metrics, and regression detection for prompts (Spec)
Policy aliases - Enterprise policy matching via glob patterns (AD-001)

Planned (Q1 2026)

Semantic embeddings (fuzzy matching via vector similarity)
Alias auto-discovery (suggest aliases during scan)
High-entropy secret detection
Framework-specific extractors (Spring, Django, Express)

Future (Q2+ 2026)

Ontology mapping (define semantic relationships)
Trust Pack composition (packs can extend other packs)
LLM-assisted extraction (semantic code understanding)
Config file deep parsing (structured YAML/JSON/TOML)

Performance Targets

Scan Time

Ephemeral: < 0.3s for typical project
Persistent: < 2s for typical project
With Policy Aliases: < 5% increase

Memory Overhead

Policy Alias Storage: ~100 bytes per alias
Typical Trust Pack: < 10 KB (10 aliases)
Corpus in memory: ~2-5 MB (varies by sources enabled)

Lookup Complexity

Direct tail-path: O(1)
Concept alias resolution: O(A) where A=aliases
Policy alias fallback (planned): O(P * A) where P=patterns, A=aliases

Testing Strategy

Unit Tests

Extractor pattern matching
ConceptIndex key generation
Conflict score calculation
Trust Pack serialization/verification

Integration Tests

Full scan flow with corpus
Trust Pack import/export
Drift detection
Observation write-back

UAT Scenarios

Enterprise security team workflow
Multi-language policy enforcement
CI/CD integration
Hosted mode aggregation

Corpus Architecture

Aphoria's corpus is emergent, not hardcoded. Best practices come from community usage and external sources.

Community Corpus (Primary)

Source: StemeDB pattern aggregates Builder: CommunityCorpusBuilder queries PatternAggregateStore Promotion: Patterns with 95%+ adoption + RFC/OWASP match auto-promote to corpus Storage: StemeDB (graph database), indexed as AUTHORITATIVE predicate

Example:

Pattern: tls/cert_verification:enabled=true
Adoption: 847/892 projects (95%)
Authority: RFC 5246
→ Auto-promoted to corpus (Tier 0: Regulatory)

Bootstrap Options

New projects need baseline assertions.

Option 1: Wiki Import

aphoria corpus import --from-wiki ~/docs
# Parses markdown for MUST/SHOULD patterns
# Creates assertions, stores in StemeDB

Option 2: Trust Pack

aphoria trust-pack install rfc-owasp-baseline
# Imports curated assertions
# Stores in StemeDB

Option 3: Skill Cold Start

# aphoria-suggest analyzes project
# Suggests 3-5 foundation claims
# User approves → CLI creates assertions

No More Hardcoded Corpus

~~hardcoded.rs~~ deleted. The 19 original assertions are available as rfc-owasp-baseline Trust Pack for bootstrap only.

Philosophy: The corpus isn't written by experts. It's discovered by the community and validated by authorities.

Product

Product Overview - What Aphoria does
Roadmap - Implementation status and plans

Guides

Enterprise Quick Start - Getting started
Federating Truth - Trust Pack workflows

Implementation

Policy Ops - Trust Pack CLI handlers
Concept Index - Matching algorithm
Local Episteme - Conflict detection
Ephemeral Detector - Fast path

Questions or Feedback?

Discuss in:

#aphoria-architecture (internal Slack)
GitHub Issues (public feedback)
Architecture review meetings (Fridays 2pm PT)

This directory is the source of truth for architectural decisions. All major changes should be documented here before implementation.

Last updated: 2026-02-05

23 KiB Raw Blame History Unescape Escape

Aphoria Architecture Documentation

System Overview

High-Level Architecture

Data Flow

Key Modules

Scan Modes

Ephemeral Mode (EphemeralDetector)

Persistent Mode (LocalEpisteme)

Authority Tiers

Concept Matching

Tail-Path Matching (ConceptIndex)

Alias Resolution

Extractors

Built-in Extractors (14)

Declarative Extractors

Verdicts

Trust Packs (Phase 6)

Hosted Mode (Phase 4E)

Community Sharing (Phase 5.6)

Key Documents

Concept Matching System

LLM Extraction Quality

Quick Reference

When to Read What

Architecture Decisions

AD-001: Explicit Policy Aliases

AD-002: Ephemeral Mode Default

AD-003: Tail-Path Matching

AD-004: LLM Prompt Evaluation System

Design Principles

1. Semantic Over Syntactic

2. Progressive Precision

3. Explicit Over Implicit

4. Zero Configuration (for common cases)

5. Cryptographic Trust

6. Privacy by Default

Extension Points

Current (2026-02-05)

In Progress

Planned (Q1 2026)

Future (Q2+ 2026)

Performance Targets

Scan Time

Memory Overhead

Lookup Complexity

Testing Strategy

Unit Tests

Integration Tests

UAT Scenarios

Corpus Architecture

Community Corpus (Primary)

Bootstrap Options

No More Hardcoded Corpus

Related Documentation

Product

Guides

Implementation

Questions or Feedback?

23 KiB

Raw Blame History

Ephemeral Mode (`EphemeralDetector`)

Persistent Mode (`LocalEpisteme`)