stemedb/applications/aphoria/docs/architecture/README.md
jml 65065f3d8f feat(aphoria): implement community corpus with wiki import and pattern aggregation
Implements Phase 4 (A4) - Community corpus as first-class citizens:

- **Community Corpus Builder** - Queries StemeDB pattern aggregates
- **Wiki Import** - Bootstrap corpus from markdown docs (aphoria corpus import wiki)
- **Pattern Aggregation** - Automatic learning from local scans (--sync flag)
- **Storage Layer** - StemeDBPatternStore with content-addressed deduplication
- **Promotion Logic** - Multi-tier thresholds (95%/80%/50% adoption rates)
- **Corpus Build** - Unified registry for RFC/OWASP/Vendor/Community sources
- **Trust Packs** - Export corpus as signed, distributable artifacts
- **Documentation** - bootstrap-corpus.md guide + CLI reference updates

Technical details:
- Pattern aggregates stored as assertions with predicate "pattern_aggregate"
- Content-addressed subjects via BLAKE3(subject:predicate:value)
- PatternAggregator handles write path (observations → patterns)
- StemeDBPatternStore handles read path (pattern queries)
- Integration tests + fixtures in tests/wiki_import_test.rs

Deleted hardcoded.rs (368 lines) - corpus now fully emergent from StemeDB.
Deleted enriched-corpus-patterns.md (677 lines) - feature shipped.

Closes VG-026 (community corpus), part of A4 milestone.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-09 00:12:31 +00:00

586 lines
23 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Aphoria Architecture Documentation
This directory contains architectural decision records, analysis, and design philosophy for Aphoria.
---
## System Overview
Aphoria is a **code-level truth linter** that validates code against authoritative sources (RFCs, OWASP, vendor docs). It extracts implicit claims from code and configs, then checks them against a tiered authority system.
### High-Level Architecture
```
┌──────────────────────────────────────────────────────────────────────────┐
│ Aphoria CLI Pipeline │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ │
│ │ CLI/Args │ ──▶ handlers.rs dispatches to scan, policy, research │
│ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌────────────────┐ ┌──────────────┐ │
│ │ Walker │──▶│ Extractors │──▶│ Bridge │ │
│ │ (walk files) │ │ (14 built-in) │ │ (claim→assn) │ │
│ └──────────────┘ └────────────────┘ └──────────────┘ │
│ │ │ │ │
│ │ │ ▼ │
│ │ │ ┌──────────────────┐ │
│ │ │ │ Episteme Layer │ │
│ │ │ │ │ │
│ │ │ │ ┌──────────────┐ │ │
│ │ │ │ │ Ephemeral │ │ ◀─ Fast path │
│ │ │ │ │ Detector │ │ (~0.25s) │
│ │ │ │ └──────────────┘ │ │
│ │ │ │ OR │ │
│ │ │ │ ┌──────────────┐ │ │
│ │ │ │ │ Local │ │ ◀─ Full path │
│ │ │ │ │ Episteme │ │ (~1-2s) │
│ │ │ │ └──────────────┘ │ │
│ │ │ └──────────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌────────────────────────────────────────────────────────────────┐ │
│ │ Conflict Detection │ │
│ │ ConceptIndex (tail-path) + Aliases + Policy Source Tracking │ │
│ └────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌────────────────┐ ┌──────────────┐ │
│ │ Report │ │ Drift Check │ │ Observation │ │
│ │ (table/json/ │ │ (self-conflict)│ │ Write-back │ │
│ │ sarif/md) │ │ │ │ (--sync) │ │
│ └──────────────┘ └────────────────┘ └──────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────┘
```
### Data Flow
1. **WALK** - Traverse project directory (respects `.gitignore`, supports `--staged` for git-staged files only)
2. **EXTRACT** - Run 14 built-in extractors + declarative extractors to find implicit claims
3. **INGEST** - Convert claims to Episteme assertions (BLAKE3 hash + Ed25519 signature)
4. **CONFLICT** - Query ConceptIndex for authority matches using tail-path matching
5. **DRIFT** - Compare against prior observations (self-conflict detection)
6. **REPORT** - Output in table, JSON, SARIF 2.1.0, or Markdown format
7. **SYNC** - (Optional) Write-back novel observations to local store or hosted server
---
## Key Modules
| Module | Purpose | Key Files |
|--------|---------|-----------|
| `cli.rs` | Clap-based CLI argument parsing | Command definitions |
| `handlers.rs` | Command dispatch, validation | `--sync requires --persist` |
| `scan.rs` | Main scan orchestrator | Mode dispatch, observation flow |
| `walker/` | Project traversal | `mod.rs`, `git.rs`, `path_mapper.rs`, `language.rs` |
| `extractors/` | 14 pattern-based claim extractors | `mod.rs`, individual extractors |
| `bridge.rs` | Observation → Assertion conversion | BLAKE3 hashing, Ed25519 signing |
| `episteme/` | Conflict detection core | `ephemeral.rs`, `local.rs`, `concept_index.rs` |
| `policy.rs` | Trust Pack management | Load/save/verify signed packs |
| `policy_ops.rs` | `bless`, `ack`, `update`, `export/import` | CLI policy operations |
| `report/` | Output formatting | `table.rs`, `json.rs`, `sarif.rs`, `markdown.rs` |
| `hosted.rs` | HTTP client for team aggregation | Push observations to remote server |
| `community/` | Anonymous pattern contribution | `anonymizer.rs`, `types.rs` |
| `research/` | Gap detection and auto-research | `gap_detector.rs`, `researcher.rs` |
| `config/` | `aphoria.toml` parsing | All configuration types |
| `types/` | Domain types | `claim.rs`, `verdict.rs`, `result.rs`, `command.rs` |
| `corpus/` | Authoritative source builders | `community.rs`, `rfc/`, `owasp/`, `vendor.rs`, `enricher.rs` |
---
## Scan Modes
| Mode | Storage | Performance | Features |
|------|---------|-------------|----------|
| **Ephemeral** (default) | None | ~0.25s | Conflict detection only |
| **Persistent** (`--persist`) | WAL + KV | ~1-2s | Baseline, diff, aliases, drift, observation write-back |
### Ephemeral Mode (`EphemeralDetector`)
- Builds corpus + ConceptIndex entirely in-memory
- No disk I/O during scan
- Perfect for CI/pre-commit hooks
- Cannot detect drift (no prior state)
- Cannot write observations (no storage)
### Persistent Mode (`LocalEpisteme`)
- Full Episteme stack initialization
- WAL recovery on startup
- Enables: baseline tracking, diff, auto-alias creation, drift detection, `--sync`
---
## Authority Tiers
| Tier | Source | Example | Weight |
|------|--------|---------|--------|
| 0 | Regulatory | RFC 7519: "JWT audience validation is mandatory" | 1.0 |
| 1 | Clinical | OWASP: "TLS certificate verification required" | 0.9 |
| 2 | Observational | Vendor docs: "Redis timeout should be > 0" | 0.7 |
| 3 | Expert | Team policy: "Our pool size is 50" | 0.5 |
| 4 | Community | Prior observations from this codebase | 0.3 |
**Conflict Score Formula:**
```
score = Σ(tier_weight × assertion_confidence × value_difference)
```
---
## Concept Matching
### Tail-Path Matching (ConceptIndex)
The primary matching algorithm uses the last 2 path segments to enable cross-scheme matching:
```
RFC assertion: rfc://5246/tls/cert_verification
Code claim: code://rust/myapp/tls/cert_verification
Both produce key: "tls/cert_verification::enabled"
```
**Algorithm:**
1. Strip scheme (`rfc://`, `code://`)
2. Take last 2 non-empty path segments
3. Append predicate
4. Key = `{seg[-2]}/{seg[-1]}::{predicate}`
### Alias Resolution
When tail-path matching fails, the system checks registered aliases. Aliases can be:
- **Auto-created** - When conflicts are detected, persist the relationship (persistent mode)
- **Manual** - Created via `aphoria bless` or Trust Pack import
- **Policy aliases** - (Planned) From Trust Packs for enterprise policy enforcement - see [Policy Alias Implementation](./policy-alias-implementation.md)
---
## Extractors
### Built-in Extractors (14)
| Extractor | Languages | Detects |
|-----------|-----------|---------|
| `tls_verify` | 8 | TLS certificate verification disabled |
| `tls_version` | 8 | Deprecated TLS 1.0/1.1 per RFC 8996 |
| `jwt_config` | 8 | JWT alg:none, skip signature verification |
| `hardcoded_secrets` | 8 | API keys, passwords in code |
| `timeout_config` | 8 | HTTP/DB/Redis timeout values |
| `dep_versions` | 3 | Dependency versions for advisory lookup |
| `cors_config` | 8 | CORS wildcard + credentials |
| `rate_limit` | 8 | Rate limiting configuration |
| `weak_crypto` | 5 | MD5, SHA1, DES, RC4 usage |
| `sql_injection` | 5 | SQL string interpolation |
| `command_injection` | 5 | Shell exec, os.system |
| `unreal_cpp` | C++ | Unreal Engine Exec functions |
| `unreal_config` | INI | Unreal Engine INI patterns |
| `unreal_performance` | C++ | Synchronous asset loading |
### Declarative Extractors
Users can define custom extractors in `aphoria.toml`:
```toml
[[extractors.declarative]]
name = "deprecated_api_v1"
description = "Detects usage of deprecated v1 API endpoints"
languages = ["go", "rust", "python"]
pattern = '/api/v1/\w+'
claim.subject = "api/deprecated_endpoint"
claim.predicate = "version"
claim.value = "v1"
confidence = 1.0
```
---
## Verdicts
| Verdict | Score Range | Exit Code | Action |
|---------|-------------|-----------|--------|
| `Block` | ≥ 0.7 | 2 | Must fix before commit |
| `Flag` | ≥ 0.4 | 1 | Should review |
| `Pass` | < 0.4 | 0 | No conflict |
| `Ack` | N/A | 0 | Acknowledged intentional |
| `Drift` | N/A | 1 | Changed from prior value |
---
## Trust Packs (Phase 6)
Signed bundles of assertions and aliases for federated policy distribution.
**Schema:**
```rust
pub struct TrustPack {
pub header: PackHeader, // name, version, issuer_id, timestamp
pub assertions: Vec<Assertion>,
pub aliases: Vec<ConceptAlias>,
pub signature: [u8; 64], // Ed25519 signature
}
```
**Operations:**
- `aphoria policy export` - Create signed pack from local decisions
- `aphoria policy import` - Load pack, verify signature, ingest assertions
- `aphoria.toml` - Auto-load policies from `policies = [...]` list
---
## Hosted Mode (Phase 4E)
Team aggregation via central StemeDB server.
```toml
[hosted]
url = "https://episteme.acme.corp"
project_id = "billing-service"
team_id = "platform-team"
sync_mode = "remote-only" # or "local-and-remote"
offline_fallback = "skip" # or "fail" or "queue"
api_key_env = "APHORIA_API_KEY"
```
**Flow:**
```
Developer scans → HostedClient → POST /v1/aphoria/observations → Team Server
```
---
## Community Sharing (Phase 5.6)
Opt-in anonymous pattern contribution.
**Privacy Model:**
- Project names wildcarded: `code://rust/myapp/tls` `code://rust/*/tls`
- File paths, line numbers, matched text NEVER shared
- Timestamps rounded to hour (k-anonymity)
- `enabled` defaults to `false` (explicit opt-in)
```toml
[community]
enabled = true
anonymize = true
min_confidence = 0.8
exclude = ["vendor://acme/internal/*"]
```
---
## Key Documents
### Concept Matching System
**Problem:** How do we match code extractors to authoritative policies across different hierarchies?
1. **[Concept Matching Analysis](./concept-matching-analysis.md)**
- Identifies the gap: tail-path matching works for RFCs but breaks for enterprise policies
- Analyzes root cause: semantic mismatch between policy hierarchies and extractor output
- Proposes solution: explicit policy aliases in Trust Packs
2. **[Policy Alias Implementation Guide](./policy-alias-implementation.md)**
- Day-by-day implementation plan (5 phases over 3 days)
- Code sketches with exact file locations
- Test strategies and success criteria
- Migration and rollout plan
3. **[Matching Philosophy](./matching-philosophy.md)**
- Core design principles: semantic over syntactic, progressive precision, explicit control
- Why tail-path matching works (by design for RFC/OWASP corpus)
- Why it breaks (enterprise hierarchies violate assumptions)
- Future extension points (semantic embeddings, ontology mapping)
4. **[Enterprise Validation](./enterprise-validation.md)**
- End-to-end scenario walkthrough
- Validates that policy aliases solve the enterprise use case
- Edge case analysis
- Real-world adoption path
### LLM Extraction Quality
**Problem:** How do we ensure LLM prompts produce consistent, high-quality extraction results?
5. **[LLM Prompt Evaluation - Vision](./llm-prompt-evaluation.md)**
- Problem statement and enterprise requirements
- Architecture overview and core components
- Fixture format design
- CI/CD integration patterns
6. **[LLM Prompt Evaluation - Implementation](./llm-eval-implementation.md)** START HERE
- Actionable implementation spec
- Code snippets and file locations
- 5-phase implementation plan (11 days)
- Seed fixture list
---
## Quick Reference
### When to Read What
| If you need to... | Read this |
|-------------------|-----------|
| Understand concept matching | [Concept Matching Analysis](./concept-matching-analysis.md) |
| Implement policy aliases | [Policy Alias Implementation](./policy-alias-implementation.md) |
| Understand design philosophy | [Matching Philosophy](./matching-philosophy.md) |
| Validate enterprise scenarios | [Enterprise Validation](./enterprise-validation.md) |
| Test/evaluate LLM prompts | [LLM Eval Implementation](./llm-eval-implementation.md) |
| Add a new extractor | `src/extractors/mod.rs` |
| Understand scan flow | `src/scan.rs` |
| Modify conflict detection | `src/episteme/conflict.rs` |
| Work with Trust Packs | `src/policy.rs`, `src/policy_ops.rs` |
| Work with LLM extraction | `src/llm/` |
---
## Architecture Decisions
### AD-001: Explicit Policy Aliases
**Status:** Approved (2026-02-04) - **Not Yet Implemented**
**Context:** Security teams need to create policies using logical hierarchies (`code://standards/*`) that don't align with extractor output (`code://rust/myapp/*`).
**Decision:** Add `PolicyAlias` type to Trust Packs with glob pattern matching.
**Implementation:** See [Policy Alias Implementation Guide](./policy-alias-implementation.md) for detailed implementation plan.
**Consequences:**
- Enables enterprise policy enforcement
- Maintains backward compatibility
- Keeps security teams in control (explicit aliases)
- Requires manual alias creation
- Adds cognitive overhead (pattern syntax)
### AD-002: Ephemeral Mode Default
**Status:** Implemented (2026-01-28)
**Context:** Full Episteme initialization took ~1-2s, too slow for pre-commit hooks.
**Decision:** Default to ephemeral mode (in-memory only), opt-in to persistent with `--persist`.
**Consequences:**
- 40x faster scans (~0.25s)
- No storage pollution for quick checks
- Drift detection requires `--persist`
- Observation write-back requires `--persist --sync`
### AD-003: Tail-Path Matching
**Status:** Implemented
**Context:** Need to match code claims against RFCs/OWASP assertions with different URI schemes.
**Decision:** Use last 2 path segments + predicate as index key.
**Consequences:**
- O(1) lookup via HashMap
- Works for RFC/OWASP corpus by design
- Breaks for enterprise policies with different hierarchies (solved by AD-001)
### AD-004: LLM Prompt Evaluation System
**Status:** Proposed (2026-02-05)
**Context:** LLM prompts that drive claim extraction are code, but we don't treat them like code. No tests, no metrics, no regression detection. When prompts change, we don't know if quality improved or degraded.
**Decision:** Build a comprehensive prompt evaluation system with:
- Golden corpus of test fixtures with expected outcomes
- Observation logging for every extraction
- Metrics computation (precision, recall, F1, cost)
- Regression detection against baselines
- CI integration (smoke tests per-PR, full eval nightly)
**Implementation:** See [LLM Prompt Evaluation Spec](./llm-prompt-evaluation.md)
**Consequences:**
- Prompt changes are validated before deployment
- Regressions are caught automatically
- Quality is measurable over time
- Enterprise confidence in extraction reliability
- Requires maintaining golden corpus
- Live evaluation has token cost
---
## Design Principles
### 1. Semantic Over Syntactic
Match concepts by meaning, not exact string paths.
### 2. Progressive Precision
Start with simple heuristics (tail-path), add layers (aliases, embeddings) as needed.
### 3. Explicit Over Implicit
Matching logic should be transparent, auditable, and controllable.
### 4. Zero Configuration (for common cases)
Bundled corpus (RFCs, OWASP) should "just work" with tail-path matching.
### 5. Cryptographic Trust
All policies are signed (Ed25519) and verified before use.
### 6. Privacy by Default
Community sharing is opt-in with anonymization enabled by default.
---
## Extension Points
### Current (2026-02-05)
- Tail-path matching (O(1) hash lookup)
- Concept aliases (auto-created on conflict detection)
- Declarative extractors (user-defined in TOML)
- Hosted mode (team aggregation)
- Community corpus (anonymous sharing)
- LLM-in-the-loop extraction (Gemini semantic claims)
- Pattern learning (LLM-extracted patterns remembered)
### In Progress
- **LLM Prompt Evaluation** - Testing, metrics, and regression detection for prompts ([Spec](./llm-prompt-evaluation.md))
- **Policy aliases** - Enterprise policy matching via glob patterns ([AD-001](./policy-alias-implementation.md))
### Planned (Q1 2026)
- Semantic embeddings (fuzzy matching via vector similarity)
- Alias auto-discovery (suggest aliases during scan)
- High-entropy secret detection
- Framework-specific extractors (Spring, Django, Express)
### Future (Q2+ 2026)
- Ontology mapping (define semantic relationships)
- Trust Pack composition (packs can extend other packs)
- LLM-assisted extraction (semantic code understanding)
- Config file deep parsing (structured YAML/JSON/TOML)
---
## Performance Targets
### Scan Time
- **Ephemeral:** < 0.3s for typical project
- **Persistent:** < 2s for typical project
- **With Policy Aliases:** < 5% increase
### Memory Overhead
- **Policy Alias Storage:** ~100 bytes per alias
- **Typical Trust Pack:** < 10 KB (10 aliases)
- **Corpus in memory:** ~2-5 MB (varies by sources enabled)
### Lookup Complexity
- **Direct tail-path:** O(1)
- **Concept alias resolution:** O(A) where A=aliases
- **Policy alias fallback (planned):** O(P * A) where P=patterns, A=aliases
---
## Testing Strategy
### Unit Tests
- Extractor pattern matching
- ConceptIndex key generation
- Conflict score calculation
- Trust Pack serialization/verification
### Integration Tests
- Full scan flow with corpus
- Trust Pack import/export
- Drift detection
- Observation write-back
### UAT Scenarios
- Enterprise security team workflow
- Multi-language policy enforcement
- CI/CD integration
- Hosted mode aggregation
---
## Corpus Architecture
Aphoria's corpus is **emergent**, not hardcoded. Best practices come from community usage and external sources.
### Community Corpus (Primary)
**Source:** StemeDB pattern aggregates
**Builder:** `CommunityCorpusBuilder` queries `PatternAggregateStore`
**Promotion:** Patterns with 95%+ adoption + RFC/OWASP match auto-promote to corpus
**Storage:** StemeDB (graph database), indexed as `AUTHORITATIVE` predicate
Example:
```
Pattern: tls/cert_verification:enabled=true
Adoption: 847/892 projects (95%)
Authority: RFC 5246
→ Auto-promoted to corpus (Tier 0: Regulatory)
```
### Bootstrap Options
**New projects need baseline assertions.**
**Option 1: Wiki Import**
```bash
aphoria corpus import --from-wiki ~/docs
# Parses markdown for MUST/SHOULD patterns
# Creates assertions, stores in StemeDB
```
**Option 2: Trust Pack**
```bash
aphoria trust-pack install rfc-owasp-baseline
# Imports curated assertions
# Stores in StemeDB
```
**Option 3: Skill Cold Start**
```bash
# aphoria-suggest analyzes project
# Suggests 3-5 foundation claims
# User approves → CLI creates assertions
```
### No More Hardcoded Corpus
~~`hardcoded.rs`~~ deleted. The 19 original assertions are available as `rfc-owasp-baseline` Trust Pack for bootstrap only.
**Philosophy:** The corpus isn't written by experts. It's discovered by the community and validated by authorities.
---
## Related Documentation
### Product
- [Product Overview](../../product.md) - What Aphoria does
- [Roadmap](../../roadmap.md) - Implementation status and plans
### Guides
- [Enterprise Quick Start](../guides/enterprise-quick-start.md) - Getting started
- [Federating Truth](../guides/federating-truth.md) - Trust Pack workflows
### Implementation
- [Policy Ops](../../src/policy_ops.rs) - Trust Pack CLI handlers
- [Concept Index](../../src/episteme/concept_index.rs) - Matching algorithm
- [Local Episteme](../../src/episteme/local.rs) - Conflict detection
- [Ephemeral Detector](../../src/episteme/ephemeral.rs) - Fast path
---
## Questions or Feedback?
Discuss in:
- `#aphoria-architecture` (internal Slack)
- GitHub Issues (public feedback)
- Architecture review meetings (Fridays 2pm PT)
---
**This directory is the source of truth for architectural decisions.** All major changes should be documented here before implementation.
---
*Last updated: 2026-02-05*