Enterprise Features: - Hosted mode with remote sync for team pattern aggregation - Community sharing with privacy-preserving anonymization - LLM-based semantic claim extraction with Gemini integration - Pattern learning with promotion to declarative extractors - High-entropy secrets extractor with configurable thresholds - Auth bypass and insecure cookies extractors Module Refactoring: - Split oversized files to comply with 500-line limit - Config split: types/core.rs, types/extractors.rs, types/hosted.rs, etc. - Handlers split: scan.rs, policy.rs, report.rs modules - Extractors split: declarative/, high_entropy_secrets/, insecure_cookies/ - Learning split: store modules with metrics and persistence SDK & Ontology: - stemedb-ontology SDK with fluent builders and StemeDB client - Pharma domain extractors for FDA Orange Book data - Consumer health UAT test infrastructure Code Quality: - Fixed clippy warnings (needless_borrows_for_generic_args) - Added KVStore trait imports where needed - Fixed utoipa path re-exports for OpenAPI docs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
490 lines
19 KiB
Markdown
490 lines
19 KiB
Markdown
# Aphoria Architecture Documentation
|
||
|
||
This directory contains architectural decision records, analysis, and design philosophy for Aphoria.
|
||
|
||
---
|
||
|
||
## System Overview
|
||
|
||
Aphoria is a **code-level truth linter** that validates code against authoritative sources (RFCs, OWASP, vendor docs). It extracts implicit claims from code and configs, then checks them against a tiered authority system.
|
||
|
||
### High-Level Architecture
|
||
|
||
```
|
||
┌──────────────────────────────────────────────────────────────────────────┐
|
||
│ Aphoria CLI Pipeline │
|
||
├──────────────────────────────────────────────────────────────────────────┤
|
||
│ │
|
||
│ ┌──────────────┐ │
|
||
│ │ CLI/Args │ ──▶ handlers.rs dispatches to scan, policy, research │
|
||
│ └──────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌──────────────┐ ┌────────────────┐ ┌──────────────┐ │
|
||
│ │ Walker │──▶│ Extractors │──▶│ Bridge │ │
|
||
│ │ (walk files) │ │ (14 built-in) │ │ (claim→assn) │ │
|
||
│ └──────────────┘ └────────────────┘ └──────────────┘ │
|
||
│ │ │ │ │
|
||
│ │ │ ▼ │
|
||
│ │ │ ┌──────────────────┐ │
|
||
│ │ │ │ Episteme Layer │ │
|
||
│ │ │ │ │ │
|
||
│ │ │ │ ┌──────────────┐ │ │
|
||
│ │ │ │ │ Ephemeral │ │ ◀─ Fast path │
|
||
│ │ │ │ │ Detector │ │ (~0.25s) │
|
||
│ │ │ │ └──────────────┘ │ │
|
||
│ │ │ │ OR │ │
|
||
│ │ │ │ ┌──────────────┐ │ │
|
||
│ │ │ │ │ Local │ │ ◀─ Full path │
|
||
│ │ │ │ │ Episteme │ │ (~1-2s) │
|
||
│ │ │ │ └──────────────┘ │ │
|
||
│ │ │ └──────────────────┘ │
|
||
│ │ │ │ │
|
||
│ ▼ ▼ ▼ │
|
||
│ ┌────────────────────────────────────────────────────────────────┐ │
|
||
│ │ Conflict Detection │ │
|
||
│ │ ConceptIndex (tail-path) + Aliases + Policy Source Tracking │ │
|
||
│ └────────────────────────────────────────────────────────────────┘ │
|
||
│ │ │
|
||
│ ▼ │
|
||
│ ┌──────────────┐ ┌────────────────┐ ┌──────────────┐ │
|
||
│ │ Report │ │ Drift Check │ │ Observation │ │
|
||
│ │ (table/json/ │ │ (self-conflict)│ │ Write-back │ │
|
||
│ │ sarif/md) │ │ │ │ (--sync) │ │
|
||
│ └──────────────┘ └────────────────┘ └──────────────┘ │
|
||
│ │
|
||
└──────────────────────────────────────────────────────────────────────────┘
|
||
```
|
||
|
||
### Data Flow
|
||
|
||
1. **WALK** - Traverse project directory (respects `.gitignore`, supports `--staged` for git-staged files only)
|
||
2. **EXTRACT** - Run 14 built-in extractors + declarative extractors to find implicit claims
|
||
3. **INGEST** - Convert claims to Episteme assertions (BLAKE3 hash + Ed25519 signature)
|
||
4. **CONFLICT** - Query ConceptIndex for authority matches using tail-path matching
|
||
5. **DRIFT** - Compare against prior observations (self-conflict detection)
|
||
6. **REPORT** - Output in table, JSON, SARIF 2.1.0, or Markdown format
|
||
7. **SYNC** - (Optional) Write-back novel observations to local store or hosted server
|
||
|
||
---
|
||
|
||
## Key Modules
|
||
|
||
| Module | Purpose | Key Files |
|
||
|--------|---------|-----------|
|
||
| `cli.rs` | Clap-based CLI argument parsing | Command definitions |
|
||
| `handlers.rs` | Command dispatch, validation | `--sync requires --persist` |
|
||
| `scan.rs` | Main scan orchestrator | Mode dispatch, observation flow |
|
||
| `walker/` | Project traversal | `mod.rs`, `git.rs`, `path_mapper.rs`, `language.rs` |
|
||
| `extractors/` | 14 pattern-based claim extractors | `mod.rs`, individual extractors |
|
||
| `bridge.rs` | ExtractedClaim → Assertion conversion | BLAKE3 hashing, Ed25519 signing |
|
||
| `episteme/` | Conflict detection core | `ephemeral.rs`, `local.rs`, `concept_index.rs` |
|
||
| `policy.rs` | Trust Pack management | Load/save/verify signed packs |
|
||
| `policy_ops.rs` | `bless`, `ack`, `update`, `export/import` | CLI policy operations |
|
||
| `report/` | Output formatting | `table.rs`, `json.rs`, `sarif.rs`, `markdown.rs` |
|
||
| `hosted.rs` | HTTP client for team aggregation | Push observations to remote server |
|
||
| `community/` | Anonymous pattern contribution | `anonymizer.rs`, `types.rs` |
|
||
| `research/` | Gap detection and auto-research | `gap_detector.rs`, `researcher.rs` |
|
||
| `config/` | `aphoria.toml` parsing | All configuration types |
|
||
| `types/` | Domain types | `claim.rs`, `verdict.rs`, `result.rs`, `command.rs` |
|
||
| `corpus/` | Authoritative source builders | `rfc/`, `owasp/`, `vendor.rs`, `hardcoded.rs` |
|
||
|
||
---
|
||
|
||
## Scan Modes
|
||
|
||
| Mode | Storage | Performance | Features |
|
||
|------|---------|-------------|----------|
|
||
| **Ephemeral** (default) | None | ~0.25s | Conflict detection only |
|
||
| **Persistent** (`--persist`) | WAL + KV | ~1-2s | Baseline, diff, aliases, drift, observation write-back |
|
||
|
||
### Ephemeral Mode (`EphemeralDetector`)
|
||
- Builds corpus + ConceptIndex entirely in-memory
|
||
- No disk I/O during scan
|
||
- Perfect for CI/pre-commit hooks
|
||
- Cannot detect drift (no prior state)
|
||
- Cannot write observations (no storage)
|
||
|
||
### Persistent Mode (`LocalEpisteme`)
|
||
- Full Episteme stack initialization
|
||
- WAL recovery on startup
|
||
- Enables: baseline tracking, diff, auto-alias creation, drift detection, `--sync`
|
||
|
||
---
|
||
|
||
## Authority Tiers
|
||
|
||
| Tier | Source | Example | Weight |
|
||
|------|--------|---------|--------|
|
||
| 0 | Regulatory | RFC 7519: "JWT audience validation is mandatory" | 1.0 |
|
||
| 1 | Clinical | OWASP: "TLS certificate verification required" | 0.9 |
|
||
| 2 | Observational | Vendor docs: "Redis timeout should be > 0" | 0.7 |
|
||
| 3 | Expert | Team policy: "Our pool size is 50" | 0.5 |
|
||
| 4 | Community | Prior observations from this codebase | 0.3 |
|
||
|
||
**Conflict Score Formula:**
|
||
```
|
||
score = Σ(tier_weight × assertion_confidence × value_difference)
|
||
```
|
||
|
||
---
|
||
|
||
## Concept Matching
|
||
|
||
### Tail-Path Matching (ConceptIndex)
|
||
|
||
The primary matching algorithm uses the last 2 path segments to enable cross-scheme matching:
|
||
|
||
```
|
||
RFC assertion: rfc://5246/tls/cert_verification
|
||
Code claim: code://rust/myapp/tls/cert_verification
|
||
|
||
Both produce key: "tls/cert_verification::enabled"
|
||
```
|
||
|
||
**Algorithm:**
|
||
1. Strip scheme (`rfc://`, `code://`)
|
||
2. Take last 2 non-empty path segments
|
||
3. Append predicate
|
||
4. Key = `{seg[-2]}/{seg[-1]}::{predicate}`
|
||
|
||
### Alias Resolution
|
||
|
||
When tail-path matching fails, the system checks registered aliases. Aliases can be:
|
||
- **Auto-created** - When conflicts are detected, persist the relationship (persistent mode)
|
||
- **Manual** - Created via `aphoria bless` or Trust Pack import
|
||
- **Policy aliases** - (Planned) From Trust Packs for enterprise policy enforcement - see [Policy Alias Implementation](./policy-alias-implementation.md)
|
||
|
||
---
|
||
|
||
## Extractors
|
||
|
||
### Built-in Extractors (14)
|
||
|
||
| Extractor | Languages | Detects |
|
||
|-----------|-----------|---------|
|
||
| `tls_verify` | 8 | TLS certificate verification disabled |
|
||
| `tls_version` | 8 | Deprecated TLS 1.0/1.1 per RFC 8996 |
|
||
| `jwt_config` | 8 | JWT alg:none, skip signature verification |
|
||
| `hardcoded_secrets` | 8 | API keys, passwords in code |
|
||
| `timeout_config` | 8 | HTTP/DB/Redis timeout values |
|
||
| `dep_versions` | 3 | Dependency versions for advisory lookup |
|
||
| `cors_config` | 8 | CORS wildcard + credentials |
|
||
| `rate_limit` | 8 | Rate limiting configuration |
|
||
| `weak_crypto` | 5 | MD5, SHA1, DES, RC4 usage |
|
||
| `sql_injection` | 5 | SQL string interpolation |
|
||
| `command_injection` | 5 | Shell exec, os.system |
|
||
| `unreal_cpp` | C++ | Unreal Engine Exec functions |
|
||
| `unreal_config` | INI | Unreal Engine INI patterns |
|
||
| `unreal_performance` | C++ | Synchronous asset loading |
|
||
|
||
### Declarative Extractors
|
||
|
||
Users can define custom extractors in `aphoria.toml`:
|
||
|
||
```toml
|
||
[[extractors.declarative]]
|
||
name = "deprecated_api_v1"
|
||
description = "Detects usage of deprecated v1 API endpoints"
|
||
languages = ["go", "rust", "python"]
|
||
pattern = '/api/v1/\w+'
|
||
claim.subject = "api/deprecated_endpoint"
|
||
claim.predicate = "version"
|
||
claim.value = "v1"
|
||
confidence = 1.0
|
||
```
|
||
|
||
---
|
||
|
||
## Verdicts
|
||
|
||
| Verdict | Score Range | Exit Code | Action |
|
||
|---------|-------------|-----------|--------|
|
||
| `Block` | ≥ 0.7 | 2 | Must fix before commit |
|
||
| `Flag` | ≥ 0.4 | 1 | Should review |
|
||
| `Pass` | < 0.4 | 0 | No conflict |
|
||
| `Ack` | N/A | 0 | Acknowledged intentional |
|
||
| `Drift` | N/A | 1 | Changed from prior value |
|
||
|
||
---
|
||
|
||
## Trust Packs (Phase 6)
|
||
|
||
Signed bundles of assertions and aliases for federated policy distribution.
|
||
|
||
**Schema:**
|
||
```rust
|
||
pub struct TrustPack {
|
||
pub header: PackHeader, // name, version, issuer_id, timestamp
|
||
pub assertions: Vec<Assertion>,
|
||
pub aliases: Vec<ConceptAlias>,
|
||
pub signature: [u8; 64], // Ed25519 signature
|
||
}
|
||
```
|
||
|
||
**Operations:**
|
||
- `aphoria policy export` - Create signed pack from local decisions
|
||
- `aphoria policy import` - Load pack, verify signature, ingest assertions
|
||
- `aphoria.toml` - Auto-load policies from `policies = [...]` list
|
||
|
||
---
|
||
|
||
## Hosted Mode (Phase 4E)
|
||
|
||
Team aggregation via central StemeDB server.
|
||
|
||
```toml
|
||
[hosted]
|
||
url = "https://episteme.acme.corp"
|
||
project_id = "billing-service"
|
||
team_id = "platform-team"
|
||
sync_mode = "remote-only" # or "local-and-remote"
|
||
offline_fallback = "skip" # or "fail" or "queue"
|
||
api_key_env = "APHORIA_API_KEY"
|
||
```
|
||
|
||
**Flow:**
|
||
```
|
||
Developer scans → HostedClient → POST /v1/aphoria/observations → Team Server
|
||
```
|
||
|
||
---
|
||
|
||
## Community Sharing (Phase 5.6)
|
||
|
||
Opt-in anonymous pattern contribution.
|
||
|
||
**Privacy Model:**
|
||
- Project names wildcarded: `code://rust/myapp/tls` → `code://rust/*/tls`
|
||
- File paths, line numbers, matched text NEVER shared
|
||
- Timestamps rounded to hour (k-anonymity)
|
||
- `enabled` defaults to `false` (explicit opt-in)
|
||
|
||
```toml
|
||
[community]
|
||
enabled = true
|
||
anonymize = true
|
||
min_confidence = 0.8
|
||
exclude = ["vendor://acme/internal/*"]
|
||
```
|
||
|
||
---
|
||
|
||
## Key Documents
|
||
|
||
### Concept Matching System
|
||
|
||
**Problem:** How do we match code extractors to authoritative policies across different hierarchies?
|
||
|
||
1. **[Concept Matching Analysis](./concept-matching-analysis.md)**
|
||
- Identifies the gap: tail-path matching works for RFCs but breaks for enterprise policies
|
||
- Analyzes root cause: semantic mismatch between policy hierarchies and extractor output
|
||
- Proposes solution: explicit policy aliases in Trust Packs
|
||
|
||
2. **[Policy Alias Implementation Guide](./policy-alias-implementation.md)**
|
||
- Day-by-day implementation plan (5 phases over 3 days)
|
||
- Code sketches with exact file locations
|
||
- Test strategies and success criteria
|
||
- Migration and rollout plan
|
||
|
||
3. **[Matching Philosophy](./matching-philosophy.md)**
|
||
- Core design principles: semantic over syntactic, progressive precision, explicit control
|
||
- Why tail-path matching works (by design for RFC/OWASP corpus)
|
||
- Why it breaks (enterprise hierarchies violate assumptions)
|
||
- Future extension points (semantic embeddings, ontology mapping)
|
||
|
||
4. **[Enterprise Validation](./enterprise-validation.md)**
|
||
- End-to-end scenario walkthrough
|
||
- Validates that policy aliases solve the enterprise use case
|
||
- Edge case analysis
|
||
- Real-world adoption path
|
||
|
||
---
|
||
|
||
## Quick Reference
|
||
|
||
### When to Read What
|
||
|
||
| If you need to... | Read this |
|
||
|-------------------|-----------|
|
||
| Understand the problem | [Concept Matching Analysis](./concept-matching-analysis.md) |
|
||
| Implement the solution | [Policy Alias Implementation](./policy-alias-implementation.md) |
|
||
| Understand design philosophy | [Matching Philosophy](./matching-philosophy.md) |
|
||
| Validate enterprise scenarios | [Enterprise Validation](./enterprise-validation.md) |
|
||
| Add a new extractor | `src/extractors/mod.rs` |
|
||
| Understand scan flow | `src/scan.rs` |
|
||
| Modify conflict detection | `src/episteme/conflict.rs` |
|
||
| Work with Trust Packs | `src/policy.rs`, `src/policy_ops.rs` |
|
||
|
||
---
|
||
|
||
## Architecture Decisions
|
||
|
||
### AD-001: Explicit Policy Aliases
|
||
|
||
**Status:** Approved (2026-02-04) - **Not Yet Implemented**
|
||
|
||
**Context:** Security teams need to create policies using logical hierarchies (`code://standards/*`) that don't align with extractor output (`code://rust/myapp/*`).
|
||
|
||
**Decision:** Add `PolicyAlias` type to Trust Packs with glob pattern matching.
|
||
|
||
**Implementation:** See [Policy Alias Implementation Guide](./policy-alias-implementation.md) for detailed implementation plan.
|
||
|
||
**Consequences:**
|
||
- ✅ Enables enterprise policy enforcement
|
||
- ✅ Maintains backward compatibility
|
||
- ✅ Keeps security teams in control (explicit aliases)
|
||
- ⚠️ Requires manual alias creation
|
||
- ⚠️ Adds cognitive overhead (pattern syntax)
|
||
|
||
### AD-002: Ephemeral Mode Default
|
||
|
||
**Status:** Implemented (2026-01-28)
|
||
|
||
**Context:** Full Episteme initialization took ~1-2s, too slow for pre-commit hooks.
|
||
|
||
**Decision:** Default to ephemeral mode (in-memory only), opt-in to persistent with `--persist`.
|
||
|
||
**Consequences:**
|
||
- ✅ 40x faster scans (~0.25s)
|
||
- ✅ No storage pollution for quick checks
|
||
- ⚠️ Drift detection requires `--persist`
|
||
- ⚠️ Observation write-back requires `--persist --sync`
|
||
|
||
### AD-003: Tail-Path Matching
|
||
|
||
**Status:** Implemented
|
||
|
||
**Context:** Need to match code claims against RFCs/OWASP assertions with different URI schemes.
|
||
|
||
**Decision:** Use last 2 path segments + predicate as index key.
|
||
|
||
**Consequences:**
|
||
- ✅ O(1) lookup via HashMap
|
||
- ✅ Works for RFC/OWASP corpus by design
|
||
- ⚠️ Breaks for enterprise policies with different hierarchies (solved by AD-001)
|
||
|
||
---
|
||
|
||
## Design Principles
|
||
|
||
### 1. Semantic Over Syntactic
|
||
Match concepts by meaning, not exact string paths.
|
||
|
||
### 2. Progressive Precision
|
||
Start with simple heuristics (tail-path), add layers (aliases, embeddings) as needed.
|
||
|
||
### 3. Explicit Over Implicit
|
||
Matching logic should be transparent, auditable, and controllable.
|
||
|
||
### 4. Zero Configuration (for common cases)
|
||
Bundled corpus (RFCs, OWASP) should "just work" with tail-path matching.
|
||
|
||
### 5. Cryptographic Trust
|
||
All policies are signed (Ed25519) and verified before use.
|
||
|
||
### 6. Privacy by Default
|
||
Community sharing is opt-in with anonymization enabled by default.
|
||
|
||
---
|
||
|
||
## Extension Points
|
||
|
||
### Current (2026-02-05)
|
||
- Tail-path matching (O(1) hash lookup)
|
||
- Concept aliases (auto-created on conflict detection)
|
||
- Declarative extractors (user-defined in TOML)
|
||
- Hosted mode (team aggregation)
|
||
- Community corpus (anonymous sharing)
|
||
|
||
### In Progress
|
||
- **Policy aliases** - Enterprise policy matching via glob patterns ([AD-001](./policy-alias-implementation.md))
|
||
|
||
### Planned (Q1 2026)
|
||
- Semantic embeddings (fuzzy matching via vector similarity)
|
||
- Alias auto-discovery (suggest aliases during scan)
|
||
- High-entropy secret detection
|
||
- Framework-specific extractors (Spring, Django, Express)
|
||
|
||
### Future (Q2+ 2026)
|
||
- Ontology mapping (define semantic relationships)
|
||
- Trust Pack composition (packs can extend other packs)
|
||
- LLM-assisted extraction (semantic code understanding)
|
||
- Config file deep parsing (structured YAML/JSON/TOML)
|
||
|
||
---
|
||
|
||
## Performance Targets
|
||
|
||
### Scan Time
|
||
- **Ephemeral:** < 0.3s for typical project
|
||
- **Persistent:** < 2s for typical project
|
||
- **With Policy Aliases:** < 5% increase
|
||
|
||
### Memory Overhead
|
||
- **Policy Alias Storage:** ~100 bytes per alias
|
||
- **Typical Trust Pack:** < 10 KB (10 aliases)
|
||
- **Corpus in memory:** ~2-5 MB (varies by sources enabled)
|
||
|
||
### Lookup Complexity
|
||
- **Direct tail-path:** O(1)
|
||
- **Concept alias resolution:** O(A) where A=aliases
|
||
- **Policy alias fallback (planned):** O(P * A) where P=patterns, A=aliases
|
||
|
||
---
|
||
|
||
## Testing Strategy
|
||
|
||
### Unit Tests
|
||
- Extractor pattern matching
|
||
- ConceptIndex key generation
|
||
- Conflict score calculation
|
||
- Trust Pack serialization/verification
|
||
|
||
### Integration Tests
|
||
- Full scan flow with corpus
|
||
- Trust Pack import/export
|
||
- Drift detection
|
||
- Observation write-back
|
||
|
||
### UAT Scenarios
|
||
- Enterprise security team workflow
|
||
- Multi-language policy enforcement
|
||
- CI/CD integration
|
||
- Hosted mode aggregation
|
||
|
||
---
|
||
|
||
## Related Documentation
|
||
|
||
### Product
|
||
- [Product Overview](../../product.md) - What Aphoria does
|
||
- [Roadmap](../../roadmap.md) - Implementation status and plans
|
||
|
||
### Guides
|
||
- [Enterprise Quick Start](../guides/enterprise-quick-start.md) - Getting started
|
||
- [Federating Truth](../guides/federating-truth.md) - Trust Pack workflows
|
||
|
||
### Implementation
|
||
- [Policy Ops](../../src/policy_ops.rs) - Trust Pack CLI handlers
|
||
- [Concept Index](../../src/episteme/concept_index.rs) - Matching algorithm
|
||
- [Local Episteme](../../src/episteme/local.rs) - Conflict detection
|
||
- [Ephemeral Detector](../../src/episteme/ephemeral.rs) - Fast path
|
||
|
||
---
|
||
|
||
## Questions or Feedback?
|
||
|
||
Discuss in:
|
||
- `#aphoria-architecture` (internal Slack)
|
||
- GitHub Issues (public feedback)
|
||
- Architecture review meetings (Fridays 2pm PT)
|
||
|
||
---
|
||
|
||
**This directory is the source of truth for architectural decisions.** All major changes should be documented here before implementation.
|
||
|
||
---
|
||
|
||
*Last updated: 2026-02-05*
|