stemedb/applications/aphoria/docs/architecture/README.md

# Aphoria Architecture Documentation

This directory contains architectural decision records, analysis, and design philosophy for Aphoria.

---

## System Overview

Aphoria is a **code-level truth linter** that validates code against authoritative sources (RFCs, OWASP, vendor docs). It extracts implicit claims from code and configs, then checks them against a tiered authority system.

### High-Level Architecture

```
┌──────────────────────────────────────────────────────────────────────────┐
│                        Aphoria CLI Pipeline                               │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ┌──────────────┐                                                        │
│  │   CLI/Args   │ ──▶ handlers.rs dispatches to scan, policy, research   │
│  └──────────────┘                                                        │
│         │                                                                │
│         ▼                                                                │
│  ┌──────────────┐  ┌────────────────┐  ┌──────────────┐                  │
│  │    Walker    │──▶│  Extractors    │──▶│    Bridge    │                │
│  │ (walk files) │  │ (14 built-in)  │  │ (claim→assn) │                  │
│  └──────────────┘  └────────────────┘  └──────────────┘                  │
│         │                 │                    │                         │
│         │                 │                    ▼                         │
│         │                 │         ┌──────────────────┐                 │
│         │                 │         │  Episteme Layer  │                 │
│         │                 │         │                  │                 │
│         │                 │         │ ┌──────────────┐ │                 │
│         │                 │         │ │  Ephemeral   │ │ ◀─ Fast path    │
│         │                 │         │ │  Detector    │ │    (~0.25s)     │
│         │                 │         │ └──────────────┘ │                 │
│         │                 │         │        OR        │                 │
│         │                 │         │ ┌──────────────┐ │                 │
│         │                 │         │ │   Local      │ │ ◀─ Full path    │
│         │                 │         │ │  Episteme    │ │    (~1-2s)      │
│         │                 │         │ └──────────────┘ │                 │
│         │                 │         └──────────────────┘                 │
│         │                 │                    │                         │
│         ▼                 ▼                    ▼                         │
│  ┌────────────────────────────────────────────────────────────────┐      │
│  │                      Conflict Detection                         │      │
│  │  ConceptIndex (tail-path) + Aliases + Policy Source Tracking   │      │
│  └────────────────────────────────────────────────────────────────┘      │
│                                    │                                     │
│                                    ▼                                     │
│  ┌──────────────┐  ┌────────────────┐  ┌──────────────┐                  │
│  │    Report    │  │  Drift Check   │  │  Observation │                  │
│  │ (table/json/ │  │ (self-conflict)│  │  Write-back  │                  │
│  │  sarif/md)   │  │                │  │  (--sync)    │                  │
│  └──────────────┘  └────────────────┘  └──────────────┘                  │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘
```

### Data Flow

1. **WALK** - Traverse project directory (respects `.gitignore`, supports `--staged` for git-staged files only)
2. **EXTRACT** - Run 14 built-in extractors + declarative extractors to find implicit claims
3. **INGEST** - Convert claims to Episteme assertions (BLAKE3 hash + Ed25519 signature)
4. **CONFLICT** - Query ConceptIndex for authority matches using tail-path matching
5. **DRIFT** - Compare against prior observations (self-conflict detection)
6. **REPORT** - Output in table, JSON, SARIF 2.1.0, or Markdown format
7. **SYNC** - (Optional) Write-back novel observations to local store or hosted server

---

## Key Modules

| Module | Purpose | Key Files |
|--------|---------|-----------|
| `cli.rs` | Clap-based CLI argument parsing | Command definitions |
| `handlers.rs` | Command dispatch, validation | `--sync requires --persist` |
| `scan.rs` | Main scan orchestrator | Mode dispatch, observation flow |
| `walker/` | Project traversal | `mod.rs`, `git.rs`, `path_mapper.rs`, `language.rs` |
| `extractors/` | 14 pattern-based claim extractors | `mod.rs`, individual extractors |
| `bridge.rs` | ExtractedClaim → Assertion conversion | BLAKE3 hashing, Ed25519 signing |
| `episteme/` | Conflict detection core | `ephemeral.rs`, `local.rs`, `concept_index.rs` |
| `policy.rs` | Trust Pack management | Load/save/verify signed packs |
| `policy_ops.rs` | `bless`, `ack`, `update`, `export/import` | CLI policy operations |
| `report/` | Output formatting | `table.rs`, `json.rs`, `sarif.rs`, `markdown.rs` |
| `hosted.rs` | HTTP client for team aggregation | Push observations to remote server |
| `community/` | Anonymous pattern contribution | `anonymizer.rs`, `types.rs` |
| `research/` | Gap detection and auto-research | `gap_detector.rs`, `researcher.rs` |
| `config/` | `aphoria.toml` parsing | All configuration types |
| `types/` | Domain types | `claim.rs`, `verdict.rs`, `result.rs`, `command.rs` |
| `corpus/` | Authoritative source builders | `rfc/`, `owasp/`, `vendor.rs`, `hardcoded.rs` |

---

## Scan Modes

| Mode | Storage | Performance | Features |
|------|---------|-------------|----------|
| **Ephemeral** (default) | None | ~0.25s | Conflict detection only |
| **Persistent** (`--persist`) | WAL + KV | ~1-2s | Baseline, diff, aliases, drift, observation write-back |

### Ephemeral Mode (`EphemeralDetector`)
- Builds corpus + ConceptIndex entirely in-memory
- No disk I/O during scan
- Perfect for CI/pre-commit hooks
- Cannot detect drift (no prior state)
- Cannot write observations (no storage)

### Persistent Mode (`LocalEpisteme`)
- Full Episteme stack initialization
- WAL recovery on startup
- Enables: baseline tracking, diff, auto-alias creation, drift detection, `--sync`

---

## Authority Tiers

| Tier | Source | Example | Weight |
|------|--------|---------|--------|
| 0 | Regulatory | RFC 7519: "JWT audience validation is mandatory" | 1.0 |
| 1 | Clinical | OWASP: "TLS certificate verification required" | 0.9 |
| 2 | Observational | Vendor docs: "Redis timeout should be > 0" | 0.7 |
| 3 | Expert | Team policy: "Our pool size is 50" | 0.5 |
| 4 | Community | Prior observations from this codebase | 0.3 |

**Conflict Score Formula:**
```
score = Σ(tier_weight × assertion_confidence × value_difference)
```

---

## Concept Matching

### Tail-Path Matching (ConceptIndex)

The primary matching algorithm uses the last 2 path segments to enable cross-scheme matching:

```
RFC assertion:  rfc://5246/tls/cert_verification
Code claim:     code://rust/myapp/tls/cert_verification

Both produce key: "tls/cert_verification::enabled"
```

**Algorithm:**
1. Strip scheme (`rfc://`, `code://`)
2. Take last 2 non-empty path segments
3. Append predicate
4. Key = `{seg[-2]}/{seg[-1]}::{predicate}`

### Alias Resolution

When tail-path matching fails, the system checks registered aliases. Aliases can be:
- **Auto-created** - When conflicts are detected, persist the relationship (persistent mode)
- **Manual** - Created via `aphoria bless` or Trust Pack import
- **Policy aliases** - (Planned) From Trust Packs for enterprise policy enforcement - see [Policy Alias Implementation](./policy-alias-implementation.md)

---

## Extractors

### Built-in Extractors (14)

| Extractor | Languages | Detects |
|-----------|-----------|---------|
| `tls_verify` | 8 | TLS certificate verification disabled |
| `tls_version` | 8 | Deprecated TLS 1.0/1.1 per RFC 8996 |
| `jwt_config` | 8 | JWT alg:none, skip signature verification |
| `hardcoded_secrets` | 8 | API keys, passwords in code |
| `timeout_config` | 8 | HTTP/DB/Redis timeout values |
| `dep_versions` | 3 | Dependency versions for advisory lookup |
| `cors_config` | 8 | CORS wildcard + credentials |
| `rate_limit` | 8 | Rate limiting configuration |
| `weak_crypto` | 5 | MD5, SHA1, DES, RC4 usage |
| `sql_injection` | 5 | SQL string interpolation |
| `command_injection` | 5 | Shell exec, os.system |
| `unreal_cpp` | C++ | Unreal Engine Exec functions |
| `unreal_config` | INI | Unreal Engine INI patterns |
| `unreal_performance` | C++ | Synchronous asset loading |

### Declarative Extractors

Users can define custom extractors in `aphoria.toml`:

```toml
[[extractors.declarative]]
name = "deprecated_api_v1"
description = "Detects usage of deprecated v1 API endpoints"
languages = ["go", "rust", "python"]
pattern = '/api/v1/\w+'
claim.subject = "api/deprecated_endpoint"
claim.predicate = "version"
claim.value = "v1"
confidence = 1.0
```

---

## Verdicts

| Verdict | Score Range | Exit Code | Action |
|---------|-------------|-----------|--------|
| `Block` | ≥ 0.7 | 2 | Must fix before commit |
| `Flag` | ≥ 0.4 | 1 | Should review |
| `Pass` | < 0.4 | 0 | No conflict |
| `Ack` | N/A | 0 | Acknowledged intentional |
| `Drift` | N/A | 1 | Changed from prior value |

---

## Trust Packs (Phase 6)

Signed bundles of assertions and aliases for federated policy distribution.

**Schema:**
```rust
pub struct TrustPack {
    pub header: PackHeader,     // name, version, issuer_id, timestamp
    pub assertions: Vec<Assertion>,
    pub aliases: Vec<ConceptAlias>,
    pub signature: [u8; 64],    // Ed25519 signature
}
```

**Operations:**
- `aphoria policy export` - Create signed pack from local decisions
- `aphoria policy import` - Load pack, verify signature, ingest assertions
- `aphoria.toml` - Auto-load policies from `policies = [...]` list

---

## Hosted Mode (Phase 4E)

Team aggregation via central StemeDB server.

```toml
[hosted]
url = "https://episteme.acme.corp"
project_id = "billing-service"
team_id = "platform-team"
sync_mode = "remote-only"       # or "local-and-remote"
offline_fallback = "skip"       # or "fail" or "queue"
api_key_env = "APHORIA_API_KEY"
```

**Flow:**
```
Developer scans → HostedClient → POST /v1/aphoria/observations → Team Server
```

---

## Community Sharing (Phase 5.6)

Opt-in anonymous pattern contribution.

**Privacy Model:**
- Project names wildcarded: `code://rust/myapp/tls` → `code://rust/*/tls`
- File paths, line numbers, matched text NEVER shared
- Timestamps rounded to hour (k-anonymity)
- `enabled` defaults to `false` (explicit opt-in)

```toml
[community]
enabled = true
anonymize = true
min_confidence = 0.8
exclude = ["vendor://acme/internal/*"]
```

---

## Key Documents

### Concept Matching System

**Problem:** How do we match code extractors to authoritative policies across different hierarchies?

1. **[Concept Matching Analysis](./concept-matching-analysis.md)**
   - Identifies the gap: tail-path matching works for RFCs but breaks for enterprise policies
   - Analyzes root cause: semantic mismatch between policy hierarchies and extractor output
   - Proposes solution: explicit policy aliases in Trust Packs

2. **[Policy Alias Implementation Guide](./policy-alias-implementation.md)**
   - Day-by-day implementation plan (5 phases over 3 days)
   - Code sketches with exact file locations
   - Test strategies and success criteria
   - Migration and rollout plan

3. **[Matching Philosophy](./matching-philosophy.md)**
   - Core design principles: semantic over syntactic, progressive precision, explicit control
   - Why tail-path matching works (by design for RFC/OWASP corpus)
   - Why it breaks (enterprise hierarchies violate assumptions)
   - Future extension points (semantic embeddings, ontology mapping)

4. **[Enterprise Validation](./enterprise-validation.md)**
   - End-to-end scenario walkthrough
   - Validates that policy aliases solve the enterprise use case
   - Edge case analysis
   - Real-world adoption path

---

## Quick Reference

### When to Read What

| If you need to... | Read this |
|-------------------|-----------|
| Understand the problem | [Concept Matching Analysis](./concept-matching-analysis.md) |
| Implement the solution | [Policy Alias Implementation](./policy-alias-implementation.md) |
| Understand design philosophy | [Matching Philosophy](./matching-philosophy.md) |
| Validate enterprise scenarios | [Enterprise Validation](./enterprise-validation.md) |
| Add a new extractor | `src/extractors/mod.rs` |
| Understand scan flow | `src/scan.rs` |
| Modify conflict detection | `src/episteme/conflict.rs` |
| Work with Trust Packs | `src/policy.rs`, `src/policy_ops.rs` |

---

## Architecture Decisions

### AD-001: Explicit Policy Aliases

**Status:** Approved (2026-02-04) - **Not Yet Implemented**

**Context:** Security teams need to create policies using logical hierarchies (`code://standards/*`) that don't align with extractor output (`code://rust/myapp/*`).

**Decision:** Add `PolicyAlias` type to Trust Packs with glob pattern matching.

**Implementation:** See [Policy Alias Implementation Guide](./policy-alias-implementation.md) for detailed implementation plan.

**Consequences:**
- ✅ Enables enterprise policy enforcement
- ✅ Maintains backward compatibility
- ✅ Keeps security teams in control (explicit aliases)
- ⚠️ Requires manual alias creation
- ⚠️ Adds cognitive overhead (pattern syntax)

### AD-002: Ephemeral Mode Default

**Status:** Implemented (2026-01-28)

**Context:** Full Episteme initialization took ~1-2s, too slow for pre-commit hooks.

**Decision:** Default to ephemeral mode (in-memory only), opt-in to persistent with `--persist`.

**Consequences:**
- ✅ 40x faster scans (~0.25s)
- ✅ No storage pollution for quick checks
- ⚠️ Drift detection requires `--persist`
- ⚠️ Observation write-back requires `--persist --sync`

### AD-003: Tail-Path Matching

**Status:** Implemented

**Context:** Need to match code claims against RFCs/OWASP assertions with different URI schemes.

**Decision:** Use last 2 path segments + predicate as index key.

**Consequences:**
- ✅ O(1) lookup via HashMap
- ✅ Works for RFC/OWASP corpus by design
- ⚠️ Breaks for enterprise policies with different hierarchies (solved by AD-001)

---

## Design Principles

### 1. Semantic Over Syntactic
Match concepts by meaning, not exact string paths.

### 2. Progressive Precision
Start with simple heuristics (tail-path), add layers (aliases, embeddings) as needed.

### 3. Explicit Over Implicit
Matching logic should be transparent, auditable, and controllable.

### 4. Zero Configuration (for common cases)
Bundled corpus (RFCs, OWASP) should "just work" with tail-path matching.

### 5. Cryptographic Trust
All policies are signed (Ed25519) and verified before use.

### 6. Privacy by Default
Community sharing is opt-in with anonymization enabled by default.

---

## Extension Points

### Current (2026-02-05)
- Tail-path matching (O(1) hash lookup)
- Concept aliases (auto-created on conflict detection)
- Declarative extractors (user-defined in TOML)
- Hosted mode (team aggregation)
- Community corpus (anonymous sharing)

### In Progress
- **Policy aliases** - Enterprise policy matching via glob patterns ([AD-001](./policy-alias-implementation.md))

### Planned (Q1 2026)
- Semantic embeddings (fuzzy matching via vector similarity)
- Alias auto-discovery (suggest aliases during scan)
- High-entropy secret detection
- Framework-specific extractors (Spring, Django, Express)

### Future (Q2+ 2026)
- Ontology mapping (define semantic relationships)
- Trust Pack composition (packs can extend other packs)
- LLM-assisted extraction (semantic code understanding)
- Config file deep parsing (structured YAML/JSON/TOML)

---

## Performance Targets

### Scan Time
- **Ephemeral:** < 0.3s for typical project
- **Persistent:** < 2s for typical project
- **With Policy Aliases:** < 5% increase

### Memory Overhead
- **Policy Alias Storage:** ~100 bytes per alias
- **Typical Trust Pack:** < 10 KB (10 aliases)
- **Corpus in memory:** ~2-5 MB (varies by sources enabled)

### Lookup Complexity
- **Direct tail-path:** O(1)
- **Concept alias resolution:** O(A) where A=aliases
- **Policy alias fallback (planned):** O(P * A) where P=patterns, A=aliases

---

## Testing Strategy

### Unit Tests
- Extractor pattern matching
- ConceptIndex key generation
- Conflict score calculation
- Trust Pack serialization/verification

### Integration Tests
- Full scan flow with corpus
- Trust Pack import/export
- Drift detection
- Observation write-back

### UAT Scenarios
- Enterprise security team workflow
- Multi-language policy enforcement
- CI/CD integration
- Hosted mode aggregation

---

## Related Documentation

### Product
- [Product Overview](../../product.md) - What Aphoria does
- [Roadmap](../../roadmap.md) - Implementation status and plans

### Guides
- [Enterprise Quick Start](../guides/enterprise-quick-start.md) - Getting started
- [Federating Truth](../guides/federating-truth.md) - Trust Pack workflows

### Implementation
- [Policy Ops](../../src/policy_ops.rs) - Trust Pack CLI handlers
- [Concept Index](../../src/episteme/concept_index.rs) - Matching algorithm
- [Local Episteme](../../src/episteme/local.rs) - Conflict detection
- [Ephemeral Detector](../../src/episteme/ephemeral.rs) - Fast path

---

## Questions or Feedback?

Discuss in:
- `#aphoria-architecture` (internal Slack)
- GitHub Issues (public feedback)
- Architecture review meetings (Fridays 2pm PT)

---

**This directory is the source of truth for architectural decisions.** All major changes should be documented here before implementation.

---

*Last updated: 2026-02-05*