stemedb/applications/aphoria/uat/comprehensive-vision-uat.md

# Aphoria Comprehensive Vision UAT Plan

**Date:** 2026-02-06
**Status:** Complete (90/90 tests passing)
**Purpose:** Verify that Aphoria delivers on its complete vision across all user personas and use cases

---

## Vision Summary

Aphoria's complete vision encompasses three layers:

1. **Core Value:** A "code-level truth linter" that validates code against authoritative sources (RFCs, OWASP, vendor docs)
2. **Enterprise Value:** Federated policy management via Trust Packs — "turn your decisions into enforceable standards"
3. **Protocol Vision:** The Epistemic Assertion Protocol (EAP) — a universal standard for truth publishing, making Aphoria the "DNS for Truth"

---

## User Personas

| Persona | Role | Primary Use Cases |
|---------|------|-------------------|
| **Solo Developer** | Individual contributor | Pre-commit checks, RFC compliance, avoiding common mistakes |
| **Security Engineer** | AppSec team member | Scan projects for security misconfigurations, create org-wide policies |
| **Platform Lead** | Staff engineer | Define "Golden Path" patterns, distribute standards to teams |
| **Compliance Officer** | GRC team member | Audit multiple projects, trace conflicts to authoritative sources |
| **AI Agent** | Autonomous code agent | Pre-flight check before commits, query authority before implementing |

---

## UAT Categories

### Category 1: Core Detection (The "Linter" Value)

> **Vision claim:** "Aphoria scans a codebase, extracts the decisions embedded in config and code, and checks them against authoritative sources."

#### 1.1 Authoritative Source Conflict Detection

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 1.1.1 | TLS verification disabled (Python `verify=False`) | Conflict with RFC 5246, BLOCK verdict | P0 |
| 1.1.2 | TLS verification disabled (Rust `danger_accept_invalid_certs`) | Conflict with RFC 5246, BLOCK verdict | P0 |
| 1.1.3 | TLS verification disabled (Go `InsecureSkipVerify`) | Conflict with RFC 5246, BLOCK verdict | P0 |
| 1.1.4 | JWT audience validation disabled | Conflict with RFC 7519, BLOCK verdict | P0 |
| 1.1.5 | Hardcoded secrets in source | Conflict with OWASP Secrets Cheatsheet, BLOCK verdict | P0 |
| 1.1.6 | CORS allow-all-origins | Conflict with OWASP Headers Cheatsheet, FLAG verdict | P0 |
| 1.1.7 | Zero timeout configuration | Conflict with vendor best practices, FLAG verdict | P1 |
| 1.1.8 | SQL injection pattern (string concat) | Conflict with OWASP Input Validation, BLOCK verdict | P0 |
| 1.1.9 | Command injection pattern | Conflict with OWASP Input Validation, BLOCK verdict | P0 |
| 1.1.10 | Weak crypto (MD5/SHA1 for security) | Conflict with OWASP Crypto Cheatsheet, BLOCK verdict | P0 |

**Success Criteria:**
- [ ] All P0 tests pass with correct verdict
- [ ] Precision ≥95% (minimal false positives)
- [ ] Every BLOCK verdict has an RFC/OWASP citation

#### 1.2 Cross-Language Consistency

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 1.2.1 | Same TLS issue detected in Rust, Go, Python, JS | Same conflict, same verdict across languages | P0 |
| 1.2.2 | Same JWT issue detected across languages | Same conflict, same verdict | P0 |
| 1.2.3 | YAML/TOML config file detection | Config issues detected regardless of language | P0 |

**Success Criteria:**
- [ ] Language parity: same issue → same verdict in all supported languages

#### 1.3 Precision and Recall

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 1.3.1 | VulnBank benchmark (intentionally vulnerable) | ≥50 findings, 100% precision | P0 |
| 1.3.2 | Real-world project scan (Citadel/Masq) | Findings with ≥95% precision | P0 |
| 1.3.3 | False positive rate on clean codebase | <5% false positive rate | P0 |
| 1.3.4 | Test file handling | Lower confidence, not flagged as BLOCK | P1 |

**Success Criteria:**
- [ ] VulnBank: 100% precision (every finding is real)
- [ ] Real-world: ≥95% precision, ≥5 distinct issues

---

### Category 2: Enterprise Policy (The "Trust Pack" Value)

> **Vision claim:** "Organizations often have internal rules that override or extend public standards. Aphoria allows you to export these decisions as Trust Packs."

#### 2.1 Policy Creation Workflow

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 2.1.1 | `aphoria bless` creates policy assertion | Assertion stored with reason, signed | P0 |
| 2.1.2 | `aphoria ack` creates acknowledgment | Acknowledgment stored with reason | P0 |
| 2.1.3 | `aphoria policy export` creates .pack file | Signed binary pack with assertions | P0 |
| 2.1.4 | Export includes both blessed and acked assertions | All policy decisions exported | P0 |

**Success Criteria:**
- [ ] Complete round-trip: bless → export → import → conflict

#### 2.2 Policy Distribution

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 2.2.1 | Local `.pack` file import | Assertions imported, conflicts detected | P0 |
| 2.2.2 | HTTP URL policy import | Remote pack downloaded, cached | P0 |
| 2.2.3 | Multiple packs, no conflict | Both policies enforced | P0 |
| 2.2.4 | Multiple packs, same concept, different values | Conflict visible, user can choose | P1 |
| 2.2.5 | Pack version update (v1 → v2) | v2 supersedes v1 | P1 |

**Success Criteria:**
- [ ] Enterprise workflow script passes (12/12)
- [ ] Multi-pack import works without data loss

#### 2.3 Policy Attribution

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 2.3.1 | Conflict shows pack name | `policy_source.pack_name` in JSON | P0 |
| 2.3.2 | Conflict shows pack version | `policy_source.pack_version` in JSON | P0 |
| 2.3.3 | Conflict shows issuer | `policy_source.issuer_hex` in JSON | P0 |
| 2.3.4 | Attribution in all formats | JSON, table, markdown, SARIF | P0 |

**Success Criteria:**
- [ ] Developer can trace any conflict to "who decided this"

#### 2.4 Predicate Aliases

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 2.4.1 | `enabled` matches `required` | Same-meaning predicates conflict | P1 |
| 2.4.2 | Pack-defined aliases | Custom alias sets work | P2 |

**Success Criteria:**
- [ ] Semantic predicate matching prevents bypasses

---

### Category 3: Pre-Commit Integration (The "Full Cycle" Value)

> **Vision claim:** "The pre-commit hook is a bidirectional knowledge sync, not just a read-only linter."

#### 3.1 Fast Scanning

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 3.1.1 | Ephemeral scan (default) | <0.5s for typical project | P0 |
| 3.1.2 | Staged-only scan (`--staged`) | <0.5s, only staged files scanned | P0 |
| 3.1.3 | No storage created in ephemeral mode | No WAL/store directories | P0 |

**Success Criteria:**
- [ ] Pre-commit hook doesn't slow down development workflow

#### 3.2 Observation Recording

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 3.2.1 | `--sync` records observations | Novel claims stored as Tier 4 | P1 |
| 3.2.2 | Observations survive across commits | Persistent local knowledge | P1 |
| 3.2.3 | `--sync` requires `--persist` | Validation error otherwise | P0 |

**Success Criteria:**
- [ ] Project builds local memory over time

#### 3.3 Drift Detection

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 3.3.1 | Value changed from prior observation | DRIFT verdict shown | P1 |
| 3.3.2 | Drift in table/json/markdown output | All formats show drift | P1 |
| 3.3.3 | `--exit-code` returns 1 for drift | CI can catch unintentional changes | P1 |

**Success Criteria:**
- [ ] Accidental configuration changes are caught

#### 3.4 Exit Codes

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 3.4.1 | No conflicts → exit 0 | Clean scan passes | P0 |
| 3.4.2 | FLAG only → exit 1 | Review recommended | P0 |
| 3.4.3 | BLOCK → exit 2 | Build should fail | P0 |
| 3.4.4 | Without `--exit-code` → always exit 0 | Interactive mode | P0 |

**Success Criteria:**
- [ ] CI/CD integration works correctly

---

### Category 4: LLM Extraction (The "Intelligent" Value)

> **Vision claim:** "Use LLM to extract claims semantically during persistent scans. This fills gaps that regex extractors can't catch."

#### 4.1 LLM Triggering

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 4.1.1 | High-value file (auth/, crypto/) | LLM extraction runs | P1 |
| 4.1.2 | Non-high-value file | LLM extraction skipped | P1 |
| 4.1.3 | File already covered by regex extractors | LLM extraction skipped | P1 |
| 4.1.4 | Token budget exceeded | Graceful stop, no crash | P1 |

**Success Criteria:**
- [ ] LLM only runs when valuable, stays within budget

#### 4.2 LLM Quality

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 4.2.1 | Evaluation fixtures pass | Baseline quality maintained | P1 |
| 4.2.2 | No regressions from prompt changes | Regression tests pass | P2 |
| 4.2.3 | Response parsing handles edge cases | No crashes on malformed JSON | P1 |

**Success Criteria:**
- [ ] LLM extraction quality is measurable and stable

#### 4.3 Pattern Learning

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 4.3.1 | LLM-extracted claim → pattern stored | LocalPatternStore updated | P2 |
| 4.3.2 | Similar pattern → merged, not duplicated | Deduplication works | P2 |
| 4.3.3 | Pattern seen in 5+ projects → promotion candidate | Threshold triggers | P2 |

**Success Criteria:**
- [ ] Learning system builds knowledge over time

---

### Category 5: Declarative Extractors (The "Extensibility" Value)

> **Vision claim:** "Enable users to define new extractors in config/policy files (TOML) without writing Rust code."

#### 5.1 Custom Extractors

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 5.1.1 | TOML-defined extractor runs | Claims extracted using custom regex | P0 |
| 5.1.2 | Invalid regex rejected at load time | Clear error, doesn't block others | P0 |
| 5.1.3 | ReDoS-vulnerable regex rejected | Security protection | P0 |
| 5.1.4 | `value_from_match` captures groups | Dynamic claim values | P1 |

**Success Criteria:**
- [ ] Users can add extractors without recompiling

#### 5.2 Extractor Promotion

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 5.2.1 | `aphoria extractors candidates` lists promotable patterns | Threshold-meeting patterns shown | P2 |
| 5.2.2 | `aphoria extractors promote` generates YAML | Extractor file created | P2 |
| 5.2.3 | Interactive review workflow | Approve/reject/skip options | P2 |

**Success Criteria:**
- [ ] Learning → promotion pipeline is functional

---

### Category 6: Output Formats (The "Integration" Value)

> **Vision claim:** "SARIF for CI integration... structured JSON/SARIF for dashboard integration."

#### 6.1 Format Correctness

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 6.1.1 | JSON output is valid JSON | Parses correctly | P0 |
| 6.1.2 | SARIF output is valid SARIF 2.1.0 | Schema validates | P0 |
| 6.1.3 | Markdown output is valid markdown | Renders correctly | P0 |
| 6.1.4 | Table output is human-readable | Aligned, clear | P0 |

**Success Criteria:**
- [ ] All formats pass validation

#### 6.2 Format Completeness

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 6.2.1 | All formats show file location | File + line for each conflict | P0 |
| 6.2.2 | All formats show conflict score | Score visible | P0 |
| 6.2.3 | All formats show verdict | BLOCK/FLAG/ACK/DRIFT visible | P0 |
| 6.2.4 | All formats show policy source (if applicable) | Attribution visible | P0 |

**Success Criteria:**
- [ ] No information loss between formats

---

### Category 7: Domain-Specific Audits (The "Vertical" Value)

> **Vision claim:** "Aphoria is not limited to web security. It includes specialized corpora for different domains."

#### 7.1 Unreal Engine

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 7.1.1 | `LoadSynchronous()` on game thread detected | BLOCK verdict | P1 |
| 7.1.2 | Hardcoded asset paths detected | FLAG verdict | P2 |
| 7.1.3 | Exposed console commands detected | FLAG verdict | P2 |

**Success Criteria:**
- [ ] Masq UAT passes (7 findings, 100% precision)

#### 7.2 Framework-Specific Security

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 7.2.1 | Spring Security misconfiguration | Conflict detected | P2 |
| 7.2.2 | Django ALLOWED_HOSTS = ["*"] | Conflict detected | P2 |
| 7.2.3 | Flask DEBUG=True in production | Conflict detected | P2 |

**Success Criteria:**
- [ ] Framework extractors detect common misconfigurations

---

### Category 8: The "Protocol Vision" (Long-Term)

> **Vision claim:** "Aphoria is not just a linter; it is the Reference Implementation (Browser) for this new web of data."

#### 8.1 EAP Readiness (Future)

| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 8.1.1 | Consume `.eap.json` manifest | EAP format supported | P3 |
| 8.1.2 | Publish project observations as EAP | Export to EAP format | P3 |
| 8.1.3 | Multi-source aggregation | RFC + OWASP + Vendor + Policy unified | P3 |

**Success Criteria:**
- [ ] Foundation for "DNS for Truth" is laid

---

## UAT Execution Plan

### Phase 1: Core Detection (Week 1)

**Goal:** Prove the core value proposition works across languages.

| Day | Focus | Tests |
|-----|-------|-------|
| 1 | VulnBank benchmark | 1.3.1 |
| 2 | Cross-language TLS/JWT | 1.1.1-1.1.5, 1.2.1-1.2.3 |
| 3 | OWASP patterns | 1.1.6-1.1.10 |
| 4 | False positive analysis | 1.3.3-1.3.4 |
| 5 | Report validation | 6.1.1-6.2.4 |

**Deliverable:** UAT report with precision/recall metrics

### Phase 2: Enterprise Policy (Week 2)

**Goal:** Prove Trust Pack workflow is production-ready.

| Day | Focus | Tests |
|-----|-------|-------|
| 1 | Policy creation | 2.1.1-2.1.4 |
| 2 | Policy distribution | 2.2.1-2.2.5 |
| 3 | Policy attribution | 2.3.1-2.3.4 |
| 4 | Multi-pack scenarios | 2.2.3-2.2.4 |
| 5 | End-to-end workflow | Full enterprise script |

**Deliverable:** UAT report with enterprise workflow validation

### Phase 3: Pre-Commit Integration (Week 3)

**Goal:** Prove Aphoria works seamlessly in development workflow.

| Day | Focus | Tests |
|-----|-------|-------|
| 1 | Performance | 3.1.1-3.1.3 |
| 2 | Exit codes | 3.4.1-3.4.4 |
| 3 | Observation recording | 3.2.1-3.2.3 |
| 4 | Drift detection | 3.3.1-3.3.3 |
| 5 | CI/CD integration | GitHub Actions, pre-commit hook |

**Deliverable:** UAT report with performance benchmarks

### Phase 4: Advanced Features (Week 4)

**Goal:** Prove LLM, learning, and extensibility work.

| Day | Focus | Tests |
|-----|-------|-------|
| 1 | LLM triggering | 4.1.1-4.1.4 |
| 2 | LLM quality | 4.2.1-4.2.3 |
| 3 | Declarative extractors | 5.1.1-5.1.4 |
| 4 | Domain-specific | 7.1.1-7.2.3 |
| 5 | End-to-end user journey | All personas |

**Deliverable:** UAT report with feature completeness matrix

---

## Automated Test Scripts

### All Scripts

| Script | Purpose | Tests | Status |
|--------|---------|-------|--------|
| `test-core-detection.sh` | Category 1: Core detection tests | 10 | PASS (10/10) |
| `test-cross-language.sh` | Category 1.2: Cross-language parity | 3 | PASS (3/3) |
| `test-declarative-extractors.sh` | Category 5: Custom extractor loading | 6 | PASS (6/6) |
| `test-domain-frameworks.sh` | Category 7.2: Framework security | 11 | PASS (11/11) |
| `test-domain-unreal.sh` | Category 7.1: Unreal Engine | 4 | PASS (4/4) |
| `test-drift-detection.sh` | Category 3.2-3.3: Observation/drift | 6 | PASS (6/6) |
| `test-enterprise-workflow.sh` | Category 2: Trust Pack round-trip | 12 | PASS (12/12) |
| `test-eval-harness.sh` | Category 4.2: LLM evaluation harness | 4 | PASS (4/4) |
| `test-exit-codes.sh` | Category 3.4: Exit code validation | 4 | PASS (4/4) |
| `test-llm-extraction.sh` | Category 4.1: LLM quality gates | 5 | PASS (5/5) |
| `test-multi-pack-conflict.sh` | Category 2.2: Multiple pack behavior | 7 | PASS (7/7) |
| `test-output-formats.sh` | Category 6: Format validation | 8 | PASS (8/8) |
| `test-pack-version-update.sh` | Category 2.2.5: Version supersession | 6 | PASS (6/6) |
| `test-precommit-performance.sh` | Category 3.1: Performance benchmarks | 4 | PASS (4/4) |

**Total: 14 scripts, 90 tests**

### Summary by Category

| Category | Scripts | Tests | Status |
|----------|---------|-------|--------|
| 1. Core Detection | 2 | 13 | PASS |
| 2. Enterprise Policy | 3 | 25 | PASS |
| 3. Pre-Commit | 3 | 14 | PASS |
| 4. LLM Extraction | 2 | 9 | PASS |
| 5. Declarative Extractors | 1 | 6 | PASS |
| 6. Output Formats | 1 | 8 | PASS |
| 7. Domain-Specific | 2 | 15 | PASS |

---

## Success Criteria Summary

### Minimum Viable UAT (MVP)

| Criterion | Threshold | Measured By |
|-----------|-----------|-------------|
| Core precision | ≥95% | VulnBank + real-world scan |
| Cross-language parity | 100% | Same issue → same verdict |
| Enterprise workflow | 12/12 pass | test-enterprise-workflow.sh |
| Ephemeral scan time | <0.5s | Performance benchmark |
| Exit code correctness | 4/4 pass | test-exit-codes.sh |
| Format validity | 4/4 valid | test-output-formats.sh |

### Full Vision UAT

| Criterion | Threshold | Measured By |
|-----------|-----------|-------------|
| All P0 tests pass | 100% | Test matrix |
| All P1 tests pass | ≥90% | Test matrix |
| User journey complete | All 5 personas | End-to-end walkthrough |
| Drift detection works | DRIFT shown, exit 1 | test-drift-detection.sh |
| LLM extraction quality | Baseline maintained | Eval fixtures |

---

## Appendix: Test Fixtures

### Fixture: VulnBank

Location: External (clone separately)
Purpose: Intentionally vulnerable polyglot codebase for precision testing

### Fixture: Citadel/Masq

Location: Real customer project (NDA)
Purpose: Real-world precision testing

### Fixture: Clean Codebase

Location: `uat/fixtures/clean-project/`
Purpose: False positive rate testing

### Fixture: LLM Evaluation

Location: `applications/aphoria/tests/fixtures/` (via eval harness)
Purpose: LLM extraction quality regression

---

## Change Log

| Date | Version | Changes |
|------|---------|---------|
| 2026-02-06 | 1.0 | Initial comprehensive UAT plan |
| 2026-02-06 | 2.0 | All 14 test scripts implemented, 90/90 tests passing |