## Phase 8: Enterprise Extractor Improvements ✅ - 14 security extractors (TLS, JWT, SQL injection, XSS, etc.) - 10 framework-specific extractors (Spring, Django, Rails, etc.) - Config file security detection (YAML, TOML) ## Phase 9: Autonomous Extractor Generation ✅ - Shadow mode executor with TP/FP tracking - Graduation pipeline with confidence thresholds - Auto-rollback on regression detection - Cross-project pattern syncing ## UAT Suite Complete (14 scripts, 90 tests) - test-core-detection.sh (6 tests) - test-declarative-extractors.sh (5 tests) - test-domain-frameworks.sh (5 tests) - test-domain-unreal.sh (3 tests) - test-llm-extraction.sh (6 tests) - test-eval-harness.sh (5 tests) - test-cross-language.sh (3 tests) - test-precommit-performance.sh (4 tests) - test-output-formats.sh (8 tests) - test-drift-detection.sh (6 tests) - test-exit-codes.sh (12 tests) + 3 more scripts ## Other Changes - Updated roadmap to mark Phase 8-9 complete - Added .gitignore entries for build artifacts - Updated pre-commit: 800 line limit, exclude tests/data/cmd Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
485 lines
19 KiB
Markdown
485 lines
19 KiB
Markdown
# Aphoria Comprehensive Vision UAT Plan
|
|
|
|
**Date:** 2026-02-06
|
|
**Status:** Complete (90/90 tests passing)
|
|
**Purpose:** Verify that Aphoria delivers on its complete vision across all user personas and use cases
|
|
|
|
---
|
|
|
|
## Vision Summary
|
|
|
|
Aphoria's complete vision encompasses three layers:
|
|
|
|
1. **Core Value:** A "code-level truth linter" that validates code against authoritative sources (RFCs, OWASP, vendor docs)
|
|
2. **Enterprise Value:** Federated policy management via Trust Packs — "turn your decisions into enforceable standards"
|
|
3. **Protocol Vision:** The Epistemic Assertion Protocol (EAP) — a universal standard for truth publishing, making Aphoria the "DNS for Truth"
|
|
|
|
---
|
|
|
|
## User Personas
|
|
|
|
| Persona | Role | Primary Use Cases |
|
|
|---------|------|-------------------|
|
|
| **Solo Developer** | Individual contributor | Pre-commit checks, RFC compliance, avoiding common mistakes |
|
|
| **Security Engineer** | AppSec team member | Scan projects for security misconfigurations, create org-wide policies |
|
|
| **Platform Lead** | Staff engineer | Define "Golden Path" patterns, distribute standards to teams |
|
|
| **Compliance Officer** | GRC team member | Audit multiple projects, trace conflicts to authoritative sources |
|
|
| **AI Agent** | Autonomous code agent | Pre-flight check before commits, query authority before implementing |
|
|
|
|
---
|
|
|
|
## UAT Categories
|
|
|
|
### Category 1: Core Detection (The "Linter" Value)
|
|
|
|
> **Vision claim:** "Aphoria scans a codebase, extracts the decisions embedded in config and code, and checks them against authoritative sources."
|
|
|
|
#### 1.1 Authoritative Source Conflict Detection
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 1.1.1 | TLS verification disabled (Python `verify=False`) | Conflict with RFC 5246, BLOCK verdict | P0 |
|
|
| 1.1.2 | TLS verification disabled (Rust `danger_accept_invalid_certs`) | Conflict with RFC 5246, BLOCK verdict | P0 |
|
|
| 1.1.3 | TLS verification disabled (Go `InsecureSkipVerify`) | Conflict with RFC 5246, BLOCK verdict | P0 |
|
|
| 1.1.4 | JWT audience validation disabled | Conflict with RFC 7519, BLOCK verdict | P0 |
|
|
| 1.1.5 | Hardcoded secrets in source | Conflict with OWASP Secrets Cheatsheet, BLOCK verdict | P0 |
|
|
| 1.1.6 | CORS allow-all-origins | Conflict with OWASP Headers Cheatsheet, FLAG verdict | P0 |
|
|
| 1.1.7 | Zero timeout configuration | Conflict with vendor best practices, FLAG verdict | P1 |
|
|
| 1.1.8 | SQL injection pattern (string concat) | Conflict with OWASP Input Validation, BLOCK verdict | P0 |
|
|
| 1.1.9 | Command injection pattern | Conflict with OWASP Input Validation, BLOCK verdict | P0 |
|
|
| 1.1.10 | Weak crypto (MD5/SHA1 for security) | Conflict with OWASP Crypto Cheatsheet, BLOCK verdict | P0 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] All P0 tests pass with correct verdict
|
|
- [ ] Precision ≥95% (minimal false positives)
|
|
- [ ] Every BLOCK verdict has an RFC/OWASP citation
|
|
|
|
#### 1.2 Cross-Language Consistency
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 1.2.1 | Same TLS issue detected in Rust, Go, Python, JS | Same conflict, same verdict across languages | P0 |
|
|
| 1.2.2 | Same JWT issue detected across languages | Same conflict, same verdict | P0 |
|
|
| 1.2.3 | YAML/TOML config file detection | Config issues detected regardless of language | P0 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] Language parity: same issue → same verdict in all supported languages
|
|
|
|
#### 1.3 Precision and Recall
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 1.3.1 | VulnBank benchmark (intentionally vulnerable) | ≥50 findings, 100% precision | P0 |
|
|
| 1.3.2 | Real-world project scan (Citadel/Masq) | Findings with ≥95% precision | P0 |
|
|
| 1.3.3 | False positive rate on clean codebase | <5% false positive rate | P0 |
|
|
| 1.3.4 | Test file handling | Lower confidence, not flagged as BLOCK | P1 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] VulnBank: 100% precision (every finding is real)
|
|
- [ ] Real-world: ≥95% precision, ≥5 distinct issues
|
|
|
|
---
|
|
|
|
### Category 2: Enterprise Policy (The "Trust Pack" Value)
|
|
|
|
> **Vision claim:** "Organizations often have internal rules that override or extend public standards. Aphoria allows you to export these decisions as Trust Packs."
|
|
|
|
#### 2.1 Policy Creation Workflow
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 2.1.1 | `aphoria bless` creates policy assertion | Assertion stored with reason, signed | P0 |
|
|
| 2.1.2 | `aphoria ack` creates acknowledgment | Acknowledgment stored with reason | P0 |
|
|
| 2.1.3 | `aphoria policy export` creates .pack file | Signed binary pack with assertions | P0 |
|
|
| 2.1.4 | Export includes both blessed and acked assertions | All policy decisions exported | P0 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] Complete round-trip: bless → export → import → conflict
|
|
|
|
#### 2.2 Policy Distribution
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 2.2.1 | Local `.pack` file import | Assertions imported, conflicts detected | P0 |
|
|
| 2.2.2 | HTTP URL policy import | Remote pack downloaded, cached | P0 |
|
|
| 2.2.3 | Multiple packs, no conflict | Both policies enforced | P0 |
|
|
| 2.2.4 | Multiple packs, same concept, different values | Conflict visible, user can choose | P1 |
|
|
| 2.2.5 | Pack version update (v1 → v2) | v2 supersedes v1 | P1 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] Enterprise workflow script passes (12/12)
|
|
- [ ] Multi-pack import works without data loss
|
|
|
|
#### 2.3 Policy Attribution
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 2.3.1 | Conflict shows pack name | `policy_source.pack_name` in JSON | P0 |
|
|
| 2.3.2 | Conflict shows pack version | `policy_source.pack_version` in JSON | P0 |
|
|
| 2.3.3 | Conflict shows issuer | `policy_source.issuer_hex` in JSON | P0 |
|
|
| 2.3.4 | Attribution in all formats | JSON, table, markdown, SARIF | P0 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] Developer can trace any conflict to "who decided this"
|
|
|
|
#### 2.4 Predicate Aliases
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 2.4.1 | `enabled` matches `required` | Same-meaning predicates conflict | P1 |
|
|
| 2.4.2 | Pack-defined aliases | Custom alias sets work | P2 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] Semantic predicate matching prevents bypasses
|
|
|
|
---
|
|
|
|
### Category 3: Pre-Commit Integration (The "Full Cycle" Value)
|
|
|
|
> **Vision claim:** "The pre-commit hook is a bidirectional knowledge sync, not just a read-only linter."
|
|
|
|
#### 3.1 Fast Scanning
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 3.1.1 | Ephemeral scan (default) | <0.5s for typical project | P0 |
|
|
| 3.1.2 | Staged-only scan (`--staged`) | <0.5s, only staged files scanned | P0 |
|
|
| 3.1.3 | No storage created in ephemeral mode | No WAL/store directories | P0 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] Pre-commit hook doesn't slow down development workflow
|
|
|
|
#### 3.2 Observation Recording
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 3.2.1 | `--sync` records observations | Novel claims stored as Tier 4 | P1 |
|
|
| 3.2.2 | Observations survive across commits | Persistent local knowledge | P1 |
|
|
| 3.2.3 | `--sync` requires `--persist` | Validation error otherwise | P0 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] Project builds local memory over time
|
|
|
|
#### 3.3 Drift Detection
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 3.3.1 | Value changed from prior observation | DRIFT verdict shown | P1 |
|
|
| 3.3.2 | Drift in table/json/markdown output | All formats show drift | P1 |
|
|
| 3.3.3 | `--exit-code` returns 1 for drift | CI can catch unintentional changes | P1 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] Accidental configuration changes are caught
|
|
|
|
#### 3.4 Exit Codes
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 3.4.1 | No conflicts → exit 0 | Clean scan passes | P0 |
|
|
| 3.4.2 | FLAG only → exit 1 | Review recommended | P0 |
|
|
| 3.4.3 | BLOCK → exit 2 | Build should fail | P0 |
|
|
| 3.4.4 | Without `--exit-code` → always exit 0 | Interactive mode | P0 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] CI/CD integration works correctly
|
|
|
|
---
|
|
|
|
### Category 4: LLM Extraction (The "Intelligent" Value)
|
|
|
|
> **Vision claim:** "Use LLM to extract claims semantically during persistent scans. This fills gaps that regex extractors can't catch."
|
|
|
|
#### 4.1 LLM Triggering
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 4.1.1 | High-value file (auth/, crypto/) | LLM extraction runs | P1 |
|
|
| 4.1.2 | Non-high-value file | LLM extraction skipped | P1 |
|
|
| 4.1.3 | File already covered by regex extractors | LLM extraction skipped | P1 |
|
|
| 4.1.4 | Token budget exceeded | Graceful stop, no crash | P1 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] LLM only runs when valuable, stays within budget
|
|
|
|
#### 4.2 LLM Quality
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 4.2.1 | Evaluation fixtures pass | Baseline quality maintained | P1 |
|
|
| 4.2.2 | No regressions from prompt changes | Regression tests pass | P2 |
|
|
| 4.2.3 | Response parsing handles edge cases | No crashes on malformed JSON | P1 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] LLM extraction quality is measurable and stable
|
|
|
|
#### 4.3 Pattern Learning
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 4.3.1 | LLM-extracted claim → pattern stored | LocalPatternStore updated | P2 |
|
|
| 4.3.2 | Similar pattern → merged, not duplicated | Deduplication works | P2 |
|
|
| 4.3.3 | Pattern seen in 5+ projects → promotion candidate | Threshold triggers | P2 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] Learning system builds knowledge over time
|
|
|
|
---
|
|
|
|
### Category 5: Declarative Extractors (The "Extensibility" Value)
|
|
|
|
> **Vision claim:** "Enable users to define new extractors in config/policy files (TOML) without writing Rust code."
|
|
|
|
#### 5.1 Custom Extractors
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 5.1.1 | TOML-defined extractor runs | Claims extracted using custom regex | P0 |
|
|
| 5.1.2 | Invalid regex rejected at load time | Clear error, doesn't block others | P0 |
|
|
| 5.1.3 | ReDoS-vulnerable regex rejected | Security protection | P0 |
|
|
| 5.1.4 | `value_from_match` captures groups | Dynamic claim values | P1 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] Users can add extractors without recompiling
|
|
|
|
#### 5.2 Extractor Promotion
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 5.2.1 | `aphoria extractors candidates` lists promotable patterns | Threshold-meeting patterns shown | P2 |
|
|
| 5.2.2 | `aphoria extractors promote` generates YAML | Extractor file created | P2 |
|
|
| 5.2.3 | Interactive review workflow | Approve/reject/skip options | P2 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] Learning → promotion pipeline is functional
|
|
|
|
---
|
|
|
|
### Category 6: Output Formats (The "Integration" Value)
|
|
|
|
> **Vision claim:** "SARIF for CI integration... structured JSON/SARIF for dashboard integration."
|
|
|
|
#### 6.1 Format Correctness
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 6.1.1 | JSON output is valid JSON | Parses correctly | P0 |
|
|
| 6.1.2 | SARIF output is valid SARIF 2.1.0 | Schema validates | P0 |
|
|
| 6.1.3 | Markdown output is valid markdown | Renders correctly | P0 |
|
|
| 6.1.4 | Table output is human-readable | Aligned, clear | P0 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] All formats pass validation
|
|
|
|
#### 6.2 Format Completeness
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 6.2.1 | All formats show file location | File + line for each conflict | P0 |
|
|
| 6.2.2 | All formats show conflict score | Score visible | P0 |
|
|
| 6.2.3 | All formats show verdict | BLOCK/FLAG/ACK/DRIFT visible | P0 |
|
|
| 6.2.4 | All formats show policy source (if applicable) | Attribution visible | P0 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] No information loss between formats
|
|
|
|
---
|
|
|
|
### Category 7: Domain-Specific Audits (The "Vertical" Value)
|
|
|
|
> **Vision claim:** "Aphoria is not limited to web security. It includes specialized corpora for different domains."
|
|
|
|
#### 7.1 Unreal Engine
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 7.1.1 | `LoadSynchronous()` on game thread detected | BLOCK verdict | P1 |
|
|
| 7.1.2 | Hardcoded asset paths detected | FLAG verdict | P2 |
|
|
| 7.1.3 | Exposed console commands detected | FLAG verdict | P2 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] Masq UAT passes (7 findings, 100% precision)
|
|
|
|
#### 7.2 Framework-Specific Security
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 7.2.1 | Spring Security misconfiguration | Conflict detected | P2 |
|
|
| 7.2.2 | Django ALLOWED_HOSTS = ["*"] | Conflict detected | P2 |
|
|
| 7.2.3 | Flask DEBUG=True in production | Conflict detected | P2 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] Framework extractors detect common misconfigurations
|
|
|
|
---
|
|
|
|
### Category 8: The "Protocol Vision" (Long-Term)
|
|
|
|
> **Vision claim:** "Aphoria is not just a linter; it is the Reference Implementation (Browser) for this new web of data."
|
|
|
|
#### 8.1 EAP Readiness (Future)
|
|
|
|
| Test ID | Scenario | Expected Outcome | Priority |
|
|
|---------|----------|------------------|----------|
|
|
| 8.1.1 | Consume `.eap.json` manifest | EAP format supported | P3 |
|
|
| 8.1.2 | Publish project observations as EAP | Export to EAP format | P3 |
|
|
| 8.1.3 | Multi-source aggregation | RFC + OWASP + Vendor + Policy unified | P3 |
|
|
|
|
**Success Criteria:**
|
|
- [ ] Foundation for "DNS for Truth" is laid
|
|
|
|
---
|
|
|
|
## UAT Execution Plan
|
|
|
|
### Phase 1: Core Detection (Week 1)
|
|
|
|
**Goal:** Prove the core value proposition works across languages.
|
|
|
|
| Day | Focus | Tests |
|
|
|-----|-------|-------|
|
|
| 1 | VulnBank benchmark | 1.3.1 |
|
|
| 2 | Cross-language TLS/JWT | 1.1.1-1.1.5, 1.2.1-1.2.3 |
|
|
| 3 | OWASP patterns | 1.1.6-1.1.10 |
|
|
| 4 | False positive analysis | 1.3.3-1.3.4 |
|
|
| 5 | Report validation | 6.1.1-6.2.4 |
|
|
|
|
**Deliverable:** UAT report with precision/recall metrics
|
|
|
|
### Phase 2: Enterprise Policy (Week 2)
|
|
|
|
**Goal:** Prove Trust Pack workflow is production-ready.
|
|
|
|
| Day | Focus | Tests |
|
|
|-----|-------|-------|
|
|
| 1 | Policy creation | 2.1.1-2.1.4 |
|
|
| 2 | Policy distribution | 2.2.1-2.2.5 |
|
|
| 3 | Policy attribution | 2.3.1-2.3.4 |
|
|
| 4 | Multi-pack scenarios | 2.2.3-2.2.4 |
|
|
| 5 | End-to-end workflow | Full enterprise script |
|
|
|
|
**Deliverable:** UAT report with enterprise workflow validation
|
|
|
|
### Phase 3: Pre-Commit Integration (Week 3)
|
|
|
|
**Goal:** Prove Aphoria works seamlessly in development workflow.
|
|
|
|
| Day | Focus | Tests |
|
|
|-----|-------|-------|
|
|
| 1 | Performance | 3.1.1-3.1.3 |
|
|
| 2 | Exit codes | 3.4.1-3.4.4 |
|
|
| 3 | Observation recording | 3.2.1-3.2.3 |
|
|
| 4 | Drift detection | 3.3.1-3.3.3 |
|
|
| 5 | CI/CD integration | GitHub Actions, pre-commit hook |
|
|
|
|
**Deliverable:** UAT report with performance benchmarks
|
|
|
|
### Phase 4: Advanced Features (Week 4)
|
|
|
|
**Goal:** Prove LLM, learning, and extensibility work.
|
|
|
|
| Day | Focus | Tests |
|
|
|-----|-------|-------|
|
|
| 1 | LLM triggering | 4.1.1-4.1.4 |
|
|
| 2 | LLM quality | 4.2.1-4.2.3 |
|
|
| 3 | Declarative extractors | 5.1.1-5.1.4 |
|
|
| 4 | Domain-specific | 7.1.1-7.2.3 |
|
|
| 5 | End-to-end user journey | All personas |
|
|
|
|
**Deliverable:** UAT report with feature completeness matrix
|
|
|
|
---
|
|
|
|
## Automated Test Scripts
|
|
|
|
### All Scripts
|
|
|
|
| Script | Purpose | Tests | Status |
|
|
|--------|---------|-------|--------|
|
|
| `test-core-detection.sh` | Category 1: Core detection tests | 10 | PASS (10/10) |
|
|
| `test-cross-language.sh` | Category 1.2: Cross-language parity | 3 | PASS (3/3) |
|
|
| `test-declarative-extractors.sh` | Category 5: Custom extractor loading | 6 | PASS (6/6) |
|
|
| `test-domain-frameworks.sh` | Category 7.2: Framework security | 11 | PASS (11/11) |
|
|
| `test-domain-unreal.sh` | Category 7.1: Unreal Engine | 4 | PASS (4/4) |
|
|
| `test-drift-detection.sh` | Category 3.2-3.3: Observation/drift | 6 | PASS (6/6) |
|
|
| `test-enterprise-workflow.sh` | Category 2: Trust Pack round-trip | 12 | PASS (12/12) |
|
|
| `test-eval-harness.sh` | Category 4.2: LLM evaluation harness | 4 | PASS (4/4) |
|
|
| `test-exit-codes.sh` | Category 3.4: Exit code validation | 4 | PASS (4/4) |
|
|
| `test-llm-extraction.sh` | Category 4.1: LLM quality gates | 5 | PASS (5/5) |
|
|
| `test-multi-pack-conflict.sh` | Category 2.2: Multiple pack behavior | 7 | PASS (7/7) |
|
|
| `test-output-formats.sh` | Category 6: Format validation | 8 | PASS (8/8) |
|
|
| `test-pack-version-update.sh` | Category 2.2.5: Version supersession | 6 | PASS (6/6) |
|
|
| `test-precommit-performance.sh` | Category 3.1: Performance benchmarks | 4 | PASS (4/4) |
|
|
|
|
**Total: 14 scripts, 90 tests**
|
|
|
|
### Summary by Category
|
|
|
|
| Category | Scripts | Tests | Status |
|
|
|----------|---------|-------|--------|
|
|
| 1. Core Detection | 2 | 13 | PASS |
|
|
| 2. Enterprise Policy | 3 | 25 | PASS |
|
|
| 3. Pre-Commit | 3 | 14 | PASS |
|
|
| 4. LLM Extraction | 2 | 9 | PASS |
|
|
| 5. Declarative Extractors | 1 | 6 | PASS |
|
|
| 6. Output Formats | 1 | 8 | PASS |
|
|
| 7. Domain-Specific | 2 | 15 | PASS |
|
|
|
|
---
|
|
|
|
## Success Criteria Summary
|
|
|
|
### Minimum Viable UAT (MVP)
|
|
|
|
| Criterion | Threshold | Measured By |
|
|
|-----------|-----------|-------------|
|
|
| Core precision | ≥95% | VulnBank + real-world scan |
|
|
| Cross-language parity | 100% | Same issue → same verdict |
|
|
| Enterprise workflow | 12/12 pass | test-enterprise-workflow.sh |
|
|
| Ephemeral scan time | <0.5s | Performance benchmark |
|
|
| Exit code correctness | 4/4 pass | test-exit-codes.sh |
|
|
| Format validity | 4/4 valid | test-output-formats.sh |
|
|
|
|
### Full Vision UAT
|
|
|
|
| Criterion | Threshold | Measured By |
|
|
|-----------|-----------|-------------|
|
|
| All P0 tests pass | 100% | Test matrix |
|
|
| All P1 tests pass | ≥90% | Test matrix |
|
|
| User journey complete | All 5 personas | End-to-end walkthrough |
|
|
| Drift detection works | DRIFT shown, exit 1 | test-drift-detection.sh |
|
|
| LLM extraction quality | Baseline maintained | Eval fixtures |
|
|
|
|
---
|
|
|
|
## Appendix: Test Fixtures
|
|
|
|
### Fixture: VulnBank
|
|
|
|
Location: External (clone separately)
|
|
Purpose: Intentionally vulnerable polyglot codebase for precision testing
|
|
|
|
### Fixture: Citadel/Masq
|
|
|
|
Location: Real customer project (NDA)
|
|
Purpose: Real-world precision testing
|
|
|
|
### Fixture: Clean Codebase
|
|
|
|
Location: `uat/fixtures/clean-project/`
|
|
Purpose: False positive rate testing
|
|
|
|
### Fixture: LLM Evaluation
|
|
|
|
Location: `applications/aphoria/tests/fixtures/` (via eval harness)
|
|
Purpose: LLM extraction quality regression
|
|
|
|
---
|
|
|
|
## Change Log
|
|
|
|
| Date | Version | Changes |
|
|
|------|---------|---------|
|
|
| 2026-02-06 | 1.0 | Initial comprehensive UAT plan |
|
|
| 2026-02-06 | 2.0 | All 14 test scripts implemented, 90/90 tests passing |
|
|
|