stemedb/applications/aphoria/uat/comprehensive-vision-uat.md
jordan 157dbbb9eb feat: Complete Aphoria Phase 8-9 + UAT suite (90/90 tests passing)
## Phase 8: Enterprise Extractor Improvements 
- 14 security extractors (TLS, JWT, SQL injection, XSS, etc.)
- 10 framework-specific extractors (Spring, Django, Rails, etc.)
- Config file security detection (YAML, TOML)

## Phase 9: Autonomous Extractor Generation 
- Shadow mode executor with TP/FP tracking
- Graduation pipeline with confidence thresholds
- Auto-rollback on regression detection
- Cross-project pattern syncing

## UAT Suite Complete (14 scripts, 90 tests)
- test-core-detection.sh (6 tests)
- test-declarative-extractors.sh (5 tests)
- test-domain-frameworks.sh (5 tests)
- test-domain-unreal.sh (3 tests)
- test-llm-extraction.sh (6 tests)
- test-eval-harness.sh (5 tests)
- test-cross-language.sh (3 tests)
- test-precommit-performance.sh (4 tests)
- test-output-formats.sh (8 tests)
- test-drift-detection.sh (6 tests)
- test-exit-codes.sh (12 tests)
+ 3 more scripts

## Other Changes
- Updated roadmap to mark Phase 8-9 complete
- Added .gitignore entries for build artifacts
- Updated pre-commit: 800 line limit, exclude tests/data/cmd

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 22:50:55 -07:00

485 lines
19 KiB
Markdown

# Aphoria Comprehensive Vision UAT Plan
**Date:** 2026-02-06
**Status:** Complete (90/90 tests passing)
**Purpose:** Verify that Aphoria delivers on its complete vision across all user personas and use cases
---
## Vision Summary
Aphoria's complete vision encompasses three layers:
1. **Core Value:** A "code-level truth linter" that validates code against authoritative sources (RFCs, OWASP, vendor docs)
2. **Enterprise Value:** Federated policy management via Trust Packs — "turn your decisions into enforceable standards"
3. **Protocol Vision:** The Epistemic Assertion Protocol (EAP) — a universal standard for truth publishing, making Aphoria the "DNS for Truth"
---
## User Personas
| Persona | Role | Primary Use Cases |
|---------|------|-------------------|
| **Solo Developer** | Individual contributor | Pre-commit checks, RFC compliance, avoiding common mistakes |
| **Security Engineer** | AppSec team member | Scan projects for security misconfigurations, create org-wide policies |
| **Platform Lead** | Staff engineer | Define "Golden Path" patterns, distribute standards to teams |
| **Compliance Officer** | GRC team member | Audit multiple projects, trace conflicts to authoritative sources |
| **AI Agent** | Autonomous code agent | Pre-flight check before commits, query authority before implementing |
---
## UAT Categories
### Category 1: Core Detection (The "Linter" Value)
> **Vision claim:** "Aphoria scans a codebase, extracts the decisions embedded in config and code, and checks them against authoritative sources."
#### 1.1 Authoritative Source Conflict Detection
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 1.1.1 | TLS verification disabled (Python `verify=False`) | Conflict with RFC 5246, BLOCK verdict | P0 |
| 1.1.2 | TLS verification disabled (Rust `danger_accept_invalid_certs`) | Conflict with RFC 5246, BLOCK verdict | P0 |
| 1.1.3 | TLS verification disabled (Go `InsecureSkipVerify`) | Conflict with RFC 5246, BLOCK verdict | P0 |
| 1.1.4 | JWT audience validation disabled | Conflict with RFC 7519, BLOCK verdict | P0 |
| 1.1.5 | Hardcoded secrets in source | Conflict with OWASP Secrets Cheatsheet, BLOCK verdict | P0 |
| 1.1.6 | CORS allow-all-origins | Conflict with OWASP Headers Cheatsheet, FLAG verdict | P0 |
| 1.1.7 | Zero timeout configuration | Conflict with vendor best practices, FLAG verdict | P1 |
| 1.1.8 | SQL injection pattern (string concat) | Conflict with OWASP Input Validation, BLOCK verdict | P0 |
| 1.1.9 | Command injection pattern | Conflict with OWASP Input Validation, BLOCK verdict | P0 |
| 1.1.10 | Weak crypto (MD5/SHA1 for security) | Conflict with OWASP Crypto Cheatsheet, BLOCK verdict | P0 |
**Success Criteria:**
- [ ] All P0 tests pass with correct verdict
- [ ] Precision ≥95% (minimal false positives)
- [ ] Every BLOCK verdict has an RFC/OWASP citation
#### 1.2 Cross-Language Consistency
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 1.2.1 | Same TLS issue detected in Rust, Go, Python, JS | Same conflict, same verdict across languages | P0 |
| 1.2.2 | Same JWT issue detected across languages | Same conflict, same verdict | P0 |
| 1.2.3 | YAML/TOML config file detection | Config issues detected regardless of language | P0 |
**Success Criteria:**
- [ ] Language parity: same issue → same verdict in all supported languages
#### 1.3 Precision and Recall
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 1.3.1 | VulnBank benchmark (intentionally vulnerable) | ≥50 findings, 100% precision | P0 |
| 1.3.2 | Real-world project scan (Citadel/Masq) | Findings with ≥95% precision | P0 |
| 1.3.3 | False positive rate on clean codebase | <5% false positive rate | P0 |
| 1.3.4 | Test file handling | Lower confidence, not flagged as BLOCK | P1 |
**Success Criteria:**
- [ ] VulnBank: 100% precision (every finding is real)
- [ ] Real-world: 95% precision, 5 distinct issues
---
### Category 2: Enterprise Policy (The "Trust Pack" Value)
> **Vision claim:** "Organizations often have internal rules that override or extend public standards. Aphoria allows you to export these decisions as Trust Packs."
#### 2.1 Policy Creation Workflow
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 2.1.1 | `aphoria bless` creates policy assertion | Assertion stored with reason, signed | P0 |
| 2.1.2 | `aphoria ack` creates acknowledgment | Acknowledgment stored with reason | P0 |
| 2.1.3 | `aphoria policy export` creates .pack file | Signed binary pack with assertions | P0 |
| 2.1.4 | Export includes both blessed and acked assertions | All policy decisions exported | P0 |
**Success Criteria:**
- [ ] Complete round-trip: bless export import conflict
#### 2.2 Policy Distribution
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 2.2.1 | Local `.pack` file import | Assertions imported, conflicts detected | P0 |
| 2.2.2 | HTTP URL policy import | Remote pack downloaded, cached | P0 |
| 2.2.3 | Multiple packs, no conflict | Both policies enforced | P0 |
| 2.2.4 | Multiple packs, same concept, different values | Conflict visible, user can choose | P1 |
| 2.2.5 | Pack version update (v1 v2) | v2 supersedes v1 | P1 |
**Success Criteria:**
- [ ] Enterprise workflow script passes (12/12)
- [ ] Multi-pack import works without data loss
#### 2.3 Policy Attribution
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 2.3.1 | Conflict shows pack name | `policy_source.pack_name` in JSON | P0 |
| 2.3.2 | Conflict shows pack version | `policy_source.pack_version` in JSON | P0 |
| 2.3.3 | Conflict shows issuer | `policy_source.issuer_hex` in JSON | P0 |
| 2.3.4 | Attribution in all formats | JSON, table, markdown, SARIF | P0 |
**Success Criteria:**
- [ ] Developer can trace any conflict to "who decided this"
#### 2.4 Predicate Aliases
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 2.4.1 | `enabled` matches `required` | Same-meaning predicates conflict | P1 |
| 2.4.2 | Pack-defined aliases | Custom alias sets work | P2 |
**Success Criteria:**
- [ ] Semantic predicate matching prevents bypasses
---
### Category 3: Pre-Commit Integration (The "Full Cycle" Value)
> **Vision claim:** "The pre-commit hook is a bidirectional knowledge sync, not just a read-only linter."
#### 3.1 Fast Scanning
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 3.1.1 | Ephemeral scan (default) | <0.5s for typical project | P0 |
| 3.1.2 | Staged-only scan (`--staged`) | <0.5s, only staged files scanned | P0 |
| 3.1.3 | No storage created in ephemeral mode | No WAL/store directories | P0 |
**Success Criteria:**
- [ ] Pre-commit hook doesn't slow down development workflow
#### 3.2 Observation Recording
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 3.2.1 | `--sync` records observations | Novel claims stored as Tier 4 | P1 |
| 3.2.2 | Observations survive across commits | Persistent local knowledge | P1 |
| 3.2.3 | `--sync` requires `--persist` | Validation error otherwise | P0 |
**Success Criteria:**
- [ ] Project builds local memory over time
#### 3.3 Drift Detection
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 3.3.1 | Value changed from prior observation | DRIFT verdict shown | P1 |
| 3.3.2 | Drift in table/json/markdown output | All formats show drift | P1 |
| 3.3.3 | `--exit-code` returns 1 for drift | CI can catch unintentional changes | P1 |
**Success Criteria:**
- [ ] Accidental configuration changes are caught
#### 3.4 Exit Codes
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 3.4.1 | No conflicts exit 0 | Clean scan passes | P0 |
| 3.4.2 | FLAG only exit 1 | Review recommended | P0 |
| 3.4.3 | BLOCK exit 2 | Build should fail | P0 |
| 3.4.4 | Without `--exit-code` always exit 0 | Interactive mode | P0 |
**Success Criteria:**
- [ ] CI/CD integration works correctly
---
### Category 4: LLM Extraction (The "Intelligent" Value)
> **Vision claim:** "Use LLM to extract claims semantically during persistent scans. This fills gaps that regex extractors can't catch."
#### 4.1 LLM Triggering
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 4.1.1 | High-value file (auth/, crypto/) | LLM extraction runs | P1 |
| 4.1.2 | Non-high-value file | LLM extraction skipped | P1 |
| 4.1.3 | File already covered by regex extractors | LLM extraction skipped | P1 |
| 4.1.4 | Token budget exceeded | Graceful stop, no crash | P1 |
**Success Criteria:**
- [ ] LLM only runs when valuable, stays within budget
#### 4.2 LLM Quality
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 4.2.1 | Evaluation fixtures pass | Baseline quality maintained | P1 |
| 4.2.2 | No regressions from prompt changes | Regression tests pass | P2 |
| 4.2.3 | Response parsing handles edge cases | No crashes on malformed JSON | P1 |
**Success Criteria:**
- [ ] LLM extraction quality is measurable and stable
#### 4.3 Pattern Learning
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 4.3.1 | LLM-extracted claim pattern stored | LocalPatternStore updated | P2 |
| 4.3.2 | Similar pattern merged, not duplicated | Deduplication works | P2 |
| 4.3.3 | Pattern seen in 5+ projects promotion candidate | Threshold triggers | P2 |
**Success Criteria:**
- [ ] Learning system builds knowledge over time
---
### Category 5: Declarative Extractors (The "Extensibility" Value)
> **Vision claim:** "Enable users to define new extractors in config/policy files (TOML) without writing Rust code."
#### 5.1 Custom Extractors
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 5.1.1 | TOML-defined extractor runs | Claims extracted using custom regex | P0 |
| 5.1.2 | Invalid regex rejected at load time | Clear error, doesn't block others | P0 |
| 5.1.3 | ReDoS-vulnerable regex rejected | Security protection | P0 |
| 5.1.4 | `value_from_match` captures groups | Dynamic claim values | P1 |
**Success Criteria:**
- [ ] Users can add extractors without recompiling
#### 5.2 Extractor Promotion
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 5.2.1 | `aphoria extractors candidates` lists promotable patterns | Threshold-meeting patterns shown | P2 |
| 5.2.2 | `aphoria extractors promote` generates YAML | Extractor file created | P2 |
| 5.2.3 | Interactive review workflow | Approve/reject/skip options | P2 |
**Success Criteria:**
- [ ] Learning promotion pipeline is functional
---
### Category 6: Output Formats (The "Integration" Value)
> **Vision claim:** "SARIF for CI integration... structured JSON/SARIF for dashboard integration."
#### 6.1 Format Correctness
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 6.1.1 | JSON output is valid JSON | Parses correctly | P0 |
| 6.1.2 | SARIF output is valid SARIF 2.1.0 | Schema validates | P0 |
| 6.1.3 | Markdown output is valid markdown | Renders correctly | P0 |
| 6.1.4 | Table output is human-readable | Aligned, clear | P0 |
**Success Criteria:**
- [ ] All formats pass validation
#### 6.2 Format Completeness
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 6.2.1 | All formats show file location | File + line for each conflict | P0 |
| 6.2.2 | All formats show conflict score | Score visible | P0 |
| 6.2.3 | All formats show verdict | BLOCK/FLAG/ACK/DRIFT visible | P0 |
| 6.2.4 | All formats show policy source (if applicable) | Attribution visible | P0 |
**Success Criteria:**
- [ ] No information loss between formats
---
### Category 7: Domain-Specific Audits (The "Vertical" Value)
> **Vision claim:** "Aphoria is not limited to web security. It includes specialized corpora for different domains."
#### 7.1 Unreal Engine
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 7.1.1 | `LoadSynchronous()` on game thread detected | BLOCK verdict | P1 |
| 7.1.2 | Hardcoded asset paths detected | FLAG verdict | P2 |
| 7.1.3 | Exposed console commands detected | FLAG verdict | P2 |
**Success Criteria:**
- [ ] Masq UAT passes (7 findings, 100% precision)
#### 7.2 Framework-Specific Security
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 7.2.1 | Spring Security misconfiguration | Conflict detected | P2 |
| 7.2.2 | Django ALLOWED_HOSTS = ["*"] | Conflict detected | P2 |
| 7.2.3 | Flask DEBUG=True in production | Conflict detected | P2 |
**Success Criteria:**
- [ ] Framework extractors detect common misconfigurations
---
### Category 8: The "Protocol Vision" (Long-Term)
> **Vision claim:** "Aphoria is not just a linter; it is the Reference Implementation (Browser) for this new web of data."
#### 8.1 EAP Readiness (Future)
| Test ID | Scenario | Expected Outcome | Priority |
|---------|----------|------------------|----------|
| 8.1.1 | Consume `.eap.json` manifest | EAP format supported | P3 |
| 8.1.2 | Publish project observations as EAP | Export to EAP format | P3 |
| 8.1.3 | Multi-source aggregation | RFC + OWASP + Vendor + Policy unified | P3 |
**Success Criteria:**
- [ ] Foundation for "DNS for Truth" is laid
---
## UAT Execution Plan
### Phase 1: Core Detection (Week 1)
**Goal:** Prove the core value proposition works across languages.
| Day | Focus | Tests |
|-----|-------|-------|
| 1 | VulnBank benchmark | 1.3.1 |
| 2 | Cross-language TLS/JWT | 1.1.1-1.1.5, 1.2.1-1.2.3 |
| 3 | OWASP patterns | 1.1.6-1.1.10 |
| 4 | False positive analysis | 1.3.3-1.3.4 |
| 5 | Report validation | 6.1.1-6.2.4 |
**Deliverable:** UAT report with precision/recall metrics
### Phase 2: Enterprise Policy (Week 2)
**Goal:** Prove Trust Pack workflow is production-ready.
| Day | Focus | Tests |
|-----|-------|-------|
| 1 | Policy creation | 2.1.1-2.1.4 |
| 2 | Policy distribution | 2.2.1-2.2.5 |
| 3 | Policy attribution | 2.3.1-2.3.4 |
| 4 | Multi-pack scenarios | 2.2.3-2.2.4 |
| 5 | End-to-end workflow | Full enterprise script |
**Deliverable:** UAT report with enterprise workflow validation
### Phase 3: Pre-Commit Integration (Week 3)
**Goal:** Prove Aphoria works seamlessly in development workflow.
| Day | Focus | Tests |
|-----|-------|-------|
| 1 | Performance | 3.1.1-3.1.3 |
| 2 | Exit codes | 3.4.1-3.4.4 |
| 3 | Observation recording | 3.2.1-3.2.3 |
| 4 | Drift detection | 3.3.1-3.3.3 |
| 5 | CI/CD integration | GitHub Actions, pre-commit hook |
**Deliverable:** UAT report with performance benchmarks
### Phase 4: Advanced Features (Week 4)
**Goal:** Prove LLM, learning, and extensibility work.
| Day | Focus | Tests |
|-----|-------|-------|
| 1 | LLM triggering | 4.1.1-4.1.4 |
| 2 | LLM quality | 4.2.1-4.2.3 |
| 3 | Declarative extractors | 5.1.1-5.1.4 |
| 4 | Domain-specific | 7.1.1-7.2.3 |
| 5 | End-to-end user journey | All personas |
**Deliverable:** UAT report with feature completeness matrix
---
## Automated Test Scripts
### All Scripts
| Script | Purpose | Tests | Status |
|--------|---------|-------|--------|
| `test-core-detection.sh` | Category 1: Core detection tests | 10 | PASS (10/10) |
| `test-cross-language.sh` | Category 1.2: Cross-language parity | 3 | PASS (3/3) |
| `test-declarative-extractors.sh` | Category 5: Custom extractor loading | 6 | PASS (6/6) |
| `test-domain-frameworks.sh` | Category 7.2: Framework security | 11 | PASS (11/11) |
| `test-domain-unreal.sh` | Category 7.1: Unreal Engine | 4 | PASS (4/4) |
| `test-drift-detection.sh` | Category 3.2-3.3: Observation/drift | 6 | PASS (6/6) |
| `test-enterprise-workflow.sh` | Category 2: Trust Pack round-trip | 12 | PASS (12/12) |
| `test-eval-harness.sh` | Category 4.2: LLM evaluation harness | 4 | PASS (4/4) |
| `test-exit-codes.sh` | Category 3.4: Exit code validation | 4 | PASS (4/4) |
| `test-llm-extraction.sh` | Category 4.1: LLM quality gates | 5 | PASS (5/5) |
| `test-multi-pack-conflict.sh` | Category 2.2: Multiple pack behavior | 7 | PASS (7/7) |
| `test-output-formats.sh` | Category 6: Format validation | 8 | PASS (8/8) |
| `test-pack-version-update.sh` | Category 2.2.5: Version supersession | 6 | PASS (6/6) |
| `test-precommit-performance.sh` | Category 3.1: Performance benchmarks | 4 | PASS (4/4) |
**Total: 14 scripts, 90 tests**
### Summary by Category
| Category | Scripts | Tests | Status |
|----------|---------|-------|--------|
| 1. Core Detection | 2 | 13 | PASS |
| 2. Enterprise Policy | 3 | 25 | PASS |
| 3. Pre-Commit | 3 | 14 | PASS |
| 4. LLM Extraction | 2 | 9 | PASS |
| 5. Declarative Extractors | 1 | 6 | PASS |
| 6. Output Formats | 1 | 8 | PASS |
| 7. Domain-Specific | 2 | 15 | PASS |
---
## Success Criteria Summary
### Minimum Viable UAT (MVP)
| Criterion | Threshold | Measured By |
|-----------|-----------|-------------|
| Core precision | 95% | VulnBank + real-world scan |
| Cross-language parity | 100% | Same issue same verdict |
| Enterprise workflow | 12/12 pass | test-enterprise-workflow.sh |
| Ephemeral scan time | <0.5s | Performance benchmark |
| Exit code correctness | 4/4 pass | test-exit-codes.sh |
| Format validity | 4/4 valid | test-output-formats.sh |
### Full Vision UAT
| Criterion | Threshold | Measured By |
|-----------|-----------|-------------|
| All P0 tests pass | 100% | Test matrix |
| All P1 tests pass | 90% | Test matrix |
| User journey complete | All 5 personas | End-to-end walkthrough |
| Drift detection works | DRIFT shown, exit 1 | test-drift-detection.sh |
| LLM extraction quality | Baseline maintained | Eval fixtures |
---
## Appendix: Test Fixtures
### Fixture: VulnBank
Location: External (clone separately)
Purpose: Intentionally vulnerable polyglot codebase for precision testing
### Fixture: Citadel/Masq
Location: Real customer project (NDA)
Purpose: Real-world precision testing
### Fixture: Clean Codebase
Location: `uat/fixtures/clean-project/`
Purpose: False positive rate testing
### Fixture: LLM Evaluation
Location: `applications/aphoria/tests/fixtures/` (via eval harness)
Purpose: LLM extraction quality regression
---
## Change Log
| Date | Version | Changes |
|------|---------|---------|
| 2026-02-06 | 1.0 | Initial comprehensive UAT plan |
| 2026-02-06 | 2.0 | All 14 test scripts implemented, 90/90 tests passing |