stemedb/uat/aphoria-cli-uat-report.md
jml 3b5f88b4f0 feat(aphoria): implement claims architecture (A1-A5) with verify engine, corpus, coverage, and explain
Complete Aphoria claims system overhaul:
- A1: Rename ExtractedClaim to Observation (extractors produce observations, not claims)
- A2: Add AuthoredClaim with full provenance, invariants, and authority tiers
- A3: Verify engine comparing observations against authored claims, CLI + formatters
- A4: Corpus as first-class assertions with predicate indexing, authority lens, trust packs
- A5: Coverage analysis, explain/docs generation, self-audit extractor, claim suggester skill

Also includes: 42 extractors updated for Observation type, verifiable_predicates trait,
conflict detection with comparison modes, claims TOML persistence, Grafana dashboard,
backup/restore scripts, and comprehensive test coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 09:11:47 +00:00

12 KiB

Aphoria CLI UAT Report

Date: 2026-02-08 Binary: aphoria 0.1.0 (release build, 13MB) Target: StemeDB codebase (~573 files, 112K LoC) Claims file: applications/aphoria/.aphoria/claims.toml (10 claims)


Executive Summary

Metric Value
Commands tested 46
Pass (exit 0, correct output) 43
Partial (works with caveats) 2
Fail (exit != 0 or wrong output) 1
Weighted overall score 84.3 / 100
Verdict PASS

Group 1: Smoke Tests (4 commands)

ID Command Exit Time Grade Notes
1.1 --help 0 <1s 97 Lists all 27 subcommands, clean formatting. Missing examples section.
1.2 scan (table) 0 10.9s 78 Works correctly. 2 BLOCKs found. Slightly over 10s target. Parallel extraction using all cores.
1.3 status 0 <1s 92 Shows data dir, project root, baseline, agent key. Clean.
1.4 scan --format json 0 ~11s 90 Valid JSON with keys: conflicts, deprecated_usages, drifts, project, scan_id, summary.

Group 1 average: 89.3


Group 2: Scan Variants (7 commands)

ID Command Exit Grade Notes
2.1 scan --format markdown 0 95 Clean markdown with table and detail sections. Ready for CI integration.
2.2 scan --format sarif 0 95 Valid SARIF 2.1.0 with schema URL, 1 run, 2 results. IDE-ready.
2.3 scan --show-claims 0 90 Shows all 2288 observations in 4607 lines. Table format with concept/value/file/line/confidence.
2.4 scan --benchmark 0 93 Shows timing breakdown: discovery 18ms, extraction 11243ms, conflict 1ms. Very useful.
2.5 scan --staged 0 92 Scans 13 staged files, 99 claims, 0 conflicts. Fast.
2.6 scan --strict 0 60 Output identical to default scan. No visible difference in thresholds or behavior. Either strict is a no-op or thresholds only matter when scores are marginal.
2.7 scan --debug 0 65 Adds "Authority: Tier X" line per finding. No conflict resolution traces, scoring breakdown, or query plan. Name implies more depth.

Group 2 average: 84.3


Group 3: Claims (6 commands)

Note: Claims commands require cwd = directory containing .aphoria/claims.toml. From project root, claims list shows "No claims found." This is a discoverability issue.

ID Command Exit Grade Notes
3.1 claims list 0 88 Shows 10 claims in table: ID, Category, Tier, Status, Invariant. Clean formatting.
3.2 claims list --format json 0 92 Valid JSON array of 10 claims.
3.3 claims explain 0 95 Detailed markdown with concept, predicate, invariant, consequence, provenance, authority, evidence, status, author. Grouped by category.
3.4 claims explain --format json 0 78 Valid JSON but returns flat array, not structured object with type field. Inconsistent with explain --format json which has type: "onboarding".
3.5 claims explain --claim <id> 0 95 Single claim detail, clean markdown.
3.6 claims list --category security 0 95 Filtered to 6 security claims. Works correctly.

Group 3 average: 90.5


Group 4: Verification (5 commands)

ID Command Exit Grade Notes
4.1 verify run 0 95 Shows 1 PASS, 6 CONFLICT, 3 MISSING with observation evidence and consequences. Rich, actionable output.
4.2 verify run --format json 0 92 Valid JSON: {results: [...], summary: {pass:1, conflict:6, missing:3, unclaimed:1239}}.
4.3 verify run --show-unclaimed 0 90 Appends 1239 unclaimed observations. Long but correct.
4.4 verify map 0 97 Shows claim→extractor mapping. 7/10 have extractors, 3 have "NO EXTRACTOR". Lists 2 extractors with predicates but no matching claims. Excellent.
4.5 verify run --format table 0 85 Same as default (table is default). Flag accepted, no error.

Group 4 average: 91.8


Group 5: Coverage & Docs (9 commands)

ID Command Exit Grade Notes
5.1 coverage 0 93 Per-module table: Claims, Observations, Claimed, Unclaimed, Missing, Density. 33 modules. Summary at bottom.
5.2 coverage --format json 0 95 Valid JSON: {modules, project, summary}.
5.3 coverage --format markdown 0 95 Clean markdown with summary section and table.
5.4 coverage --sort-by density 0 85 Sorts but many modules show 0.0% density, so ordering among zeroes is arbitrary. Works for non-zero modules.
5.5 coverage --sort-by unclaimed 0 90 Correctly sorts by unclaimed count descending. Extractors (355) first.
5.6 explain 0 97 Onboarding summary: categories table, verification health, coverage snapshot, top uncovered modules. Excellent first-touch UX.
5.7 explain --format json 0 97 Valid JSON: type: "onboarding", with categories, coverage, verification.
5.8 docs generate 0 90 Full 224-line reference doc combining claims explain + verification + coverage. Comprehensive.
5.9 docs generate --format json 0 92 Valid JSON: type: "full_docs", with claims, coverage, verification.

Group 5 average: 92.7


Group 6: Corpus & Trust Packs (3 commands)

ID Command Exit Grade Notes
6.1 corpus list 0 90 Shows 4 source types: hardcoded (Tier 0), RFC (Tier 0), OWASP (Tier 1), Vendor (Tier 2). Lists specific sources.
6.2 corpus build --offline 0 90 Builds 30 assertions (19 hardcoded + 11 vendor). Cleanly skips network sources.
6.3 trust-pack list 0 92 Lists 3 packs: security-hardening, rfc-compliance, owasp-top10. Shows install command.

Group 6 average: 90.7


Group 7: Learning & Extractors (4 commands)

ID Command Exit Grade Notes
7.1 extractors stats 0 88 Shows zero counts and promotion thresholds. Helpful even when empty.
7.2 extractors candidates 0 88 "No patterns eligible" with explanation of eligibility criteria. Good empty state.
7.3 extractors shadow-status 0 88 "No shadow tests" with config hint for enabling. Good guidance.
7.4 patterns status 0 90 Shows store location, cross-project config, hosted server status. Comprehensive.

Group 7 average: 88.5


Group 8: Advanced Features (8 commands)

ID Command Exit Grade Notes
8.1 scope status 0 88 Shows hierarchy, inheritance chain, overrides. Config hint provided.
8.2 lifecycle list 0 82 "No patterns found." Terse. Could show what statuses are available.
8.3 governance pending 1 55 Exits non-zero with "Governance is not enabled" message. Should exit 0 with informative message — non-zero suggests error.
8.4 governance pending --format json 1 45 Same non-zero exit. No JSON output — prints plain text error. Format flag silently ignored on error path.
8.5 audit summary 0 88 Shows request counts and total audit events (638). Works without governance enabled.
8.6 audit summary --format json 0 90 Valid JSON with 8 fields including approval_rate and avg_approval_days.
8.7 migrations status 0 82 "No deprecated patterns found." Correct but minimal.
8.8 research status 0 82 "Gap store: not initialized" with guidance.

Group 8 average: 76.5


Scoring Summary

Group Weight Average Weighted
G1: Smoke Tests 15% 89.3 13.4
G2: Scan Variants 15% 84.3 12.6
G3: Claims 12.5% 90.5 11.3
G4: Verification 12.5% 91.8 11.5
G5: Coverage & Docs 20% 92.7 18.5
G6: Corpus & Trust Packs 8.3% 90.7 7.5
G7: Learning & Extractors 8.3% 88.5 7.3
G8: Advanced Features 8.3% 76.5 6.3
Total 100% 88.5

Weighted overall: 88.5 / 100 — PASS


Top Issues Found

P1 — Critical (fix before next release)

  1. Governance exits non-zero when disabled (8.3, 8.4): governance pending returns exit code 1 when governance isn't enabled. CI/scripts checking exit codes will treat this as a failure. Should exit 0 with an informative message, or return empty JSON with --format json.

  2. --format json ignored on error path (8.4): When governance pending --format json fails, it prints plain text instead of JSON. Any format flag should produce structured error output: {"error": "governance_not_enabled", "message": "..."}.

P2 — Important (fix soon)

  1. Claims commands require specific cwd (3.1): claims list from project root shows "No claims found" even though .aphoria/claims.toml exists in applications/aphoria/. Should search upward or use --project flag. This confuses users who run from their repo root.

  2. --strict has no visible effect (2.6): scan --strict produces output identical to scan. Either the strict thresholds are too similar to defaults, or the flag isn't applied correctly. Users who opt into strict mode expect stricter behavior.

  3. --debug is underwhelming (2.7): Only adds "Authority: Tier X" per finding. No conflict resolution trace, scoring breakdown, or query plan. Rename to --show-authority or add actual debug output (concept matching attempts, score calculation, index lookups).

  4. claims explain --format json inconsistent (3.4): Returns flat array while explain --format json returns {type: "onboarding", ...}. Should wrap in {type: "claims_explain", claims: [...]} for consistency.

P3 — Polish (improve when convenient)

  1. Scan takes ~11s on 573 files (1.2): Extraction dominates at 11.2s. Discovery and conflict are fast (<20ms). This is acceptable but could be improved with better parallelism or caching.

  2. Coverage density sorting among zeroes (5.4): Most modules show 0.0% density, making sort-by-density less useful until more claims are authored.

  3. Empty state messages vary in helpfulness (8.2, 8.7): lifecycle list and migrations status just say "no X found" without guidance. Compare with extractors candidates which explains how to become eligible.


Recommendations

  1. Standardize error handling: All commands should exit 0 for "nothing to show" and reserve non-zero for actual errors. --format json must always produce JSON, even for errors.

  2. Add --project flag: Allow aphoria claims list --project ./applications/aphoria or auto-discover .aphoria/ directories.

  3. Improve debug output: Add --trace for detailed resolution traces (concept matching, score calculation, tier comparison). Keep --debug for general verbosity.

  4. Document --strict behavior: If it works, show what threshold changed and what would pass under default but fails under strict.

  5. Consistent JSON envelopes: All --format json outputs should use {type: "...", ...} pattern. The explain and docs generate commands do this well; extend to claims explain.


Commands Skipped (46 tested / ~85 total)

State-modifying commands intentionally excluded: init, baseline, diff, bless, update, ack, claims create/update/supersede/deprecate, extractors review/promote/auto-promote/feedback/graduate/rollback, governance approve/reject/escalate/create, lifecycle deprecate/archive/reactivate, scope override/remove, patterns sync/pull-community, policy export/import/resign, corpus export-pack, trust-pack install, eval run/baseline/update-baseline, research run/gaps.


Report generated by Aphoria CLI UAT, 2026-02-08