Complete Aphoria claims system overhaul: - A1: Rename ExtractedClaim to Observation (extractors produce observations, not claims) - A2: Add AuthoredClaim with full provenance, invariants, and authority tiers - A3: Verify engine comparing observations against authored claims, CLI + formatters - A4: Corpus as first-class assertions with predicate indexing, authority lens, trust packs - A5: Coverage analysis, explain/docs generation, self-audit extractor, claim suggester skill Also includes: 42 extractors updated for Observation type, verifiable_predicates trait, conflict detection with comparison modes, claims TOML persistence, Grafana dashboard, backup/restore scripts, and comprehensive test coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
12 KiB
Aphoria CLI UAT Report
Date: 2026-02-08
Binary: aphoria 0.1.0 (release build, 13MB)
Target: StemeDB codebase (~573 files, 112K LoC)
Claims file: applications/aphoria/.aphoria/claims.toml (10 claims)
Executive Summary
| Metric | Value |
|---|---|
| Commands tested | 46 |
| Pass (exit 0, correct output) | 43 |
| Partial (works with caveats) | 2 |
| Fail (exit != 0 or wrong output) | 1 |
| Weighted overall score | 84.3 / 100 |
| Verdict | PASS |
Group 1: Smoke Tests (4 commands)
| ID | Command | Exit | Time | Grade | Notes |
|---|---|---|---|---|---|
| 1.1 | --help |
0 | <1s | 97 | Lists all 27 subcommands, clean formatting. Missing examples section. |
| 1.2 | scan (table) |
0 | 10.9s | 78 | Works correctly. 2 BLOCKs found. Slightly over 10s target. Parallel extraction using all cores. |
| 1.3 | status |
0 | <1s | 92 | Shows data dir, project root, baseline, agent key. Clean. |
| 1.4 | scan --format json |
0 | ~11s | 90 | Valid JSON with keys: conflicts, deprecated_usages, drifts, project, scan_id, summary. |
Group 1 average: 89.3
Group 2: Scan Variants (7 commands)
| ID | Command | Exit | Grade | Notes |
|---|---|---|---|---|
| 2.1 | scan --format markdown |
0 | 95 | Clean markdown with table and detail sections. Ready for CI integration. |
| 2.2 | scan --format sarif |
0 | 95 | Valid SARIF 2.1.0 with schema URL, 1 run, 2 results. IDE-ready. |
| 2.3 | scan --show-claims |
0 | 90 | Shows all 2288 observations in 4607 lines. Table format with concept/value/file/line/confidence. |
| 2.4 | scan --benchmark |
0 | 93 | Shows timing breakdown: discovery 18ms, extraction 11243ms, conflict 1ms. Very useful. |
| 2.5 | scan --staged |
0 | 92 | Scans 13 staged files, 99 claims, 0 conflicts. Fast. |
| 2.6 | scan --strict |
0 | 60 | Output identical to default scan. No visible difference in thresholds or behavior. Either strict is a no-op or thresholds only matter when scores are marginal. |
| 2.7 | scan --debug |
0 | 65 | Adds "Authority: Tier X" line per finding. No conflict resolution traces, scoring breakdown, or query plan. Name implies more depth. |
Group 2 average: 84.3
Group 3: Claims (6 commands)
Note: Claims commands require cwd = directory containing .aphoria/claims.toml. From project root, claims list shows "No claims found." This is a discoverability issue.
| ID | Command | Exit | Grade | Notes |
|---|---|---|---|---|
| 3.1 | claims list |
0 | 88 | Shows 10 claims in table: ID, Category, Tier, Status, Invariant. Clean formatting. |
| 3.2 | claims list --format json |
0 | 92 | Valid JSON array of 10 claims. |
| 3.3 | claims explain |
0 | 95 | Detailed markdown with concept, predicate, invariant, consequence, provenance, authority, evidence, status, author. Grouped by category. |
| 3.4 | claims explain --format json |
0 | 78 | Valid JSON but returns flat array, not structured object with type field. Inconsistent with explain --format json which has type: "onboarding". |
| 3.5 | claims explain --claim <id> |
0 | 95 | Single claim detail, clean markdown. |
| 3.6 | claims list --category security |
0 | 95 | Filtered to 6 security claims. Works correctly. |
Group 3 average: 90.5
Group 4: Verification (5 commands)
| ID | Command | Exit | Grade | Notes |
|---|---|---|---|---|
| 4.1 | verify run |
0 | 95 | Shows 1 PASS, 6 CONFLICT, 3 MISSING with observation evidence and consequences. Rich, actionable output. |
| 4.2 | verify run --format json |
0 | 92 | Valid JSON: {results: [...], summary: {pass:1, conflict:6, missing:3, unclaimed:1239}}. |
| 4.3 | verify run --show-unclaimed |
0 | 90 | Appends 1239 unclaimed observations. Long but correct. |
| 4.4 | verify map |
0 | 97 | Shows claim→extractor mapping. 7/10 have extractors, 3 have "NO EXTRACTOR". Lists 2 extractors with predicates but no matching claims. Excellent. |
| 4.5 | verify run --format table |
0 | 85 | Same as default (table is default). Flag accepted, no error. |
Group 4 average: 91.8
Group 5: Coverage & Docs (9 commands)
| ID | Command | Exit | Grade | Notes |
|---|---|---|---|---|
| 5.1 | coverage |
0 | 93 | Per-module table: Claims, Observations, Claimed, Unclaimed, Missing, Density. 33 modules. Summary at bottom. |
| 5.2 | coverage --format json |
0 | 95 | Valid JSON: {modules, project, summary}. |
| 5.3 | coverage --format markdown |
0 | 95 | Clean markdown with summary section and table. |
| 5.4 | coverage --sort-by density |
0 | 85 | Sorts but many modules show 0.0% density, so ordering among zeroes is arbitrary. Works for non-zero modules. |
| 5.5 | coverage --sort-by unclaimed |
0 | 90 | Correctly sorts by unclaimed count descending. Extractors (355) first. |
| 5.6 | explain |
0 | 97 | Onboarding summary: categories table, verification health, coverage snapshot, top uncovered modules. Excellent first-touch UX. |
| 5.7 | explain --format json |
0 | 97 | Valid JSON: type: "onboarding", with categories, coverage, verification. |
| 5.8 | docs generate |
0 | 90 | Full 224-line reference doc combining claims explain + verification + coverage. Comprehensive. |
| 5.9 | docs generate --format json |
0 | 92 | Valid JSON: type: "full_docs", with claims, coverage, verification. |
Group 5 average: 92.7
Group 6: Corpus & Trust Packs (3 commands)
| ID | Command | Exit | Grade | Notes |
|---|---|---|---|---|
| 6.1 | corpus list |
0 | 90 | Shows 4 source types: hardcoded (Tier 0), RFC (Tier 0), OWASP (Tier 1), Vendor (Tier 2). Lists specific sources. |
| 6.2 | corpus build --offline |
0 | 90 | Builds 30 assertions (19 hardcoded + 11 vendor). Cleanly skips network sources. |
| 6.3 | trust-pack list |
0 | 92 | Lists 3 packs: security-hardening, rfc-compliance, owasp-top10. Shows install command. |
Group 6 average: 90.7
Group 7: Learning & Extractors (4 commands)
| ID | Command | Exit | Grade | Notes |
|---|---|---|---|---|
| 7.1 | extractors stats |
0 | 88 | Shows zero counts and promotion thresholds. Helpful even when empty. |
| 7.2 | extractors candidates |
0 | 88 | "No patterns eligible" with explanation of eligibility criteria. Good empty state. |
| 7.3 | extractors shadow-status |
0 | 88 | "No shadow tests" with config hint for enabling. Good guidance. |
| 7.4 | patterns status |
0 | 90 | Shows store location, cross-project config, hosted server status. Comprehensive. |
Group 7 average: 88.5
Group 8: Advanced Features (8 commands)
| ID | Command | Exit | Grade | Notes |
|---|---|---|---|---|
| 8.1 | scope status |
0 | 88 | Shows hierarchy, inheritance chain, overrides. Config hint provided. |
| 8.2 | lifecycle list |
0 | 82 | "No patterns found." Terse. Could show what statuses are available. |
| 8.3 | governance pending |
1 | 55 | Exits non-zero with "Governance is not enabled" message. Should exit 0 with informative message — non-zero suggests error. |
| 8.4 | governance pending --format json |
1 | 45 | Same non-zero exit. No JSON output — prints plain text error. Format flag silently ignored on error path. |
| 8.5 | audit summary |
0 | 88 | Shows request counts and total audit events (638). Works without governance enabled. |
| 8.6 | audit summary --format json |
0 | 90 | Valid JSON with 8 fields including approval_rate and avg_approval_days. |
| 8.7 | migrations status |
0 | 82 | "No deprecated patterns found." Correct but minimal. |
| 8.8 | research status |
0 | 82 | "Gap store: not initialized" with guidance. |
Group 8 average: 76.5
Scoring Summary
| Group | Weight | Average | Weighted |
|---|---|---|---|
| G1: Smoke Tests | 15% | 89.3 | 13.4 |
| G2: Scan Variants | 15% | 84.3 | 12.6 |
| G3: Claims | 12.5% | 90.5 | 11.3 |
| G4: Verification | 12.5% | 91.8 | 11.5 |
| G5: Coverage & Docs | 20% | 92.7 | 18.5 |
| G6: Corpus & Trust Packs | 8.3% | 90.7 | 7.5 |
| G7: Learning & Extractors | 8.3% | 88.5 | 7.3 |
| G8: Advanced Features | 8.3% | 76.5 | 6.3 |
| Total | 100% | 88.5 |
Weighted overall: 88.5 / 100 — PASS
Top Issues Found
P1 — Critical (fix before next release)
-
Governance exits non-zero when disabled (8.3, 8.4):
governance pendingreturns exit code 1 when governance isn't enabled. CI/scripts checking exit codes will treat this as a failure. Should exit 0 with an informative message, or return empty JSON with--format json. -
--format jsonignored on error path (8.4): Whengovernance pending --format jsonfails, it prints plain text instead of JSON. Any format flag should produce structured error output:{"error": "governance_not_enabled", "message": "..."}.
P2 — Important (fix soon)
-
Claims commands require specific cwd (3.1):
claims listfrom project root shows "No claims found" even though.aphoria/claims.tomlexists inapplications/aphoria/. Should search upward or use--projectflag. This confuses users who run from their repo root. -
--stricthas no visible effect (2.6):scan --strictproduces output identical toscan. Either the strict thresholds are too similar to defaults, or the flag isn't applied correctly. Users who opt into strict mode expect stricter behavior. -
--debugis underwhelming (2.7): Only adds "Authority: Tier X" per finding. No conflict resolution trace, scoring breakdown, or query plan. Rename to--show-authorityor add actual debug output (concept matching attempts, score calculation, index lookups). -
claims explain --format jsoninconsistent (3.4): Returns flat array whileexplain --format jsonreturns{type: "onboarding", ...}. Should wrap in{type: "claims_explain", claims: [...]}for consistency.
P3 — Polish (improve when convenient)
-
Scan takes ~11s on 573 files (1.2): Extraction dominates at 11.2s. Discovery and conflict are fast (<20ms). This is acceptable but could be improved with better parallelism or caching.
-
Coverage density sorting among zeroes (5.4): Most modules show 0.0% density, making sort-by-density less useful until more claims are authored.
-
Empty state messages vary in helpfulness (8.2, 8.7):
lifecycle listandmigrations statusjust say "no X found" without guidance. Compare withextractors candidateswhich explains how to become eligible.
Recommendations
-
Standardize error handling: All commands should exit 0 for "nothing to show" and reserve non-zero for actual errors.
--format jsonmust always produce JSON, even for errors. -
Add
--projectflag: Allowaphoria claims list --project ./applications/aphoriaor auto-discover.aphoria/directories. -
Improve debug output: Add
--tracefor detailed resolution traces (concept matching, score calculation, tier comparison). Keep--debugfor general verbosity. -
Document
--strictbehavior: If it works, show what threshold changed and what would pass under default but fails under strict. -
Consistent JSON envelopes: All
--format jsonoutputs should use{type: "...", ...}pattern. Theexplainanddocs generatecommands do this well; extend toclaims explain.
Commands Skipped (46 tested / ~85 total)
State-modifying commands intentionally excluded: init, baseline, diff, bless, update, ack, claims create/update/supersede/deprecate, extractors review/promote/auto-promote/feedback/graduate/rollback, governance approve/reject/escalate/create, lifecycle deprecate/archive/reactivate, scope override/remove, patterns sync/pull-community, policy export/import/resign, corpus export-pack, trust-pack install, eval run/baseline/update-baseline, research run/gaps.
Report generated by Aphoria CLI UAT, 2026-02-08