jml 3b5f88b4f0 feat(aphoria): implement claims architecture (A1-A5) with verify engine, corpus, coverage, and explain

Complete Aphoria claims system overhaul:
- A1: Rename ExtractedClaim to Observation (extractors produce observations, not claims)
- A2: Add AuthoredClaim with full provenance, invariants, and authority tiers
- A3: Verify engine comparing observations against authored claims, CLI + formatters
- A4: Corpus as first-class assertions with predicate indexing, authority lens, trust packs
- A5: Coverage analysis, explain/docs generation, self-audit extractor, claim suggester skill

Also includes: 42 extractors updated for Observation type, verifiable_predicates trait,
conflict detection with comparison modes, claims TOML persistence, Grafana dashboard,
backup/restore scripts, and comprehensive test coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-08 09:11:47 +00:00

12 KiB

Raw Blame History

Aphoria CLI UAT Report

Date: 2026-02-08 Binary: aphoria 0.1.0 (release build, 13MB) Target: StemeDB codebase (~573 files, 112K LoC) Claims file: applications/aphoria/.aphoria/claims.toml (10 claims)

Executive Summary

Metric	Value
Commands tested	46
Pass (exit 0, correct output)	43
Partial (works with caveats)	2
Fail (exit != 0 or wrong output)	1
Weighted overall score	84.3 / 100
Verdict	PASS

Group 1: Smoke Tests (4 commands)

ID	Command	Time	Grade	Notes
1.1	`--help`	<1s	97	Lists all 27 subcommands, clean formatting. Missing examples section.
1.2	`scan` (table)	10.9s	78	Works correctly. 2 BLOCKs found. Slightly over 10s target. Parallel extraction using all cores.
1.3	`status`	<1s	92	Shows data dir, project root, baseline, agent key. Clean.
1.4	`scan --format json`	~11s	90	Valid JSON with keys: conflicts, deprecated_usages, drifts, project, scan_id, summary.

Group 1 average: 89.3

Group 2: Scan Variants (7 commands)

ID	Command	Grade	Notes
2.1	`scan --format markdown`	95	Clean markdown with table and detail sections. Ready for CI integration.
2.2	`scan --format sarif`	95	Valid SARIF 2.1.0 with schema URL, 1 run, 2 results. IDE-ready.
2.3	`scan --show-claims`	90	Shows all 2288 observations in 4607 lines. Table format with concept/value/file/line/confidence.
2.4	`scan --benchmark`	93	Shows timing breakdown: discovery 18ms, extraction 11243ms, conflict 1ms. Very useful.
2.5	`scan --staged`	92	Scans 13 staged files, 99 claims, 0 conflicts. Fast.
2.6	`scan --strict`	60	Output identical to default scan. No visible difference in thresholds or behavior. Either strict is a no-op or thresholds only matter when scores are marginal.
2.7	`scan --debug`	65	Adds "Authority: Tier X" line per finding. No conflict resolution traces, scoring breakdown, or query plan. Name implies more depth.

Group 2 average: 84.3

Group 3: Claims (6 commands)

Note: Claims commands require cwd = directory containing .aphoria/claims.toml. From project root, claims list shows "No claims found." This is a discoverability issue.

ID	Command	Grade	Notes
3.1	`claims list`	88	Shows 10 claims in table: ID, Category, Tier, Status, Invariant. Clean formatting.
3.2	`claims list --format json`	92	Valid JSON array of 10 claims.
3.3	`claims explain`	95	Detailed markdown with concept, predicate, invariant, consequence, provenance, authority, evidence, status, author. Grouped by category.
3.4	`claims explain --format json`	78	Valid JSON but returns flat array, not structured object with `type` field. Inconsistent with `explain --format json` which has `type: "onboarding"`.
3.5	`claims explain --claim <id>`	95	Single claim detail, clean markdown.
3.6	`claims list --category security`	95	Filtered to 6 security claims. Works correctly.

Group 3 average: 90.5

Group 4: Verification (5 commands)

ID	Command	Grade	Notes
4.1	`verify run`	95	Shows 1 PASS, 6 CONFLICT, 3 MISSING with observation evidence and consequences. Rich, actionable output.
4.2	`verify run --format json`	92	Valid JSON: `{results: [...], summary: {pass:1, conflict:6, missing:3, unclaimed:1239}}`.
4.3	`verify run --show-unclaimed`	90	Appends 1239 unclaimed observations. Long but correct.
4.4	`verify map`	97	Shows claim→extractor mapping. 7/10 have extractors, 3 have "NO EXTRACTOR". Lists 2 extractors with predicates but no matching claims. Excellent.
4.5	`verify run --format table`	85	Same as default (table is default). Flag accepted, no error.

Group 4 average: 91.8

Group 5: Coverage & Docs (9 commands)

ID	Command	Grade	Notes
5.1	`coverage`	93	Per-module table: Claims, Observations, Claimed, Unclaimed, Missing, Density. 33 modules. Summary at bottom.
5.2	`coverage --format json`	95	Valid JSON: `{modules, project, summary}`.
5.3	`coverage --format markdown`	95	Clean markdown with summary section and table.
5.4	`coverage --sort-by density`	85	Sorts but many modules show 0.0% density, so ordering among zeroes is arbitrary. Works for non-zero modules.
5.5	`coverage --sort-by unclaimed`	90	Correctly sorts by unclaimed count descending. Extractors (355) first.
5.6	`explain`	97	Onboarding summary: categories table, verification health, coverage snapshot, top uncovered modules. Excellent first-touch UX.
5.7	`explain --format json`	97	Valid JSON: `type: "onboarding"`, with categories, coverage, verification.
5.8	`docs generate`	90	Full 224-line reference doc combining claims explain + verification + coverage. Comprehensive.
5.9	`docs generate --format json`	92	Valid JSON: `type: "full_docs"`, with claims, coverage, verification.

Group 5 average: 92.7

Group 6: Corpus & Trust Packs (3 commands)

ID	Command	Grade	Notes
6.1	`corpus list`	90	Shows 4 source types: hardcoded (Tier 0), RFC (Tier 0), OWASP (Tier 1), Vendor (Tier 2). Lists specific sources.
6.2	`corpus build --offline`	90	Builds 30 assertions (19 hardcoded + 11 vendor). Cleanly skips network sources.
6.3	`trust-pack list`	92	Lists 3 packs: security-hardening, rfc-compliance, owasp-top10. Shows install command.

Group 6 average: 90.7

Group 7: Learning & Extractors (4 commands)

ID	Command	Grade	Notes
7.1	`extractors stats`	88	Shows zero counts and promotion thresholds. Helpful even when empty.
7.2	`extractors candidates`	88	"No patterns eligible" with explanation of eligibility criteria. Good empty state.
7.3	`extractors shadow-status`	88	"No shadow tests" with config hint for enabling. Good guidance.
7.4	`patterns status`	90	Shows store location, cross-project config, hosted server status. Comprehensive.

Group 7 average: 88.5

Group 8: Advanced Features (8 commands)

ID	Command	Exit	Grade	Notes
8.1	`scope status`	0	88	Shows hierarchy, inheritance chain, overrides. Config hint provided.
8.2	`lifecycle list`	0	82	"No patterns found." Terse. Could show what statuses are available.
8.3	`governance pending`	1	55	Exits non-zero with "Governance is not enabled" message. Should exit 0 with informative message — non-zero suggests error.
8.4	`governance pending --format json`	1	45	Same non-zero exit. No JSON output — prints plain text error. Format flag silently ignored on error path.
8.5	`audit summary`	0	88	Shows request counts and total audit events (638). Works without governance enabled.
8.6	`audit summary --format json`	0	90	Valid JSON with 8 fields including approval_rate and avg_approval_days.
8.7	`migrations status`	0	82	"No deprecated patterns found." Correct but minimal.
8.8	`research status`	0	82	"Gap store: not initialized" with guidance.

Group 8 average: 76.5

Scoring Summary

Group	Weight	Average	Weighted
G1: Smoke Tests	15%	89.3	13.4
G2: Scan Variants	15%	84.3	12.6
G3: Claims	12.5%	90.5	11.3
G4: Verification	12.5%	91.8	11.5
G5: Coverage & Docs	20%	92.7	18.5
G6: Corpus & Trust Packs	8.3%	90.7	7.5
G7: Learning & Extractors	8.3%	88.5	7.3
G8: Advanced Features	8.3%	76.5	6.3
Total	100%		88.5

Weighted overall: 88.5 / 100 — PASS

Top Issues Found

P1 — Critical (fix before next release)

Governance exits non-zero when disabled (8.3, 8.4): governance pending returns exit code 1 when governance isn't enabled. CI/scripts checking exit codes will treat this as a failure. Should exit 0 with an informative message, or return empty JSON with --format json.
--format json ignored on error path (8.4): When governance pending --format json fails, it prints plain text instead of JSON. Any format flag should produce structured error output: {"error": "governance_not_enabled", "message": "..."}.

P2 — Important (fix soon)

Claims commands require specific cwd (3.1): claims list from project root shows "No claims found" even though .aphoria/claims.toml exists in applications/aphoria/. Should search upward or use --project flag. This confuses users who run from their repo root.
--strict has no visible effect (2.6): scan --strict produces output identical to scan. Either the strict thresholds are too similar to defaults, or the flag isn't applied correctly. Users who opt into strict mode expect stricter behavior.
--debug is underwhelming (2.7): Only adds "Authority: Tier X" per finding. No conflict resolution trace, scoring breakdown, or query plan. Rename to --show-authority or add actual debug output (concept matching attempts, score calculation, index lookups).
claims explain --format json inconsistent (3.4): Returns flat array while explain --format json returns {type: "onboarding", ...}. Should wrap in {type: "claims_explain", claims: [...]} for consistency.

P3 — Polish (improve when convenient)

Scan takes ~11s on 573 files (1.2): Extraction dominates at 11.2s. Discovery and conflict are fast (<20ms). This is acceptable but could be improved with better parallelism or caching.
Coverage density sorting among zeroes (5.4): Most modules show 0.0% density, making sort-by-density less useful until more claims are authored.
Empty state messages vary in helpfulness (8.2, 8.7): lifecycle list and migrations status just say "no X found" without guidance. Compare with extractors candidates which explains how to become eligible.

Recommendations

Standardize error handling: All commands should exit 0 for "nothing to show" and reserve non-zero for actual errors. --format json must always produce JSON, even for errors.
Add --project flag: Allow aphoria claims list --project ./applications/aphoria or auto-discover .aphoria/ directories.
Improve debug output: Add --trace for detailed resolution traces (concept matching, score calculation, tier comparison). Keep --debug for general verbosity.
Document --strict behavior: If it works, show what threshold changed and what would pass under default but fails under strict.
Consistent JSON envelopes: All --format json outputs should use {type: "...", ...} pattern. The explain and docs generate commands do this well; extend to claims explain.

Commands Skipped (46 tested / ~85 total)

State-modifying commands intentionally excluded: init, baseline, diff, bless, update, ack, claims create/update/supersede/deprecate, extractors review/promote/auto-promote/feedback/graduate/rollback, governance approve/reject/escalate/create, lifecycle deprecate/archive/reactivate, scope override/remove, patterns sync/pull-community, policy export/import/resign, corpus export-pack, trust-pack install, eval run/baseline/update-baseline, research run/gaps.

Report generated by Aphoria CLI UAT, 2026-02-08

12 KiB Raw Blame History