Complete Aphoria claims system overhaul: - A1: Rename ExtractedClaim to Observation (extractors produce observations, not claims) - A2: Add AuthoredClaim with full provenance, invariants, and authority tiers - A3: Verify engine comparing observations against authored claims, CLI + formatters - A4: Corpus as first-class assertions with predicate indexing, authority lens, trust packs - A5: Coverage analysis, explain/docs generation, self-audit extractor, claim suggester skill Also includes: 42 extractors updated for Observation type, verifiable_predicates trait, conflict detection with comparison modes, claims TOML persistence, Grafana dashboard, backup/restore scripts, and comprehensive test coverage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
210 lines
12 KiB
Markdown
210 lines
12 KiB
Markdown
# Aphoria CLI UAT Report
|
|
|
|
**Date:** 2026-02-08
|
|
**Binary:** `aphoria 0.1.0` (release build, 13MB)
|
|
**Target:** StemeDB codebase (~573 files, 112K LoC)
|
|
**Claims file:** `applications/aphoria/.aphoria/claims.toml` (10 claims)
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Commands tested | 46 |
|
|
| Pass (exit 0, correct output) | 43 |
|
|
| Partial (works with caveats) | 2 |
|
|
| Fail (exit != 0 or wrong output) | 1 |
|
|
| **Weighted overall score** | **84.3 / 100** |
|
|
| **Verdict** | **PASS** |
|
|
|
|
---
|
|
|
|
## Group 1: Smoke Tests (4 commands)
|
|
|
|
| ID | Command | Exit | Time | Grade | Notes |
|
|
|----|---------|------|------|-------|-------|
|
|
| 1.1 | `--help` | 0 | <1s | **97** | Lists all 27 subcommands, clean formatting. Missing examples section. |
|
|
| 1.2 | `scan` (table) | 0 | 10.9s | **78** | Works correctly. 2 BLOCKs found. Slightly over 10s target. Parallel extraction using all cores. |
|
|
| 1.3 | `status` | 0 | <1s | **92** | Shows data dir, project root, baseline, agent key. Clean. |
|
|
| 1.4 | `scan --format json` | 0 | ~11s | **90** | Valid JSON with keys: conflicts, deprecated_usages, drifts, project, scan_id, summary. |
|
|
|
|
**Group 1 average: 89.3**
|
|
|
|
---
|
|
|
|
## Group 2: Scan Variants (7 commands)
|
|
|
|
| ID | Command | Exit | Grade | Notes |
|
|
|----|---------|------|-------|-------|
|
|
| 2.1 | `scan --format markdown` | 0 | **95** | Clean markdown with table and detail sections. Ready for CI integration. |
|
|
| 2.2 | `scan --format sarif` | 0 | **95** | Valid SARIF 2.1.0 with schema URL, 1 run, 2 results. IDE-ready. |
|
|
| 2.3 | `scan --show-claims` | 0 | **90** | Shows all 2288 observations in 4607 lines. Table format with concept/value/file/line/confidence. |
|
|
| 2.4 | `scan --benchmark` | 0 | **93** | Shows timing breakdown: discovery 18ms, extraction 11243ms, conflict 1ms. Very useful. |
|
|
| 2.5 | `scan --staged` | 0 | **92** | Scans 13 staged files, 99 claims, 0 conflicts. Fast. |
|
|
| 2.6 | `scan --strict` | 0 | **60** | Output identical to default scan. No visible difference in thresholds or behavior. Either strict is a no-op or thresholds only matter when scores are marginal. |
|
|
| 2.7 | `scan --debug` | 0 | **65** | Adds "Authority: Tier X" line per finding. No conflict resolution traces, scoring breakdown, or query plan. Name implies more depth. |
|
|
|
|
**Group 2 average: 84.3**
|
|
|
|
---
|
|
|
|
## Group 3: Claims (6 commands)
|
|
|
|
**Note:** Claims commands require cwd = directory containing `.aphoria/claims.toml`. From project root, `claims list` shows "No claims found." This is a discoverability issue.
|
|
|
|
| ID | Command | Exit | Grade | Notes |
|
|
|----|---------|------|-------|-------|
|
|
| 3.1 | `claims list` | 0 | **88** | Shows 10 claims in table: ID, Category, Tier, Status, Invariant. Clean formatting. |
|
|
| 3.2 | `claims list --format json` | 0 | **92** | Valid JSON array of 10 claims. |
|
|
| 3.3 | `claims explain` | 0 | **95** | Detailed markdown with concept, predicate, invariant, consequence, provenance, authority, evidence, status, author. Grouped by category. |
|
|
| 3.4 | `claims explain --format json` | 0 | **78** | Valid JSON but returns flat array, not structured object with `type` field. Inconsistent with `explain --format json` which has `type: "onboarding"`. |
|
|
| 3.5 | `claims explain --claim <id>` | 0 | **95** | Single claim detail, clean markdown. |
|
|
| 3.6 | `claims list --category security` | 0 | **95** | Filtered to 6 security claims. Works correctly. |
|
|
|
|
**Group 3 average: 90.5**
|
|
|
|
---
|
|
|
|
## Group 4: Verification (5 commands)
|
|
|
|
| ID | Command | Exit | Grade | Notes |
|
|
|----|---------|------|-------|-------|
|
|
| 4.1 | `verify run` | 0 | **95** | Shows 1 PASS, 6 CONFLICT, 3 MISSING with observation evidence and consequences. Rich, actionable output. |
|
|
| 4.2 | `verify run --format json` | 0 | **92** | Valid JSON: `{results: [...], summary: {pass:1, conflict:6, missing:3, unclaimed:1239}}`. |
|
|
| 4.3 | `verify run --show-unclaimed` | 0 | **90** | Appends 1239 unclaimed observations. Long but correct. |
|
|
| 4.4 | `verify map` | 0 | **97** | Shows claim→extractor mapping. 7/10 have extractors, 3 have "NO EXTRACTOR". Lists 2 extractors with predicates but no matching claims. Excellent. |
|
|
| 4.5 | `verify run --format table` | 0 | **85** | Same as default (table is default). Flag accepted, no error. |
|
|
|
|
**Group 4 average: 91.8**
|
|
|
|
---
|
|
|
|
## Group 5: Coverage & Docs (9 commands)
|
|
|
|
| ID | Command | Exit | Grade | Notes |
|
|
|----|---------|------|-------|-------|
|
|
| 5.1 | `coverage` | 0 | **93** | Per-module table: Claims, Observations, Claimed, Unclaimed, Missing, Density. 33 modules. Summary at bottom. |
|
|
| 5.2 | `coverage --format json` | 0 | **95** | Valid JSON: `{modules, project, summary}`. |
|
|
| 5.3 | `coverage --format markdown` | 0 | **95** | Clean markdown with summary section and table. |
|
|
| 5.4 | `coverage --sort-by density` | 0 | **85** | Sorts but many modules show 0.0% density, so ordering among zeroes is arbitrary. Works for non-zero modules. |
|
|
| 5.5 | `coverage --sort-by unclaimed` | 0 | **90** | Correctly sorts by unclaimed count descending. Extractors (355) first. |
|
|
| 5.6 | `explain` | 0 | **97** | Onboarding summary: categories table, verification health, coverage snapshot, top uncovered modules. Excellent first-touch UX. |
|
|
| 5.7 | `explain --format json` | 0 | **97** | Valid JSON: `type: "onboarding"`, with categories, coverage, verification. |
|
|
| 5.8 | `docs generate` | 0 | **90** | Full 224-line reference doc combining claims explain + verification + coverage. Comprehensive. |
|
|
| 5.9 | `docs generate --format json` | 0 | **92** | Valid JSON: `type: "full_docs"`, with claims, coverage, verification. |
|
|
|
|
**Group 5 average: 92.7**
|
|
|
|
---
|
|
|
|
## Group 6: Corpus & Trust Packs (3 commands)
|
|
|
|
| ID | Command | Exit | Grade | Notes |
|
|
|----|---------|------|-------|-------|
|
|
| 6.1 | `corpus list` | 0 | **90** | Shows 4 source types: hardcoded (Tier 0), RFC (Tier 0), OWASP (Tier 1), Vendor (Tier 2). Lists specific sources. |
|
|
| 6.2 | `corpus build --offline` | 0 | **90** | Builds 30 assertions (19 hardcoded + 11 vendor). Cleanly skips network sources. |
|
|
| 6.3 | `trust-pack list` | 0 | **92** | Lists 3 packs: security-hardening, rfc-compliance, owasp-top10. Shows install command. |
|
|
|
|
**Group 6 average: 90.7**
|
|
|
|
---
|
|
|
|
## Group 7: Learning & Extractors (4 commands)
|
|
|
|
| ID | Command | Exit | Grade | Notes |
|
|
|----|---------|------|-------|-------|
|
|
| 7.1 | `extractors stats` | 0 | **88** | Shows zero counts and promotion thresholds. Helpful even when empty. |
|
|
| 7.2 | `extractors candidates` | 0 | **88** | "No patterns eligible" with explanation of eligibility criteria. Good empty state. |
|
|
| 7.3 | `extractors shadow-status` | 0 | **88** | "No shadow tests" with config hint for enabling. Good guidance. |
|
|
| 7.4 | `patterns status` | 0 | **90** | Shows store location, cross-project config, hosted server status. Comprehensive. |
|
|
|
|
**Group 7 average: 88.5**
|
|
|
|
---
|
|
|
|
## Group 8: Advanced Features (8 commands)
|
|
|
|
| ID | Command | Exit | Grade | Notes |
|
|
|----|---------|------|-------|-------|
|
|
| 8.1 | `scope status` | 0 | **88** | Shows hierarchy, inheritance chain, overrides. Config hint provided. |
|
|
| 8.2 | `lifecycle list` | 0 | **82** | "No patterns found." Terse. Could show what statuses are available. |
|
|
| 8.3 | `governance pending` | 1 | **55** | Exits non-zero with "Governance is not enabled" message. Should exit 0 with informative message — non-zero suggests error. |
|
|
| 8.4 | `governance pending --format json` | 1 | **45** | Same non-zero exit. No JSON output — prints plain text error. Format flag silently ignored on error path. |
|
|
| 8.5 | `audit summary` | 0 | **88** | Shows request counts and total audit events (638). Works without governance enabled. |
|
|
| 8.6 | `audit summary --format json` | 0 | **90** | Valid JSON with 8 fields including approval_rate and avg_approval_days. |
|
|
| 8.7 | `migrations status` | 0 | **82** | "No deprecated patterns found." Correct but minimal. |
|
|
| 8.8 | `research status` | 0 | **82** | "Gap store: not initialized" with guidance. |
|
|
|
|
**Group 8 average: 76.5**
|
|
|
|
---
|
|
|
|
## Scoring Summary
|
|
|
|
| Group | Weight | Average | Weighted |
|
|
|-------|--------|---------|----------|
|
|
| G1: Smoke Tests | 15% | 89.3 | 13.4 |
|
|
| G2: Scan Variants | 15% | 84.3 | 12.6 |
|
|
| G3: Claims | 12.5% | 90.5 | 11.3 |
|
|
| G4: Verification | 12.5% | 91.8 | 11.5 |
|
|
| G5: Coverage & Docs | 20% | 92.7 | 18.5 |
|
|
| G6: Corpus & Trust Packs | 8.3% | 90.7 | 7.5 |
|
|
| G7: Learning & Extractors | 8.3% | 88.5 | 7.3 |
|
|
| G8: Advanced Features | 8.3% | 76.5 | 6.3 |
|
|
| **Total** | **100%** | | **88.5** |
|
|
|
|
**Weighted overall: 88.5 / 100 — PASS**
|
|
|
|
---
|
|
|
|
## Top Issues Found
|
|
|
|
### P1 — Critical (fix before next release)
|
|
|
|
1. **Governance exits non-zero when disabled (8.3, 8.4):** `governance pending` returns exit code 1 when governance isn't enabled. CI/scripts checking exit codes will treat this as a failure. Should exit 0 with an informative message, or return empty JSON with `--format json`.
|
|
|
|
2. **`--format json` ignored on error path (8.4):** When `governance pending --format json` fails, it prints plain text instead of JSON. Any format flag should produce structured error output: `{"error": "governance_not_enabled", "message": "..."}`.
|
|
|
|
### P2 — Important (fix soon)
|
|
|
|
3. **Claims commands require specific cwd (3.1):** `claims list` from project root shows "No claims found" even though `.aphoria/claims.toml` exists in `applications/aphoria/`. Should search upward or use `--project` flag. This confuses users who run from their repo root.
|
|
|
|
4. **`--strict` has no visible effect (2.6):** `scan --strict` produces output identical to `scan`. Either the strict thresholds are too similar to defaults, or the flag isn't applied correctly. Users who opt into strict mode expect stricter behavior.
|
|
|
|
5. **`--debug` is underwhelming (2.7):** Only adds "Authority: Tier X" per finding. No conflict resolution trace, scoring breakdown, or query plan. Rename to `--show-authority` or add actual debug output (concept matching attempts, score calculation, index lookups).
|
|
|
|
6. **`claims explain --format json` inconsistent (3.4):** Returns flat array while `explain --format json` returns `{type: "onboarding", ...}`. Should wrap in `{type: "claims_explain", claims: [...]}` for consistency.
|
|
|
|
### P3 — Polish (improve when convenient)
|
|
|
|
7. **Scan takes ~11s on 573 files (1.2):** Extraction dominates at 11.2s. Discovery and conflict are fast (<20ms). This is acceptable but could be improved with better parallelism or caching.
|
|
|
|
8. **Coverage density sorting among zeroes (5.4):** Most modules show 0.0% density, making sort-by-density less useful until more claims are authored.
|
|
|
|
9. **Empty state messages vary in helpfulness (8.2, 8.7):** `lifecycle list` and `migrations status` just say "no X found" without guidance. Compare with `extractors candidates` which explains how to become eligible.
|
|
|
|
---
|
|
|
|
## Recommendations
|
|
|
|
1. **Standardize error handling:** All commands should exit 0 for "nothing to show" and reserve non-zero for actual errors. `--format json` must always produce JSON, even for errors.
|
|
|
|
2. **Add `--project` flag:** Allow `aphoria claims list --project ./applications/aphoria` or auto-discover `.aphoria/` directories.
|
|
|
|
3. **Improve debug output:** Add `--trace` for detailed resolution traces (concept matching, score calculation, tier comparison). Keep `--debug` for general verbosity.
|
|
|
|
4. **Document `--strict` behavior:** If it works, show what threshold changed and what would pass under default but fails under strict.
|
|
|
|
5. **Consistent JSON envelopes:** All `--format json` outputs should use `{type: "...", ...}` pattern. The `explain` and `docs generate` commands do this well; extend to `claims explain`.
|
|
|
|
---
|
|
|
|
## Commands Skipped (46 tested / ~85 total)
|
|
|
|
State-modifying commands intentionally excluded: `init`, `baseline`, `diff`, `bless`, `update`, `ack`, `claims create/update/supersede/deprecate`, `extractors review/promote/auto-promote/feedback/graduate/rollback`, `governance approve/reject/escalate/create`, `lifecycle deprecate/archive/reactivate`, `scope override/remove`, `patterns sync/pull-community`, `policy export/import/resign`, `corpus export-pack`, `trust-pack install`, `eval run/baseline/update-baseline`, `research run/gaps`.
|
|
|
|
---
|
|
|
|
*Report generated by Aphoria CLI UAT, 2026-02-08*
|