stemedb/research-requests/a5-flywheel-skill-design.md
jml 3b5f88b4f0 feat(aphoria): implement claims architecture (A1-A5) with verify engine, corpus, coverage, and explain
Complete Aphoria claims system overhaul:
- A1: Rename ExtractedClaim to Observation (extractors produce observations, not claims)
- A2: Add AuthoredClaim with full provenance, invariants, and authority tiers
- A3: Verify engine comparing observations against authored claims, CLI + formatters
- A4: Corpus as first-class assertions with predicate indexing, authority lens, trust packs
- A5: Coverage analysis, explain/docs generation, self-audit extractor, claim suggester skill

Also includes: 42 extractors updated for Observation type, verifiable_predicates trait,
conflict detection with comparison modes, claims TOML persistence, Grafana dashboard,
backup/restore scripts, and comprehensive test coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 09:11:47 +00:00

165 lines
9.1 KiB
Markdown

# A5 Flywheel Skill Design Research Directive
You are an expert in AI-assisted developer tooling, specifically in designing Claude Code skills (prompt-based tool orchestration). You understand how LLM-based coding assistants work with CLI tools to create intelligent developer workflows.
You are going to research how to design a Claude Code skill that creates a **flywheel effect** for a code-level claim system: the more claims developers author, the smarter the skill gets at identifying what needs claims and suggesting them.
---
## Context
### What We Have
**Aphoria** is a code truth linter. It has:
1. **42 regex extractors** that scan code and produce **observations** — structured grep results like `imported = true` or `ordering = SeqCst` at a specific file:line.
2. **Authored claims** stored in `.aphoria/claims.toml` — human-written assertions with provenance, invariants, consequences, and evidence. Example:
```toml
[[claim]]
id = "wallet-seqcst-001"
concept_path = "maxwell/wallet/atomics/ordering"
predicate = "required_ordering"
value = "SeqCst"
provenance = "Safety analysis by lead developer"
invariant = "All wallet atomics MUST use SeqCst"
consequence = "Double-spend race condition"
authority_tier = "expert"
evidence = ["wallet ADR-003", "Intel SDM Vol 4"]
category = "safety"
```
3. **A verification engine** (`aphoria verify run`) that matches claims against observations and produces verdicts: PASS, CONFLICT, MISSING, UNCLAIMED.
4. **An existing claims skill** (`.claude/skills/aphoria-claims/SKILL.md`) that reviews diffs, identifies claimable patterns, checks existing claims, and creates new ones via CLI.
5. **CLI commands**:
- `aphoria claims list --format json` — all authored claims
- `aphoria claims create --id X --invariant "..." ...` — create a claim
- `aphoria verify run --format json --show-unclaimed` — verify claims vs code
- `aphoria scan . --show-claims --format json` — extract observations
- `aphoria claims explain` — render claims-explained markdown
### What We Need (A5 — The Flywheel)
Four features that make the system get smarter with use:
**A5.1 Coverage Metrics** — "What percentage of this codebase has claims?" Per-module density, unclaimed observation gaps. Probably a new CLI command: `aphoria coverage`.
**A5.2 Auto-Generated Claims-Explained**`aphoria docs generate` produces full project documentation from the knowledge graph. Groups by module, includes provenance chains (claim A supersedes claim B), highlights coverage gaps.
**A5.3 Skill Learning** — The skill reads existing claims, recognizes patterns, and when it sees analogous code, suggests new claims. "You have 3 claims about SeqCst ordering in wallet code. This new `Ordering::Relaxed` in sync.rs — should that also be SeqCst?"
**A5.4 Onboarding Mode**`aphoria explain` produces a narrative: "This codebase enforces 12 safety invariants. The wallet uses SeqCst everywhere because of double-spend risk (ADR-003)..."
### The Hypothesis
A5.1, A5.2, A5.4 are new Rust CLI commands. Straightforward engineering.
A5.3 is the interesting one. The hypothesis is: **the skill just calls the CLI**. The "learning" is the LLM reasoning over the JSON output of `aphoria claims list` and `aphoria verify run --show-unclaimed`. There's no ML model, no vector embeddings, no training loop. Claude reads the claims, understands the patterns, and applies that understanding to new code. The flywheel is prompt engineering, not machine learning.
---
## Research Questions
### 1. Is the "skill calls CLI" pattern sufficient for claim suggestion?
Given that Claude can read JSON output of `aphoria claims list` (typically 5-50 claims) and `aphoria verify run --show-unclaimed` (typically 10-200 unclaimed observations), can it effectively:
- Group existing claims by semantic pattern (not string matching — understanding that "SeqCst in wallet" and "SeqCst in sync" are the same kind of safety invariant)
- Identify unclaimed observations that match existing claim patterns
- Generate suggested claims with meaningful invariant/consequence text (not template garbage)
- Prioritize suggestions by coverage impact
What are the known limits? At what number of claims/observations does context window become a problem? What's the fallback?
### 2. What skill prompt patterns produce the best "learning from examples" behavior?
When you give an LLM N examples of authored claims and ask it to suggest new ones for unclaimed observations, what prompt structure works best? Specifically:
- Few-shot with existing claims as examples vs. summarized patterns?
- Should the skill prompt include ALL claims or only those in relevant categories?
- Does chain-of-thought ("this observation at sync.rs:42 is analogous to claim wallet-seqcst-001 because both involve atomic ordering") improve suggestion quality?
- How should the skill handle the case where 0 claims exist yet (cold start)?
### 3. What's the right CLI surface for A5.1/A5.2/A5.4?
For the Rust CLI commands that support the flywheel:
- **Coverage** (`aphoria coverage`): What metrics actually matter? Per-module claim density? Unclaimed-to-observation ratio? Something else? What do existing code quality/coverage tools (SonarQube, CodeClimate) show that developers actually act on?
- **Docs generation** (`aphoria docs generate`): What documentation formats do teams actually read? Is claims-explained.md (grouped by category, full provenance per claim) the right format, or should it be structured differently for consumption?
- **Onboarding** (`aphoria explain`): What information do new team members need in their first day/week? Is a narrative format better than structured? Should it be interactive (ask questions) or static?
### 4. How do other tools create "flywheel" effects in developer workflows?
Are there examples of developer tools where:
- The tool gets more useful as developers invest in it
- The "investment" is structured metadata (not just code)
- The tool surfaces that metadata in context-appropriate ways
Examples might include: Conventional Commits + release automation, ADR tools that surface decisions during review, type systems that get stronger with more annotations, architecture fitness functions. What patterns work and which are abandoned?
### 5. Cross-project claim patterns: what's the realistic scope?
A5.3 mentions "confidence grows with consistency across projects." Is this realistic for v1?
- How do tools like ESLint shared configs, Prettier presets, or security policy bundles handle cross-project knowledge?
- Can Trust Packs (our existing export/import format for claim bundles) serve as the cross-project learning mechanism?
- Or should v1 be single-project only, with cross-project deferred?
---
## Methodology
### Phase 1: Prompt Engineering for "Learning from Claims"
- Look at research on few-shot learning with structured data in LLM contexts
- Find examples of LLM-based tools that improve with user-provided examples
- Study Claude Code skill patterns that demonstrate "reads context, applies judgment"
- Look at GitHub Copilot Workspace, Cursor rules, and similar tools that use project-specific context
### Phase 2: Developer Tool Flywheel Patterns
- Study adoption curves of tools that require upfront investment (TypeScript, linting configs, ADRs)
- Find data on what makes developers continue investing vs. abandon tools
- Look at code quality dashboard UX (SonarQube, CodeClimate, Codacy) for coverage metric design
### Phase 3: CLI UX for Knowledge Graph Outputs
- Study how tools present "code knowledge" (Sourcegraph insights, CodeScene, ndepend)
- Find examples of auto-generated documentation that teams actually use
- Look at onboarding documentation formats (READMEs vs. guided tours vs. interactive)
### Phase 4: Synthesis
- Compare findings across sources
- Identify which A5 features are "just engineering" vs. "need careful design"
- Validate or refute the "skill just calls CLI" hypothesis
---
## Deliverables
Produce a report with:
1. **Executive Summary** — Is the "skill calls CLI" hypothesis correct? What's the biggest risk?
2. **Skill Prompt Design** — Concrete recommendations for how the A5.3 skill prompt should be structured, with examples
3. **CLI Command Recommendations** — What `aphoria coverage`, `aphoria docs generate`, and `aphoria explain` should output and why
4. **Flywheel Mechanics** — What specifically creates the positive feedback loop, and what could break it
5. **Cold Start Strategy** — How does A5 work when a project has 0 claims? What's the bootstrapping UX?
6. **Cross-Project Scope** — Include in v1 or defer? With evidence.
7. **Open Questions** — What still needs testing/prototyping
---
## Success Criteria
Research is complete when:
- [ ] The "skill calls CLI" hypothesis has a clear verdict with supporting evidence
- [ ] There's a concrete skill prompt structure (not abstract — actual prompt text or template)
- [ ] CLI output formats are specified with rationale (what developers act on vs. ignore)
- [ ] At least 3 real-world flywheel examples from developer tools are analyzed
- [ ] Cold start problem has a specific solution
- [ ] Cross-project scope has a recommendation with evidence