stemedb/research-requests/a5-flywheel-skill-design.md

# A5 Flywheel Skill Design Research Directive

You are an expert in AI-assisted developer tooling, specifically in designing Claude Code skills (prompt-based tool orchestration). You understand how LLM-based coding assistants work with CLI tools to create intelligent developer workflows.

You are going to research how to design a Claude Code skill that creates a **flywheel effect** for a code-level claim system: the more claims developers author, the smarter the skill gets at identifying what needs claims and suggesting them.

---

## Context

### What We Have

**Aphoria** is a code truth linter. It has:

1. **42 regex extractors** that scan code and produce **observations** — structured grep results like `imported = true` or `ordering = SeqCst` at a specific file:line.

2. **Authored claims** stored in `.aphoria/claims.toml` — human-written assertions with provenance, invariants, consequences, and evidence. Example:
   ```toml
   [[claim]]
   id = "wallet-seqcst-001"
   concept_path = "maxwell/wallet/atomics/ordering"
   predicate = "required_ordering"
   value = "SeqCst"
   provenance = "Safety analysis by lead developer"
   invariant = "All wallet atomics MUST use SeqCst"
   consequence = "Double-spend race condition"
   authority_tier = "expert"
   evidence = ["wallet ADR-003", "Intel SDM Vol 4"]
   category = "safety"
   ```

3. **A verification engine** (`aphoria verify run`) that matches claims against observations and produces verdicts: PASS, CONFLICT, MISSING, UNCLAIMED.

4. **An existing claims skill** (`.claude/skills/aphoria-claims/SKILL.md`) that reviews diffs, identifies claimable patterns, checks existing claims, and creates new ones via CLI.

5. **CLI commands**:
   - `aphoria claims list --format json` — all authored claims
   - `aphoria claims create --id X --invariant "..." ...` — create a claim
   - `aphoria verify run --format json --show-unclaimed` — verify claims vs code
   - `aphoria scan . --show-claims --format json` — extract observations
   - `aphoria claims explain` — render claims-explained markdown

### What We Need (A5 — The Flywheel)

Four features that make the system get smarter with use:

**A5.1 Coverage Metrics** — "What percentage of this codebase has claims?" Per-module density, unclaimed observation gaps. Probably a new CLI command: `aphoria coverage`.

**A5.2 Auto-Generated Claims-Explained** — `aphoria docs generate` produces full project documentation from the knowledge graph. Groups by module, includes provenance chains (claim A supersedes claim B), highlights coverage gaps.

**A5.3 Skill Learning** — The skill reads existing claims, recognizes patterns, and when it sees analogous code, suggests new claims. "You have 3 claims about SeqCst ordering in wallet code. This new `Ordering::Relaxed` in sync.rs — should that also be SeqCst?"

**A5.4 Onboarding Mode** — `aphoria explain` produces a narrative: "This codebase enforces 12 safety invariants. The wallet uses SeqCst everywhere because of double-spend risk (ADR-003)..."

### The Hypothesis

A5.1, A5.2, A5.4 are new Rust CLI commands. Straightforward engineering.

A5.3 is the interesting one. The hypothesis is: **the skill just calls the CLI**. The "learning" is the LLM reasoning over the JSON output of `aphoria claims list` and `aphoria verify run --show-unclaimed`. There's no ML model, no vector embeddings, no training loop. Claude reads the claims, understands the patterns, and applies that understanding to new code. The flywheel is prompt engineering, not machine learning.

---

## Research Questions

### 1. Is the "skill calls CLI" pattern sufficient for claim suggestion?

Given that Claude can read JSON output of `aphoria claims list` (typically 5-50 claims) and `aphoria verify run --show-unclaimed` (typically 10-200 unclaimed observations), can it effectively:

- Group existing claims by semantic pattern (not string matching — understanding that "SeqCst in wallet" and "SeqCst in sync" are the same kind of safety invariant)
- Identify unclaimed observations that match existing claim patterns
- Generate suggested claims with meaningful invariant/consequence text (not template garbage)
- Prioritize suggestions by coverage impact

What are the known limits? At what number of claims/observations does context window become a problem? What's the fallback?

### 2. What skill prompt patterns produce the best "learning from examples" behavior?

When you give an LLM N examples of authored claims and ask it to suggest new ones for unclaimed observations, what prompt structure works best? Specifically:

- Few-shot with existing claims as examples vs. summarized patterns?
- Should the skill prompt include ALL claims or only those in relevant categories?
- Does chain-of-thought ("this observation at sync.rs:42 is analogous to claim wallet-seqcst-001 because both involve atomic ordering") improve suggestion quality?
- How should the skill handle the case where 0 claims exist yet (cold start)?

### 3. What's the right CLI surface for A5.1/A5.2/A5.4?

For the Rust CLI commands that support the flywheel:

- **Coverage** (`aphoria coverage`): What metrics actually matter? Per-module claim density? Unclaimed-to-observation ratio? Something else? What do existing code quality/coverage tools (SonarQube, CodeClimate) show that developers actually act on?
- **Docs generation** (`aphoria docs generate`): What documentation formats do teams actually read? Is claims-explained.md (grouped by category, full provenance per claim) the right format, or should it be structured differently for consumption?
- **Onboarding** (`aphoria explain`): What information do new team members need in their first day/week? Is a narrative format better than structured? Should it be interactive (ask questions) or static?

### 4. How do other tools create "flywheel" effects in developer workflows?

Are there examples of developer tools where:
- The tool gets more useful as developers invest in it
- The "investment" is structured metadata (not just code)
- The tool surfaces that metadata in context-appropriate ways

Examples might include: Conventional Commits + release automation, ADR tools that surface decisions during review, type systems that get stronger with more annotations, architecture fitness functions. What patterns work and which are abandoned?

### 5. Cross-project claim patterns: what's the realistic scope?

A5.3 mentions "confidence grows with consistency across projects." Is this realistic for v1?

- How do tools like ESLint shared configs, Prettier presets, or security policy bundles handle cross-project knowledge?
- Can Trust Packs (our existing export/import format for claim bundles) serve as the cross-project learning mechanism?
- Or should v1 be single-project only, with cross-project deferred?

---

## Methodology

### Phase 1: Prompt Engineering for "Learning from Claims"

- Look at research on few-shot learning with structured data in LLM contexts
- Find examples of LLM-based tools that improve with user-provided examples
- Study Claude Code skill patterns that demonstrate "reads context, applies judgment"
- Look at GitHub Copilot Workspace, Cursor rules, and similar tools that use project-specific context

### Phase 2: Developer Tool Flywheel Patterns

- Study adoption curves of tools that require upfront investment (TypeScript, linting configs, ADRs)
- Find data on what makes developers continue investing vs. abandon tools
- Look at code quality dashboard UX (SonarQube, CodeClimate, Codacy) for coverage metric design

### Phase 3: CLI UX for Knowledge Graph Outputs

- Study how tools present "code knowledge" (Sourcegraph insights, CodeScene, ndepend)
- Find examples of auto-generated documentation that teams actually use
- Look at onboarding documentation formats (READMEs vs. guided tours vs. interactive)

### Phase 4: Synthesis

- Compare findings across sources
- Identify which A5 features are "just engineering" vs. "need careful design"
- Validate or refute the "skill just calls CLI" hypothesis

---

## Deliverables

Produce a report with:

1. **Executive Summary** — Is the "skill calls CLI" hypothesis correct? What's the biggest risk?
2. **Skill Prompt Design** — Concrete recommendations for how the A5.3 skill prompt should be structured, with examples
3. **CLI Command Recommendations** — What `aphoria coverage`, `aphoria docs generate`, and `aphoria explain` should output and why
4. **Flywheel Mechanics** — What specifically creates the positive feedback loop, and what could break it
5. **Cold Start Strategy** — How does A5 work when a project has 0 claims? What's the bootstrapping UX?
6. **Cross-Project Scope** — Include in v1 or defer? With evidence.
7. **Open Questions** — What still needs testing/prototyping

---

## Success Criteria

Research is complete when:

- [ ] The "skill calls CLI" hypothesis has a clear verdict with supporting evidence
- [ ] There's a concrete skill prompt structure (not abstract — actual prompt text or template)
- [ ] CLI output formats are specified with rationale (what developers act on vs. ignore)
- [ ] At least 3 real-world flywheel examples from developer tools are analyzed
- [ ] Cold start problem has a specific solution
- [ ] Cross-project scope has a recommendation with evidence