# A5 Flywheel Skill Design Research Directive You are an expert in AI-assisted developer tooling, specifically in designing Claude Code skills (prompt-based tool orchestration). You understand how LLM-based coding assistants work with CLI tools to create intelligent developer workflows. You are going to research how to design a Claude Code skill that creates a **flywheel effect** for a code-level claim system: the more claims developers author, the smarter the skill gets at identifying what needs claims and suggesting them. --- ## Context ### What We Have **Aphoria** is a code truth linter. It has: 1. **42 regex extractors** that scan code and produce **observations** — structured grep results like `imported = true` or `ordering = SeqCst` at a specific file:line. 2. **Authored claims** stored in `.aphoria/claims.toml` — human-written assertions with provenance, invariants, consequences, and evidence. Example: ```toml [[claim]] id = "wallet-seqcst-001" concept_path = "maxwell/wallet/atomics/ordering" predicate = "required_ordering" value = "SeqCst" provenance = "Safety analysis by lead developer" invariant = "All wallet atomics MUST use SeqCst" consequence = "Double-spend race condition" authority_tier = "expert" evidence = ["wallet ADR-003", "Intel SDM Vol 4"] category = "safety" ``` 3. **A verification engine** (`aphoria verify run`) that matches claims against observations and produces verdicts: PASS, CONFLICT, MISSING, UNCLAIMED. 4. **An existing claims skill** (`.claude/skills/aphoria-claims/SKILL.md`) that reviews diffs, identifies claimable patterns, checks existing claims, and creates new ones via CLI. 5. **CLI commands**: - `aphoria claims list --format json` — all authored claims - `aphoria claims create --id X --invariant "..." ...` — create a claim - `aphoria verify run --format json --show-unclaimed` — verify claims vs code - `aphoria scan . --show-claims --format json` — extract observations - `aphoria claims explain` — render claims-explained markdown ### What We Need (A5 — The Flywheel) Four features that make the system get smarter with use: **A5.1 Coverage Metrics** — "What percentage of this codebase has claims?" Per-module density, unclaimed observation gaps. Probably a new CLI command: `aphoria coverage`. **A5.2 Auto-Generated Claims-Explained** — `aphoria docs generate` produces full project documentation from the knowledge graph. Groups by module, includes provenance chains (claim A supersedes claim B), highlights coverage gaps. **A5.3 Skill Learning** — The skill reads existing claims, recognizes patterns, and when it sees analogous code, suggests new claims. "You have 3 claims about SeqCst ordering in wallet code. This new `Ordering::Relaxed` in sync.rs — should that also be SeqCst?" **A5.4 Onboarding Mode** — `aphoria explain` produces a narrative: "This codebase enforces 12 safety invariants. The wallet uses SeqCst everywhere because of double-spend risk (ADR-003)..." ### The Hypothesis A5.1, A5.2, A5.4 are new Rust CLI commands. Straightforward engineering. A5.3 is the interesting one. The hypothesis is: **the skill just calls the CLI**. The "learning" is the LLM reasoning over the JSON output of `aphoria claims list` and `aphoria verify run --show-unclaimed`. There's no ML model, no vector embeddings, no training loop. Claude reads the claims, understands the patterns, and applies that understanding to new code. The flywheel is prompt engineering, not machine learning. --- ## Research Questions ### 1. Is the "skill calls CLI" pattern sufficient for claim suggestion? Given that Claude can read JSON output of `aphoria claims list` (typically 5-50 claims) and `aphoria verify run --show-unclaimed` (typically 10-200 unclaimed observations), can it effectively: - Group existing claims by semantic pattern (not string matching — understanding that "SeqCst in wallet" and "SeqCst in sync" are the same kind of safety invariant) - Identify unclaimed observations that match existing claim patterns - Generate suggested claims with meaningful invariant/consequence text (not template garbage) - Prioritize suggestions by coverage impact What are the known limits? At what number of claims/observations does context window become a problem? What's the fallback? ### 2. What skill prompt patterns produce the best "learning from examples" behavior? When you give an LLM N examples of authored claims and ask it to suggest new ones for unclaimed observations, what prompt structure works best? Specifically: - Few-shot with existing claims as examples vs. summarized patterns? - Should the skill prompt include ALL claims or only those in relevant categories? - Does chain-of-thought ("this observation at sync.rs:42 is analogous to claim wallet-seqcst-001 because both involve atomic ordering") improve suggestion quality? - How should the skill handle the case where 0 claims exist yet (cold start)? ### 3. What's the right CLI surface for A5.1/A5.2/A5.4? For the Rust CLI commands that support the flywheel: - **Coverage** (`aphoria coverage`): What metrics actually matter? Per-module claim density? Unclaimed-to-observation ratio? Something else? What do existing code quality/coverage tools (SonarQube, CodeClimate) show that developers actually act on? - **Docs generation** (`aphoria docs generate`): What documentation formats do teams actually read? Is claims-explained.md (grouped by category, full provenance per claim) the right format, or should it be structured differently for consumption? - **Onboarding** (`aphoria explain`): What information do new team members need in their first day/week? Is a narrative format better than structured? Should it be interactive (ask questions) or static? ### 4. How do other tools create "flywheel" effects in developer workflows? Are there examples of developer tools where: - The tool gets more useful as developers invest in it - The "investment" is structured metadata (not just code) - The tool surfaces that metadata in context-appropriate ways Examples might include: Conventional Commits + release automation, ADR tools that surface decisions during review, type systems that get stronger with more annotations, architecture fitness functions. What patterns work and which are abandoned? ### 5. Cross-project claim patterns: what's the realistic scope? A5.3 mentions "confidence grows with consistency across projects." Is this realistic for v1? - How do tools like ESLint shared configs, Prettier presets, or security policy bundles handle cross-project knowledge? - Can Trust Packs (our existing export/import format for claim bundles) serve as the cross-project learning mechanism? - Or should v1 be single-project only, with cross-project deferred? --- ## Methodology ### Phase 1: Prompt Engineering for "Learning from Claims" - Look at research on few-shot learning with structured data in LLM contexts - Find examples of LLM-based tools that improve with user-provided examples - Study Claude Code skill patterns that demonstrate "reads context, applies judgment" - Look at GitHub Copilot Workspace, Cursor rules, and similar tools that use project-specific context ### Phase 2: Developer Tool Flywheel Patterns - Study adoption curves of tools that require upfront investment (TypeScript, linting configs, ADRs) - Find data on what makes developers continue investing vs. abandon tools - Look at code quality dashboard UX (SonarQube, CodeClimate, Codacy) for coverage metric design ### Phase 3: CLI UX for Knowledge Graph Outputs - Study how tools present "code knowledge" (Sourcegraph insights, CodeScene, ndepend) - Find examples of auto-generated documentation that teams actually use - Look at onboarding documentation formats (READMEs vs. guided tours vs. interactive) ### Phase 4: Synthesis - Compare findings across sources - Identify which A5 features are "just engineering" vs. "need careful design" - Validate or refute the "skill just calls CLI" hypothesis --- ## Deliverables Produce a report with: 1. **Executive Summary** — Is the "skill calls CLI" hypothesis correct? What's the biggest risk? 2. **Skill Prompt Design** — Concrete recommendations for how the A5.3 skill prompt should be structured, with examples 3. **CLI Command Recommendations** — What `aphoria coverage`, `aphoria docs generate`, and `aphoria explain` should output and why 4. **Flywheel Mechanics** — What specifically creates the positive feedback loop, and what could break it 5. **Cold Start Strategy** — How does A5 work when a project has 0 claims? What's the bootstrapping UX? 6. **Cross-Project Scope** — Include in v1 or defer? With evidence. 7. **Open Questions** — What still needs testing/prototyping --- ## Success Criteria Research is complete when: - [ ] The "skill calls CLI" hypothesis has a clear verdict with supporting evidence - [ ] There's a concrete skill prompt structure (not abstract — actual prompt text or template) - [ ] CLI output formats are specified with rationale (what developers act on vs. ignore) - [ ] At least 3 real-world flywheel examples from developer tools are analyzed - [ ] Cold start problem has a specific solution - [ ] Cross-project scope has a recommendation with evidence