stemedb/research-requests/a5-flywheel-skill-design.md
jml 3b5f88b4f0 feat(aphoria): implement claims architecture (A1-A5) with verify engine, corpus, coverage, and explain
Complete Aphoria claims system overhaul:
- A1: Rename ExtractedClaim to Observation (extractors produce observations, not claims)
- A2: Add AuthoredClaim with full provenance, invariants, and authority tiers
- A3: Verify engine comparing observations against authored claims, CLI + formatters
- A4: Corpus as first-class assertions with predicate indexing, authority lens, trust packs
- A5: Coverage analysis, explain/docs generation, self-audit extractor, claim suggester skill

Also includes: 42 extractors updated for Observation type, verifiable_predicates trait,
conflict detection with comparison modes, claims TOML persistence, Grafana dashboard,
backup/restore scripts, and comprehensive test coverage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-08 09:11:47 +00:00

9.1 KiB

A5 Flywheel Skill Design Research Directive

You are an expert in AI-assisted developer tooling, specifically in designing Claude Code skills (prompt-based tool orchestration). You understand how LLM-based coding assistants work with CLI tools to create intelligent developer workflows.

You are going to research how to design a Claude Code skill that creates a flywheel effect for a code-level claim system: the more claims developers author, the smarter the skill gets at identifying what needs claims and suggesting them.


Context

What We Have

Aphoria is a code truth linter. It has:

  1. 42 regex extractors that scan code and produce observations — structured grep results like imported = true or ordering = SeqCst at a specific file:line.

  2. Authored claims stored in .aphoria/claims.toml — human-written assertions with provenance, invariants, consequences, and evidence. Example:

    [[claim]]
    id = "wallet-seqcst-001"
    concept_path = "maxwell/wallet/atomics/ordering"
    predicate = "required_ordering"
    value = "SeqCst"
    provenance = "Safety analysis by lead developer"
    invariant = "All wallet atomics MUST use SeqCst"
    consequence = "Double-spend race condition"
    authority_tier = "expert"
    evidence = ["wallet ADR-003", "Intel SDM Vol 4"]
    category = "safety"
    
  3. A verification engine (aphoria verify run) that matches claims against observations and produces verdicts: PASS, CONFLICT, MISSING, UNCLAIMED.

  4. An existing claims skill (.claude/skills/aphoria-claims/SKILL.md) that reviews diffs, identifies claimable patterns, checks existing claims, and creates new ones via CLI.

  5. CLI commands:

    • aphoria claims list --format json — all authored claims
    • aphoria claims create --id X --invariant "..." ... — create a claim
    • aphoria verify run --format json --show-unclaimed — verify claims vs code
    • aphoria scan . --show-claims --format json — extract observations
    • aphoria claims explain — render claims-explained markdown

What We Need (A5 — The Flywheel)

Four features that make the system get smarter with use:

A5.1 Coverage Metrics — "What percentage of this codebase has claims?" Per-module density, unclaimed observation gaps. Probably a new CLI command: aphoria coverage.

A5.2 Auto-Generated Claims-Explainedaphoria docs generate produces full project documentation from the knowledge graph. Groups by module, includes provenance chains (claim A supersedes claim B), highlights coverage gaps.

A5.3 Skill Learning — The skill reads existing claims, recognizes patterns, and when it sees analogous code, suggests new claims. "You have 3 claims about SeqCst ordering in wallet code. This new Ordering::Relaxed in sync.rs — should that also be SeqCst?"

A5.4 Onboarding Modeaphoria explain produces a narrative: "This codebase enforces 12 safety invariants. The wallet uses SeqCst everywhere because of double-spend risk (ADR-003)..."

The Hypothesis

A5.1, A5.2, A5.4 are new Rust CLI commands. Straightforward engineering.

A5.3 is the interesting one. The hypothesis is: the skill just calls the CLI. The "learning" is the LLM reasoning over the JSON output of aphoria claims list and aphoria verify run --show-unclaimed. There's no ML model, no vector embeddings, no training loop. Claude reads the claims, understands the patterns, and applies that understanding to new code. The flywheel is prompt engineering, not machine learning.


Research Questions

1. Is the "skill calls CLI" pattern sufficient for claim suggestion?

Given that Claude can read JSON output of aphoria claims list (typically 5-50 claims) and aphoria verify run --show-unclaimed (typically 10-200 unclaimed observations), can it effectively:

  • Group existing claims by semantic pattern (not string matching — understanding that "SeqCst in wallet" and "SeqCst in sync" are the same kind of safety invariant)
  • Identify unclaimed observations that match existing claim patterns
  • Generate suggested claims with meaningful invariant/consequence text (not template garbage)
  • Prioritize suggestions by coverage impact

What are the known limits? At what number of claims/observations does context window become a problem? What's the fallback?

2. What skill prompt patterns produce the best "learning from examples" behavior?

When you give an LLM N examples of authored claims and ask it to suggest new ones for unclaimed observations, what prompt structure works best? Specifically:

  • Few-shot with existing claims as examples vs. summarized patterns?
  • Should the skill prompt include ALL claims or only those in relevant categories?
  • Does chain-of-thought ("this observation at sync.rs:42 is analogous to claim wallet-seqcst-001 because both involve atomic ordering") improve suggestion quality?
  • How should the skill handle the case where 0 claims exist yet (cold start)?

3. What's the right CLI surface for A5.1/A5.2/A5.4?

For the Rust CLI commands that support the flywheel:

  • Coverage (aphoria coverage): What metrics actually matter? Per-module claim density? Unclaimed-to-observation ratio? Something else? What do existing code quality/coverage tools (SonarQube, CodeClimate) show that developers actually act on?
  • Docs generation (aphoria docs generate): What documentation formats do teams actually read? Is claims-explained.md (grouped by category, full provenance per claim) the right format, or should it be structured differently for consumption?
  • Onboarding (aphoria explain): What information do new team members need in their first day/week? Is a narrative format better than structured? Should it be interactive (ask questions) or static?

4. How do other tools create "flywheel" effects in developer workflows?

Are there examples of developer tools where:

  • The tool gets more useful as developers invest in it
  • The "investment" is structured metadata (not just code)
  • The tool surfaces that metadata in context-appropriate ways

Examples might include: Conventional Commits + release automation, ADR tools that surface decisions during review, type systems that get stronger with more annotations, architecture fitness functions. What patterns work and which are abandoned?

5. Cross-project claim patterns: what's the realistic scope?

A5.3 mentions "confidence grows with consistency across projects." Is this realistic for v1?

  • How do tools like ESLint shared configs, Prettier presets, or security policy bundles handle cross-project knowledge?
  • Can Trust Packs (our existing export/import format for claim bundles) serve as the cross-project learning mechanism?
  • Or should v1 be single-project only, with cross-project deferred?

Methodology

Phase 1: Prompt Engineering for "Learning from Claims"

  • Look at research on few-shot learning with structured data in LLM contexts
  • Find examples of LLM-based tools that improve with user-provided examples
  • Study Claude Code skill patterns that demonstrate "reads context, applies judgment"
  • Look at GitHub Copilot Workspace, Cursor rules, and similar tools that use project-specific context

Phase 2: Developer Tool Flywheel Patterns

  • Study adoption curves of tools that require upfront investment (TypeScript, linting configs, ADRs)
  • Find data on what makes developers continue investing vs. abandon tools
  • Look at code quality dashboard UX (SonarQube, CodeClimate, Codacy) for coverage metric design

Phase 3: CLI UX for Knowledge Graph Outputs

  • Study how tools present "code knowledge" (Sourcegraph insights, CodeScene, ndepend)
  • Find examples of auto-generated documentation that teams actually use
  • Look at onboarding documentation formats (READMEs vs. guided tours vs. interactive)

Phase 4: Synthesis

  • Compare findings across sources
  • Identify which A5 features are "just engineering" vs. "need careful design"
  • Validate or refute the "skill just calls CLI" hypothesis

Deliverables

Produce a report with:

  1. Executive Summary — Is the "skill calls CLI" hypothesis correct? What's the biggest risk?
  2. Skill Prompt Design — Concrete recommendations for how the A5.3 skill prompt should be structured, with examples
  3. CLI Command Recommendations — What aphoria coverage, aphoria docs generate, and aphoria explain should output and why
  4. Flywheel Mechanics — What specifically creates the positive feedback loop, and what could break it
  5. Cold Start Strategy — How does A5 work when a project has 0 claims? What's the bootstrapping UX?
  6. Cross-Project Scope — Include in v1 or defer? With evidence.
  7. Open Questions — What still needs testing/prototyping

Success Criteria

Research is complete when:

  • The "skill calls CLI" hypothesis has a clear verdict with supporting evidence
  • There's a concrete skill prompt structure (not abstract — actual prompt text or template)
  • CLI output formats are specified with rationale (what developers act on vs. ignore)
  • At least 3 real-world flywheel examples from developer tools are analyzed
  • Cold start problem has a specific solution
  • Cross-project scope has a recommendation with evidence