stemedb/.claude/skills/extract-wiki-corpus/SKILL.md
jml bb0c33f8d3 fix(api): enable querying of CLI-created community corpus items
## Problem
CLI-created community corpus items (tier 3) were stored correctly but
invisible via API queries. Two issues blocked discoverability:

1. **Prefix mismatch**: API hardcoded 'community://pattern/' for
   aggregated patterns, but CLI creates 'community://rust/http/...' URIs
2. **Query parameter parsing**: Axum's default parser doesn't support
   bracket notation (?sources[]=value) used by the dashboard

Result: 0/22 CLI-created items were queryable.

## Solution

### Fix 1: Broaden Community Prefix
- Changed: 'community://pattern/' → 'community://' in corpus handler
- Impact: Now matches both aggregated patterns AND CLI-created items
- Backward compatible: Broader prefix includes narrower results

### Fix 2: Add QsQuery Extractor
- Added: serde_qs dependency + custom QsQuery extractor
- Supports: Bracket notation for array parameters (?sources[]=a&sources[]=b)
- Compatible: Works with JavaScript URLSearchParams standard
- Tested: 3 new unit tests for extractor behavior

## Verification
-  All 22 CLI-created community items now queryable (was 0)
-  Source filtering works: community (22), RFC (2), vendor (5)
-  Multi-source queries work: ?sources[]=community&sources[]=rfc → 24
-  All 89 API tests pass + 3 new extractor tests
-  Clippy clean (0 warnings)
-  No regressions in existing functionality

## Files Changed
- crates/stemedb-api/Cargo.toml: Add serde_qs dependency
- crates/stemedb-api/src/extractors.rs: New QsQuery extractor (117 lines)
- crates/stemedb-api/src/handlers/aphoria/corpus.rs: Use QsQuery, broaden prefix
- crates/stemedb-api/src/lib.rs: Export extractors module

Also includes: Scale-adaptive thresholds, wiki corpus extraction,
documentation updates, and dashboard UI improvements from prior work.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-09 15:54:35 +00:00

16 KiB

name description
extract-wiki-corpus Extract structured claims from wiki documentation using LLM reasoning. Use when importing technical wikis, research docs, or compatibility guides into Aphoria corpus.

Wiki Corpus Extraction Skill

Identity

You are an intelligent claim extraction engine that reads technical documentation and extracts factual, verifiable claims for the Aphoria knowledge corpus.

Your job is to:

  1. Read wiki markdown files
  2. Extract factual claims using LLM reasoning
  3. Generate CLI commands to persist claims in the corpus database
  4. Report comprehensive results with success/failure breakdown

Core Principles

  1. Factual over Normative: Extract what IS (not what SHOULD BE)
  2. Context-Aware Authority: Infer sources from GitHub URLs, paper citations, official docs
  3. Hierarchical Subjects: Build semantic paths (ml/dependencies/basicsr/version)
  4. Intelligent Chunking: Break at headings when possible, ~4K token chunks
  5. Batch Processing: Extract all claims, then execute CLI commands
  6. Bundle Errors: Collect all errors and report them together

Workflow Overview

Phase 1: Discover & Read
    ↓
Phase 1.2: Verify Commands
    ↓
Phase 2: Intelligent Chunking
    ↓
Phase 3: Claim Extraction (Per Chunk)
    ↓
Phase 4: Validation
    ↓
Phase 5: CLI Execution
    ↓
Phase 6: Summary Report

Phase 1: Discover & Read

Step 1.1: Check Input

  • If file passed via CLI args: use that file
  • If directory passed: walk to find all .md files
  • Use Read tool to get full content of each file

Step 1.2: Verify Aphoria Binary and Commands

Before proceeding, verify that the required commands exist:

# Check Aphoria version
aphoria --version

# Verify corpus create command exists
if ! aphoria corpus --help 2>&1 | grep -q "create"; then
    echo "❌ ERROR: 'aphoria corpus create' command not available"
    echo ""
    echo "This suggests the aphoria binary is out of date."
    echo ""
    echo "Fix options:"
    echo "  1. Rebuild: cargo build --release -p aphoria"
    echo "  2. Check git status: git status"
    echo "  3. Pull latest: git pull && cargo build --release -p aphoria"
    echo ""
    exit 1
fi

echo "✅ Aphoria binary up to date (corpus create available)"

Decision Gate: Command exists? → Proceed to token estimation

Step 1.3: Estimate Token Count

Rough estimate: 1 token ≈ 4 characters

token_count = len(content) / 4

If token_count > 4000, proceed to Phase 2 (chunking). If token_count <= 4000, treat as single chunk.


Phase 2: Intelligent Chunking

Goal

Split content into ~4K token chunks, preferring heading boundaries.

Algorithm

  1. Try splitting on ## (level 2 headings)

    • Sections should be roughly 4K tokens each
    • If a section is still > 4K, split on ### (level 3 headings)
  2. Include context in each chunk

    • Document title (from # heading)
    • Section path (breadcrumb of headings)
    • Example: "Document: ML Dependencies Guide / Section: Critical Compatibility Solutions / Subsection: BasicSR Fix"
  3. Maintain overlap

    • Include previous heading for context
    • This helps LLM understand relationships

Chunk Metadata Format

{
  "chunk_id": 1,
  "total_chunks": 3,
  "document_title": "ML Dependencies Guide",
  "section_path": "Critical Compatibility Solutions / BasicSR Fix",
  "content": "..."
}

Phase 3: Claim Extraction (Per Chunk)

Prompt the LLM

For each chunk, use a structured extraction prompt:

You are extracting factual claims from technical documentation for a knowledge corpus.

**Context:**
- Document: {document_title}
- Section: {section_path}
- Chunk: {chunk_id}/{total_chunks}

**Content:**
{chunk_content}

**Task:**
Extract all factual claims as JSON array. Each claim must be:
1. Factual (not opinion or speculation)
2. Verifiable from the text
3. Useful for developers

**Authority Inference Rules:**
- GitHub URLs/commits → "Repository/Project@hash"
- Research papers → "Author et al. (Year)"
- Official documentation → "Project Documentation"
- Empirical observation → "Community consensus"

**Tier Assignment:**
- 0: RFC, W3C spec, ISO standard (regulatory)
- 1: OWASP, CWE, security advisory (clinical)
- 2: Project docs, compatibility notes (observational)
- 3: Blog posts, forum consensus (community)

**Output Format:**
```json
[
  {
    "subject": "hierarchical/path/to/concept",
    "predicate": "relationship_type",
    "value": "constraint_or_value",
    "explanation": "full sentence with context",
    "authority": "inferred_source",
    "category": "compatibility|performance|security|architecture|quality",
    "confidence": 0.95,
    "tier": 2
  }
]
```

Return ONLY the JSON array, no additional text.

Expected Output Structure

[
  {
    "subject": "ml/dependencies/basicsr/torchvision",
    "predicate": "incompatible_with",
    "value": ">=0.15",
    "explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+",
    "authority": "XPixelGroup/BasicSR GitHub",
    "category": "compatibility",
    "confidence": 0.95,
    "tier": 2
  }
]

Phase 4: Validation

Step 4.1: Filter by Confidence

Only keep claims where confidence >= 0.7

Step 4.2: Check Required Fields

Each claim must have:

  • subject (non-empty string)
  • predicate (non-empty string)
  • value (any type)
  • explanation (non-empty string)
  • authority (non-empty string)
  • category (one of: compatibility, performance, security, architecture, quality)
  • tier (0-3)

Step 4.3: Validate Tier

Tier must be 0, 1, 2, or 3. If invalid, record error and skip claim.

Step 4.4: Check for Duplicates

Important: The corpus database is append-only. Multiple sources can create the same subject+predicate pair. This is allowed and expected. Do NOT filter duplicates — just warn about them in the report.


Phase 5: CLI Execution

Step 5.1: Construct CLI Commands

For each validated claim, construct:

aphoria corpus create \
  --subject "{subject}" \
  --predicate "{predicate}" \
  --value "{value}" \
  --explanation "{explanation}" \
  --authority "{authority}" \
  --category "{category}" \
  --tier {tier}

Important: Use proper shell escaping for strings with quotes or special characters.

Step 5.2: Execute Commands

Use the Bash tool to execute each command.

Step 5.3: Collect Results

For each execution:

  • Success: Record the corpus ID (e.g., "corpus://ml/foo/bar/predicate")
  • Failure: Record the full error message

Phase 6: Summary Report

Report Structure

# Wiki Corpus Extraction Report

**File:** /path/to/wiki/article.md
**Chunks Processed:** 3
**Claims Extracted:** 23
**Claims Stored:** 20
**Errors:** 3

## Stored Claims

| Subject | Predicate | Value | Authority | Tier |
|---------|-----------|-------|-----------|------|
| ml/basicsr/torchvision | incompatible_with | >=0.15 | XPixelGroup/BasicSR | 2 |
| ... | ... | ... | ... | ... |

## Errors

### Validation Errors (2)

1. **ml/foo/bar** - Invalid tier '5' (must be 0-3)
2. **api/rest/foo** - Missing explanation field

### Storage Errors (1)

1. **net/http/timeout** - Database write failed: connection refused

## Next Steps

View corpus items: http://localhost:3000/corpus
Query API: curl 'http://localhost:18180/v1/aphoria/corpus?sources[]=community&limit=100'

Predicate Naming Conventions

Use consistent predicate names to enable effective querying:

Relationship Predicate
Version constraint requires, incompatible_with, compatible_with
Recommendation recommends, discourages
Performance faster_than, slower_than, optimal_for
Security vulnerable_to, mitigates, exposes
Configuration default_value, max_value, required_for

Subject Path Guidelines

Build hierarchical paths that reflect the domain structure:

Examples

  • ml/dependencies/{package}/{aspect}
    • Example: ml/dependencies/basicsr/torchvision
  • api/{protocol}/{feature}
    • Example: api/rest/authentication
  • security/{category}/{vuln_type}
    • Example: security/input-validation/xss
  • performance/{component}/{metric}
    • Example: performance/database/connection-pool

Principles

  • Start general, get specific
  • Use lowercase with forward slashes
  • Use hyphens for multi-word segments
  • Keep paths under 6-7 levels

Category Guidelines

Choose the most appropriate category:

Category Use When
compatibility Version constraints, breaking changes, API compatibility
performance Optimization, resource usage, latency, throughput
security Vulnerabilities, mitigations, attack vectors
architecture Design patterns, module structure, dependencies
quality Code quality, maintainability, best practices

Authority Tier Guidelines

Tier Name Examples When to Use
0 Regulatory RFC 7231, W3C spec, ISO 27001 Official standards bodies
1 Clinical OWASP Top 10, CWE-79, NVD Security advisories, vulnerability databases
2 Observational PyTorch docs, GitHub project READMEs Official project documentation
3 Community Blog posts, Stack Overflow, forum threads Community wisdom, empirical observations

Error Handling

Validation Errors

Collect all validation errors and report them together. Do NOT stop on the first error.

Example validation errors:

  • Invalid tier (not 0-3)
  • Missing required field
  • Confidence below threshold (< 0.7)

Storage Errors

If a CLI command fails:

  • Capture the full error message
  • Continue with remaining commands
  • Report all failures at the end

LLM Extraction Errors

If the LLM returns invalid JSON:

  • Log the chunk that failed
  • Continue with remaining chunks
  • Report the parsing error in summary

Do's and Don'ts

Do

  • Extract factual claims (not opinions)
  • Verify command availability before execution
  • Infer authority from context
  • Generate semantic subject paths
  • Include full explanation context
  • Bundle errors for batch reporting
  • Use Read tool to get file content
  • Use Bash tool to execute CLI commands
  • Filter by confidence >= 0.7
  • Allow duplicate subject+predicate (append-only DB)

Do Not

  • Extract opinions or speculative claims
  • Assume binary is up to date
  • Lose source attribution
  • Hardcode authority (infer from content)
  • Stop on first error (collect all errors)
  • Modify files (read-only skill)
  • Use placeholder values
  • Skip validation
  • Filter duplicates (append-only allows them)

Example Extraction

Input Text

## BasicSR and Torchvision Compatibility

The BasicSR library (v1.4.2) has a critical compatibility issue with torchvision >= 0.15.
The library imports from `torchvision.transforms.functional_tensor`, which was removed
in torchvision 0.15+.

Source: https://github.com/XPixelGroup/BasicSR/issues/123

Recommended workaround: Pin torchvision to 0.14.1 or earlier.

Extracted Claims

[
  {
    "subject": "ml/dependencies/basicsr/torchvision",
    "predicate": "incompatible_with",
    "value": ">=0.15",
    "explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+",
    "authority": "XPixelGroup/BasicSR#123",
    "category": "compatibility",
    "confidence": 0.95,
    "tier": 2
  },
  {
    "subject": "ml/dependencies/basicsr/torchvision",
    "predicate": "recommends",
    "value": "<=0.14.1",
    "explanation": "Workaround for basicsr compatibility issue: pin torchvision to 0.14.1 or earlier",
    "authority": "XPixelGroup/BasicSR#123",
    "category": "compatibility",
    "confidence": 0.9,
    "tier": 3
  }
]

CLI Commands

aphoria corpus create \
  --subject "ml/dependencies/basicsr/torchvision" \
  --predicate "incompatible_with" \
  --value ">=0.15" \
  --explanation "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+" \
  --authority "XPixelGroup/BasicSR#123" \
  --category "compatibility" \
  --tier 2

aphoria corpus create \
  --subject "ml/dependencies/basicsr/torchvision" \
  --predicate "recommends" \
  --value "<=0.14.1" \
  --explanation "Workaround for basicsr compatibility issue: pin torchvision to 0.14.1 or earlier" \
  --authority "XPixelGroup/BasicSR#123" \
  --category "compatibility" \
  --tier 3

  • extract-claims: Entity-level extraction from prose (for StemeDB ingestion)
  • aphoria-suggest: Suggest claims from existing patterns
  • aphoria-claims: Author claims from diffs

Implementation Notes

Token Counting

Use rough heuristic: token_count = len(content) / 4

This is approximate but good enough for chunking decisions.

Shell Escaping

When constructing CLI commands, properly escape strings:

import shlex

escaped_explanation = shlex.quote(explanation)

Or in bash:

explanation="${explanation//\"/\\\"}"  # Escape quotes

Confidence Threshold

Only extract claims with confidence >= 0.7. This filters out:

  • Speculative statements
  • Uncertain inferences
  • Low-quality extractions

Append-Only Semantics

The corpus database is append-only. Multiple sources can contribute claims for the same subject+predicate. This enables:

  • Cross-validation from multiple sources
  • Community consensus building
  • Evolving knowledge over time

Do NOT filter duplicates. Just warn about them in the report.


Success Criteria

A successful extraction should:

  1. Read all markdown files in the input directory
  2. Extract factual claims with proper structure
  3. Infer authority from context (GitHub URLs, docs, etc.)
  4. Assign appropriate tiers (0-3)
  5. Execute CLI commands successfully
  6. Report comprehensive summary with errors bundled
  7. Handle validation errors gracefully
  8. Handle storage errors gracefully
  9. Generate semantic subject paths
  10. Use consistent predicate naming

Troubleshooting

"Command not found" or "unrecognized subcommand 'create'" Errors

If you see error: unrecognized subcommand 'create' or similar errors:

Diagnosis:

  1. Check binary date: ls -lh target/release/aphoria
  2. Check CLI code date: ls -lh applications/aphoria/src/cli/mod.rs
  3. If CLI is newer: The binary is out of date

Solutions:

# Option 1: Rebuild the binary
cargo build --release -p aphoria

# Option 2: Pull latest changes and rebuild
git pull && cargo build --release -p aphoria

# Option 3: Check if there are uncommitted changes
git status

Prevention: See Fix #1 for setting up git hooks that automatically rebuild binaries on pull.

"Database already open" error

The corpus database at ~/.aphoria/corpus-db is locked by another process (probably the API server).

Solution: Stop the API server temporarily:

pkill -f stemedb-api

"Invalid tier" error

Tier must be 0, 1, 2, or 3.

Solution: Review tier assignment rules and fix the extracted tier value.

"Missing required field" error

All claims must have: subject, predicate, value, explanation, authority, category, tier.

Solution: Review the LLM extraction prompt and ensure all fields are present.

LLM returns invalid JSON

The LLM might return markdown formatting or extra text.

Solution: Update the extraction prompt to be more explicit about returning ONLY the JSON array.