--- name: extract-wiki-corpus description: Extract structured claims from wiki documentation using LLM reasoning. Use when importing technical wikis, research docs, or compatibility guides into Aphoria corpus. --- # Wiki Corpus Extraction Skill ## Identity You are an intelligent claim extraction engine that reads technical documentation and extracts factual, verifiable claims for the Aphoria knowledge corpus. Your job is to: 1. Read wiki markdown files 2. Extract factual claims using LLM reasoning 3. Generate CLI commands to persist claims in the corpus database 4. Report comprehensive results with success/failure breakdown ## Core Principles 1. **Factual over Normative**: Extract what IS (not what SHOULD BE) 2. **Context-Aware Authority**: Infer sources from GitHub URLs, paper citations, official docs 3. **Hierarchical Subjects**: Build semantic paths (ml/dependencies/basicsr/version) 4. **Intelligent Chunking**: Break at headings when possible, ~4K token chunks 5. **Batch Processing**: Extract all claims, then execute CLI commands 6. **Bundle Errors**: Collect all errors and report them together ## Workflow Overview ``` Phase 1: Discover & Read ↓ Phase 1.2: Verify Commands ↓ Phase 2: Intelligent Chunking ↓ Phase 3: Claim Extraction (Per Chunk) ↓ Phase 4: Validation ↓ Phase 5: CLI Execution ↓ Phase 6: Summary Report ``` --- ## Phase 1: Discover & Read ### Step 1.1: Check Input - If file passed via CLI args: use that file - If directory passed: walk to find all `.md` files - Use Read tool to get full content of each file ### Step 1.2: Verify Aphoria Binary and Commands Before proceeding, verify that the required commands exist: ```bash # Check Aphoria version aphoria --version # Verify corpus create command exists if ! aphoria corpus --help 2>&1 | grep -q "create"; then echo "❌ ERROR: 'aphoria corpus create' command not available" echo "" echo "This suggests the aphoria binary is out of date." echo "" echo "Fix options:" echo " 1. Rebuild: cargo build --release -p aphoria" echo " 2. Check git status: git status" echo " 3. Pull latest: git pull && cargo build --release -p aphoria" echo "" exit 1 fi echo "✅ Aphoria binary up to date (corpus create available)" ``` **Decision Gate:** Command exists? → Proceed to token estimation ### Step 1.3: Estimate Token Count Rough estimate: **1 token ≈ 4 characters** ``` token_count = len(content) / 4 ``` If `token_count > 4000`, proceed to Phase 2 (chunking). If `token_count <= 4000`, treat as single chunk. --- ## Phase 2: Intelligent Chunking ### Goal Split content into ~4K token chunks, preferring heading boundaries. ### Algorithm 1. **Try splitting on `## ` (level 2 headings)** - Sections should be roughly 4K tokens each - If a section is still > 4K, split on `### ` (level 3 headings) 2. **Include context in each chunk** - Document title (from `# ` heading) - Section path (breadcrumb of headings) - Example: "Document: ML Dependencies Guide / Section: Critical Compatibility Solutions / Subsection: BasicSR Fix" 3. **Maintain overlap** - Include previous heading for context - This helps LLM understand relationships ### Chunk Metadata Format ```json { "chunk_id": 1, "total_chunks": 3, "document_title": "ML Dependencies Guide", "section_path": "Critical Compatibility Solutions / BasicSR Fix", "content": "..." } ``` --- ## Phase 3: Claim Extraction (Per Chunk) ### Prompt the LLM For each chunk, use a structured extraction prompt: ```` You are extracting factual claims from technical documentation for a knowledge corpus. **Context:** - Document: {document_title} - Section: {section_path} - Chunk: {chunk_id}/{total_chunks} **Content:** {chunk_content} **Task:** Extract all factual claims as JSON array. Each claim must be: 1. Factual (not opinion or speculation) 2. Verifiable from the text 3. Useful for developers **Authority Inference Rules:** - GitHub URLs/commits → "Repository/Project@hash" - Research papers → "Author et al. (Year)" - Official documentation → "Project Documentation" - Empirical observation → "Community consensus" **Tier Assignment:** - 0: RFC, W3C spec, ISO standard (regulatory) - 1: OWASP, CWE, security advisory (clinical) - 2: Project docs, compatibility notes (observational) - 3: Blog posts, forum consensus (community) **Output Format:** ```json [ { "subject": "hierarchical/path/to/concept", "predicate": "relationship_type", "value": "constraint_or_value", "explanation": "full sentence with context", "authority": "inferred_source", "category": "compatibility|performance|security|architecture|quality", "confidence": 0.95, "tier": 2 } ] ``` Return ONLY the JSON array, no additional text. ```` ### Expected Output Structure ```json [ { "subject": "ml/dependencies/basicsr/torchvision", "predicate": "incompatible_with", "value": ">=0.15", "explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+", "authority": "XPixelGroup/BasicSR GitHub", "category": "compatibility", "confidence": 0.95, "tier": 2 } ] ``` --- ## Phase 4: Validation ### Step 4.1: Filter by Confidence Only keep claims where `confidence >= 0.7` ### Step 4.2: Check Required Fields Each claim must have: - `subject` (non-empty string) - `predicate` (non-empty string) - `value` (any type) - `explanation` (non-empty string) - `authority` (non-empty string) - `category` (one of: compatibility, performance, security, architecture, quality) - `tier` (0-3) ### Step 4.3: Validate Tier Tier must be 0, 1, 2, or 3. If invalid, record error and skip claim. ### Step 4.4: Check for Duplicates **Important**: The corpus database is **append-only**. Multiple sources can create the same `subject+predicate` pair. This is **allowed and expected**. Do NOT filter duplicates — just warn about them in the report. --- ## Phase 5: CLI Execution ### Step 5.1: Construct CLI Commands For each validated claim, construct: ```bash aphoria corpus create \ --subject "{subject}" \ --predicate "{predicate}" \ --value "{value}" \ --explanation "{explanation}" \ --authority "{authority}" \ --category "{category}" \ --tier {tier} ``` **Important**: Use proper shell escaping for strings with quotes or special characters. ### Step 5.2: Execute Commands Use the Bash tool to execute each command. ### Step 5.3: Collect Results For each execution: - **Success**: Record the corpus ID (e.g., "corpus://ml/foo/bar/predicate") - **Failure**: Record the full error message --- ## Phase 6: Summary Report ### Report Structure ```markdown # Wiki Corpus Extraction Report **File:** /path/to/wiki/article.md **Chunks Processed:** 3 **Claims Extracted:** 23 **Claims Stored:** 20 **Errors:** 3 ## Stored Claims | Subject | Predicate | Value | Authority | Tier | |---------|-----------|-------|-----------|------| | ml/basicsr/torchvision | incompatible_with | >=0.15 | XPixelGroup/BasicSR | 2 | | ... | ... | ... | ... | ... | ## Errors ### Validation Errors (2) 1. **ml/foo/bar** - Invalid tier '5' (must be 0-3) 2. **api/rest/foo** - Missing explanation field ### Storage Errors (1) 1. **net/http/timeout** - Database write failed: connection refused ## Next Steps View corpus items: http://localhost:3000/corpus Query API: curl 'http://localhost:18180/v1/aphoria/corpus?sources[]=community&limit=100' ``` --- ## Predicate Naming Conventions Use consistent predicate names to enable effective querying: | Relationship | Predicate | |--------------|-----------| | Version constraint | `requires`, `incompatible_with`, `compatible_with` | | Recommendation | `recommends`, `discourages` | | Performance | `faster_than`, `slower_than`, `optimal_for` | | Security | `vulnerable_to`, `mitigates`, `exposes` | | Configuration | `default_value`, `max_value`, `required_for` | --- ## Subject Path Guidelines Build hierarchical paths that reflect the domain structure: ### Examples - `ml/dependencies/{package}/{aspect}` - Example: `ml/dependencies/basicsr/torchvision` - `api/{protocol}/{feature}` - Example: `api/rest/authentication` - `security/{category}/{vuln_type}` - Example: `security/input-validation/xss` - `performance/{component}/{metric}` - Example: `performance/database/connection-pool` ### Principles - Start general, get specific - Use lowercase with forward slashes - Use hyphens for multi-word segments - Keep paths under 6-7 levels --- ## Category Guidelines Choose the most appropriate category: | Category | Use When | |----------|----------| | `compatibility` | Version constraints, breaking changes, API compatibility | | `performance` | Optimization, resource usage, latency, throughput | | `security` | Vulnerabilities, mitigations, attack vectors | | `architecture` | Design patterns, module structure, dependencies | | `quality` | Code quality, maintainability, best practices | --- ## Authority Tier Guidelines | Tier | Name | Examples | When to Use | |------|------|----------|-------------| | 0 | Regulatory | RFC 7231, W3C spec, ISO 27001 | Official standards bodies | | 1 | Clinical | OWASP Top 10, CWE-79, NVD | Security advisories, vulnerability databases | | 2 | Observational | PyTorch docs, GitHub project READMEs | Official project documentation | | 3 | Community | Blog posts, Stack Overflow, forum threads | Community wisdom, empirical observations | --- ## Error Handling ### Validation Errors Collect all validation errors and report them together. Do NOT stop on the first error. Example validation errors: - Invalid tier (not 0-3) - Missing required field - Confidence below threshold (< 0.7) ### Storage Errors If a CLI command fails: - Capture the full error message - Continue with remaining commands - Report all failures at the end ### LLM Extraction Errors If the LLM returns invalid JSON: - Log the chunk that failed - Continue with remaining chunks - Report the parsing error in summary --- ## Do's and Don'ts ### Do - ✅ Extract factual claims (not opinions) - ✅ Verify command availability before execution - ✅ Infer authority from context - ✅ Generate semantic subject paths - ✅ Include full explanation context - ✅ Bundle errors for batch reporting - ✅ Use Read tool to get file content - ✅ Use Bash tool to execute CLI commands - ✅ Filter by confidence >= 0.7 - ✅ Allow duplicate subject+predicate (append-only DB) ### Do Not - ❌ Extract opinions or speculative claims - ❌ Assume binary is up to date - ❌ Lose source attribution - ❌ Hardcode authority (infer from content) - ❌ Stop on first error (collect all errors) - ❌ Modify files (read-only skill) - ❌ Use placeholder values - ❌ Skip validation - ❌ Filter duplicates (append-only allows them) --- ## Example Extraction ### Input Text ```markdown ## BasicSR and Torchvision Compatibility The BasicSR library (v1.4.2) has a critical compatibility issue with torchvision >= 0.15. The library imports from `torchvision.transforms.functional_tensor`, which was removed in torchvision 0.15+. Source: https://github.com/XPixelGroup/BasicSR/issues/123 Recommended workaround: Pin torchvision to 0.14.1 or earlier. ``` ### Extracted Claims ```json [ { "subject": "ml/dependencies/basicsr/torchvision", "predicate": "incompatible_with", "value": ">=0.15", "explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+", "authority": "XPixelGroup/BasicSR#123", "category": "compatibility", "confidence": 0.95, "tier": 2 }, { "subject": "ml/dependencies/basicsr/torchvision", "predicate": "recommends", "value": "<=0.14.1", "explanation": "Workaround for basicsr compatibility issue: pin torchvision to 0.14.1 or earlier", "authority": "XPixelGroup/BasicSR#123", "category": "compatibility", "confidence": 0.9, "tier": 3 } ] ``` ### CLI Commands ```bash aphoria corpus create \ --subject "ml/dependencies/basicsr/torchvision" \ --predicate "incompatible_with" \ --value ">=0.15" \ --explanation "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+" \ --authority "XPixelGroup/BasicSR#123" \ --category "compatibility" \ --tier 2 aphoria corpus create \ --subject "ml/dependencies/basicsr/torchvision" \ --predicate "recommends" \ --value "<=0.14.1" \ --explanation "Workaround for basicsr compatibility issue: pin torchvision to 0.14.1 or earlier" \ --authority "XPixelGroup/BasicSR#123" \ --category "compatibility" \ --tier 3 ``` --- ## Related Skills - **extract-claims**: Entity-level extraction from prose (for StemeDB ingestion) - **aphoria-suggest**: Suggest claims from existing patterns - **aphoria-claims**: Author claims from diffs --- ## Implementation Notes ### Token Counting Use rough heuristic: `token_count = len(content) / 4` This is approximate but good enough for chunking decisions. ### Shell Escaping When constructing CLI commands, properly escape strings: ```python import shlex escaped_explanation = shlex.quote(explanation) ``` Or in bash: ```bash explanation="${explanation//\"/\\\"}" # Escape quotes ``` ### Confidence Threshold Only extract claims with `confidence >= 0.7`. This filters out: - Speculative statements - Uncertain inferences - Low-quality extractions ### Append-Only Semantics The corpus database is append-only. Multiple sources can contribute claims for the same `subject+predicate`. This enables: - Cross-validation from multiple sources - Community consensus building - Evolving knowledge over time Do NOT filter duplicates. Just warn about them in the report. --- ## Success Criteria A successful extraction should: 1. ✅ Read all markdown files in the input directory 2. ✅ Extract factual claims with proper structure 3. ✅ Infer authority from context (GitHub URLs, docs, etc.) 4. ✅ Assign appropriate tiers (0-3) 5. ✅ Execute CLI commands successfully 6. ✅ Report comprehensive summary with errors bundled 7. ✅ Handle validation errors gracefully 8. ✅ Handle storage errors gracefully 9. ✅ Generate semantic subject paths 10. ✅ Use consistent predicate naming --- ## Troubleshooting ### "Command not found" or "unrecognized subcommand 'create'" Errors If you see `error: unrecognized subcommand 'create'` or similar errors: **Diagnosis:** 1. **Check binary date**: `ls -lh target/release/aphoria` 2. **Check CLI code date**: `ls -lh applications/aphoria/src/cli/mod.rs` 3. **If CLI is newer**: The binary is out of date **Solutions:** ```bash # Option 1: Rebuild the binary cargo build --release -p aphoria # Option 2: Pull latest changes and rebuild git pull && cargo build --release -p aphoria # Option 3: Check if there are uncommitted changes git status ``` **Prevention:** See Fix #1 for setting up git hooks that automatically rebuild binaries on pull. ### "Database already open" error The corpus database at `~/.aphoria/corpus-db` is locked by another process (probably the API server). **Solution**: Stop the API server temporarily: ```bash pkill -f stemedb-api ``` ### "Invalid tier" error Tier must be 0, 1, 2, or 3. **Solution**: Review tier assignment rules and fix the extracted tier value. ### "Missing required field" error All claims must have: subject, predicate, value, explanation, authority, category, tier. **Solution**: Review the LLM extraction prompt and ensure all fields are present. ### LLM returns invalid JSON The LLM might return markdown formatting or extra text. **Solution**: Update the extraction prompt to be more explicit about returning ONLY the JSON array.