# LLM-Based Wiki Corpus Extraction Extract factual claims from technical documentation using an LLM skill that intelligently chunks, analyzes, and persists to the corpus database. ## Quick Start ```bash # Extract claims from a wiki article cd ~/Workspace/stemedb claude -p ~/path/to/wiki/article.md --skill extract-wiki-corpus # Example with actual file claude -p ~/Workspace/orchard9/wiki/intakes/REQUEST_FOR_RESEARCH_ANSWERS.md \ --skill extract-wiki-corpus ``` Expected output: ``` Reading article: REQUEST_FOR_RESEARCH_ANSWERS.md (12,450 tokens) Chunked into 3 segments (by ## headings) Chunk 1/3: "Critical Compatibility Solutions" Extracted 8 claims ✓ ml/basicsr/torchvision/incompatible_with = ">=0.15" ✓ ml/gpen/gfpgan/outperforms = "eye_enhancement" ... Chunk 2/3: "CUDA 12.9 Compatibility" Extracted 5 claims ... Summary: 23 claims extracted, 23 stored successfully ``` ## How It Works ### 1. Intelligent Chunking The skill chunks large articles to fit LLM context limits: **Strategy:** - Target: ~4K tokens per chunk - Break at `##` headings when possible - Preserve context: Include document title + section path in each chunk **Example:** ```markdown # Python Dependency Stack ## Critical Solutions ### BasicSR Fix [content...] ``` Becomes 3 chunks: 1. `"Python Dependency Stack / Critical Solutions / BasicSR Fix"` + content 2. `"Python Dependency Stack / Critical Solutions / GPEN vs GFPGAN"` + content 3. `"Python Dependency Stack / CUDA Compatibility"` + content ### 2. LLM Claim Extraction For each chunk, Claude extracts factual assertions as structured JSON: **Extraction Criteria:** - Factual (verifiable from text) - Useful for developers - Has clear subject/predicate/value **Example extraction:** Input text: ```markdown ### BasicSR/Torchvision Fix The core issue is that basicsr 1.4.2 imports from `torchvision.transforms.functional_tensor` which was removed in torchvision 0.15+. **Primary Solution:** git+https://github.com/XPixelGroup/BasicSR@8d56e3a ``` Extracted claim: ```json { "subject": "ml/dependencies/basicsr/torchvision", "predicate": "incompatible_with", "value": ">=0.15", "explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+", "authority": "XPixelGroup/BasicSR@8d56e3a", "category": "compatibility" } ``` ### 3. Authority Inference The LLM infers authority sources from context: | Pattern | Authority Format | Example | |---------|-----------------|---------| | GitHub URL | `repo@commit` | `XPixelGroup/BasicSR@8d56e3a` | | Research paper | `Author et al. (Year)` | `Smith et al. (2023)` | | Official docs | `Product Documentation` | `PyTorch Documentation` | | Empirical | `Community consensus` | `Community best practice` | ### 4. Tier Assignment The skill assigns tiers based on authority source: | Tier | Authority Type | Examples | |------|---------------|----------| | 0 | Regulatory specs | RFC, W3C standards | | 1 | Authoritative sources | Official docs, research papers | | 2 | Observational | GitHub repos, community consensus | | 3 | Empirical | Unverified claims | **Guidance to LLM:** - Official standards (RFC, W3C) → Tier 0 - Official documentation, published research → Tier 1 - GitHub repos, maintainer statements → Tier 2 - Community reports, unverified → Tier 3 ### 5. Persistence via CLI Each extracted claim is stored using: ```bash aphoria corpus create \ --subject "ml/dependencies/basicsr/torchvision" \ --predicate "incompatible_with" \ --value ">=0.15" \ --explanation "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in 0.15+" \ --authority "XPixelGroup/BasicSR@8d56e3a" \ --category "compatibility" \ --tier 2 ``` ## CLI Reference: `aphoria corpus create` Create a corpus assertion from structured claim data. **Usage:** ```bash aphoria corpus create \ --subject \ --predicate \ --value \ --explanation \ --authority \ --category \ --tier <0-3> ``` **Arguments:** | Flag | Required | Description | Example | |------|----------|-------------|---------| | `--subject` | Yes | Hierarchical path to concept | `ml/basicsr/torchvision` | | `--predicate` | Yes | Relationship type | `incompatible_with` | | `--value` | Yes | Value or constraint | `">=0.15"` | | `--explanation` | Yes | Full context sentence | `"basicsr 1.4.2 imports from..."` | | `--authority` | Yes | Source citation | `XPixelGroup/BasicSR@8d56e3a` | | `--category` | Yes | Category tag | `compatibility` | | `--tier` | Yes | Authority tier (0-3) | `2` | **Categories:** - `compatibility` - Dependency constraints, version requirements - `performance` - Performance characteristics, benchmarks - `security` - Security properties, vulnerabilities - `architecture` - Design patterns, structure - `behavior` - Functional behavior, side effects **Behavior:** **Deduplication:** Stores ALL claims, even if subject+predicate exists. This is append-only; sourced differing claims are the whole point of Episteme. **Error Handling:** Bundles all validation errors and presents them together: ``` Error creating corpus assertion: Validation errors: 1. --subject: Must be non-empty hierarchical path (got: "") 2. --tier: Must be 0-3 (got: 5) 3. --category: Must be one of: compatibility, performance, security, architecture, behavior (got: "random") Fix all errors and retry. ``` **Example:** ```bash $ aphoria corpus create \ --subject "ml/pytorch/version" \ --predicate "requires" \ --value ">=2.0" \ --explanation "Uses torch.compile which requires PyTorch 2.0+" \ --authority "PyTorch 2.0 Release Notes" \ --category "compatibility" \ --tier 1 ✓ Created corpus assertion: ml/pytorch/version Stored in: ~/.aphoria/corpus-db ``` ## Skill Output Format The `extract-wiki-corpus` skill produces structured output: ``` Reading article: REQUEST_FOR_RESEARCH_ANSWERS.md (12,450 tokens) Chunked into 3 segments (by ## headings) Chunk 1/3: "Critical Compatibility Solutions" Extracted 8 claims 1. ml/dependencies/basicsr/torchvision incompatible_with = ">=0.15" Authority: XPixelGroup/BasicSR@8d56e3a ✓ Stored 2. ml/enhancements/gpen/gfpgan outperforms = "eye_enhancement" Authority: Research comparison (2023) ✓ Stored [... 6 more claims ...] Chunk 2/3: "CUDA 12.9 Compatibility" Extracted 5 claims 9. ml/face_detection/mediaipe/dlib preferred_over = "CUDA 12 support" Authority: Community consensus ✓ Stored [... 4 more claims ...] Chunk 3/3: "Optimized Requirements" Extracted 10 claims [... all claims ...] Summary: Total claims: 23 Successfully stored: 23 Failed: 0 Corpus database: ~/.aphoria/corpus-db Query: curl 'http://localhost:18180/v1/aphoria/corpus?category=compatibility' ``` **If errors occur:** ``` Summary: Total claims: 23 Successfully stored: 18 Failed: 5 Errors: 1. Claim #7 (ml/torch/cuda/version) - --tier: Must be 0-3 (got: 5) - Fix: LLM assigned invalid tier 2. Claim #12 (ml/xformers/optional) - --subject: Empty subject path - Fix: LLM extraction failed [... 3 more errors with details ...] Fix these issues and re-run extraction. ``` ## Verification After extraction, verify claims appear in the corpus: ```bash # Query all compatibility claims curl -s 'http://localhost:18180/v1/aphoria/corpus?category=compatibility' | jq '.total_matching' # Expected: 23 (or however many were extracted) # Query specific subject curl -s 'http://localhost:18180/v1/aphoria/corpus' | \ jq '.items[] | select(.subject | contains("basicsr"))' # Expected output: { "subject": "ml/dependencies/basicsr/torchvision", "predicate": "incompatible_with", "value": ">=0.15", "source": "ml://", "tier": 2, "category": "compatibility", "explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in 0.15+", "authority_source": "XPixelGroup/BasicSR@8d56e3a" } ``` ## Dashboard View Extracted claims appear in the Aphoria dashboard at `/corpus`: **Filters:** - By category: compatibility, performance, security, architecture, behavior - By tier: 0 (Regulatory), 1 (Authoritative), 2 (Observational), 3 (Empirical) - By source: ml://, security://, etc. **Display:** - Subject path as breadcrumbs: `ml > dependencies > basicsr > torchvision` - Tier badge with color coding - Full explanation text - Authority citation as link (if URL) ## Troubleshooting **Problem:** Skill chunks too aggressively, loses context **Solution:** Adjust chunk size in skill configuration (target 4K tokens, can go up to 8K for complex articles) --- **Problem:** LLM assigns wrong tiers **Solution:** Refine tier guidance in skill prompt: - Official standards (RFC, IEEE) → Tier 0 - Official docs, peer-reviewed papers → Tier 1 - GitHub repos, maintainer statements → Tier 2 - Blog posts, community forums → Tier 3 --- **Problem:** Too many failed claims (validation errors) **Solution:** Check common error patterns: ```bash # Review failed claims grep "Failed:" /tmp/extraction-output.log # Common issues: # 1. Empty subjects - LLM extraction failed # 2. Invalid tiers - LLM assigned tier > 3 # 3. Missing required fields - Incomplete extraction ``` Fix by refining LLM extraction prompt. --- **Problem:** Duplicate claims (same subject+predicate) **This is expected behavior.** Episteme stores ALL claims, even duplicates from different sources. This enables: - Sourced differing opinions (PyTorch docs say X, community says Y) - Conflict detection (authority says A, codebase does B) - Historical tracking (claim evolved over time) To query all claims for a subject: ```bash curl -s 'http://localhost:18180/v1/aphoria/corpus' | \ jq '.items[] | select(.subject == "ml/dependencies/basicsr/torchvision")' ``` ## Integration with Other Features **With Scans:** - Corpus claims act as authority sources - Aphoria compares scanned observations against corpus - Conflicts trigger violations **With Claims Management:** - Can supersede corpus claims: `aphoria claims supersede ` - Can deprecate outdated corpus: `aphoria claims deprecate ` - Corpus claims have same structure as project claims **With Dashboard:** - All corpus claims visible at `/corpus` - Filterable by category, tier, source - Click through to see full explanation ## Best Practices **DO:** - Extract from authoritative sources (official docs, research) - Verify claims appear in dashboard after extraction - Review tier assignments for accuracy - Include full context in explanations **DON'T:** - Extract from opinion pieces or blogs (or use tier 3) - Skip authority citations (always provide source) - Use vague subjects ("thing" → "ml/pytorch/feature/specific") - Ignore validation errors (fix all before considering extraction complete) ## Examples ### Example 1: ML Dependencies **Input:** `~/wiki/ml-stack.md` ```markdown ## PyTorch CUDA Compatibility PyTorch 2.6.0 with CUDA 12.6 builds are forward compatible with CUDA 12.9. Source: PyTorch 2.6 Release Notes ``` **Extraction:** ```bash claude -p ~/wiki/ml-stack.md --skill extract-wiki-corpus # Output: Extracted 1 claim: ✓ ml/pytorch/cuda/compatibility predicate: forward_compatible_with value: "CUDA 12.9" tier: 1 (PyTorch 2.6 Release Notes) ``` ### Example 2: Security Best Practices **Input:** `~/wiki/security.md` ```markdown ## Password Hashing Research shows Argon2 consistently outperforms bcrypt and scrypt for password hashing in modern environments. Source: OWASP Password Storage Cheat Sheet (2023) ``` **Extraction:** ```bash claude -p ~/wiki/security.md --skill extract-wiki-corpus # Output: Extracted 1 claim: ✓ security/password/hashing/algorithm predicate: recommended value: "Argon2" tier: 1 (OWASP Password Storage Cheat Sheet) ``` ### Example 3: Large Article **Input:** `~/wiki/complete-stack.md` (15,000 tokens) ```markdown # Complete Python Stack for SDXL ## Critical Solutions [4,000 tokens] ## Enhancement Libraries [5,000 tokens] ## CUDA Compatibility [6,000 tokens] ``` **Extraction:** ```bash claude -p ~/wiki/complete-stack.md --skill extract-wiki-corpus # Output: Reading article: complete-stack.md (15,234 tokens) Chunked into 3 segments (by ## headings) Chunk 1/3: "Critical Solutions" Extracted 12 claims ... Chunk 2/3: "Enhancement Libraries" Extracted 8 claims ... Chunk 3/3: "CUDA Compatibility" Extracted 7 claims ... Summary: 27 claims extracted, 27 stored successfully ``` ## See Also - [CLI Reference](../reference/cli-reference.md) - All `aphoria corpus` commands - [Corpus API](../api-reference.md) - Query corpus programmatically - [Claims vs Observations](../../README.md#claims-vs-observations) - Key concepts