stemedb/applications/aphoria/docs/guides/llm-wiki-extraction.md
jml 9bfa626203 docs: reorganize documentation structure for clarity
Major documentation restructure to improve discoverability and reduce duplication.

## Changes

**Deleted (Archived/Consolidated)**:
- Removed duplicate getting started guides
- Archived outdated planning documents
- Consolidated corpus and configuration docs
- Removed obsolete vision/spec files (superseded by vision.md)
- Cleaned up scrapyard and old PDFs

**New Structure**:
- docs/about/ - Project overview and introduction
- docs/guides/ - User guides (moved from root)
- docs/specs/ - Technical specifications
- docs/sdk/ - SDK documentation (Go)
- docs/references/ - API references
- docs/archive/ - Archived historical docs
- applications/aphoria/docs/advanced/ - Advanced topics
- applications/aphoria/docs/reference/ - CLI reference
- applications/aphoria/docs/archive/ - Archived aphoria docs

**Updated**:
- README.md - New root README with clear navigation
- CONTRIBUTING.md - Contribution guidelines
- CLAUDE.md - Updated paths to new structure
- roadmap.md - Added recent completions

## Files Changed
- 57 files changed
- 1,977 insertions(+)
- 961 deletions(-)

**Net change**: +1,016 lines (added CONTRIBUTING.md, README.md, reorganized content)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-11 07:33:40 +00:00

13 KiB

LLM-Based Wiki Corpus Extraction

Extract factual claims from technical documentation using an LLM skill that intelligently chunks, analyzes, and persists to the corpus database.

Quick Start

# Extract claims from a wiki article
cd ~/Workspace/stemedb
claude -p ~/path/to/wiki/article.md --skill extract-wiki-corpus

# Example with actual file
claude -p ~/Workspace/orchard9/wiki/intakes/REQUEST_FOR_RESEARCH_ANSWERS.md \
  --skill extract-wiki-corpus

Expected output:

Reading article: REQUEST_FOR_RESEARCH_ANSWERS.md (12,450 tokens)
Chunked into 3 segments (by ## headings)

Chunk 1/3: "Critical Compatibility Solutions"
  Extracted 8 claims
  ✓ ml/basicsr/torchvision/incompatible_with = ">=0.15"
  ✓ ml/gpen/gfpgan/outperforms = "eye_enhancement"
  ...

Chunk 2/3: "CUDA 12.9 Compatibility"
  Extracted 5 claims
  ...

Summary: 23 claims extracted, 23 stored successfully

How It Works

1. Intelligent Chunking

The skill chunks large articles to fit LLM context limits:

Strategy:

  • Target: ~4K tokens per chunk
  • Break at ## headings when possible
  • Preserve context: Include document title + section path in each chunk

Example:

# Python Dependency Stack
## Critical Solutions
### BasicSR Fix
[content...]

Becomes 3 chunks:

  1. "Python Dependency Stack / Critical Solutions / BasicSR Fix" + content
  2. "Python Dependency Stack / Critical Solutions / GPEN vs GFPGAN" + content
  3. "Python Dependency Stack / CUDA Compatibility" + content

2. LLM Claim Extraction

For each chunk, Claude extracts factual assertions as structured JSON:

Extraction Criteria:

  • Factual (verifiable from text)
  • Useful for developers
  • Has clear subject/predicate/value

Example extraction:

Input text:

### BasicSR/Torchvision Fix
The core issue is that basicsr 1.4.2 imports from
`torchvision.transforms.functional_tensor` which was removed in
torchvision 0.15+.

**Primary Solution:**
git+https://github.com/XPixelGroup/BasicSR@8d56e3a

Extracted claim:

{
  "subject": "ml/dependencies/basicsr/torchvision",
  "predicate": "incompatible_with",
  "value": ">=0.15",
  "explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+",
  "authority": "XPixelGroup/BasicSR@8d56e3a",
  "category": "compatibility"
}

3. Authority Inference

The LLM infers authority sources from context:

Pattern Authority Format Example
GitHub URL repo@commit XPixelGroup/BasicSR@8d56e3a
Research paper Author et al. (Year) Smith et al. (2023)
Official docs Product Documentation PyTorch Documentation
Empirical Community consensus Community best practice

4. Tier Assignment

The skill assigns tiers based on authority source:

Tier Authority Type Examples
0 Regulatory specs RFC, W3C standards
1 Authoritative sources Official docs, research papers
2 Observational GitHub repos, community consensus
3 Empirical Unverified claims

Guidance to LLM:

  • Official standards (RFC, W3C) → Tier 0
  • Official documentation, published research → Tier 1
  • GitHub repos, maintainer statements → Tier 2
  • Community reports, unverified → Tier 3

5. Persistence via CLI

Each extracted claim is stored using:

aphoria corpus create \
  --subject "ml/dependencies/basicsr/torchvision" \
  --predicate "incompatible_with" \
  --value ">=0.15" \
  --explanation "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in 0.15+" \
  --authority "XPixelGroup/BasicSR@8d56e3a" \
  --category "compatibility" \
  --tier 2

CLI Reference: aphoria corpus create

Create a corpus assertion from structured claim data.

Usage:

aphoria corpus create \
  --subject <hierarchical/path> \
  --predicate <relationship> \
  --value <value> \
  --explanation <full-context> \
  --authority <source> \
  --category <category> \
  --tier <0-3>

Arguments:

Flag Required Description Example
--subject Yes Hierarchical path to concept ml/basicsr/torchvision
--predicate Yes Relationship type incompatible_with
--value Yes Value or constraint ">=0.15"
--explanation Yes Full context sentence "basicsr 1.4.2 imports from..."
--authority Yes Source citation XPixelGroup/BasicSR@8d56e3a
--category Yes Category tag compatibility
--tier Yes Authority tier (0-3) 2

Categories:

  • compatibility - Dependency constraints, version requirements
  • performance - Performance characteristics, benchmarks
  • security - Security properties, vulnerabilities
  • architecture - Design patterns, structure
  • behavior - Functional behavior, side effects

Behavior:

Deduplication: Stores ALL claims, even if subject+predicate exists. This is append-only; sourced differing claims are the whole point of Episteme.

Error Handling: Bundles all validation errors and presents them together:

Error creating corpus assertion:

Validation errors:
  1. --subject: Must be non-empty hierarchical path (got: "")
  2. --tier: Must be 0-3 (got: 5)
  3. --category: Must be one of: compatibility, performance, security, architecture, behavior (got: "random")

Fix all errors and retry.

Example:

$ aphoria corpus create \
  --subject "ml/pytorch/version" \
  --predicate "requires" \
  --value ">=2.0" \
  --explanation "Uses torch.compile which requires PyTorch 2.0+" \
  --authority "PyTorch 2.0 Release Notes" \
  --category "compatibility" \
  --tier 1

✓ Created corpus assertion: ml/pytorch/version
  Stored in: ~/.aphoria/corpus-db

Skill Output Format

The extract-wiki-corpus skill produces structured output:

Reading article: REQUEST_FOR_RESEARCH_ANSWERS.md (12,450 tokens)
Chunked into 3 segments (by ## headings)

Chunk 1/3: "Critical Compatibility Solutions"
  Extracted 8 claims

  1. ml/dependencies/basicsr/torchvision
     incompatible_with = ">=0.15"
     Authority: XPixelGroup/BasicSR@8d56e3a
     ✓ Stored

  2. ml/enhancements/gpen/gfpgan
     outperforms = "eye_enhancement"
     Authority: Research comparison (2023)
     ✓ Stored

  [... 6 more claims ...]

Chunk 2/3: "CUDA 12.9 Compatibility"
  Extracted 5 claims

  9. ml/face_detection/mediaipe/dlib
     preferred_over = "CUDA 12 support"
     Authority: Community consensus
     ✓ Stored

  [... 4 more claims ...]

Chunk 3/3: "Optimized Requirements"
  Extracted 10 claims

  [... all claims ...]

Summary:
  Total claims: 23
  Successfully stored: 23
  Failed: 0

Corpus database: ~/.aphoria/corpus-db
Query: curl 'http://localhost:18180/v1/aphoria/corpus?category=compatibility'

If errors occur:

Summary:
  Total claims: 23
  Successfully stored: 18
  Failed: 5

Errors:
  1. Claim #7 (ml/torch/cuda/version)
     - --tier: Must be 0-3 (got: 5)
     - Fix: LLM assigned invalid tier

  2. Claim #12 (ml/xformers/optional)
     - --subject: Empty subject path
     - Fix: LLM extraction failed

  [... 3 more errors with details ...]

Fix these issues and re-run extraction.

Verification

After extraction, verify claims appear in the corpus:

# Query all compatibility claims
curl -s 'http://localhost:18180/v1/aphoria/corpus?category=compatibility' | jq '.total_matching'
# Expected: 23 (or however many were extracted)

# Query specific subject
curl -s 'http://localhost:18180/v1/aphoria/corpus' | \
  jq '.items[] | select(.subject | contains("basicsr"))'

# Expected output:
{
  "subject": "ml/dependencies/basicsr/torchvision",
  "predicate": "incompatible_with",
  "value": ">=0.15",
  "source": "ml://",
  "tier": 2,
  "category": "compatibility",
  "explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in 0.15+",
  "authority_source": "XPixelGroup/BasicSR@8d56e3a"
}

Dashboard View

Extracted claims appear in the Aphoria dashboard at /corpus:

Filters:

  • By category: compatibility, performance, security, architecture, behavior
  • By tier: 0 (Regulatory), 1 (Authoritative), 2 (Observational), 3 (Empirical)
  • By source: ml://, security://, etc.

Display:

  • Subject path as breadcrumbs: ml > dependencies > basicsr > torchvision
  • Tier badge with color coding
  • Full explanation text
  • Authority citation as link (if URL)

Troubleshooting

Problem: Skill chunks too aggressively, loses context

Solution: Adjust chunk size in skill configuration (target 4K tokens, can go up to 8K for complex articles)


Problem: LLM assigns wrong tiers

Solution: Refine tier guidance in skill prompt:

  • Official standards (RFC, IEEE) → Tier 0
  • Official docs, peer-reviewed papers → Tier 1
  • GitHub repos, maintainer statements → Tier 2
  • Blog posts, community forums → Tier 3

Problem: Too many failed claims (validation errors)

Solution: Check common error patterns:

# Review failed claims
grep "Failed:" /tmp/extraction-output.log

# Common issues:
# 1. Empty subjects - LLM extraction failed
# 2. Invalid tiers - LLM assigned tier > 3
# 3. Missing required fields - Incomplete extraction

Fix by refining LLM extraction prompt.


Problem: Duplicate claims (same subject+predicate)

This is expected behavior. Episteme stores ALL claims, even duplicates from different sources. This enables:

  • Sourced differing opinions (PyTorch docs say X, community says Y)
  • Conflict detection (authority says A, codebase does B)
  • Historical tracking (claim evolved over time)

To query all claims for a subject:

curl -s 'http://localhost:18180/v1/aphoria/corpus' | \
  jq '.items[] | select(.subject == "ml/dependencies/basicsr/torchvision")'

Integration with Other Features

With Scans:

  • Corpus claims act as authority sources
  • Aphoria compares scanned observations against corpus
  • Conflicts trigger violations

With Claims Management:

  • Can supersede corpus claims: aphoria claims supersede <id>
  • Can deprecate outdated corpus: aphoria claims deprecate <id>
  • Corpus claims have same structure as project claims

With Dashboard:

  • All corpus claims visible at /corpus
  • Filterable by category, tier, source
  • Click through to see full explanation

Best Practices

DO:

  • Extract from authoritative sources (official docs, research)
  • Verify claims appear in dashboard after extraction
  • Review tier assignments for accuracy
  • Include full context in explanations

DON'T:

  • Extract from opinion pieces or blogs (or use tier 3)
  • Skip authority citations (always provide source)
  • Use vague subjects ("thing" → "ml/pytorch/feature/specific")
  • Ignore validation errors (fix all before considering extraction complete)

Examples

Example 1: ML Dependencies

Input: ~/wiki/ml-stack.md

## PyTorch CUDA Compatibility

PyTorch 2.6.0 with CUDA 12.6 builds are forward compatible with CUDA 12.9.

Source: PyTorch 2.6 Release Notes

Extraction:

claude -p ~/wiki/ml-stack.md --skill extract-wiki-corpus

# Output:
Extracted 1 claim:
✓ ml/pytorch/cuda/compatibility
  predicate: forward_compatible_with
  value: "CUDA 12.9"
  tier: 1 (PyTorch 2.6 Release Notes)

Example 2: Security Best Practices

Input: ~/wiki/security.md

## Password Hashing

Research shows Argon2 consistently outperforms bcrypt and scrypt for
password hashing in modern environments.

Source: OWASP Password Storage Cheat Sheet (2023)

Extraction:

claude -p ~/wiki/security.md --skill extract-wiki-corpus

# Output:
Extracted 1 claim:
✓ security/password/hashing/algorithm
  predicate: recommended
  value: "Argon2"
  tier: 1 (OWASP Password Storage Cheat Sheet)

Example 3: Large Article

Input: ~/wiki/complete-stack.md (15,000 tokens)

# Complete Python Stack for SDXL

## Critical Solutions
[4,000 tokens]

## Enhancement Libraries
[5,000 tokens]

## CUDA Compatibility
[6,000 tokens]

Extraction:

claude -p ~/wiki/complete-stack.md --skill extract-wiki-corpus

# Output:
Reading article: complete-stack.md (15,234 tokens)
Chunked into 3 segments (by ## headings)

Chunk 1/3: "Critical Solutions"
  Extracted 12 claims
  ...

Chunk 2/3: "Enhancement Libraries"
  Extracted 8 claims
  ...

Chunk 3/3: "CUDA Compatibility"
  Extracted 7 claims
  ...

Summary: 27 claims extracted, 27 stored successfully

See Also