stemedb/applications/aphoria/docs/guides/llm-wiki-extraction.md

# LLM-Based Wiki Corpus Extraction

Extract factual claims from technical documentation using an LLM skill that intelligently chunks, analyzes, and persists to the corpus database.

## Quick Start

```bash
# Extract claims from a wiki article
cd ~/Workspace/stemedb
claude -p ~/path/to/wiki/article.md --skill extract-wiki-corpus

# Example with actual file
claude -p ~/Workspace/orchard9/wiki/intakes/REQUEST_FOR_RESEARCH_ANSWERS.md \
  --skill extract-wiki-corpus
```

Expected output:
```
Reading article: REQUEST_FOR_RESEARCH_ANSWERS.md (12,450 tokens)
Chunked into 3 segments (by ## headings)

Chunk 1/3: "Critical Compatibility Solutions"
  Extracted 8 claims
  ✓ ml/basicsr/torchvision/incompatible_with = ">=0.15"
  ✓ ml/gpen/gfpgan/outperforms = "eye_enhancement"
  ...

Chunk 2/3: "CUDA 12.9 Compatibility"
  Extracted 5 claims
  ...

Summary: 23 claims extracted, 23 stored successfully
```

## How It Works

### 1. Intelligent Chunking

The skill chunks large articles to fit LLM context limits:

**Strategy:**
- Target: ~4K tokens per chunk
- Break at `##` headings when possible
- Preserve context: Include document title + section path in each chunk

**Example:**
```markdown
# Python Dependency Stack
## Critical Solutions
### BasicSR Fix
[content...]
```

Becomes 3 chunks:
1. `"Python Dependency Stack / Critical Solutions / BasicSR Fix"` + content
2. `"Python Dependency Stack / Critical Solutions / GPEN vs GFPGAN"` + content
3. `"Python Dependency Stack / CUDA Compatibility"` + content

### 2. LLM Claim Extraction

For each chunk, Claude extracts factual assertions as structured JSON:

**Extraction Criteria:**
- Factual (verifiable from text)
- Useful for developers
- Has clear subject/predicate/value

**Example extraction:**

Input text:
```markdown
### BasicSR/Torchvision Fix
The core issue is that basicsr 1.4.2 imports from
`torchvision.transforms.functional_tensor` which was removed in
torchvision 0.15+.

**Primary Solution:**
git+https://github.com/XPixelGroup/BasicSR@8d56e3a
```

Extracted claim:
```json
{
  "subject": "ml/dependencies/basicsr/torchvision",
  "predicate": "incompatible_with",
  "value": ">=0.15",
  "explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+",
  "authority": "XPixelGroup/BasicSR@8d56e3a",
  "category": "compatibility"
}
```

### 3. Authority Inference

The LLM infers authority sources from context:

| Pattern | Authority Format | Example |
|---------|-----------------|---------|
| GitHub URL | `repo@commit` | `XPixelGroup/BasicSR@8d56e3a` |
| Research paper | `Author et al. (Year)` | `Smith et al. (2023)` |
| Official docs | `Product Documentation` | `PyTorch Documentation` |
| Empirical | `Community consensus` | `Community best practice` |

### 4. Tier Assignment

The skill assigns tiers based on authority source:

| Tier | Authority Type | Examples |
|------|---------------|----------|
| 0 | Regulatory specs | RFC, W3C standards |
| 1 | Authoritative sources | Official docs, research papers |
| 2 | Observational | GitHub repos, community consensus |
| 3 | Empirical | Unverified claims |

**Guidance to LLM:**
- Official standards (RFC, W3C) → Tier 0
- Official documentation, published research → Tier 1
- GitHub repos, maintainer statements → Tier 2
- Community reports, unverified → Tier 3

### 5. Persistence via CLI

Each extracted claim is stored using:

```bash
aphoria corpus create \
  --subject "ml/dependencies/basicsr/torchvision" \
  --predicate "incompatible_with" \
  --value ">=0.15" \
  --explanation "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in 0.15+" \
  --authority "XPixelGroup/BasicSR@8d56e3a" \
  --category "compatibility" \
  --tier 2
```

## CLI Reference: `aphoria corpus create`

Create a corpus assertion from structured claim data.

**Usage:**
```bash
aphoria corpus create \
  --subject <hierarchical/path> \
  --predicate <relationship> \
  --value <value> \
  --explanation <full-context> \
  --authority <source> \
  --category <category> \
  --tier <0-3>
```

**Arguments:**

| Flag | Required | Description | Example |
|------|----------|-------------|---------|
| `--subject` | Yes | Hierarchical path to concept | `ml/basicsr/torchvision` |
| `--predicate` | Yes | Relationship type | `incompatible_with` |
| `--value` | Yes | Value or constraint | `">=0.15"` |
| `--explanation` | Yes | Full context sentence | `"basicsr 1.4.2 imports from..."` |
| `--authority` | Yes | Source citation | `XPixelGroup/BasicSR@8d56e3a` |
| `--category` | Yes | Category tag | `compatibility` |
| `--tier` | Yes | Authority tier (0-3) | `2` |

**Categories:**
- `compatibility` - Dependency constraints, version requirements
- `performance` - Performance characteristics, benchmarks
- `security` - Security properties, vulnerabilities
- `architecture` - Design patterns, structure
- `behavior` - Functional behavior, side effects

**Behavior:**

**Deduplication:** Stores ALL claims, even if subject+predicate exists. This is append-only; sourced differing claims are the whole point of Episteme.

**Error Handling:** Bundles all validation errors and presents them together:

```
Error creating corpus assertion:

Validation errors:
  1. --subject: Must be non-empty hierarchical path (got: "")
  2. --tier: Must be 0-3 (got: 5)
  3. --category: Must be one of: compatibility, performance, security, architecture, behavior (got: "random")

Fix all errors and retry.
```

**Example:**
```bash
$ aphoria corpus create \
  --subject "ml/pytorch/version" \
  --predicate "requires" \
  --value ">=2.0" \
  --explanation "Uses torch.compile which requires PyTorch 2.0+" \
  --authority "PyTorch 2.0 Release Notes" \
  --category "compatibility" \
  --tier 1

✓ Created corpus assertion: ml/pytorch/version
  Stored in: ~/.aphoria/corpus-db
```

## Skill Output Format

The `extract-wiki-corpus` skill produces structured output:

```
Reading article: REQUEST_FOR_RESEARCH_ANSWERS.md (12,450 tokens)
Chunked into 3 segments (by ## headings)

Chunk 1/3: "Critical Compatibility Solutions"
  Extracted 8 claims

  1. ml/dependencies/basicsr/torchvision
     incompatible_with = ">=0.15"
     Authority: XPixelGroup/BasicSR@8d56e3a
     ✓ Stored

  2. ml/enhancements/gpen/gfpgan
     outperforms = "eye_enhancement"
     Authority: Research comparison (2023)
     ✓ Stored

  [... 6 more claims ...]

Chunk 2/3: "CUDA 12.9 Compatibility"
  Extracted 5 claims

  9. ml/face_detection/mediaipe/dlib
     preferred_over = "CUDA 12 support"
     Authority: Community consensus
     ✓ Stored

  [... 4 more claims ...]

Chunk 3/3: "Optimized Requirements"
  Extracted 10 claims

  [... all claims ...]

Summary:
  Total claims: 23
  Successfully stored: 23
  Failed: 0

Corpus database: ~/.aphoria/corpus-db
Query: curl 'http://localhost:18180/v1/aphoria/corpus?category=compatibility'
```

**If errors occur:**
```
Summary:
  Total claims: 23
  Successfully stored: 18
  Failed: 5

Errors:
  1. Claim #7 (ml/torch/cuda/version)
     - --tier: Must be 0-3 (got: 5)
     - Fix: LLM assigned invalid tier

  2. Claim #12 (ml/xformers/optional)
     - --subject: Empty subject path
     - Fix: LLM extraction failed

  [... 3 more errors with details ...]

Fix these issues and re-run extraction.
```

## Verification

After extraction, verify claims appear in the corpus:

```bash
# Query all compatibility claims
curl -s 'http://localhost:18180/v1/aphoria/corpus?category=compatibility' | jq '.total_matching'
# Expected: 23 (or however many were extracted)

# Query specific subject
curl -s 'http://localhost:18180/v1/aphoria/corpus' | \
  jq '.items[] | select(.subject | contains("basicsr"))'

# Expected output:
{
  "subject": "ml/dependencies/basicsr/torchvision",
  "predicate": "incompatible_with",
  "value": ">=0.15",
  "source": "ml://",
  "tier": 2,
  "category": "compatibility",
  "explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in 0.15+",
  "authority_source": "XPixelGroup/BasicSR@8d56e3a"
}
```

## Dashboard View

Extracted claims appear in the Aphoria dashboard at `/corpus`:

**Filters:**
- By category: compatibility, performance, security, architecture, behavior
- By tier: 0 (Regulatory), 1 (Authoritative), 2 (Observational), 3 (Empirical)
- By source: ml://, security://, etc.

**Display:**
- Subject path as breadcrumbs: `ml > dependencies > basicsr > torchvision`
- Tier badge with color coding
- Full explanation text
- Authority citation as link (if URL)

## Troubleshooting

**Problem:** Skill chunks too aggressively, loses context

**Solution:** Adjust chunk size in skill configuration (target 4K tokens, can go up to 8K for complex articles)

---

**Problem:** LLM assigns wrong tiers

**Solution:** Refine tier guidance in skill prompt:
- Official standards (RFC, IEEE) → Tier 0
- Official docs, peer-reviewed papers → Tier 1
- GitHub repos, maintainer statements → Tier 2
- Blog posts, community forums → Tier 3

---

**Problem:** Too many failed claims (validation errors)

**Solution:** Check common error patterns:
```bash
# Review failed claims
grep "Failed:" /tmp/extraction-output.log

# Common issues:
# 1. Empty subjects - LLM extraction failed
# 2. Invalid tiers - LLM assigned tier > 3
# 3. Missing required fields - Incomplete extraction
```

Fix by refining LLM extraction prompt.

---

**Problem:** Duplicate claims (same subject+predicate)

**This is expected behavior.** Episteme stores ALL claims, even duplicates from different sources. This enables:
- Sourced differing opinions (PyTorch docs say X, community says Y)
- Conflict detection (authority says A, codebase does B)
- Historical tracking (claim evolved over time)

To query all claims for a subject:
```bash
curl -s 'http://localhost:18180/v1/aphoria/corpus' | \
  jq '.items[] | select(.subject == "ml/dependencies/basicsr/torchvision")'
```

## Integration with Other Features

**With Scans:**
- Corpus claims act as authority sources
- Aphoria compares scanned observations against corpus
- Conflicts trigger violations

**With Claims Management:**
- Can supersede corpus claims: `aphoria claims supersede <id>`
- Can deprecate outdated corpus: `aphoria claims deprecate <id>`
- Corpus claims have same structure as project claims

**With Dashboard:**
- All corpus claims visible at `/corpus`
- Filterable by category, tier, source
- Click through to see full explanation

## Best Practices

**DO:**
- Extract from authoritative sources (official docs, research)
- Verify claims appear in dashboard after extraction
- Review tier assignments for accuracy
- Include full context in explanations

**DON'T:**
- Extract from opinion pieces or blogs (or use tier 3)
- Skip authority citations (always provide source)
- Use vague subjects ("thing" → "ml/pytorch/feature/specific")
- Ignore validation errors (fix all before considering extraction complete)

## Examples

### Example 1: ML Dependencies

**Input:** `~/wiki/ml-stack.md`
```markdown
## PyTorch CUDA Compatibility

PyTorch 2.6.0 with CUDA 12.6 builds are forward compatible with CUDA 12.9.

Source: PyTorch 2.6 Release Notes
```

**Extraction:**
```bash
claude -p ~/wiki/ml-stack.md --skill extract-wiki-corpus

# Output:
Extracted 1 claim:
✓ ml/pytorch/cuda/compatibility
  predicate: forward_compatible_with
  value: "CUDA 12.9"
  tier: 1 (PyTorch 2.6 Release Notes)
```

### Example 2: Security Best Practices

**Input:** `~/wiki/security.md`
```markdown
## Password Hashing

Research shows Argon2 consistently outperforms bcrypt and scrypt for
password hashing in modern environments.

Source: OWASP Password Storage Cheat Sheet (2023)
```

**Extraction:**
```bash
claude -p ~/wiki/security.md --skill extract-wiki-corpus

# Output:
Extracted 1 claim:
✓ security/password/hashing/algorithm
  predicate: recommended
  value: "Argon2"
  tier: 1 (OWASP Password Storage Cheat Sheet)
```

### Example 3: Large Article

**Input:** `~/wiki/complete-stack.md` (15,000 tokens)
```markdown
# Complete Python Stack for SDXL

## Critical Solutions
[4,000 tokens]

## Enhancement Libraries
[5,000 tokens]

## CUDA Compatibility
[6,000 tokens]
```

**Extraction:**
```bash
claude -p ~/wiki/complete-stack.md --skill extract-wiki-corpus

# Output:
Reading article: complete-stack.md (15,234 tokens)
Chunked into 3 segments (by ## headings)

Chunk 1/3: "Critical Solutions"
  Extracted 12 claims
  ...

Chunk 2/3: "Enhancement Libraries"
  Extracted 8 claims
  ...

Chunk 3/3: "CUDA Compatibility"
  Extracted 7 claims
  ...

Summary: 27 claims extracted, 27 stored successfully
```

## See Also

- [CLI Reference](../reference/cli-reference.md) - All `aphoria corpus` commands
- [Corpus API](../api-reference.md) - Query corpus programmatically
- [Claims vs Observations](../../README.md#claims-vs-observations) - Key concepts