Major documentation restructure to improve discoverability and reduce duplication. ## Changes **Deleted (Archived/Consolidated)**: - Removed duplicate getting started guides - Archived outdated planning documents - Consolidated corpus and configuration docs - Removed obsolete vision/spec files (superseded by vision.md) - Cleaned up scrapyard and old PDFs **New Structure**: - docs/about/ - Project overview and introduction - docs/guides/ - User guides (moved from root) - docs/specs/ - Technical specifications - docs/sdk/ - SDK documentation (Go) - docs/references/ - API references - docs/archive/ - Archived historical docs - applications/aphoria/docs/advanced/ - Advanced topics - applications/aphoria/docs/reference/ - CLI reference - applications/aphoria/docs/archive/ - Archived aphoria docs **Updated**: - README.md - New root README with clear navigation - CONTRIBUTING.md - Contribution guidelines - CLAUDE.md - Updated paths to new structure - roadmap.md - Added recent completions ## Files Changed - 57 files changed - 1,977 insertions(+) - 961 deletions(-) **Net change**: +1,016 lines (added CONTRIBUTING.md, README.md, reorganized content) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
484 lines
13 KiB
Markdown
484 lines
13 KiB
Markdown
# LLM-Based Wiki Corpus Extraction
|
|
|
|
Extract factual claims from technical documentation using an LLM skill that intelligently chunks, analyzes, and persists to the corpus database.
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Extract claims from a wiki article
|
|
cd ~/Workspace/stemedb
|
|
claude -p ~/path/to/wiki/article.md --skill extract-wiki-corpus
|
|
|
|
# Example with actual file
|
|
claude -p ~/Workspace/orchard9/wiki/intakes/REQUEST_FOR_RESEARCH_ANSWERS.md \
|
|
--skill extract-wiki-corpus
|
|
```
|
|
|
|
Expected output:
|
|
```
|
|
Reading article: REQUEST_FOR_RESEARCH_ANSWERS.md (12,450 tokens)
|
|
Chunked into 3 segments (by ## headings)
|
|
|
|
Chunk 1/3: "Critical Compatibility Solutions"
|
|
Extracted 8 claims
|
|
✓ ml/basicsr/torchvision/incompatible_with = ">=0.15"
|
|
✓ ml/gpen/gfpgan/outperforms = "eye_enhancement"
|
|
...
|
|
|
|
Chunk 2/3: "CUDA 12.9 Compatibility"
|
|
Extracted 5 claims
|
|
...
|
|
|
|
Summary: 23 claims extracted, 23 stored successfully
|
|
```
|
|
|
|
## How It Works
|
|
|
|
### 1. Intelligent Chunking
|
|
|
|
The skill chunks large articles to fit LLM context limits:
|
|
|
|
**Strategy:**
|
|
- Target: ~4K tokens per chunk
|
|
- Break at `##` headings when possible
|
|
- Preserve context: Include document title + section path in each chunk
|
|
|
|
**Example:**
|
|
```markdown
|
|
# Python Dependency Stack
|
|
## Critical Solutions
|
|
### BasicSR Fix
|
|
[content...]
|
|
```
|
|
|
|
Becomes 3 chunks:
|
|
1. `"Python Dependency Stack / Critical Solutions / BasicSR Fix"` + content
|
|
2. `"Python Dependency Stack / Critical Solutions / GPEN vs GFPGAN"` + content
|
|
3. `"Python Dependency Stack / CUDA Compatibility"` + content
|
|
|
|
### 2. LLM Claim Extraction
|
|
|
|
For each chunk, Claude extracts factual assertions as structured JSON:
|
|
|
|
**Extraction Criteria:**
|
|
- Factual (verifiable from text)
|
|
- Useful for developers
|
|
- Has clear subject/predicate/value
|
|
|
|
**Example extraction:**
|
|
|
|
Input text:
|
|
```markdown
|
|
### BasicSR/Torchvision Fix
|
|
The core issue is that basicsr 1.4.2 imports from
|
|
`torchvision.transforms.functional_tensor` which was removed in
|
|
torchvision 0.15+.
|
|
|
|
**Primary Solution:**
|
|
git+https://github.com/XPixelGroup/BasicSR@8d56e3a
|
|
```
|
|
|
|
Extracted claim:
|
|
```json
|
|
{
|
|
"subject": "ml/dependencies/basicsr/torchvision",
|
|
"predicate": "incompatible_with",
|
|
"value": ">=0.15",
|
|
"explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+",
|
|
"authority": "XPixelGroup/BasicSR@8d56e3a",
|
|
"category": "compatibility"
|
|
}
|
|
```
|
|
|
|
### 3. Authority Inference
|
|
|
|
The LLM infers authority sources from context:
|
|
|
|
| Pattern | Authority Format | Example |
|
|
|---------|-----------------|---------|
|
|
| GitHub URL | `repo@commit` | `XPixelGroup/BasicSR@8d56e3a` |
|
|
| Research paper | `Author et al. (Year)` | `Smith et al. (2023)` |
|
|
| Official docs | `Product Documentation` | `PyTorch Documentation` |
|
|
| Empirical | `Community consensus` | `Community best practice` |
|
|
|
|
### 4. Tier Assignment
|
|
|
|
The skill assigns tiers based on authority source:
|
|
|
|
| Tier | Authority Type | Examples |
|
|
|------|---------------|----------|
|
|
| 0 | Regulatory specs | RFC, W3C standards |
|
|
| 1 | Authoritative sources | Official docs, research papers |
|
|
| 2 | Observational | GitHub repos, community consensus |
|
|
| 3 | Empirical | Unverified claims |
|
|
|
|
**Guidance to LLM:**
|
|
- Official standards (RFC, W3C) → Tier 0
|
|
- Official documentation, published research → Tier 1
|
|
- GitHub repos, maintainer statements → Tier 2
|
|
- Community reports, unverified → Tier 3
|
|
|
|
### 5. Persistence via CLI
|
|
|
|
Each extracted claim is stored using:
|
|
|
|
```bash
|
|
aphoria corpus create \
|
|
--subject "ml/dependencies/basicsr/torchvision" \
|
|
--predicate "incompatible_with" \
|
|
--value ">=0.15" \
|
|
--explanation "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in 0.15+" \
|
|
--authority "XPixelGroup/BasicSR@8d56e3a" \
|
|
--category "compatibility" \
|
|
--tier 2
|
|
```
|
|
|
|
## CLI Reference: `aphoria corpus create`
|
|
|
|
Create a corpus assertion from structured claim data.
|
|
|
|
**Usage:**
|
|
```bash
|
|
aphoria corpus create \
|
|
--subject <hierarchical/path> \
|
|
--predicate <relationship> \
|
|
--value <value> \
|
|
--explanation <full-context> \
|
|
--authority <source> \
|
|
--category <category> \
|
|
--tier <0-3>
|
|
```
|
|
|
|
**Arguments:**
|
|
|
|
| Flag | Required | Description | Example |
|
|
|------|----------|-------------|---------|
|
|
| `--subject` | Yes | Hierarchical path to concept | `ml/basicsr/torchvision` |
|
|
| `--predicate` | Yes | Relationship type | `incompatible_with` |
|
|
| `--value` | Yes | Value or constraint | `">=0.15"` |
|
|
| `--explanation` | Yes | Full context sentence | `"basicsr 1.4.2 imports from..."` |
|
|
| `--authority` | Yes | Source citation | `XPixelGroup/BasicSR@8d56e3a` |
|
|
| `--category` | Yes | Category tag | `compatibility` |
|
|
| `--tier` | Yes | Authority tier (0-3) | `2` |
|
|
|
|
**Categories:**
|
|
- `compatibility` - Dependency constraints, version requirements
|
|
- `performance` - Performance characteristics, benchmarks
|
|
- `security` - Security properties, vulnerabilities
|
|
- `architecture` - Design patterns, structure
|
|
- `behavior` - Functional behavior, side effects
|
|
|
|
**Behavior:**
|
|
|
|
**Deduplication:** Stores ALL claims, even if subject+predicate exists. This is append-only; sourced differing claims are the whole point of Episteme.
|
|
|
|
**Error Handling:** Bundles all validation errors and presents them together:
|
|
|
|
```
|
|
Error creating corpus assertion:
|
|
|
|
Validation errors:
|
|
1. --subject: Must be non-empty hierarchical path (got: "")
|
|
2. --tier: Must be 0-3 (got: 5)
|
|
3. --category: Must be one of: compatibility, performance, security, architecture, behavior (got: "random")
|
|
|
|
Fix all errors and retry.
|
|
```
|
|
|
|
**Example:**
|
|
```bash
|
|
$ aphoria corpus create \
|
|
--subject "ml/pytorch/version" \
|
|
--predicate "requires" \
|
|
--value ">=2.0" \
|
|
--explanation "Uses torch.compile which requires PyTorch 2.0+" \
|
|
--authority "PyTorch 2.0 Release Notes" \
|
|
--category "compatibility" \
|
|
--tier 1
|
|
|
|
✓ Created corpus assertion: ml/pytorch/version
|
|
Stored in: ~/.aphoria/corpus-db
|
|
```
|
|
|
|
## Skill Output Format
|
|
|
|
The `extract-wiki-corpus` skill produces structured output:
|
|
|
|
```
|
|
Reading article: REQUEST_FOR_RESEARCH_ANSWERS.md (12,450 tokens)
|
|
Chunked into 3 segments (by ## headings)
|
|
|
|
Chunk 1/3: "Critical Compatibility Solutions"
|
|
Extracted 8 claims
|
|
|
|
1. ml/dependencies/basicsr/torchvision
|
|
incompatible_with = ">=0.15"
|
|
Authority: XPixelGroup/BasicSR@8d56e3a
|
|
✓ Stored
|
|
|
|
2. ml/enhancements/gpen/gfpgan
|
|
outperforms = "eye_enhancement"
|
|
Authority: Research comparison (2023)
|
|
✓ Stored
|
|
|
|
[... 6 more claims ...]
|
|
|
|
Chunk 2/3: "CUDA 12.9 Compatibility"
|
|
Extracted 5 claims
|
|
|
|
9. ml/face_detection/mediaipe/dlib
|
|
preferred_over = "CUDA 12 support"
|
|
Authority: Community consensus
|
|
✓ Stored
|
|
|
|
[... 4 more claims ...]
|
|
|
|
Chunk 3/3: "Optimized Requirements"
|
|
Extracted 10 claims
|
|
|
|
[... all claims ...]
|
|
|
|
Summary:
|
|
Total claims: 23
|
|
Successfully stored: 23
|
|
Failed: 0
|
|
|
|
Corpus database: ~/.aphoria/corpus-db
|
|
Query: curl 'http://localhost:18180/v1/aphoria/corpus?category=compatibility'
|
|
```
|
|
|
|
**If errors occur:**
|
|
```
|
|
Summary:
|
|
Total claims: 23
|
|
Successfully stored: 18
|
|
Failed: 5
|
|
|
|
Errors:
|
|
1. Claim #7 (ml/torch/cuda/version)
|
|
- --tier: Must be 0-3 (got: 5)
|
|
- Fix: LLM assigned invalid tier
|
|
|
|
2. Claim #12 (ml/xformers/optional)
|
|
- --subject: Empty subject path
|
|
- Fix: LLM extraction failed
|
|
|
|
[... 3 more errors with details ...]
|
|
|
|
Fix these issues and re-run extraction.
|
|
```
|
|
|
|
## Verification
|
|
|
|
After extraction, verify claims appear in the corpus:
|
|
|
|
```bash
|
|
# Query all compatibility claims
|
|
curl -s 'http://localhost:18180/v1/aphoria/corpus?category=compatibility' | jq '.total_matching'
|
|
# Expected: 23 (or however many were extracted)
|
|
|
|
# Query specific subject
|
|
curl -s 'http://localhost:18180/v1/aphoria/corpus' | \
|
|
jq '.items[] | select(.subject | contains("basicsr"))'
|
|
|
|
# Expected output:
|
|
{
|
|
"subject": "ml/dependencies/basicsr/torchvision",
|
|
"predicate": "incompatible_with",
|
|
"value": ">=0.15",
|
|
"source": "ml://",
|
|
"tier": 2,
|
|
"category": "compatibility",
|
|
"explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in 0.15+",
|
|
"authority_source": "XPixelGroup/BasicSR@8d56e3a"
|
|
}
|
|
```
|
|
|
|
## Dashboard View
|
|
|
|
Extracted claims appear in the Aphoria dashboard at `/corpus`:
|
|
|
|
**Filters:**
|
|
- By category: compatibility, performance, security, architecture, behavior
|
|
- By tier: 0 (Regulatory), 1 (Authoritative), 2 (Observational), 3 (Empirical)
|
|
- By source: ml://, security://, etc.
|
|
|
|
**Display:**
|
|
- Subject path as breadcrumbs: `ml > dependencies > basicsr > torchvision`
|
|
- Tier badge with color coding
|
|
- Full explanation text
|
|
- Authority citation as link (if URL)
|
|
|
|
## Troubleshooting
|
|
|
|
**Problem:** Skill chunks too aggressively, loses context
|
|
|
|
**Solution:** Adjust chunk size in skill configuration (target 4K tokens, can go up to 8K for complex articles)
|
|
|
|
---
|
|
|
|
**Problem:** LLM assigns wrong tiers
|
|
|
|
**Solution:** Refine tier guidance in skill prompt:
|
|
- Official standards (RFC, IEEE) → Tier 0
|
|
- Official docs, peer-reviewed papers → Tier 1
|
|
- GitHub repos, maintainer statements → Tier 2
|
|
- Blog posts, community forums → Tier 3
|
|
|
|
---
|
|
|
|
**Problem:** Too many failed claims (validation errors)
|
|
|
|
**Solution:** Check common error patterns:
|
|
```bash
|
|
# Review failed claims
|
|
grep "Failed:" /tmp/extraction-output.log
|
|
|
|
# Common issues:
|
|
# 1. Empty subjects - LLM extraction failed
|
|
# 2. Invalid tiers - LLM assigned tier > 3
|
|
# 3. Missing required fields - Incomplete extraction
|
|
```
|
|
|
|
Fix by refining LLM extraction prompt.
|
|
|
|
---
|
|
|
|
**Problem:** Duplicate claims (same subject+predicate)
|
|
|
|
**This is expected behavior.** Episteme stores ALL claims, even duplicates from different sources. This enables:
|
|
- Sourced differing opinions (PyTorch docs say X, community says Y)
|
|
- Conflict detection (authority says A, codebase does B)
|
|
- Historical tracking (claim evolved over time)
|
|
|
|
To query all claims for a subject:
|
|
```bash
|
|
curl -s 'http://localhost:18180/v1/aphoria/corpus' | \
|
|
jq '.items[] | select(.subject == "ml/dependencies/basicsr/torchvision")'
|
|
```
|
|
|
|
## Integration with Other Features
|
|
|
|
**With Scans:**
|
|
- Corpus claims act as authority sources
|
|
- Aphoria compares scanned observations against corpus
|
|
- Conflicts trigger violations
|
|
|
|
**With Claims Management:**
|
|
- Can supersede corpus claims: `aphoria claims supersede <id>`
|
|
- Can deprecate outdated corpus: `aphoria claims deprecate <id>`
|
|
- Corpus claims have same structure as project claims
|
|
|
|
**With Dashboard:**
|
|
- All corpus claims visible at `/corpus`
|
|
- Filterable by category, tier, source
|
|
- Click through to see full explanation
|
|
|
|
## Best Practices
|
|
|
|
**DO:**
|
|
- Extract from authoritative sources (official docs, research)
|
|
- Verify claims appear in dashboard after extraction
|
|
- Review tier assignments for accuracy
|
|
- Include full context in explanations
|
|
|
|
**DON'T:**
|
|
- Extract from opinion pieces or blogs (or use tier 3)
|
|
- Skip authority citations (always provide source)
|
|
- Use vague subjects ("thing" → "ml/pytorch/feature/specific")
|
|
- Ignore validation errors (fix all before considering extraction complete)
|
|
|
|
## Examples
|
|
|
|
### Example 1: ML Dependencies
|
|
|
|
**Input:** `~/wiki/ml-stack.md`
|
|
```markdown
|
|
## PyTorch CUDA Compatibility
|
|
|
|
PyTorch 2.6.0 with CUDA 12.6 builds are forward compatible with CUDA 12.9.
|
|
|
|
Source: PyTorch 2.6 Release Notes
|
|
```
|
|
|
|
**Extraction:**
|
|
```bash
|
|
claude -p ~/wiki/ml-stack.md --skill extract-wiki-corpus
|
|
|
|
# Output:
|
|
Extracted 1 claim:
|
|
✓ ml/pytorch/cuda/compatibility
|
|
predicate: forward_compatible_with
|
|
value: "CUDA 12.9"
|
|
tier: 1 (PyTorch 2.6 Release Notes)
|
|
```
|
|
|
|
### Example 2: Security Best Practices
|
|
|
|
**Input:** `~/wiki/security.md`
|
|
```markdown
|
|
## Password Hashing
|
|
|
|
Research shows Argon2 consistently outperforms bcrypt and scrypt for
|
|
password hashing in modern environments.
|
|
|
|
Source: OWASP Password Storage Cheat Sheet (2023)
|
|
```
|
|
|
|
**Extraction:**
|
|
```bash
|
|
claude -p ~/wiki/security.md --skill extract-wiki-corpus
|
|
|
|
# Output:
|
|
Extracted 1 claim:
|
|
✓ security/password/hashing/algorithm
|
|
predicate: recommended
|
|
value: "Argon2"
|
|
tier: 1 (OWASP Password Storage Cheat Sheet)
|
|
```
|
|
|
|
### Example 3: Large Article
|
|
|
|
**Input:** `~/wiki/complete-stack.md` (15,000 tokens)
|
|
```markdown
|
|
# Complete Python Stack for SDXL
|
|
|
|
## Critical Solutions
|
|
[4,000 tokens]
|
|
|
|
## Enhancement Libraries
|
|
[5,000 tokens]
|
|
|
|
## CUDA Compatibility
|
|
[6,000 tokens]
|
|
```
|
|
|
|
**Extraction:**
|
|
```bash
|
|
claude -p ~/wiki/complete-stack.md --skill extract-wiki-corpus
|
|
|
|
# Output:
|
|
Reading article: complete-stack.md (15,234 tokens)
|
|
Chunked into 3 segments (by ## headings)
|
|
|
|
Chunk 1/3: "Critical Solutions"
|
|
Extracted 12 claims
|
|
...
|
|
|
|
Chunk 2/3: "Enhancement Libraries"
|
|
Extracted 8 claims
|
|
...
|
|
|
|
Chunk 3/3: "CUDA Compatibility"
|
|
Extracted 7 claims
|
|
...
|
|
|
|
Summary: 27 claims extracted, 27 stored successfully
|
|
```
|
|
|
|
## See Also
|
|
|
|
- [CLI Reference](../reference/cli-reference.md) - All `aphoria corpus` commands
|
|
- [Corpus API](../api-reference.md) - Query corpus programmatically
|
|
- [Claims vs Observations](../../README.md#claims-vs-observations) - Key concepts
|