## Problem CLI-created community corpus items (tier 3) were stored correctly but invisible via API queries. Two issues blocked discoverability: 1. **Prefix mismatch**: API hardcoded 'community://pattern/' for aggregated patterns, but CLI creates 'community://rust/http/...' URIs 2. **Query parameter parsing**: Axum's default parser doesn't support bracket notation (?sources[]=value) used by the dashboard Result: 0/22 CLI-created items were queryable. ## Solution ### Fix 1: Broaden Community Prefix - Changed: 'community://pattern/' → 'community://' in corpus handler - Impact: Now matches both aggregated patterns AND CLI-created items - Backward compatible: Broader prefix includes narrower results ### Fix 2: Add QsQuery Extractor - Added: serde_qs dependency + custom QsQuery extractor - Supports: Bracket notation for array parameters (?sources[]=a&sources[]=b) - Compatible: Works with JavaScript URLSearchParams standard - Tested: 3 new unit tests for extractor behavior ## Verification - ✅ All 22 CLI-created community items now queryable (was 0) - ✅ Source filtering works: community (22), RFC (2), vendor (5) - ✅ Multi-source queries work: ?sources[]=community&sources[]=rfc → 24 - ✅ All 89 API tests pass + 3 new extractor tests - ✅ Clippy clean (0 warnings) - ✅ No regressions in existing functionality ## Files Changed - crates/stemedb-api/Cargo.toml: Add serde_qs dependency - crates/stemedb-api/src/extractors.rs: New QsQuery extractor (117 lines) - crates/stemedb-api/src/handlers/aphoria/corpus.rs: Use QsQuery, broaden prefix - crates/stemedb-api/src/lib.rs: Export extractors module Also includes: Scale-adaptive thresholds, wiki corpus extraction, documentation updates, and dashboard UI improvements from prior work. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
13 KiB
LLM-Based Wiki Corpus Extraction
Extract factual claims from technical documentation using an LLM skill that intelligently chunks, analyzes, and persists to the corpus database.
Quick Start
# Extract claims from a wiki article
cd ~/Workspace/stemedb
claude -p ~/path/to/wiki/article.md --skill extract-wiki-corpus
# Example with actual file
claude -p ~/Workspace/orchard9/wiki/intakes/REQUEST_FOR_RESEARCH_ANSWERS.md \
--skill extract-wiki-corpus
Expected output:
Reading article: REQUEST_FOR_RESEARCH_ANSWERS.md (12,450 tokens)
Chunked into 3 segments (by ## headings)
Chunk 1/3: "Critical Compatibility Solutions"
Extracted 8 claims
✓ ml/basicsr/torchvision/incompatible_with = ">=0.15"
✓ ml/gpen/gfpgan/outperforms = "eye_enhancement"
...
Chunk 2/3: "CUDA 12.9 Compatibility"
Extracted 5 claims
...
Summary: 23 claims extracted, 23 stored successfully
How It Works
1. Intelligent Chunking
The skill chunks large articles to fit LLM context limits:
Strategy:
- Target: ~4K tokens per chunk
- Break at
##headings when possible - Preserve context: Include document title + section path in each chunk
Example:
# Python Dependency Stack
## Critical Solutions
### BasicSR Fix
[content...]
Becomes 3 chunks:
"Python Dependency Stack / Critical Solutions / BasicSR Fix"+ content"Python Dependency Stack / Critical Solutions / GPEN vs GFPGAN"+ content"Python Dependency Stack / CUDA Compatibility"+ content
2. LLM Claim Extraction
For each chunk, Claude extracts factual assertions as structured JSON:
Extraction Criteria:
- Factual (verifiable from text)
- Useful for developers
- Has clear subject/predicate/value
Example extraction:
Input text:
### BasicSR/Torchvision Fix
The core issue is that basicsr 1.4.2 imports from
`torchvision.transforms.functional_tensor` which was removed in
torchvision 0.15+.
**Primary Solution:**
git+https://github.com/XPixelGroup/BasicSR@8d56e3a
Extracted claim:
{
"subject": "ml/dependencies/basicsr/torchvision",
"predicate": "incompatible_with",
"value": ">=0.15",
"explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+",
"authority": "XPixelGroup/BasicSR@8d56e3a",
"category": "compatibility"
}
3. Authority Inference
The LLM infers authority sources from context:
| Pattern | Authority Format | Example |
|---|---|---|
| GitHub URL | repo@commit |
XPixelGroup/BasicSR@8d56e3a |
| Research paper | Author et al. (Year) |
Smith et al. (2023) |
| Official docs | Product Documentation |
PyTorch Documentation |
| Empirical | Community consensus |
Community best practice |
4. Tier Assignment
The skill assigns tiers based on authority source:
| Tier | Authority Type | Examples |
|---|---|---|
| 0 | Regulatory specs | RFC, W3C standards |
| 1 | Authoritative sources | Official docs, research papers |
| 2 | Observational | GitHub repos, community consensus |
| 3 | Empirical | Unverified claims |
Guidance to LLM:
- Official standards (RFC, W3C) → Tier 0
- Official documentation, published research → Tier 1
- GitHub repos, maintainer statements → Tier 2
- Community reports, unverified → Tier 3
5. Persistence via CLI
Each extracted claim is stored using:
aphoria corpus create \
--subject "ml/dependencies/basicsr/torchvision" \
--predicate "incompatible_with" \
--value ">=0.15" \
--explanation "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in 0.15+" \
--authority "XPixelGroup/BasicSR@8d56e3a" \
--category "compatibility" \
--tier 2
CLI Reference: aphoria corpus create
Create a corpus assertion from structured claim data.
Usage:
aphoria corpus create \
--subject <hierarchical/path> \
--predicate <relationship> \
--value <value> \
--explanation <full-context> \
--authority <source> \
--category <category> \
--tier <0-3>
Arguments:
| Flag | Required | Description | Example |
|---|---|---|---|
--subject |
Yes | Hierarchical path to concept | ml/basicsr/torchvision |
--predicate |
Yes | Relationship type | incompatible_with |
--value |
Yes | Value or constraint | ">=0.15" |
--explanation |
Yes | Full context sentence | "basicsr 1.4.2 imports from..." |
--authority |
Yes | Source citation | XPixelGroup/BasicSR@8d56e3a |
--category |
Yes | Category tag | compatibility |
--tier |
Yes | Authority tier (0-3) | 2 |
Categories:
compatibility- Dependency constraints, version requirementsperformance- Performance characteristics, benchmarkssecurity- Security properties, vulnerabilitiesarchitecture- Design patterns, structurebehavior- Functional behavior, side effects
Behavior:
Deduplication: Stores ALL claims, even if subject+predicate exists. This is append-only; sourced differing claims are the whole point of Episteme.
Error Handling: Bundles all validation errors and presents them together:
Error creating corpus assertion:
Validation errors:
1. --subject: Must be non-empty hierarchical path (got: "")
2. --tier: Must be 0-3 (got: 5)
3. --category: Must be one of: compatibility, performance, security, architecture, behavior (got: "random")
Fix all errors and retry.
Example:
$ aphoria corpus create \
--subject "ml/pytorch/version" \
--predicate "requires" \
--value ">=2.0" \
--explanation "Uses torch.compile which requires PyTorch 2.0+" \
--authority "PyTorch 2.0 Release Notes" \
--category "compatibility" \
--tier 1
✓ Created corpus assertion: ml/pytorch/version
Stored in: ~/.aphoria/corpus-db
Skill Output Format
The extract-wiki-corpus skill produces structured output:
Reading article: REQUEST_FOR_RESEARCH_ANSWERS.md (12,450 tokens)
Chunked into 3 segments (by ## headings)
Chunk 1/3: "Critical Compatibility Solutions"
Extracted 8 claims
1. ml/dependencies/basicsr/torchvision
incompatible_with = ">=0.15"
Authority: XPixelGroup/BasicSR@8d56e3a
✓ Stored
2. ml/enhancements/gpen/gfpgan
outperforms = "eye_enhancement"
Authority: Research comparison (2023)
✓ Stored
[... 6 more claims ...]
Chunk 2/3: "CUDA 12.9 Compatibility"
Extracted 5 claims
9. ml/face_detection/mediaipe/dlib
preferred_over = "CUDA 12 support"
Authority: Community consensus
✓ Stored
[... 4 more claims ...]
Chunk 3/3: "Optimized Requirements"
Extracted 10 claims
[... all claims ...]
Summary:
Total claims: 23
Successfully stored: 23
Failed: 0
Corpus database: ~/.aphoria/corpus-db
Query: curl 'http://localhost:18180/v1/aphoria/corpus?category=compatibility'
If errors occur:
Summary:
Total claims: 23
Successfully stored: 18
Failed: 5
Errors:
1. Claim #7 (ml/torch/cuda/version)
- --tier: Must be 0-3 (got: 5)
- Fix: LLM assigned invalid tier
2. Claim #12 (ml/xformers/optional)
- --subject: Empty subject path
- Fix: LLM extraction failed
[... 3 more errors with details ...]
Fix these issues and re-run extraction.
Verification
After extraction, verify claims appear in the corpus:
# Query all compatibility claims
curl -s 'http://localhost:18180/v1/aphoria/corpus?category=compatibility' | jq '.total_matching'
# Expected: 23 (or however many were extracted)
# Query specific subject
curl -s 'http://localhost:18180/v1/aphoria/corpus' | \
jq '.items[] | select(.subject | contains("basicsr"))'
# Expected output:
{
"subject": "ml/dependencies/basicsr/torchvision",
"predicate": "incompatible_with",
"value": ">=0.15",
"source": "ml://",
"tier": 2,
"category": "compatibility",
"explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in 0.15+",
"authority_source": "XPixelGroup/BasicSR@8d56e3a"
}
Dashboard View
Extracted claims appear in the Aphoria dashboard at /corpus:
Filters:
- By category: compatibility, performance, security, architecture, behavior
- By tier: 0 (Regulatory), 1 (Authoritative), 2 (Observational), 3 (Empirical)
- By source: ml://, security://, etc.
Display:
- Subject path as breadcrumbs:
ml > dependencies > basicsr > torchvision - Tier badge with color coding
- Full explanation text
- Authority citation as link (if URL)
Troubleshooting
Problem: Skill chunks too aggressively, loses context
Solution: Adjust chunk size in skill configuration (target 4K tokens, can go up to 8K for complex articles)
Problem: LLM assigns wrong tiers
Solution: Refine tier guidance in skill prompt:
- Official standards (RFC, IEEE) → Tier 0
- Official docs, peer-reviewed papers → Tier 1
- GitHub repos, maintainer statements → Tier 2
- Blog posts, community forums → Tier 3
Problem: Too many failed claims (validation errors)
Solution: Check common error patterns:
# Review failed claims
grep "Failed:" /tmp/extraction-output.log
# Common issues:
# 1. Empty subjects - LLM extraction failed
# 2. Invalid tiers - LLM assigned tier > 3
# 3. Missing required fields - Incomplete extraction
Fix by refining LLM extraction prompt.
Problem: Duplicate claims (same subject+predicate)
This is expected behavior. Episteme stores ALL claims, even duplicates from different sources. This enables:
- Sourced differing opinions (PyTorch docs say X, community says Y)
- Conflict detection (authority says A, codebase does B)
- Historical tracking (claim evolved over time)
To query all claims for a subject:
curl -s 'http://localhost:18180/v1/aphoria/corpus' | \
jq '.items[] | select(.subject == "ml/dependencies/basicsr/torchvision")'
Integration with Other Features
With Scans:
- Corpus claims act as authority sources
- Aphoria compares scanned observations against corpus
- Conflicts trigger violations
With Claims Management:
- Can supersede corpus claims:
aphoria claims supersede <id> - Can deprecate outdated corpus:
aphoria claims deprecate <id> - Corpus claims have same structure as project claims
With Dashboard:
- All corpus claims visible at
/corpus - Filterable by category, tier, source
- Click through to see full explanation
Best Practices
DO:
- Extract from authoritative sources (official docs, research)
- Verify claims appear in dashboard after extraction
- Review tier assignments for accuracy
- Include full context in explanations
DON'T:
- Extract from opinion pieces or blogs (or use tier 3)
- Skip authority citations (always provide source)
- Use vague subjects ("thing" → "ml/pytorch/feature/specific")
- Ignore validation errors (fix all before considering extraction complete)
Examples
Example 1: ML Dependencies
Input: ~/wiki/ml-stack.md
## PyTorch CUDA Compatibility
PyTorch 2.6.0 with CUDA 12.6 builds are forward compatible with CUDA 12.9.
Source: PyTorch 2.6 Release Notes
Extraction:
claude -p ~/wiki/ml-stack.md --skill extract-wiki-corpus
# Output:
Extracted 1 claim:
✓ ml/pytorch/cuda/compatibility
predicate: forward_compatible_with
value: "CUDA 12.9"
tier: 1 (PyTorch 2.6 Release Notes)
Example 2: Security Best Practices
Input: ~/wiki/security.md
## Password Hashing
Research shows Argon2 consistently outperforms bcrypt and scrypt for
password hashing in modern environments.
Source: OWASP Password Storage Cheat Sheet (2023)
Extraction:
claude -p ~/wiki/security.md --skill extract-wiki-corpus
# Output:
Extracted 1 claim:
✓ security/password/hashing/algorithm
predicate: recommended
value: "Argon2"
tier: 1 (OWASP Password Storage Cheat Sheet)
Example 3: Large Article
Input: ~/wiki/complete-stack.md (15,000 tokens)
# Complete Python Stack for SDXL
## Critical Solutions
[4,000 tokens]
## Enhancement Libraries
[5,000 tokens]
## CUDA Compatibility
[6,000 tokens]
Extraction:
claude -p ~/wiki/complete-stack.md --skill extract-wiki-corpus
# Output:
Reading article: complete-stack.md (15,234 tokens)
Chunked into 3 segments (by ## headings)
Chunk 1/3: "Critical Solutions"
Extracted 12 claims
...
Chunk 2/3: "Enhancement Libraries"
Extracted 8 claims
...
Chunk 3/3: "CUDA Compatibility"
Extracted 7 claims
...
Summary: 27 claims extracted, 27 stored successfully
See Also
- CLI Reference - All
aphoria corpuscommands - Corpus API - Query corpus programmatically
- Claims vs Observations - Key concepts