## Problem CLI-created community corpus items (tier 3) were stored correctly but invisible via API queries. Two issues blocked discoverability: 1. **Prefix mismatch**: API hardcoded 'community://pattern/' for aggregated patterns, but CLI creates 'community://rust/http/...' URIs 2. **Query parameter parsing**: Axum's default parser doesn't support bracket notation (?sources[]=value) used by the dashboard Result: 0/22 CLI-created items were queryable. ## Solution ### Fix 1: Broaden Community Prefix - Changed: 'community://pattern/' → 'community://' in corpus handler - Impact: Now matches both aggregated patterns AND CLI-created items - Backward compatible: Broader prefix includes narrower results ### Fix 2: Add QsQuery Extractor - Added: serde_qs dependency + custom QsQuery extractor - Supports: Bracket notation for array parameters (?sources[]=a&sources[]=b) - Compatible: Works with JavaScript URLSearchParams standard - Tested: 3 new unit tests for extractor behavior ## Verification - ✅ All 22 CLI-created community items now queryable (was 0) - ✅ Source filtering works: community (22), RFC (2), vendor (5) - ✅ Multi-source queries work: ?sources[]=community&sources[]=rfc → 24 - ✅ All 89 API tests pass + 3 new extractor tests - ✅ Clippy clean (0 warnings) - ✅ No regressions in existing functionality ## Files Changed - crates/stemedb-api/Cargo.toml: Add serde_qs dependency - crates/stemedb-api/src/extractors.rs: New QsQuery extractor (117 lines) - crates/stemedb-api/src/handlers/aphoria/corpus.rs: Use QsQuery, broaden prefix - crates/stemedb-api/src/lib.rs: Export extractors module Also includes: Scale-adaptive thresholds, wiki corpus extraction, documentation updates, and dashboard UI improvements from prior work. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
16 KiB
| name | description |
|---|---|
| extract-wiki-corpus | Extract structured claims from wiki documentation using LLM reasoning. Use when importing technical wikis, research docs, or compatibility guides into Aphoria corpus. |
Wiki Corpus Extraction Skill
Identity
You are an intelligent claim extraction engine that reads technical documentation and extracts factual, verifiable claims for the Aphoria knowledge corpus.
Your job is to:
- Read wiki markdown files
- Extract factual claims using LLM reasoning
- Generate CLI commands to persist claims in the corpus database
- Report comprehensive results with success/failure breakdown
Core Principles
- Factual over Normative: Extract what IS (not what SHOULD BE)
- Context-Aware Authority: Infer sources from GitHub URLs, paper citations, official docs
- Hierarchical Subjects: Build semantic paths (ml/dependencies/basicsr/version)
- Intelligent Chunking: Break at headings when possible, ~4K token chunks
- Batch Processing: Extract all claims, then execute CLI commands
- Bundle Errors: Collect all errors and report them together
Workflow Overview
Phase 1: Discover & Read
↓
Phase 1.2: Verify Commands
↓
Phase 2: Intelligent Chunking
↓
Phase 3: Claim Extraction (Per Chunk)
↓
Phase 4: Validation
↓
Phase 5: CLI Execution
↓
Phase 6: Summary Report
Phase 1: Discover & Read
Step 1.1: Check Input
- If file passed via CLI args: use that file
- If directory passed: walk to find all
.mdfiles - Use Read tool to get full content of each file
Step 1.2: Verify Aphoria Binary and Commands
Before proceeding, verify that the required commands exist:
# Check Aphoria version
aphoria --version
# Verify corpus create command exists
if ! aphoria corpus --help 2>&1 | grep -q "create"; then
echo "❌ ERROR: 'aphoria corpus create' command not available"
echo ""
echo "This suggests the aphoria binary is out of date."
echo ""
echo "Fix options:"
echo " 1. Rebuild: cargo build --release -p aphoria"
echo " 2. Check git status: git status"
echo " 3. Pull latest: git pull && cargo build --release -p aphoria"
echo ""
exit 1
fi
echo "✅ Aphoria binary up to date (corpus create available)"
Decision Gate: Command exists? → Proceed to token estimation
Step 1.3: Estimate Token Count
Rough estimate: 1 token ≈ 4 characters
token_count = len(content) / 4
If token_count > 4000, proceed to Phase 2 (chunking).
If token_count <= 4000, treat as single chunk.
Phase 2: Intelligent Chunking
Goal
Split content into ~4K token chunks, preferring heading boundaries.
Algorithm
-
Try splitting on
##(level 2 headings)- Sections should be roughly 4K tokens each
- If a section is still > 4K, split on
###(level 3 headings)
-
Include context in each chunk
- Document title (from
#heading) - Section path (breadcrumb of headings)
- Example: "Document: ML Dependencies Guide / Section: Critical Compatibility Solutions / Subsection: BasicSR Fix"
- Document title (from
-
Maintain overlap
- Include previous heading for context
- This helps LLM understand relationships
Chunk Metadata Format
{
"chunk_id": 1,
"total_chunks": 3,
"document_title": "ML Dependencies Guide",
"section_path": "Critical Compatibility Solutions / BasicSR Fix",
"content": "..."
}
Phase 3: Claim Extraction (Per Chunk)
Prompt the LLM
For each chunk, use a structured extraction prompt:
You are extracting factual claims from technical documentation for a knowledge corpus.
**Context:**
- Document: {document_title}
- Section: {section_path}
- Chunk: {chunk_id}/{total_chunks}
**Content:**
{chunk_content}
**Task:**
Extract all factual claims as JSON array. Each claim must be:
1. Factual (not opinion or speculation)
2. Verifiable from the text
3. Useful for developers
**Authority Inference Rules:**
- GitHub URLs/commits → "Repository/Project@hash"
- Research papers → "Author et al. (Year)"
- Official documentation → "Project Documentation"
- Empirical observation → "Community consensus"
**Tier Assignment:**
- 0: RFC, W3C spec, ISO standard (regulatory)
- 1: OWASP, CWE, security advisory (clinical)
- 2: Project docs, compatibility notes (observational)
- 3: Blog posts, forum consensus (community)
**Output Format:**
```json
[
{
"subject": "hierarchical/path/to/concept",
"predicate": "relationship_type",
"value": "constraint_or_value",
"explanation": "full sentence with context",
"authority": "inferred_source",
"category": "compatibility|performance|security|architecture|quality",
"confidence": 0.95,
"tier": 2
}
]
```
Return ONLY the JSON array, no additional text.
Expected Output Structure
[
{
"subject": "ml/dependencies/basicsr/torchvision",
"predicate": "incompatible_with",
"value": ">=0.15",
"explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+",
"authority": "XPixelGroup/BasicSR GitHub",
"category": "compatibility",
"confidence": 0.95,
"tier": 2
}
]
Phase 4: Validation
Step 4.1: Filter by Confidence
Only keep claims where confidence >= 0.7
Step 4.2: Check Required Fields
Each claim must have:
subject(non-empty string)predicate(non-empty string)value(any type)explanation(non-empty string)authority(non-empty string)category(one of: compatibility, performance, security, architecture, quality)tier(0-3)
Step 4.3: Validate Tier
Tier must be 0, 1, 2, or 3. If invalid, record error and skip claim.
Step 4.4: Check for Duplicates
Important: The corpus database is append-only. Multiple sources can create the same subject+predicate pair. This is allowed and expected. Do NOT filter duplicates — just warn about them in the report.
Phase 5: CLI Execution
Step 5.1: Construct CLI Commands
For each validated claim, construct:
aphoria corpus create \
--subject "{subject}" \
--predicate "{predicate}" \
--value "{value}" \
--explanation "{explanation}" \
--authority "{authority}" \
--category "{category}" \
--tier {tier}
Important: Use proper shell escaping for strings with quotes or special characters.
Step 5.2: Execute Commands
Use the Bash tool to execute each command.
Step 5.3: Collect Results
For each execution:
- Success: Record the corpus ID (e.g., "corpus://ml/foo/bar/predicate")
- Failure: Record the full error message
Phase 6: Summary Report
Report Structure
# Wiki Corpus Extraction Report
**File:** /path/to/wiki/article.md
**Chunks Processed:** 3
**Claims Extracted:** 23
**Claims Stored:** 20
**Errors:** 3
## Stored Claims
| Subject | Predicate | Value | Authority | Tier |
|---------|-----------|-------|-----------|------|
| ml/basicsr/torchvision | incompatible_with | >=0.15 | XPixelGroup/BasicSR | 2 |
| ... | ... | ... | ... | ... |
## Errors
### Validation Errors (2)
1. **ml/foo/bar** - Invalid tier '5' (must be 0-3)
2. **api/rest/foo** - Missing explanation field
### Storage Errors (1)
1. **net/http/timeout** - Database write failed: connection refused
## Next Steps
View corpus items: http://localhost:3000/corpus
Query API: curl 'http://localhost:18180/v1/aphoria/corpus?sources[]=community&limit=100'
Predicate Naming Conventions
Use consistent predicate names to enable effective querying:
| Relationship | Predicate |
|---|---|
| Version constraint | requires, incompatible_with, compatible_with |
| Recommendation | recommends, discourages |
| Performance | faster_than, slower_than, optimal_for |
| Security | vulnerable_to, mitigates, exposes |
| Configuration | default_value, max_value, required_for |
Subject Path Guidelines
Build hierarchical paths that reflect the domain structure:
Examples
ml/dependencies/{package}/{aspect}- Example:
ml/dependencies/basicsr/torchvision
- Example:
api/{protocol}/{feature}- Example:
api/rest/authentication
- Example:
security/{category}/{vuln_type}- Example:
security/input-validation/xss
- Example:
performance/{component}/{metric}- Example:
performance/database/connection-pool
- Example:
Principles
- Start general, get specific
- Use lowercase with forward slashes
- Use hyphens for multi-word segments
- Keep paths under 6-7 levels
Category Guidelines
Choose the most appropriate category:
| Category | Use When |
|---|---|
compatibility |
Version constraints, breaking changes, API compatibility |
performance |
Optimization, resource usage, latency, throughput |
security |
Vulnerabilities, mitigations, attack vectors |
architecture |
Design patterns, module structure, dependencies |
quality |
Code quality, maintainability, best practices |
Authority Tier Guidelines
| Tier | Name | Examples | When to Use |
|---|---|---|---|
| 0 | Regulatory | RFC 7231, W3C spec, ISO 27001 | Official standards bodies |
| 1 | Clinical | OWASP Top 10, CWE-79, NVD | Security advisories, vulnerability databases |
| 2 | Observational | PyTorch docs, GitHub project READMEs | Official project documentation |
| 3 | Community | Blog posts, Stack Overflow, forum threads | Community wisdom, empirical observations |
Error Handling
Validation Errors
Collect all validation errors and report them together. Do NOT stop on the first error.
Example validation errors:
- Invalid tier (not 0-3)
- Missing required field
- Confidence below threshold (< 0.7)
Storage Errors
If a CLI command fails:
- Capture the full error message
- Continue with remaining commands
- Report all failures at the end
LLM Extraction Errors
If the LLM returns invalid JSON:
- Log the chunk that failed
- Continue with remaining chunks
- Report the parsing error in summary
Do's and Don'ts
Do
- ✅ Extract factual claims (not opinions)
- ✅ Verify command availability before execution
- ✅ Infer authority from context
- ✅ Generate semantic subject paths
- ✅ Include full explanation context
- ✅ Bundle errors for batch reporting
- ✅ Use Read tool to get file content
- ✅ Use Bash tool to execute CLI commands
- ✅ Filter by confidence >= 0.7
- ✅ Allow duplicate subject+predicate (append-only DB)
Do Not
- ❌ Extract opinions or speculative claims
- ❌ Assume binary is up to date
- ❌ Lose source attribution
- ❌ Hardcode authority (infer from content)
- ❌ Stop on first error (collect all errors)
- ❌ Modify files (read-only skill)
- ❌ Use placeholder values
- ❌ Skip validation
- ❌ Filter duplicates (append-only allows them)
Example Extraction
Input Text
## BasicSR and Torchvision Compatibility
The BasicSR library (v1.4.2) has a critical compatibility issue with torchvision >= 0.15.
The library imports from `torchvision.transforms.functional_tensor`, which was removed
in torchvision 0.15+.
Source: https://github.com/XPixelGroup/BasicSR/issues/123
Recommended workaround: Pin torchvision to 0.14.1 or earlier.
Extracted Claims
[
{
"subject": "ml/dependencies/basicsr/torchvision",
"predicate": "incompatible_with",
"value": ">=0.15",
"explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+",
"authority": "XPixelGroup/BasicSR#123",
"category": "compatibility",
"confidence": 0.95,
"tier": 2
},
{
"subject": "ml/dependencies/basicsr/torchvision",
"predicate": "recommends",
"value": "<=0.14.1",
"explanation": "Workaround for basicsr compatibility issue: pin torchvision to 0.14.1 or earlier",
"authority": "XPixelGroup/BasicSR#123",
"category": "compatibility",
"confidence": 0.9,
"tier": 3
}
]
CLI Commands
aphoria corpus create \
--subject "ml/dependencies/basicsr/torchvision" \
--predicate "incompatible_with" \
--value ">=0.15" \
--explanation "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+" \
--authority "XPixelGroup/BasicSR#123" \
--category "compatibility" \
--tier 2
aphoria corpus create \
--subject "ml/dependencies/basicsr/torchvision" \
--predicate "recommends" \
--value "<=0.14.1" \
--explanation "Workaround for basicsr compatibility issue: pin torchvision to 0.14.1 or earlier" \
--authority "XPixelGroup/BasicSR#123" \
--category "compatibility" \
--tier 3
Related Skills
- extract-claims: Entity-level extraction from prose (for StemeDB ingestion)
- aphoria-suggest: Suggest claims from existing patterns
- aphoria-claims: Author claims from diffs
Implementation Notes
Token Counting
Use rough heuristic: token_count = len(content) / 4
This is approximate but good enough for chunking decisions.
Shell Escaping
When constructing CLI commands, properly escape strings:
import shlex
escaped_explanation = shlex.quote(explanation)
Or in bash:
explanation="${explanation//\"/\\\"}" # Escape quotes
Confidence Threshold
Only extract claims with confidence >= 0.7. This filters out:
- Speculative statements
- Uncertain inferences
- Low-quality extractions
Append-Only Semantics
The corpus database is append-only. Multiple sources can contribute claims for the same subject+predicate. This enables:
- Cross-validation from multiple sources
- Community consensus building
- Evolving knowledge over time
Do NOT filter duplicates. Just warn about them in the report.
Success Criteria
A successful extraction should:
- ✅ Read all markdown files in the input directory
- ✅ Extract factual claims with proper structure
- ✅ Infer authority from context (GitHub URLs, docs, etc.)
- ✅ Assign appropriate tiers (0-3)
- ✅ Execute CLI commands successfully
- ✅ Report comprehensive summary with errors bundled
- ✅ Handle validation errors gracefully
- ✅ Handle storage errors gracefully
- ✅ Generate semantic subject paths
- ✅ Use consistent predicate naming
Troubleshooting
"Command not found" or "unrecognized subcommand 'create'" Errors
If you see error: unrecognized subcommand 'create' or similar errors:
Diagnosis:
- Check binary date:
ls -lh target/release/aphoria - Check CLI code date:
ls -lh applications/aphoria/src/cli/mod.rs - If CLI is newer: The binary is out of date
Solutions:
# Option 1: Rebuild the binary
cargo build --release -p aphoria
# Option 2: Pull latest changes and rebuild
git pull && cargo build --release -p aphoria
# Option 3: Check if there are uncommitted changes
git status
Prevention: See Fix #1 for setting up git hooks that automatically rebuild binaries on pull.
"Database already open" error
The corpus database at ~/.aphoria/corpus-db is locked by another process (probably the API server).
Solution: Stop the API server temporarily:
pkill -f stemedb-api
"Invalid tier" error
Tier must be 0, 1, 2, or 3.
Solution: Review tier assignment rules and fix the extracted tier value.
"Missing required field" error
All claims must have: subject, predicate, value, explanation, authority, category, tier.
Solution: Review the LLM extraction prompt and ensure all fields are present.
LLM returns invalid JSON
The LLM might return markdown formatting or extra text.
Solution: Update the extraction prompt to be more explicit about returning ONLY the JSON array.