---
name: extract-wiki-corpus
description: Extract structured claims from wiki documentation using LLM reasoning. Use when importing technical wikis, research docs, or compatibility guides into Aphoria corpus.
---

# Wiki Corpus Extraction Skill

## Identity

You are an intelligent claim extraction engine that reads technical documentation and extracts factual, verifiable claims for the Aphoria knowledge corpus.

Your job is to:
1. Read wiki markdown files
2. Extract factual claims using LLM reasoning
3. Generate CLI commands to persist claims in the corpus database
4. Report comprehensive results with success/failure breakdown

## Core Principles

1. **Factual over Normative**: Extract what IS (not what SHOULD BE)
2. **Context-Aware Authority**: Infer sources from GitHub URLs, paper citations, official docs
3. **Hierarchical Subjects**: Build semantic paths (ml/dependencies/basicsr/version)
4. **Intelligent Chunking**: Break at headings when possible, ~4K token chunks
5. **Batch Processing**: Extract all claims, then execute CLI commands
6. **Bundle Errors**: Collect all errors and report them together

## Workflow Overview

```
Phase 1: Discover & Read
    ↓
Phase 1.2: Verify Commands
    ↓
Phase 2: Intelligent Chunking
    ↓
Phase 3: Claim Extraction (Per Chunk)
    ↓
Phase 4: Validation
    ↓
Phase 5: CLI Execution
    ↓
Phase 6: Summary Report
```

---

## Phase 1: Discover & Read

### Step 1.1: Check Input

- If file passed via CLI args: use that file
- If directory passed: walk to find all `.md` files
- Use Read tool to get full content of each file

### Step 1.2: Verify Aphoria Binary and Commands

Before proceeding, verify that the required commands exist:

```bash
# Check Aphoria version
aphoria --version

# Verify corpus create command exists
if ! aphoria corpus --help 2>&1 | grep -q "create"; then
    echo "❌ ERROR: 'aphoria corpus create' command not available"
    echo ""
    echo "This suggests the aphoria binary is out of date."
    echo ""
    echo "Fix options:"
    echo "  1. Rebuild: cargo build --release -p aphoria"
    echo "  2. Check git status: git status"
    echo "  3. Pull latest: git pull && cargo build --release -p aphoria"
    echo ""
    exit 1
fi

echo "✅ Aphoria binary up to date (corpus create available)"
```

**Decision Gate:** Command exists? → Proceed to token estimation

### Step 1.3: Estimate Token Count

Rough estimate: **1 token ≈ 4 characters**

```
token_count = len(content) / 4
```

If `token_count > 4000`, proceed to Phase 2 (chunking).
If `token_count <= 4000`, treat as single chunk.

---

## Phase 2: Intelligent Chunking

### Goal
Split content into ~4K token chunks, preferring heading boundaries.

### Algorithm

1. **Try splitting on `## ` (level 2 headings)**
   - Sections should be roughly 4K tokens each
   - If a section is still > 4K, split on `### ` (level 3 headings)

2. **Include context in each chunk**
   - Document title (from `# ` heading)
   - Section path (breadcrumb of headings)
   - Example: "Document: ML Dependencies Guide / Section: Critical Compatibility Solutions / Subsection: BasicSR Fix"

3. **Maintain overlap**
   - Include previous heading for context
   - This helps LLM understand relationships

### Chunk Metadata Format

```json
{
  "chunk_id": 1,
  "total_chunks": 3,
  "document_title": "ML Dependencies Guide",
  "section_path": "Critical Compatibility Solutions / BasicSR Fix",
  "content": "..."
}
```

---

## Phase 3: Claim Extraction (Per Chunk)

### Prompt the LLM

For each chunk, use a structured extraction prompt:

````
You are extracting factual claims from technical documentation for a knowledge corpus.

**Context:**
- Document: {document_title}
- Section: {section_path}
- Chunk: {chunk_id}/{total_chunks}

**Content:**
{chunk_content}

**Task:**
Extract all factual claims as JSON array. Each claim must be:
1. Factual (not opinion or speculation)
2. Verifiable from the text
3. Useful for developers

**Authority Inference Rules:**
- GitHub URLs/commits → "Repository/Project@hash"
- Research papers → "Author et al. (Year)"
- Official documentation → "Project Documentation"
- Empirical observation → "Community consensus"

**Tier Assignment:**
- 0: RFC, W3C spec, ISO standard (regulatory)
- 1: OWASP, CWE, security advisory (clinical)
- 2: Project docs, compatibility notes (observational)
- 3: Blog posts, forum consensus (community)

**Output Format:**
```json
[
  {
    "subject": "hierarchical/path/to/concept",
    "predicate": "relationship_type",
    "value": "constraint_or_value",
    "explanation": "full sentence with context",
    "authority": "inferred_source",
    "category": "compatibility|performance|security|architecture|quality",
    "confidence": 0.95,
    "tier": 2
  }
]
```

Return ONLY the JSON array, no additional text.
````

### Expected Output Structure

```json
[
  {
    "subject": "ml/dependencies/basicsr/torchvision",
    "predicate": "incompatible_with",
    "value": ">=0.15",
    "explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+",
    "authority": "XPixelGroup/BasicSR GitHub",
    "category": "compatibility",
    "confidence": 0.95,
    "tier": 2
  }
]
```

---

## Phase 4: Validation

### Step 4.1: Filter by Confidence

Only keep claims where `confidence >= 0.7`

### Step 4.2: Check Required Fields

Each claim must have:
- `subject` (non-empty string)
- `predicate` (non-empty string)
- `value` (any type)
- `explanation` (non-empty string)
- `authority` (non-empty string)
- `category` (one of: compatibility, performance, security, architecture, quality)
- `tier` (0-3)

### Step 4.3: Validate Tier

Tier must be 0, 1, 2, or 3. If invalid, record error and skip claim.

### Step 4.4: Check for Duplicates

**Important**: The corpus database is **append-only**. Multiple sources can create the same `subject+predicate` pair. This is **allowed and expected**. Do NOT filter duplicates — just warn about them in the report.

---

## Phase 5: CLI Execution

### Step 5.1: Construct CLI Commands

For each validated claim, construct:

```bash
aphoria corpus create \
  --subject "{subject}" \
  --predicate "{predicate}" \
  --value "{value}" \
  --explanation "{explanation}" \
  --authority "{authority}" \
  --category "{category}" \
  --tier {tier}
```

**Important**: Use proper shell escaping for strings with quotes or special characters.

### Step 5.2: Execute Commands

Use the Bash tool to execute each command.

### Step 5.3: Collect Results

For each execution:
- **Success**: Record the corpus ID (e.g., "corpus://ml/foo/bar/predicate")
- **Failure**: Record the full error message

---

## Phase 6: Summary Report

### Report Structure

```markdown
# Wiki Corpus Extraction Report

**File:** /path/to/wiki/article.md
**Chunks Processed:** 3
**Claims Extracted:** 23
**Claims Stored:** 20
**Errors:** 3

## Stored Claims

| Subject | Predicate | Value | Authority | Tier |
|---------|-----------|-------|-----------|------|
| ml/basicsr/torchvision | incompatible_with | >=0.15 | XPixelGroup/BasicSR | 2 |
| ... | ... | ... | ... | ... |

## Errors

### Validation Errors (2)

1. **ml/foo/bar** - Invalid tier '5' (must be 0-3)
2. **api/rest/foo** - Missing explanation field

### Storage Errors (1)

1. **net/http/timeout** - Database write failed: connection refused

## Next Steps

View corpus items: http://localhost:3000/corpus
Query API: curl 'http://localhost:18180/v1/aphoria/corpus?sources[]=community&limit=100'
```

---

## Predicate Naming Conventions

Use consistent predicate names to enable effective querying:

| Relationship | Predicate |
|--------------|-----------|
| Version constraint | `requires`, `incompatible_with`, `compatible_with` |
| Recommendation | `recommends`, `discourages` |
| Performance | `faster_than`, `slower_than`, `optimal_for` |
| Security | `vulnerable_to`, `mitigates`, `exposes` |
| Configuration | `default_value`, `max_value`, `required_for` |

---

## Subject Path Guidelines

Build hierarchical paths that reflect the domain structure:

### Examples

- `ml/dependencies/{package}/{aspect}`
  - Example: `ml/dependencies/basicsr/torchvision`
- `api/{protocol}/{feature}`
  - Example: `api/rest/authentication`
- `security/{category}/{vuln_type}`
  - Example: `security/input-validation/xss`
- `performance/{component}/{metric}`
  - Example: `performance/database/connection-pool`

### Principles

- Start general, get specific
- Use lowercase with forward slashes
- Use hyphens for multi-word segments
- Keep paths under 6-7 levels

---

## Category Guidelines

Choose the most appropriate category:

| Category | Use When |
|----------|----------|
| `compatibility` | Version constraints, breaking changes, API compatibility |
| `performance` | Optimization, resource usage, latency, throughput |
| `security` | Vulnerabilities, mitigations, attack vectors |
| `architecture` | Design patterns, module structure, dependencies |
| `quality` | Code quality, maintainability, best practices |

---

## Authority Tier Guidelines

| Tier | Name | Examples | When to Use |
|------|------|----------|-------------|
| 0 | Regulatory | RFC 7231, W3C spec, ISO 27001 | Official standards bodies |
| 1 | Clinical | OWASP Top 10, CWE-79, NVD | Security advisories, vulnerability databases |
| 2 | Observational | PyTorch docs, GitHub project READMEs | Official project documentation |
| 3 | Community | Blog posts, Stack Overflow, forum threads | Community wisdom, empirical observations |

---

## Error Handling

### Validation Errors

Collect all validation errors and report them together. Do NOT stop on the first error.

Example validation errors:
- Invalid tier (not 0-3)
- Missing required field
- Confidence below threshold (< 0.7)

### Storage Errors

If a CLI command fails:
- Capture the full error message
- Continue with remaining commands
- Report all failures at the end

### LLM Extraction Errors

If the LLM returns invalid JSON:
- Log the chunk that failed
- Continue with remaining chunks
- Report the parsing error in summary

---

## Do's and Don'ts

### Do

- ✅ Extract factual claims (not opinions)
- ✅ Verify command availability before execution
- ✅ Infer authority from context
- ✅ Generate semantic subject paths
- ✅ Include full explanation context
- ✅ Bundle errors for batch reporting
- ✅ Use Read tool to get file content
- ✅ Use Bash tool to execute CLI commands
- ✅ Filter by confidence >= 0.7
- ✅ Allow duplicate subject+predicate (append-only DB)

### Do Not

- ❌ Extract opinions or speculative claims
- ❌ Assume binary is up to date
- ❌ Lose source attribution
- ❌ Hardcode authority (infer from content)
- ❌ Stop on first error (collect all errors)
- ❌ Modify files (read-only skill)
- ❌ Use placeholder values
- ❌ Skip validation
- ❌ Filter duplicates (append-only allows them)

---

## Example Extraction

### Input Text

```markdown
## BasicSR and Torchvision Compatibility

The BasicSR library (v1.4.2) has a critical compatibility issue with torchvision >= 0.15.
The library imports from `torchvision.transforms.functional_tensor`, which was removed
in torchvision 0.15+.

Source: https://github.com/XPixelGroup/BasicSR/issues/123

Recommended workaround: Pin torchvision to 0.14.1 or earlier.
```

### Extracted Claims

```json
[
  {
    "subject": "ml/dependencies/basicsr/torchvision",
    "predicate": "incompatible_with",
    "value": ">=0.15",
    "explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+",
    "authority": "XPixelGroup/BasicSR#123",
    "category": "compatibility",
    "confidence": 0.95,
    "tier": 2
  },
  {
    "subject": "ml/dependencies/basicsr/torchvision",
    "predicate": "recommends",
    "value": "<=0.14.1",
    "explanation": "Workaround for basicsr compatibility issue: pin torchvision to 0.14.1 or earlier",
    "authority": "XPixelGroup/BasicSR#123",
    "category": "compatibility",
    "confidence": 0.9,
    "tier": 3
  }
]
```

### CLI Commands

```bash
aphoria corpus create \
  --subject "ml/dependencies/basicsr/torchvision" \
  --predicate "incompatible_with" \
  --value ">=0.15" \
  --explanation "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+" \
  --authority "XPixelGroup/BasicSR#123" \
  --category "compatibility" \
  --tier 2

aphoria corpus create \
  --subject "ml/dependencies/basicsr/torchvision" \
  --predicate "recommends" \
  --value "<=0.14.1" \
  --explanation "Workaround for basicsr compatibility issue: pin torchvision to 0.14.1 or earlier" \
  --authority "XPixelGroup/BasicSR#123" \
  --category "compatibility" \
  --tier 3
```

---

## Related Skills

- **extract-claims**: Entity-level extraction from prose (for StemeDB ingestion)
- **aphoria-suggest**: Suggest claims from existing patterns
- **aphoria-claims**: Author claims from diffs

---

## Implementation Notes

### Token Counting

Use rough heuristic: `token_count = len(content) / 4`

This is approximate but good enough for chunking decisions.

### Shell Escaping

When constructing CLI commands, properly escape strings:

```python
import shlex

escaped_explanation = shlex.quote(explanation)
```

Or in bash:
```bash
explanation="${explanation//\"/\\\"}"  # Escape quotes
```

### Confidence Threshold

Only extract claims with `confidence >= 0.7`. This filters out:
- Speculative statements
- Uncertain inferences
- Low-quality extractions

### Append-Only Semantics

The corpus database is append-only. Multiple sources can contribute claims for the same `subject+predicate`. This enables:
- Cross-validation from multiple sources
- Community consensus building
- Evolving knowledge over time

Do NOT filter duplicates. Just warn about them in the report.

---

## Success Criteria

A successful extraction should:

1. ✅ Read all markdown files in the input directory
2. ✅ Extract factual claims with proper structure
3. ✅ Infer authority from context (GitHub URLs, docs, etc.)
4. ✅ Assign appropriate tiers (0-3)
5. ✅ Execute CLI commands successfully
6. ✅ Report comprehensive summary with errors bundled
7. ✅ Handle validation errors gracefully
8. ✅ Handle storage errors gracefully
9. ✅ Generate semantic subject paths
10. ✅ Use consistent predicate naming

---

## Troubleshooting

### "Command not found" or "unrecognized subcommand 'create'" Errors

If you see `error: unrecognized subcommand 'create'` or similar errors:

**Diagnosis:**
1. **Check binary date**: `ls -lh target/release/aphoria`
2. **Check CLI code date**: `ls -lh applications/aphoria/src/cli/mod.rs`
3. **If CLI is newer**: The binary is out of date

**Solutions:**
```bash
# Option 1: Rebuild the binary
cargo build --release -p aphoria

# Option 2: Pull latest changes and rebuild
git pull && cargo build --release -p aphoria

# Option 3: Check if there are uncommitted changes
git status
```

**Prevention:**
See Fix #1 for setting up git hooks that automatically rebuild binaries on pull.

### "Database already open" error

The corpus database at `~/.aphoria/corpus-db` is locked by another process (probably the API server).

**Solution**: Stop the API server temporarily:
```bash
pkill -f stemedb-api
```

### "Invalid tier" error

Tier must be 0, 1, 2, or 3.

**Solution**: Review tier assignment rules and fix the extracted tier value.

### "Missing required field" error

All claims must have: subject, predicate, value, explanation, authority, category, tier.

**Solution**: Review the LLM extraction prompt and ensure all fields are present.

### LLM returns invalid JSON

The LLM might return markdown formatting or extra text.

**Solution**: Update the extraction prompt to be more explicit about returning ONLY the JSON array.