jml bb0c33f8d3 fix(api): enable querying of CLI-created community corpus items

## Problem
CLI-created community corpus items (tier 3) were stored correctly but
invisible via API queries. Two issues blocked discoverability:

1. **Prefix mismatch**: API hardcoded 'community://pattern/' for
   aggregated patterns, but CLI creates 'community://rust/http/...' URIs
2. **Query parameter parsing**: Axum's default parser doesn't support
   bracket notation (?sources[]=value) used by the dashboard

Result: 0/22 CLI-created items were queryable.

## Solution

### Fix 1: Broaden Community Prefix
- Changed: 'community://pattern/' → 'community://' in corpus handler
- Impact: Now matches both aggregated patterns AND CLI-created items
- Backward compatible: Broader prefix includes narrower results

### Fix 2: Add QsQuery Extractor
- Added: serde_qs dependency + custom QsQuery extractor
- Supports: Bracket notation for array parameters (?sources[]=a&sources[]=b)
- Compatible: Works with JavaScript URLSearchParams standard
- Tested: 3 new unit tests for extractor behavior

## Verification
- ✅ All 22 CLI-created community items now queryable (was 0)
- ✅ Source filtering works: community (22), RFC (2), vendor (5)
- ✅ Multi-source queries work: ?sources[]=community&sources[]=rfc → 24
- ✅ All 89 API tests pass + 3 new extractor tests
- ✅ Clippy clean (0 warnings)
- ✅ No regressions in existing functionality

## Files Changed
- crates/stemedb-api/Cargo.toml: Add serde_qs dependency
- crates/stemedb-api/src/extractors.rs: New QsQuery extractor (117 lines)
- crates/stemedb-api/src/handlers/aphoria/corpus.rs: Use QsQuery, broaden prefix
- crates/stemedb-api/src/lib.rs: Export extractors module

Also includes: Scale-adaptive thresholds, wiki corpus extraction,
documentation updates, and dashboard UI improvements from prior work.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-02-09 15:54:35 +00:00

16 KiB

Raw Blame History

name	description
extract-wiki-corpus	Extract structured claims from wiki documentation using LLM reasoning. Use when importing technical wikis, research docs, or compatibility guides into Aphoria corpus.

Wiki Corpus Extraction Skill

Identity

You are an intelligent claim extraction engine that reads technical documentation and extracts factual, verifiable claims for the Aphoria knowledge corpus.

Your job is to:

Read wiki markdown files
Extract factual claims using LLM reasoning
Generate CLI commands to persist claims in the corpus database
Report comprehensive results with success/failure breakdown

Core Principles

Factual over Normative: Extract what IS (not what SHOULD BE)
Context-Aware Authority: Infer sources from GitHub URLs, paper citations, official docs
Hierarchical Subjects: Build semantic paths (ml/dependencies/basicsr/version)
Intelligent Chunking: Break at headings when possible, ~4K token chunks
Batch Processing: Extract all claims, then execute CLI commands
Bundle Errors: Collect all errors and report them together

Workflow Overview

Phase 1: Discover & Read
    ↓
Phase 1.2: Verify Commands
    ↓
Phase 2: Intelligent Chunking
    ↓
Phase 3: Claim Extraction (Per Chunk)
    ↓
Phase 4: Validation
    ↓
Phase 5: CLI Execution
    ↓
Phase 6: Summary Report

Phase 1: Discover & Read

Step 1.1: Check Input

If file passed via CLI args: use that file
If directory passed: walk to find all .md files
Use Read tool to get full content of each file

Step 1.2: Verify Aphoria Binary and Commands

Before proceeding, verify that the required commands exist:

# Check Aphoria version
aphoria --version

# Verify corpus create command exists
if ! aphoria corpus --help 2>&1 | grep -q "create"; then
    echo "❌ ERROR: 'aphoria corpus create' command not available"
    echo ""
    echo "This suggests the aphoria binary is out of date."
    echo ""
    echo "Fix options:"
    echo "  1. Rebuild: cargo build --release -p aphoria"
    echo "  2. Check git status: git status"
    echo "  3. Pull latest: git pull && cargo build --release -p aphoria"
    echo ""
    exit 1
fi

echo "✅ Aphoria binary up to date (corpus create available)"

Decision Gate: Command exists? → Proceed to token estimation

Step 1.3: Estimate Token Count

Rough estimate: 1 token ≈ 4 characters

token_count = len(content) / 4

If token_count > 4000, proceed to Phase 2 (chunking). If token_count <= 4000, treat as single chunk.

Phase 2: Intelligent Chunking

Goal

Split content into ~4K token chunks, preferring heading boundaries.

Algorithm

Try splitting on ## (level 2 headings)
- Sections should be roughly 4K tokens each
- If a section is still > 4K, split on ### (level 3 headings)
Include context in each chunk
- Document title (from # heading)
- Section path (breadcrumb of headings)
- Example: "Document: ML Dependencies Guide / Section: Critical Compatibility Solutions / Subsection: BasicSR Fix"
Maintain overlap
- Include previous heading for context
- This helps LLM understand relationships

Chunk Metadata Format

{
  "chunk_id": 1,
  "total_chunks": 3,
  "document_title": "ML Dependencies Guide",
  "section_path": "Critical Compatibility Solutions / BasicSR Fix",
  "content": "..."
}

Phase 3: Claim Extraction (Per Chunk)

Prompt the LLM

For each chunk, use a structured extraction prompt:

You are extracting factual claims from technical documentation for a knowledge corpus.

**Context:**
- Document: {document_title}
- Section: {section_path}
- Chunk: {chunk_id}/{total_chunks}

**Content:**
{chunk_content}

**Task:**
Extract all factual claims as JSON array. Each claim must be:
1. Factual (not opinion or speculation)
2. Verifiable from the text
3. Useful for developers

**Authority Inference Rules:**
- GitHub URLs/commits → "Repository/Project@hash"
- Research papers → "Author et al. (Year)"
- Official documentation → "Project Documentation"
- Empirical observation → "Community consensus"

**Tier Assignment:**
- 0: RFC, W3C spec, ISO standard (regulatory)
- 1: OWASP, CWE, security advisory (clinical)
- 2: Project docs, compatibility notes (observational)
- 3: Blog posts, forum consensus (community)

**Output Format:**
```json
[
  {
    "subject": "hierarchical/path/to/concept",
    "predicate": "relationship_type",
    "value": "constraint_or_value",
    "explanation": "full sentence with context",
    "authority": "inferred_source",
    "category": "compatibility|performance|security|architecture|quality",
    "confidence": 0.95,
    "tier": 2
  }
]
```

Return ONLY the JSON array, no additional text.

Expected Output Structure

[
  {
    "subject": "ml/dependencies/basicsr/torchvision",
    "predicate": "incompatible_with",
    "value": ">=0.15",
    "explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+",
    "authority": "XPixelGroup/BasicSR GitHub",
    "category": "compatibility",
    "confidence": 0.95,
    "tier": 2
  }
]

Phase 4: Validation

Step 4.1: Filter by Confidence

Only keep claims where confidence >= 0.7

Step 4.2: Check Required Fields

Each claim must have:

subject (non-empty string)
predicate (non-empty string)
value (any type)
explanation (non-empty string)
authority (non-empty string)
category (one of: compatibility, performance, security, architecture, quality)
tier (0-3)

Step 4.3: Validate Tier

Tier must be 0, 1, 2, or 3. If invalid, record error and skip claim.

Step 4.4: Check for Duplicates

Important: The corpus database is append-only. Multiple sources can create the same subject+predicate pair. This is allowed and expected. Do NOT filter duplicates — just warn about them in the report.

Phase 5: CLI Execution

Step 5.1: Construct CLI Commands

For each validated claim, construct:

aphoria corpus create \
  --subject "{subject}" \
  --predicate "{predicate}" \
  --value "{value}" \
  --explanation "{explanation}" \
  --authority "{authority}" \
  --category "{category}" \
  --tier {tier}

Important: Use proper shell escaping for strings with quotes or special characters.

Step 5.2: Execute Commands

Use the Bash tool to execute each command.

Step 5.3: Collect Results

For each execution:

Success: Record the corpus ID (e.g., "corpus://ml/foo/bar/predicate")
Failure: Record the full error message

Phase 6: Summary Report

Report Structure

# Wiki Corpus Extraction Report

**File:** /path/to/wiki/article.md
**Chunks Processed:** 3
**Claims Extracted:** 23
**Claims Stored:** 20
**Errors:** 3

## Stored Claims

| Subject | Predicate | Value | Authority | Tier |
|---------|-----------|-------|-----------|------|
| ml/basicsr/torchvision | incompatible_with | >=0.15 | XPixelGroup/BasicSR | 2 |
| ... | ... | ... | ... | ... |

## Errors

### Validation Errors (2)

1. **ml/foo/bar** - Invalid tier '5' (must be 0-3)
2. **api/rest/foo** - Missing explanation field

### Storage Errors (1)

1. **net/http/timeout** - Database write failed: connection refused

## Next Steps

View corpus items: http://localhost:3000/corpus
Query API: curl 'http://localhost:18180/v1/aphoria/corpus?sources[]=community&limit=100'

Predicate Naming Conventions

Use consistent predicate names to enable effective querying:

Relationship	Predicate
Version constraint	`requires`, `incompatible_with`, `compatible_with`
Recommendation	`recommends`, `discourages`
Performance	`faster_than`, `slower_than`, `optimal_for`
Security	`vulnerable_to`, `mitigates`, `exposes`
Configuration	`default_value`, `max_value`, `required_for`

Subject Path Guidelines

Build hierarchical paths that reflect the domain structure:

Examples

ml/dependencies/{package}/{aspect}
- Example: ml/dependencies/basicsr/torchvision
api/{protocol}/{feature}
- Example: api/rest/authentication
security/{category}/{vuln_type}
- Example: security/input-validation/xss
performance/{component}/{metric}
- Example: performance/database/connection-pool

Principles

Start general, get specific
Use lowercase with forward slashes
Use hyphens for multi-word segments
Keep paths under 6-7 levels

Category Guidelines

Choose the most appropriate category:

Category	Use When
`compatibility`	Version constraints, breaking changes, API compatibility
`performance`	Optimization, resource usage, latency, throughput
`security`	Vulnerabilities, mitigations, attack vectors
`architecture`	Design patterns, module structure, dependencies
`quality`	Code quality, maintainability, best practices

Authority Tier Guidelines

Tier	Name	Examples	When to Use
0	Regulatory	RFC 7231, W3C spec, ISO 27001	Official standards bodies
1	Clinical	OWASP Top 10, CWE-79, NVD	Security advisories, vulnerability databases
2	Observational	PyTorch docs, GitHub project READMEs	Official project documentation
3	Community	Blog posts, Stack Overflow, forum threads	Community wisdom, empirical observations

Error Handling

Validation Errors

Collect all validation errors and report them together. Do NOT stop on the first error.

Example validation errors:

Invalid tier (not 0-3)
Missing required field
Confidence below threshold (< 0.7)

Storage Errors

If a CLI command fails:

Capture the full error message
Continue with remaining commands
Report all failures at the end

LLM Extraction Errors

If the LLM returns invalid JSON:

Log the chunk that failed
Continue with remaining chunks
Report the parsing error in summary

Do's and Don'ts

Do

✅ Extract factual claims (not opinions)
✅ Verify command availability before execution
✅ Infer authority from context
✅ Generate semantic subject paths
✅ Include full explanation context
✅ Bundle errors for batch reporting
✅ Use Read tool to get file content
✅ Use Bash tool to execute CLI commands
✅ Filter by confidence >= 0.7
✅ Allow duplicate subject+predicate (append-only DB)

Do Not

❌ Extract opinions or speculative claims
❌ Assume binary is up to date
❌ Lose source attribution
❌ Hardcode authority (infer from content)
❌ Stop on first error (collect all errors)
❌ Modify files (read-only skill)
❌ Use placeholder values
❌ Skip validation
❌ Filter duplicates (append-only allows them)

Example Extraction

Input Text

## BasicSR and Torchvision Compatibility

The BasicSR library (v1.4.2) has a critical compatibility issue with torchvision >= 0.15.
The library imports from `torchvision.transforms.functional_tensor`, which was removed
in torchvision 0.15+.

Source: https://github.com/XPixelGroup/BasicSR/issues/123

Recommended workaround: Pin torchvision to 0.14.1 or earlier.

Extracted Claims

[
  {
    "subject": "ml/dependencies/basicsr/torchvision",
    "predicate": "incompatible_with",
    "value": ">=0.15",
    "explanation": "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+",
    "authority": "XPixelGroup/BasicSR#123",
    "category": "compatibility",
    "confidence": 0.95,
    "tier": 2
  },
  {
    "subject": "ml/dependencies/basicsr/torchvision",
    "predicate": "recommends",
    "value": "<=0.14.1",
    "explanation": "Workaround for basicsr compatibility issue: pin torchvision to 0.14.1 or earlier",
    "authority": "XPixelGroup/BasicSR#123",
    "category": "compatibility",
    "confidence": 0.9,
    "tier": 3
  }
]

CLI Commands

aphoria corpus create \
  --subject "ml/dependencies/basicsr/torchvision" \
  --predicate "incompatible_with" \
  --value ">=0.15" \
  --explanation "basicsr 1.4.2 imports from torchvision.transforms.functional_tensor which was removed in torchvision 0.15+" \
  --authority "XPixelGroup/BasicSR#123" \
  --category "compatibility" \
  --tier 2

aphoria corpus create \
  --subject "ml/dependencies/basicsr/torchvision" \
  --predicate "recommends" \
  --value "<=0.14.1" \
  --explanation "Workaround for basicsr compatibility issue: pin torchvision to 0.14.1 or earlier" \
  --authority "XPixelGroup/BasicSR#123" \
  --category "compatibility" \
  --tier 3

extract-claims: Entity-level extraction from prose (for StemeDB ingestion)
aphoria-suggest: Suggest claims from existing patterns
aphoria-claims: Author claims from diffs

Implementation Notes

Token Counting

Use rough heuristic: token_count = len(content) / 4

This is approximate but good enough for chunking decisions.

Shell Escaping

When constructing CLI commands, properly escape strings:

import shlex

escaped_explanation = shlex.quote(explanation)

Or in bash:

explanation="${explanation//\"/\\\"}"  # Escape quotes

Confidence Threshold

Only extract claims with confidence >= 0.7. This filters out:

Speculative statements
Uncertain inferences
Low-quality extractions

Append-Only Semantics

The corpus database is append-only. Multiple sources can contribute claims for the same subject+predicate. This enables:

Cross-validation from multiple sources
Community consensus building
Evolving knowledge over time

Do NOT filter duplicates. Just warn about them in the report.

Success Criteria

A successful extraction should:

✅ Read all markdown files in the input directory
✅ Extract factual claims with proper structure
✅ Infer authority from context (GitHub URLs, docs, etc.)
✅ Assign appropriate tiers (0-3)
✅ Execute CLI commands successfully
✅ Report comprehensive summary with errors bundled
✅ Handle validation errors gracefully
✅ Handle storage errors gracefully
✅ Generate semantic subject paths
✅ Use consistent predicate naming

Troubleshooting

"Command not found" or "unrecognized subcommand 'create'" Errors

If you see error: unrecognized subcommand 'create' or similar errors:

Diagnosis:

Check binary date: ls -lh target/release/aphoria
Check CLI code date: ls -lh applications/aphoria/src/cli/mod.rs
If CLI is newer: The binary is out of date

Solutions:

# Option 1: Rebuild the binary
cargo build --release -p aphoria

# Option 2: Pull latest changes and rebuild
git pull && cargo build --release -p aphoria

# Option 3: Check if there are uncommitted changes
git status

Prevention: See Fix #1 for setting up git hooks that automatically rebuild binaries on pull.

"Database already open" error

The corpus database at ~/.aphoria/corpus-db is locked by another process (probably the API server).

Solution: Stop the API server temporarily:

pkill -f stemedb-api

"Invalid tier" error

Tier must be 0, 1, 2, or 3.

Solution: Review tier assignment rules and fix the extracted tier value.

"Missing required field" error

All claims must have: subject, predicate, value, explanation, authority, category, tier.

Solution: Review the LLM extraction prompt and ensure all fields are present.

LLM returns invalid JSON

The LLM might return markdown formatting or extra text.

Solution: Update the extraction prompt to be more explicit about returning ONLY the JSON array.

16 KiB Raw Blame History