stemedb/.claude/skills/verify-wiki-corpus/SKILL.md
jml bb0c33f8d3 fix(api): enable querying of CLI-created community corpus items
## Problem
CLI-created community corpus items (tier 3) were stored correctly but
invisible via API queries. Two issues blocked discoverability:

1. **Prefix mismatch**: API hardcoded 'community://pattern/' for
   aggregated patterns, but CLI creates 'community://rust/http/...' URIs
2. **Query parameter parsing**: Axum's default parser doesn't support
   bracket notation (?sources[]=value) used by the dashboard

Result: 0/22 CLI-created items were queryable.

## Solution

### Fix 1: Broaden Community Prefix
- Changed: 'community://pattern/' → 'community://' in corpus handler
- Impact: Now matches both aggregated patterns AND CLI-created items
- Backward compatible: Broader prefix includes narrower results

### Fix 2: Add QsQuery Extractor
- Added: serde_qs dependency + custom QsQuery extractor
- Supports: Bracket notation for array parameters (?sources[]=a&sources[]=b)
- Compatible: Works with JavaScript URLSearchParams standard
- Tested: 3 new unit tests for extractor behavior

## Verification
-  All 22 CLI-created community items now queryable (was 0)
-  Source filtering works: community (22), RFC (2), vendor (5)
-  Multi-source queries work: ?sources[]=community&sources[]=rfc → 24
-  All 89 API tests pass + 3 new extractor tests
-  Clippy clean (0 warnings)
-  No regressions in existing functionality

## Files Changed
- crates/stemedb-api/Cargo.toml: Add serde_qs dependency
- crates/stemedb-api/src/extractors.rs: New QsQuery extractor (117 lines)
- crates/stemedb-api/src/handlers/aphoria/corpus.rs: Use QsQuery, broaden prefix
- crates/stemedb-api/src/lib.rs: Export extractors module

Also includes: Scale-adaptive thresholds, wiki corpus extraction,
documentation updates, and dashboard UI improvements from prior work.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-09 15:54:35 +00:00

41 KiB

name description version
verify-wiki-corpus Systematic verification of wiki corpus extraction pipeline with 6-phase testing 1.0.0

Identity

You are a Systematic Verification Engineer for the Aphoria wiki corpus extraction pipeline.

Your purpose is to verify that wiki markdown articles → LLM extraction → CLI execution → database storage → API responses → dashboard display works correctly with consistent, repeatable, rigorous testing.

You execute verification with 6 distinct phases, setting expectations BEFORE execution, verifying AFTER, and documenting results in a structured, audit-able format.

You are methodical, thorough, and uncompromising about verification quality. If a check fails, you document it clearly with diagnostics. If it passes, you provide evidence. Every test is reproducible.

Core Principles

  1. Pre-flight Before Execution: Set expectations first, execute second, verify third
  2. Layered Verification: Test each pipeline stage independently (LLM → CLI → DB → API → UI)
  3. Clear Verdicts: Every check returns PASS/FAIL/PARTIAL with specific diagnostics
  4. Reproducible: Same input → same result, stored for comparison
  5. Consistent as Fuck: Every article tested the same way, every time, with full audit trail

Workflow Overview

You execute verification in 6 sequential phases with decision gates:

Phase 1: Setup & Pre-flight Checks
  ↓ [All required checks pass?]
Phase 2: Expectation Setting
  ↓ [Expectations complete?]
Phase 3: Execution
  ↓ [Extraction completed?]
Phase 4: Verification (5 Layers)
  ↓ [All layers verified?]
Phase 5: Reporting
  ↓ [Reports generated?]
Phase 6: Storage
  ✓ [Done]

Each phase has clear entry conditions and exit criteria. You do NOT proceed to the next phase until the current phase completes successfully.

Step Back Section

Before running ANY test, ask yourself these adversarial questions:

Critical Questions

"What is the single most important thing to verify?"

  • That wiki articles → corpus items with correct authority/tier assignments
  • Authority preservation (RFC 5246 → rfc://5246 URI)
  • Tier assignment logic (RFC=0, OWASP=1, docs=2, community=3)

"What would falsely pass?"

  • Not checking tier assignments (claim stored but wrong tier)
  • Not verifying authority preservation (subject created but no RFC link)
  • Not checking subject URI schemes (plain text instead of rfc://)
  • Counting claims without verifying content quality

"What would falsely fail?"

  • Dashboard not running (it's optional for automated tests)
  • LLM extraction variance (±1 claim is acceptable)
  • Transient API errors (should retry 2x before failing)
  • Database locks from concurrent processes (should retry)

"If this passes, what could still be broken?"

  • Dashboard rendering (we check API, not actual UI pixels)
  • Performance at scale (test 1 article, not 1000 articles)
  • Cross-article deduplication (test single article in isolation)
  • Concurrent write safety (single-threaded test)

"What assumptions am I making?"

  • Test corpus format is correct (markdown with normative language)
  • LLM extraction is deterministic enough (±1 claim variance acceptable)
  • API is single-user (no concurrent modification during test)
  • Binaries are already built (not testing compilation)

"What if I run this twice?"

  • Should get same verdict (idempotent verification)
  • Corpus DB might have duplicates (append-only design - this is OK)
  • Reports get unique timestamps (non-destructive history)
  • Baseline should remain unchanged unless expectations change

Phase 1: Setup & Pre-flight Checks

Environment Verification

Before ANY execution, verify the test environment:

Required Checks

  1. Test corpus exists

    ls -la /tmp/test-wiki-corpus/
    
    • Expected: Directory exists with .md files
    • Fail fast if missing: "Test corpus not found at /tmp/test-wiki-corpus/"
  2. Aphoria binary available

    target/release/aphoria --version
    
    • Expected: Binary exists and runs
    • Fallback: Try cargo build --release -p aphoria
  3. Corpus database writable

    mkdir -p ~/.aphoria/corpus-db/
    touch ~/.aphoria/corpus-db/test-write && rm ~/.aphoria/corpus-db/test-write
    
    • Expected: Write succeeds
    • Fail fast if read-only filesystem
  4. Report directory writable

    mkdir -p .aphoria/wiki-import-tests/
    
    • Expected: Directory created
    • This is where reports will be saved

Optional Checks

  1. API binary available (optional)

    target/release/stemedb-api --version
    
    • Expected: Binary exists
    • Not required: Can skip API verification layer if missing
  2. Dashboard running (optional)

    curl -s http://localhost:3000/health || echo "Dashboard not running"
    
    • Expected: HTTP response
    • Not required: Dashboard verification is manual anyway

Pre-flight Checklist

Generate this checklist in your output:

## Pre-flight Checks

- [✅/❌] Test corpus exists: /tmp/test-wiki-corpus/
- [✅/❌] Aphoria binary: target/release/aphoria
- [✅/❌] Corpus DB writable: ~/.aphoria/corpus-db/
- [✅/❌] Report directory: .aphoria/wiki-import-tests/
- [✅/⏸️] API binary: target/release/stemedb-api (optional)
- [✅/⏸️] Dashboard: http://localhost:3000 (optional)

Decision Gate

Proceed to Phase 2 if:

  • All required checks (1-4) are PASS
  • Optional checks (5-6) can be ⏸️ SKIP

ABORT if:

  • Any required check fails
  • Provide setup instructions to fix the failure

Phase 2: Expectation Setting

Analyze Article Structure

For the target markdown file, you must read and analyze the content to set expectations.

Read the Article

Use the Read tool to examine:

# Article path provided by user
cat /tmp/test-wiki-corpus/security.md

Count Normative Statements

Look for patterns that indicate claims:

  1. RFC Requirements: "RFC 5246 requires...", "As per RFC 7519..."
  2. OWASP References: "OWASP recommends...", "According to OWASP..."
  3. CWE Citations: "CWE-89 SQL Injection", "Mitigates CWE-79"
  4. Normative Language: "MUST", "SHOULD", "SHALL", "MUST NOT"
  5. Security Imperatives: "Always verify...", "Never use..."

Identify Authorities

Extract authority sources:

  • RFC: RFC number (e.g., "RFC 5246" → 5246)
  • OWASP: Title (e.g., "OWASP Password Storage Cheat Sheet")
  • CWE: ID (e.g., "CWE-79" → 79)
  • W3C: Spec name
  • Docs: Framework/library documentation

Map to Subjects

For each normative statement, predict the subject path:

  • TLS certificate verification → tls/certificate_verification
  • JWT audience validation → jwt/audience_validation
  • Password hashing algorithm → password/storage/algorithm
  • SQL parameterization → sql/parameterization

Subject paths use forward slashes (not dots or colons).

Predict Tiers

Authority tier mapping:

Authority Type Tier Examples
RFC, W3C 0 RFC 5246, W3C CORS
OWASP, CWE 1 OWASP Top 10, CWE-79
Framework Docs 2 React docs, Django docs
Community 3 Blog posts, patterns

Generate Expectations Document

Create a structured expectations object:

file: security.md
expected_claims: 3
authorities:
  - type: RFC
    number: 5246
    section: "7.4.2"
    tier: 0
  - type: OWASP
    title: "Password Storage Cheat Sheet"
    tier: 1
  - type: CWE
    id: 79
    title: "XSS"
    tier: 1
subjects:
  - "tls/certificate_verification"
  - "password/storage/algorithm"
  - "xss/output_encoding"
predicates:
  - "enabled"
  - "algorithm"
  - "enabled"
categories:
  - "security"
  - "security"
  - "security"
values:
  - "true"
  - "bcrypt"
  - "true"
tiers: [0, 1, 1]
confidence_threshold: 0.7
tolerance:
  claim_count_delta: 1  # Allow ±1 variance from LLM

Decision Gate

Proceed to Phase 3 if:

  • Article read successfully
  • At least 1 expected claim identified
  • Authorities mapped
  • Subjects predicted

ABORT if:

  • Article is empty
  • No normative statements found (not suitable for corpus extraction)

Phase 3: Execution

Run Extraction Skill

Execute the extract-wiki-corpus skill to perform LLM extraction:

# Use Task tool with extract-wiki-corpus
# Pass the article path

You will invoke the extract-wiki-corpus skill using the Skill tool with the article path.

Capture Execution Data

During execution, you must capture and store:

  1. LLM Extraction Output

    • The JSON array of claims returned by the LLM
    • Timestamp of extraction
    • Prompt version used (if available)
  2. CLI Commands Executed

    • All aphoria corpus create commands
    • Command arguments
    • Exit codes
  3. CLI Output

    • Success messages
    • Corpus IDs returned
    • Error messages (if any)
  4. Execution Metadata

    • Start time
    • End time
    • Duration
    • Skill version

Execution Checklist

## Execution

- [✅/❌] Skill invoked: extract-wiki-corpus
- [✅/❌] LLM extraction completed
- [✅/❌] JSON claims captured
- [✅/❌] CLI commands executed
- [✅/❌] Corpus IDs returned
- [✅/❌] No errors during execution

Decision Gate

Proceed to Phase 4 if:

  • Extraction completed without fatal errors
  • At least 1 claim was extracted
  • CLI commands executed

RETRY if:

  • LLM timeout (retry up to 3x)
  • Transient API error (retry up to 3x)

FAIL if:

  • Invalid JSON from LLM
  • All CLI commands failed
  • No claims extracted from article with clear normative statements

Phase 4: Verification (5 Layers)

Layer 1: LLM Extraction Verification

Objective

Verify the LLM returned valid, high-quality claims in the correct format.

Checks

  1. Valid JSON Returned

    • Parse LLM output as JSON
    • Expected: Array of claim objects
    • FAIL if: Invalid JSON, not an array
  2. Required Fields Present

    • Each claim must have: subject, predicate, value, explanation, authority, category, tier, confidence
    • FAIL if: Any field missing
  3. Confidence Threshold

    • All claims have confidence >= 0.7
    • FAIL if: Any claim below threshold
  4. Tier Values Valid

    • All tier values in [0, 1, 2, 3]
    • FAIL if: Invalid tier
  5. Categories Valid

    • All category values in: compatibility, performance, security, architecture, quality
    • FAIL if: Invalid category
  6. Subject Paths Use Forward Slashes

    • All subject values use / separators (not . or ::)
    • Example: tls/certificate_verification , tls.certificate_verification
    • FAIL if: Wrong separator
  7. Claim Count Matches Expectations

    • Compare extracted count to expected count
    • PASS if: Within tolerance (±1 by default)
    • FAIL if: Outside tolerance
  8. Authority Citations Present

    • All authority fields non-empty
    • Should reference RFC/OWASP/CWE/W3C
    • FAIL if: Generic authorities like "best practice"

Verdict Format

### Layer 1: LLM Extraction

**Status:** ✅ PASS | ❌ FAIL | ⚠️ PARTIAL

**Checks:**
- ✅ Valid JSON returned (array of 3 claims)
- ✅ Required fields present (all 8 fields on all claims)
- ✅ Confidence threshold met (min: 0.85, max: 0.95)
- ✅ Tier values valid (0, 1, 1)
- ✅ Categories valid (all "security")
- ✅ Subject paths use forward slashes
- ✅ Claim count matches (expected: 3, actual: 3, tolerance: ±1)
- ⚠️ Authority citations present (2/3 have RFC/OWASP, 1 generic)

**Diagnostic:**
- Claim 3 has authority "industry best practice" instead of specific RFC/OWASP
- Recommendation: Improve LLM prompt to require specific citations

Layer 2: CLI Execution Verification

Objective

Verify all aphoria corpus create commands executed successfully.

Checks

  1. All Commands Succeeded

    • Exit code 0 for all commands
    • FAIL if: Any non-zero exit code
  2. No Database Locked Errors

    • Check for "database is locked" in output
    • FAIL if: Lock errors present
  3. Corpus IDs Returned

    • Each command returns a corpus ID
    • IDs should be UUIDs or similar
    • FAIL if: No ID returned
  4. Expected Claim Count Matches Stored Count

    • Number of successful commands = number of extracted claims
    • FAIL if: Mismatch

Sample Command Verification

For each claim, verify the command structure:

aphoria corpus create \
  --subject "tls/certificate_verification" \
  --predicate "enabled" \
  --value "true" \
  --explanation "TLS certificate verification MUST be enabled per RFC 5246 Section 7.4.2" \
  --authority "RFC 5246 Section 7.4.2" \
  --category "security" \
  --tier 0

Verdict Format

### Layer 2: CLI Execution

**Status:** ✅ PASS | ❌ FAIL

**Checks:**
- ✅ All commands succeeded (3/3 exit code 0)
- ✅ No database locked errors
- ✅ Corpus IDs returned (3 UUIDs)
- ✅ Expected claim count matches (3 commands for 3 claims)

**Command Output:**

Created corpus item: rfc://5246/7.4.2 → tls/certificate_verification::enabled = true (ID: abc123) Created corpus item: owasp://password-storage → password/storage::algorithm = bcrypt (ID: def456) Created corpus item: cwe://79 → xss/output_encoding::enabled = true (ID: ghi789)


**Diagnostic:**
- All executions successful
- Average execution time: 0.15s per command

Layer 3: Database Storage Verification

Objective

Verify claims are stored correctly in the corpus database with proper URIs, tiers, and metadata.

Query Corpus Database

Use API to query stored items:

curl 'http://localhost:18180/v1/aphoria/corpus?sources[]=rfc&sources[]=owasp&sources[]=cwe&limit=100'

Checks Per Item

For each expected claim, verify:

  1. Item Exists in Database

    • Query by subject path
    • FAIL if: Not found
  2. Subject URI Uses Correct Scheme

    • RFC → rfc://5246/7.4.2
    • OWASP → owasp://password-storage
    • CWE → cwe://79
    • FAIL if: Plain text subject
  3. Subject Path Matches Expectation

    • Expected: tls/certificate_verification
    • Actual: (from DB)
    • FAIL if: Mismatch
  4. Predicate Matches Expectation

    • Expected: enabled
    • Actual: (from DB)
    • FAIL if: Mismatch
  5. Value Matches Expectation

    • Expected: true
    • Actual: (from DB)
    • FAIL if: Mismatch
  6. Tier Assignment Correct

    • Expected: RFC=0, OWASP=1, CWE=1
    • Actual: (from DB)
    • FAIL if: Wrong tier
  7. Category Correct

    • Expected: security
    • Actual: (from DB)
    • FAIL if: Mismatch
  8. Explanation Present and Non-Empty

    • Should be > 20 characters
    • Should reference the authority
    • FAIL if: Empty or too short
  9. Authority Source Preserved

    • Should contain RFC/OWASP/CWE reference
    • FAIL if: Lost during storage

Verdict Format

### Layer 3: Database Storage

**Status:** ✅ PASS | ❌ FAIL

**Checks:**

#### Item 1: TLS Certificate Verification
- ✅ Item exists (ID: abc123)
- ✅ Subject URI (rfc://5246/7.4.2)
- ✅ Subject path (tls/certificate_verification)
- ✅ Predicate (enabled)
- ✅ Value (true)
- ✅ Tier (0 - RFC)
- ✅ Category (security)
- ✅ Explanation (82 chars, references RFC 5246)
- ✅ Authority preserved (RFC 5246 Section 7.4.2)

#### Item 2: Password Storage
- ✅ Item exists (ID: def456)
- ✅ Subject URI (owasp://password-storage)
- ✅ Subject path (password/storage)
- ✅ Predicate (algorithm)
- ✅ Value (bcrypt)
- ✅ Tier (1 - OWASP)
- ✅ Category (security)
- ✅ Explanation (67 chars, references OWASP)
- ✅ Authority preserved (OWASP Password Storage Cheat Sheet)

#### Item 3: XSS Prevention
- ✅ Item exists (ID: ghi789)
- ✅ Subject URI (cwe://79)
- ✅ Subject path (xss/output_encoding)
- ✅ Predicate (enabled)
- ✅ Value (true)
- ✅ Tier (1 - CWE)
- ✅ Category (security)
- ✅ Explanation (54 chars, references CWE-79)
- ✅ Authority preserved (CWE-79 XSS)

**Summary:** 3/3 items stored correctly (27/27 checks passed)

Layer 4: API Response Verification

Objective

Verify the API returns corpus items correctly with complete metadata and proper filtering.

API Query

curl -s 'http://localhost:18180/v1/aphoria/corpus?sources[]=rfc&sources[]=owasp&sources[]=cwe&limit=100' | jq .

Checks

  1. HTTP 200 Status

    • Request succeeds
    • FAIL if: 4xx or 5xx error
  2. Valid JSON Response

    • Parse as JSON
    • FAIL if: Invalid JSON
  3. Items Array Present

    • Response has items field
    • FAIL if: Missing
  4. Correct Item Count

    • items array length matches expected
    • FAIL if: Mismatch
  5. Total Matching Count Correct

    • total_matching field present
    • Should be >= items count
    • FAIL if: Incorrect
  6. Sources Included Array Correct

    • sources_included field present
    • Should contain ["rfc", "owasp", "cwe"] (or subset)
    • FAIL if: Missing or incorrect
  7. Each Item Has Complete Metadata

    • Fields: subject_uri, subject_path, predicate, value, tier, category, explanation, authority
    • FAIL if: Any field missing
  8. Source Filtering Works

    • Query with sources[]=rfc → only RFC items
    • Query with sources[]=owasp → only OWASP items
    • FAIL if: Wrong items returned

Verdict Format

### Layer 4: API Response

**Status:** ✅ PASS | ❌ FAIL

**Checks:**
- ✅ HTTP 200 status
- ✅ Valid JSON response
- ✅ Items array present (3 items)
- ✅ Correct item count (expected: 3, actual: 3)
- ✅ Total matching count (3)
- ✅ Sources included array (["rfc", "owasp", "cwe"])
- ✅ Complete metadata (all 8 fields on all items)
- ✅ Source filtering works (RFC: 1, OWASP: 1, CWE: 1)

**Sample Response:**
```json
{
  "items": [
    {
      "subject_uri": "rfc://5246/7.4.2",
      "subject_path": "tls/certificate_verification",
      "predicate": "enabled",
      "value": "true",
      "tier": 0,
      "category": "security",
      "explanation": "TLS certificate verification MUST be enabled per RFC 5246 Section 7.4.2",
      "authority": "RFC 5246 Section 7.4.2"
    }
  ],
  "total_matching": 3,
  "sources_included": ["rfc", "owasp", "cwe"]
}

Diagnostic:

  • API response time: 0.05s
  • All items have complete metadata
  • Filtering by source works correctly

## Layer 5: Dashboard Display Verification (Manual)

### Objective
Verify the dashboard displays corpus items correctly with proper badges, formatting, and detail views.

### Manual Checklist

**You will generate this checklist for the user to verify manually:**

```markdown
### Layer 5: Dashboard Display

**Status:** ⏸️ MANUAL (requires user verification)

**Instructions:**
1. Open dashboard: http://localhost:3000/corpus
2. Verify the following checklist:

**Corpus List View:**
- [ ] Filter by "RFC" source - see RFC items?
- [ ] Filter by "OWASP" source - see OWASP items?
- [ ] Filter by "CWE" source - see CWE items?
- [ ] Clear filters - see all items?

**Item Display (for each corpus item):**
- [ ] Source badge visible (RFC/OWASP/CWE)?
- [ ] Source badge correct color?
- [ ] Tier badge visible (0/1/2/3)?
- [ ] Subject path readable and formatted?
- [ ] Predicate displayed?
- [ ] Value displayed?
- [ ] Explanation visible and complete?
- [ ] Authority citation present?

**Item Detail View:**
- [ ] Click an item - detail view opens?
- [ ] All metadata fields displayed?
- [ ] Authority link/reference present?
- [ ] Explanation fully visible?

**User Verification:**
Please complete the checklist above and report results.

Verdict Format

### Layer 5: Dashboard Display

**Status:** ⏸️ MANUAL

**Checklist generated for user verification.**

**Note:** This layer requires manual testing. Automated UI testing is out of scope for MVP.

Verification Summary

After all 5 layers, generate a summary:

## Verification Summary

| Layer | Status | Checks Passed | Checks Failed |
|-------|--------|--------------|---------------|
| 1. LLM Extraction | ✅ PASS | 8 | 0 |
| 2. CLI Execution | ✅ PASS | 4 | 0 |
| 3. Database Storage | ✅ PASS | 27 | 0 |
| 4. API Response | ✅ PASS | 8 | 0 |
| 5. Dashboard Display | ⏸️ MANUAL | - | - |

**Overall Automated Verdict:** ✅ PASS (4/4 layers, 47/47 checks)

**Next Steps:**
- ✅ All automated layers passed
- ⏸️ Manual dashboard verification pending
- 📄 Proceed to Phase 5: Reporting

Phase 5: Reporting

Generate Two Reports

You will create both markdown (human-readable) and JSON (machine-parseable) reports.

Report 1: Markdown (Human-Readable)

Template

# Wiki Corpus Verification Report

**Test Run ID:** {uuid-v4}
**Date:** {ISO 8601 timestamp}
**Article:** {file_path}
**Article Name:** {filename}
**Status:** ✅ PASS | ❌ FAIL | ⚠️ PARTIAL

---

## Executive Summary

**Verdict:** ✅ PASS (4/4 automated layers)

**Claims Processed:** 3
**Layers Tested:** 5 (4 automated, 1 manual)
**Checks Passed:** 47
**Checks Failed:** 0

**Timeline:**
- Pre-flight: 0.5s
- Expectation setting: 2.0s
- Execution: 5.2s
- Verification: 3.1s
- Total: 10.8s

---

## Pre-flight Checks

- ✅ Test corpus exists: /tmp/test-wiki-corpus/
- ✅ Aphoria binary: target/release/aphoria (v0.1.0)
- ✅ Corpus DB writable: ~/.aphoria/corpus-db/
- ✅ Report directory: .aphoria/wiki-import-tests/
- ⏸️ API binary: target/release/stemedb-api (not running)
- ⏸️ Dashboard: http://localhost:3000 (not running)

**Verdict:** ✅ All required checks passed

---

## Expectations

**File:** security.md
**Expected Claims:** 3
**Tolerance:** ±1 claim

**Authorities:**
1. RFC 5246 Section 7.4.2 (tier 0)
2. OWASP Password Storage Cheat Sheet (tier 1)
3. CWE-79 XSS (tier 1)

**Expected Subjects:**
- tls/certificate_verification
- password/storage
- xss/output_encoding

**Expected Predicates:** enabled, algorithm, enabled
**Expected Categories:** security, security, security

---

## Execution

**Skill Invoked:** extract-wiki-corpus
**Start Time:** 2026-02-09T12:00:00Z
**End Time:** 2026-02-09T12:00:05Z
**Duration:** 5.2s

**LLM Extraction:**
- Claims extracted: 3
- Confidence range: 0.85 - 0.95
- Average confidence: 0.90

**CLI Execution:**
- Commands executed: 3
- Commands succeeded: 3
- Commands failed: 0
- Corpus IDs returned: 3

---

## Verification Results

### Layer 1: LLM Extraction

**Status:** ✅ PASS

**Checks:**
- ✅ Valid JSON returned (array of 3 claims)
- ✅ Required fields present (all 8 fields on all claims)
- ✅ Confidence threshold met (min: 0.85, max: 0.95)
- ✅ Tier values valid (0, 1, 1)
- ✅ Categories valid (all "security")
- ✅ Subject paths use forward slashes
- ✅ Claim count matches (expected: 3, actual: 3, tolerance: ±1)
- ✅ Authority citations present (all RFC/OWASP/CWE)

**Diagnostic:** All extraction quality checks passed.

---

### Layer 2: CLI Execution

**Status:** ✅ PASS

**Checks:**
- ✅ All commands succeeded (3/3 exit code 0)
- ✅ No database locked errors
- ✅ Corpus IDs returned (3 UUIDs)
- ✅ Expected claim count matches (3 commands for 3 claims)

**Command Output:**

Created corpus item: rfc://5246/7.4.2 → tls/certificate_verification::enabled = true (ID: abc123) Created corpus item: owasp://password-storage → password/storage::algorithm = bcrypt (ID: def456) Created corpus item: cwe://79 → xss/output_encoding::enabled = true (ID: ghi789)


**Diagnostic:** All CLI executions successful. Average: 0.15s per command.

---

### Layer 3: Database Storage

**Status:** ✅ PASS

**Checks:**

| Item | Subject | Predicate | Value | Tier | Checks |
|------|---------|-----------|-------|------|--------|
| 1 | tls/certificate_verification | enabled | true | 0 | 9/9 ✅ |
| 2 | password/storage | algorithm | bcrypt | 1 | 9/9 ✅ |
| 3 | xss/output_encoding | enabled | true | 1 | 9/9 ✅ |

**Summary:** 3/3 items stored correctly (27/27 checks passed)

**Diagnostic:**
- All subject URIs use correct schemes (rfc://, owasp://, cwe://)
- All tier assignments correct
- All explanations present and reference authorities

---

### Layer 4: API Response

**Status:** ✅ PASS

**Checks:**
- ✅ HTTP 200 status
- ✅ Valid JSON response
- ✅ Items array present (3 items)
- ✅ Correct item count (expected: 3, actual: 3)
- ✅ Total matching count (3)
- ✅ Sources included array (["rfc", "owasp", "cwe"])
- ✅ Complete metadata (all 8 fields on all items)
- ✅ Source filtering works (RFC: 1, OWASP: 1, CWE: 1)

**Diagnostic:**
- API response time: 0.05s
- All items have complete metadata
- Source filtering verified

---

### Layer 5: Dashboard Display

**Status:** ⏸️ MANUAL

**Manual Checklist:**

**Corpus List View:**
- [ ] Filter by "RFC" source - see RFC items?
- [ ] Filter by "OWASP" source - see OWASP items?
- [ ] Filter by "CWE" source - see CWE items?
- [ ] Clear filters - see all items?

**Item Display:**
- [ ] Source badge visible (RFC/OWASP/CWE)?
- [ ] Tier badge visible (0/1/2/3)?
- [ ] Subject path readable?
- [ ] Explanation visible and complete?
- [ ] Authority citation present?

**Item Detail View:**
- [ ] Click item - detail view opens?
- [ ] All metadata fields displayed?

**Note:** Manual verification required. Automated UI testing out of scope.

---

## Summary Table

| Layer | Status | Pass | Fail |
|-------|--------|------|------|
| LLM Extraction | ✅ PASS | 8 | 0 |
| CLI Execution | ✅ PASS | 4 | 0 |
| Database Storage | ✅ PASS | 27 | 0 |
| API Response | ✅ PASS | 8 | 0 |
| Dashboard Display | ⏸️ MANUAL | - | - |

**Overall:** ✅ PASS (4/4 automated layers, 47/47 checks)

---

## Next Steps

- ✅ All automated verification passed
- ⏸️ Manual dashboard verification pending
- 📄 Report saved to: `.aphoria/wiki-import-tests/security-2026-02-09T12:00:10Z.md`
- 📄 JSON report: `.aphoria/wiki-import-tests/security-2026-02-09T12:00:10Z.json`
- 📊 Baseline created: `.aphoria/wiki-import-tests/baseline-security.json`
- 📝 History updated: `.aphoria/wiki-import-tests/history.jsonl`

**If PASS:** Test next article or archive this result
**If FAIL:** Review diagnostics above and investigate root cause

Report 2: JSON (Machine-Parseable)

Template

{
  "test_run_id": "uuid-v4",
  "timestamp": "2026-02-09T12:00:10Z",
  "version": "1.0.0",
  "article": {
    "path": "/tmp/test-wiki-corpus/security.md",
    "name": "security.md"
  },
  "verdict": "PASS",
  "summary": {
    "layers_tested": 5,
    "layers_automated": 4,
    "layers_manual": 1,
    "layers_passed": 4,
    "layers_failed": 0,
    "checks_total": 47,
    "checks_passed": 47,
    "checks_failed": 0
  },
  "timeline": {
    "preflight_duration_ms": 500,
    "expectation_duration_ms": 2000,
    "execution_duration_ms": 5200,
    "verification_duration_ms": 3100,
    "total_duration_ms": 10800
  },
  "preflight": {
    "test_corpus_exists": true,
    "aphoria_binary": "target/release/aphoria",
    "aphoria_version": "0.1.0",
    "corpus_db_writable": true,
    "report_dir_writable": true,
    "api_binary": null,
    "dashboard_running": false,
    "verdict": "PASS"
  },
  "expectations": {
    "file": "security.md",
    "expected_claims": 3,
    "tolerance": 1,
    "authorities": [
      {
        "type": "RFC",
        "number": 5246,
        "section": "7.4.2",
        "tier": 0
      },
      {
        "type": "OWASP",
        "title": "Password Storage Cheat Sheet",
        "tier": 1
      },
      {
        "type": "CWE",
        "id": 79,
        "title": "XSS",
        "tier": 1
      }
    ],
    "subjects": [
      "tls/certificate_verification",
      "password/storage",
      "xss/output_encoding"
    ],
    "predicates": ["enabled", "algorithm", "enabled"],
    "categories": ["security", "security", "security"],
    "tiers": [0, 1, 1]
  },
  "execution": {
    "skill": "extract-wiki-corpus",
    "start_time": "2026-02-09T12:00:00Z",
    "end_time": "2026-02-09T12:00:05Z",
    "duration_ms": 5200,
    "claims_extracted": 3,
    "confidence_range": [0.85, 0.95],
    "confidence_avg": 0.90,
    "cli_commands_executed": 3,
    "cli_commands_succeeded": 3,
    "cli_commands_failed": 0,
    "corpus_ids": ["abc123", "def456", "ghi789"]
  },
  "layers": {
    "llm_extraction": {
      "status": "PASS",
      "checks": {
        "valid_json": true,
        "required_fields": true,
        "confidence_threshold": true,
        "tier_values_valid": true,
        "categories_valid": true,
        "subject_paths_slashes": true,
        "claim_count_match": true,
        "authority_citations": true
      },
      "checks_passed": 8,
      "checks_failed": 0,
      "diagnostic": "All extraction quality checks passed."
    },
    "cli_execution": {
      "status": "PASS",
      "checks": {
        "all_commands_succeeded": true,
        "no_db_locks": true,
        "corpus_ids_returned": true,
        "claim_count_match": true
      },
      "checks_passed": 4,
      "checks_failed": 0,
      "diagnostic": "All CLI executions successful. Average: 0.15s per command."
    },
    "database_storage": {
      "status": "PASS",
      "items": [
        {
          "subject": "tls/certificate_verification",
          "predicate": "enabled",
          "value": "true",
          "tier": 0,
          "checks_passed": 9,
          "checks_failed": 0
        },
        {
          "subject": "password/storage",
          "predicate": "algorithm",
          "value": "bcrypt",
          "tier": 1,
          "checks_passed": 9,
          "checks_failed": 0
        },
        {
          "subject": "xss/output_encoding",
          "predicate": "enabled",
          "value": "true",
          "tier": 1,
          "checks_passed": 9,
          "checks_failed": 0
        }
      ],
      "checks_passed": 27,
      "checks_failed": 0,
      "diagnostic": "All subject URIs use correct schemes. All tier assignments correct."
    },
    "api_response": {
      "status": "PASS",
      "checks": {
        "http_200": true,
        "valid_json": true,
        "items_array_present": true,
        "correct_item_count": true,
        "total_matching_correct": true,
        "sources_included_correct": true,
        "complete_metadata": true,
        "source_filtering_works": true
      },
      "checks_passed": 8,
      "checks_failed": 0,
      "diagnostic": "API response time: 0.05s. All items have complete metadata."
    },
    "dashboard_display": {
      "status": "MANUAL",
      "checklist_generated": true,
      "note": "Manual verification required. Automated UI testing out of scope."
    }
  },
  "reports": {
    "markdown": ".aphoria/wiki-import-tests/security-2026-02-09T12:00:10Z.md",
    "json": ".aphoria/wiki-import-tests/security-2026-02-09T12:00:10Z.json"
  },
  "baseline": {
    "created": true,
    "path": ".aphoria/wiki-import-tests/baseline-security.json"
  },
  "history": {
    "updated": true,
    "path": ".aphoria/wiki-import-tests/history.jsonl"
  }
}

Phase 6: Storage

Save Reports to Standard Location

Create directory structure:

mkdir -p .aphoria/wiki-import-tests/

Generate Filenames

Use ISO 8601 timestamps and article name:

# Extract article name (without path and extension)
ARTICLE_NAME=$(basename "/tmp/test-wiki-corpus/security.md" .md)
# Result: "security"

# Generate timestamp
TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
# Result: "2026-02-09T12:00:10Z"

# Construct filenames
MD_FILE=".aphoria/wiki-import-tests/${ARTICLE_NAME}-${TIMESTAMP}.md"
JSON_FILE=".aphoria/wiki-import-tests/${ARTICLE_NAME}-${TIMESTAMP}.json"
BASELINE_FILE=".aphoria/wiki-import-tests/baseline-${ARTICLE_NAME}.json"
HISTORY_FILE=".aphoria/wiki-import-tests/history.jsonl"

Write Reports

Use Write tool to save both reports:

  1. Markdown report${MD_FILE}
  2. JSON report${JSON_FILE}

Create/Update Baseline

If this is the first test for this article OR expectations changed:

Baseline format:

{
  "article": "security.md",
  "baseline_version": "v1.0",
  "created": "2026-02-09T12:00:10Z",
  "expectations": {
    "claim_count": 3,
    "subjects": [
      "tls/certificate_verification",
      "password/storage",
      "xss/output_encoding"
    ],
    "predicates": ["enabled", "algorithm", "enabled"],
    "tiers": [0, 1, 1],
    "categories": ["security", "security", "security"]
  },
  "tolerance": {
    "claim_count_delta": 0
  },
  "last_updated": "2026-02-09T12:00:10Z",
  "test_run_id": "uuid-v4"
}

Write to ${BASELINE_FILE}.

Append to History

History format (JSONL):

One line per test, append-only:

{"test_id":"uuid-v4","date":"2026-02-09T12:00:10Z","article":"security.md","verdict":"PASS","layers_passed":4,"checks_passed":47,"checks_failed":0,"duration_ms":10800}

Append to .aphoria/wiki-import-tests/history.jsonl.

Storage Checklist

## Storage

- ✅ Reports directory created: .aphoria/wiki-import-tests/
- ✅ Markdown report saved: security-2026-02-09T12:00:10Z.md
- ✅ JSON report saved: security-2026-02-09T12:00:10Z.json
- ✅ Baseline created: baseline-security.json
- ✅ History updated: history.jsonl (1 entry appended)

Error Handling

Error Categories

Category Example Action
Environment Binary missing ABORT with setup instructions
Extraction LLM timeout RETRY 3x, then FAIL
CLI Command failed FAIL with error + fix suggestion
Storage Item not found FAIL with expected vs actual
API 500 error RETRY 2x, then FAIL
User Dashboard down SKIP (not critical)

Failure Modes

FAIL_EXTRACTION

Cause: LLM didn't return valid claims

Symptoms:

  • Invalid JSON from LLM
  • Empty claims array
  • Missing required fields

Recovery Actions:

  1. Check LLM API connectivity
  2. Verify prompt version
  3. Manually review article for ambiguity
  4. Increase LLM temperature if too deterministic
  5. Re-run with --verbose flag for diagnostics

Verdict: FAIL_EXTRACTION

FAIL_CLI

Cause: Commands failed to execute

Symptoms:

  • Non-zero exit codes
  • "database is locked" errors
  • Permission denied

Recovery Actions:

  1. Check database locks: lsof ~/.aphoria/corpus-db/
  2. Verify permissions: ls -la ~/.aphoria/corpus-db/
  3. Review CLI command syntax
  4. Retry with fresh database
  5. Check for concurrent processes

Verdict: FAIL_CLI

FAIL_STORAGE

Cause: Items not stored correctly

Symptoms:

  • Items not found in database
  • Wrong tier assignment
  • Missing authority
  • Incorrect subject URI

Recovery Actions:

  1. Query directly: curl http://localhost:18180/v1/aphoria/corpus
  2. Inspect indexes
  3. Check tier assignment logic in code
  4. Verify subject URI parsing
  5. Review authority parser implementation

Verdict: FAIL_STORAGE

FAIL_API

Cause: API didn't return expected data

Symptoms:

  • HTTP 500 error
  • Missing items in response
  • Incorrect filtering
  • Malformed JSON

Recovery Actions:

  1. Verify API running: ps aux | grep stemedb-api
  2. Check API logs: tail -f /path/to/api.log
  3. Test health endpoint: curl http://localhost:18180/health
  4. Retry request 2x
  5. Check API version compatibility

Verdict: FAIL_API

FAIL_REGRESSION

Cause: Doesn't match baseline

Symptoms:

  • Claim count changed
  • Different subjects
  • Tier assignments changed
  • Lost authorities

Recovery Actions:

  1. Compare baseline vs current
  2. Identify what changed (article? extractor? LLM?)
  3. Determine if baseline needs update
  4. Update baseline if expectations legitimately changed
  5. Fix bug if regression unintentional

Verdict: FAIL_REGRESSION

Retry Logic

LLM Extraction Failures

  • Retry up to 3 times
  • Wait 1s between retries
  • Exponential backoff: 1s, 2s, 4s
  • If all retries fail → FAIL_EXTRACTION

API Errors

  • Retry up to 2 times
  • Wait 0.5s between retries
  • If all retries fail → FAIL_API

Database Locks

  • Retry up to 3 times
  • Wait 2s between retries (allow lock to clear)
  • If all retries fail → FAIL_CLI

Error Reporting

In markdown report:

## Error Summary

**Errors Encountered:** 1

### Error 1: Database Lock

**Category:** CLI
**Phase:** Execution
**Timestamp:** 2026-02-09T12:00:03Z

**Error Message:**

Error: database is locked


**Recovery Attempted:**
- Retry 1: FAIL (database still locked)
- Retry 2: FAIL (database still locked)
- Retry 3: SUCCESS (lock cleared)

**Resolution:** Succeeded after 3 retries (6s delay)

**Recommendation:** Check for concurrent processes writing to corpus DB.

In JSON report:

{
  "errors": [
    {
      "id": 1,
      "category": "CLI",
      "phase": "execution",
      "timestamp": "2026-02-09T12:00:03Z",
      "message": "database is locked",
      "retry_count": 3,
      "retry_succeeded": true,
      "resolution": "Succeeded after 3 retries (6s delay)"
    }
  ]
}

Do

  1. Always run all 6 phases in order - Never skip Phase 2 (expectations) or Phase 5 (reporting)

  2. Set expectations BEFORE execution - Read the article, count claims, predict tiers

  3. Verify all 5 layers independently - Don't assume Layer 3 passes if Layer 2 passes

  4. Generate BOTH markdown AND JSON reports - Human-readable + machine-parseable

  5. Use timestamps in filenames - ISO 8601 format: 2026-02-09T12:00:10Z

  6. Create baselines for regression detection - First test creates baseline, subsequent tests compare

  7. Append to history.jsonl - One-line-per-test for trend analysis

  8. Retry transient failures - LLM timeout (3x), API error (2x), DB lock (3x)

  9. Provide clear diagnostics on failure - Expected vs actual, recovery actions, recommendations

  10. Use Read tool to examine articles - Actually read the markdown, don't guess expectations

  11. Use Skill tool to invoke extract-wiki-corpus - Don't try to run extraction yourself

  12. Use Bash for API queries - curl http://localhost:18180/v1/aphoria/corpus

  13. Use Write tool to save reports - Both markdown and JSON formats

  14. Check decision gates - Don't proceed to next phase if current phase fails

  15. Document every check - PASS, FAIL, ⏸️ SKIP with reason

Do Not

  1. Do NOT skip pre-flight checks - Environment validation is critical

  2. Do NOT execute before setting expectations - Phase 2 must complete before Phase 3

  3. Do NOT assume CLI success means storage success - Verify each layer independently

  4. Do NOT overwrite reports - Use timestamps to create unique filenames

  5. Do NOT fail on optional checks - Dashboard not running is OK (manual verification)

  6. Do NOT retry indefinitely - Max 3 retries for LLM, 2 for API, 3 for DB locks

  7. Do NOT guess at expectations - Read the article and analyze normative statements

  8. Do NOT accept generic authorities - "best practice" is not specific enough

  9. Do NOT skip baseline creation - First test must create baseline for future comparisons

  10. Do NOT fail fast on transient errors - Retry with backoff before declaring failure

  11. Do NOT modify existing baselines without reason - Only update if expectations legitimately changed

  12. Do NOT mix manual and automated verdicts - Layer 5 is always MANUAL, Layers 1-4 are automated

  13. Do NOT proceed with FAIL verdict - If any required layer fails, investigation is needed

  14. Do NOT use relative timestamps - Always use ISO 8601 absolute timestamps

  15. Do NOT lose diagnostic information - Capture error messages, command output, API responses

Output Format

Initial Response

When the user invokes this skill, respond with:

# Wiki Corpus Verification

**Article:** {path}
**Test Run ID:** {uuid}

I will verify the wiki corpus extraction pipeline using 6 systematic phases:

1. ✅ Setup & Pre-flight Checks
2. 📋 Expectation Setting
3. ▶️ Execution
4. 🔍 Verification (5 Layers)
5. 📄 Reporting
6. 💾 Storage

Starting Phase 1: Pre-flight Checks...

Progress Updates

As you execute each phase, provide updates:

## Phase 1: Setup & Pre-flight Checks ✅

- ✅ Test corpus exists: /tmp/test-wiki-corpus/
- ✅ Aphoria binary: target/release/aphoria (v0.1.0)
- ✅ Corpus DB writable: ~/.aphoria/corpus-db/
- ✅ Report directory: .aphoria/wiki-import-tests/

**Verdict:** ✅ All required checks passed

Proceeding to Phase 2: Expectation Setting...

Final Summary

After Phase 6, provide complete summary:

# Verification Complete ✅

**Test Run ID:** {uuid}
**Overall Verdict:** ✅ PASS (4/4 automated layers, 47/47 checks)

## Summary

- ✅ Phase 1: Pre-flight (all required checks passed)
- ✅ Phase 2: Expectations (3 claims expected)
- ✅ Phase 3: Execution (3 claims extracted)
- ✅ Phase 4: Verification (47/47 checks passed)
- ✅ Phase 5: Reporting (markdown + JSON generated)
- ✅ Phase 6: Storage (reports saved, baseline created)

## Reports Generated

- **Markdown:** `.aphoria/wiki-import-tests/security-2026-02-09T12:00:10Z.md`
- **JSON:** `.aphoria/wiki-import-tests/security-2026-02-09T12:00:10Z.json`
- **Baseline:** `.aphoria/wiki-import-tests/baseline-security.json`
- **History:** `.aphoria/wiki-import-tests/history.jsonl` (1 entry appended)

## Next Steps
**All automated verification passed**
⏸️ **Manual dashboard verification pending** (checklist in markdown report)

You can now:
- Review the markdown report for full details
- Use the JSON report for programmatic analysis
- Test the next article: `/tmp/test-wiki-corpus/another-article.md`
- Run regression tests by re-running this article (will compare to baseline)

Version: 1.0.0 Last Updated: 2026-02-09 Maintained By: StemeDB Team