stemedb/applications/aphoria/docs/bootstrap-corpus.md
jml 65065f3d8f feat(aphoria): implement community corpus with wiki import and pattern aggregation
Implements Phase 4 (A4) - Community corpus as first-class citizens:

- **Community Corpus Builder** - Queries StemeDB pattern aggregates
- **Wiki Import** - Bootstrap corpus from markdown docs (aphoria corpus import wiki)
- **Pattern Aggregation** - Automatic learning from local scans (--sync flag)
- **Storage Layer** - StemeDBPatternStore with content-addressed deduplication
- **Promotion Logic** - Multi-tier thresholds (95%/80%/50% adoption rates)
- **Corpus Build** - Unified registry for RFC/OWASP/Vendor/Community sources
- **Trust Packs** - Export corpus as signed, distributable artifacts
- **Documentation** - bootstrap-corpus.md guide + CLI reference updates

Technical details:
- Pattern aggregates stored as assertions with predicate "pattern_aggregate"
- Content-addressed subjects via BLAKE3(subject:predicate:value)
- PatternAggregator handles write path (observations → patterns)
- StemeDBPatternStore handles read path (pattern queries)
- Integration tests + fixtures in tests/wiki_import_test.rs

Deleted hardcoded.rs (368 lines) - corpus now fully emergent from StemeDB.
Deleted enriched-corpus-patterns.md (677 lines) - feature shipped.

Closes VG-026 (community corpus), part of A4 milestone.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-09 00:12:31 +00:00

5.0 KiB

Bootstrap Corpus from External Sources

Overview

When starting fresh with Aphoria, the community corpus is empty because there are no pattern aggregates in StemeDB. Phase 3 provides three bootstrap options to seed the corpus:

Option A: Wiki Import (Implemented)

Parse markdown documentation to extract MUST/SHOULD patterns and store them as pattern aggregates.

Usage

aphoria corpus import wiki <path/to/wiki>

Example Wiki Format

## TLS Configuration

TLS certificate verification MUST be enabled. Disabling verification
opens the application to man-in-the-middle attacks.

Authority: RFC 5246 Section 7.4.2

This extracts:

  • Subject: code://*/tls
  • Predicate: enabled
  • Value: Boolean(true)
  • Authority: RFC 5246 Section 7.4.2

Pattern Extraction

The wiki parser uses regex to match MUST/SHOULD patterns with these components:

  1. Subject identifier (e.g., "TLS", "JWT", "password")
  2. Modal verb (MUST, SHOULD, MUST NOT, SHOULD NOT)
  3. Action (enabled, disabled, required, verified, enforced)

The parser also looks for Authority statements in nearby lines (within 5 lines):

  • RFC references: RFC 5246 Section 7.4.2
  • OWASP references: OWASP Transport Layer Protection Cheat Sheet
  • CWE references: CWE-256

Storage

Patterns are stored as assertions in StemeDB with:

  • Predicate: pattern_aggregate
  • Subject: Content-addressed community://pattern/{hash} (deduplication)
  • Metadata: JSON encoding of project_count, observation_count, timestamps

Bootstrap patterns have:

  • project_count = 1 (initial count, grows with real scans)
  • observation_count = 1
  • No signatures (unsigned bootstrap data)

Examples

Create a wiki directory with markdown files:

mkdir -p .aphoria/wiki
cat > .aphoria/wiki/tls-best-practices.md <<'EOF'
# TLS Best Practices

## Certificate Verification

TLS certificate verification MUST be enabled. Disabling verification
opens the application to man-in-the-middle attacks.

Authority: RFC 5246 Section 7.4.2
EOF

Import the wiki:

aphoria corpus import wiki .aphoria/wiki
# Output: Imported 1 patterns from wiki at .aphoria/wiki

Option B: Trust Pack (Not Yet Implemented)

Import curated assertions from a Trust Pack that includes pattern aggregates.

aphoria trust-pack install rfc-owasp-baseline

Option C: Skill-Driven Cold Start (Not Yet Implemented)

Use the aphoria-suggest skill to analyze the project and suggest 3-5 foundation claims.

The skill will:

  1. Detect empty corpus
  2. Analyze project structure (Cargo.toml, package.json, etc.)
  3. Suggest baseline patterns
  4. User approves
  5. Skill creates patterns via aphoria claims create

Architecture

Data Flow

Wiki Markdown Files
    |
    v
WikiParser (regex extraction)
    |
    v
PatternAggregate (in-memory)
    |
    v
PatternAggregator (write path)
    |
    v
StemeDB (KV Store + Predicate Index)
    |
    v
CommunityCorpusBuilder (read path)
    |
    v
Conflict Detection

Storage Schema

Pattern aggregates are stored as assertions:

Assertion {
    subject: "community://pattern/{blake3_hash}",
    predicate: "pattern_aggregate",
    object: ObjectValue::Boolean(true),
    source_metadata: JSON({
        "subject": "code://*/tls/cert_verification",
        "predicate": "enabled",
        "project_count": 1,
        "observation_count": 1,
        "first_seen": 1706832000,
        "last_seen": 1706832000
    }),
    // ... other fields
}

Content-Addressed Deduplication

The subject hash is computed as:

BLAKE3(subject + ":" + predicate + ":" + value)

This ensures:

  • Same pattern → same hash → same subject
  • Duplicate imports are deduplicated by content
  • Pattern counts can be updated by creating new assertions (append-only)

Implementation Files

  • Parser: applications/aphoria/src/corpus/wiki_importer.rs
  • Write Path: applications/aphoria/src/community/pattern_store.rs (PatternAggregator)
  • CLI Command: applications/aphoria/src/cli/mod.rs (CorpusCommands::Import)
  • Handler: applications/aphoria/src/handlers/corpus.rs
  • Public API: applications/aphoria/src/corpus_build.rs (import_corpus_from_wiki)
  • Tests: applications/aphoria/tests/wiki_import_test.rs
  • Fixtures: applications/aphoria/tests/fixtures/wiki/

Testing

Run the integration tests:

cargo test -p aphoria --test wiki_import_test

Tests cover:

  • Basic pattern extraction from wiki files
  • Storage round-trip (write → read)
  • Pattern deduplication via content-addressed subjects
  • Predicate indexing for efficient queries
  • Multiple pattern types (TLS, JWT, password, etc.)

Future Enhancements

  1. Improved Regex: Support more complex pattern structures
  2. Multi-language: Extract patterns from non-English documentation
  3. Incremental Updates: Update existing patterns instead of duplicating
  4. Authority Validation: Verify RFC/OWASP references are valid
  5. Trust Pack Integration: Package bootstrap patterns as distributable packs