# Bootstrap Corpus from External Sources ## Overview When starting fresh with Aphoria, the community corpus is empty because there are no pattern aggregates in StemeDB. Phase 3 provides three bootstrap options to seed the corpus: ## Option A: Wiki Import (Implemented) Parse markdown documentation to extract MUST/SHOULD patterns and store them as pattern aggregates. ### Usage ```bash aphoria corpus import wiki ``` ### Example Wiki Format ```markdown ## TLS Configuration TLS certificate verification MUST be enabled. Disabling verification opens the application to man-in-the-middle attacks. Authority: RFC 5246 Section 7.4.2 ``` This extracts: - **Subject**: `code://*/tls` - **Predicate**: `enabled` - **Value**: `Boolean(true)` - **Authority**: `RFC 5246 Section 7.4.2` ### Pattern Extraction The wiki parser uses regex to match MUST/SHOULD patterns with these components: 1. **Subject identifier** (e.g., "TLS", "JWT", "password") 2. **Modal verb** (MUST, SHOULD, MUST NOT, SHOULD NOT) 3. **Action** (enabled, disabled, required, verified, enforced) The parser also looks for Authority statements in nearby lines (within 5 lines): - RFC references: `RFC 5246 Section 7.4.2` - OWASP references: `OWASP Transport Layer Protection Cheat Sheet` - CWE references: `CWE-256` ### Storage Patterns are stored as assertions in StemeDB with: - **Predicate**: `pattern_aggregate` - **Subject**: Content-addressed `community://pattern/{hash}` (deduplication) - **Metadata**: JSON encoding of project_count, observation_count, timestamps Bootstrap patterns have: - `project_count = 1` (initial count, grows with real scans) - `observation_count = 1` - No signatures (unsigned bootstrap data) ### Examples Create a wiki directory with markdown files: ```bash mkdir -p .aphoria/wiki cat > .aphoria/wiki/tls-best-practices.md <<'EOF' # TLS Best Practices ## Certificate Verification TLS certificate verification MUST be enabled. Disabling verification opens the application to man-in-the-middle attacks. Authority: RFC 5246 Section 7.4.2 EOF ``` Import the wiki: ```bash aphoria corpus import wiki .aphoria/wiki # Output: Imported 1 patterns from wiki at .aphoria/wiki ``` ## Option B: Trust Pack (Not Yet Implemented) Import curated assertions from a Trust Pack that includes pattern aggregates. ```bash aphoria trust-pack install rfc-owasp-baseline ``` ## Option C: Skill-Driven Cold Start (Not Yet Implemented) Use the `aphoria-suggest` skill to analyze the project and suggest 3-5 foundation claims. The skill will: 1. Detect empty corpus 2. Analyze project structure (Cargo.toml, package.json, etc.) 3. Suggest baseline patterns 4. User approves 5. Skill creates patterns via `aphoria claims create` ## Architecture ### Data Flow ``` Wiki Markdown Files | v WikiParser (regex extraction) | v PatternAggregate (in-memory) | v PatternAggregator (write path) | v StemeDB (KV Store + Predicate Index) | v CommunityCorpusBuilder (read path) | v Conflict Detection ``` ### Storage Schema Pattern aggregates are stored as assertions: ```rust Assertion { subject: "community://pattern/{blake3_hash}", predicate: "pattern_aggregate", object: ObjectValue::Boolean(true), source_metadata: JSON({ "subject": "code://*/tls/cert_verification", "predicate": "enabled", "project_count": 1, "observation_count": 1, "first_seen": 1706832000, "last_seen": 1706832000 }), // ... other fields } ``` ### Content-Addressed Deduplication The subject hash is computed as: ```rust BLAKE3(subject + ":" + predicate + ":" + value) ``` This ensures: - Same pattern → same hash → same subject - Duplicate imports are deduplicated by content - Pattern counts can be updated by creating new assertions (append-only) ## Implementation Files - **Parser**: `applications/aphoria/src/corpus/wiki_importer.rs` - **Write Path**: `applications/aphoria/src/community/pattern_store.rs` (`PatternAggregator`) - **CLI Command**: `applications/aphoria/src/cli/mod.rs` (`CorpusCommands::Import`) - **Handler**: `applications/aphoria/src/handlers/corpus.rs` - **Public API**: `applications/aphoria/src/corpus_build.rs` (`import_corpus_from_wiki`) - **Tests**: `applications/aphoria/tests/wiki_import_test.rs` - **Fixtures**: `applications/aphoria/tests/fixtures/wiki/` ## Testing Run the integration tests: ```bash cargo test -p aphoria --test wiki_import_test ``` Tests cover: - Basic pattern extraction from wiki files - Storage round-trip (write → read) - Pattern deduplication via content-addressed subjects - Predicate indexing for efficient queries - Multiple pattern types (TLS, JWT, password, etc.) ## Future Enhancements 1. **Improved Regex**: Support more complex pattern structures 2. **Multi-language**: Extract patterns from non-English documentation 3. **Incremental Updates**: Update existing patterns instead of duplicating 4. **Authority Validation**: Verify RFC/OWASP references are valid 5. **Trust Pack Integration**: Package bootstrap patterns as distributable packs