Implements Phase 4 (A4) - Community corpus as first-class citizens: - **Community Corpus Builder** - Queries StemeDB pattern aggregates - **Wiki Import** - Bootstrap corpus from markdown docs (aphoria corpus import wiki) - **Pattern Aggregation** - Automatic learning from local scans (--sync flag) - **Storage Layer** - StemeDBPatternStore with content-addressed deduplication - **Promotion Logic** - Multi-tier thresholds (95%/80%/50% adoption rates) - **Corpus Build** - Unified registry for RFC/OWASP/Vendor/Community sources - **Trust Packs** - Export corpus as signed, distributable artifacts - **Documentation** - bootstrap-corpus.md guide + CLI reference updates Technical details: - Pattern aggregates stored as assertions with predicate "pattern_aggregate" - Content-addressed subjects via BLAKE3(subject:predicate:value) - PatternAggregator handles write path (observations → patterns) - StemeDBPatternStore handles read path (pattern queries) - Integration tests + fixtures in tests/wiki_import_test.rs Deleted hardcoded.rs (368 lines) - corpus now fully emergent from StemeDB. Deleted enriched-corpus-patterns.md (677 lines) - feature shipped. Closes VG-026 (community corpus), part of A4 milestone. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
5.0 KiB
Bootstrap Corpus from External Sources
Overview
When starting fresh with Aphoria, the community corpus is empty because there are no pattern aggregates in StemeDB. Phase 3 provides three bootstrap options to seed the corpus:
Option A: Wiki Import (Implemented)
Parse markdown documentation to extract MUST/SHOULD patterns and store them as pattern aggregates.
Usage
aphoria corpus import wiki <path/to/wiki>
Example Wiki Format
## TLS Configuration
TLS certificate verification MUST be enabled. Disabling verification
opens the application to man-in-the-middle attacks.
Authority: RFC 5246 Section 7.4.2
This extracts:
- Subject:
code://*/tls - Predicate:
enabled - Value:
Boolean(true) - Authority:
RFC 5246 Section 7.4.2
Pattern Extraction
The wiki parser uses regex to match MUST/SHOULD patterns with these components:
- Subject identifier (e.g., "TLS", "JWT", "password")
- Modal verb (MUST, SHOULD, MUST NOT, SHOULD NOT)
- Action (enabled, disabled, required, verified, enforced)
The parser also looks for Authority statements in nearby lines (within 5 lines):
- RFC references:
RFC 5246 Section 7.4.2 - OWASP references:
OWASP Transport Layer Protection Cheat Sheet - CWE references:
CWE-256
Storage
Patterns are stored as assertions in StemeDB with:
- Predicate:
pattern_aggregate - Subject: Content-addressed
community://pattern/{hash}(deduplication) - Metadata: JSON encoding of project_count, observation_count, timestamps
Bootstrap patterns have:
project_count = 1(initial count, grows with real scans)observation_count = 1- No signatures (unsigned bootstrap data)
Examples
Create a wiki directory with markdown files:
mkdir -p .aphoria/wiki
cat > .aphoria/wiki/tls-best-practices.md <<'EOF'
# TLS Best Practices
## Certificate Verification
TLS certificate verification MUST be enabled. Disabling verification
opens the application to man-in-the-middle attacks.
Authority: RFC 5246 Section 7.4.2
EOF
Import the wiki:
aphoria corpus import wiki .aphoria/wiki
# Output: Imported 1 patterns from wiki at .aphoria/wiki
Option B: Trust Pack (Not Yet Implemented)
Import curated assertions from a Trust Pack that includes pattern aggregates.
aphoria trust-pack install rfc-owasp-baseline
Option C: Skill-Driven Cold Start (Not Yet Implemented)
Use the aphoria-suggest skill to analyze the project and suggest 3-5 foundation claims.
The skill will:
- Detect empty corpus
- Analyze project structure (Cargo.toml, package.json, etc.)
- Suggest baseline patterns
- User approves
- Skill creates patterns via
aphoria claims create
Architecture
Data Flow
Wiki Markdown Files
|
v
WikiParser (regex extraction)
|
v
PatternAggregate (in-memory)
|
v
PatternAggregator (write path)
|
v
StemeDB (KV Store + Predicate Index)
|
v
CommunityCorpusBuilder (read path)
|
v
Conflict Detection
Storage Schema
Pattern aggregates are stored as assertions:
Assertion {
subject: "community://pattern/{blake3_hash}",
predicate: "pattern_aggregate",
object: ObjectValue::Boolean(true),
source_metadata: JSON({
"subject": "code://*/tls/cert_verification",
"predicate": "enabled",
"project_count": 1,
"observation_count": 1,
"first_seen": 1706832000,
"last_seen": 1706832000
}),
// ... other fields
}
Content-Addressed Deduplication
The subject hash is computed as:
BLAKE3(subject + ":" + predicate + ":" + value)
This ensures:
- Same pattern → same hash → same subject
- Duplicate imports are deduplicated by content
- Pattern counts can be updated by creating new assertions (append-only)
Implementation Files
- Parser:
applications/aphoria/src/corpus/wiki_importer.rs - Write Path:
applications/aphoria/src/community/pattern_store.rs(PatternAggregator) - CLI Command:
applications/aphoria/src/cli/mod.rs(CorpusCommands::Import) - Handler:
applications/aphoria/src/handlers/corpus.rs - Public API:
applications/aphoria/src/corpus_build.rs(import_corpus_from_wiki) - Tests:
applications/aphoria/tests/wiki_import_test.rs - Fixtures:
applications/aphoria/tests/fixtures/wiki/
Testing
Run the integration tests:
cargo test -p aphoria --test wiki_import_test
Tests cover:
- Basic pattern extraction from wiki files
- Storage round-trip (write → read)
- Pattern deduplication via content-addressed subjects
- Predicate indexing for efficient queries
- Multiple pattern types (TLS, JWT, password, etc.)
Future Enhancements
- Improved Regex: Support more complex pattern structures
- Multi-language: Extract patterns from non-English documentation
- Incremental Updates: Update existing patterns instead of duplicating
- Authority Validation: Verify RFC/OWASP references are valid
- Trust Pack Integration: Package bootstrap patterns as distributable packs