--- created: 2026-02-08 last_updated: 2026-02-08 status: Planning Document feature: Phase 2-3 - LLM-Assisted Document Ingestion timeline: 4 weeks estimated --- # Ingest Best Practices Documentation - Executable Policy ## Problem Statement **Current Reality:** 1. Teams write extensive architecture/security/style guides (50+ pages) 2. Developers are expected to read and remember all guidelines 3. Compliance is checked manually in code review 4. Guidelines drift out of sync with code over time 5. New team members miss context from old documents **What Users Want:** - Write documentation once (markdown, PDF, confluence) - Have Aphoria automatically enforce the guidelines - Get real-time feedback during development - Maintain compliance without manual review ## Vision: Documentation That Enforces Itself ### Example: Hexagonal Architecture Guide **Traditional Flow:** ``` 1. Architect writes: "HTTP handlers MUST be in adapters/http/" 2. Developer reads guide (hopefully) 3. Developer writes code in wrong location 4. Code reviewer catches it (maybe) 5. Fix during review (wasted time) ``` **With Aphoria Ingestion:** ``` 1. Architect writes: "HTTP handlers MUST be in adapters/http/" 2. Run: aphoria ingest-guide architecture.md 3. Developer writes code in wrong location 4. aphoria scan immediately shows: ❌ File location violation Expected: adapters/http/*_handler.go Found: adapters/handlers/user.go Fix: Move to adapters/http/user_handler.go 5. Developer fixes before commit (no review cycles wasted) ``` ## User Experience ### 1. Ingest Phase ```bash $ aphoria ingest-guide docs/architecture/hexagonal.md \ --authority-tier team_policy \ --category architecture \ --dry-run Analyzing: docs/architecture/hexagonal.md (15 KB, 342 lines) 📊 Extraction Summary: ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Section Claims Severity ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Directory Structure 8 MUST Dependency Rules 6 MUST_NOT Naming Conventions 5 MUST Interface Definitions 4 SHOULD Testing Strategy 3 SHOULD ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Total 26 claims extracted 🔍 Preview of Extracted Claims: 1. [MUST] HTTP handlers in adapters/http/ directory Subject: code://go/*/adapters/http/** Predicate: directory_pattern Value: *_handler.go Source: hexagonal.md:45-47 2. [MUST_NOT] Core domain imports infrastructure Subject: code://go/*/core/domain/** Predicate: imports_forbidden Value: infrastructure/* Source: hexagonal.md:62-64 3. [MUST] Handler files end with _handler.go Subject: code://go/*/adapters/http/*.go Predicate: filename_pattern Value: *_handler.go Source: hexagonal.md:89-91 ... (23 more) Would add 26 claims to authoritative corpus. Estimated scan coverage: ~65% of codebase Proceed with ingestion? [y/N] ``` ### 2. Compliance Checking ```bash $ aphoria scan --check-policy hexagonal-arch 📋 Policy Compliance Report: Hexagonal Architecture ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✅ Directory Structure (95% compliant) ✓ 45 files in correct locations ❌ 3 violations: • adapters/handlers/user.go → should be adapters/http/user_handler.go • adapters/db/user_repo.go → should be adapters/persistence/user_repo.go • domain/user_service.go → should be core/domain/user_service.go ✅ Dependency Rules (100% compliant) ✓ No forbidden imports detected ✓ Core domain is clean of infrastructure dependencies ⚠️ Naming Conventions (80% compliant) ✓ 35 files follow naming conventions ❌ 9 violations: • adapters/http/user.go → should be user_handler.go • adapters/http/order.go → should be order_handler.go ... (7 more) ✅ Interface Definitions (90% compliant) ✓ 18 interfaces properly named ⚠️ 2 warnings: • PostgresUserRepository → consider UserStore (behavior-based naming) • MySQLOrderRepository → consider OrderStore (behavior-based naming) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Overall Compliance: 91% (237 of 260 checks passed) 📝 Recommendations: 1. Run: aphoria fix --policy hexagonal-arch --auto-safe This will automatically fix 8 file location issues 2. Manually review 2 interface naming suggestions 3. Update hexagonal.md if any rules need revision Last policy update: hexagonal.md (modified 3 days ago) ``` ### 3. Real-Time Feedback ```bash $ git add adapters/handlers/user.go $ git commit ⚠️ Pre-commit hook: Aphoria policy check ❌ Policy violations detected (hexagonal-arch): adapters/handlers/user.go: ❌ File location violation Expected: adapters/http/*_handler.go Found: adapters/handlers/user.go Rule: HTTP handlers must be in adapters/http/ Source: hexagonal.md:45 (team policy) Suggested fix: git mv adapters/handlers/user.go adapters/http/user_handler.go Commit blocked. Fix violations or use --no-verify to skip. ``` ## Data Model ### Ingested Claim Structure ```rust pub struct IngestedClaim { /// Unique claim ID pub id: String, /// Subject pattern (supports wildcards) pub subject_pattern: String, /// Predicate pub predicate: String, /// Expected value pub value: ClaimValue, /// Comparison mode pub comparison: ComparisonMode, // MUST, MUST_NOT, SHOULD, MAY /// Category pub category: String, // "architecture" | "security" | "style" /// Explanation (from doc) pub explanation: String, /// Authority tier pub authority_tier: AuthorityTier, // TeamPolicy (Tier 2.5) /// Source document tracking pub source: DocumentSource, /// When ingested pub ingested_at: u64, } pub struct DocumentSource { /// Path to source document pub file_path: String, /// Line numbers in source pub line_start: u32, pub line_end: u32, /// Section heading pub section: String, // "Directory Structure Rules" /// Document version (hash) pub document_hash: String, } pub enum ComparisonMode { Must, // Value MUST match MustNot, // Value MUST NOT match Should, // Warning if doesn't match May, // Informational only } ``` ### Authority Tier Hierarchy ``` Tier 0: System (StemeDB internals, not user-facing) Tier 1: Regulatory (RFCs, legal requirements) Tier 2: Clinical (OWASP, NIST, industry standards) Tier 2.5: TeamPolicy ← NEW (team-specific guidelines) Tier 3: Expert (recognized authorities, vetted claims) Tier 4: Community (project-specific observations) ``` TeamPolicy tier: - Higher authority than community observations - Lower authority than industry standards (OWASP) - Can override community patterns - Cannot override RFCs or security standards - Scoped to project/team ## Implementation ### Phase 1: Manual Extraction (MVP - 2 days) User manually creates claims TOML from guidelines, then imports: ```bash # User writes claims.toml manually from reading architecture.md aphoria claims import team-guidelines.toml --authority-tier team_policy ``` **Pros:** Works immediately, no LLM needed **Cons:** Manual work, doesn't scale ### Phase 2: LLM-Assisted Extraction (Week 1 - 3 days) ```bash $ aphoria ingest-guide docs/architecture/hexagonal.md --preview Processing: hexagonal.md Using LLM to extract claims... Found 26 potential claims. Review and edit before importing. Opening editor... # Generated claims (edit before importing) [[claim]] id = "hex-arch-http-handlers-001" subject = "code://go/*/adapters/http/**" predicate = "directory_pattern" value = "*_handler.go" comparison = "must" category = "architecture" explanation = "HTTP handlers must be in adapters/http/ directory" source = "hexagonal.md:45-47" # Edit to refine, then save and close to import ``` **LLM Prompt:** ``` Extract architectural claims from this document. For each MUST/SHOULD/MUST NOT statement: 1. Identify the subject (what code element) 2. Identify the predicate (what property) 3. Identify the value (expected value) 4. Determine comparison mode (must/should/must_not) 5. Extract explanation Format as TOML claims. Example input: "HTTP handlers MUST be in adapters/http/ directory and end with _handler.go" Example output: [[claim]] subject = "code://go/*/adapters/http/**" predicate = "directory_pattern" value = "*_handler.go" comparison = "must" explanation = "HTTP handlers must be in adapters/http/ directory" ``` **Implementation:** - File: `applications/aphoria/src/llm/document_ingestion.rs` - Uses existing LLM infrastructure - Outputs TOML for review before import - User can edit/refine before committing ### Phase 3: Automated Extraction with Validation (Week 2 - 4 days) Fully automated pipeline with confidence scoring: ```rust pub struct DocumentIngester { llm: LlmClient, validator: ClaimValidator, } impl DocumentIngester { /// Ingest a document and extract claims. pub async fn ingest( &self, doc_path: &Path, options: IngestionOptions, ) -> Result, IngestionError> { // 1. Parse document (markdown/PDF/text) let sections = self.parse_document(doc_path)?; // 2. Extract claims using LLM let raw_claims = self.extract_claims_from_sections(sections).await?; // 3. Validate and score confidence let validated = self.validate_claims(raw_claims)?; // 4. Filter by confidence threshold let high_confidence: Vec<_> = validated .into_iter() .filter(|c| c.confidence >= options.min_confidence) .collect(); // 5. Preview or auto-import if options.dry_run { self.preview_claims(&high_confidence)?; Ok(vec![]) } else { self.import_claims(high_confidence).await } } /// Extract claims from a section using LLM. async fn extract_claims_from_section( &self, section: &DocumentSection, ) -> Result, LlmError> { let prompt = format!( r#"Extract architectural claims from this section. Section: {} Content: {} For each claim: 1. Identify subject pattern (supports wildcards) 2. Identify predicate 3. Identify expected value 4. Determine severity (MUST/SHOULD/MAY) 5. Extract explanation Return as JSON array."#, section.heading, section.content, ); self.llm.extract_structured(prompt).await } /// Validate extracted claims for quality. fn validate_claims( &self, claims: Vec, ) -> Result, ValidationError> { claims .into_iter() .map(|claim| { // Check if subject pattern is valid let subject_valid = self.validator.validate_subject(&claim.subject); // Check if predicate is recognized let predicate_valid = self.validator.validate_predicate(&claim.predicate); // Compute confidence score let confidence = self.compute_confidence(&claim, subject_valid, predicate_valid); ValidatedClaim { claim, confidence, validation_issues: vec![], } }) .collect() } } ``` ## CLI Design ### Commands ```bash # Ingest a document aphoria ingest-guide [options] Options: --authority-tier Authority tier (default: team_policy) --category Category (architecture|security|style) --min-confidence Min confidence to include (0.0-1.0, default: 0.7) --dry-run Preview without importing --edit Open editor to review/refine before importing --project Project scope (default: current project) # List ingested guidelines aphoria list-guides # Check compliance against a guideline aphoria check-compliance # Update from changed document aphoria update-guide # Remove a guideline aphoria remove-guide ``` ### Examples ```bash # Ingest with preview aphoria ingest-guide docs/architecture/hexagonal.md --dry-run # Ingest with manual review aphoria ingest-guide docs/security/owasp-top-10.md --edit # Check compliance aphoria check-compliance hexagonal-arch # Update when doc changes aphoria update-guide hexagonal-arch --from docs/architecture/hexagonal.md # List all active guidelines aphoria list-guides ``` ## Integration with Existing Features ### 1. Conflict Detection Ingested claims stored as authoritative assertions → existing conflict engine detects violations ### 2. Scan Reports Compliance shown in standard scan reports: ``` Conflicts: 3 ❌ hexagonal.md:45 - File in wrong directory ❌ hexagonal.md:62 - Forbidden import detected ❌ hexagonal.md:89 - Invalid filename pattern ``` ### 3. Authority Lens TeamPolicy tier (2.5) ranks between Clinical (2) and Expert (3): - Overrides community observations - Can be overridden by team-authored claims (explicit) - Respects RFCs and security standards ### 4. Pre-commit Hooks Compliance checking in pre-commit: ```bash #!/bin/bash # .git/hooks/pre-commit aphoria scan --check-policy hexagonal-arch --exit-code ``` ## Storage ### Claims Storage Ingested claims stored as regular AuthoredClaim instances: - File: `.aphoria/claims.toml` - Tagged with `ingested_from: "hexagonal.md"` - Authority tier: `team_policy` ### Document Metadata Track source documents: ```toml # .aphoria/ingested_guides.toml [[guide]] id = "hexagonal-arch" name = "Hexagonal Architecture Guidelines" source_path = "docs/architecture/hexagonal.md" document_hash = "blake3:abc123..." ingested_at = 1234567890 claims_count = 26 authority_tier = "team_policy" category = "architecture" [[guide]] id = "security-owasp" name = "OWASP Top 10 Compliance" source_path = "docs/security/owasp.md" document_hash = "blake3:def456..." ingested_at = 1234567900 claims_count = 15 authority_tier = "team_policy" category = "security" ``` ### Update Detection ```bash $ aphoria scan ⚠️ Warning: Source document has changed Guide: hexagonal-arch Source: docs/architecture/hexagonal.md Last ingested: 3 days ago (hash: abc123...) Current hash: xyz789... Run: aphoria update-guide hexagonal-arch ``` ## Success Metrics ### Phase 1 (Manual Import) - Users can manually create claims from guidelines - Claims enforce during scans - Pre-commit hooks work ### Phase 2 (LLM-Assisted) - LLM extracts 80%+ of claims correctly - Users can review/edit before importing - Saves >90% of manual effort ### Phase 3 (Automated) - Confidence scoring filters noise - Automatic updates when docs change - Compliance reports show trends ## Open Questions 1. **How to handle ambiguous statements?** - "Handlers should generally be in adapters/http/" - Extract as SHOULD with low confidence? 2. **How to handle conflicting guidelines?** - Doc A says X, Doc B says Y - Use authority tier + recency? 3. **Should we support non-Markdown formats?** - PDF extraction (common for external standards) - Confluence/Google Docs (via export) - Word documents 4. **How to version guidelines?** - When doc changes, update claims or create new versions? - Show history of guideline changes? 5. **Should compliance be project-scoped or team-scoped?** - Team-level guidelines apply to all team projects? - Project-specific guidelines override team guidelines? ## Future Enhancements ### 1. Guided Onboarding ```bash $ aphoria init --with-guides No guidelines found. Would you like to: 1. Import existing documentation 2. Start from template (hexagonal/clean/ddd) 3. Skip for now Choice: 1 Enter path to architecture guide: docs/architecture.md Enter path to security guide: docs/security.md Enter path to style guide: (skip) Extracting claims... Found 42 claims across 2 documents. Review before importing? [Y/n] ``` ### 2. Compliance Dashboard Visual compliance tracking over time: - Trend graphs (improving/declining) - Per-guideline compliance scores - Team comparison (if multiple teams) ### 3. AI-Generated Fix Suggestions ```bash ❌ File location violation Expected: adapters/http/*_handler.go Found: adapters/handlers/user.go Suggested fix: git mv adapters/handlers/user.go adapters/http/user_handler.go Apply fix? [y/N] ``` ### 4. Guideline Templates Pre-built guidelines for common architectures: - Hexagonal Architecture - Clean Architecture - Domain-Driven Design - Microservices patterns - Security baselines (OWASP, NIST) ```bash $ aphoria init-guide --template hexagonal Imported 35 hexagonal architecture claims. Customize at: .aphoria/claims.toml ``` ## Timeline - **Week 1:** Manual import (MVP) + LLM extraction prototype - **Week 2:** Automated pipeline + confidence scoring - **Week 3:** CLI polish + documentation + examples - **Week 4:** Dashboard integration + compliance reports **Total:** 4 weeks for complete feature