stemedb/applications/aphoria/docs/planning/ingest-best-practices-docs.md

---
created: 2026-02-08
last_updated: 2026-02-08
status: Planning Document
feature: Phase 2-3 - LLM-Assisted Document Ingestion
timeline: 4 weeks estimated
---

# Ingest Best Practices Documentation - Executable Policy

## Problem Statement

**Current Reality:**
1. Teams write extensive architecture/security/style guides (50+ pages)
2. Developers are expected to read and remember all guidelines
3. Compliance is checked manually in code review
4. Guidelines drift out of sync with code over time
5. New team members miss context from old documents

**What Users Want:**
- Write documentation once (markdown, PDF, confluence)
- Have Aphoria automatically enforce the guidelines
- Get real-time feedback during development
- Maintain compliance without manual review

## Vision: Documentation That Enforces Itself

### Example: Hexagonal Architecture Guide

**Traditional Flow:**
```
1. Architect writes: "HTTP handlers MUST be in adapters/http/"
2. Developer reads guide (hopefully)
3. Developer writes code in wrong location
4. Code reviewer catches it (maybe)
5. Fix during review (wasted time)
```

**With Aphoria Ingestion:**
```
1. Architect writes: "HTTP handlers MUST be in adapters/http/"
2. Run: aphoria ingest-guide architecture.md
3. Developer writes code in wrong location
4. aphoria scan immediately shows:
   ❌ File location violation
   Expected: adapters/http/*_handler.go
   Found: adapters/handlers/user.go
   Fix: Move to adapters/http/user_handler.go
5. Developer fixes before commit (no review cycles wasted)
```

## User Experience

### 1. Ingest Phase

```bash
$ aphoria ingest-guide docs/architecture/hexagonal.md \
  --authority-tier team_policy \
  --category architecture \
  --dry-run

Analyzing: docs/architecture/hexagonal.md (15 KB, 342 lines)

📊 Extraction Summary:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Section                   Claims    Severity
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Directory Structure       8         MUST
Dependency Rules          6         MUST_NOT
Naming Conventions        5         MUST
Interface Definitions     4         SHOULD
Testing Strategy          3         SHOULD
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total                     26 claims extracted

🔍 Preview of Extracted Claims:

1. [MUST] HTTP handlers in adapters/http/ directory
   Subject: code://go/*/adapters/http/**
   Predicate: directory_pattern
   Value: *_handler.go
   Source: hexagonal.md:45-47

2. [MUST_NOT] Core domain imports infrastructure
   Subject: code://go/*/core/domain/**
   Predicate: imports_forbidden
   Value: infrastructure/*
   Source: hexagonal.md:62-64

3. [MUST] Handler files end with _handler.go
   Subject: code://go/*/adapters/http/*.go
   Predicate: filename_pattern
   Value: *_handler.go
   Source: hexagonal.md:89-91

... (23 more)

Would add 26 claims to authoritative corpus.
Estimated scan coverage: ~65% of codebase

Proceed with ingestion? [y/N]
```

### 2. Compliance Checking

```bash
$ aphoria scan --check-policy hexagonal-arch

📋 Policy Compliance Report: Hexagonal Architecture
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✅ Directory Structure (95% compliant)
   ✓ 45 files in correct locations
   ❌ 3 violations:
      • adapters/handlers/user.go → should be adapters/http/user_handler.go
      • adapters/db/user_repo.go → should be adapters/persistence/user_repo.go
      • domain/user_service.go → should be core/domain/user_service.go

✅ Dependency Rules (100% compliant)
   ✓ No forbidden imports detected
   ✓ Core domain is clean of infrastructure dependencies

⚠️  Naming Conventions (80% compliant)
   ✓ 35 files follow naming conventions
   ❌ 9 violations:
      • adapters/http/user.go → should be user_handler.go
      • adapters/http/order.go → should be order_handler.go
      ... (7 more)

✅ Interface Definitions (90% compliant)
   ✓ 18 interfaces properly named
   ⚠️  2 warnings:
      • PostgresUserRepository → consider UserStore (behavior-based naming)
      • MySQLOrderRepository → consider OrderStore (behavior-based naming)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Overall Compliance: 91% (237 of 260 checks passed)

📝 Recommendations:
  1. Run: aphoria fix --policy hexagonal-arch --auto-safe
     This will automatically fix 8 file location issues

  2. Manually review 2 interface naming suggestions

  3. Update hexagonal.md if any rules need revision

Last policy update: hexagonal.md (modified 3 days ago)
```

### 3. Real-Time Feedback

```bash
$ git add adapters/handlers/user.go
$ git commit

⚠️  Pre-commit hook: Aphoria policy check

❌ Policy violations detected (hexagonal-arch):

  adapters/handlers/user.go:
    ❌ File location violation
       Expected: adapters/http/*_handler.go
       Found: adapters/handlers/user.go
       Rule: HTTP handlers must be in adapters/http/
       Source: hexagonal.md:45 (team policy)

       Suggested fix:
         git mv adapters/handlers/user.go adapters/http/user_handler.go

Commit blocked. Fix violations or use --no-verify to skip.
```

## Data Model

### Ingested Claim Structure

```rust
pub struct IngestedClaim {
    /// Unique claim ID
    pub id: String,

    /// Subject pattern (supports wildcards)
    pub subject_pattern: String,

    /// Predicate
    pub predicate: String,

    /// Expected value
    pub value: ClaimValue,

    /// Comparison mode
    pub comparison: ComparisonMode,  // MUST, MUST_NOT, SHOULD, MAY

    /// Category
    pub category: String,  // "architecture" | "security" | "style"

    /// Explanation (from doc)
    pub explanation: String,

    /// Authority tier
    pub authority_tier: AuthorityTier,  // TeamPolicy (Tier 2.5)

    /// Source document tracking
    pub source: DocumentSource,

    /// When ingested
    pub ingested_at: u64,
}

pub struct DocumentSource {
    /// Path to source document
    pub file_path: String,

    /// Line numbers in source
    pub line_start: u32,
    pub line_end: u32,

    /// Section heading
    pub section: String,  // "Directory Structure Rules"

    /// Document version (hash)
    pub document_hash: String,
}

pub enum ComparisonMode {
    Must,       // Value MUST match
    MustNot,    // Value MUST NOT match
    Should,     // Warning if doesn't match
    May,        // Informational only
}
```

### Authority Tier Hierarchy

```
Tier 0: System      (StemeDB internals, not user-facing)
Tier 1: Regulatory  (RFCs, legal requirements)
Tier 2: Clinical    (OWASP, NIST, industry standards)
Tier 2.5: TeamPolicy ← NEW (team-specific guidelines)
Tier 3: Expert      (recognized authorities, vetted claims)
Tier 4: Community   (project-specific observations)
```

TeamPolicy tier:
- Higher authority than community observations
- Lower authority than industry standards (OWASP)
- Can override community patterns
- Cannot override RFCs or security standards
- Scoped to project/team

## Implementation

### Phase 1: Manual Extraction (MVP - 2 days)

User manually creates claims TOML from guidelines, then imports:

```bash
# User writes claims.toml manually from reading architecture.md
aphoria claims import team-guidelines.toml --authority-tier team_policy
```

**Pros:** Works immediately, no LLM needed
**Cons:** Manual work, doesn't scale

### Phase 2: LLM-Assisted Extraction (Week 1 - 3 days)

```bash
$ aphoria ingest-guide docs/architecture/hexagonal.md --preview

Processing: hexagonal.md
Using LLM to extract claims...

Found 26 potential claims. Review and edit before importing.
Opening editor...

# Generated claims (edit before importing)
[[claim]]
id = "hex-arch-http-handlers-001"
subject = "code://go/*/adapters/http/**"
predicate = "directory_pattern"
value = "*_handler.go"
comparison = "must"
category = "architecture"
explanation = "HTTP handlers must be in adapters/http/ directory"
source = "hexagonal.md:45-47"

# Edit to refine, then save and close to import
```

**LLM Prompt:**
```
Extract architectural claims from this document.

For each MUST/SHOULD/MUST NOT statement:
1. Identify the subject (what code element)
2. Identify the predicate (what property)
3. Identify the value (expected value)
4. Determine comparison mode (must/should/must_not)
5. Extract explanation

Format as TOML claims.

Example input:
"HTTP handlers MUST be in adapters/http/ directory and end with _handler.go"

Example output:
[[claim]]
subject = "code://go/*/adapters/http/**"
predicate = "directory_pattern"
value = "*_handler.go"
comparison = "must"
explanation = "HTTP handlers must be in adapters/http/ directory"
```

**Implementation:**
- File: `applications/aphoria/src/llm/document_ingestion.rs`
- Uses existing LLM infrastructure
- Outputs TOML for review before import
- User can edit/refine before committing

### Phase 3: Automated Extraction with Validation (Week 2 - 4 days)

Fully automated pipeline with confidence scoring:

```rust
pub struct DocumentIngester {
    llm: LlmClient,
    validator: ClaimValidator,
}

impl DocumentIngester {
    /// Ingest a document and extract claims.
    pub async fn ingest(
        &self,
        doc_path: &Path,
        options: IngestionOptions,
    ) -> Result<Vec<IngestedClaim>, IngestionError> {
        // 1. Parse document (markdown/PDF/text)
        let sections = self.parse_document(doc_path)?;

        // 2. Extract claims using LLM
        let raw_claims = self.extract_claims_from_sections(sections).await?;

        // 3. Validate and score confidence
        let validated = self.validate_claims(raw_claims)?;

        // 4. Filter by confidence threshold
        let high_confidence: Vec<_> = validated
            .into_iter()
            .filter(|c| c.confidence >= options.min_confidence)
            .collect();

        // 5. Preview or auto-import
        if options.dry_run {
            self.preview_claims(&high_confidence)?;
            Ok(vec![])
        } else {
            self.import_claims(high_confidence).await
        }
    }

    /// Extract claims from a section using LLM.
    async fn extract_claims_from_section(
        &self,
        section: &DocumentSection,
    ) -> Result<Vec<ExtractedClaim>, LlmError> {
        let prompt = format!(
            r#"Extract architectural claims from this section.

Section: {}

Content:
{}

For each claim:
1. Identify subject pattern (supports wildcards)
2. Identify predicate
3. Identify expected value
4. Determine severity (MUST/SHOULD/MAY)
5. Extract explanation

Return as JSON array."#,
            section.heading,
            section.content,
        );

        self.llm.extract_structured(prompt).await
    }

    /// Validate extracted claims for quality.
    fn validate_claims(
        &self,
        claims: Vec<ExtractedClaim>,
    ) -> Result<Vec<ValidatedClaim>, ValidationError> {
        claims
            .into_iter()
            .map(|claim| {
                // Check if subject pattern is valid
                let subject_valid = self.validator.validate_subject(&claim.subject);

                // Check if predicate is recognized
                let predicate_valid = self.validator.validate_predicate(&claim.predicate);

                // Compute confidence score
                let confidence = self.compute_confidence(&claim, subject_valid, predicate_valid);

                ValidatedClaim {
                    claim,
                    confidence,
                    validation_issues: vec![],
                }
            })
            .collect()
    }
}
```

## CLI Design

### Commands

```bash
# Ingest a document
aphoria ingest-guide <path> [options]

Options:
  --authority-tier <tier>    Authority tier (default: team_policy)
  --category <category>      Category (architecture|security|style)
  --min-confidence <float>   Min confidence to include (0.0-1.0, default: 0.7)
  --dry-run                  Preview without importing
  --edit                     Open editor to review/refine before importing
  --project <name>           Project scope (default: current project)

# List ingested guidelines
aphoria list-guides

# Check compliance against a guideline
aphoria check-compliance <guide-name>

# Update from changed document
aphoria update-guide <guide-name>

# Remove a guideline
aphoria remove-guide <guide-name>
```

### Examples

```bash
# Ingest with preview
aphoria ingest-guide docs/architecture/hexagonal.md --dry-run

# Ingest with manual review
aphoria ingest-guide docs/security/owasp-top-10.md --edit

# Check compliance
aphoria check-compliance hexagonal-arch

# Update when doc changes
aphoria update-guide hexagonal-arch --from docs/architecture/hexagonal.md

# List all active guidelines
aphoria list-guides
```

## Integration with Existing Features

### 1. Conflict Detection
Ingested claims stored as authoritative assertions → existing conflict engine detects violations

### 2. Scan Reports
Compliance shown in standard scan reports:
```
Conflicts: 3
  ❌ hexagonal.md:45 - File in wrong directory
  ❌ hexagonal.md:62 - Forbidden import detected
  ❌ hexagonal.md:89 - Invalid filename pattern
```

### 3. Authority Lens
TeamPolicy tier (2.5) ranks between Clinical (2) and Expert (3):
- Overrides community observations
- Can be overridden by team-authored claims (explicit)
- Respects RFCs and security standards

### 4. Pre-commit Hooks
Compliance checking in pre-commit:
```bash
#!/bin/bash
# .git/hooks/pre-commit

aphoria scan --check-policy hexagonal-arch --exit-code
```

## Storage

### Claims Storage
Ingested claims stored as regular AuthoredClaim instances:
- File: `.aphoria/claims.toml`
- Tagged with `ingested_from: "hexagonal.md"`
- Authority tier: `team_policy`

### Document Metadata
Track source documents:
```toml
# .aphoria/ingested_guides.toml

[[guide]]
id = "hexagonal-arch"
name = "Hexagonal Architecture Guidelines"
source_path = "docs/architecture/hexagonal.md"
document_hash = "blake3:abc123..."
ingested_at = 1234567890
claims_count = 26
authority_tier = "team_policy"
category = "architecture"

[[guide]]
id = "security-owasp"
name = "OWASP Top 10 Compliance"
source_path = "docs/security/owasp.md"
document_hash = "blake3:def456..."
ingested_at = 1234567900
claims_count = 15
authority_tier = "team_policy"
category = "security"
```

### Update Detection
```bash
$ aphoria scan

⚠️  Warning: Source document has changed
    Guide: hexagonal-arch
    Source: docs/architecture/hexagonal.md
    Last ingested: 3 days ago (hash: abc123...)
    Current hash: xyz789...

    Run: aphoria update-guide hexagonal-arch
```

## Success Metrics

### Phase 1 (Manual Import)
- Users can manually create claims from guidelines
- Claims enforce during scans
- Pre-commit hooks work

### Phase 2 (LLM-Assisted)
- LLM extracts 80%+ of claims correctly
- Users can review/edit before importing
- Saves >90% of manual effort

### Phase 3 (Automated)
- Confidence scoring filters noise
- Automatic updates when docs change
- Compliance reports show trends

## Open Questions

1. **How to handle ambiguous statements?**
   - "Handlers should generally be in adapters/http/"
   - Extract as SHOULD with low confidence?

2. **How to handle conflicting guidelines?**
   - Doc A says X, Doc B says Y
   - Use authority tier + recency?

3. **Should we support non-Markdown formats?**
   - PDF extraction (common for external standards)
   - Confluence/Google Docs (via export)
   - Word documents

4. **How to version guidelines?**
   - When doc changes, update claims or create new versions?
   - Show history of guideline changes?

5. **Should compliance be project-scoped or team-scoped?**
   - Team-level guidelines apply to all team projects?
   - Project-specific guidelines override team guidelines?

## Future Enhancements

### 1. Guided Onboarding
```bash
$ aphoria init --with-guides

No guidelines found. Would you like to:
1. Import existing documentation
2. Start from template (hexagonal/clean/ddd)
3. Skip for now

Choice: 1

Enter path to architecture guide: docs/architecture.md
Enter path to security guide: docs/security.md
Enter path to style guide: (skip)

Extracting claims...
Found 42 claims across 2 documents.
Review before importing? [Y/n]
```

### 2. Compliance Dashboard
Visual compliance tracking over time:
- Trend graphs (improving/declining)
- Per-guideline compliance scores
- Team comparison (if multiple teams)

### 3. AI-Generated Fix Suggestions
```bash
❌ File location violation
   Expected: adapters/http/*_handler.go
   Found: adapters/handlers/user.go

   Suggested fix:
     git mv adapters/handlers/user.go adapters/http/user_handler.go

   Apply fix? [y/N]
```

### 4. Guideline Templates
Pre-built guidelines for common architectures:
- Hexagonal Architecture
- Clean Architecture
- Domain-Driven Design
- Microservices patterns
- Security baselines (OWASP, NIST)

```bash
$ aphoria init-guide --template hexagonal

Imported 35 hexagonal architecture claims.
Customize at: .aphoria/claims.toml
```

## Timeline

- **Week 1:** Manual import (MVP) + LLM extraction prototype
- **Week 2:** Automated pipeline + confidence scoring
- **Week 3:** CLI polish + documentation + examples
- **Week 4:** Dashboard integration + compliance reports

**Total:** 4 weeks for complete feature