stemedb/applications/aphoria/docs/archive/ingest-best-practices-docs-2026-02-08.md
jml 9bfa626203 docs: reorganize documentation structure for clarity
Major documentation restructure to improve discoverability and reduce duplication.

## Changes

**Deleted (Archived/Consolidated)**:
- Removed duplicate getting started guides
- Archived outdated planning documents
- Consolidated corpus and configuration docs
- Removed obsolete vision/spec files (superseded by vision.md)
- Cleaned up scrapyard and old PDFs

**New Structure**:
- docs/about/ - Project overview and introduction
- docs/guides/ - User guides (moved from root)
- docs/specs/ - Technical specifications
- docs/sdk/ - SDK documentation (Go)
- docs/references/ - API references
- docs/archive/ - Archived historical docs
- applications/aphoria/docs/advanced/ - Advanced topics
- applications/aphoria/docs/reference/ - CLI reference
- applications/aphoria/docs/archive/ - Archived aphoria docs

**Updated**:
- README.md - New root README with clear navigation
- CONTRIBUTING.md - Contribution guidelines
- CLAUDE.md - Updated paths to new structure
- roadmap.md - Added recent completions

## Files Changed
- 57 files changed
- 1,977 insertions(+)
- 961 deletions(-)

**Net change**: +1,016 lines (added CONTRIBUTING.md, README.md, reorganized content)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-11 07:33:40 +00:00

17 KiB

created last_updated status feature timeline
2026-02-08 2026-02-08 Planning Document Phase 2-3 - LLM-Assisted Document Ingestion 4 weeks estimated

Ingest Best Practices Documentation - Executable Policy

Problem Statement

Current Reality:

  1. Teams write extensive architecture/security/style guides (50+ pages)
  2. Developers are expected to read and remember all guidelines
  3. Compliance is checked manually in code review
  4. Guidelines drift out of sync with code over time
  5. New team members miss context from old documents

What Users Want:

  • Write documentation once (markdown, PDF, confluence)
  • Have Aphoria automatically enforce the guidelines
  • Get real-time feedback during development
  • Maintain compliance without manual review

Vision: Documentation That Enforces Itself

Example: Hexagonal Architecture Guide

Traditional Flow:

1. Architect writes: "HTTP handlers MUST be in adapters/http/"
2. Developer reads guide (hopefully)
3. Developer writes code in wrong location
4. Code reviewer catches it (maybe)
5. Fix during review (wasted time)

With Aphoria Ingestion:

1. Architect writes: "HTTP handlers MUST be in adapters/http/"
2. Run: aphoria ingest-guide architecture.md
3. Developer writes code in wrong location
4. aphoria scan immediately shows:
   ❌ File location violation
   Expected: adapters/http/*_handler.go
   Found: adapters/handlers/user.go
   Fix: Move to adapters/http/user_handler.go
5. Developer fixes before commit (no review cycles wasted)

User Experience

1. Ingest Phase

$ aphoria ingest-guide docs/architecture/hexagonal.md \
  --authority-tier team_policy \
  --category architecture \
  --dry-run

Analyzing: docs/architecture/hexagonal.md (15 KB, 342 lines)

📊 Extraction Summary:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Section                   Claims    Severity
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Directory Structure       8         MUST
Dependency Rules          6         MUST_NOT
Naming Conventions        5         MUST
Interface Definitions     4         SHOULD
Testing Strategy          3         SHOULD
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total                     26 claims extracted

🔍 Preview of Extracted Claims:

1. [MUST] HTTP handlers in adapters/http/ directory
   Subject: code://go/*/adapters/http/**
   Predicate: directory_pattern
   Value: *_handler.go
   Source: hexagonal.md:45-47

2. [MUST_NOT] Core domain imports infrastructure
   Subject: code://go/*/core/domain/**
   Predicate: imports_forbidden
   Value: infrastructure/*
   Source: hexagonal.md:62-64

3. [MUST] Handler files end with _handler.go
   Subject: code://go/*/adapters/http/*.go
   Predicate: filename_pattern
   Value: *_handler.go
   Source: hexagonal.md:89-91

... (23 more)

Would add 26 claims to authoritative corpus.
Estimated scan coverage: ~65% of codebase

Proceed with ingestion? [y/N]

2. Compliance Checking

$ aphoria scan --check-policy hexagonal-arch

📋 Policy Compliance Report: Hexagonal Architecture
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✅ Directory Structure (95% compliant)45 files in correct locations
   ❌ 3 violations:
      • adapters/handlers/user.go → should be adapters/http/user_handler.go
      • adapters/db/user_repo.go → should be adapters/persistence/user_repo.go
      • domain/user_service.go → should be core/domain/user_service.go

✅ Dependency Rules (100% compliant)
   ✓ No forbidden imports detected
   ✓ Core domain is clean of infrastructure dependencies

⚠️  Naming Conventions (80% compliant)35 files follow naming conventions
   ❌ 9 violations:
      • adapters/http/user.go → should be user_handler.go
      • adapters/http/order.go → should be order_handler.go
      ... (7 more)

✅ Interface Definitions (90% compliant)18 interfaces properly named
   ⚠️  2 warnings:
      • PostgresUserRepository → consider UserStore (behavior-based naming)
      • MySQLOrderRepository → consider OrderStore (behavior-based naming)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Overall Compliance: 91% (237 of 260 checks passed)

📝 Recommendations:
  1. Run: aphoria fix --policy hexagonal-arch --auto-safe
     This will automatically fix 8 file location issues

  2. Manually review 2 interface naming suggestions

  3. Update hexagonal.md if any rules need revision

Last policy update: hexagonal.md (modified 3 days ago)

3. Real-Time Feedback

$ git add adapters/handlers/user.go
$ git commit

⚠️  Pre-commit hook: Aphoria policy check

❌ Policy violations detected (hexagonal-arch):

  adapters/handlers/user.go:
    ❌ File location violation
       Expected: adapters/http/*_handler.go
       Found: adapters/handlers/user.go
       Rule: HTTP handlers must be in adapters/http/
       Source: hexagonal.md:45 (team policy)

       Suggested fix:
         git mv adapters/handlers/user.go adapters/http/user_handler.go

Commit blocked. Fix violations or use --no-verify to skip.

Data Model

Ingested Claim Structure

pub struct IngestedClaim {
    /// Unique claim ID
    pub id: String,

    /// Subject pattern (supports wildcards)
    pub subject_pattern: String,

    /// Predicate
    pub predicate: String,

    /// Expected value
    pub value: ClaimValue,

    /// Comparison mode
    pub comparison: ComparisonMode,  // MUST, MUST_NOT, SHOULD, MAY

    /// Category
    pub category: String,  // "architecture" | "security" | "style"

    /// Explanation (from doc)
    pub explanation: String,

    /// Authority tier
    pub authority_tier: AuthorityTier,  // TeamPolicy (Tier 2.5)

    /// Source document tracking
    pub source: DocumentSource,

    /// When ingested
    pub ingested_at: u64,
}

pub struct DocumentSource {
    /// Path to source document
    pub file_path: String,

    /// Line numbers in source
    pub line_start: u32,
    pub line_end: u32,

    /// Section heading
    pub section: String,  // "Directory Structure Rules"

    /// Document version (hash)
    pub document_hash: String,
}

pub enum ComparisonMode {
    Must,       // Value MUST match
    MustNot,    // Value MUST NOT match
    Should,     // Warning if doesn't match
    May,        // Informational only
}

Authority Tier Hierarchy

Tier 0: System      (StemeDB internals, not user-facing)
Tier 1: Regulatory  (RFCs, legal requirements)
Tier 2: Clinical    (OWASP, NIST, industry standards)
Tier 2.5: TeamPolicy ← NEW (team-specific guidelines)
Tier 3: Expert      (recognized authorities, vetted claims)
Tier 4: Community   (project-specific observations)

TeamPolicy tier:

  • Higher authority than community observations
  • Lower authority than industry standards (OWASP)
  • Can override community patterns
  • Cannot override RFCs or security standards
  • Scoped to project/team

Implementation

Phase 1: Manual Extraction (MVP - 2 days)

User manually creates claims TOML from guidelines, then imports:

# User writes claims.toml manually from reading architecture.md
aphoria claims import team-guidelines.toml --authority-tier team_policy

Pros: Works immediately, no LLM needed Cons: Manual work, doesn't scale

Phase 2: LLM-Assisted Extraction (Week 1 - 3 days)

$ aphoria ingest-guide docs/architecture/hexagonal.md --preview

Processing: hexagonal.md
Using LLM to extract claims...

Found 26 potential claims. Review and edit before importing.
Opening editor...

# Generated claims (edit before importing)
[[claim]]
id = "hex-arch-http-handlers-001"
subject = "code://go/*/adapters/http/**"
predicate = "directory_pattern"
value = "*_handler.go"
comparison = "must"
category = "architecture"
explanation = "HTTP handlers must be in adapters/http/ directory"
source = "hexagonal.md:45-47"

# Edit to refine, then save and close to import

LLM Prompt:

Extract architectural claims from this document.

For each MUST/SHOULD/MUST NOT statement:
1. Identify the subject (what code element)
2. Identify the predicate (what property)
3. Identify the value (expected value)
4. Determine comparison mode (must/should/must_not)
5. Extract explanation

Format as TOML claims.

Example input:
"HTTP handlers MUST be in adapters/http/ directory and end with _handler.go"

Example output:
[[claim]]
subject = "code://go/*/adapters/http/**"
predicate = "directory_pattern"
value = "*_handler.go"
comparison = "must"
explanation = "HTTP handlers must be in adapters/http/ directory"

Implementation:

  • File: applications/aphoria/src/llm/document_ingestion.rs
  • Uses existing LLM infrastructure
  • Outputs TOML for review before import
  • User can edit/refine before committing

Phase 3: Automated Extraction with Validation (Week 2 - 4 days)

Fully automated pipeline with confidence scoring:

pub struct DocumentIngester {
    llm: LlmClient,
    validator: ClaimValidator,
}

impl DocumentIngester {
    /// Ingest a document and extract claims.
    pub async fn ingest(
        &self,
        doc_path: &Path,
        options: IngestionOptions,
    ) -> Result<Vec<IngestedClaim>, IngestionError> {
        // 1. Parse document (markdown/PDF/text)
        let sections = self.parse_document(doc_path)?;

        // 2. Extract claims using LLM
        let raw_claims = self.extract_claims_from_sections(sections).await?;

        // 3. Validate and score confidence
        let validated = self.validate_claims(raw_claims)?;

        // 4. Filter by confidence threshold
        let high_confidence: Vec<_> = validated
            .into_iter()
            .filter(|c| c.confidence >= options.min_confidence)
            .collect();

        // 5. Preview or auto-import
        if options.dry_run {
            self.preview_claims(&high_confidence)?;
            Ok(vec![])
        } else {
            self.import_claims(high_confidence).await
        }
    }

    /// Extract claims from a section using LLM.
    async fn extract_claims_from_section(
        &self,
        section: &DocumentSection,
    ) -> Result<Vec<Observation>, LlmError> {
        let prompt = format!(
            r#"Extract architectural claims from this section.

Section: {}

Content:
{}

For each claim:
1. Identify subject pattern (supports wildcards)
2. Identify predicate
3. Identify expected value
4. Determine severity (MUST/SHOULD/MAY)
5. Extract explanation

Return as JSON array."#,
            section.heading,
            section.content,
        );

        self.llm.extract_structured(prompt).await
    }

    /// Validate extracted claims for quality.
    fn validate_claims(
        &self,
        claims: Vec<Observation>,
    ) -> Result<Vec<ValidatedClaim>, ValidationError> {
        claims
            .into_iter()
            .map(|claim| {
                // Check if subject pattern is valid
                let subject_valid = self.validator.validate_subject(&claim.subject);

                // Check if predicate is recognized
                let predicate_valid = self.validator.validate_predicate(&claim.predicate);

                // Compute confidence score
                let confidence = self.compute_confidence(&claim, subject_valid, predicate_valid);

                ValidatedClaim {
                    claim,
                    confidence,
                    validation_issues: vec![],
                }
            })
            .collect()
    }
}

CLI Design

Commands

# Ingest a document
aphoria ingest-guide <path> [options]

Options:
  --authority-tier <tier>    Authority tier (default: team_policy)
  --category <category>      Category (architecture|security|style)
  --min-confidence <float>   Min confidence to include (0.0-1.0, default: 0.7)
  --dry-run                  Preview without importing
  --edit                     Open editor to review/refine before importing
  --project <name>           Project scope (default: current project)

# List ingested guidelines
aphoria list-guides

# Check compliance against a guideline
aphoria check-compliance <guide-name>

# Update from changed document
aphoria update-guide <guide-name>

# Remove a guideline
aphoria remove-guide <guide-name>

Examples

# Ingest with preview
aphoria ingest-guide docs/architecture/hexagonal.md --dry-run

# Ingest with manual review
aphoria ingest-guide docs/security/owasp-top-10.md --edit

# Check compliance
aphoria check-compliance hexagonal-arch

# Update when doc changes
aphoria update-guide hexagonal-arch --from docs/architecture/hexagonal.md

# List all active guidelines
aphoria list-guides

Integration with Existing Features

1. Conflict Detection

Ingested claims stored as authoritative assertions → existing conflict engine detects violations

2. Scan Reports

Compliance shown in standard scan reports:

Conflicts: 3
  ❌ hexagonal.md:45 - File in wrong directory
  ❌ hexagonal.md:62 - Forbidden import detected
  ❌ hexagonal.md:89 - Invalid filename pattern

3. Authority Lens

TeamPolicy tier (2.5) ranks between Clinical (2) and Expert (3):

  • Overrides community observations
  • Can be overridden by team-authored claims (explicit)
  • Respects RFCs and security standards

4. Pre-commit Hooks

Compliance checking in pre-commit:

#!/bin/bash
# .git/hooks/pre-commit

aphoria scan --check-policy hexagonal-arch --exit-code

Storage

Claims Storage

Ingested claims stored as regular AuthoredClaim instances:

  • File: .aphoria/claims.toml
  • Tagged with ingested_from: "hexagonal.md"
  • Authority tier: team_policy

Document Metadata

Track source documents:

# .aphoria/ingested_guides.toml

[[guide]]
id = "hexagonal-arch"
name = "Hexagonal Architecture Guidelines"
source_path = "docs/architecture/hexagonal.md"
document_hash = "blake3:abc123..."
ingested_at = 1234567890
claims_count = 26
authority_tier = "team_policy"
category = "architecture"

[[guide]]
id = "security-owasp"
name = "OWASP Top 10 Compliance"
source_path = "docs/security/owasp.md"
document_hash = "blake3:def456..."
ingested_at = 1234567900
claims_count = 15
authority_tier = "team_policy"
category = "security"

Update Detection

$ aphoria scan

⚠️  Warning: Source document has changed
    Guide: hexagonal-arch
    Source: docs/architecture/hexagonal.md
    Last ingested: 3 days ago (hash: abc123...)
    Current hash: xyz789...

    Run: aphoria update-guide hexagonal-arch

Success Metrics

Phase 1 (Manual Import)

  • Users can manually create claims from guidelines
  • Claims enforce during scans
  • Pre-commit hooks work

Phase 2 (LLM-Assisted)

  • LLM extracts 80%+ of claims correctly
  • Users can review/edit before importing
  • Saves >90% of manual effort

Phase 3 (Automated)

  • Confidence scoring filters noise
  • Automatic updates when docs change
  • Compliance reports show trends

Open Questions

  1. How to handle ambiguous statements?

    • "Handlers should generally be in adapters/http/"
    • Extract as SHOULD with low confidence?
  2. How to handle conflicting guidelines?

    • Doc A says X, Doc B says Y
    • Use authority tier + recency?
  3. Should we support non-Markdown formats?

    • PDF extraction (common for external standards)
    • Confluence/Google Docs (via export)
    • Word documents
  4. How to version guidelines?

    • When doc changes, update claims or create new versions?
    • Show history of guideline changes?
  5. Should compliance be project-scoped or team-scoped?

    • Team-level guidelines apply to all team projects?
    • Project-specific guidelines override team guidelines?

Future Enhancements

1. Guided Onboarding

$ aphoria init --with-guides

No guidelines found. Would you like to:
1. Import existing documentation
2. Start from template (hexagonal/clean/ddd)
3. Skip for now

Choice: 1

Enter path to architecture guide: docs/architecture.md
Enter path to security guide: docs/security.md
Enter path to style guide: (skip)

Extracting claims...
Found 42 claims across 2 documents.
Review before importing? [Y/n]

2. Compliance Dashboard

Visual compliance tracking over time:

  • Trend graphs (improving/declining)
  • Per-guideline compliance scores
  • Team comparison (if multiple teams)

3. AI-Generated Fix Suggestions

❌ File location violation
   Expected: adapters/http/*_handler.go
   Found: adapters/handlers/user.go

   Suggested fix:
     git mv adapters/handlers/user.go adapters/http/user_handler.go

   Apply fix? [y/N]

4. Guideline Templates

Pre-built guidelines for common architectures:

  • Hexagonal Architecture
  • Clean Architecture
  • Domain-Driven Design
  • Microservices patterns
  • Security baselines (OWASP, NIST)
$ aphoria init-guide --template hexagonal

Imported 35 hexagonal architecture claims.
Customize at: .aphoria/claims.toml

Timeline

  • Week 1: Manual import (MVP) + LLM extraction prototype
  • Week 2: Automated pipeline + confidence scoring
  • Week 3: CLI polish + documentation + examples
  • Week 4: Dashboard integration + compliance reports

Total: 4 weeks for complete feature