stemedb/applications/aphoria/docs/planning/ingest-best-practices-docs.md
jml e95c978481 feat(aphoria): add inline claim markers and claim enrichment infrastructure
This commit implements Phase 17 of the Aphoria roadmap, adding:

**Inline Claim Markers (@aphoria:claim):**
- New extractor for detecting inline markers in comments
- Pending markers tracked in .aphoria/pending_markers.toml
- CLI commands: list-markers, formalize-marker, reject-marker
- Support for all major comment styles (Rust, Python, SQL, etc.)
- Auto-sync during scan (configurable)

**Claim Enrichment:**
- ClaimEnrichment type with source attribution (inline, extractor, manual)
- EnrichedClaimInfo with full enrichment metadata
- Extended AuthoredClaim with optional enrichment field
- API endpoints for enriched claim queries
- Dashboard UI components (enrichment badge, verdict badge)

**Enhanced Extractor Trait:**
- verifiable_predicates() method for declaring (tail_path, predicate) pairs
- 10 security extractors now implement verifiable_predicates
- Enables claim suggester skill to find unclaimed patterns

**Documentation:**
- Phase 17 summary with complete implementation details
- Gap fixes summary documenting 8 closed vision gaps
- Updated CLI reference with new commands
- New aphoria-docs skill for documentation maintenance
- Updated roadmap with Phase 17 completion

**Integration:**
- ClaimsFile support for claim enrichment persistence
- Pattern aggregate store support for enrichment queries
- Dashboard filters and display for enrichment metadata
- API handlers for list-markers and enrichment queries

**Tests:**
- New gap_fixes_integration test suite
- Corpus enricher module with best practices ingestion

Closes: VG-005, VG-017, VG-018, VG-019, VG-020, VG-021, VG-022, VG-023

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-08 20:18:20 +00:00

645 lines
17 KiB
Markdown

---
created: 2026-02-08
last_updated: 2026-02-08
status: Planning Document
feature: Phase 2-3 - LLM-Assisted Document Ingestion
timeline: 4 weeks estimated
---
# Ingest Best Practices Documentation - Executable Policy
## Problem Statement
**Current Reality:**
1. Teams write extensive architecture/security/style guides (50+ pages)
2. Developers are expected to read and remember all guidelines
3. Compliance is checked manually in code review
4. Guidelines drift out of sync with code over time
5. New team members miss context from old documents
**What Users Want:**
- Write documentation once (markdown, PDF, confluence)
- Have Aphoria automatically enforce the guidelines
- Get real-time feedback during development
- Maintain compliance without manual review
## Vision: Documentation That Enforces Itself
### Example: Hexagonal Architecture Guide
**Traditional Flow:**
```
1. Architect writes: "HTTP handlers MUST be in adapters/http/"
2. Developer reads guide (hopefully)
3. Developer writes code in wrong location
4. Code reviewer catches it (maybe)
5. Fix during review (wasted time)
```
**With Aphoria Ingestion:**
```
1. Architect writes: "HTTP handlers MUST be in adapters/http/"
2. Run: aphoria ingest-guide architecture.md
3. Developer writes code in wrong location
4. aphoria scan immediately shows:
❌ File location violation
Expected: adapters/http/*_handler.go
Found: adapters/handlers/user.go
Fix: Move to adapters/http/user_handler.go
5. Developer fixes before commit (no review cycles wasted)
```
## User Experience
### 1. Ingest Phase
```bash
$ aphoria ingest-guide docs/architecture/hexagonal.md \
--authority-tier team_policy \
--category architecture \
--dry-run
Analyzing: docs/architecture/hexagonal.md (15 KB, 342 lines)
📊 Extraction Summary:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Section Claims Severity
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Directory Structure 8 MUST
Dependency Rules 6 MUST_NOT
Naming Conventions 5 MUST
Interface Definitions 4 SHOULD
Testing Strategy 3 SHOULD
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Total 26 claims extracted
🔍 Preview of Extracted Claims:
1. [MUST] HTTP handlers in adapters/http/ directory
Subject: code://go/*/adapters/http/**
Predicate: directory_pattern
Value: *_handler.go
Source: hexagonal.md:45-47
2. [MUST_NOT] Core domain imports infrastructure
Subject: code://go/*/core/domain/**
Predicate: imports_forbidden
Value: infrastructure/*
Source: hexagonal.md:62-64
3. [MUST] Handler files end with _handler.go
Subject: code://go/*/adapters/http/*.go
Predicate: filename_pattern
Value: *_handler.go
Source: hexagonal.md:89-91
... (23 more)
Would add 26 claims to authoritative corpus.
Estimated scan coverage: ~65% of codebase
Proceed with ingestion? [y/N]
```
### 2. Compliance Checking
```bash
$ aphoria scan --check-policy hexagonal-arch
📋 Policy Compliance Report: Hexagonal Architecture
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
✅ Directory Structure (95% compliant)
45 files in correct locations
3 violations:
• adapters/handlers/user.go → should be adapters/http/user_handler.go
• adapters/db/user_repo.go → should be adapters/persistence/user_repo.go
• domain/user_service.go → should be core/domain/user_service.go
✅ Dependency Rules (100% compliant)
✓ No forbidden imports detected
✓ Core domain is clean of infrastructure dependencies
⚠️ Naming Conventions (80% compliant)
35 files follow naming conventions
9 violations:
• adapters/http/user.go → should be user_handler.go
• adapters/http/order.go → should be order_handler.go
... (7 more)
✅ Interface Definitions (90% compliant)
18 interfaces properly named
⚠️ 2 warnings:
• PostgresUserRepository → consider UserStore (behavior-based naming)
• MySQLOrderRepository → consider OrderStore (behavior-based naming)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Overall Compliance: 91% (237 of 260 checks passed)
📝 Recommendations:
1. Run: aphoria fix --policy hexagonal-arch --auto-safe
This will automatically fix 8 file location issues
2. Manually review 2 interface naming suggestions
3. Update hexagonal.md if any rules need revision
Last policy update: hexagonal.md (modified 3 days ago)
```
### 3. Real-Time Feedback
```bash
$ git add adapters/handlers/user.go
$ git commit
⚠️ Pre-commit hook: Aphoria policy check
❌ Policy violations detected (hexagonal-arch):
adapters/handlers/user.go:
❌ File location violation
Expected: adapters/http/*_handler.go
Found: adapters/handlers/user.go
Rule: HTTP handlers must be in adapters/http/
Source: hexagonal.md:45 (team policy)
Suggested fix:
git mv adapters/handlers/user.go adapters/http/user_handler.go
Commit blocked. Fix violations or use --no-verify to skip.
```
## Data Model
### Ingested Claim Structure
```rust
pub struct IngestedClaim {
/// Unique claim ID
pub id: String,
/// Subject pattern (supports wildcards)
pub subject_pattern: String,
/// Predicate
pub predicate: String,
/// Expected value
pub value: ClaimValue,
/// Comparison mode
pub comparison: ComparisonMode, // MUST, MUST_NOT, SHOULD, MAY
/// Category
pub category: String, // "architecture" | "security" | "style"
/// Explanation (from doc)
pub explanation: String,
/// Authority tier
pub authority_tier: AuthorityTier, // TeamPolicy (Tier 2.5)
/// Source document tracking
pub source: DocumentSource,
/// When ingested
pub ingested_at: u64,
}
pub struct DocumentSource {
/// Path to source document
pub file_path: String,
/// Line numbers in source
pub line_start: u32,
pub line_end: u32,
/// Section heading
pub section: String, // "Directory Structure Rules"
/// Document version (hash)
pub document_hash: String,
}
pub enum ComparisonMode {
Must, // Value MUST match
MustNot, // Value MUST NOT match
Should, // Warning if doesn't match
May, // Informational only
}
```
### Authority Tier Hierarchy
```
Tier 0: System (StemeDB internals, not user-facing)
Tier 1: Regulatory (RFCs, legal requirements)
Tier 2: Clinical (OWASP, NIST, industry standards)
Tier 2.5: TeamPolicy ← NEW (team-specific guidelines)
Tier 3: Expert (recognized authorities, vetted claims)
Tier 4: Community (project-specific observations)
```
TeamPolicy tier:
- Higher authority than community observations
- Lower authority than industry standards (OWASP)
- Can override community patterns
- Cannot override RFCs or security standards
- Scoped to project/team
## Implementation
### Phase 1: Manual Extraction (MVP - 2 days)
User manually creates claims TOML from guidelines, then imports:
```bash
# User writes claims.toml manually from reading architecture.md
aphoria claims import team-guidelines.toml --authority-tier team_policy
```
**Pros:** Works immediately, no LLM needed
**Cons:** Manual work, doesn't scale
### Phase 2: LLM-Assisted Extraction (Week 1 - 3 days)
```bash
$ aphoria ingest-guide docs/architecture/hexagonal.md --preview
Processing: hexagonal.md
Using LLM to extract claims...
Found 26 potential claims. Review and edit before importing.
Opening editor...
# Generated claims (edit before importing)
[[claim]]
id = "hex-arch-http-handlers-001"
subject = "code://go/*/adapters/http/**"
predicate = "directory_pattern"
value = "*_handler.go"
comparison = "must"
category = "architecture"
explanation = "HTTP handlers must be in adapters/http/ directory"
source = "hexagonal.md:45-47"
# Edit to refine, then save and close to import
```
**LLM Prompt:**
```
Extract architectural claims from this document.
For each MUST/SHOULD/MUST NOT statement:
1. Identify the subject (what code element)
2. Identify the predicate (what property)
3. Identify the value (expected value)
4. Determine comparison mode (must/should/must_not)
5. Extract explanation
Format as TOML claims.
Example input:
"HTTP handlers MUST be in adapters/http/ directory and end with _handler.go"
Example output:
[[claim]]
subject = "code://go/*/adapters/http/**"
predicate = "directory_pattern"
value = "*_handler.go"
comparison = "must"
explanation = "HTTP handlers must be in adapters/http/ directory"
```
**Implementation:**
- File: `applications/aphoria/src/llm/document_ingestion.rs`
- Uses existing LLM infrastructure
- Outputs TOML for review before import
- User can edit/refine before committing
### Phase 3: Automated Extraction with Validation (Week 2 - 4 days)
Fully automated pipeline with confidence scoring:
```rust
pub struct DocumentIngester {
llm: LlmClient,
validator: ClaimValidator,
}
impl DocumentIngester {
/// Ingest a document and extract claims.
pub async fn ingest(
&self,
doc_path: &Path,
options: IngestionOptions,
) -> Result<Vec<IngestedClaim>, IngestionError> {
// 1. Parse document (markdown/PDF/text)
let sections = self.parse_document(doc_path)?;
// 2. Extract claims using LLM
let raw_claims = self.extract_claims_from_sections(sections).await?;
// 3. Validate and score confidence
let validated = self.validate_claims(raw_claims)?;
// 4. Filter by confidence threshold
let high_confidence: Vec<_> = validated
.into_iter()
.filter(|c| c.confidence >= options.min_confidence)
.collect();
// 5. Preview or auto-import
if options.dry_run {
self.preview_claims(&high_confidence)?;
Ok(vec![])
} else {
self.import_claims(high_confidence).await
}
}
/// Extract claims from a section using LLM.
async fn extract_claims_from_section(
&self,
section: &DocumentSection,
) -> Result<Vec<ExtractedClaim>, LlmError> {
let prompt = format!(
r#"Extract architectural claims from this section.
Section: {}
Content:
{}
For each claim:
1. Identify subject pattern (supports wildcards)
2. Identify predicate
3. Identify expected value
4. Determine severity (MUST/SHOULD/MAY)
5. Extract explanation
Return as JSON array."#,
section.heading,
section.content,
);
self.llm.extract_structured(prompt).await
}
/// Validate extracted claims for quality.
fn validate_claims(
&self,
claims: Vec<ExtractedClaim>,
) -> Result<Vec<ValidatedClaim>, ValidationError> {
claims
.into_iter()
.map(|claim| {
// Check if subject pattern is valid
let subject_valid = self.validator.validate_subject(&claim.subject);
// Check if predicate is recognized
let predicate_valid = self.validator.validate_predicate(&claim.predicate);
// Compute confidence score
let confidence = self.compute_confidence(&claim, subject_valid, predicate_valid);
ValidatedClaim {
claim,
confidence,
validation_issues: vec![],
}
})
.collect()
}
}
```
## CLI Design
### Commands
```bash
# Ingest a document
aphoria ingest-guide <path> [options]
Options:
--authority-tier <tier> Authority tier (default: team_policy)
--category <category> Category (architecture|security|style)
--min-confidence <float> Min confidence to include (0.0-1.0, default: 0.7)
--dry-run Preview without importing
--edit Open editor to review/refine before importing
--project <name> Project scope (default: current project)
# List ingested guidelines
aphoria list-guides
# Check compliance against a guideline
aphoria check-compliance <guide-name>
# Update from changed document
aphoria update-guide <guide-name>
# Remove a guideline
aphoria remove-guide <guide-name>
```
### Examples
```bash
# Ingest with preview
aphoria ingest-guide docs/architecture/hexagonal.md --dry-run
# Ingest with manual review
aphoria ingest-guide docs/security/owasp-top-10.md --edit
# Check compliance
aphoria check-compliance hexagonal-arch
# Update when doc changes
aphoria update-guide hexagonal-arch --from docs/architecture/hexagonal.md
# List all active guidelines
aphoria list-guides
```
## Integration with Existing Features
### 1. Conflict Detection
Ingested claims stored as authoritative assertions → existing conflict engine detects violations
### 2. Scan Reports
Compliance shown in standard scan reports:
```
Conflicts: 3
❌ hexagonal.md:45 - File in wrong directory
❌ hexagonal.md:62 - Forbidden import detected
❌ hexagonal.md:89 - Invalid filename pattern
```
### 3. Authority Lens
TeamPolicy tier (2.5) ranks between Clinical (2) and Expert (3):
- Overrides community observations
- Can be overridden by team-authored claims (explicit)
- Respects RFCs and security standards
### 4. Pre-commit Hooks
Compliance checking in pre-commit:
```bash
#!/bin/bash
# .git/hooks/pre-commit
aphoria scan --check-policy hexagonal-arch --exit-code
```
## Storage
### Claims Storage
Ingested claims stored as regular AuthoredClaim instances:
- File: `.aphoria/claims.toml`
- Tagged with `ingested_from: "hexagonal.md"`
- Authority tier: `team_policy`
### Document Metadata
Track source documents:
```toml
# .aphoria/ingested_guides.toml
[[guide]]
id = "hexagonal-arch"
name = "Hexagonal Architecture Guidelines"
source_path = "docs/architecture/hexagonal.md"
document_hash = "blake3:abc123..."
ingested_at = 1234567890
claims_count = 26
authority_tier = "team_policy"
category = "architecture"
[[guide]]
id = "security-owasp"
name = "OWASP Top 10 Compliance"
source_path = "docs/security/owasp.md"
document_hash = "blake3:def456..."
ingested_at = 1234567900
claims_count = 15
authority_tier = "team_policy"
category = "security"
```
### Update Detection
```bash
$ aphoria scan
⚠️ Warning: Source document has changed
Guide: hexagonal-arch
Source: docs/architecture/hexagonal.md
Last ingested: 3 days ago (hash: abc123...)
Current hash: xyz789...
Run: aphoria update-guide hexagonal-arch
```
## Success Metrics
### Phase 1 (Manual Import)
- Users can manually create claims from guidelines
- Claims enforce during scans
- Pre-commit hooks work
### Phase 2 (LLM-Assisted)
- LLM extracts 80%+ of claims correctly
- Users can review/edit before importing
- Saves >90% of manual effort
### Phase 3 (Automated)
- Confidence scoring filters noise
- Automatic updates when docs change
- Compliance reports show trends
## Open Questions
1. **How to handle ambiguous statements?**
- "Handlers should generally be in adapters/http/"
- Extract as SHOULD with low confidence?
2. **How to handle conflicting guidelines?**
- Doc A says X, Doc B says Y
- Use authority tier + recency?
3. **Should we support non-Markdown formats?**
- PDF extraction (common for external standards)
- Confluence/Google Docs (via export)
- Word documents
4. **How to version guidelines?**
- When doc changes, update claims or create new versions?
- Show history of guideline changes?
5. **Should compliance be project-scoped or team-scoped?**
- Team-level guidelines apply to all team projects?
- Project-specific guidelines override team guidelines?
## Future Enhancements
### 1. Guided Onboarding
```bash
$ aphoria init --with-guides
No guidelines found. Would you like to:
1. Import existing documentation
2. Start from template (hexagonal/clean/ddd)
3. Skip for now
Choice: 1
Enter path to architecture guide: docs/architecture.md
Enter path to security guide: docs/security.md
Enter path to style guide: (skip)
Extracting claims...
Found 42 claims across 2 documents.
Review before importing? [Y/n]
```
### 2. Compliance Dashboard
Visual compliance tracking over time:
- Trend graphs (improving/declining)
- Per-guideline compliance scores
- Team comparison (if multiple teams)
### 3. AI-Generated Fix Suggestions
```bash
❌ File location violation
Expected: adapters/http/*_handler.go
Found: adapters/handlers/user.go
Suggested fix:
git mv adapters/handlers/user.go adapters/http/user_handler.go
Apply fix? [y/N]
```
### 4. Guideline Templates
Pre-built guidelines for common architectures:
- Hexagonal Architecture
- Clean Architecture
- Domain-Driven Design
- Microservices patterns
- Security baselines (OWASP, NIST)
```bash
$ aphoria init-guide --template hexagonal
Imported 35 hexagonal architecture claims.
Customize at: .aphoria/claims.toml
```
## Timeline
- **Week 1:** Manual import (MVP) + LLM extraction prototype
- **Week 2:** Automated pipeline + confidence scoring
- **Week 3:** CLI polish + documentation + examples
- **Week 4:** Dashboard integration + compliance reports
**Total:** 4 weeks for complete feature