stemedb/applications/aphoria/docs/llm-optimization/index.md
jordan 157dbbb9eb feat: Complete Aphoria Phase 8-9 + UAT suite (90/90 tests passing)
## Phase 8: Enterprise Extractor Improvements 
- 14 security extractors (TLS, JWT, SQL injection, XSS, etc.)
- 10 framework-specific extractors (Spring, Django, Rails, etc.)
- Config file security detection (YAML, TOML)

## Phase 9: Autonomous Extractor Generation 
- Shadow mode executor with TP/FP tracking
- Graduation pipeline with confidence thresholds
- Auto-rollback on regression detection
- Cross-project pattern syncing

## UAT Suite Complete (14 scripts, 90 tests)
- test-core-detection.sh (6 tests)
- test-declarative-extractors.sh (5 tests)
- test-domain-frameworks.sh (5 tests)
- test-domain-unreal.sh (3 tests)
- test-llm-extraction.sh (6 tests)
- test-eval-harness.sh (5 tests)
- test-cross-language.sh (3 tests)
- test-precommit-performance.sh (4 tests)
- test-output-formats.sh (8 tests)
- test-drift-detection.sh (6 tests)
- test-exit-codes.sh (12 tests)
+ 3 more scripts

## Other Changes
- Updated roadmap to mark Phase 8-9 complete
- Added .gitignore entries for build artifacts
- Updated pre-commit: 800 line limit, exclude tests/data/cmd

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 22:50:55 -07:00

3.1 KiB

LLM Extraction Optimization

Systematic approach to maximizing Aphoria's LLM extraction quality.

Document When to Use
Quick Start First time optimizing, want to get started fast
Full Playbook Comprehensive optimization guide with decision trees
Baseline Template Recording metrics after each optimization cycle
Research Template Investigating unknown issues or new approaches

Current Status

Latest Baseline: 2026-02-06

Metric Current Target Status
Precision 0.93 0.80 Exceeded
Recall 1.00 0.75 Exceeded
F1 0.96 0.77 Exceeded
Parse Rate 100% 95%
Fixtures Passing 10/10 - All pass

Verdict: PASS - All metrics exceed targets.

Directory Structure

docs/llm-optimization/
├── index.md              # This file
├── quickstart.md         # 15-minute getting started
├── playbook.md           # Full optimization guide
├── baselines/            # Historical metrics
│   ├── template.md
│   └── YYYY-MM-DD.md     # One per baseline
└── research/             # Investigation notes
    ├── template.md
    └── [topic].md        # One per research topic

Key Commands

# Run evaluation
aphoria eval run --fixtures tests/llm_fixtures --mode live

# Check for regressions (CI)
aphoria eval run --mode cached --fail-on-regression

# Update baseline after improvements
aphoria eval update-baseline --force

# List fixtures
aphoria eval list-fixtures

# Validate fixtures
aphoria eval validate-fixtures

Optimization Flow

1. Run baseline evaluation
       ↓
2. Identify failure categories
       ↓
3. Apply targeted fixes (one at a time!)
       ↓
4. Validate: did metrics improve?
       ↓
   YES → Save new baseline, continue to next issue
   NO  → Revert, try different approach or research
       ↓
5. Repeat until targets met
       ↓
6. Set up CI to prevent regressions

Fixture Locations

Category Path Count
TLS tests/llm_fixtures/tls/ 2
JWT tests/llm_fixtures/jwt/ 2
Secrets tests/llm_fixtures/secrets/ 2
Auth tests/llm_fixtures/auth/ 1
Negative tests/llm_fixtures/negative/ 2
Edge tests/llm_fixtures/edge/ 1
Total 10
  • Prompt source: src/llm/prompts.rs
  • Extractor: src/llm/extractor.rs
  • Client: src/llm/client.rs
  • Eval harness: src/eval/harness.rs
  • Fixtures: tests/llm_fixtures/

Contributing Fixtures

See Fixture Writing Guide in the playbook.

Quick checklist:

  • Create TOML file in appropriate category folder
  • Include both must_contain and must_not_contain
  • Run aphoria eval validate-fixtures
  • Test with aphoria eval run --max-fixtures 1
  • Update manifest.toml category counts