jordan 157dbbb9eb feat: Complete Aphoria Phase 8-9 + UAT suite (90/90 tests passing)

## Phase 8: Enterprise Extractor Improvements ✅
- 14 security extractors (TLS, JWT, SQL injection, XSS, etc.)
- 10 framework-specific extractors (Spring, Django, Rails, etc.)
- Config file security detection (YAML, TOML)

## Phase 9: Autonomous Extractor Generation ✅
- Shadow mode executor with TP/FP tracking
- Graduation pipeline with confidence thresholds
- Auto-rollback on regression detection
- Cross-project pattern syncing

## UAT Suite Complete (14 scripts, 90 tests)
- test-core-detection.sh (6 tests)
- test-declarative-extractors.sh (5 tests)
- test-domain-frameworks.sh (5 tests)
- test-domain-unreal.sh (3 tests)
- test-llm-extraction.sh (6 tests)
- test-eval-harness.sh (5 tests)
- test-cross-language.sh (3 tests)
- test-precommit-performance.sh (4 tests)
- test-output-formats.sh (8 tests)
- test-drift-detection.sh (6 tests)
- test-exit-codes.sh (12 tests)
+ 3 more scripts

## Other Changes
- Updated roadmap to mark Phase 8-9 complete
- Added .gitignore entries for build artifacts
- Updated pre-commit: 800 line limit, exclude tests/data/cmd

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-06 22:50:55 -07:00

3.2 KiB

Raw Blame History

LLM Optimization Quick Start

Get started with LLM extraction optimization in 15 minutes.

Prerequisites

Aphoria built and working
GEMINI_API_KEY set in environment
Fixtures exist in tests/llm_fixtures/

Step 1: Validate Setup (2 min)

# Check fixtures are valid
aphoria eval validate-fixtures --fixtures tests/llm_fixtures

# Expected: "All fixtures are valid."

Step 2: Run Baseline (5 min)

# Run live evaluation
aphoria eval run --fixtures tests/llm_fixtures --mode live --format table

Record these numbers:

Precision: ______
Recall: ______
F1: ______
Parse Rate: ______%

Step 3: Identify Priority (3 min)

Look at the output and answer:

Question	Answer	Action
Parse Rate < 95%?	Y/N	Fix output structure first
Recall < 70%?	Y/N	Add few-shot examples
Precision < 70%?	Y/N	Add negative examples
Many subject mismatches?	Y/N	Standardize vocabulary

Step 4: Make ONE Change (5 min)

Pick the highest-priority issue and make a single change:

If Parse Issues:

Edit llm/extractor.rs - add response cleaning:

fn clean_response(raw: &str) -> String {
    raw.trim()
        .trim_start_matches("```json")
        .trim_start_matches("```")
        .trim_end_matches("```")
        .trim()
        .to_string()
}

If Recall Issues:

Edit llm/prompts.rs - add examples:

const EXAMPLES: &str = r#"
Example: verify=False → {"subject": "tls/cert_verification", "predicate": "enabled", "value": false}
"#;

If Precision Issues:

Edit llm/prompts.rs - add what NOT to flag:

const NEGATIVE_EXAMPLES: &str = r#"
Do NOT flag:
- verify=certifi.where() (using CA bundle, this is safe)
- API_KEY = os.environ['KEY'] (from environment, not hardcoded)
"#;

Step 5: Validate Change

# Run eval again
aphoria eval run --fixtures tests/llm_fixtures --mode live --fail-on-regression

If improved: Save new baseline:

aphoria eval update-baseline --fixtures tests/llm_fixtures --force

If regressed: Revert change, try different approach.

What's Next?

Read full playbook: playbook.md
Add more fixtures: playbook.md#fixture-writing-guide
Set up CI: playbook.md#ci-integration

Common Commands

# Evaluate all fixtures
aphoria eval run --mode live

# Evaluate one category
aphoria eval run --mode live --category tls

# Use cached responses (fast, deterministic)
aphoria eval run --mode cached

# List all fixtures
aphoria eval list-fixtures

# Check for regressions (CI mode)
aphoria eval run --mode cached --fail-on-regression --threshold 0.05

Troubleshooting

"No fixtures found"

ls tests/llm_fixtures/
# Should see: manifest.toml, tls/, jwt/, etc.

"API error"

echo $GEMINI_API_KEY
# Should show your key (not empty)

"All fixtures failed"

# Run in mock mode to test harness
aphoria eval run --mode mock
# If this fails too, harness is broken

"Results differ between runs"

LLM is non-deterministic
Use --mode cached for consistent results
Set temperature to 0 in config (if supported)

3.2 KiB Raw Blame History