jordan 157dbbb9eb feat: Complete Aphoria Phase 8-9 + UAT suite (90/90 tests passing)

## Phase 8: Enterprise Extractor Improvements ✅
- 14 security extractors (TLS, JWT, SQL injection, XSS, etc.)
- 10 framework-specific extractors (Spring, Django, Rails, etc.)
- Config file security detection (YAML, TOML)

## Phase 9: Autonomous Extractor Generation ✅
- Shadow mode executor with TP/FP tracking
- Graduation pipeline with confidence thresholds
- Auto-rollback on regression detection
- Cross-project pattern syncing

## UAT Suite Complete (14 scripts, 90 tests)
- test-core-detection.sh (6 tests)
- test-declarative-extractors.sh (5 tests)
- test-domain-frameworks.sh (5 tests)
- test-domain-unreal.sh (3 tests)
- test-llm-extraction.sh (6 tests)
- test-eval-harness.sh (5 tests)
- test-cross-language.sh (3 tests)
- test-precommit-performance.sh (4 tests)
- test-output-formats.sh (8 tests)
- test-drift-detection.sh (6 tests)
- test-exit-codes.sh (12 tests)
+ 3 more scripts

## Other Changes
- Updated roadmap to mark Phase 8-9 complete
- Added .gitignore entries for build artifacts
- Updated pre-commit: 800 line limit, exclude tests/data/cmd

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-06 22:50:55 -07:00

29 KiB

Raw Blame History

LLM Extraction Optimization Playbook

A systematic approach to maximizing LLM extraction reliability and coverage.

Overview

This playbook guides you through the complete process of optimizing Aphoria's LLM extraction system. Follow the phases sequentially, using the decision trees to navigate conditional paths based on your findings.

┌─────────────────────────────────────────────────────────────────────────┐
│                    LLM OPTIMIZATION PATHWAY                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Phase 0: Baseline Assessment                                            │
│      ↓                                                                   │
│  Phase 1: Diagnostic Analysis ──────────────────────────────────────────┐│
│      ↓                                           ↓                      ││
│  Phase 2: Quick Wins              Research Required?                    ││
│      ↓                                           ↓                      ││
│  Phase 3: Systematic Improvements    Phase 2R: Research Sprint          ││
│      ↓                                           ↓                      ││
│  Phase 4: Edge Case Hardening    ←───────────────┘                      ││
│      ↓                                                                   │
│  Phase 5: CI Integration & Monitoring                                    │
│      ↓                                                                   │
│  Phase 6: Continuous Improvement Loop                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Phase 0: Baseline Assessment

Goal: Establish current performance metrics before making any changes.

0.1 Run Initial Evaluation

# Ensure fixtures exist
aphoria eval validate-fixtures --fixtures tests/llm_fixtures

# Run live evaluation to get real metrics
aphoria eval run --fixtures tests/llm_fixtures --mode live --format json > baseline-$(date +%Y%m%d).json

# Review results
aphoria eval run --fixtures tests/llm_fixtures --mode live --format table

0.2 Record Baseline Metrics

Create docs/llm-optimization/baselines/YYYY-MM-DD.md:

# Baseline: YYYY-MM-DD

## Overall Metrics
- Precision: X.XX
- Recall: X.XX
- F1: X.XX
- Parse Success Rate: X.XX%

## Per-Category Breakdown
| Category | Fixtures | Precision | Recall | F1 |
|----------|----------|-----------|--------|-----|
| tls      | N        | X.XX      | X.XX   | X.XX |
| jwt      | N        | X.XX      | X.XX   | X.XX |
| secrets  | N        | X.XX      | X.XX   | X.XX |
| auth     | N        | X.XX      | X.XX   | X.XX |

## Known Issues
- [ ] Issue 1
- [ ] Issue 2

0.3 Save Official Baseline

aphoria eval update-baseline --fixtures tests/llm_fixtures --force

Decision Point: Is Baseline Acceptable?

IF F1 >= 0.85 AND parse_success_rate >= 0.95:
    → Skip to Phase 4 (Edge Case Hardening)

ELSE IF F1 < 0.50:
    → Major issues exist. Proceed to Phase 1 with priority flag.

ELSE:
    → Normal optimization flow. Proceed to Phase 1.

Phase 1: Diagnostic Analysis

Goal: Identify root causes of extraction failures.

1.1 Categorize Failures

Run detailed analysis:

# Get verbose failure information
aphoria eval run --mode live --format json | jq '.fixture_results[] | select(.status == "Failed")'

1.2 Failure Classification Matrix

For each failed fixture, classify the root cause:

Failure Type	Symptoms	Root Cause	Fix Category
Parse Failure	`parse_success: false`	LLM returned malformed JSON	Prompt/Schema
Missing Claim	`false_negatives > 0`	LLM didn't extract expected claim	Prompt/Examples
Wrong Subject	Claim exists but subject mismatch	Subject path inconsistency	Normalization
Wrong Value	Claim exists but value mismatch	Type coercion or interpretation	Prompt/Matcher
Wrong Predicate	Claim exists but predicate mismatch	Vocabulary inconsistency	Prompt/Glossary
False Positive	`violations > 0`	LLM extracted non-existent issue	Negative Examples
Low Confidence	Claim filtered by min_confidence	LLM under-confident	Calibration

1.3 Tally Results

## Failure Analysis: YYYY-MM-DD

### By Failure Type
| Type | Count | % of Failures |
|------|-------|---------------|
| Parse Failure | N | X% |
| Missing Claim | N | X% |
| Wrong Subject | N | X% |
| Wrong Value | N | X% |
| False Positive | N | X% |

### Priority Order (fix highest-impact first)
1. [Type] - N failures
2. [Type] - N failures
3. [Type] - N failures

Decision Point: What's the Dominant Failure Mode?

IF parse_failures > 30% of total failures:
    → Proceed to Phase 2A: Output Structure Fixes

ELSE IF missing_claims > 50% of total failures:
    → Proceed to Phase 2B: Recall Improvements

ELSE IF false_positives > 30% of total failures:
    → Proceed to Phase 2C: Precision Improvements

ELSE IF subject/predicate mismatches > 40%:
    → Proceed to Phase 2D: Normalization Fixes

ELSE:
    → Mixed issues. Proceed through 2A → 2B → 2C → 2D sequentially.

Phase 2: Quick Wins

Phase 2A: Output Structure Fixes

When: Parse failures > 30%

2A.1 Diagnosis

# Check what the LLM is actually returning
# Add debug logging to llm/extractor.rs temporarily
RUST_LOG=debug aphoria scan . --persist 2>&1 | grep "LLM response"

Common parse issues:

Markdown code fences around JSON
Extra text before/after JSON
Nested JSON (JSON inside string)
Truncated response (token limit)
Wrong array structure

2A.2 Fixes by Issue

Issue: Markdown code fences

// In llm/extractor.rs - add response cleaning
fn clean_response(raw: &str) -> String {
    let trimmed = raw.trim();
    if trimmed.starts_with("```json") {
        trimmed
            .strip_prefix("```json")
            .and_then(|s| s.strip_suffix("```"))
            .unwrap_or(trimmed)
            .trim()
            .to_string()
    } else if trimmed.starts_with("```") {
        trimmed
            .strip_prefix("```")
            .and_then(|s| s.strip_suffix("```"))
            .unwrap_or(trimmed)
            .trim()
            .to_string()
    } else {
        trimmed.to_string()
    }
}

Issue: Extra text around JSON

// Find JSON array in response
fn extract_json_array(raw: &str) -> Option<&str> {
    let start = raw.find('[')?;
    let end = raw.rfind(']')?;
    if end > start {
        Some(&raw[start..=end])
    } else {
        None
    }
}

Issue: Token limit truncation

Reduce input context
Request fewer claims per call
Implement chunking

Issue: Inconsistent structure Add explicit schema to prompt:

Return ONLY a JSON array with this exact structure:
[
  {
    "subject": "category/specific_thing",
    "predicate": "property_name",
    "value": <boolean|number|string>,
    "confidence": <0.0-1.0>,
    "line": <line_number>,
    "rationale": "why this was extracted"
  }
]

2A.3 Validate Fix

aphoria eval run --mode live --fail-on-regression
# Parse success rate should improve

Checkpoint: If parse_success_rate >= 95%, proceed. Otherwise, consider:

IF still failing:
    → Research: Check Gemini API docs for response_format options
    → Research: Consider switching to function calling mode

Phase 2B: Recall Improvements

When: Missing claims (false negatives) > 50%

2B.1 Analyze What's Being Missed

# For each failed fixture, check what was expected vs extracted
aphoria eval run --mode live --format json | \
  jq '.fixture_results[] | select(.false_negatives > 0) | {id: .fixture_id, unmatched: .unmatched_expectations}'

2B.2 Common Recall Issues & Fixes

Issue: LLM doesn't recognize the pattern

Add few-shot examples to prompt:

// In llm/prompts.rs
const FEW_SHOT_EXAMPLES: &str = r#"
Example 1:
Input: `requests.get(url, verify=False)`
Output: [{"subject": "tls/cert_verification", "predicate": "enabled", "value": false, ...}]

Example 2:
Input: `jwt.decode(token, algorithms=['none', 'HS256'])`
Output: [{"subject": "jwt/algorithms", "predicate": "allows_none", "value": true, ...}]
"#;

Issue: LLM extracts with different subject path

Check actual vs expected subjects:

Expected: tls/cert_verification
Actual:   security/tls/verify_disabled

→ Either: Update fixtures to match LLM's vocabulary
→ Or: Add subject mapping instructions to prompt
→ Or: Improve matcher's tail-path matching

Issue: LLM only extracts first finding

Add explicit instruction:

IMPORTANT: Extract ALL security-relevant claims from the code.
Do not stop after finding one issue - scan the entire file.

Issue: LLM misses subtle patterns

Add pattern hints:

Look for these specific patterns:
- verify=False, VERIFY=False, ssl_verify=False → TLS verification disabled
- algorithms=['none'], algorithm='none' → JWT algorithm none attack
- API_KEY = "...", apiKey: "..." → Hardcoded secrets

2B.3 Validate Recall Improvements

aphoria eval run --mode live
# Check: recall should increase, precision should not drop significantly

Checkpoint:

IF recall improved AND precision dropped > 10%:
    → Proceed to Phase 2C immediately (balance precision)

ELSE IF recall still < 0.70:
    → Research Sprint: Analyze why specific patterns aren't recognized
    → Consider: Are fixtures too strict? Update matcher tolerance?

Phase 2C: Precision Improvements

When: False positives or violations > 30%

2C.1 Analyze False Positives

# What is the LLM incorrectly flagging?
aphoria eval run --mode live --format json | \
  jq '.fixture_results[] | select(.violations > 0) | {id: .fixture_id, violations: .violation_details}'

2C.2 Common Precision Issues & Fixes

Issue: Flagging safe patterns as issues

Add negative examples to prompt:

const NEGATIVE_EXAMPLES: &str = r#"
These are SAFE and should NOT be flagged:

- `verify=certifi.where()` → NOT a TLS issue (using CA bundle)
- `API_KEY = os.environ['API_KEY']` → NOT hardcoded (from environment)
- `algorithms=['HS256', 'RS256']` → NOT algorithm none (secure algorithms only)
"#;

Issue: Over-eager pattern matching

Add specificity requirements:

Only extract claims when you are CONFIDENT the code is problematic.
Do NOT flag:
- Comments discussing security issues (not actual code)
- Test files demonstrating vulnerabilities (not production code)
- Variables that MIGHT contain sensitive data but don't clearly

Issue: Misinterpreting context

Add context awareness:

Consider the full context:
- Is this in a test file? Test code may intentionally have "bad" patterns.
- Is this in a comment? Comments are not executable code.
- Is the value from a secure source (environment, secrets manager)?

2C.3 Validate Precision Improvements

aphoria eval run --mode live
# Check: precision should increase, recall should not drop significantly

Checkpoint:

IF precision improved AND recall dropped > 10%:
    → Go back to 2B, find balance

ELSE IF precision still < 0.70:
    → Research: Are fixtures correctly marked as "must_not_contain"?
    → Consider: Is the LLM's interpretation actually valid? Update fixtures?

Phase 2D: Normalization Fixes

When: Subject/predicate mismatches > 40%

2D.1 Identify Vocabulary Mismatches

# Compare LLM output subjects vs expected subjects
# Manual review of failures

Common mismatches:

LLM Output	Expected	Fix
`security/tls/disabled`	`tls/cert_verification`	Prompt vocabulary
`auth/jwt/none_allowed`	`jwt/algorithms`	Subject mapping
`credentials/hardcoded`	`secrets/api_key`	Standardize categories

2D.2 Fix Options

Option A: Update Prompt Vocabulary

Define exact subject paths in prompt:

Use these exact subject paths:
- TLS issues: tls/cert_verification, tls/min_version, tls/cipher_suites
- JWT issues: jwt/algorithms, jwt/validation, jwt/expiry
- Secrets: secrets/api_key, secrets/token, secrets/password
- Auth: auth/bypass, auth/session, auth/password_policy

Option B: Improve Matcher Tolerance

In eval/matcher.rs, enhance subject_matches():

fn subject_matches(&self, extracted: &str, expected: &str) -> bool {
    // Existing tail-path matching
    let ext_tail = self.tail_path(extracted, 2);
    let exp_tail = self.tail_path(expected, 2);
    if ext_tail == exp_tail {
        return true;
    }

    // Add synonym matching
    let synonyms = [
        ("cert_verification", "verify"),
        ("cert_verification", "ssl_verify"),
        ("api_key", "apikey"),
        ("api_key", "secret_key"),
    ];

    for (a, b) in synonyms {
        if (ext_tail.contains(&a) && exp_tail.contains(&b)) ||
           (ext_tail.contains(&b) && exp_tail.contains(&a)) {
            return true;
        }
    }

    false
}

Option C: Update Fixtures to Match LLM

If LLM's vocabulary is actually better/more consistent:

# Update fixture subjects to match what LLM produces
# This is valid if LLM's naming is more intuitive

2D.3 Validate

aphoria eval run --mode live
# Subject/predicate mismatches should decrease

Phase 2R: Research Sprint

When: Quick wins don't resolve issues, or unknown problems arise.

2R.1 Research Checklist

## Research Log: [Issue Description]

### Problem Statement
- What specifically is failing?
- What have we tried?
- Why didn't it work?

### Research Tasks
- [ ] Review Gemini API documentation for relevant features
- [ ] Check if other projects solved similar issues
- [ ] Experiment with different prompt structures
- [ ] Test alternative models (if available)
- [ ] Review academic papers on code analysis with LLMs

### Experiments
| Experiment | Hypothesis | Result |
|------------|------------|--------|
| | | |

### Conclusion
- What worked?
- What didn't?
- Recommended approach?

2R.2 Common Research Topics

Topic: Structured Output

Gemini supports response_mime_type: "application/json"
May need response_schema for strict typing
Test: Does this improve parse success rate?

Topic: Function Calling

Alternative to free-form JSON output
Define claims as function parameters
Test: More reliable structure?

Topic: Chain of Thought

Ask LLM to reason before extracting
May improve accuracy on subtle patterns
Test: Does CoT help without hurting speed?

Topic: Fine-Tuning

Create training dataset from fixtures
Fine-tune model on security extraction
Test: Significant quality improvement?

Topic: Multi-Pass Extraction

First pass: identify potential issues
Second pass: validate and detail each
Test: Higher precision with acceptable latency?

2R.3 Research Output

Document findings in docs/llm-optimization/research/:

research/
  structured-output.md
  chain-of-thought.md
  fine-tuning-feasibility.md

Phase 3: Systematic Improvements

Goal: Implement comprehensive prompt and system improvements.

3.1 Prompt Architecture

Restructure prompt for maximum clarity:

const SYSTEM_PROMPT: &str = r#"
You are a security code analyzer. Your task is to extract security-relevant claims from source code.

## Output Format
Return a JSON array. Each claim must have:
- subject: Category path (e.g., "tls/cert_verification")
- predicate: Property name (e.g., "enabled")
- value: The extracted value (boolean, number, or string)
- confidence: Your confidence 0.0-1.0
- line: Line number where found
- rationale: Brief explanation

## Subject Categories
- tls/*: TLS/SSL configuration
- jwt/*: JWT token handling
- secrets/*: Credentials and API keys
- auth/*: Authentication mechanisms
- crypto/*: Cryptographic settings

## What to Extract
[Positive examples here]

## What NOT to Extract
[Negative examples here]

## Important Rules
1. Extract ALL findings, not just the first one
2. Only flag actual code, not comments
3. Consider context (test files, environment variables)
4. Be conservative - only flag clear issues
"#;

3.2 Confidence Calibration

Improve confidence scoring:

const CONFIDENCE_GUIDANCE: &str = r#"
Set confidence based on:
- 0.95-1.0: Explicit, unambiguous code (verify=False)
- 0.80-0.94: Clear pattern, minor ambiguity
- 0.60-0.79: Likely issue, needs context
- 0.40-0.59: Possible issue, uncertain
- Below 0.40: Don't report (too uncertain)
"#;

3.3 Language-Specific Handling

Add language-specific prompt sections:

fn get_language_hints(language: Language) -> &'static str {
    match language {
        Language::Python => r#"
Python-specific patterns:
- requests: verify=False, cert=False
- urllib3: disable_warnings()
- ssl: CERT_NONE, check_hostname=False
"#,
        Language::JavaScript => r#"
JavaScript/Node.js patterns:
- https: rejectUnauthorized: false
- axios: httpsAgent with rejectUnauthorized
- NODE_TLS_REJECT_UNAUTHORIZED=0
"#,
        // ... other languages
    }
}

3.4 Validate Comprehensive Changes

# Run full evaluation
aphoria eval run --mode live --format table

# Compare to baseline
# Expect: F1 improvement, no significant regressions

Phase 4: Edge Case Hardening

Goal: Handle unusual inputs gracefully.

4.1 Edge Case Categories

Category	Example	Expected Behavior
Empty file	0 bytes	No claims, no error
Binary file	.exe, .png	Skip gracefully
Huge file	>100KB code	Chunk or skip with warning
Minified code	single-line JS	Best effort extraction
Mixed language	HTML with JS	Detect embedded languages
Unicode	Non-ASCII identifiers	Handle correctly
Syntax errors	Invalid code	Extract what's possible

4.2 Add Edge Case Fixtures

# Create edge case fixtures
cat > tests/llm_fixtures/edge/edge-002-minified.toml << 'EOF'
[metadata]
id = "edge-002"
name = "Minified JavaScript"
category = "edge"
language = "javascript"

[input]
filename = "bundle.min.js"
content = "var a=require('https');a.get({rejectUnauthorized:false},function(r){});"

[expected]
must_contain = [
    { subject = "tls/cert_verification", predicate = "enabled", value = false }
]
EOF

4.3 Implement Defensive Code

// In llm/extractor.rs
impl LlmExtractor {
    pub fn extract(&self, content: &str, language: Language) -> Vec<ExtractedClaim> {
        // Edge case: empty content
        if content.trim().is_empty() {
            return Vec::new();
        }

        // Edge case: too large
        if content.len() > MAX_CONTENT_SIZE {
            warn!(size = content.len(), "Content too large, chunking");
            return self.extract_chunked(content, language);
        }

        // Edge case: binary content
        if content.bytes().any(|b| b == 0) {
            debug!("Binary content detected, skipping");
            return Vec::new();
        }

        // Normal extraction
        self.extract_internal(content, language)
    }
}

4.4 Validate Edge Cases

aphoria eval run --mode live --category edge
# All edge cases should pass or gracefully skip

Phase 5: CI Integration & Monitoring

Goal: Prevent regressions and track quality over time.

5.1 CI Pipeline

# .github/workflows/llm-quality.yml
name: LLM Extraction Quality

on:
  pull_request:
    paths:
      - 'applications/aphoria/src/llm/**'
      - 'applications/aphoria/tests/llm_fixtures/**'
  schedule:
    - cron: '0 6 * * 1'  # Weekly Monday 6am

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Rust
        uses: dtolnay/rust-action@stable

      - name: Cache LLM responses
        uses: actions/cache@v4
        with:
          path: ~/.cache/aphoria/llm_cache
          key: llm-cache-${{ hashFiles('tests/llm_fixtures/**') }}

      - name: Run Evaluation
        env:
          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
        run: |
          cargo run -p aphoria -- eval run \
            --mode cached \
            --fail-on-regression \
            --threshold 0.05 \
            --format json > eval-results.json          

      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: eval-results.json

      - name: Comment PR with Results
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('eval-results.json'));
            const body = `## LLM Eval Results
            | Metric | Value |
            |--------|-------|
            | Precision | ${results.metrics.precision.toFixed(2)} |
            | Recall | ${results.metrics.recall.toFixed(2)} |
            | F1 | ${results.metrics.f1.toFixed(2)} |
            | Verdict | ${results.verdict} |`;
            github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body: body
            });

5.2 Monitoring Dashboard

Track metrics over time:

# docs/llm-optimization/metrics-history.md

| Date | Precision | Recall | F1 | Notes |
|------|-----------|--------|-----|-------|
| 2026-02-05 | 0.78 | 0.72 | 0.75 | Initial baseline |
| 2026-02-06 | 0.82 | 0.74 | 0.78 | Added few-shot examples |
| 2026-02-07 | 0.85 | 0.76 | 0.80 | Fixed parse issues |

5.3 Alert Thresholds

# aphoria.toml
[eval]
# Fail CI if metrics drop below these
min_precision = 0.75
min_recall = 0.70
min_f1 = 0.72
regression_threshold = 0.05

Phase 6: Continuous Improvement Loop

Goal: Establish ongoing optimization process.

6.1 Weekly Cadence

## Monday: Review
- [ ] Check CI results from past week
- [ ] Review any failed fixtures
- [ ] Identify top 3 improvement opportunities

## Wednesday: Implement
- [ ] Pick one improvement from Monday's list
- [ ] Implement and test locally
- [ ] Run full eval suite

## Friday: Deploy
- [ ] If metrics improved, merge changes
- [ ] Update baseline
- [ ] Document what changed and why

6.2 Fixture Expansion

Add new fixtures when:

New vulnerability pattern discovered
False positive reported by user
New language/framework support added
Edge case found in production

# Fixture addition checklist
- [ ] Create fixture file in appropriate category
- [ ] Add both must_contain and must_not_contain
- [ ] Run validation: aphoria eval validate-fixtures
- [ ] Run eval to verify fixture works: aphoria eval run --max-fixtures 1
- [ ] Update manifest.toml category count

6.3 Prompt Version Control

Track prompt changes:

// llm/prompts.rs
pub const PROMPT_VERSION: &str = "1.2.0";

// In changelog:
// 1.2.0 - Added negative examples for safe patterns
// 1.1.0 - Added language-specific hints
// 1.0.0 - Initial structured prompt

6.4 Quarterly Review

## Quarterly LLM Optimization Review

### Metrics Trend
- Q1 Start: F1 = 0.XX
- Q1 End: F1 = 0.XX
- Improvement: +X%

### Major Changes
1. Change description
2. Change description

### Lessons Learned
- What worked well?
- What didn't work?
- What should we try next quarter?

### Next Quarter Goals
- [ ] Goal 1
- [ ] Goal 2

Appendix A: Common Issues Reference

A.1 "LLM returns empty array"

Symptoms: No claims extracted from files that clearly have issues.

Diagnosis:

# Check if LLM is being called
RUST_LOG=debug aphoria scan . --persist 2>&1 | grep -i llm

Possible Causes:

LLM disabled in config → Enable in aphoria.toml
File filtered out → Check file size/type filters
API error → Check API key and quota
Prompt too restrictive → Loosen "what to extract" section

A.2 "Parse failures spike after API update"

Symptoms: Suddenly many parse failures.

Diagnosis:

# Check raw responses
# Add temporary logging to see what API returns

Possible Causes:

API response format changed
Model version updated
Rate limiting affecting responses

Fix: Update response parsing to handle new format.

A.3 "Good local results, bad CI results"

Symptoms: Eval passes locally but fails in CI.

Possible Causes:

Cache inconsistency → Clear and rebuild cache
Environment differences → Check env vars
Timeout issues → Increase CI timeout
API key issues → Verify CI secrets

A.4 "Precision tanked after recall improvement"

Symptoms: Improved recall but precision dropped significantly.

Fix:

Add negative examples to balance
Increase confidence threshold
Add more must_not_contain fixtures
Make extraction criteria more specific

A.5 "Works for Python, fails for Go"

Symptoms: Language-specific extraction issues.

Fix:

Add language-specific prompt hints
Add fixtures for failing language
Check if patterns are language-specific
Consider language-specific extraction paths

Appendix B: Fixture Writing Guide

B.1 Good Fixture Characteristics

Minimal: Only include code necessary to demonstrate the issue
Clear: Obvious what the security issue is
Realistic: Resembles actual production code
Isolated: Tests one concept per fixture

B.2 Fixture Template

[metadata]
id = "category-NNN"
name = "Brief description of what this tests"
category = "tls|jwt|secrets|auth|crypto|negative|edge"
language = "python|javascript|go|rust|java|..."
difficulty = "easy|medium|hard"
source = "hand-curated|real-world|generated"
created = "YYYY-MM-DD"
notes = "Any additional context"

[input]
filename = "example.py"
content = """
# Minimal code that demonstrates the issue
actual_code_here()
"""

[expected]
must_contain = [
    {
        subject = "category/specific_thing",
        predicate = "property",
        value = <expected_value>,
        rationale = "Why this should be extracted"
    }
]

must_not_contain = [
    {
        subject = "category/other_thing",
        predicate = "property",
        value = <wrong_value>,
        rationale = "Why this should NOT be extracted"
    }
]

[scoring]
weight = 1.0
min_confidence = 0.7

B.3 Category Guidelines

Category	What to Include	Subject Prefix
tls	Certificate, protocol, cipher issues	`tls/`
jwt	Token validation, algorithm issues	`jwt/`
secrets	Hardcoded credentials, keys, tokens	`secrets/`
auth	Authentication bypass, weak auth	`auth/`
crypto	Weak algorithms, short keys	`crypto/`
negative	Safe patterns (no findings expected)	N/A
edge	Boundary conditions, unusual input	N/A

Appendix C: Decision Tree Summary

START
  │
  ├─→ Run baseline eval
  │     │
  │     ├─→ F1 >= 0.85? ──→ Skip to Phase 4
  │     │
  │     └─→ F1 < 0.85? ──→ Continue to diagnosis
  │
  ├─→ Classify failures
  │     │
  │     ├─→ Parse failures > 30%? ──→ Phase 2A (output structure)
  │     │
  │     ├─→ Missing claims > 50%? ──→ Phase 2B (recall)
  │     │
  │     ├─→ False positives > 30%? ──→ Phase 2C (precision)
  │     │
  │     └─→ Subject mismatches > 40%? ──→ Phase 2D (normalization)
  │
  ├─→ After each phase:
  │     │
  │     ├─→ Improved? ──→ Continue to next phase
  │     │
  │     ├─→ Regressed? ──→ Revert, try different approach
  │     │
  │     └─→ Stuck? ──→ Phase 2R (research sprint)
  │
  ├─→ All phases complete?
  │     │
  │     ├─→ F1 >= target? ──→ Phase 4 (edge cases), Phase 5 (CI)
  │     │
  │     └─→ F1 < target? ──→ Research: model limits, fine-tuning, alternatives
  │
  └─→ Ongoing: Phase 6 (continuous improvement)

Quick Reference Commands

# Run evaluation
aphoria eval run --fixtures tests/llm_fixtures --mode live

# Run specific category
aphoria eval run --category tls --mode live

# Check for regressions
aphoria eval run --mode cached --fail-on-regression

# Update baseline
aphoria eval update-baseline --fixtures tests/llm_fixtures --force

# List fixtures
aphoria eval list-fixtures

# Validate fixtures
aphoria eval validate-fixtures

# Export JSON for analysis
aphoria eval run --mode live --format json > results.json

29 KiB Raw Blame History

LLM Extraction Optimization Playbook

Overview

Phase 0: Baseline Assessment

0.1 Run Initial Evaluation

0.2 Record Baseline Metrics

0.3 Save Official Baseline

Decision Point: Is Baseline Acceptable?

Phase 1: Diagnostic Analysis

1.1 Categorize Failures

1.2 Failure Classification Matrix

1.3 Tally Results

Decision Point: What's the Dominant Failure Mode?

Phase 2: Quick Wins

Phase 2A: Output Structure Fixes

2A.1 Diagnosis

2A.2 Fixes by Issue

2A.3 Validate Fix

Phase 2B: Recall Improvements

2B.1 Analyze What's Being Missed

2B.2 Common Recall Issues & Fixes

2B.3 Validate Recall Improvements

Phase 2C: Precision Improvements

2C.1 Analyze False Positives

2C.2 Common Precision Issues & Fixes

2C.3 Validate Precision Improvements

Phase 2D: Normalization Fixes

2D.1 Identify Vocabulary Mismatches

2D.2 Fix Options

2D.3 Validate

Phase 2R: Research Sprint

2R.1 Research Checklist

2R.2 Common Research Topics

2R.3 Research Output

Phase 3: Systematic Improvements

3.1 Prompt Architecture

3.2 Confidence Calibration

3.3 Language-Specific Handling

3.4 Validate Comprehensive Changes

Phase 4: Edge Case Hardening

4.1 Edge Case Categories

4.2 Add Edge Case Fixtures

4.3 Implement Defensive Code

4.4 Validate Edge Cases

Phase 5: CI Integration & Monitoring

5.1 CI Pipeline

5.2 Monitoring Dashboard

5.3 Alert Thresholds

Phase 6: Continuous Improvement Loop

6.1 Weekly Cadence

6.2 Fixture Expansion

6.3 Prompt Version Control

6.4 Quarterly Review

Appendix A: Common Issues Reference

A.1 "LLM returns empty array"

A.2 "Parse failures spike after API update"

A.3 "Good local results, bad CI results"

A.4 "Precision tanked after recall improvement"

A.5 "Works for Python, fails for Go"

Appendix B: Fixture Writing Guide

B.1 Good Fixture Characteristics

B.2 Fixture Template

B.3 Category Guidelines

Appendix C: Decision Tree Summary

Quick Reference Commands

29 KiB

Raw Blame History