stemedb/applications/aphoria/docs/llm-optimization/playbook.md
jordan 157dbbb9eb feat: Complete Aphoria Phase 8-9 + UAT suite (90/90 tests passing)
## Phase 8: Enterprise Extractor Improvements 
- 14 security extractors (TLS, JWT, SQL injection, XSS, etc.)
- 10 framework-specific extractors (Spring, Django, Rails, etc.)
- Config file security detection (YAML, TOML)

## Phase 9: Autonomous Extractor Generation 
- Shadow mode executor with TP/FP tracking
- Graduation pipeline with confidence thresholds
- Auto-rollback on regression detection
- Cross-project pattern syncing

## UAT Suite Complete (14 scripts, 90 tests)
- test-core-detection.sh (6 tests)
- test-declarative-extractors.sh (5 tests)
- test-domain-frameworks.sh (5 tests)
- test-domain-unreal.sh (3 tests)
- test-llm-extraction.sh (6 tests)
- test-eval-harness.sh (5 tests)
- test-cross-language.sh (3 tests)
- test-precommit-performance.sh (4 tests)
- test-output-formats.sh (8 tests)
- test-drift-detection.sh (6 tests)
- test-exit-codes.sh (12 tests)
+ 3 more scripts

## Other Changes
- Updated roadmap to mark Phase 8-9 complete
- Added .gitignore entries for build artifacts
- Updated pre-commit: 800 line limit, exclude tests/data/cmd

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-06 22:50:55 -07:00

29 KiB

LLM Extraction Optimization Playbook

A systematic approach to maximizing LLM extraction reliability and coverage.

Overview

This playbook guides you through the complete process of optimizing Aphoria's LLM extraction system. Follow the phases sequentially, using the decision trees to navigate conditional paths based on your findings.

┌─────────────────────────────────────────────────────────────────────────┐
│                    LLM OPTIMIZATION PATHWAY                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Phase 0: Baseline Assessment                                            │
│      ↓                                                                   │
│  Phase 1: Diagnostic Analysis ──────────────────────────────────────────┐│
│      ↓                                           ↓                      ││
│  Phase 2: Quick Wins              Research Required?                    ││
│      ↓                                           ↓                      ││
│  Phase 3: Systematic Improvements    Phase 2R: Research Sprint          ││
│      ↓                                           ↓                      ││
│  Phase 4: Edge Case Hardening    ←───────────────┘                      ││
│      ↓                                                                   │
│  Phase 5: CI Integration & Monitoring                                    │
│      ↓                                                                   │
│  Phase 6: Continuous Improvement Loop                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Phase 0: Baseline Assessment

Goal: Establish current performance metrics before making any changes.

0.1 Run Initial Evaluation

# Ensure fixtures exist
aphoria eval validate-fixtures --fixtures tests/llm_fixtures

# Run live evaluation to get real metrics
aphoria eval run --fixtures tests/llm_fixtures --mode live --format json > baseline-$(date +%Y%m%d).json

# Review results
aphoria eval run --fixtures tests/llm_fixtures --mode live --format table

0.2 Record Baseline Metrics

Create docs/llm-optimization/baselines/YYYY-MM-DD.md:

# Baseline: YYYY-MM-DD

## Overall Metrics
- Precision: X.XX
- Recall: X.XX
- F1: X.XX
- Parse Success Rate: X.XX%

## Per-Category Breakdown
| Category | Fixtures | Precision | Recall | F1 |
|----------|----------|-----------|--------|-----|
| tls      | N        | X.XX      | X.XX   | X.XX |
| jwt      | N        | X.XX      | X.XX   | X.XX |
| secrets  | N        | X.XX      | X.XX   | X.XX |
| auth     | N        | X.XX      | X.XX   | X.XX |

## Known Issues
- [ ] Issue 1
- [ ] Issue 2

0.3 Save Official Baseline

aphoria eval update-baseline --fixtures tests/llm_fixtures --force

Decision Point: Is Baseline Acceptable?

IF F1 >= 0.85 AND parse_success_rate >= 0.95:
    → Skip to Phase 4 (Edge Case Hardening)

ELSE IF F1 < 0.50:
    → Major issues exist. Proceed to Phase 1 with priority flag.

ELSE:
    → Normal optimization flow. Proceed to Phase 1.

Phase 1: Diagnostic Analysis

Goal: Identify root causes of extraction failures.

1.1 Categorize Failures

Run detailed analysis:

# Get verbose failure information
aphoria eval run --mode live --format json | jq '.fixture_results[] | select(.status == "Failed")'

1.2 Failure Classification Matrix

For each failed fixture, classify the root cause:

Failure Type Symptoms Root Cause Fix Category
Parse Failure parse_success: false LLM returned malformed JSON Prompt/Schema
Missing Claim false_negatives > 0 LLM didn't extract expected claim Prompt/Examples
Wrong Subject Claim exists but subject mismatch Subject path inconsistency Normalization
Wrong Value Claim exists but value mismatch Type coercion or interpretation Prompt/Matcher
Wrong Predicate Claim exists but predicate mismatch Vocabulary inconsistency Prompt/Glossary
False Positive violations > 0 LLM extracted non-existent issue Negative Examples
Low Confidence Claim filtered by min_confidence LLM under-confident Calibration

1.3 Tally Results

## Failure Analysis: YYYY-MM-DD

### By Failure Type
| Type | Count | % of Failures |
|------|-------|---------------|
| Parse Failure | N | X% |
| Missing Claim | N | X% |
| Wrong Subject | N | X% |
| Wrong Value | N | X% |
| False Positive | N | X% |

### Priority Order (fix highest-impact first)
1. [Type] - N failures
2. [Type] - N failures
3. [Type] - N failures

Decision Point: What's the Dominant Failure Mode?

IF parse_failures > 30% of total failures:
    → Proceed to Phase 2A: Output Structure Fixes

ELSE IF missing_claims > 50% of total failures:
    → Proceed to Phase 2B: Recall Improvements

ELSE IF false_positives > 30% of total failures:
    → Proceed to Phase 2C: Precision Improvements

ELSE IF subject/predicate mismatches > 40%:
    → Proceed to Phase 2D: Normalization Fixes

ELSE:
    → Mixed issues. Proceed through 2A → 2B → 2C → 2D sequentially.

Phase 2: Quick Wins

Phase 2A: Output Structure Fixes

When: Parse failures > 30%

2A.1 Diagnosis

# Check what the LLM is actually returning
# Add debug logging to llm/extractor.rs temporarily
RUST_LOG=debug aphoria scan . --persist 2>&1 | grep "LLM response"

Common parse issues:

  • Markdown code fences around JSON
  • Extra text before/after JSON
  • Nested JSON (JSON inside string)
  • Truncated response (token limit)
  • Wrong array structure

2A.2 Fixes by Issue

Issue: Markdown code fences

// In llm/extractor.rs - add response cleaning
fn clean_response(raw: &str) -> String {
    let trimmed = raw.trim();
    if trimmed.starts_with("```json") {
        trimmed
            .strip_prefix("```json")
            .and_then(|s| s.strip_suffix("```"))
            .unwrap_or(trimmed)
            .trim()
            .to_string()
    } else if trimmed.starts_with("```") {
        trimmed
            .strip_prefix("```")
            .and_then(|s| s.strip_suffix("```"))
            .unwrap_or(trimmed)
            .trim()
            .to_string()
    } else {
        trimmed.to_string()
    }
}

Issue: Extra text around JSON

// Find JSON array in response
fn extract_json_array(raw: &str) -> Option<&str> {
    let start = raw.find('[')?;
    let end = raw.rfind(']')?;
    if end > start {
        Some(&raw[start..=end])
    } else {
        None
    }
}

Issue: Token limit truncation

  • Reduce input context
  • Request fewer claims per call
  • Implement chunking

Issue: Inconsistent structure Add explicit schema to prompt:

Return ONLY a JSON array with this exact structure:
[
  {
    "subject": "category/specific_thing",
    "predicate": "property_name",
    "value": <boolean|number|string>,
    "confidence": <0.0-1.0>,
    "line": <line_number>,
    "rationale": "why this was extracted"
  }
]

2A.3 Validate Fix

aphoria eval run --mode live --fail-on-regression
# Parse success rate should improve

Checkpoint: If parse_success_rate >= 95%, proceed. Otherwise, consider:

IF still failing:
    → Research: Check Gemini API docs for response_format options
    → Research: Consider switching to function calling mode

Phase 2B: Recall Improvements

When: Missing claims (false negatives) > 50%

2B.1 Analyze What's Being Missed

# For each failed fixture, check what was expected vs extracted
aphoria eval run --mode live --format json | \
  jq '.fixture_results[] | select(.false_negatives > 0) | {id: .fixture_id, unmatched: .unmatched_expectations}'

2B.2 Common Recall Issues & Fixes

Issue: LLM doesn't recognize the pattern

Add few-shot examples to prompt:

// In llm/prompts.rs
const FEW_SHOT_EXAMPLES: &str = r#"
Example 1:
Input: `requests.get(url, verify=False)`
Output: [{"subject": "tls/cert_verification", "predicate": "enabled", "value": false, ...}]

Example 2:
Input: `jwt.decode(token, algorithms=['none', 'HS256'])`
Output: [{"subject": "jwt/algorithms", "predicate": "allows_none", "value": true, ...}]
"#;

Issue: LLM extracts with different subject path

Check actual vs expected subjects:

Expected: tls/cert_verification
Actual:   security/tls/verify_disabled

→ Either: Update fixtures to match LLM's vocabulary
→ Or: Add subject mapping instructions to prompt
→ Or: Improve matcher's tail-path matching

Issue: LLM only extracts first finding

Add explicit instruction:

IMPORTANT: Extract ALL security-relevant claims from the code.
Do not stop after finding one issue - scan the entire file.

Issue: LLM misses subtle patterns

Add pattern hints:

Look for these specific patterns:
- verify=False, VERIFY=False, ssl_verify=False → TLS verification disabled
- algorithms=['none'], algorithm='none' → JWT algorithm none attack
- API_KEY = "...", apiKey: "..." → Hardcoded secrets

2B.3 Validate Recall Improvements

aphoria eval run --mode live
# Check: recall should increase, precision should not drop significantly

Checkpoint:

IF recall improved AND precision dropped > 10%:
    → Proceed to Phase 2C immediately (balance precision)

ELSE IF recall still < 0.70:
    → Research Sprint: Analyze why specific patterns aren't recognized
    → Consider: Are fixtures too strict? Update matcher tolerance?

Phase 2C: Precision Improvements

When: False positives or violations > 30%

2C.1 Analyze False Positives

# What is the LLM incorrectly flagging?
aphoria eval run --mode live --format json | \
  jq '.fixture_results[] | select(.violations > 0) | {id: .fixture_id, violations: .violation_details}'

2C.2 Common Precision Issues & Fixes

Issue: Flagging safe patterns as issues

Add negative examples to prompt:

const NEGATIVE_EXAMPLES: &str = r#"
These are SAFE and should NOT be flagged:

- `verify=certifi.where()` → NOT a TLS issue (using CA bundle)
- `API_KEY = os.environ['API_KEY']` → NOT hardcoded (from environment)
- `algorithms=['HS256', 'RS256']` → NOT algorithm none (secure algorithms only)
"#;

Issue: Over-eager pattern matching

Add specificity requirements:

Only extract claims when you are CONFIDENT the code is problematic.
Do NOT flag:
- Comments discussing security issues (not actual code)
- Test files demonstrating vulnerabilities (not production code)
- Variables that MIGHT contain sensitive data but don't clearly

Issue: Misinterpreting context

Add context awareness:

Consider the full context:
- Is this in a test file? Test code may intentionally have "bad" patterns.
- Is this in a comment? Comments are not executable code.
- Is the value from a secure source (environment, secrets manager)?

2C.3 Validate Precision Improvements

aphoria eval run --mode live
# Check: precision should increase, recall should not drop significantly

Checkpoint:

IF precision improved AND recall dropped > 10%:
    → Go back to 2B, find balance

ELSE IF precision still < 0.70:
    → Research: Are fixtures correctly marked as "must_not_contain"?
    → Consider: Is the LLM's interpretation actually valid? Update fixtures?

Phase 2D: Normalization Fixes

When: Subject/predicate mismatches > 40%

2D.1 Identify Vocabulary Mismatches

# Compare LLM output subjects vs expected subjects
# Manual review of failures

Common mismatches:

LLM Output Expected Fix
security/tls/disabled tls/cert_verification Prompt vocabulary
auth/jwt/none_allowed jwt/algorithms Subject mapping
credentials/hardcoded secrets/api_key Standardize categories

2D.2 Fix Options

Option A: Update Prompt Vocabulary

Define exact subject paths in prompt:

Use these exact subject paths:
- TLS issues: tls/cert_verification, tls/min_version, tls/cipher_suites
- JWT issues: jwt/algorithms, jwt/validation, jwt/expiry
- Secrets: secrets/api_key, secrets/token, secrets/password
- Auth: auth/bypass, auth/session, auth/password_policy

Option B: Improve Matcher Tolerance

In eval/matcher.rs, enhance subject_matches():

fn subject_matches(&self, extracted: &str, expected: &str) -> bool {
    // Existing tail-path matching
    let ext_tail = self.tail_path(extracted, 2);
    let exp_tail = self.tail_path(expected, 2);
    if ext_tail == exp_tail {
        return true;
    }

    // Add synonym matching
    let synonyms = [
        ("cert_verification", "verify"),
        ("cert_verification", "ssl_verify"),
        ("api_key", "apikey"),
        ("api_key", "secret_key"),
    ];

    for (a, b) in synonyms {
        if (ext_tail.contains(&a) && exp_tail.contains(&b)) ||
           (ext_tail.contains(&b) && exp_tail.contains(&a)) {
            return true;
        }
    }

    false
}

Option C: Update Fixtures to Match LLM

If LLM's vocabulary is actually better/more consistent:

# Update fixture subjects to match what LLM produces
# This is valid if LLM's naming is more intuitive

2D.3 Validate

aphoria eval run --mode live
# Subject/predicate mismatches should decrease

Phase 2R: Research Sprint

When: Quick wins don't resolve issues, or unknown problems arise.

2R.1 Research Checklist

## Research Log: [Issue Description]

### Problem Statement
- What specifically is failing?
- What have we tried?
- Why didn't it work?

### Research Tasks
- [ ] Review Gemini API documentation for relevant features
- [ ] Check if other projects solved similar issues
- [ ] Experiment with different prompt structures
- [ ] Test alternative models (if available)
- [ ] Review academic papers on code analysis with LLMs

### Experiments
| Experiment | Hypothesis | Result |
|------------|------------|--------|
| | | |

### Conclusion
- What worked?
- What didn't?
- Recommended approach?

2R.2 Common Research Topics

Topic: Structured Output

  • Gemini supports response_mime_type: "application/json"
  • May need response_schema for strict typing
  • Test: Does this improve parse success rate?

Topic: Function Calling

  • Alternative to free-form JSON output
  • Define claims as function parameters
  • Test: More reliable structure?

Topic: Chain of Thought

  • Ask LLM to reason before extracting
  • May improve accuracy on subtle patterns
  • Test: Does CoT help without hurting speed?

Topic: Fine-Tuning

  • Create training dataset from fixtures
  • Fine-tune model on security extraction
  • Test: Significant quality improvement?

Topic: Multi-Pass Extraction

  • First pass: identify potential issues
  • Second pass: validate and detail each
  • Test: Higher precision with acceptable latency?

2R.3 Research Output

Document findings in docs/llm-optimization/research/:

research/
  structured-output.md
  chain-of-thought.md
  fine-tuning-feasibility.md

Phase 3: Systematic Improvements

Goal: Implement comprehensive prompt and system improvements.

3.1 Prompt Architecture

Restructure prompt for maximum clarity:

const SYSTEM_PROMPT: &str = r#"
You are a security code analyzer. Your task is to extract security-relevant claims from source code.

## Output Format
Return a JSON array. Each claim must have:
- subject: Category path (e.g., "tls/cert_verification")
- predicate: Property name (e.g., "enabled")
- value: The extracted value (boolean, number, or string)
- confidence: Your confidence 0.0-1.0
- line: Line number where found
- rationale: Brief explanation

## Subject Categories
- tls/*: TLS/SSL configuration
- jwt/*: JWT token handling
- secrets/*: Credentials and API keys
- auth/*: Authentication mechanisms
- crypto/*: Cryptographic settings

## What to Extract
[Positive examples here]

## What NOT to Extract
[Negative examples here]

## Important Rules
1. Extract ALL findings, not just the first one
2. Only flag actual code, not comments
3. Consider context (test files, environment variables)
4. Be conservative - only flag clear issues
"#;

3.2 Confidence Calibration

Improve confidence scoring:

const CONFIDENCE_GUIDANCE: &str = r#"
Set confidence based on:
- 0.95-1.0: Explicit, unambiguous code (verify=False)
- 0.80-0.94: Clear pattern, minor ambiguity
- 0.60-0.79: Likely issue, needs context
- 0.40-0.59: Possible issue, uncertain
- Below 0.40: Don't report (too uncertain)
"#;

3.3 Language-Specific Handling

Add language-specific prompt sections:

fn get_language_hints(language: Language) -> &'static str {
    match language {
        Language::Python => r#"
Python-specific patterns:
- requests: verify=False, cert=False
- urllib3: disable_warnings()
- ssl: CERT_NONE, check_hostname=False
"#,
        Language::JavaScript => r#"
JavaScript/Node.js patterns:
- https: rejectUnauthorized: false
- axios: httpsAgent with rejectUnauthorized
- NODE_TLS_REJECT_UNAUTHORIZED=0
"#,
        // ... other languages
    }
}

3.4 Validate Comprehensive Changes

# Run full evaluation
aphoria eval run --mode live --format table

# Compare to baseline
# Expect: F1 improvement, no significant regressions

Phase 4: Edge Case Hardening

Goal: Handle unusual inputs gracefully.

4.1 Edge Case Categories

Category Example Expected Behavior
Empty file 0 bytes No claims, no error
Binary file .exe, .png Skip gracefully
Huge file >100KB code Chunk or skip with warning
Minified code single-line JS Best effort extraction
Mixed language HTML with JS Detect embedded languages
Unicode Non-ASCII identifiers Handle correctly
Syntax errors Invalid code Extract what's possible

4.2 Add Edge Case Fixtures

# Create edge case fixtures
cat > tests/llm_fixtures/edge/edge-002-minified.toml << 'EOF'
[metadata]
id = "edge-002"
name = "Minified JavaScript"
category = "edge"
language = "javascript"

[input]
filename = "bundle.min.js"
content = "var a=require('https');a.get({rejectUnauthorized:false},function(r){});"

[expected]
must_contain = [
    { subject = "tls/cert_verification", predicate = "enabled", value = false }
]
EOF

4.3 Implement Defensive Code

// In llm/extractor.rs
impl LlmExtractor {
    pub fn extract(&self, content: &str, language: Language) -> Vec<ExtractedClaim> {
        // Edge case: empty content
        if content.trim().is_empty() {
            return Vec::new();
        }

        // Edge case: too large
        if content.len() > MAX_CONTENT_SIZE {
            warn!(size = content.len(), "Content too large, chunking");
            return self.extract_chunked(content, language);
        }

        // Edge case: binary content
        if content.bytes().any(|b| b == 0) {
            debug!("Binary content detected, skipping");
            return Vec::new();
        }

        // Normal extraction
        self.extract_internal(content, language)
    }
}

4.4 Validate Edge Cases

aphoria eval run --mode live --category edge
# All edge cases should pass or gracefully skip

Phase 5: CI Integration & Monitoring

Goal: Prevent regressions and track quality over time.

5.1 CI Pipeline

# .github/workflows/llm-quality.yml
name: LLM Extraction Quality

on:
  pull_request:
    paths:
      - 'applications/aphoria/src/llm/**'
      - 'applications/aphoria/tests/llm_fixtures/**'
  schedule:
    - cron: '0 6 * * 1'  # Weekly Monday 6am

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Rust
        uses: dtolnay/rust-action@stable

      - name: Cache LLM responses
        uses: actions/cache@v4
        with:
          path: ~/.cache/aphoria/llm_cache
          key: llm-cache-${{ hashFiles('tests/llm_fixtures/**') }}

      - name: Run Evaluation
        env:
          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
        run: |
          cargo run -p aphoria -- eval run \
            --mode cached \
            --fail-on-regression \
            --threshold 0.05 \
            --format json > eval-results.json          

      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: eval-results.json

      - name: Comment PR with Results
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('eval-results.json'));
            const body = `## LLM Eval Results
            | Metric | Value |
            |--------|-------|
            | Precision | ${results.metrics.precision.toFixed(2)} |
            | Recall | ${results.metrics.recall.toFixed(2)} |
            | F1 | ${results.metrics.f1.toFixed(2)} |
            | Verdict | ${results.verdict} |`;
            github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body: body
            });            

5.2 Monitoring Dashboard

Track metrics over time:

# docs/llm-optimization/metrics-history.md

| Date | Precision | Recall | F1 | Notes |
|------|-----------|--------|-----|-------|
| 2026-02-05 | 0.78 | 0.72 | 0.75 | Initial baseline |
| 2026-02-06 | 0.82 | 0.74 | 0.78 | Added few-shot examples |
| 2026-02-07 | 0.85 | 0.76 | 0.80 | Fixed parse issues |

5.3 Alert Thresholds

# aphoria.toml
[eval]
# Fail CI if metrics drop below these
min_precision = 0.75
min_recall = 0.70
min_f1 = 0.72
regression_threshold = 0.05

Phase 6: Continuous Improvement Loop

Goal: Establish ongoing optimization process.

6.1 Weekly Cadence

## Monday: Review
- [ ] Check CI results from past week
- [ ] Review any failed fixtures
- [ ] Identify top 3 improvement opportunities

## Wednesday: Implement
- [ ] Pick one improvement from Monday's list
- [ ] Implement and test locally
- [ ] Run full eval suite

## Friday: Deploy
- [ ] If metrics improved, merge changes
- [ ] Update baseline
- [ ] Document what changed and why

6.2 Fixture Expansion

Add new fixtures when:

  • New vulnerability pattern discovered
  • False positive reported by user
  • New language/framework support added
  • Edge case found in production
# Fixture addition checklist
- [ ] Create fixture file in appropriate category
- [ ] Add both must_contain and must_not_contain
- [ ] Run validation: aphoria eval validate-fixtures
- [ ] Run eval to verify fixture works: aphoria eval run --max-fixtures 1
- [ ] Update manifest.toml category count

6.3 Prompt Version Control

Track prompt changes:

// llm/prompts.rs
pub const PROMPT_VERSION: &str = "1.2.0";

// In changelog:
// 1.2.0 - Added negative examples for safe patterns
// 1.1.0 - Added language-specific hints
// 1.0.0 - Initial structured prompt

6.4 Quarterly Review

## Quarterly LLM Optimization Review

### Metrics Trend
- Q1 Start: F1 = 0.XX
- Q1 End: F1 = 0.XX
- Improvement: +X%

### Major Changes
1. Change description
2. Change description

### Lessons Learned
- What worked well?
- What didn't work?
- What should we try next quarter?

### Next Quarter Goals
- [ ] Goal 1
- [ ] Goal 2

Appendix A: Common Issues Reference

A.1 "LLM returns empty array"

Symptoms: No claims extracted from files that clearly have issues.

Diagnosis:

# Check if LLM is being called
RUST_LOG=debug aphoria scan . --persist 2>&1 | grep -i llm

Possible Causes:

  1. LLM disabled in config → Enable in aphoria.toml
  2. File filtered out → Check file size/type filters
  3. API error → Check API key and quota
  4. Prompt too restrictive → Loosen "what to extract" section

A.2 "Parse failures spike after API update"

Symptoms: Suddenly many parse failures.

Diagnosis:

# Check raw responses
# Add temporary logging to see what API returns

Possible Causes:

  1. API response format changed
  2. Model version updated
  3. Rate limiting affecting responses

Fix: Update response parsing to handle new format.

A.3 "Good local results, bad CI results"

Symptoms: Eval passes locally but fails in CI.

Possible Causes:

  1. Cache inconsistency → Clear and rebuild cache
  2. Environment differences → Check env vars
  3. Timeout issues → Increase CI timeout
  4. API key issues → Verify CI secrets

A.4 "Precision tanked after recall improvement"

Symptoms: Improved recall but precision dropped significantly.

Fix:

  1. Add negative examples to balance
  2. Increase confidence threshold
  3. Add more must_not_contain fixtures
  4. Make extraction criteria more specific

A.5 "Works for Python, fails for Go"

Symptoms: Language-specific extraction issues.

Fix:

  1. Add language-specific prompt hints
  2. Add fixtures for failing language
  3. Check if patterns are language-specific
  4. Consider language-specific extraction paths

Appendix B: Fixture Writing Guide

B.1 Good Fixture Characteristics

  • Minimal: Only include code necessary to demonstrate the issue
  • Clear: Obvious what the security issue is
  • Realistic: Resembles actual production code
  • Isolated: Tests one concept per fixture

B.2 Fixture Template

[metadata]
id = "category-NNN"
name = "Brief description of what this tests"
category = "tls|jwt|secrets|auth|crypto|negative|edge"
language = "python|javascript|go|rust|java|..."
difficulty = "easy|medium|hard"
source = "hand-curated|real-world|generated"
created = "YYYY-MM-DD"
notes = "Any additional context"

[input]
filename = "example.py"
content = """
# Minimal code that demonstrates the issue
actual_code_here()
"""

[expected]
must_contain = [
    {
        subject = "category/specific_thing",
        predicate = "property",
        value = <expected_value>,
        rationale = "Why this should be extracted"
    }
]

must_not_contain = [
    {
        subject = "category/other_thing",
        predicate = "property",
        value = <wrong_value>,
        rationale = "Why this should NOT be extracted"
    }
]

[scoring]
weight = 1.0
min_confidence = 0.7

B.3 Category Guidelines

Category What to Include Subject Prefix
tls Certificate, protocol, cipher issues tls/
jwt Token validation, algorithm issues jwt/
secrets Hardcoded credentials, keys, tokens secrets/
auth Authentication bypass, weak auth auth/
crypto Weak algorithms, short keys crypto/
negative Safe patterns (no findings expected) N/A
edge Boundary conditions, unusual input N/A

Appendix C: Decision Tree Summary

START
  │
  ├─→ Run baseline eval
  │     │
  │     ├─→ F1 >= 0.85? ──→ Skip to Phase 4
  │     │
  │     └─→ F1 < 0.85? ──→ Continue to diagnosis
  │
  ├─→ Classify failures
  │     │
  │     ├─→ Parse failures > 30%? ──→ Phase 2A (output structure)
  │     │
  │     ├─→ Missing claims > 50%? ──→ Phase 2B (recall)
  │     │
  │     ├─→ False positives > 30%? ──→ Phase 2C (precision)
  │     │
  │     └─→ Subject mismatches > 40%? ──→ Phase 2D (normalization)
  │
  ├─→ After each phase:
  │     │
  │     ├─→ Improved? ──→ Continue to next phase
  │     │
  │     ├─→ Regressed? ──→ Revert, try different approach
  │     │
  │     └─→ Stuck? ──→ Phase 2R (research sprint)
  │
  ├─→ All phases complete?
  │     │
  │     ├─→ F1 >= target? ──→ Phase 4 (edge cases), Phase 5 (CI)
  │     │
  │     └─→ F1 < target? ──→ Research: model limits, fine-tuning, alternatives
  │
  └─→ Ongoing: Phase 6 (continuous improvement)

Quick Reference Commands

# Run evaluation
aphoria eval run --fixtures tests/llm_fixtures --mode live

# Run specific category
aphoria eval run --category tls --mode live

# Check for regressions
aphoria eval run --mode cached --fail-on-regression

# Update baseline
aphoria eval update-baseline --fixtures tests/llm_fixtures --force

# List fixtures
aphoria eval list-fixtures

# Validate fixtures
aphoria eval validate-fixtures

# Export JSON for analysis
aphoria eval run --mode live --format json > results.json