stemedb/applications/aphoria/docs/llm-optimization/playbook.md

# LLM Extraction Optimization Playbook

> A systematic approach to maximizing LLM extraction reliability and coverage.

## Overview

This playbook guides you through the complete process of optimizing Aphoria's LLM extraction system. Follow the phases sequentially, using the decision trees to navigate conditional paths based on your findings.

```
┌─────────────────────────────────────────────────────────────────────────┐
│                    LLM OPTIMIZATION PATHWAY                              │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  Phase 0: Baseline Assessment                                            │
│      ↓                                                                   │
│  Phase 1: Diagnostic Analysis ──────────────────────────────────────────┐│
│      ↓                                           ↓                      ││
│  Phase 2: Quick Wins              Research Required?                    ││
│      ↓                                           ↓                      ││
│  Phase 3: Systematic Improvements    Phase 2R: Research Sprint          ││
│      ↓                                           ↓                      ││
│  Phase 4: Edge Case Hardening    ←───────────────┘                      ││
│      ↓                                                                   │
│  Phase 5: CI Integration & Monitoring                                    │
│      ↓                                                                   │
│  Phase 6: Continuous Improvement Loop                                    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘
```

---

## Phase 0: Baseline Assessment

**Goal:** Establish current performance metrics before making any changes.

### 0.1 Run Initial Evaluation

```bash
# Ensure fixtures exist
aphoria eval validate-fixtures --fixtures tests/llm_fixtures

# Run live evaluation to get real metrics
aphoria eval run --fixtures tests/llm_fixtures --mode live --format json > baseline-$(date +%Y%m%d).json

# Review results
aphoria eval run --fixtures tests/llm_fixtures --mode live --format table
```

### 0.2 Record Baseline Metrics

Create `docs/llm-optimization/baselines/YYYY-MM-DD.md`:

```markdown
# Baseline: YYYY-MM-DD

## Overall Metrics
- Precision: X.XX
- Recall: X.XX
- F1: X.XX
- Parse Success Rate: X.XX%

## Per-Category Breakdown
| Category | Fixtures | Precision | Recall | F1 |
|----------|----------|-----------|--------|-----|
| tls      | N        | X.XX      | X.XX   | X.XX |
| jwt      | N        | X.XX      | X.XX   | X.XX |
| secrets  | N        | X.XX      | X.XX   | X.XX |
| auth     | N        | X.XX      | X.XX   | X.XX |

## Known Issues
- [ ] Issue 1
- [ ] Issue 2
```

### 0.3 Save Official Baseline

```bash
aphoria eval update-baseline --fixtures tests/llm_fixtures --force
```

### Decision Point: Is Baseline Acceptable?

```
IF F1 >= 0.85 AND parse_success_rate >= 0.95:
    → Skip to Phase 4 (Edge Case Hardening)

ELSE IF F1 < 0.50:
    → Major issues exist. Proceed to Phase 1 with priority flag.

ELSE:
    → Normal optimization flow. Proceed to Phase 1.
```

---

## Phase 1: Diagnostic Analysis

**Goal:** Identify root causes of extraction failures.

### 1.1 Categorize Failures

Run detailed analysis:

```bash
# Get verbose failure information
aphoria eval run --mode live --format json | jq '.fixture_results[] | select(.status == "Failed")'
```

### 1.2 Failure Classification Matrix

For each failed fixture, classify the root cause:

| Failure Type | Symptoms | Root Cause | Fix Category |
|--------------|----------|------------|--------------|
| **Parse Failure** | `parse_success: false` | LLM returned malformed JSON | Prompt/Schema |
| **Missing Claim** | `false_negatives > 0` | LLM didn't extract expected claim | Prompt/Examples |
| **Wrong Subject** | Claim exists but subject mismatch | Subject path inconsistency | Normalization |
| **Wrong Value** | Claim exists but value mismatch | Type coercion or interpretation | Prompt/Matcher |
| **Wrong Predicate** | Claim exists but predicate mismatch | Vocabulary inconsistency | Prompt/Glossary |
| **False Positive** | `violations > 0` | LLM extracted non-existent issue | Negative Examples |
| **Low Confidence** | Claim filtered by min_confidence | LLM under-confident | Calibration |

### 1.3 Tally Results

```markdown
## Failure Analysis: YYYY-MM-DD

### By Failure Type
| Type | Count | % of Failures |
|------|-------|---------------|
| Parse Failure | N | X% |
| Missing Claim | N | X% |
| Wrong Subject | N | X% |
| Wrong Value | N | X% |
| False Positive | N | X% |

### Priority Order (fix highest-impact first)
1. [Type] - N failures
2. [Type] - N failures
3. [Type] - N failures
```

### Decision Point: What's the Dominant Failure Mode?

```
IF parse_failures > 30% of total failures:
    → Proceed to Phase 2A: Output Structure Fixes

ELSE IF missing_claims > 50% of total failures:
    → Proceed to Phase 2B: Recall Improvements

ELSE IF false_positives > 30% of total failures:
    → Proceed to Phase 2C: Precision Improvements

ELSE IF subject/predicate mismatches > 40%:
    → Proceed to Phase 2D: Normalization Fixes

ELSE:
    → Mixed issues. Proceed through 2A → 2B → 2C → 2D sequentially.
```

---

## Phase 2: Quick Wins

### Phase 2A: Output Structure Fixes

**When:** Parse failures > 30%

#### 2A.1 Diagnosis

```bash
# Check what the LLM is actually returning
# Add debug logging to llm/extractor.rs temporarily
RUST_LOG=debug aphoria scan . --persist 2>&1 | grep "LLM response"
```

Common parse issues:
- Markdown code fences around JSON
- Extra text before/after JSON
- Nested JSON (JSON inside string)
- Truncated response (token limit)
- Wrong array structure

#### 2A.2 Fixes by Issue

**Issue: Markdown code fences**
```rust
// In llm/extractor.rs - add response cleaning
fn clean_response(raw: &str) -> String {
    let trimmed = raw.trim();
    if trimmed.starts_with("```json") {
        trimmed
            .strip_prefix("```json")
            .and_then(|s| s.strip_suffix("```"))
            .unwrap_or(trimmed)
            .trim()
            .to_string()
    } else if trimmed.starts_with("```") {
        trimmed
            .strip_prefix("```")
            .and_then(|s| s.strip_suffix("```"))
            .unwrap_or(trimmed)
            .trim()
            .to_string()
    } else {
        trimmed.to_string()
    }
}
```

**Issue: Extra text around JSON**
```rust
// Find JSON array in response
fn extract_json_array(raw: &str) -> Option<&str> {
    let start = raw.find('[')?;
    let end = raw.rfind(']')?;
    if end > start {
        Some(&raw[start..=end])
    } else {
        None
    }
}
```

**Issue: Token limit truncation**
- Reduce input context
- Request fewer claims per call
- Implement chunking

**Issue: Inconsistent structure**
Add explicit schema to prompt:
```
Return ONLY a JSON array with this exact structure:
[
  {
    "subject": "category/specific_thing",
    "predicate": "property_name",
    "value": <boolean|number|string>,
    "confidence": <0.0-1.0>,
    "line": <line_number>,
    "rationale": "why this was extracted"
  }
]
```

#### 2A.3 Validate Fix

```bash
aphoria eval run --mode live --fail-on-regression
# Parse success rate should improve
```

**Checkpoint:** If parse_success_rate >= 95%, proceed. Otherwise, consider:
```
IF still failing:
    → Research: Check Gemini API docs for response_format options
    → Research: Consider switching to function calling mode
```

---

### Phase 2B: Recall Improvements

**When:** Missing claims (false negatives) > 50%

#### 2B.1 Analyze What's Being Missed

```bash
# For each failed fixture, check what was expected vs extracted
aphoria eval run --mode live --format json | \
  jq '.fixture_results[] | select(.false_negatives > 0) | {id: .fixture_id, unmatched: .unmatched_expectations}'
```

#### 2B.2 Common Recall Issues & Fixes

**Issue: LLM doesn't recognize the pattern**

Add few-shot examples to prompt:
```rust
// In llm/prompts.rs
const FEW_SHOT_EXAMPLES: &str = r#"
Example 1:
Input: `requests.get(url, verify=False)`
Output: [{"subject": "tls/cert_verification", "predicate": "enabled", "value": false, ...}]

Example 2:
Input: `jwt.decode(token, algorithms=['none', 'HS256'])`
Output: [{"subject": "jwt/algorithms", "predicate": "allows_none", "value": true, ...}]
"#;
```

**Issue: LLM extracts with different subject path**

Check actual vs expected subjects:
```
Expected: tls/cert_verification
Actual:   security/tls/verify_disabled

→ Either: Update fixtures to match LLM's vocabulary
→ Or: Add subject mapping instructions to prompt
→ Or: Improve matcher's tail-path matching
```

**Issue: LLM only extracts first finding**

Add explicit instruction:
```
IMPORTANT: Extract ALL security-relevant claims from the code.
Do not stop after finding one issue - scan the entire file.
```

**Issue: LLM misses subtle patterns**

Add pattern hints:
```
Look for these specific patterns:
- verify=False, VERIFY=False, ssl_verify=False → TLS verification disabled
- algorithms=['none'], algorithm='none' → JWT algorithm none attack
- API_KEY = "...", apiKey: "..." → Hardcoded secrets
```

#### 2B.3 Validate Recall Improvements

```bash
aphoria eval run --mode live
# Check: recall should increase, precision should not drop significantly
```

**Checkpoint:**
```
IF recall improved AND precision dropped > 10%:
    → Proceed to Phase 2C immediately (balance precision)

ELSE IF recall still < 0.70:
    → Research Sprint: Analyze why specific patterns aren't recognized
    → Consider: Are fixtures too strict? Update matcher tolerance?
```

---

### Phase 2C: Precision Improvements

**When:** False positives or violations > 30%

#### 2C.1 Analyze False Positives

```bash
# What is the LLM incorrectly flagging?
aphoria eval run --mode live --format json | \
  jq '.fixture_results[] | select(.violations > 0) | {id: .fixture_id, violations: .violation_details}'
```

#### 2C.2 Common Precision Issues & Fixes

**Issue: Flagging safe patterns as issues**

Add negative examples to prompt:
```rust
const NEGATIVE_EXAMPLES: &str = r#"
These are SAFE and should NOT be flagged:

- `verify=certifi.where()` → NOT a TLS issue (using CA bundle)
- `API_KEY = os.environ['API_KEY']` → NOT hardcoded (from environment)
- `algorithms=['HS256', 'RS256']` → NOT algorithm none (secure algorithms only)
"#;
```

**Issue: Over-eager pattern matching**

Add specificity requirements:
```
Only extract claims when you are CONFIDENT the code is problematic.
Do NOT flag:
- Comments discussing security issues (not actual code)
- Test files demonstrating vulnerabilities (not production code)
- Variables that MIGHT contain sensitive data but don't clearly
```

**Issue: Misinterpreting context**

Add context awareness:
```
Consider the full context:
- Is this in a test file? Test code may intentionally have "bad" patterns.
- Is this in a comment? Comments are not executable code.
- Is the value from a secure source (environment, secrets manager)?
```

#### 2C.3 Validate Precision Improvements

```bash
aphoria eval run --mode live
# Check: precision should increase, recall should not drop significantly
```

**Checkpoint:**
```
IF precision improved AND recall dropped > 10%:
    → Go back to 2B, find balance

ELSE IF precision still < 0.70:
    → Research: Are fixtures correctly marked as "must_not_contain"?
    → Consider: Is the LLM's interpretation actually valid? Update fixtures?
```

---

### Phase 2D: Normalization Fixes

**When:** Subject/predicate mismatches > 40%

#### 2D.1 Identify Vocabulary Mismatches

```bash
# Compare LLM output subjects vs expected subjects
# Manual review of failures
```

Common mismatches:
| LLM Output | Expected | Fix |
|------------|----------|-----|
| `security/tls/disabled` | `tls/cert_verification` | Prompt vocabulary |
| `auth/jwt/none_allowed` | `jwt/algorithms` | Subject mapping |
| `credentials/hardcoded` | `secrets/api_key` | Standardize categories |

#### 2D.2 Fix Options

**Option A: Update Prompt Vocabulary**

Define exact subject paths in prompt:
```
Use these exact subject paths:
- TLS issues: tls/cert_verification, tls/min_version, tls/cipher_suites
- JWT issues: jwt/algorithms, jwt/validation, jwt/expiry
- Secrets: secrets/api_key, secrets/token, secrets/password
- Auth: auth/bypass, auth/session, auth/password_policy
```

**Option B: Improve Matcher Tolerance**

In `eval/matcher.rs`, enhance `subject_matches()`:
```rust
fn subject_matches(&self, extracted: &str, expected: &str) -> bool {
    // Existing tail-path matching
    let ext_tail = self.tail_path(extracted, 2);
    let exp_tail = self.tail_path(expected, 2);
    if ext_tail == exp_tail {
        return true;
    }

    // Add synonym matching
    let synonyms = [
        ("cert_verification", "verify"),
        ("cert_verification", "ssl_verify"),
        ("api_key", "apikey"),
        ("api_key", "secret_key"),
    ];

    for (a, b) in synonyms {
        if (ext_tail.contains(&a) && exp_tail.contains(&b)) ||
           (ext_tail.contains(&b) && exp_tail.contains(&a)) {
            return true;
        }
    }

    false
}
```

**Option C: Update Fixtures to Match LLM**

If LLM's vocabulary is actually better/more consistent:
```bash
# Update fixture subjects to match what LLM produces
# This is valid if LLM's naming is more intuitive
```

#### 2D.3 Validate

```bash
aphoria eval run --mode live
# Subject/predicate mismatches should decrease
```

---

## Phase 2R: Research Sprint

**When:** Quick wins don't resolve issues, or unknown problems arise.

### 2R.1 Research Checklist

```markdown
## Research Log: [Issue Description]

### Problem Statement
- What specifically is failing?
- What have we tried?
- Why didn't it work?

### Research Tasks
- [ ] Review Gemini API documentation for relevant features
- [ ] Check if other projects solved similar issues
- [ ] Experiment with different prompt structures
- [ ] Test alternative models (if available)
- [ ] Review academic papers on code analysis with LLMs

### Experiments
| Experiment | Hypothesis | Result |
|------------|------------|--------|
| | | |

### Conclusion
- What worked?
- What didn't?
- Recommended approach?
```

### 2R.2 Common Research Topics

**Topic: Structured Output**
- Gemini supports `response_mime_type: "application/json"`
- May need `response_schema` for strict typing
- Test: Does this improve parse success rate?

**Topic: Function Calling**
- Alternative to free-form JSON output
- Define claims as function parameters
- Test: More reliable structure?

**Topic: Chain of Thought**
- Ask LLM to reason before extracting
- May improve accuracy on subtle patterns
- Test: Does CoT help without hurting speed?

**Topic: Fine-Tuning**
- Create training dataset from fixtures
- Fine-tune model on security extraction
- Test: Significant quality improvement?

**Topic: Multi-Pass Extraction**
- First pass: identify potential issues
- Second pass: validate and detail each
- Test: Higher precision with acceptable latency?

### 2R.3 Research Output

Document findings in `docs/llm-optimization/research/`:
```
research/
  structured-output.md
  chain-of-thought.md
  fine-tuning-feasibility.md
```

---

## Phase 3: Systematic Improvements

**Goal:** Implement comprehensive prompt and system improvements.

### 3.1 Prompt Architecture

Restructure prompt for maximum clarity:

```rust
const SYSTEM_PROMPT: &str = r#"
You are a security code analyzer. Your task is to extract security-relevant claims from source code.

## Output Format
Return a JSON array. Each claim must have:
- subject: Category path (e.g., "tls/cert_verification")
- predicate: Property name (e.g., "enabled")
- value: The extracted value (boolean, number, or string)
- confidence: Your confidence 0.0-1.0
- line: Line number where found
- rationale: Brief explanation

## Subject Categories
- tls/*: TLS/SSL configuration
- jwt/*: JWT token handling
- secrets/*: Credentials and API keys
- auth/*: Authentication mechanisms
- crypto/*: Cryptographic settings

## What to Extract
[Positive examples here]

## What NOT to Extract
[Negative examples here]

## Important Rules
1. Extract ALL findings, not just the first one
2. Only flag actual code, not comments
3. Consider context (test files, environment variables)
4. Be conservative - only flag clear issues
"#;
```

### 3.2 Confidence Calibration

Improve confidence scoring:

```rust
const CONFIDENCE_GUIDANCE: &str = r#"
Set confidence based on:
- 0.95-1.0: Explicit, unambiguous code (verify=False)
- 0.80-0.94: Clear pattern, minor ambiguity
- 0.60-0.79: Likely issue, needs context
- 0.40-0.59: Possible issue, uncertain
- Below 0.40: Don't report (too uncertain)
"#;
```

### 3.3 Language-Specific Handling

Add language-specific prompt sections:

```rust
fn get_language_hints(language: Language) -> &'static str {
    match language {
        Language::Python => r#"
Python-specific patterns:
- requests: verify=False, cert=False
- urllib3: disable_warnings()
- ssl: CERT_NONE, check_hostname=False
"#,
        Language::JavaScript => r#"
JavaScript/Node.js patterns:
- https: rejectUnauthorized: false
- axios: httpsAgent with rejectUnauthorized
- NODE_TLS_REJECT_UNAUTHORIZED=0
"#,
        // ... other languages
    }
}
```

### 3.4 Validate Comprehensive Changes

```bash
# Run full evaluation
aphoria eval run --mode live --format table

# Compare to baseline
# Expect: F1 improvement, no significant regressions
```

---

## Phase 4: Edge Case Hardening

**Goal:** Handle unusual inputs gracefully.

### 4.1 Edge Case Categories

| Category | Example | Expected Behavior |
|----------|---------|-------------------|
| Empty file | 0 bytes | No claims, no error |
| Binary file | .exe, .png | Skip gracefully |
| Huge file | >100KB code | Chunk or skip with warning |
| Minified code | single-line JS | Best effort extraction |
| Mixed language | HTML with JS | Detect embedded languages |
| Unicode | Non-ASCII identifiers | Handle correctly |
| Syntax errors | Invalid code | Extract what's possible |

### 4.2 Add Edge Case Fixtures

```bash
# Create edge case fixtures
cat > tests/llm_fixtures/edge/edge-002-minified.toml << 'EOF'
[metadata]
id = "edge-002"
name = "Minified JavaScript"
category = "edge"
language = "javascript"

[input]
filename = "bundle.min.js"
content = "var a=require('https');a.get({rejectUnauthorized:false},function(r){});"

[expected]
must_contain = [
    { subject = "tls/cert_verification", predicate = "enabled", value = false }
]
EOF
```

### 4.3 Implement Defensive Code

```rust
// In llm/extractor.rs
impl LlmExtractor {
    pub fn extract(&self, content: &str, language: Language) -> Vec<Observation> {
        // Edge case: empty content
        if content.trim().is_empty() {
            return Vec::new();
        }

        // Edge case: too large
        if content.len() > MAX_CONTENT_SIZE {
            warn!(size = content.len(), "Content too large, chunking");
            return self.extract_chunked(content, language);
        }

        // Edge case: binary content
        if content.bytes().any(|b| b == 0) {
            debug!("Binary content detected, skipping");
            return Vec::new();
        }

        // Normal extraction
        self.extract_internal(content, language)
    }
}
```

### 4.4 Validate Edge Cases

```bash
aphoria eval run --mode live --category edge
# All edge cases should pass or gracefully skip
```

---

## Phase 5: CI Integration & Monitoring

**Goal:** Prevent regressions and track quality over time.

### 5.1 CI Pipeline

```yaml
# .github/workflows/llm-quality.yml
name: LLM Extraction Quality

on:
  pull_request:
    paths:
      - 'applications/aphoria/src/llm/**'
      - 'applications/aphoria/tests/llm_fixtures/**'
  schedule:
    - cron: '0 6 * * 1'  # Weekly Monday 6am

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Rust
        uses: dtolnay/rust-action@stable

      - name: Cache LLM responses
        uses: actions/cache@v4
        with:
          path: ~/.cache/aphoria/llm_cache
          key: llm-cache-${{ hashFiles('tests/llm_fixtures/**') }}

      - name: Run Evaluation
        env:
          GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
        run: |
          cargo run -p aphoria -- eval run \
            --mode cached \
            --fail-on-regression \
            --threshold 0.05 \
            --format json > eval-results.json

      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: eval-results.json

      - name: Comment PR with Results
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('eval-results.json'));
            const body = `## LLM Eval Results
            | Metric | Value |
            |--------|-------|
            | Precision | ${results.metrics.precision.toFixed(2)} |
            | Recall | ${results.metrics.recall.toFixed(2)} |
            | F1 | ${results.metrics.f1.toFixed(2)} |
            | Verdict | ${results.verdict} |`;
            github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body: body
            });
```

### 5.2 Monitoring Dashboard

Track metrics over time:

```markdown
# docs/llm-optimization/metrics-history.md

| Date | Precision | Recall | F1 | Notes |
|------|-----------|--------|-----|-------|
| 2026-02-05 | 0.78 | 0.72 | 0.75 | Initial baseline |
| 2026-02-06 | 0.82 | 0.74 | 0.78 | Added few-shot examples |
| 2026-02-07 | 0.85 | 0.76 | 0.80 | Fixed parse issues |
```

### 5.3 Alert Thresholds

```toml
# aphoria.toml
[eval]
# Fail CI if metrics drop below these
min_precision = 0.75
min_recall = 0.70
min_f1 = 0.72
regression_threshold = 0.05
```

---

## Phase 6: Continuous Improvement Loop

**Goal:** Establish ongoing optimization process.

### 6.1 Weekly Cadence

```markdown
## Monday: Review
- [ ] Check CI results from past week
- [ ] Review any failed fixtures
- [ ] Identify top 3 improvement opportunities

## Wednesday: Implement
- [ ] Pick one improvement from Monday's list
- [ ] Implement and test locally
- [ ] Run full eval suite

## Friday: Deploy
- [ ] If metrics improved, merge changes
- [ ] Update baseline
- [ ] Document what changed and why
```

### 6.2 Fixture Expansion

Add new fixtures when:
- New vulnerability pattern discovered
- False positive reported by user
- New language/framework support added
- Edge case found in production

```bash
# Fixture addition checklist
- [ ] Create fixture file in appropriate category
- [ ] Add both must_contain and must_not_contain
- [ ] Run validation: aphoria eval validate-fixtures
- [ ] Run eval to verify fixture works: aphoria eval run --max-fixtures 1
- [ ] Update manifest.toml category count
```

### 6.3 Prompt Version Control

Track prompt changes:

```rust
// llm/prompts.rs
pub const PROMPT_VERSION: &str = "1.2.0";

// In changelog:
// 1.2.0 - Added negative examples for safe patterns
// 1.1.0 - Added language-specific hints
// 1.0.0 - Initial structured prompt
```

### 6.4 Quarterly Review

```markdown
## Quarterly LLM Optimization Review

### Metrics Trend
- Q1 Start: F1 = 0.XX
- Q1 End: F1 = 0.XX
- Improvement: +X%

### Major Changes
1. Change description
2. Change description

### Lessons Learned
- What worked well?
- What didn't work?
- What should we try next quarter?

### Next Quarter Goals
- [ ] Goal 1
- [ ] Goal 2
```

---

## Appendix A: Common Issues Reference

### A.1 "LLM returns empty array"

**Symptoms:** No claims extracted from files that clearly have issues.

**Diagnosis:**
```bash
# Check if LLM is being called
RUST_LOG=debug aphoria scan . --persist 2>&1 | grep -i llm
```

**Possible Causes:**
1. LLM disabled in config → Enable in aphoria.toml
2. File filtered out → Check file size/type filters
3. API error → Check API key and quota
4. Prompt too restrictive → Loosen "what to extract" section

### A.2 "Parse failures spike after API update"

**Symptoms:** Suddenly many parse failures.

**Diagnosis:**
```bash
# Check raw responses
# Add temporary logging to see what API returns
```

**Possible Causes:**
1. API response format changed
2. Model version updated
3. Rate limiting affecting responses

**Fix:** Update response parsing to handle new format.

### A.3 "Good local results, bad CI results"

**Symptoms:** Eval passes locally but fails in CI.

**Possible Causes:**
1. Cache inconsistency → Clear and rebuild cache
2. Environment differences → Check env vars
3. Timeout issues → Increase CI timeout
4. API key issues → Verify CI secrets

### A.4 "Precision tanked after recall improvement"

**Symptoms:** Improved recall but precision dropped significantly.

**Fix:**
1. Add negative examples to balance
2. Increase confidence threshold
3. Add more must_not_contain fixtures
4. Make extraction criteria more specific

### A.5 "Works for Python, fails for Go"

**Symptoms:** Language-specific extraction issues.

**Fix:**
1. Add language-specific prompt hints
2. Add fixtures for failing language
3. Check if patterns are language-specific
4. Consider language-specific extraction paths

---

## Appendix B: Fixture Writing Guide

### B.1 Good Fixture Characteristics

- **Minimal:** Only include code necessary to demonstrate the issue
- **Clear:** Obvious what the security issue is
- **Realistic:** Resembles actual production code
- **Isolated:** Tests one concept per fixture

### B.2 Fixture Template

```toml
[metadata]
id = "category-NNN"
name = "Brief description of what this tests"
category = "tls|jwt|secrets|auth|crypto|negative|edge"
language = "python|javascript|go|rust|java|..."
difficulty = "easy|medium|hard"
source = "hand-curated|real-world|generated"
created = "YYYY-MM-DD"
notes = "Any additional context"

[input]
filename = "example.py"
content = """
# Minimal code that demonstrates the issue
actual_code_here()
"""

[expected]
must_contain = [
    {
        subject = "category/specific_thing",
        predicate = "property",
        value = <expected_value>,
        rationale = "Why this should be extracted"
    }
]

must_not_contain = [
    {
        subject = "category/other_thing",
        predicate = "property",
        value = <wrong_value>,
        rationale = "Why this should NOT be extracted"
    }
]

[scoring]
weight = 1.0
min_confidence = 0.7
```

### B.3 Category Guidelines

| Category | What to Include | Subject Prefix |
|----------|-----------------|----------------|
| tls | Certificate, protocol, cipher issues | `tls/` |
| jwt | Token validation, algorithm issues | `jwt/` |
| secrets | Hardcoded credentials, keys, tokens | `secrets/` |
| auth | Authentication bypass, weak auth | `auth/` |
| crypto | Weak algorithms, short keys | `crypto/` |
| negative | Safe patterns (no findings expected) | N/A |
| edge | Boundary conditions, unusual input | N/A |

---

## Appendix C: Decision Tree Summary

```
START
  │
  ├─→ Run baseline eval
  │     │
  │     ├─→ F1 >= 0.85? ──→ Skip to Phase 4
  │     │
  │     └─→ F1 < 0.85? ──→ Continue to diagnosis
  │
  ├─→ Classify failures
  │     │
  │     ├─→ Parse failures > 30%? ──→ Phase 2A (output structure)
  │     │
  │     ├─→ Missing claims > 50%? ──→ Phase 2B (recall)
  │     │
  │     ├─→ False positives > 30%? ──→ Phase 2C (precision)
  │     │
  │     └─→ Subject mismatches > 40%? ──→ Phase 2D (normalization)
  │
  ├─→ After each phase:
  │     │
  │     ├─→ Improved? ──→ Continue to next phase
  │     │
  │     ├─→ Regressed? ──→ Revert, try different approach
  │     │
  │     └─→ Stuck? ──→ Phase 2R (research sprint)
  │
  ├─→ All phases complete?
  │     │
  │     ├─→ F1 >= target? ──→ Phase 4 (edge cases), Phase 5 (CI)
  │     │
  │     └─→ F1 < target? ──→ Research: model limits, fine-tuning, alternatives
  │
  └─→ Ongoing: Phase 6 (continuous improvement)
```

---

## Quick Reference Commands

```bash
# Run evaluation
aphoria eval run --fixtures tests/llm_fixtures --mode live

# Run specific category
aphoria eval run --category tls --mode live

# Check for regressions
aphoria eval run --mode cached --fail-on-regression

# Update baseline
aphoria eval update-baseline --fixtures tests/llm_fixtures --force

# List fixtures
aphoria eval list-fixtures

# Validate fixtures
aphoria eval validate-fixtures

# Export JSON for analysis
aphoria eval run --mode live --format json > results.json
```