Major documentation restructure to improve discoverability and reduce duplication. ## Changes **Deleted (Archived/Consolidated)**: - Removed duplicate getting started guides - Archived outdated planning documents - Consolidated corpus and configuration docs - Removed obsolete vision/spec files (superseded by vision.md) - Cleaned up scrapyard and old PDFs **New Structure**: - docs/about/ - Project overview and introduction - docs/guides/ - User guides (moved from root) - docs/specs/ - Technical specifications - docs/sdk/ - SDK documentation (Go) - docs/references/ - API references - docs/archive/ - Archived historical docs - applications/aphoria/docs/advanced/ - Advanced topics - applications/aphoria/docs/reference/ - CLI reference - applications/aphoria/docs/archive/ - Archived aphoria docs **Updated**: - README.md - New root README with clear navigation - CONTRIBUTING.md - Contribution guidelines - CLAUDE.md - Updated paths to new structure - roadmap.md - Added recent completions ## Files Changed - 57 files changed - 1,977 insertions(+) - 961 deletions(-) **Net change**: +1,016 lines (added CONTRIBUTING.md, README.md, reorganized content) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1106 lines
29 KiB
Markdown
1106 lines
29 KiB
Markdown
# LLM Extraction Optimization Playbook
|
|
|
|
> A systematic approach to maximizing LLM extraction reliability and coverage.
|
|
|
|
## Overview
|
|
|
|
This playbook guides you through the complete process of optimizing Aphoria's LLM extraction system. Follow the phases sequentially, using the decision trees to navigate conditional paths based on your findings.
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────┐
|
|
│ LLM OPTIMIZATION PATHWAY │
|
|
├─────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ Phase 0: Baseline Assessment │
|
|
│ ↓ │
|
|
│ Phase 1: Diagnostic Analysis ──────────────────────────────────────────┐│
|
|
│ ↓ ↓ ││
|
|
│ Phase 2: Quick Wins Research Required? ││
|
|
│ ↓ ↓ ││
|
|
│ Phase 3: Systematic Improvements Phase 2R: Research Sprint ││
|
|
│ ↓ ↓ ││
|
|
│ Phase 4: Edge Case Hardening ←───────────────┘ ││
|
|
│ ↓ │
|
|
│ Phase 5: CI Integration & Monitoring │
|
|
│ ↓ │
|
|
│ Phase 6: Continuous Improvement Loop │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 0: Baseline Assessment
|
|
|
|
**Goal:** Establish current performance metrics before making any changes.
|
|
|
|
### 0.1 Run Initial Evaluation
|
|
|
|
```bash
|
|
# Ensure fixtures exist
|
|
aphoria eval validate-fixtures --fixtures tests/llm_fixtures
|
|
|
|
# Run live evaluation to get real metrics
|
|
aphoria eval run --fixtures tests/llm_fixtures --mode live --format json > baseline-$(date +%Y%m%d).json
|
|
|
|
# Review results
|
|
aphoria eval run --fixtures tests/llm_fixtures --mode live --format table
|
|
```
|
|
|
|
### 0.2 Record Baseline Metrics
|
|
|
|
Create `docs/llm-optimization/baselines/YYYY-MM-DD.md`:
|
|
|
|
```markdown
|
|
# Baseline: YYYY-MM-DD
|
|
|
|
## Overall Metrics
|
|
- Precision: X.XX
|
|
- Recall: X.XX
|
|
- F1: X.XX
|
|
- Parse Success Rate: X.XX%
|
|
|
|
## Per-Category Breakdown
|
|
| Category | Fixtures | Precision | Recall | F1 |
|
|
|----------|----------|-----------|--------|-----|
|
|
| tls | N | X.XX | X.XX | X.XX |
|
|
| jwt | N | X.XX | X.XX | X.XX |
|
|
| secrets | N | X.XX | X.XX | X.XX |
|
|
| auth | N | X.XX | X.XX | X.XX |
|
|
|
|
## Known Issues
|
|
- [ ] Issue 1
|
|
- [ ] Issue 2
|
|
```
|
|
|
|
### 0.3 Save Official Baseline
|
|
|
|
```bash
|
|
aphoria eval update-baseline --fixtures tests/llm_fixtures --force
|
|
```
|
|
|
|
### Decision Point: Is Baseline Acceptable?
|
|
|
|
```
|
|
IF F1 >= 0.85 AND parse_success_rate >= 0.95:
|
|
→ Skip to Phase 4 (Edge Case Hardening)
|
|
|
|
ELSE IF F1 < 0.50:
|
|
→ Major issues exist. Proceed to Phase 1 with priority flag.
|
|
|
|
ELSE:
|
|
→ Normal optimization flow. Proceed to Phase 1.
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 1: Diagnostic Analysis
|
|
|
|
**Goal:** Identify root causes of extraction failures.
|
|
|
|
### 1.1 Categorize Failures
|
|
|
|
Run detailed analysis:
|
|
|
|
```bash
|
|
# Get verbose failure information
|
|
aphoria eval run --mode live --format json | jq '.fixture_results[] | select(.status == "Failed")'
|
|
```
|
|
|
|
### 1.2 Failure Classification Matrix
|
|
|
|
For each failed fixture, classify the root cause:
|
|
|
|
| Failure Type | Symptoms | Root Cause | Fix Category |
|
|
|--------------|----------|------------|--------------|
|
|
| **Parse Failure** | `parse_success: false` | LLM returned malformed JSON | Prompt/Schema |
|
|
| **Missing Claim** | `false_negatives > 0` | LLM didn't extract expected claim | Prompt/Examples |
|
|
| **Wrong Subject** | Claim exists but subject mismatch | Subject path inconsistency | Normalization |
|
|
| **Wrong Value** | Claim exists but value mismatch | Type coercion or interpretation | Prompt/Matcher |
|
|
| **Wrong Predicate** | Claim exists but predicate mismatch | Vocabulary inconsistency | Prompt/Glossary |
|
|
| **False Positive** | `violations > 0` | LLM extracted non-existent issue | Negative Examples |
|
|
| **Low Confidence** | Claim filtered by min_confidence | LLM under-confident | Calibration |
|
|
|
|
### 1.3 Tally Results
|
|
|
|
```markdown
|
|
## Failure Analysis: YYYY-MM-DD
|
|
|
|
### By Failure Type
|
|
| Type | Count | % of Failures |
|
|
|------|-------|---------------|
|
|
| Parse Failure | N | X% |
|
|
| Missing Claim | N | X% |
|
|
| Wrong Subject | N | X% |
|
|
| Wrong Value | N | X% |
|
|
| False Positive | N | X% |
|
|
|
|
### Priority Order (fix highest-impact first)
|
|
1. [Type] - N failures
|
|
2. [Type] - N failures
|
|
3. [Type] - N failures
|
|
```
|
|
|
|
### Decision Point: What's the Dominant Failure Mode?
|
|
|
|
```
|
|
IF parse_failures > 30% of total failures:
|
|
→ Proceed to Phase 2A: Output Structure Fixes
|
|
|
|
ELSE IF missing_claims > 50% of total failures:
|
|
→ Proceed to Phase 2B: Recall Improvements
|
|
|
|
ELSE IF false_positives > 30% of total failures:
|
|
→ Proceed to Phase 2C: Precision Improvements
|
|
|
|
ELSE IF subject/predicate mismatches > 40%:
|
|
→ Proceed to Phase 2D: Normalization Fixes
|
|
|
|
ELSE:
|
|
→ Mixed issues. Proceed through 2A → 2B → 2C → 2D sequentially.
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 2: Quick Wins
|
|
|
|
### Phase 2A: Output Structure Fixes
|
|
|
|
**When:** Parse failures > 30%
|
|
|
|
#### 2A.1 Diagnosis
|
|
|
|
```bash
|
|
# Check what the LLM is actually returning
|
|
# Add debug logging to llm/extractor.rs temporarily
|
|
RUST_LOG=debug aphoria scan . --persist 2>&1 | grep "LLM response"
|
|
```
|
|
|
|
Common parse issues:
|
|
- Markdown code fences around JSON
|
|
- Extra text before/after JSON
|
|
- Nested JSON (JSON inside string)
|
|
- Truncated response (token limit)
|
|
- Wrong array structure
|
|
|
|
#### 2A.2 Fixes by Issue
|
|
|
|
**Issue: Markdown code fences**
|
|
```rust
|
|
// In llm/extractor.rs - add response cleaning
|
|
fn clean_response(raw: &str) -> String {
|
|
let trimmed = raw.trim();
|
|
if trimmed.starts_with("```json") {
|
|
trimmed
|
|
.strip_prefix("```json")
|
|
.and_then(|s| s.strip_suffix("```"))
|
|
.unwrap_or(trimmed)
|
|
.trim()
|
|
.to_string()
|
|
} else if trimmed.starts_with("```") {
|
|
trimmed
|
|
.strip_prefix("```")
|
|
.and_then(|s| s.strip_suffix("```"))
|
|
.unwrap_or(trimmed)
|
|
.trim()
|
|
.to_string()
|
|
} else {
|
|
trimmed.to_string()
|
|
}
|
|
}
|
|
```
|
|
|
|
**Issue: Extra text around JSON**
|
|
```rust
|
|
// Find JSON array in response
|
|
fn extract_json_array(raw: &str) -> Option<&str> {
|
|
let start = raw.find('[')?;
|
|
let end = raw.rfind(']')?;
|
|
if end > start {
|
|
Some(&raw[start..=end])
|
|
} else {
|
|
None
|
|
}
|
|
}
|
|
```
|
|
|
|
**Issue: Token limit truncation**
|
|
- Reduce input context
|
|
- Request fewer claims per call
|
|
- Implement chunking
|
|
|
|
**Issue: Inconsistent structure**
|
|
Add explicit schema to prompt:
|
|
```
|
|
Return ONLY a JSON array with this exact structure:
|
|
[
|
|
{
|
|
"subject": "category/specific_thing",
|
|
"predicate": "property_name",
|
|
"value": <boolean|number|string>,
|
|
"confidence": <0.0-1.0>,
|
|
"line": <line_number>,
|
|
"rationale": "why this was extracted"
|
|
}
|
|
]
|
|
```
|
|
|
|
#### 2A.3 Validate Fix
|
|
|
|
```bash
|
|
aphoria eval run --mode live --fail-on-regression
|
|
# Parse success rate should improve
|
|
```
|
|
|
|
**Checkpoint:** If parse_success_rate >= 95%, proceed. Otherwise, consider:
|
|
```
|
|
IF still failing:
|
|
→ Research: Check Gemini API docs for response_format options
|
|
→ Research: Consider switching to function calling mode
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 2B: Recall Improvements
|
|
|
|
**When:** Missing claims (false negatives) > 50%
|
|
|
|
#### 2B.1 Analyze What's Being Missed
|
|
|
|
```bash
|
|
# For each failed fixture, check what was expected vs extracted
|
|
aphoria eval run --mode live --format json | \
|
|
jq '.fixture_results[] | select(.false_negatives > 0) | {id: .fixture_id, unmatched: .unmatched_expectations}'
|
|
```
|
|
|
|
#### 2B.2 Common Recall Issues & Fixes
|
|
|
|
**Issue: LLM doesn't recognize the pattern**
|
|
|
|
Add few-shot examples to prompt:
|
|
```rust
|
|
// In llm/prompts.rs
|
|
const FEW_SHOT_EXAMPLES: &str = r#"
|
|
Example 1:
|
|
Input: `requests.get(url, verify=False)`
|
|
Output: [{"subject": "tls/cert_verification", "predicate": "enabled", "value": false, ...}]
|
|
|
|
Example 2:
|
|
Input: `jwt.decode(token, algorithms=['none', 'HS256'])`
|
|
Output: [{"subject": "jwt/algorithms", "predicate": "allows_none", "value": true, ...}]
|
|
"#;
|
|
```
|
|
|
|
**Issue: LLM extracts with different subject path**
|
|
|
|
Check actual vs expected subjects:
|
|
```
|
|
Expected: tls/cert_verification
|
|
Actual: security/tls/verify_disabled
|
|
|
|
→ Either: Update fixtures to match LLM's vocabulary
|
|
→ Or: Add subject mapping instructions to prompt
|
|
→ Or: Improve matcher's tail-path matching
|
|
```
|
|
|
|
**Issue: LLM only extracts first finding**
|
|
|
|
Add explicit instruction:
|
|
```
|
|
IMPORTANT: Extract ALL security-relevant claims from the code.
|
|
Do not stop after finding one issue - scan the entire file.
|
|
```
|
|
|
|
**Issue: LLM misses subtle patterns**
|
|
|
|
Add pattern hints:
|
|
```
|
|
Look for these specific patterns:
|
|
- verify=False, VERIFY=False, ssl_verify=False → TLS verification disabled
|
|
- algorithms=['none'], algorithm='none' → JWT algorithm none attack
|
|
- API_KEY = "...", apiKey: "..." → Hardcoded secrets
|
|
```
|
|
|
|
#### 2B.3 Validate Recall Improvements
|
|
|
|
```bash
|
|
aphoria eval run --mode live
|
|
# Check: recall should increase, precision should not drop significantly
|
|
```
|
|
|
|
**Checkpoint:**
|
|
```
|
|
IF recall improved AND precision dropped > 10%:
|
|
→ Proceed to Phase 2C immediately (balance precision)
|
|
|
|
ELSE IF recall still < 0.70:
|
|
→ Research Sprint: Analyze why specific patterns aren't recognized
|
|
→ Consider: Are fixtures too strict? Update matcher tolerance?
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 2C: Precision Improvements
|
|
|
|
**When:** False positives or violations > 30%
|
|
|
|
#### 2C.1 Analyze False Positives
|
|
|
|
```bash
|
|
# What is the LLM incorrectly flagging?
|
|
aphoria eval run --mode live --format json | \
|
|
jq '.fixture_results[] | select(.violations > 0) | {id: .fixture_id, violations: .violation_details}'
|
|
```
|
|
|
|
#### 2C.2 Common Precision Issues & Fixes
|
|
|
|
**Issue: Flagging safe patterns as issues**
|
|
|
|
Add negative examples to prompt:
|
|
```rust
|
|
const NEGATIVE_EXAMPLES: &str = r#"
|
|
These are SAFE and should NOT be flagged:
|
|
|
|
- `verify=certifi.where()` → NOT a TLS issue (using CA bundle)
|
|
- `API_KEY = os.environ['API_KEY']` → NOT hardcoded (from environment)
|
|
- `algorithms=['HS256', 'RS256']` → NOT algorithm none (secure algorithms only)
|
|
"#;
|
|
```
|
|
|
|
**Issue: Over-eager pattern matching**
|
|
|
|
Add specificity requirements:
|
|
```
|
|
Only extract claims when you are CONFIDENT the code is problematic.
|
|
Do NOT flag:
|
|
- Comments discussing security issues (not actual code)
|
|
- Test files demonstrating vulnerabilities (not production code)
|
|
- Variables that MIGHT contain sensitive data but don't clearly
|
|
```
|
|
|
|
**Issue: Misinterpreting context**
|
|
|
|
Add context awareness:
|
|
```
|
|
Consider the full context:
|
|
- Is this in a test file? Test code may intentionally have "bad" patterns.
|
|
- Is this in a comment? Comments are not executable code.
|
|
- Is the value from a secure source (environment, secrets manager)?
|
|
```
|
|
|
|
#### 2C.3 Validate Precision Improvements
|
|
|
|
```bash
|
|
aphoria eval run --mode live
|
|
# Check: precision should increase, recall should not drop significantly
|
|
```
|
|
|
|
**Checkpoint:**
|
|
```
|
|
IF precision improved AND recall dropped > 10%:
|
|
→ Go back to 2B, find balance
|
|
|
|
ELSE IF precision still < 0.70:
|
|
→ Research: Are fixtures correctly marked as "must_not_contain"?
|
|
→ Consider: Is the LLM's interpretation actually valid? Update fixtures?
|
|
```
|
|
|
|
---
|
|
|
|
### Phase 2D: Normalization Fixes
|
|
|
|
**When:** Subject/predicate mismatches > 40%
|
|
|
|
#### 2D.1 Identify Vocabulary Mismatches
|
|
|
|
```bash
|
|
# Compare LLM output subjects vs expected subjects
|
|
# Manual review of failures
|
|
```
|
|
|
|
Common mismatches:
|
|
| LLM Output | Expected | Fix |
|
|
|------------|----------|-----|
|
|
| `security/tls/disabled` | `tls/cert_verification` | Prompt vocabulary |
|
|
| `auth/jwt/none_allowed` | `jwt/algorithms` | Subject mapping |
|
|
| `credentials/hardcoded` | `secrets/api_key` | Standardize categories |
|
|
|
|
#### 2D.2 Fix Options
|
|
|
|
**Option A: Update Prompt Vocabulary**
|
|
|
|
Define exact subject paths in prompt:
|
|
```
|
|
Use these exact subject paths:
|
|
- TLS issues: tls/cert_verification, tls/min_version, tls/cipher_suites
|
|
- JWT issues: jwt/algorithms, jwt/validation, jwt/expiry
|
|
- Secrets: secrets/api_key, secrets/token, secrets/password
|
|
- Auth: auth/bypass, auth/session, auth/password_policy
|
|
```
|
|
|
|
**Option B: Improve Matcher Tolerance**
|
|
|
|
In `eval/matcher.rs`, enhance `subject_matches()`:
|
|
```rust
|
|
fn subject_matches(&self, extracted: &str, expected: &str) -> bool {
|
|
// Existing tail-path matching
|
|
let ext_tail = self.tail_path(extracted, 2);
|
|
let exp_tail = self.tail_path(expected, 2);
|
|
if ext_tail == exp_tail {
|
|
return true;
|
|
}
|
|
|
|
// Add synonym matching
|
|
let synonyms = [
|
|
("cert_verification", "verify"),
|
|
("cert_verification", "ssl_verify"),
|
|
("api_key", "apikey"),
|
|
("api_key", "secret_key"),
|
|
];
|
|
|
|
for (a, b) in synonyms {
|
|
if (ext_tail.contains(&a) && exp_tail.contains(&b)) ||
|
|
(ext_tail.contains(&b) && exp_tail.contains(&a)) {
|
|
return true;
|
|
}
|
|
}
|
|
|
|
false
|
|
}
|
|
```
|
|
|
|
**Option C: Update Fixtures to Match LLM**
|
|
|
|
If LLM's vocabulary is actually better/more consistent:
|
|
```bash
|
|
# Update fixture subjects to match what LLM produces
|
|
# This is valid if LLM's naming is more intuitive
|
|
```
|
|
|
|
#### 2D.3 Validate
|
|
|
|
```bash
|
|
aphoria eval run --mode live
|
|
# Subject/predicate mismatches should decrease
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 2R: Research Sprint
|
|
|
|
**When:** Quick wins don't resolve issues, or unknown problems arise.
|
|
|
|
### 2R.1 Research Checklist
|
|
|
|
```markdown
|
|
## Research Log: [Issue Description]
|
|
|
|
### Problem Statement
|
|
- What specifically is failing?
|
|
- What have we tried?
|
|
- Why didn't it work?
|
|
|
|
### Research Tasks
|
|
- [ ] Review Gemini API documentation for relevant features
|
|
- [ ] Check if other projects solved similar issues
|
|
- [ ] Experiment with different prompt structures
|
|
- [ ] Test alternative models (if available)
|
|
- [ ] Review academic papers on code analysis with LLMs
|
|
|
|
### Experiments
|
|
| Experiment | Hypothesis | Result |
|
|
|------------|------------|--------|
|
|
| | | |
|
|
|
|
### Conclusion
|
|
- What worked?
|
|
- What didn't?
|
|
- Recommended approach?
|
|
```
|
|
|
|
### 2R.2 Common Research Topics
|
|
|
|
**Topic: Structured Output**
|
|
- Gemini supports `response_mime_type: "application/json"`
|
|
- May need `response_schema` for strict typing
|
|
- Test: Does this improve parse success rate?
|
|
|
|
**Topic: Function Calling**
|
|
- Alternative to free-form JSON output
|
|
- Define claims as function parameters
|
|
- Test: More reliable structure?
|
|
|
|
**Topic: Chain of Thought**
|
|
- Ask LLM to reason before extracting
|
|
- May improve accuracy on subtle patterns
|
|
- Test: Does CoT help without hurting speed?
|
|
|
|
**Topic: Fine-Tuning**
|
|
- Create training dataset from fixtures
|
|
- Fine-tune model on security extraction
|
|
- Test: Significant quality improvement?
|
|
|
|
**Topic: Multi-Pass Extraction**
|
|
- First pass: identify potential issues
|
|
- Second pass: validate and detail each
|
|
- Test: Higher precision with acceptable latency?
|
|
|
|
### 2R.3 Research Output
|
|
|
|
Document findings in `docs/llm-optimization/research/`:
|
|
```
|
|
research/
|
|
structured-output.md
|
|
chain-of-thought.md
|
|
fine-tuning-feasibility.md
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 3: Systematic Improvements
|
|
|
|
**Goal:** Implement comprehensive prompt and system improvements.
|
|
|
|
### 3.1 Prompt Architecture
|
|
|
|
Restructure prompt for maximum clarity:
|
|
|
|
```rust
|
|
const SYSTEM_PROMPT: &str = r#"
|
|
You are a security code analyzer. Your task is to extract security-relevant claims from source code.
|
|
|
|
## Output Format
|
|
Return a JSON array. Each claim must have:
|
|
- subject: Category path (e.g., "tls/cert_verification")
|
|
- predicate: Property name (e.g., "enabled")
|
|
- value: The extracted value (boolean, number, or string)
|
|
- confidence: Your confidence 0.0-1.0
|
|
- line: Line number where found
|
|
- rationale: Brief explanation
|
|
|
|
## Subject Categories
|
|
- tls/*: TLS/SSL configuration
|
|
- jwt/*: JWT token handling
|
|
- secrets/*: Credentials and API keys
|
|
- auth/*: Authentication mechanisms
|
|
- crypto/*: Cryptographic settings
|
|
|
|
## What to Extract
|
|
[Positive examples here]
|
|
|
|
## What NOT to Extract
|
|
[Negative examples here]
|
|
|
|
## Important Rules
|
|
1. Extract ALL findings, not just the first one
|
|
2. Only flag actual code, not comments
|
|
3. Consider context (test files, environment variables)
|
|
4. Be conservative - only flag clear issues
|
|
"#;
|
|
```
|
|
|
|
### 3.2 Confidence Calibration
|
|
|
|
Improve confidence scoring:
|
|
|
|
```rust
|
|
const CONFIDENCE_GUIDANCE: &str = r#"
|
|
Set confidence based on:
|
|
- 0.95-1.0: Explicit, unambiguous code (verify=False)
|
|
- 0.80-0.94: Clear pattern, minor ambiguity
|
|
- 0.60-0.79: Likely issue, needs context
|
|
- 0.40-0.59: Possible issue, uncertain
|
|
- Below 0.40: Don't report (too uncertain)
|
|
"#;
|
|
```
|
|
|
|
### 3.3 Language-Specific Handling
|
|
|
|
Add language-specific prompt sections:
|
|
|
|
```rust
|
|
fn get_language_hints(language: Language) -> &'static str {
|
|
match language {
|
|
Language::Python => r#"
|
|
Python-specific patterns:
|
|
- requests: verify=False, cert=False
|
|
- urllib3: disable_warnings()
|
|
- ssl: CERT_NONE, check_hostname=False
|
|
"#,
|
|
Language::JavaScript => r#"
|
|
JavaScript/Node.js patterns:
|
|
- https: rejectUnauthorized: false
|
|
- axios: httpsAgent with rejectUnauthorized
|
|
- NODE_TLS_REJECT_UNAUTHORIZED=0
|
|
"#,
|
|
// ... other languages
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3.4 Validate Comprehensive Changes
|
|
|
|
```bash
|
|
# Run full evaluation
|
|
aphoria eval run --mode live --format table
|
|
|
|
# Compare to baseline
|
|
# Expect: F1 improvement, no significant regressions
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 4: Edge Case Hardening
|
|
|
|
**Goal:** Handle unusual inputs gracefully.
|
|
|
|
### 4.1 Edge Case Categories
|
|
|
|
| Category | Example | Expected Behavior |
|
|
|----------|---------|-------------------|
|
|
| Empty file | 0 bytes | No claims, no error |
|
|
| Binary file | .exe, .png | Skip gracefully |
|
|
| Huge file | >100KB code | Chunk or skip with warning |
|
|
| Minified code | single-line JS | Best effort extraction |
|
|
| Mixed language | HTML with JS | Detect embedded languages |
|
|
| Unicode | Non-ASCII identifiers | Handle correctly |
|
|
| Syntax errors | Invalid code | Extract what's possible |
|
|
|
|
### 4.2 Add Edge Case Fixtures
|
|
|
|
```bash
|
|
# Create edge case fixtures
|
|
cat > tests/llm_fixtures/edge/edge-002-minified.toml << 'EOF'
|
|
[metadata]
|
|
id = "edge-002"
|
|
name = "Minified JavaScript"
|
|
category = "edge"
|
|
language = "javascript"
|
|
|
|
[input]
|
|
filename = "bundle.min.js"
|
|
content = "var a=require('https');a.get({rejectUnauthorized:false},function(r){});"
|
|
|
|
[expected]
|
|
must_contain = [
|
|
{ subject = "tls/cert_verification", predicate = "enabled", value = false }
|
|
]
|
|
EOF
|
|
```
|
|
|
|
### 4.3 Implement Defensive Code
|
|
|
|
```rust
|
|
// In llm/extractor.rs
|
|
impl LlmExtractor {
|
|
pub fn extract(&self, content: &str, language: Language) -> Vec<Observation> {
|
|
// Edge case: empty content
|
|
if content.trim().is_empty() {
|
|
return Vec::new();
|
|
}
|
|
|
|
// Edge case: too large
|
|
if content.len() > MAX_CONTENT_SIZE {
|
|
warn!(size = content.len(), "Content too large, chunking");
|
|
return self.extract_chunked(content, language);
|
|
}
|
|
|
|
// Edge case: binary content
|
|
if content.bytes().any(|b| b == 0) {
|
|
debug!("Binary content detected, skipping");
|
|
return Vec::new();
|
|
}
|
|
|
|
// Normal extraction
|
|
self.extract_internal(content, language)
|
|
}
|
|
}
|
|
```
|
|
|
|
### 4.4 Validate Edge Cases
|
|
|
|
```bash
|
|
aphoria eval run --mode live --category edge
|
|
# All edge cases should pass or gracefully skip
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 5: CI Integration & Monitoring
|
|
|
|
**Goal:** Prevent regressions and track quality over time.
|
|
|
|
### 5.1 CI Pipeline
|
|
|
|
```yaml
|
|
# .github/workflows/llm-quality.yml
|
|
name: LLM Extraction Quality
|
|
|
|
on:
|
|
pull_request:
|
|
paths:
|
|
- 'applications/aphoria/src/llm/**'
|
|
- 'applications/aphoria/tests/llm_fixtures/**'
|
|
schedule:
|
|
- cron: '0 6 * * 1' # Weekly Monday 6am
|
|
|
|
jobs:
|
|
eval:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
|
|
- name: Setup Rust
|
|
uses: dtolnay/rust-action@stable
|
|
|
|
- name: Cache LLM responses
|
|
uses: actions/cache@v4
|
|
with:
|
|
path: ~/.cache/aphoria/llm_cache
|
|
key: llm-cache-${{ hashFiles('tests/llm_fixtures/**') }}
|
|
|
|
- name: Run Evaluation
|
|
env:
|
|
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
|
|
run: |
|
|
cargo run -p aphoria -- eval run \
|
|
--mode cached \
|
|
--fail-on-regression \
|
|
--threshold 0.05 \
|
|
--format json > eval-results.json
|
|
|
|
- name: Upload Results
|
|
uses: actions/upload-artifact@v4
|
|
with:
|
|
name: eval-results
|
|
path: eval-results.json
|
|
|
|
- name: Comment PR with Results
|
|
if: github.event_name == 'pull_request'
|
|
uses: actions/github-script@v7
|
|
with:
|
|
script: |
|
|
const fs = require('fs');
|
|
const results = JSON.parse(fs.readFileSync('eval-results.json'));
|
|
const body = `## LLM Eval Results
|
|
| Metric | Value |
|
|
|--------|-------|
|
|
| Precision | ${results.metrics.precision.toFixed(2)} |
|
|
| Recall | ${results.metrics.recall.toFixed(2)} |
|
|
| F1 | ${results.metrics.f1.toFixed(2)} |
|
|
| Verdict | ${results.verdict} |`;
|
|
github.rest.issues.createComment({
|
|
owner: context.repo.owner,
|
|
repo: context.repo.repo,
|
|
issue_number: context.issue.number,
|
|
body: body
|
|
});
|
|
```
|
|
|
|
### 5.2 Monitoring Dashboard
|
|
|
|
Track metrics over time:
|
|
|
|
```markdown
|
|
# docs/llm-optimization/metrics-history.md
|
|
|
|
| Date | Precision | Recall | F1 | Notes |
|
|
|------|-----------|--------|-----|-------|
|
|
| 2026-02-05 | 0.78 | 0.72 | 0.75 | Initial baseline |
|
|
| 2026-02-06 | 0.82 | 0.74 | 0.78 | Added few-shot examples |
|
|
| 2026-02-07 | 0.85 | 0.76 | 0.80 | Fixed parse issues |
|
|
```
|
|
|
|
### 5.3 Alert Thresholds
|
|
|
|
```toml
|
|
# aphoria.toml
|
|
[eval]
|
|
# Fail CI if metrics drop below these
|
|
min_precision = 0.75
|
|
min_recall = 0.70
|
|
min_f1 = 0.72
|
|
regression_threshold = 0.05
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 6: Continuous Improvement Loop
|
|
|
|
**Goal:** Establish ongoing optimization process.
|
|
|
|
### 6.1 Weekly Cadence
|
|
|
|
```markdown
|
|
## Monday: Review
|
|
- [ ] Check CI results from past week
|
|
- [ ] Review any failed fixtures
|
|
- [ ] Identify top 3 improvement opportunities
|
|
|
|
## Wednesday: Implement
|
|
- [ ] Pick one improvement from Monday's list
|
|
- [ ] Implement and test locally
|
|
- [ ] Run full eval suite
|
|
|
|
## Friday: Deploy
|
|
- [ ] If metrics improved, merge changes
|
|
- [ ] Update baseline
|
|
- [ ] Document what changed and why
|
|
```
|
|
|
|
### 6.2 Fixture Expansion
|
|
|
|
Add new fixtures when:
|
|
- New vulnerability pattern discovered
|
|
- False positive reported by user
|
|
- New language/framework support added
|
|
- Edge case found in production
|
|
|
|
```bash
|
|
# Fixture addition checklist
|
|
- [ ] Create fixture file in appropriate category
|
|
- [ ] Add both must_contain and must_not_contain
|
|
- [ ] Run validation: aphoria eval validate-fixtures
|
|
- [ ] Run eval to verify fixture works: aphoria eval run --max-fixtures 1
|
|
- [ ] Update manifest.toml category count
|
|
```
|
|
|
|
### 6.3 Prompt Version Control
|
|
|
|
Track prompt changes:
|
|
|
|
```rust
|
|
// llm/prompts.rs
|
|
pub const PROMPT_VERSION: &str = "1.2.0";
|
|
|
|
// In changelog:
|
|
// 1.2.0 - Added negative examples for safe patterns
|
|
// 1.1.0 - Added language-specific hints
|
|
// 1.0.0 - Initial structured prompt
|
|
```
|
|
|
|
### 6.4 Quarterly Review
|
|
|
|
```markdown
|
|
## Quarterly LLM Optimization Review
|
|
|
|
### Metrics Trend
|
|
- Q1 Start: F1 = 0.XX
|
|
- Q1 End: F1 = 0.XX
|
|
- Improvement: +X%
|
|
|
|
### Major Changes
|
|
1. Change description
|
|
2. Change description
|
|
|
|
### Lessons Learned
|
|
- What worked well?
|
|
- What didn't work?
|
|
- What should we try next quarter?
|
|
|
|
### Next Quarter Goals
|
|
- [ ] Goal 1
|
|
- [ ] Goal 2
|
|
```
|
|
|
|
---
|
|
|
|
## Appendix A: Common Issues Reference
|
|
|
|
### A.1 "LLM returns empty array"
|
|
|
|
**Symptoms:** No claims extracted from files that clearly have issues.
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check if LLM is being called
|
|
RUST_LOG=debug aphoria scan . --persist 2>&1 | grep -i llm
|
|
```
|
|
|
|
**Possible Causes:**
|
|
1. LLM disabled in config → Enable in aphoria.toml
|
|
2. File filtered out → Check file size/type filters
|
|
3. API error → Check API key and quota
|
|
4. Prompt too restrictive → Loosen "what to extract" section
|
|
|
|
### A.2 "Parse failures spike after API update"
|
|
|
|
**Symptoms:** Suddenly many parse failures.
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check raw responses
|
|
# Add temporary logging to see what API returns
|
|
```
|
|
|
|
**Possible Causes:**
|
|
1. API response format changed
|
|
2. Model version updated
|
|
3. Rate limiting affecting responses
|
|
|
|
**Fix:** Update response parsing to handle new format.
|
|
|
|
### A.3 "Good local results, bad CI results"
|
|
|
|
**Symptoms:** Eval passes locally but fails in CI.
|
|
|
|
**Possible Causes:**
|
|
1. Cache inconsistency → Clear and rebuild cache
|
|
2. Environment differences → Check env vars
|
|
3. Timeout issues → Increase CI timeout
|
|
4. API key issues → Verify CI secrets
|
|
|
|
### A.4 "Precision tanked after recall improvement"
|
|
|
|
**Symptoms:** Improved recall but precision dropped significantly.
|
|
|
|
**Fix:**
|
|
1. Add negative examples to balance
|
|
2. Increase confidence threshold
|
|
3. Add more must_not_contain fixtures
|
|
4. Make extraction criteria more specific
|
|
|
|
### A.5 "Works for Python, fails for Go"
|
|
|
|
**Symptoms:** Language-specific extraction issues.
|
|
|
|
**Fix:**
|
|
1. Add language-specific prompt hints
|
|
2. Add fixtures for failing language
|
|
3. Check if patterns are language-specific
|
|
4. Consider language-specific extraction paths
|
|
|
|
---
|
|
|
|
## Appendix B: Fixture Writing Guide
|
|
|
|
### B.1 Good Fixture Characteristics
|
|
|
|
- **Minimal:** Only include code necessary to demonstrate the issue
|
|
- **Clear:** Obvious what the security issue is
|
|
- **Realistic:** Resembles actual production code
|
|
- **Isolated:** Tests one concept per fixture
|
|
|
|
### B.2 Fixture Template
|
|
|
|
```toml
|
|
[metadata]
|
|
id = "category-NNN"
|
|
name = "Brief description of what this tests"
|
|
category = "tls|jwt|secrets|auth|crypto|negative|edge"
|
|
language = "python|javascript|go|rust|java|..."
|
|
difficulty = "easy|medium|hard"
|
|
source = "hand-curated|real-world|generated"
|
|
created = "YYYY-MM-DD"
|
|
notes = "Any additional context"
|
|
|
|
[input]
|
|
filename = "example.py"
|
|
content = """
|
|
# Minimal code that demonstrates the issue
|
|
actual_code_here()
|
|
"""
|
|
|
|
[expected]
|
|
must_contain = [
|
|
{
|
|
subject = "category/specific_thing",
|
|
predicate = "property",
|
|
value = <expected_value>,
|
|
rationale = "Why this should be extracted"
|
|
}
|
|
]
|
|
|
|
must_not_contain = [
|
|
{
|
|
subject = "category/other_thing",
|
|
predicate = "property",
|
|
value = <wrong_value>,
|
|
rationale = "Why this should NOT be extracted"
|
|
}
|
|
]
|
|
|
|
[scoring]
|
|
weight = 1.0
|
|
min_confidence = 0.7
|
|
```
|
|
|
|
### B.3 Category Guidelines
|
|
|
|
| Category | What to Include | Subject Prefix |
|
|
|----------|-----------------|----------------|
|
|
| tls | Certificate, protocol, cipher issues | `tls/` |
|
|
| jwt | Token validation, algorithm issues | `jwt/` |
|
|
| secrets | Hardcoded credentials, keys, tokens | `secrets/` |
|
|
| auth | Authentication bypass, weak auth | `auth/` |
|
|
| crypto | Weak algorithms, short keys | `crypto/` |
|
|
| negative | Safe patterns (no findings expected) | N/A |
|
|
| edge | Boundary conditions, unusual input | N/A |
|
|
|
|
---
|
|
|
|
## Appendix C: Decision Tree Summary
|
|
|
|
```
|
|
START
|
|
│
|
|
├─→ Run baseline eval
|
|
│ │
|
|
│ ├─→ F1 >= 0.85? ──→ Skip to Phase 4
|
|
│ │
|
|
│ └─→ F1 < 0.85? ──→ Continue to diagnosis
|
|
│
|
|
├─→ Classify failures
|
|
│ │
|
|
│ ├─→ Parse failures > 30%? ──→ Phase 2A (output structure)
|
|
│ │
|
|
│ ├─→ Missing claims > 50%? ──→ Phase 2B (recall)
|
|
│ │
|
|
│ ├─→ False positives > 30%? ──→ Phase 2C (precision)
|
|
│ │
|
|
│ └─→ Subject mismatches > 40%? ──→ Phase 2D (normalization)
|
|
│
|
|
├─→ After each phase:
|
|
│ │
|
|
│ ├─→ Improved? ──→ Continue to next phase
|
|
│ │
|
|
│ ├─→ Regressed? ──→ Revert, try different approach
|
|
│ │
|
|
│ └─→ Stuck? ──→ Phase 2R (research sprint)
|
|
│
|
|
├─→ All phases complete?
|
|
│ │
|
|
│ ├─→ F1 >= target? ──→ Phase 4 (edge cases), Phase 5 (CI)
|
|
│ │
|
|
│ └─→ F1 < target? ──→ Research: model limits, fine-tuning, alternatives
|
|
│
|
|
└─→ Ongoing: Phase 6 (continuous improvement)
|
|
```
|
|
|
|
---
|
|
|
|
## Quick Reference Commands
|
|
|
|
```bash
|
|
# Run evaluation
|
|
aphoria eval run --fixtures tests/llm_fixtures --mode live
|
|
|
|
# Run specific category
|
|
aphoria eval run --category tls --mode live
|
|
|
|
# Check for regressions
|
|
aphoria eval run --mode cached --fail-on-regression
|
|
|
|
# Update baseline
|
|
aphoria eval update-baseline --fixtures tests/llm_fixtures --force
|
|
|
|
# List fixtures
|
|
aphoria eval list-fixtures
|
|
|
|
# Validate fixtures
|
|
aphoria eval validate-fixtures
|
|
|
|
# Export JSON for analysis
|
|
aphoria eval run --mode live --format json > results.json
|
|
```
|