New security extractors: - insecure_deserialization, orm_injection, path_traversal, security_headers - ssrf, unvalidated_redirects, weak_password, xxe - Enhanced tls_version extractor with comprehensive cipher/protocol checks Architecture docs: - Scout-judge extraction pattern for LLM-based code analysis - LLM prompt evaluation framework - LLM eval implementation guide Core improvements: - stemedb-ontology README and client enhancements - WAL journal/segment instrumentation - Signing and ingestion refinements - Consumer health demo script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
6.4 KiB
Scout & Judge: Hybrid Deterministic-Probabilistic Extraction Architecture
Status: Proposed (2026-02-05) Phase: 7.9 (Replaces monolithic LLM extraction) Context: Evolution of Phase 7.5 (LLM-in-the-Loop)
1. Problem Statement
The current LLM extraction pipeline ("Monolithic Mode") treats code files as unstructured text. It feeds entire files to the LLM to find security claims.
Issues with Monolithic Mode:
- Cost: 90% of a file is irrelevant to security (imports, UI logic, helpers), yet we pay for every token.
- Recall: LLMs struggle to find "needles in haystacks" (long context window degradation).
- Hallucination: Irrelevant code confuses the model, leading to false positives.
- Latency: Processing large files is slow/blocking.
2. The Solution: Scout & Judge Architecture
We decouple the discovery of potential claims from the analysis of those claims.
- The Scout (Deterministic): Uses Abstract Syntax Trees (AST) via
tree-sitterto find Regions of Interest (ROIs) with 100% speed and 0 cost. - The Judge (Probabilistic): Uses the LLM to analyze only the specific ROI snippet to extract semantic meaning and confidence.
Architectural Diagram
graph TD
File[Source File] -->|Input| Scout[AST Scout (Tree-sitter)]
subgraph "The Scout (Local/Fast)"
Scout -->|Parse| AST
AST -->|Query| Query[SCM Queries]
Query -->|Match| Candidate[Candidate Node]
Candidate -->|Expand| Snippet[Context Snippet]
end
Snippet -->|Input| Judge[LLM Judge (Gemini/Claude)]
subgraph "The Judge (Remote/Smart)"
Judge -->|Prompt: Analyze this specific call| Claims[Structured Claims]
end
Claims -->|Output| Aggregator[Claim Aggregator]
3. Component Details
3.1 The Scout (Tree-sitter)
The Scout's job is High Recall. It should find anything that might be relevant. It does not need to be precise.
Technology: tree-sitter (Rust bindings)
Workflow:
- Detect Language: Identify file type (Python, Go, Rust, JS).
- Parse: Generate AST.
- Query: Run SCM (S-expression) queries to find patterns.
Example Query (Python TLS):
(call_expression
function: (attribute) @func
arguments: (argument_list
(keyword_argument
name: (identifier) @arg_name
value: (_) @value
)
)
(#match? @func "requests\.(get|post|put|delete)")
(#eq? @arg_name "verify")
)
Context Expansion: The Scout doesn't just grab the line. It grabs the Logical Context:
- The function call itself.
- Variable definitions referenced in the call (simple static analysis).
- Surrounding 5 lines for comments.
3.2 The Judge (LLM)
The Judge's job is High Precision. It receives a focused prompt and determines if a claim exists.
Input Prompt:
You are a security analyst.
Analyze this code snippet for TLS verification settings.
SNIPPET:
# Dev override
should_verify = False
requests.get(url, verify=should_verify)
CONTEXT:
Variable `should_verify` is defined on line 2.
TASK:
Does this snippet disable TLS verification?
Output JSON: { "subject": "tls/verification", "value": false, "confidence": 0.95 }
Why this wins:
- Token Efficiency: Input reduced from 2000 tokens (file) to ~100 tokens (snippet).
- Accuracy: Model has no distractions.
- Speed: Parallelizable per-snippet.
4. Implementation Plan
Phase 1: Infrastructure (Dependencies)
Add tree-sitter support to Cargo.toml.
[dependencies]
tree-sitter = "0.20"
tree-sitter-python = "0.20"
tree-sitter-javascript = "0.20"
tree-sitter-go = "0.20"
tree-sitter-rust = "0.20"
Phase 2: The Scout Engine (src/scout/)
Create a new module applications/aphoria/src/scout/.
mod.rs: Public interface.engine.rs: Orchestrates parsing and querying.queries/: Directory containing.scmquery files for each category/language.python/tls.scmgo/sql_injection.scm
Struct definition:
pub struct CandidateSnippet {
pub file_path: String,
pub language: Language,
pub start_line: usize,
pub end_line: usize,
pub code: String,
pub context_variables: HashMap<String, String>, // Name -> Value/Definition
pub query_id: String, // Which query found this
}
Phase 3: The Judge Engine (src/llm/judge.rs)
Refactor LlmExtractor to support "Judge Mode".
- Modify
extract()to acceptCandidateSnippetinstead of full file content. - Create specialized prompts for specific query IDs (e.g., if Scout found a TLS pattern, use the specialized "TLS Judge" prompt, not the generic one).
Phase 4: Integration
Modify the main scan loop:
- Regex Extractors run first (unchanged).
- Scout runs on all files (extremely fast).
- Deduplicate: If Scout finds a region already handled by Regex, drop it.
- Judge: Send remaining Candidates to LLM.
5. Evaluation & Metrics
The "Prompt Evaluation System" (Phase 7.8) adapts to this model:
1. Scout Evaluation (Deterministic):
- Metric: Recall. "Did the Scout find the vulnerable line in
fixtures/tls/bad.py?" - Test: Unit tests using
tree-sitterqueries against code snippets. No LLM required.
2. Judge Evaluation (Probabilistic):
- Metric: Precision/Accuracy. "Given the snippet, did the LLM classify it correctly?"
- Fixture:
tests/llm_fixturesnow contains snippets derived from the Golden Corpus files.
3. Cost Efficiency Metric:
- Track
tokens_per_claim. - Goal: Reduce tokens/claim by >80% compared to Monolithic approach.
6. Migration Strategy
- Parallel Run: Run Scout logic alongside Regex logic in "shadow mode" (logging only) to tune queries.
- Incremental Rollout: Enable Scout & Judge for one category (e.g., TLS) while leaving others in Monolithic mode (if any) or Regex mode.
- Full Switch: Deprecate "Monolithic Mode" prompts.
7. Comparison Summary
| Feature | Current (Monolithic) | Scout & Judge (Proposed) |
|---|---|---|
| Trigger | File name heuristic | AST Pattern Match |
| Input | Whole File | Relevant Snippet |
| Context | Noisy (imports, unrelated code) | Focused (local scope) |
| Cost | $ (Linear to file size) |
¢ (Linear to relevant code) |
| Reliability | Low (Lost in middle) | High (Forced focus) |
| Maintenance | Prompt Engineering | Query Engineering + Simple Prompts |