# Scout & Judge: Hybrid Deterministic-Probabilistic Extraction Architecture > **Status:** Proposed (2026-02-05) > **Phase:** 7.9 (Replaces monolithic LLM extraction) > **Context:** Evolution of Phase 7.5 (LLM-in-the-Loop) --- ## 1. Problem Statement The current LLM extraction pipeline ("Monolithic Mode") treats code files as unstructured text. It feeds entire files to the LLM to find security claims. **Issues with Monolithic Mode:** 1. **Cost:** 90% of a file is irrelevant to security (imports, UI logic, helpers), yet we pay for every token. 2. **Recall:** LLMs struggle to find "needles in haystacks" (long context window degradation). 3. **Hallucination:** Irrelevant code confuses the model, leading to false positives. 4. **Latency:** Processing large files is slow/blocking. ## 2. The Solution: Scout & Judge Architecture We decouple the **discovery** of potential claims from the **analysis** of those claims. * **The Scout (Deterministic):** Uses Abstract Syntax Trees (AST) via `tree-sitter` to find *Regions of Interest* (ROIs) with 100% speed and 0 cost. * **The Judge (Probabilistic):** Uses the LLM to analyze *only* the specific ROI snippet to extract semantic meaning and confidence. ### Architectural Diagram ```mermaid graph TD File[Source File] -->|Input| Scout[AST Scout (Tree-sitter)] subgraph "The Scout (Local/Fast)" Scout -->|Parse| AST AST -->|Query| Query[SCM Queries] Query -->|Match| Candidate[Candidate Node] Candidate -->|Expand| Snippet[Context Snippet] end Snippet -->|Input| Judge[LLM Judge (Gemini/Claude)] subgraph "The Judge (Remote/Smart)" Judge -->|Prompt: Analyze this specific call| Claims[Structured Claims] end Claims -->|Output| Aggregator[Claim Aggregator] ``` --- ## 3. Component Details ### 3.1 The Scout (Tree-sitter) The Scout's job is **High Recall**. It should find *anything* that *might* be relevant. It does not need to be precise. **Technology:** `tree-sitter` (Rust bindings) **Workflow:** 1. **Detect Language:** Identify file type (Python, Go, Rust, JS). 2. **Parse:** Generate AST. 3. **Query:** Run SCM (S-expression) queries to find patterns. **Example Query (Python TLS):** ```scm (call_expression function: (attribute) @func arguments: (argument_list (keyword_argument name: (identifier) @arg_name value: (_) @value ) ) (#match? @func "requests\.(get|post|put|delete)") (#eq? @arg_name "verify") ) ``` **Context Expansion:** The Scout doesn't just grab the line. It grabs the **Logical Context**: * The function call itself. * Variable definitions referenced in the call (simple static analysis). * Surrounding 5 lines for comments. ### 3.2 The Judge (LLM) The Judge's job is **High Precision**. It receives a focused prompt and determines if a claim exists. **Input Prompt:** ```text You are a security analyst. Analyze this code snippet for TLS verification settings. SNIPPET: # Dev override should_verify = False requests.get(url, verify=should_verify) CONTEXT: Variable `should_verify` is defined on line 2. TASK: Does this snippet disable TLS verification? Output JSON: { "subject": "tls/verification", "value": false, "confidence": 0.95 } ``` **Why this wins:** * **Token Efficiency:** Input reduced from 2000 tokens (file) to ~100 tokens (snippet). * **Accuracy:** Model has no distractions. * **Speed:** Parallelizable per-snippet. --- ## 4. Implementation Plan ### Phase 1: Infrastructure (Dependencies) Add `tree-sitter` support to `Cargo.toml`. ```toml [dependencies] tree-sitter = "0.20" tree-sitter-python = "0.20" tree-sitter-javascript = "0.20" tree-sitter-go = "0.20" tree-sitter-rust = "0.20" ``` ### Phase 2: The Scout Engine (`src/scout/`) Create a new module `applications/aphoria/src/scout/`. * `mod.rs`: Public interface. * `engine.rs`: Orchestrates parsing and querying. * `queries/`: Directory containing `.scm` query files for each category/language. * `python/tls.scm` * `go/sql_injection.scm` **Struct definition:** ```rust pub struct CandidateSnippet { pub file_path: String, pub language: Language, pub start_line: usize, pub end_line: usize, pub code: String, pub context_variables: HashMap, // Name -> Value/Definition pub query_id: String, // Which query found this } ``` ### Phase 3: The Judge Engine (`src/llm/judge.rs`) Refactor `LlmExtractor` to support "Judge Mode". * Modify `extract()` to accept `CandidateSnippet` instead of full file content. * Create specialized prompts for specific query IDs (e.g., if Scout found a TLS pattern, use the specialized "TLS Judge" prompt, not the generic one). ### Phase 4: Integration Modify the main `scan` loop: 1. **Regex Extractors** run first (unchanged). 2. **Scout** runs on all files (extremely fast). 3. **Deduplicate:** If Scout finds a region already handled by Regex, drop it. 4. **Judge:** Send remaining Candidates to LLM. --- ## 5. Evaluation & Metrics The "Prompt Evaluation System" (Phase 7.8) adapts to this model: **1. Scout Evaluation (Deterministic):** * **Metric:** Recall. "Did the Scout find the vulnerable line in `fixtures/tls/bad.py`?" * **Test:** Unit tests using `tree-sitter` queries against code snippets. No LLM required. **2. Judge Evaluation (Probabilistic):** * **Metric:** Precision/Accuracy. "Given the snippet, did the LLM classify it correctly?" * **Fixture:** `tests/llm_fixtures` now contains *snippets* derived from the Golden Corpus files. **3. Cost Efficiency Metric:** * Track `tokens_per_claim`. * Goal: Reduce tokens/claim by >80% compared to Monolithic approach. ## 6. Migration Strategy 1. **Parallel Run:** Run Scout logic alongside Regex logic in "shadow mode" (logging only) to tune queries. 2. **Incremental Rollout:** Enable Scout & Judge for **one category** (e.g., TLS) while leaving others in Monolithic mode (if any) or Regex mode. 3. **Full Switch:** Deprecate "Monolithic Mode" prompts. --- ## 7. Comparison Summary | Feature | Current (Monolithic) | Scout & Judge (Proposed) | | :--- | :--- | :--- | | **Trigger** | File name heuristic | AST Pattern Match | | **Input** | Whole File | Relevant Snippet | | **Context** | Noisy (imports, unrelated code) | Focused (local scope) | | **Cost** | $$$ (Linear to file size) | ยข (Linear to *relevant* code) | | **Reliability** | Low (Lost in middle) | High (Forced focus) | | **Maintenance** | Prompt Engineering | Query Engineering + Simple Prompts |