stemedb/applications/aphoria/docs/architecture/scout-judge-extraction.md

# Scout & Judge: Hybrid Deterministic-Probabilistic Extraction Architecture

> **Status:** Proposed (2026-02-05)
> **Phase:** 7.9 (Replaces monolithic LLM extraction)
> **Context:** Evolution of Phase 7.5 (LLM-in-the-Loop)

---

## 1. Problem Statement

The current LLM extraction pipeline ("Monolithic Mode") treats code files as unstructured text. It feeds entire files to the LLM to find security claims.

**Issues with Monolithic Mode:**
1.  **Cost:** 90% of a file is irrelevant to security (imports, UI logic, helpers), yet we pay for every token.
2.  **Recall:** LLMs struggle to find "needles in haystacks" (long context window degradation).
3.  **Hallucination:** Irrelevant code confuses the model, leading to false positives.
4.  **Latency:** Processing large files is slow/blocking.

## 2. The Solution: Scout & Judge Architecture

We decouple the **discovery** of potential claims from the **analysis** of those claims.

*   **The Scout (Deterministic):** Uses Abstract Syntax Trees (AST) via `tree-sitter` to find *Regions of Interest* (ROIs) with 100% speed and 0 cost.
*   **The Judge (Probabilistic):** Uses the LLM to analyze *only* the specific ROI snippet to extract semantic meaning and confidence.

### Architectural Diagram

```mermaid
graph TD
    File[Source File] -->|Input| Scout[AST Scout (Tree-sitter)]

    subgraph "The Scout (Local/Fast)"
        Scout -->|Parse| AST
        AST -->|Query| Query[SCM Queries]
        Query -->|Match| Candidate[Candidate Node]
        Candidate -->|Expand| Snippet[Context Snippet]
    end

    Snippet -->|Input| Judge[LLM Judge (Gemini/Claude)]

    subgraph "The Judge (Remote/Smart)"
        Judge -->|Prompt: Analyze this specific call| Claims[Structured Claims]
    end

    Claims -->|Output| Aggregator[Claim Aggregator]
```

---

## 3. Component Details

### 3.1 The Scout (Tree-sitter)

The Scout's job is **High Recall**. It should find *anything* that *might* be relevant. It does not need to be precise.

**Technology:** `tree-sitter` (Rust bindings)

**Workflow:**
1.  **Detect Language:** Identify file type (Python, Go, Rust, JS).
2.  **Parse:** Generate AST.
3.  **Query:** Run SCM (S-expression) queries to find patterns.

**Example Query (Python TLS):**
```scm
(call_expression
  function: (attribute) @func
  arguments: (argument_list
    (keyword_argument
      name: (identifier) @arg_name
      value: (_) @value
    )
  )
  (#match? @func "requests\.(get|post|put|delete)")
  (#eq? @arg_name "verify")
)
```

**Context Expansion:**
The Scout doesn't just grab the line. It grabs the **Logical Context**:
*   The function call itself.
*   Variable definitions referenced in the call (simple static analysis).
*   Surrounding 5 lines for comments.

### 3.2 The Judge (LLM)

The Judge's job is **High Precision**. It receives a focused prompt and determines if a claim exists.

**Input Prompt:**
```text
You are a security analyst.
Analyze this code snippet for TLS verification settings.

SNIPPET:
# Dev override
should_verify = False
requests.get(url, verify=should_verify)

CONTEXT:
Variable `should_verify` is defined on line 2.

TASK:
Does this snippet disable TLS verification?
Output JSON: { "subject": "tls/verification", "value": false, "confidence": 0.95 }
```

**Why this wins:**
*   **Token Efficiency:** Input reduced from 2000 tokens (file) to ~100 tokens (snippet).
*   **Accuracy:** Model has no distractions.
*   **Speed:** Parallelizable per-snippet.

---

## 4. Implementation Plan

### Phase 1: Infrastructure (Dependencies)

Add `tree-sitter` support to `Cargo.toml`.

```toml
[dependencies]
tree-sitter = "0.20"
tree-sitter-python = "0.20"
tree-sitter-javascript = "0.20"
tree-sitter-go = "0.20"
tree-sitter-rust = "0.20"
```

### Phase 2: The Scout Engine (`src/scout/`)

Create a new module `applications/aphoria/src/scout/`.

*   `mod.rs`: Public interface.
*   `engine.rs`: Orchestrates parsing and querying.
*   `queries/`: Directory containing `.scm` query files for each category/language.
    *   `python/tls.scm`
    *   `go/sql_injection.scm`

**Struct definition:**
```rust
pub struct CandidateSnippet {
    pub file_path: String,
    pub language: Language,
    pub start_line: usize,
    pub end_line: usize,
    pub code: String,
    pub context_variables: HashMap<String, String>, // Name -> Value/Definition
    pub query_id: String, // Which query found this
}
```

### Phase 3: The Judge Engine (`src/llm/judge.rs`)

Refactor `LlmExtractor` to support "Judge Mode".

*   Modify `extract()` to accept `CandidateSnippet` instead of full file content.
*   Create specialized prompts for specific query IDs (e.g., if Scout found a TLS pattern, use the specialized "TLS Judge" prompt, not the generic one).

### Phase 4: Integration

Modify the main `scan` loop:

1.  **Regex Extractors** run first (unchanged).
2.  **Scout** runs on all files (extremely fast).
3.  **Deduplicate:** If Scout finds a region already handled by Regex, drop it.
4.  **Judge:** Send remaining Candidates to LLM.

---

## 5. Evaluation & Metrics

The "Prompt Evaluation System" (Phase 7.8) adapts to this model:

**1. Scout Evaluation (Deterministic):**
*   **Metric:** Recall. "Did the Scout find the vulnerable line in `fixtures/tls/bad.py`?"
*   **Test:** Unit tests using `tree-sitter` queries against code snippets. No LLM required.

**2. Judge Evaluation (Probabilistic):**
*   **Metric:** Precision/Accuracy. "Given the snippet, did the LLM classify it correctly?"
*   **Fixture:** `tests/llm_fixtures` now contains *snippets* derived from the Golden Corpus files.

**3. Cost Efficiency Metric:**
*   Track `tokens_per_claim`.
*   Goal: Reduce tokens/claim by >80% compared to Monolithic approach.

## 6. Migration Strategy

1.  **Parallel Run:** Run Scout logic alongside Regex logic in "shadow mode" (logging only) to tune queries.
2.  **Incremental Rollout:** Enable Scout & Judge for **one category** (e.g., TLS) while leaving others in Monolithic mode (if any) or Regex mode.
3.  **Full Switch:** Deprecate "Monolithic Mode" prompts.

---

## 7. Comparison Summary

| Feature | Current (Monolithic) | Scout & Judge (Proposed) |
| :--- | :--- | :--- |
| **Trigger** | File name heuristic | AST Pattern Match |
| **Input** | Whole File | Relevant Snippet |
| **Context** | Noisy (imports, unrelated code) | Focused (local scope) |
| **Cost** | $$$ (Linear to file size) | ¢ (Linear to *relevant* code) |
| **Reliability** | Low (Lost in middle) | High (Forced focus) |
| **Maintenance** | Prompt Engineering | Query Engineering + Simple Prompts |