stemedb/applications/aphoria/docs/architecture/scout-judge-extraction.md
jordan bbe6aedc40 feat: Aphoria security extractors + LLM evaluation architecture + ontology docs
New security extractors:
- insecure_deserialization, orm_injection, path_traversal, security_headers
- ssrf, unvalidated_redirects, weak_password, xxe
- Enhanced tls_version extractor with comprehensive cipher/protocol checks

Architecture docs:
- Scout-judge extraction pattern for LLM-based code analysis
- LLM prompt evaluation framework
- LLM eval implementation guide

Core improvements:
- stemedb-ontology README and client enhancements
- WAL journal/segment instrumentation
- Signing and ingestion refinements
- Consumer health demo script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 15:22:55 -07:00

204 lines
6.4 KiB
Markdown

# Scout & Judge: Hybrid Deterministic-Probabilistic Extraction Architecture
> **Status:** Proposed (2026-02-05)
> **Phase:** 7.9 (Replaces monolithic LLM extraction)
> **Context:** Evolution of Phase 7.5 (LLM-in-the-Loop)
---
## 1. Problem Statement
The current LLM extraction pipeline ("Monolithic Mode") treats code files as unstructured text. It feeds entire files to the LLM to find security claims.
**Issues with Monolithic Mode:**
1. **Cost:** 90% of a file is irrelevant to security (imports, UI logic, helpers), yet we pay for every token.
2. **Recall:** LLMs struggle to find "needles in haystacks" (long context window degradation).
3. **Hallucination:** Irrelevant code confuses the model, leading to false positives.
4. **Latency:** Processing large files is slow/blocking.
## 2. The Solution: Scout & Judge Architecture
We decouple the **discovery** of potential claims from the **analysis** of those claims.
* **The Scout (Deterministic):** Uses Abstract Syntax Trees (AST) via `tree-sitter` to find *Regions of Interest* (ROIs) with 100% speed and 0 cost.
* **The Judge (Probabilistic):** Uses the LLM to analyze *only* the specific ROI snippet to extract semantic meaning and confidence.
### Architectural Diagram
```mermaid
graph TD
File[Source File] -->|Input| Scout[AST Scout (Tree-sitter)]
subgraph "The Scout (Local/Fast)"
Scout -->|Parse| AST
AST -->|Query| Query[SCM Queries]
Query -->|Match| Candidate[Candidate Node]
Candidate -->|Expand| Snippet[Context Snippet]
end
Snippet -->|Input| Judge[LLM Judge (Gemini/Claude)]
subgraph "The Judge (Remote/Smart)"
Judge -->|Prompt: Analyze this specific call| Claims[Structured Claims]
end
Claims -->|Output| Aggregator[Claim Aggregator]
```
---
## 3. Component Details
### 3.1 The Scout (Tree-sitter)
The Scout's job is **High Recall**. It should find *anything* that *might* be relevant. It does not need to be precise.
**Technology:** `tree-sitter` (Rust bindings)
**Workflow:**
1. **Detect Language:** Identify file type (Python, Go, Rust, JS).
2. **Parse:** Generate AST.
3. **Query:** Run SCM (S-expression) queries to find patterns.
**Example Query (Python TLS):**
```scm
(call_expression
function: (attribute) @func
arguments: (argument_list
(keyword_argument
name: (identifier) @arg_name
value: (_) @value
)
)
(#match? @func "requests\.(get|post|put|delete)")
(#eq? @arg_name "verify")
)
```
**Context Expansion:**
The Scout doesn't just grab the line. It grabs the **Logical Context**:
* The function call itself.
* Variable definitions referenced in the call (simple static analysis).
* Surrounding 5 lines for comments.
### 3.2 The Judge (LLM)
The Judge's job is **High Precision**. It receives a focused prompt and determines if a claim exists.
**Input Prompt:**
```text
You are a security analyst.
Analyze this code snippet for TLS verification settings.
SNIPPET:
# Dev override
should_verify = False
requests.get(url, verify=should_verify)
CONTEXT:
Variable `should_verify` is defined on line 2.
TASK:
Does this snippet disable TLS verification?
Output JSON: { "subject": "tls/verification", "value": false, "confidence": 0.95 }
```
**Why this wins:**
* **Token Efficiency:** Input reduced from 2000 tokens (file) to ~100 tokens (snippet).
* **Accuracy:** Model has no distractions.
* **Speed:** Parallelizable per-snippet.
---
## 4. Implementation Plan
### Phase 1: Infrastructure (Dependencies)
Add `tree-sitter` support to `Cargo.toml`.
```toml
[dependencies]
tree-sitter = "0.20"
tree-sitter-python = "0.20"
tree-sitter-javascript = "0.20"
tree-sitter-go = "0.20"
tree-sitter-rust = "0.20"
```
### Phase 2: The Scout Engine (`src/scout/`)
Create a new module `applications/aphoria/src/scout/`.
* `mod.rs`: Public interface.
* `engine.rs`: Orchestrates parsing and querying.
* `queries/`: Directory containing `.scm` query files for each category/language.
* `python/tls.scm`
* `go/sql_injection.scm`
**Struct definition:**
```rust
pub struct CandidateSnippet {
pub file_path: String,
pub language: Language,
pub start_line: usize,
pub end_line: usize,
pub code: String,
pub context_variables: HashMap<String, String>, // Name -> Value/Definition
pub query_id: String, // Which query found this
}
```
### Phase 3: The Judge Engine (`src/llm/judge.rs`)
Refactor `LlmExtractor` to support "Judge Mode".
* Modify `extract()` to accept `CandidateSnippet` instead of full file content.
* Create specialized prompts for specific query IDs (e.g., if Scout found a TLS pattern, use the specialized "TLS Judge" prompt, not the generic one).
### Phase 4: Integration
Modify the main `scan` loop:
1. **Regex Extractors** run first (unchanged).
2. **Scout** runs on all files (extremely fast).
3. **Deduplicate:** If Scout finds a region already handled by Regex, drop it.
4. **Judge:** Send remaining Candidates to LLM.
---
## 5. Evaluation & Metrics
The "Prompt Evaluation System" (Phase 7.8) adapts to this model:
**1. Scout Evaluation (Deterministic):**
* **Metric:** Recall. "Did the Scout find the vulnerable line in `fixtures/tls/bad.py`?"
* **Test:** Unit tests using `tree-sitter` queries against code snippets. No LLM required.
**2. Judge Evaluation (Probabilistic):**
* **Metric:** Precision/Accuracy. "Given the snippet, did the LLM classify it correctly?"
* **Fixture:** `tests/llm_fixtures` now contains *snippets* derived from the Golden Corpus files.
**3. Cost Efficiency Metric:**
* Track `tokens_per_claim`.
* Goal: Reduce tokens/claim by >80% compared to Monolithic approach.
## 6. Migration Strategy
1. **Parallel Run:** Run Scout logic alongside Regex logic in "shadow mode" (logging only) to tune queries.
2. **Incremental Rollout:** Enable Scout & Judge for **one category** (e.g., TLS) while leaving others in Monolithic mode (if any) or Regex mode.
3. **Full Switch:** Deprecate "Monolithic Mode" prompts.
---
## 7. Comparison Summary
| Feature | Current (Monolithic) | Scout & Judge (Proposed) |
| :--- | :--- | :--- |
| **Trigger** | File name heuristic | AST Pattern Match |
| **Input** | Whole File | Relevant Snippet |
| **Context** | Noisy (imports, unrelated code) | Focused (local scope) |
| **Cost** | $$$ (Linear to file size) | ¢ (Linear to *relevant* code) |
| **Reliability** | Low (Lost in middle) | High (Forced focus) |
| **Maintenance** | Prompt Engineering | Query Engineering + Simple Prompts |