New security extractors: - insecure_deserialization, orm_injection, path_traversal, security_headers - ssrf, unvalidated_redirects, weak_password, xxe - Enhanced tls_version extractor with comprehensive cipher/protocol checks Architecture docs: - Scout-judge extraction pattern for LLM-based code analysis - LLM prompt evaluation framework - LLM eval implementation guide Core improvements: - stemedb-ontology README and client enhancements - WAL journal/segment instrumentation - Signing and ingestion refinements - Consumer health demo script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
204 lines
6.4 KiB
Markdown
204 lines
6.4 KiB
Markdown
# Scout & Judge: Hybrid Deterministic-Probabilistic Extraction Architecture
|
|
|
|
> **Status:** Proposed (2026-02-05)
|
|
> **Phase:** 7.9 (Replaces monolithic LLM extraction)
|
|
> **Context:** Evolution of Phase 7.5 (LLM-in-the-Loop)
|
|
|
|
---
|
|
|
|
## 1. Problem Statement
|
|
|
|
The current LLM extraction pipeline ("Monolithic Mode") treats code files as unstructured text. It feeds entire files to the LLM to find security claims.
|
|
|
|
**Issues with Monolithic Mode:**
|
|
1. **Cost:** 90% of a file is irrelevant to security (imports, UI logic, helpers), yet we pay for every token.
|
|
2. **Recall:** LLMs struggle to find "needles in haystacks" (long context window degradation).
|
|
3. **Hallucination:** Irrelevant code confuses the model, leading to false positives.
|
|
4. **Latency:** Processing large files is slow/blocking.
|
|
|
|
## 2. The Solution: Scout & Judge Architecture
|
|
|
|
We decouple the **discovery** of potential claims from the **analysis** of those claims.
|
|
|
|
* **The Scout (Deterministic):** Uses Abstract Syntax Trees (AST) via `tree-sitter` to find *Regions of Interest* (ROIs) with 100% speed and 0 cost.
|
|
* **The Judge (Probabilistic):** Uses the LLM to analyze *only* the specific ROI snippet to extract semantic meaning and confidence.
|
|
|
|
### Architectural Diagram
|
|
|
|
```mermaid
|
|
graph TD
|
|
File[Source File] -->|Input| Scout[AST Scout (Tree-sitter)]
|
|
|
|
subgraph "The Scout (Local/Fast)"
|
|
Scout -->|Parse| AST
|
|
AST -->|Query| Query[SCM Queries]
|
|
Query -->|Match| Candidate[Candidate Node]
|
|
Candidate -->|Expand| Snippet[Context Snippet]
|
|
end
|
|
|
|
Snippet -->|Input| Judge[LLM Judge (Gemini/Claude)]
|
|
|
|
subgraph "The Judge (Remote/Smart)"
|
|
Judge -->|Prompt: Analyze this specific call| Claims[Structured Claims]
|
|
end
|
|
|
|
Claims -->|Output| Aggregator[Claim Aggregator]
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Component Details
|
|
|
|
### 3.1 The Scout (Tree-sitter)
|
|
|
|
The Scout's job is **High Recall**. It should find *anything* that *might* be relevant. It does not need to be precise.
|
|
|
|
**Technology:** `tree-sitter` (Rust bindings)
|
|
|
|
**Workflow:**
|
|
1. **Detect Language:** Identify file type (Python, Go, Rust, JS).
|
|
2. **Parse:** Generate AST.
|
|
3. **Query:** Run SCM (S-expression) queries to find patterns.
|
|
|
|
**Example Query (Python TLS):**
|
|
```scm
|
|
(call_expression
|
|
function: (attribute) @func
|
|
arguments: (argument_list
|
|
(keyword_argument
|
|
name: (identifier) @arg_name
|
|
value: (_) @value
|
|
)
|
|
)
|
|
(#match? @func "requests\.(get|post|put|delete)")
|
|
(#eq? @arg_name "verify")
|
|
)
|
|
```
|
|
|
|
**Context Expansion:**
|
|
The Scout doesn't just grab the line. It grabs the **Logical Context**:
|
|
* The function call itself.
|
|
* Variable definitions referenced in the call (simple static analysis).
|
|
* Surrounding 5 lines for comments.
|
|
|
|
### 3.2 The Judge (LLM)
|
|
|
|
The Judge's job is **High Precision**. It receives a focused prompt and determines if a claim exists.
|
|
|
|
**Input Prompt:**
|
|
```text
|
|
You are a security analyst.
|
|
Analyze this code snippet for TLS verification settings.
|
|
|
|
SNIPPET:
|
|
# Dev override
|
|
should_verify = False
|
|
requests.get(url, verify=should_verify)
|
|
|
|
CONTEXT:
|
|
Variable `should_verify` is defined on line 2.
|
|
|
|
TASK:
|
|
Does this snippet disable TLS verification?
|
|
Output JSON: { "subject": "tls/verification", "value": false, "confidence": 0.95 }
|
|
```
|
|
|
|
**Why this wins:**
|
|
* **Token Efficiency:** Input reduced from 2000 tokens (file) to ~100 tokens (snippet).
|
|
* **Accuracy:** Model has no distractions.
|
|
* **Speed:** Parallelizable per-snippet.
|
|
|
|
---
|
|
|
|
## 4. Implementation Plan
|
|
|
|
### Phase 1: Infrastructure (Dependencies)
|
|
|
|
Add `tree-sitter` support to `Cargo.toml`.
|
|
|
|
```toml
|
|
[dependencies]
|
|
tree-sitter = "0.20"
|
|
tree-sitter-python = "0.20"
|
|
tree-sitter-javascript = "0.20"
|
|
tree-sitter-go = "0.20"
|
|
tree-sitter-rust = "0.20"
|
|
```
|
|
|
|
### Phase 2: The Scout Engine (`src/scout/`)
|
|
|
|
Create a new module `applications/aphoria/src/scout/`.
|
|
|
|
* `mod.rs`: Public interface.
|
|
* `engine.rs`: Orchestrates parsing and querying.
|
|
* `queries/`: Directory containing `.scm` query files for each category/language.
|
|
* `python/tls.scm`
|
|
* `go/sql_injection.scm`
|
|
|
|
**Struct definition:**
|
|
```rust
|
|
pub struct CandidateSnippet {
|
|
pub file_path: String,
|
|
pub language: Language,
|
|
pub start_line: usize,
|
|
pub end_line: usize,
|
|
pub code: String,
|
|
pub context_variables: HashMap<String, String>, // Name -> Value/Definition
|
|
pub query_id: String, // Which query found this
|
|
}
|
|
```
|
|
|
|
### Phase 3: The Judge Engine (`src/llm/judge.rs`)
|
|
|
|
Refactor `LlmExtractor` to support "Judge Mode".
|
|
|
|
* Modify `extract()` to accept `CandidateSnippet` instead of full file content.
|
|
* Create specialized prompts for specific query IDs (e.g., if Scout found a TLS pattern, use the specialized "TLS Judge" prompt, not the generic one).
|
|
|
|
### Phase 4: Integration
|
|
|
|
Modify the main `scan` loop:
|
|
|
|
1. **Regex Extractors** run first (unchanged).
|
|
2. **Scout** runs on all files (extremely fast).
|
|
3. **Deduplicate:** If Scout finds a region already handled by Regex, drop it.
|
|
4. **Judge:** Send remaining Candidates to LLM.
|
|
|
|
---
|
|
|
|
## 5. Evaluation & Metrics
|
|
|
|
The "Prompt Evaluation System" (Phase 7.8) adapts to this model:
|
|
|
|
**1. Scout Evaluation (Deterministic):**
|
|
* **Metric:** Recall. "Did the Scout find the vulnerable line in `fixtures/tls/bad.py`?"
|
|
* **Test:** Unit tests using `tree-sitter` queries against code snippets. No LLM required.
|
|
|
|
**2. Judge Evaluation (Probabilistic):**
|
|
* **Metric:** Precision/Accuracy. "Given the snippet, did the LLM classify it correctly?"
|
|
* **Fixture:** `tests/llm_fixtures` now contains *snippets* derived from the Golden Corpus files.
|
|
|
|
**3. Cost Efficiency Metric:**
|
|
* Track `tokens_per_claim`.
|
|
* Goal: Reduce tokens/claim by >80% compared to Monolithic approach.
|
|
|
|
## 6. Migration Strategy
|
|
|
|
1. **Parallel Run:** Run Scout logic alongside Regex logic in "shadow mode" (logging only) to tune queries.
|
|
2. **Incremental Rollout:** Enable Scout & Judge for **one category** (e.g., TLS) while leaving others in Monolithic mode (if any) or Regex mode.
|
|
3. **Full Switch:** Deprecate "Monolithic Mode" prompts.
|
|
|
|
---
|
|
|
|
## 7. Comparison Summary
|
|
|
|
| Feature | Current (Monolithic) | Scout & Judge (Proposed) |
|
|
| :--- | :--- | :--- |
|
|
| **Trigger** | File name heuristic | AST Pattern Match |
|
|
| **Input** | Whole File | Relevant Snippet |
|
|
| **Context** | Noisy (imports, unrelated code) | Focused (local scope) |
|
|
| **Cost** | $$$ (Linear to file size) | ¢ (Linear to *relevant* code) |
|
|
| **Reliability** | Low (Lost in middle) | High (Forced focus) |
|
|
| **Maintenance** | Prompt Engineering | Query Engineering + Simple Prompts |
|
|
|