stemedb/applications/aphoria/docs/architecture/scout-judge-extraction.md
jordan bbe6aedc40 feat: Aphoria security extractors + LLM evaluation architecture + ontology docs
New security extractors:
- insecure_deserialization, orm_injection, path_traversal, security_headers
- ssrf, unvalidated_redirects, weak_password, xxe
- Enhanced tls_version extractor with comprehensive cipher/protocol checks

Architecture docs:
- Scout-judge extraction pattern for LLM-based code analysis
- LLM prompt evaluation framework
- LLM eval implementation guide

Core improvements:
- stemedb-ontology README and client enhancements
- WAL journal/segment instrumentation
- Signing and ingestion refinements
- Consumer health demo script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 15:22:55 -07:00

6.4 KiB

Scout & Judge: Hybrid Deterministic-Probabilistic Extraction Architecture

Status: Proposed (2026-02-05) Phase: 7.9 (Replaces monolithic LLM extraction) Context: Evolution of Phase 7.5 (LLM-in-the-Loop)


1. Problem Statement

The current LLM extraction pipeline ("Monolithic Mode") treats code files as unstructured text. It feeds entire files to the LLM to find security claims.

Issues with Monolithic Mode:

  1. Cost: 90% of a file is irrelevant to security (imports, UI logic, helpers), yet we pay for every token.
  2. Recall: LLMs struggle to find "needles in haystacks" (long context window degradation).
  3. Hallucination: Irrelevant code confuses the model, leading to false positives.
  4. Latency: Processing large files is slow/blocking.

2. The Solution: Scout & Judge Architecture

We decouple the discovery of potential claims from the analysis of those claims.

  • The Scout (Deterministic): Uses Abstract Syntax Trees (AST) via tree-sitter to find Regions of Interest (ROIs) with 100% speed and 0 cost.
  • The Judge (Probabilistic): Uses the LLM to analyze only the specific ROI snippet to extract semantic meaning and confidence.

Architectural Diagram

graph TD
    File[Source File] -->|Input| Scout[AST Scout (Tree-sitter)]
    
    subgraph "The Scout (Local/Fast)"
        Scout -->|Parse| AST
        AST -->|Query| Query[SCM Queries]
        Query -->|Match| Candidate[Candidate Node]
        Candidate -->|Expand| Snippet[Context Snippet]
    end
    
    Snippet -->|Input| Judge[LLM Judge (Gemini/Claude)]
    
    subgraph "The Judge (Remote/Smart)"
        Judge -->|Prompt: Analyze this specific call| Claims[Structured Claims]
    end
    
    Claims -->|Output| Aggregator[Claim Aggregator]

3. Component Details

3.1 The Scout (Tree-sitter)

The Scout's job is High Recall. It should find anything that might be relevant. It does not need to be precise.

Technology: tree-sitter (Rust bindings)

Workflow:

  1. Detect Language: Identify file type (Python, Go, Rust, JS).
  2. Parse: Generate AST.
  3. Query: Run SCM (S-expression) queries to find patterns.

Example Query (Python TLS):

(call_expression
  function: (attribute) @func
  arguments: (argument_list
    (keyword_argument
      name: (identifier) @arg_name
      value: (_) @value
    )
  )
  (#match? @func "requests\.(get|post|put|delete)")
  (#eq? @arg_name "verify")
)

Context Expansion: The Scout doesn't just grab the line. It grabs the Logical Context:

  • The function call itself.
  • Variable definitions referenced in the call (simple static analysis).
  • Surrounding 5 lines for comments.

3.2 The Judge (LLM)

The Judge's job is High Precision. It receives a focused prompt and determines if a claim exists.

Input Prompt:

You are a security analyst.
Analyze this code snippet for TLS verification settings.

SNIPPET:
# Dev override
should_verify = False
requests.get(url, verify=should_verify)

CONTEXT:
Variable `should_verify` is defined on line 2.

TASK:
Does this snippet disable TLS verification?
Output JSON: { "subject": "tls/verification", "value": false, "confidence": 0.95 }

Why this wins:

  • Token Efficiency: Input reduced from 2000 tokens (file) to ~100 tokens (snippet).
  • Accuracy: Model has no distractions.
  • Speed: Parallelizable per-snippet.

4. Implementation Plan

Phase 1: Infrastructure (Dependencies)

Add tree-sitter support to Cargo.toml.

[dependencies]
tree-sitter = "0.20"
tree-sitter-python = "0.20"
tree-sitter-javascript = "0.20"
tree-sitter-go = "0.20"
tree-sitter-rust = "0.20"

Phase 2: The Scout Engine (src/scout/)

Create a new module applications/aphoria/src/scout/.

  • mod.rs: Public interface.
  • engine.rs: Orchestrates parsing and querying.
  • queries/: Directory containing .scm query files for each category/language.
    • python/tls.scm
    • go/sql_injection.scm

Struct definition:

pub struct CandidateSnippet {
    pub file_path: String,
    pub language: Language,
    pub start_line: usize,
    pub end_line: usize,
    pub code: String,
    pub context_variables: HashMap<String, String>, // Name -> Value/Definition
    pub query_id: String, // Which query found this
}

Phase 3: The Judge Engine (src/llm/judge.rs)

Refactor LlmExtractor to support "Judge Mode".

  • Modify extract() to accept CandidateSnippet instead of full file content.
  • Create specialized prompts for specific query IDs (e.g., if Scout found a TLS pattern, use the specialized "TLS Judge" prompt, not the generic one).

Phase 4: Integration

Modify the main scan loop:

  1. Regex Extractors run first (unchanged).
  2. Scout runs on all files (extremely fast).
  3. Deduplicate: If Scout finds a region already handled by Regex, drop it.
  4. Judge: Send remaining Candidates to LLM.

5. Evaluation & Metrics

The "Prompt Evaluation System" (Phase 7.8) adapts to this model:

1. Scout Evaluation (Deterministic):

  • Metric: Recall. "Did the Scout find the vulnerable line in fixtures/tls/bad.py?"
  • Test: Unit tests using tree-sitter queries against code snippets. No LLM required.

2. Judge Evaluation (Probabilistic):

  • Metric: Precision/Accuracy. "Given the snippet, did the LLM classify it correctly?"
  • Fixture: tests/llm_fixtures now contains snippets derived from the Golden Corpus files.

3. Cost Efficiency Metric:

  • Track tokens_per_claim.
  • Goal: Reduce tokens/claim by >80% compared to Monolithic approach.

6. Migration Strategy

  1. Parallel Run: Run Scout logic alongside Regex logic in "shadow mode" (logging only) to tune queries.
  2. Incremental Rollout: Enable Scout & Judge for one category (e.g., TLS) while leaving others in Monolithic mode (if any) or Regex mode.
  3. Full Switch: Deprecate "Monolithic Mode" prompts.

7. Comparison Summary

Feature Current (Monolithic) Scout & Judge (Proposed)
Trigger File name heuristic AST Pattern Match
Input Whole File Relevant Snippet
Context Noisy (imports, unrelated code) Focused (local scope)
Cost $ (Linear to file size) ¢ (Linear to relevant code)
Reliability Low (Lost in middle) High (Forced focus)
Maintenance Prompt Engineering Query Engineering + Simple Prompts