# Declarative Extractor Reference Declarative extractors are pattern-based extractors defined in TOML configuration. They're ideal for detecting simple code patterns through regex matching. --- ## Quick Start **Add to `.aphoria/config.toml`:** ```toml [[extractors.declarative]] name = "timeout_zero_detector" pattern = 'timeout:\s*Duration::from_secs\(0\)' languages = ["rust"] [extractors.declarative.claim] subject = "myapp/config/timeout" predicate = "zero" value = 0 confidence = 0.95 ``` **Result:** Creates observations when pattern matches code, compares against claims with same `concept_path`. --- ## Field Reference ### Required Fields #### `name` (String) Unique identifier for this extractor. **Format:** snake_case, descriptive **Example:** `"timeout_zero_detector"`, `"unbounded_queue_size"` --- #### `pattern` (String) Regular expression matching the code pattern you want to detect. **Format:** Valid regex (Rust regex crate syntax) **Tips:** - Use `\s*` for optional whitespace - Escape special chars: `\(`, `\)`, `\.` - Test with `grep -E "pattern" file.rs` before adding to config **Examples:** ```toml # Detect timeout = 0 pattern = 'timeout:\s*Duration::from_secs\(0\)' # Detect None for max_size pattern = 'max_queue_size:\s*None' # Detect verify_certificates = false pattern = 'verify_certificates:\s*false' ``` --- #### `languages` (Array of Strings) File types this extractor should run on. **Format:** Array of language names **Supported:** `["rust", "python", "javascript", "typescript", "go", "java"]` **Example:** ```toml languages = ["rust"] ``` --- #### `[extractors.declarative.claim]` Section This defines the observation that will be created when the pattern matches. ##### `subject` (String) - **CRITICAL FIELD** The **concept path** for observations created by this extractor. **⚠️ MOST COMMON MISTAKE:** Using partial path instead of full path. **Format:** Full slash-separated path matching your claim's `concept_path` **EXACTLY**. **Example (Correct):** ```toml # Claim has: concept_path = "msgqueue/queue/max_size" # Extractor MUST use SAME path: [extractors.declarative.claim] subject = "msgqueue/queue/max_size" # ✅ CORRECT ``` **Common Mistake (Wrong):** ```toml # ❌ WRONG: Using only leaf segments subject = "queue/max_size" # Will NOT match claim! # ❌ WRONG: Different prefix subject = "myapp/queue/max_size" # Will NOT match unless claim also uses "myapp" ``` **Why This Matters:** Observations match claims via **tail-path matching** (last 2 segments). - **Claim:** `msgqueue/queue/max_size` → tail: `queue/max_size` - **Observation:** `queue/max_size` → tail: `queue/max_size` - **Match?** Only if observation path ENDS with same tail as claim If you use `subject = "queue/max_size"`, the observation will have path `queue/max_size` with tail `queue/max_size`. But if the claim expects `msgqueue/queue/max_size`, the FULL paths must align for tail matching to work. **Rule of Thumb:** Copy `concept_path` from your claim EXACTLY into `subject` field. --- ##### `predicate` (String) The attribute you're observing. **Format:** Snake_case identifier **Common Values:** - `"zero"` - For numeric zero checks - `"bounded"` - For limit/size checks - `"enabled"` - For boolean flags - `"valid"` - For validation checks **Must match:** The predicate in your claim. **Example:** ```toml # Claim has: predicate = "bounded" # Extractor must use: predicate = "bounded" ``` --- ##### `value` (Boolean, Number, or String) The value observed when pattern matches. **Type:** Must match claim's value type **Typical Pattern:** Extractor observes VIOLATION value (opposite of claim's desired value) **Example:** ```toml # Claim says: max_size should be bounded (true) concept_path = "msgqueue/queue/max_size" predicate = "bounded" value = true comparison = "equals" # Extractor detects: max_size is unbounded (None in code) [extractors.declarative.claim] subject = "msgqueue/queue/max_size" predicate = "bounded" value = false # ← Opposite of claim (violation detected) ``` --- ##### `confidence` (Float, Optional) Confidence level (0.0 to 1.0). Defaults to 0.95. **Format:** `0.0` (no confidence) to `1.0` (certain) **Typical:** `0.95` (high confidence for regex matches) --- ## Complete Examples ### Example 1: Detecting timeout=0 **The Code (Violation):** ```rust // src/config.rs:20 pub struct Config { pub timeout: Duration = Duration::from_secs(0); // ❌ Violation } ``` **The Claim (`.aphoria/claims.toml`):** ```toml [[claim]] id = "msgqueue-001" concept_path = "msgqueue/config/timeout" predicate = "zero" value = 0 comparison = "not_equals" # Timeout MUST NOT be zero invariant = "Timeout MUST be greater than zero" consequence = "Zero timeout causes indefinite blocking" ``` **The Extractor (`.aphoria/config.toml`):** ```toml [[extractors.declarative]] name = "timeout_zero_detector" pattern = 'timeout:\s*Duration::from_secs\(0\)' languages = ["rust"] [extractors.declarative.claim] subject = "msgqueue/config/timeout" # ← Matches claim concept_path exactly predicate = "zero" value = 0 confidence = 0.95 ``` **How It Works:** 1. Extractor scans Rust files 2. Finds pattern `timeout: Duration::from_secs(0)` in `src/config.rs:20` 3. Creates observation: `msgqueue/config/timeout :: zero = 0` 4. Compares to claim: `msgqueue/config/timeout :: zero NOT_EQUALS 0` 5. **Result:** CONFLICT (observation says 0, claim says NOT 0) --- ### Example 2: Detecting Unbounded Queue **The Code (Violation):** ```rust // src/queue.rs:45 pub struct QueueConfig { pub max_queue_size: Option = None; // ❌ Violation } ``` **The Claim:** ```toml [[claim]] id = "msgqueue-015" concept_path = "msgqueue/queue/max_size" predicate = "bounded" value = true comparison = "equals" # Queue size MUST be bounded invariant = "Queue size MUST have explicit limit" consequence = "Unbounded queue causes OOM under sustained load" ``` **The Extractor:** ```toml [[extractors.declarative]] name = "queue_max_size_unbounded" pattern = 'max_queue_size:\s*None' languages = ["rust"] [extractors.declarative.claim] subject = "msgqueue/queue/max_size" # ← Matches claim exactly predicate = "bounded" value = false # ← Observing "NOT bounded" (violation) confidence = 0.95 ``` **Result:** CONFLICT (observation says NOT bounded, claim says MUST be bounded) --- ### Example 3: Detecting Disabled TLS Validation **The Code (Violation):** ```rust // src/tls.rs:12 pub struct TlsConfig { pub verify_certificates: bool = false; // ❌ Violation } ``` **The Claim:** ```toml [[claim]] id = "msgqueue-002" concept_path = "msgqueue/tls/certificate_validation" predicate = "enabled" value = true comparison = "equals" # Certificate validation MUST be enabled invariant = "TLS certificate validation MUST be enabled" consequence = "Disabled validation allows MITM attacks" authority_tier = "expert" category = "security" ``` **The Extractor:** ```toml [[extractors.declarative]] name = "tls_cert_validation_disabled" pattern = 'verify_certificates:\s*false' languages = ["rust"] [extractors.declarative.claim] subject = "msgqueue/tls/certificate_validation" # ← Matches claim exactly predicate = "enabled" value = false # ← Observing "disabled" (violation) confidence = 0.95 ``` **Result:** CONFLICT (observation says disabled, claim says MUST be enabled) --- ## Common Mistakes & Fixes ### Mistake 1: Subject Path Doesn't Match Claim **Symptom:** Extractors run (+N observations), but 0% detection rate **Example:** ```toml # Claim has: concept_path = "msgqueue/queue/max_size" # Extractor uses (WRONG): subject = "queue/max_size" # ❌ Missing "msgqueue/" prefix ``` **Fix:** Copy `concept_path` from claim EXACTLY: ```toml subject = "msgqueue/queue/max_size" # ✅ Matches claim ``` **Debug Tip:** ```bash # Compare subject fields vs concept paths grep "subject =" .aphoria/config.toml grep "concept_path =" .aphoria/claims.toml # Subjects should be subset of concept_paths ``` --- ### Mistake 2: Pattern Doesn't Match Code **Symptom:** 0 observations created, nothing detected **Example:** ```toml # Pattern (wrong): pattern = 'timeout: 0' # Code has: timeout: Duration::from_secs(0) # ← Pattern too simplistic ``` **Fix:** Make pattern match actual code syntax: ```toml pattern = 'timeout:\s*Duration::from_secs\(0\)' # ✅ Matches code ``` **Debug Tip:** ```bash # Test regex against code BEFORE adding to config grep -rE 'timeout:\s*Duration::from_secs\(0\)' src/ # Should find the violation line ``` --- ### Mistake 3: Wrong Value Type **Symptom:** Extractors run, observations created, but no CONFLICT detected **Example:** ```toml # Claim expects boolean: predicate = "enabled" value = true # Boolean # Extractor uses string (WRONG): value = "false" # ❌ String doesn't match boolean ``` **Fix:** Match value types: ```toml value = false # ✅ Boolean matches claim type ``` --- ### Mistake 4: Predicate Mismatch **Symptom:** Observations don't match claims (different predicates) **Example:** ```toml # Claim has: predicate = "bounded" # Extractor uses (WRONG): predicate = "unbounded" # ❌ Different predicate ``` **Fix:** Use SAME predicate as claim: ```toml predicate = "bounded" # ✅ Matches claim value = false # ← Value indicates violation ``` --- ## Validation Workflow Before running scan, validate your extractors: ### Step 1: Check Subject Paths Match Claims ```bash # Extract all subjects from extractors grep "subject =" .aphoria/config.toml # Extract all concept_paths from claims grep "concept_path =" .aphoria/claims.toml # Verify: Every subject should match a concept_path EXACTLY ``` **Expected:** Each extractor's `subject` appears in a claim's `concept_path` --- ### Step 2: Test Regex Pattern Against Code ```bash # For each extractor pattern, test against codebase grep -rE 'timeout:\s*Duration::from_secs\(0\)' src/ # Should find the violation line(s) you're targeting ``` **Expected:** Pattern matches at least one line in code --- ### Step 3: Verify TOML Syntax ```bash # Check for TOML syntax errors cargo install taplo-cli # Install TOML linter taplo fmt --check .aphoria/config.toml # Or: Try loading with aphoria aphoria scan --dry-run # (Feature request: VG-DAY3-003) ``` **Expected:** No syntax errors --- ## Debugging 0% Detection Rate If your extractors run but detection rate is still 0%: ### Step 1: Verify Observations Were Created ```bash # Check scan output for observation count jq '.observations | length' scan-results-v2.json # Expected: > 0 (if 0, extractors didn't match any code) ``` **If 0 observations:** - Problem: Pattern doesn't match code - Fix: Test pattern with `grep -rE "pattern" src/` **If >0 observations:** - Problem: Observations don't match claims (path/predicate mismatch) - Continue to Step 2 --- ### Step 2: Compare Observation Paths vs Claim Paths **⚠️ Workaround:** (Until VG-DAY3-001 `--show-observations` exists) ```bash # Manual inspection of scan JSON jq '.observations[].concept_path' scan-results-v2.json | sort -u # Compare with claim paths grep "concept_path =" .aphoria/claims.toml | sort -u # Check: Do observation paths END with same tail as claim paths? ``` **Example:** - Observation: `queue/max_size` - Claim: `msgqueue/queue/max_size` - Tail-path: Last 2 segments = `queue/max_size` - **Issue:** Observation missing `msgqueue/` prefix **Fix:** Update extractor `subject` to match claim's full path. --- ### Step 3: Check Predicate Alignment ```bash # Extract predicates from observations (manual inspection) jq '.observations[].predicate' scan-results-v2.json | sort -u # Compare with claim predicates grep "predicate =" .aphoria/claims.toml | sort -u # Verify: Observation predicates match claim predicates ``` --- ## Advanced: Tail-Path Matching Explained Aphoria uses **tail-path matching** (last 2 segments) to allow observations from different namespaces to match claims. ### How It Works **Claim:** `myapp/database/connection/pool_size` - Full path: 4 segments - Tail-path: Last 2 = `connection/pool_size` **Observation:** `postgres/connection/pool_size` - Full path: 3 segments - Tail-path: Last 2 = `connection/pool_size` **Match:** ✅ Tails match (`connection/pool_size`) ### Why This Matters for Extractors Your extractor's `subject` becomes the observation's concept_path. **If you use:** ```toml subject = "connection/pool_size" # 2 segments ``` **Observation will have:** - Path: `connection/pool_size` - Tail: `connection/pool_size` (last 2) **This matches claims with tail:** - `myapp/database/connection/pool_size` → tail: `connection/pool_size` ✅ - `postgres/connection/pool_size` → tail: `connection/pool_size` ✅ **But NOT:** - `myapp/connection_pool_size` → tail: (1 segment, no match) ❌ ### Best Practice **Use full path matching your claim:** - Claim: `msgqueue/queue/max_size` - Extractor: `subject = "msgqueue/queue/max_size"` (exact copy) This avoids tail-path confusion and ensures exact matching. --- ## When to Use Declarative Extractors ### ✅ Good Use Cases 1. **Simple regex patterns** - Detecting specific code constructs - `timeout = 0` - `max_size = None` - `verify_certificates = false` 2. **Known anti-patterns** - Common mistakes with clear regex - `std::thread::sleep` in async functions - `unwrap()` calls in production code - Hardcoded credentials patterns 3. **Configuration violations** - Specific config values - Port numbers - Timeouts - Buffer sizes ### ❌ When NOT to Use 1. **Complex logic** - Requires control flow analysis - "Function X must be called before function Y" - "Lock must be released in all code paths" - Use programmatic extractors instead 2. **Context-dependent patterns** - Depends on surrounding code - "Timeout must be > connection_timeout" - "Buffer size must match header size" - Use programmatic extractors with AST analysis 3. **Cross-file patterns** - Spans multiple files - "Config file must match CLI args" - "Database schema must match API types" - Use programmatic extractors with global analysis --- ## Related Documentation - **Creating Extractors:** `.claude/skills/aphoria-custom-extractor-creator/SKILL.md` - **Claims Reference:** `applications/aphoria/docs/claims-reference.md` - **Scan Workflow:** `applications/aphoria/docs/scanning.md` - **Product Gaps:** `VG-DAY3-001` (`--show-observations`), `VG-DAY3-003` (`aphoria extractors validate`) --- ## FAQ **Q: What if my pattern never matches?** A: Test with `grep -rE "pattern" src/` first. If grep finds nothing, your pattern is wrong. **Q: What if observations are created but no conflicts detected?** A: Check `subject` field matches claim `concept_path` EXACTLY. Use `grep "subject =" .aphoria/config.toml` vs `grep "concept_path =" .aphoria/claims.toml` to compare. **Q: Can I use wildcards in subject paths?** A: Not in declarative extractors. Use programmatic extractors for dynamic path generation. **Q: How do I debug observation paths?** A: Manually inspect `scan-results.json` with `jq '.observations[].concept_path'` until VG-DAY3-001 (`--show-observations` flag) is implemented. **Q: Can one extractor create multiple observations?** A: Yes! If pattern matches multiple times in code, extractor creates one observation per match (all with same subject/predicate). --- **Last Updated:** 2026-02-10 (after msgqueue Day 3 evaluation) **Related Gaps:** VG-DAY3-001, VG-DAY3-002, VG-DAY3-003