# Declarative Extractor Reference

Declarative extractors are pattern-based extractors defined in TOML configuration. They're ideal for detecting simple code patterns through regex matching.

---

## Quick Start

**Add to `.aphoria/config.toml`:**

```toml
[[extractors.declarative]]
name = "timeout_zero_detector"
pattern = 'timeout:\s*Duration::from_secs\(0\)'
languages = ["rust"]

[extractors.declarative.claim]
subject = "myapp/config/timeout"
predicate = "zero"
value = 0
confidence = 0.95
```

**Result:** Creates observations when pattern matches code, compares against claims with same `concept_path`.

---

## Field Reference

### Required Fields

#### `name` (String)
Unique identifier for this extractor.

**Format:** snake_case, descriptive
**Example:** `"timeout_zero_detector"`, `"unbounded_queue_size"`

---

#### `pattern` (String)
Regular expression matching the code pattern you want to detect.

**Format:** Valid regex (Rust regex crate syntax)
**Tips:**
- Use `\s*` for optional whitespace
- Escape special chars: `\(`, `\)`, `\.`
- Test with `grep -E "pattern" file.rs` before adding to config

**Examples:**
```toml
# Detect timeout = 0
pattern = 'timeout:\s*Duration::from_secs\(0\)'

# Detect None for max_size
pattern = 'max_queue_size:\s*None'

# Detect verify_certificates = false
pattern = 'verify_certificates:\s*false'
```

---

#### `languages` (Array of Strings)
File types this extractor should run on.

**Format:** Array of language names
**Supported:** `["rust", "python", "javascript", "typescript", "go", "java"]`

**Example:**
```toml
languages = ["rust"]
```

---

#### `[extractors.declarative.claim]` Section

This defines the observation that will be created when the pattern matches.

##### `subject` (String) - **CRITICAL FIELD**

The **concept path** for observations created by this extractor.

**⚠️ MOST COMMON MISTAKE:** Using partial path instead of full path.

**Format:** Full slash-separated path matching your claim's `concept_path` **EXACTLY**.

**Example (Correct):**
```toml
# Claim has:
concept_path = "msgqueue/queue/max_size"

# Extractor MUST use SAME path:
[extractors.declarative.claim]
subject = "msgqueue/queue/max_size"  # ✅ CORRECT
```

**Common Mistake (Wrong):**
```toml
# ❌ WRONG: Using only leaf segments
subject = "queue/max_size"  # Will NOT match claim!

# ❌ WRONG: Different prefix
subject = "myapp/queue/max_size"  # Will NOT match unless claim also uses "myapp"
```

**Why This Matters:**

Observations match claims via **tail-path matching** (last 2 segments).

- **Claim:** `msgqueue/queue/max_size` → tail: `queue/max_size`
- **Observation:** `queue/max_size` → tail: `queue/max_size`
- **Match?** Only if observation path ENDS with same tail as claim

If you use `subject = "queue/max_size"`, the observation will have path `queue/max_size` with tail `queue/max_size`. But if the claim expects `msgqueue/queue/max_size`, the FULL paths must align for tail matching to work.

**Rule of Thumb:** Copy `concept_path` from your claim EXACTLY into `subject` field.

---

##### `predicate` (String)
The attribute you're observing.

**Format:** Snake_case identifier
**Common Values:**
- `"zero"` - For numeric zero checks
- `"bounded"` - For limit/size checks
- `"enabled"` - For boolean flags
- `"valid"` - For validation checks

**Must match:** The predicate in your claim.

**Example:**
```toml
# Claim has: predicate = "bounded"
# Extractor must use:
predicate = "bounded"
```

---

##### `value` (Boolean, Number, or String)
The value observed when pattern matches.

**Type:** Must match claim's value type
**Typical Pattern:** Extractor observes VIOLATION value (opposite of claim's desired value)

**Example:**
```toml
# Claim says: max_size should be bounded (true)
concept_path = "msgqueue/queue/max_size"
predicate = "bounded"
value = true
comparison = "equals"

# Extractor detects: max_size is unbounded (None in code)
[extractors.declarative.claim]
subject = "msgqueue/queue/max_size"
predicate = "bounded"
value = false  # ← Opposite of claim (violation detected)
```

---

##### `confidence` (Float, Optional)
Confidence level (0.0 to 1.0). Defaults to 0.95.

**Format:** `0.0` (no confidence) to `1.0` (certain)
**Typical:** `0.95` (high confidence for regex matches)

---

## Complete Examples

### Example 1: Detecting timeout=0

**The Code (Violation):**
```rust
// src/config.rs:20
pub struct Config {
    pub timeout: Duration = Duration::from_secs(0);  // ❌ Violation
}
```

**The Claim (`.aphoria/claims.toml`):**
```toml
[[claim]]
id = "msgqueue-001"
concept_path = "msgqueue/config/timeout"
predicate = "zero"
value = 0
comparison = "not_equals"  # Timeout MUST NOT be zero
invariant = "Timeout MUST be greater than zero"
consequence = "Zero timeout causes indefinite blocking"
```

**The Extractor (`.aphoria/config.toml`):**
```toml
[[extractors.declarative]]
name = "timeout_zero_detector"
pattern = 'timeout:\s*Duration::from_secs\(0\)'
languages = ["rust"]

[extractors.declarative.claim]
subject = "msgqueue/config/timeout"  # ← Matches claim concept_path exactly
predicate = "zero"
value = 0
confidence = 0.95
```

**How It Works:**
1. Extractor scans Rust files
2. Finds pattern `timeout: Duration::from_secs(0)` in `src/config.rs:20`
3. Creates observation: `msgqueue/config/timeout :: zero = 0`
4. Compares to claim: `msgqueue/config/timeout :: zero NOT_EQUALS 0`
5. **Result:** CONFLICT (observation says 0, claim says NOT 0)

---

### Example 2: Detecting Unbounded Queue

**The Code (Violation):**
```rust
// src/queue.rs:45
pub struct QueueConfig {
    pub max_queue_size: Option<usize> = None;  // ❌ Violation
}
```

**The Claim:**
```toml
[[claim]]
id = "msgqueue-015"
concept_path = "msgqueue/queue/max_size"
predicate = "bounded"
value = true
comparison = "equals"  # Queue size MUST be bounded
invariant = "Queue size MUST have explicit limit"
consequence = "Unbounded queue causes OOM under sustained load"
```

**The Extractor:**
```toml
[[extractors.declarative]]
name = "queue_max_size_unbounded"
pattern = 'max_queue_size:\s*None'
languages = ["rust"]

[extractors.declarative.claim]
subject = "msgqueue/queue/max_size"  # ← Matches claim exactly
predicate = "bounded"
value = false  # ← Observing "NOT bounded" (violation)
confidence = 0.95
```

**Result:** CONFLICT (observation says NOT bounded, claim says MUST be bounded)

---

### Example 3: Detecting Disabled TLS Validation

**The Code (Violation):**
```rust
// src/tls.rs:12
pub struct TlsConfig {
    pub verify_certificates: bool = false;  // ❌ Violation
}
```

**The Claim:**
```toml
[[claim]]
id = "msgqueue-002"
concept_path = "msgqueue/tls/certificate_validation"
predicate = "enabled"
value = true
comparison = "equals"  # Certificate validation MUST be enabled
invariant = "TLS certificate validation MUST be enabled"
consequence = "Disabled validation allows MITM attacks"
authority_tier = "expert"
category = "security"
```

**The Extractor:**
```toml
[[extractors.declarative]]
name = "tls_cert_validation_disabled"
pattern = 'verify_certificates:\s*false'
languages = ["rust"]

[extractors.declarative.claim]
subject = "msgqueue/tls/certificate_validation"  # ← Matches claim exactly
predicate = "enabled"
value = false  # ← Observing "disabled" (violation)
confidence = 0.95
```

**Result:** CONFLICT (observation says disabled, claim says MUST be enabled)

---

## Common Mistakes & Fixes

### Mistake 1: Subject Path Doesn't Match Claim

**Symptom:** Extractors run (+N observations), but 0% detection rate

**Example:**
```toml
# Claim has:
concept_path = "msgqueue/queue/max_size"

# Extractor uses (WRONG):
subject = "queue/max_size"  # ❌ Missing "msgqueue/" prefix
```

**Fix:** Copy `concept_path` from claim EXACTLY:
```toml
subject = "msgqueue/queue/max_size"  # ✅ Matches claim
```

**Debug Tip:**
```bash
# Compare subject fields vs concept paths
grep "subject =" .aphoria/config.toml
grep "concept_path =" .aphoria/claims.toml

# Subjects should be subset of concept_paths
```

---

### Mistake 2: Pattern Doesn't Match Code

**Symptom:** 0 observations created, nothing detected

**Example:**
```toml
# Pattern (wrong):
pattern = 'timeout: 0'

# Code has:
timeout: Duration::from_secs(0)  # ← Pattern too simplistic
```

**Fix:** Make pattern match actual code syntax:
```toml
pattern = 'timeout:\s*Duration::from_secs\(0\)'  # ✅ Matches code
```

**Debug Tip:**
```bash
# Test regex against code BEFORE adding to config
grep -rE 'timeout:\s*Duration::from_secs\(0\)' src/
# Should find the violation line
```

---

### Mistake 3: Wrong Value Type

**Symptom:** Extractors run, observations created, but no CONFLICT detected

**Example:**
```toml
# Claim expects boolean:
predicate = "enabled"
value = true  # Boolean

# Extractor uses string (WRONG):
value = "false"  # ❌ String doesn't match boolean
```

**Fix:** Match value types:
```toml
value = false  # ✅ Boolean matches claim type
```

---

### Mistake 4: Predicate Mismatch

**Symptom:** Observations don't match claims (different predicates)

**Example:**
```toml
# Claim has:
predicate = "bounded"

# Extractor uses (WRONG):
predicate = "unbounded"  # ❌ Different predicate
```

**Fix:** Use SAME predicate as claim:
```toml
predicate = "bounded"  # ✅ Matches claim
value = false          # ← Value indicates violation
```

---

## Validation Workflow

Before running scan, validate your extractors:

### Step 1: Check Subject Paths Match Claims

```bash
# Extract all subjects from extractors
grep "subject =" .aphoria/config.toml

# Extract all concept_paths from claims
grep "concept_path =" .aphoria/claims.toml

# Verify: Every subject should match a concept_path EXACTLY
```

**Expected:** Each extractor's `subject` appears in a claim's `concept_path`

---

### Step 2: Test Regex Pattern Against Code

```bash
# For each extractor pattern, test against codebase
grep -rE 'timeout:\s*Duration::from_secs\(0\)' src/

# Should find the violation line(s) you're targeting
```

**Expected:** Pattern matches at least one line in code

---

### Step 3: Verify TOML Syntax

```bash
# Check for TOML syntax errors
cargo install taplo-cli  # Install TOML linter
taplo fmt --check .aphoria/config.toml

# Or: Try loading with aphoria
aphoria scan --dry-run  # (Feature request: VG-DAY3-003)
```

**Expected:** No syntax errors

---

## Debugging 0% Detection Rate

If your extractors run but detection rate is still 0%:

### Step 1: Verify Observations Were Created

```bash
# Check scan output for observation count
jq '.observations | length' scan-results-v2.json

# Expected: > 0 (if 0, extractors didn't match any code)
```

**If 0 observations:**
- Problem: Pattern doesn't match code
- Fix: Test pattern with `grep -rE "pattern" src/`

**If >0 observations:**
- Problem: Observations don't match claims (path/predicate mismatch)
- Continue to Step 2

---

### Step 2: Compare Observation Paths vs Claim Paths

**⚠️ Workaround:** (Until VG-DAY3-001 `--show-observations` exists)

```bash
# Manual inspection of scan JSON
jq '.observations[].concept_path' scan-results-v2.json | sort -u

# Compare with claim paths
grep "concept_path =" .aphoria/claims.toml | sort -u

# Check: Do observation paths END with same tail as claim paths?
```

**Example:**
- Observation: `queue/max_size`
- Claim: `msgqueue/queue/max_size`
- Tail-path: Last 2 segments = `queue/max_size`
- **Issue:** Observation missing `msgqueue/` prefix

**Fix:** Update extractor `subject` to match claim's full path.

---

### Step 3: Check Predicate Alignment

```bash
# Extract predicates from observations (manual inspection)
jq '.observations[].predicate' scan-results-v2.json | sort -u

# Compare with claim predicates
grep "predicate =" .aphoria/claims.toml | sort -u

# Verify: Observation predicates match claim predicates
```

---

## Advanced: Tail-Path Matching Explained

Aphoria uses **tail-path matching** (last 2 segments) to allow observations from different namespaces to match claims.

### How It Works

**Claim:** `myapp/database/connection/pool_size`
- Full path: 4 segments
- Tail-path: Last 2 = `connection/pool_size`

**Observation:** `postgres/connection/pool_size`
- Full path: 3 segments
- Tail-path: Last 2 = `connection/pool_size`

**Match:** ✅ Tails match (`connection/pool_size`)

### Why This Matters for Extractors

Your extractor's `subject` becomes the observation's concept_path.

**If you use:**
```toml
subject = "connection/pool_size"  # 2 segments
```

**Observation will have:**
- Path: `connection/pool_size`
- Tail: `connection/pool_size` (last 2)

**This matches claims with tail:**
- `myapp/database/connection/pool_size` → tail: `connection/pool_size` ✅
- `postgres/connection/pool_size` → tail: `connection/pool_size` ✅

**But NOT:**
- `myapp/connection_pool_size` → tail: (1 segment, no match) ❌

### Best Practice

**Use full path matching your claim:**
- Claim: `msgqueue/queue/max_size`
- Extractor: `subject = "msgqueue/queue/max_size"` (exact copy)

This avoids tail-path confusion and ensures exact matching.

---

## When to Use Declarative Extractors

### ✅ Good Use Cases

1. **Simple regex patterns** - Detecting specific code constructs
   - `timeout = 0`
   - `max_size = None`
   - `verify_certificates = false`

2. **Known anti-patterns** - Common mistakes with clear regex
   - `std::thread::sleep` in async functions
   - `unwrap()` calls in production code
   - Hardcoded credentials patterns

3. **Configuration violations** - Specific config values
   - Port numbers
   - Timeouts
   - Buffer sizes

### ❌ When NOT to Use

1. **Complex logic** - Requires control flow analysis
   - "Function X must be called before function Y"
   - "Lock must be released in all code paths"
   - Use programmatic extractors instead

2. **Context-dependent patterns** - Depends on surrounding code
   - "Timeout must be > connection_timeout"
   - "Buffer size must match header size"
   - Use programmatic extractors with AST analysis

3. **Cross-file patterns** - Spans multiple files
   - "Config file must match CLI args"
   - "Database schema must match API types"
   - Use programmatic extractors with global analysis

---

## Related Documentation

- **Creating Extractors:** `.claude/skills/aphoria-custom-extractor-creator/SKILL.md`
- **Claims Reference:** `applications/aphoria/docs/claims-reference.md`
- **Scan Workflow:** `applications/aphoria/docs/scanning.md`
- **Product Gaps:** `VG-DAY3-001` (`--show-observations`), `VG-DAY3-003` (`aphoria extractors validate`)

---

## FAQ

**Q: What if my pattern never matches?**
A: Test with `grep -rE "pattern" src/` first. If grep finds nothing, your pattern is wrong.

**Q: What if observations are created but no conflicts detected?**
A: Check `subject` field matches claim `concept_path` EXACTLY. Use `grep "subject =" .aphoria/config.toml` vs `grep "concept_path =" .aphoria/claims.toml` to compare.

**Q: Can I use wildcards in subject paths?**
A: Not in declarative extractors. Use programmatic extractors for dynamic path generation.

**Q: How do I debug observation paths?**
A: Manually inspect `scan-results.json` with `jq '.observations[].concept_path'` until VG-DAY3-001 (`--show-observations` flag) is implemented.

**Q: Can one extractor create multiple observations?**
A: Yes! If pattern matches multiple times in code, extractor creates one observation per match (all with same subject/predicate).

---

**Last Updated:** 2026-02-10 (after msgqueue Day 3 evaluation)
**Related Gaps:** VG-DAY3-001, VG-DAY3-002, VG-DAY3-003