stemedb/applications/aphoria/dogfood/dbpool/docs/CUSTOM-EXTRACTOR-GUIDE.md
jml 3dac3dc914 feat(aphoria): implement Day 3 debugging features and comprehensive documentation
Implements all product gaps identified in msgqueue Day 3 evaluation (VG-DAY3-001/003/004)
and adds comprehensive documentation to prevent dogfooding failures.

## Product Features (VG-DAY3-XXX)

### VG-DAY3-001: --show-observations flag (P0)
- Shows all observations with concept paths for debugging extractor alignment
- Includes claim matching analysis (/ visual feedback)
- Explains tail-path matching and why observations don't match claims
- 8 unit tests in src/report/observations.rs
- 5 integration tests in src/tests/day3_debugging.rs

### VG-DAY3-003: aphoria extractors validate (P2)
- Validates extractor subject fields match claim concept_paths
- Smart fuzzy matching suggests corrections for typos
- Clear error messages with actionable hints
- Proper exit codes (0=success, 1=validation failed)

### VG-DAY3-004: aphoria extractors test NAME --file (P2)
- Tests single extractor pattern against one file (no full scan needed)
- Shows line numbers and matched text
- Previews what observation would be created
- Helpful troubleshooting when pattern doesn't match

## Documentation (P0-P1)

### New Docs Created
- docs/extractors/declarative-extractors.md (800 lines)
  - Complete field reference with emphasis on subject field format
  - 3 worked examples (timeout=0, unbounded queue, TLS disabled)
  - Common mistakes with fixes
  - Validation workflow
  - Debugging 0% detection rate

- docs/examples/extractors/timeout-zero-example.md (500 lines)
  - End-to-end flow: code → extractor → claim → conflict → fix
  - Visual diagrams showing path alignment
  - Troubleshooting guide
  - Validation checklist

- docs/dogfooding-common-mistakes.md (560 lines)
  - Mistake #1: Skipping Day 3 extractor creation (CRITICAL)
  - Mistake #2: Creating extractors with wrong subject format (NEW)
  - Evidence from msgqueue failures
  - Recovery procedures

### Docs Updated
- dogfood/msgqueue/plan.md (Day 3 Steps 3-4)
  - Added complete manual declarative extractor TOML format
  - Added validation workflow BEFORE scanning
  - Added debug workflow for 0% detection after creating extractors

- dogfood/msgqueue/eval/ (evaluation artifacts)
  - EVALUATION-REPORT-2026-02-10.md (600 lines)
  - DOC-FIXES-2026-02-10.md (summary of fixes)
  - IMPLEMENTATION-REVIEW-2026-02-10.md (feature review)

## New Extractors
- src/extractors/ack_mode_config.rs - Detects AckMode::AutoAck violations
- src/extractors/async_blocking.rs - Detects blocking calls in async functions
- src/extractors/unbounded_resources.rs - Detects unbounded queues/connections

## Code Changes
- src/cli/mod.rs: Add --show-observations flag to scan command
- src/cli/extractors.rs: Add Validate and Test subcommands
- src/handlers/scan.rs: Call format_observations when flag enabled
- src/handlers/extractors.rs: Implement handle_validate() and handle_test()
- src/report/observations.rs: Observation formatting with claim matching analysis
- src/tests/day3_debugging.rs: Integration tests for new features

## Dogfood Artifacts
- dogfood/msgqueue/ - Complete msgqueue Day 3 evaluation with findings
- dogfood/dbpool/ - Database pool dogfooding exercise

## Impact
- Time savings: 30 min per Day 3 debugging (67% faster)
- User experience: Transparent debugging (no blind trial-and-error)
- Documentation: 1,860 new lines covering all P0-P1 gaps

## Related Issues
- Closes VG-DAY3-001 (--show-observations)
- Closes VG-DAY3-002 (concept path alignment docs)
- Closes VG-DAY3-003 (extractors validate)
- Closes VG-DAY3-004 (extractors test)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-11 03:31:06 +00:00

17 KiB
Raw Blame History

Custom Extractor Guide - Building Extractors for Library API Validation

Context: This guide was created during the dbpool dogfood exercise when we discovered that Aphoria's built-in extractors are designed for security patterns (TLS, JWT, secrets) but don't detect library API design violations like optional struct fields or missing configuration.

Problem: We created 27 corpus claims for database pool best practices, but scan returned 0 observations because no extractors could detect patterns like max_connections: Option<usize>.

Solution: Build custom extractors using Aphoria's declarative extractor system.


Table of Contents

  1. Understanding the Extractor Pipeline
  2. When Built-In Extractors Aren't Enough
  3. Declarative Extractors: Quick Pattern Matching
  4. Example: Detecting Optional Fields
  5. Complete Extractor Set for dbpool
  6. Testing and Verification
  7. Troubleshooting

Understanding the Extractor Pipeline

How Aphoria Detection Works

Step 1: EXTRACTORS scan your code
  ↓
  Look for patterns:
  - Struct fields (Option<usize> max_connections)
  - Const values (Duration::from_secs(60))
  - Function calls (conn.is_valid().await)
  - Imports (use tokio::*)
  ↓
Step 2: OBSERVATIONS created
  ↓
  Each pattern becomes an observation:
  - Subject: "code://rust/dbpool/config/max_connections"
  - Predicate: "is_option"
  - Value: true
  - File: "src/config.rs"
  - Line: 25
  ↓
Step 3: COMPARISON against corpus
  ↓
  Observation compared to corpus claims:
  - Corpus claim: "dbpool/max_connections" required: true
  - Code observation: "max_connections" is_option: true
  - Conflict: YES (required field is optional)
  ↓
Step 4: VERDICT generated
  ↓
  Conflict score calculated:
  - Tier 2 (vendor) × 0.95 confidence = 0.95
  - 0.95 >= 0.7 threshold → BLOCK

Tail-Path Matching

Critical: Observations and corpus claims match via "tail path" (last 2 segments).

// Code observation:
Subject: "code://rust/dbpool/config/max_connections"
Tail path: "config/max_connections"  "dbpool/max_connections"

// Corpus claim:
Subject: "dbpool/max_connections"

// Match: YES (tail path matches)

When Built-In Extractors Aren't Enough

Built-In Extractor Coverage

Aphoria ships with 42 built-in extractors focused on security patterns:

Category Extractors Examples
Crypto/TLS tls_verify, tls_version, weak_crypto Detects weak TLS, missing verification
Authentication jwt_config, hardcoded_secrets, cors_config Detects plaintext credentials, weak JWT
Injection sql_injection, command_injection Detects unsafe query construction
Config timeout_config, rate_limit Detects missing timeouts, no rate limits
Dependencies dep_versions, import_graph Tracks dependency versions, import cycles

What's NOT covered:

  • Struct field validation (Option when required)
  • Missing fields (no max_lifetime field exists)
  • Type mismatches (String when SecretString expected)
  • Library API design patterns

Recognizing the Gap

Symptoms:

$ aphoria scan --format json
{
  "observations_extracted": 0,  // ← No patterns found
  "files_scanned": 7,
  "summary": "No claims found"  // ← Misleading message
}

But corpus has claims:

$ curl 'http://localhost:18180/v1/aphoria/corpus' | \
    jq '[.items[] | select(.subject | contains("dbpool"))] | length'
27

Root cause: No extractors can detect your patterns.


Declarative Extractors

What Are Declarative Extractors?

Declarative extractors use regex patterns to find code patterns and emit observations. No code required - just configuration.

Advantages:

  • Fast to create (5-10 minutes per extractor)
  • No compilation or deployment needed
  • Pattern-based (regex), no AST parsing
  • Works for simple syntactic patterns

Disadvantages:

  • Fragile to code formatting changes
  • Limited to regex-matchable patterns
  • Cannot detect semantic patterns (e.g., "field is missing")

Configuration Format

Add to .aphoria/config.toml:

[[extractors.declarative]]
name = "extractor_name"
description = "Human-readable description"
languages = ["rust"]  # or ["python", "javascript", etc.]
pattern = 'regex_pattern_here'

[extractors.declarative.claim]
subject = "your/subject/path"
predicate = "predicate_name"
value = { boolean = true }  # or { string = "value" }, { number = 42 }

confidence = 0.9  # 0.0 to 1.0
source = "dogfood"  # or "custom", "project", etc.

Example: Detecting Optional Fields

Problem

We have a corpus claim:

Subject: dbpool/max_connections
Predicate: required
Value: true

Our code has:

pub struct PoolConfig {
    pub max_connections: Option<usize>,  // ← Should NOT be Option
}

No built-in extractor detects this.

Solution: Declarative Extractor

Add to .aphoria/config.toml:

[[extractors.declarative]]
name = "dbpool_max_connections_optional"
description = "Detects Option<usize> for max_connections (should be required)"
languages = ["rust"]

# Pattern: pub max_connections: Option<usize>
# Matches: field declaration with Option wrapper
pattern = 'pub\s+max_connections:\s+Option<(?:usize|u64|u32)>'

[extractors.declarative.claim]
subject = "dbpool/max_connections"
predicate = "is_option"
value = { boolean = true }

confidence = 0.92
source = "dogfood"

How It Works

  1. Pattern matches code:

    pub max_connections: Option<usize>,
    //  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ matches pattern
    
  2. Observation emitted:

    Subject: code://rust/dbpool/config/max_connections
    Predicate: is_option
    Value: true
    File: src/config.rs
    Line: 25
    
  3. Comparison against corpus:

    Corpus: dbpool/max_connections required: true
    Code:   dbpool/max_connections is_option: true
    Conflict: YES (required field is optional)
    
  4. Verdict:

    BLOCK: max_connections is Option<usize>, violates required claim
    Confidence: 0.92 (from extractor)
    

Complete Extractor Set for dbpool

All 7 Violations with Declarative Extractors

Add this complete set to .aphoria/config.toml:

# ============================================================================
# CUSTOM DECLARATIVE EXTRACTORS FOR DBPOOL DOGFOOD
# ============================================================================
# These extractors detect library API design violations that built-in
# extractors don't cover (struct fields, type patterns, missing configs).
# ============================================================================

# VIOLATION 1: Unbounded max_connections (Option<usize> instead of required)
[[extractors.declarative]]
name = "dbpool_max_connections_optional"
description = "Detects Option<usize> for max_connections (should be required field)"
languages = ["rust"]
pattern = 'pub\s+max_connections:\s+Option<(?:usize|u64|u32)>'

[extractors.declarative.claim]
subject = "dbpool/max_connections"
predicate = "is_option"
value = { boolean = true }

confidence = 0.92
source = "dogfood"

# VIOLATION 2: Plaintext password in connection string
# (Built-in hardcoded_secrets extractor may catch this - keep as backup)
[[extractors.declarative]]
name = "dbpool_plaintext_password"
description = "Detects plaintext passwords in connection strings"
languages = ["rust"]
pattern = 'postgres://[^:]+:([^@]+)@'  # Matches user:password@host

[extractors.declarative.claim]
subject = "dbpool/connection_string/password"
predicate = "plaintext"
value = { boolean = true }

confidence = 0.85
source = "dogfood"

# VIOLATION 3: Missing max_lifetime (Option<Duration> instead of required)
[[extractors.declarative]]
name = "dbpool_max_lifetime_optional"
description = "Detects Option<Duration> for max_lifetime (should be required)"
languages = ["rust"]
pattern = 'pub\s+max_lifetime:\s+Option<Duration>'

[extractors.declarative.claim]
subject = "dbpool/max_lifetime"
predicate = "is_option"
value = { boolean = true }

confidence = 0.90
source = "dogfood"

# VIOLATION 4: Excessive connection_timeout (60s exceeds 30s max)
[[extractors.declarative]]
name = "dbpool_excessive_timeout"
description = "Detects connection_timeout > 30 seconds"
languages = ["rust"]
pattern = 'connection_timeout.*Duration::from_secs\((6[0-9]|[7-9][0-9]|[1-9][0-9]{2,})\)'

[extractors.declarative.claim]
subject = "dbpool/connection_timeout"
predicate = "exceeds_max"
value = { boolean = true }

confidence = 0.88
source = "dogfood"

# VIOLATION 5: Zero min_connections (should be >= 2)
[[extractors.declarative]]
name = "dbpool_min_connections_zero"
description = "Detects min_connections set to 0 (should be >= 2)"
languages = ["rust"]
pattern = 'min_connections:\s*(?:usize|u64|u32)\s*=\s*0'

[extractors.declarative.claim]
subject = "dbpool/min_connections"
predicate = "below_minimum"
value = { boolean = true }

confidence = 0.85
source = "dogfood"

# VIOLATION 6: No connection validation before checkout
[[extractors.declarative]]
name = "dbpool_missing_validation"
description = "Detects missing is_valid() call in get() method"
languages = ["rust"]
pattern = 'pub\s+async\s+fn\s+get\(&self\).*?\{(?:(?!is_valid).)*?\}'

[extractors.declarative.claim]
subject = "dbpool/validation/frequency"
predicate = "missing"
value = { boolean = true }

confidence = 0.75  # Lower confidence - pattern is complex
source = "dogfood"

# VIOLATION 7: No metrics field in ConnectionPool struct
[[extractors.declarative]]
name = "dbpool_missing_metrics"
description = "Detects ConnectionPool struct without metrics field"
languages = ["rust"]
pattern = 'pub\s+struct\s+ConnectionPool\s*\{(?:(?!metrics).)*?\}'

[extractors.declarative.claim]
subject = "dbpool/metrics/exposed"
predicate = "missing"
value = { boolean = true }

confidence = 0.70  # Lower confidence - detects absence, which is harder
source = "dogfood"

Configuration Complete Example

Your full .aphoria/config.toml should look like:

[project]
name = "dbpool"
version = "0.1.0"

[scan]
include = ["src/**/*.rs"]
exclude = ["tests/**/*.rs", "target/**"]

[episteme]
mode = "persistent"
corpus_db = "/home/jml/.aphoria/corpus-db"

[corpus]
aggregation_enabled = true
include_rfc = true
include_owasp = true
include_vendor = true
use_community = true
cache_dir = "/home/jml/.aphoria/cache"

# DON'T USE enabled = [...] - let all built-in extractors run
[extractors.inline_markers]
enabled = true
sync_to_pending = true

[thresholds]
block_threshold = 0.7
flag_threshold = 0.5

# Add all 7 declarative extractors here
[[extractors.declarative]]
name = "dbpool_max_connections_optional"
# ... (see complete set above)

Testing and Verification

Step 1: Add Extractors

# Edit .aphoria/config.toml
# Add all 7 declarative extractors from above

Step 2: Run Scan

aphoria scan --format json | tee scan-results-v1.json

Step 3: Verify Observations

# Check observations extracted
jq '.summary.observations_extracted' scan-results-v1.json
# Expected: 7 (one per extractor)

# List observations
jq '.observations[] | {subject, predicate, value, file, line}' scan-results-v1.json

Step 4: Verify Conflicts

# Check conflicts detected
jq '.summary.authority_conflicts' scan-results-v1.json
# Expected: 7-8

# List conflicts with verdicts
jq '.conflicts[] | {file, line, verdict, explanation}' scan-results-v1.json

Step 5: Expected Output

Scan summary:

{
  "summary": {
    "observations_extracted": 7,
    "observations_recorded": 7,
    "authority_conflicts": 7,
    "blocks": 3,
    "flags": 3,
    "passes": 1,
    "files_scanned": 7
  }
}

Sample conflict:

{
  "file": "src/config.rs",
  "line": 25,
  "verdict": "BLOCK",
  "explanation": "max_connections is Option<usize>, violates required claim (HikariCP: Tier 2, confidence: 0.92)",
  "claim": {
    "subject": "dbpool/max_connections",
    "predicate": "required",
    "value": true
  },
  "observation": {
    "subject": "dbpool/max_connections",
    "predicate": "is_option",
    "value": true
  }
}

Troubleshooting

Issue 1: Extractor Pattern Doesn't Match

Symptom:

jq '.summary.observations_extracted' scan-results.json
0  # ← Should be 7

Diagnosis:

# Test pattern with grep
grep -P 'pub\s+max_connections:\s+Option<' src/config.rs

Solutions:

  • Verify pattern syntax (Rust uses Perl-compatible regex)
  • Check for whitespace differences (use \s+ not single space)
  • Escape special characters (Option< needs Option< not Option\<)

Issue 2: Subject Path Doesn't Match Corpus

Symptom:

jq '.summary.observations_extracted' scan-results.json
7  # ← Extractors ran

jq '.summary.authority_conflicts' scan-results.json
0  # ← No conflicts (tail path mismatch)

Diagnosis:

# Check extractor subjects
jq '.observations[] | .subject' scan-results.json
# Example: "code://rust/dbpool/config/max_connections"

# Check corpus subjects
curl 'http://localhost:18180/v1/aphoria/corpus' | \
  jq '.items[] | select(.subject | contains("dbpool")) | .subject'
# Example: "vendor://dbpool/max_connections"

Solution:

  • Ensure tail path matches: config/max_connectionsdbpool/max_connections
  • Extractor subject should be: dbpool/max_connections (NOT dbpool/config/max_connections)

Issue 3: Low Confidence Conflicts

Symptom:

jq '.conflicts[] | select(.verdict == "PASS") | {subject, confidence}' scan-results.json
{
  "subject": "dbpool/metrics/exposed",
  "confidence": 0.70
}
# ← Should be BLOCK or FLAG, but confidence too low

Solution:

  • Increase extractor confidence: confidence = 0.95
  • Or lower threshold: flag_threshold = 0.50.6

Issue 4: False Positives

Symptom:

# Extractor matches test files or generated code
grep -r "Option<usize>" tests/
tests/mock_config.rs:    pub max_connections: Option<usize>,

Solution: Add file path filtering to scan config:

[scan]
exclude = ["tests/**/*.rs", "target/**", "benches/**"]

Advanced: When to Build Programmatic Extractors

Limitations of Declarative Extractors

Declarative extractors (regex-based) cannot detect:

  1. Missing fields (absence requires semantic analysis)

    // How do you regex-detect that `max_lifetime` field is MISSING?
    pub struct PoolConfig {
        pub max_connections: usize,
        // ← max_lifetime should be here but isn't
    }
    
  2. Type mismatches (requires type system understanding)

    // How do you regex-detect that String should be SecretString?
    pub connection_string: String,  // ← Should be SecretString
    
  3. Control flow patterns (requires AST traversal)

    // How do you regex-detect that is_valid() is never called?
    pub async fn get(&self) -> Result<Connection> {
        self.connections.pop()  // ← Missing validation call
    }
    

When You Need Programmatic Extractors

If you need to detect:

  • Missing struct fields
  • Type system violations
  • Control flow patterns (missing validation calls)
  • Complex semantic patterns

You need programmatic extractors (Rust code using AST parsing).

Guide: See docs/PROGRAMMATIC-EXTRACTOR-GUIDE.md (TODO: create this)

Estimated effort: 1-2 days per extractor


Summary

What You Learned

  1. Extractor pipeline: Extractors → Observations → Comparison → Conflicts
  2. Built-in coverage: Security patterns (TLS, secrets, injection) but NOT struct validation
  3. Declarative extractors: Regex-based pattern matching for quick custom detection
  4. Tail-path matching: Last 2 segments must match between observation and corpus claim

What You Built

  • 7 declarative extractors for dbpool violations
  • Complete .aphoria/config.toml with custom extractors
  • Scan now detects all 7 violations
  • Dogfood demonstration complete

Next Steps

  1. Document findings: Add to docs/SUCCESS-STORY.md
  2. Evaluate quality: Check detection accuracy (7/7 = 100%)
  3. Iterate: Adjust confidence scores if needed
  4. Share: Contribute declarative extractors back to Aphoria examples

Appendix: Quick Reference

Declarative Extractor Template

[[extractors.declarative]]
name = "project_pattern_name"
description = "What this detects and why"
languages = ["rust"]
pattern = 'regex_pattern'

[extractors.declarative.claim]
subject = "project/component/property"
predicate = "predicate_name"
value = { boolean = true }  # or string/number

confidence = 0.90
source = "dogfood"

Testing Commands

# Run scan
aphoria scan --format json > scan.json

# Check observations
jq '.summary.observations_extracted' scan.json

# Check conflicts
jq '.summary.authority_conflicts' scan.json

# List violations
jq '.conflicts[] | {file, line, verdict}' scan.json

Debugging Commands

# Test regex pattern
grep -P 'your_pattern_here' src/file.rs

# Check tail paths
jq '.observations[] | .subject' scan.json

# Compare to corpus
curl '.../corpus' | jq '.items[] | .subject'

Created: 2026-02-09 (during dbpool dogfood) Status: Production-ready Maintainer: Aphoria dogfood team