# Claim Extraction Walkthrough ## Purpose This document teaches you how to extract claims from prose documentation. You'll see a complete example: taking a paragraph from HikariCP's wiki and producing 3 structured claims with full reasoning. By the end, you'll have a decision framework for identifying what deserves to be a claim vs. what's just background information. --- ## Source Material From **HikariCP Wiki: "About Pool Sizing"** page: > "You want a small pool, saturated with threads waiting for connections. As a general guideline, the pool should be somewhere around `((core_count * 2) + effective_spindle_count)`. A formula which has held up pretty well across a lot of benchmarks for years is that for optimal throughput the number of active connections should be somewhere near `((core_count * 2) + effective_spindle_count)`. A 4-core i7 with one hard disk should have a pool of around 9-10 connections." --- ## Extraction Process ### Step 1: Identify Claimable Statements Read through and highlight statements that are: - ✅ **Prescriptive** - tells you what MUST/SHOULD do - ✅ **Have consequences** - explains why or what breaks if violated - ✅ **Verifiable in code** - you can write an extractor to check it - ❌ **Skip descriptive prose** - background, history, general opinions **What we identified:** 1. ✅ "pool should be somewhere around `((core_count * 2) + effective_spindle_count)`" → Formula for sizing 2. ✅ "A 4-core i7 with one hard disk should have a pool of around 9-10 connections" → Concrete example 3. ✅ "You want a small pool" (implicit: NOT unbounded) → Pool must be bounded ### Step 2: Extract First Claim (The Formula) **Raw statement:** "pool should be somewhere around `((core_count * 2) + effective_spindle_count)`" **Reasoning:** - This is a **FORMULA**, not a specific value - It's prescriptive ("should be") - Has a clear mathematical relationship - **Consequence:** Deviating causes poor throughput - **Verifiable:** Can check if code uses this formula or a constant **Extracted Claim:** ```bash aphoria corpus create \ --subject "dbpool/max_connections/formula" \ --predicate "recommended_formula" \ --value "((core_count * 2) + effective_spindle_count)" \ --explanation "Pool size SHOULD follow HikariCP formula: ((core_count * 2) + effective_spindle_count). This formula balances CPU availability with I/O blocking opportunities. If pool is too large, context-switching overhead degrades throughput. If too small, threads starve waiting for connections." \ --authority "HikariCP Wiki: About Pool Sizing" \ --category "performance" \ --tier 2 ``` **Why these choices:** | Field | Value | Reasoning | |-------|-------|-----------| | `subject` | `dbpool/max_connections/formula` | Specific enough to be useful, not too generic | | `predicate` | `recommended_formula` | Captures that it's a calculation, not a constant | | `value` | `"((core_count * 2) + effective_spindle_count)"` | Exact formula as a string (not evaluated) | | `explanation` | Full WHAT + WHY + CONSEQUENCE | Includes context for future maintainers | | `authority` | `"HikariCP Wiki: About Pool Sizing"` | Specific page, not just "HikariCP" | | `tier` | `2` | Vendor best practice (not regulatory/spec) | | `category` | `performance` | Not safety/security, but performance guidance | --- ### Step 3: Extract Second Claim (Concrete Example) **Raw statement:** "A 4-core i7 with one hard disk should have a pool of around 9-10 connections" **Reasoning:** - This is a **SPECIFIC EXAMPLE** of the formula - Validates the formula: `(4*2)+1 = 9` ✓ - Provides a concrete **development default** - More verifiable than abstract formula (can check if default is ~10) **Extracted Claim:** ```bash aphoria corpus create \ --subject "dbpool/max_connections/development" \ --predicate "default_value" \ --value "10" \ --explanation "Development pool size SHOULD default to 10 connections. This matches HikariCP recommendation for typical dev hardware (4-core + 1 disk). Formula: (4 cores × 2) + 1 spindle = 9, rounded to 10. If unbounded or excessively large in dev, it masks production sizing issues during testing." \ --authority "HikariCP Wiki: About Pool Sizing" \ --category "performance" \ --tier 2 ``` **Why these choices:** | Field | Value | Reasoning | |-------|-------|-----------| | `subject` | `dbpool/max_connections/development` | Distinguishes this from production sizing | | `predicate` | `default_value` | This is a concrete constant, not a formula | | `value` | `"10"` | Specific number from the recommendation | | `explanation` | Links back to formula + consequence | Shows how 10 was derived, what breaks if wrong | | `consequence` (in explanation) | "masks production sizing issues" | Real problem: dev diverges from prod | --- ### Step 4: Extract Third Claim (Implicit Requirement) **Raw statement:** "You want a small pool" (implies bounded, not infinite) **Reasoning:** - This is **IMPLICIT but CRITICAL**: pool MUST be bounded - Opposite of what naive developers might do: `Option = None` (unbounded) - Has **severe consequence**: unbounded growth exhausts DB connections - This is actually a **safety** claim, not just performance **Extracted Claim:** ```bash aphoria corpus create \ --subject "dbpool/max_connections" \ --predicate "required" \ --value "true" \ --explanation "Pool max_connections MUST be explicitly configured. HikariCP emphasizes small, bounded pools. If unbounded (None/null), pool grows without limit under load, exhausting database max_connections and causing cascading failures across all clients. This is a safety requirement, not just performance." \ --authority "HikariCP Wiki: About Pool Sizing" \ --category "safety" \ --tier 2 ``` **Why these choices:** | Field | Value | Reasoning | |-------|-------|-----------| | `subject` | `dbpool/max_connections` | The field itself, not a subpath | | `predicate` | `required` | Boolean: this field MUST exist | | `value` | `"true"` | The requirement is active | | `category` | `safety` | This prevents outages, not just perf issues | | `explanation` | Emphasizes MUST + severe consequence | Cascading failures = safety issue | --- ## Decision Framework Use this table when deciding if something deserves to be a claim: | Question | If YES | If NO | |----------|--------|-------| | Is it prescriptive (MUST/SHOULD)? | ✅ Candidate | ❌ Skip (just background) | | Can you verify it in code? | ✅ Candidate | ❌ Skip (too abstract) | | Does it have consequences? | ✅ Strong candidate | ⚠️ Weak claim (why care?) | | Is it specific to this domain? | ✅ Good claim | ⚠️ Too generic (avoid noise) | | Would violating it cause a real incident? | ✅ HIGH TIER | ⚠️ LOW TIER (style guide) | --- ## Anti-Patterns (What NOT to Extract) ### ❌ Too Generic ```bash # BAD: "Code should be maintainable" # This is vague advice, not a verifiable claim # Aphoria can't check "maintainability" ``` ### ❌ No Consequence ```bash # BAD: "Use camelCase for variable names" # This is a style guide, not a safety/security claim # No one gets paged if you use snake_case ``` ### ❌ Not Verifiable ```bash # BAD: "Algorithm should be fast" # "Fast" is subjective, can't write an extractor # Need concrete thresholds: "p95 latency < 100ms" ``` ### ❌ Background Information ```bash # BAD: "HikariCP was created in 2013" # Interesting history, but not a claim about code # Skip descriptive prose, focus on requirements ``` --- ## Good Claim Examples ✅ **Numeric Thresholds:** ```bash --predicate "maximum" --value "100" --comparison "equals" --explanation "Connection pool size MUST NOT exceed 100..." ``` ✅ **Required Fields:** ```bash --predicate "required" --value "true" --comparison "equals" --explanation "max_lifetime MUST be set to prevent connection leaks..." ``` ✅ **Forbidden Patterns:** ```bash --predicate "forbidden_pattern" --value "plaintext_password" --comparison "present" --explanation "Passwords MUST NOT be stored in plaintext. Use environment variables..." ``` ✅ **Configuration Relationships:** ```bash --predicate "minimum" --value "2" --comparison "equals" --explanation "min_idle MUST be at least 2 to handle failover..." ``` --- ## What You've Learned After this walkthrough, you should be able to: 1. ✅ Read technical documentation and identify claimable statements 2. ✅ Distinguish prescriptive requirements from descriptive background 3. ✅ Structure claims with proper subject/predicate/value 4. ✅ Write explanations that include WHAT + WHY + CONSEQUENCE 5. ✅ Choose appropriate authority tiers and categories 6. ✅ Avoid extracting noise (generic advice, style guides) --- ## Next Steps Now apply this process to your own domain: 1. **Find authoritative docs** - wikis, RFCs, vendor best practices 2. **Extract 3-5 claims** - start small, focus on high-impact rules 3. **Add to corpus** - use `aphoria corpus create` for each claim 4. **Scan your code** - see what violations Aphoria finds 5. **Iterate** - refine claims based on false positives/negatives Remember: **Claims are products, not byproducts.** Invest time in writing clear explanations with consequences. Future maintainers (including yourself) will thank you.