# Claim Extraction Walkthrough

## Purpose

This document teaches you how to extract claims from prose documentation. You'll see a complete example: taking a paragraph from HikariCP's wiki and producing 3 structured claims with full reasoning.

By the end, you'll have a decision framework for identifying what deserves to be a claim vs. what's just background information.

---

## Source Material

From **HikariCP Wiki: "About Pool Sizing"** page:

> "You want a small pool, saturated with threads waiting for connections. As a general guideline, the pool should be somewhere around `((core_count * 2) + effective_spindle_count)`. A formula which has held up pretty well across a lot of benchmarks for years is that for optimal throughput the number of active connections should be somewhere near `((core_count * 2) + effective_spindle_count)`. A 4-core i7 with one hard disk should have a pool of around 9-10 connections."

---

## Extraction Process

### Step 1: Identify Claimable Statements

Read through and highlight statements that are:

- ✅ **Prescriptive** - tells you what MUST/SHOULD do
- ✅ **Have consequences** - explains why or what breaks if violated
- ✅ **Verifiable in code** - you can write an extractor to check it
- ❌ **Skip descriptive prose** - background, history, general opinions

**What we identified:**

1. ✅ "pool should be somewhere around `((core_count * 2) + effective_spindle_count)`" → Formula for sizing
2. ✅ "A 4-core i7 with one hard disk should have a pool of around 9-10 connections" → Concrete example
3. ✅ "You want a small pool" (implicit: NOT unbounded) → Pool must be bounded

### Step 2: Extract First Claim (The Formula)

**Raw statement:**
"pool should be somewhere around `((core_count * 2) + effective_spindle_count)`"

**Reasoning:**
- This is a **FORMULA**, not a specific value
- It's prescriptive ("should be")
- Has a clear mathematical relationship
- **Consequence:** Deviating causes poor throughput
- **Verifiable:** Can check if code uses this formula or a constant

**Extracted Claim:**

```bash
aphoria corpus create \
  --subject "dbpool/max_connections/formula" \
  --predicate "recommended_formula" \
  --value "((core_count * 2) + effective_spindle_count)" \
  --explanation "Pool size SHOULD follow HikariCP formula: ((core_count * 2) + effective_spindle_count). This formula balances CPU availability with I/O blocking opportunities. If pool is too large, context-switching overhead degrades throughput. If too small, threads starve waiting for connections." \
  --authority "HikariCP Wiki: About Pool Sizing" \
  --category "performance" \
  --tier 2
```

**Why these choices:**

| Field | Value | Reasoning |
|-------|-------|-----------|
| `subject` | `dbpool/max_connections/formula` | Specific enough to be useful, not too generic |
| `predicate` | `recommended_formula` | Captures that it's a calculation, not a constant |
| `value` | `"((core_count * 2) + effective_spindle_count)"` | Exact formula as a string (not evaluated) |
| `explanation` | Full WHAT + WHY + CONSEQUENCE | Includes context for future maintainers |
| `authority` | `"HikariCP Wiki: About Pool Sizing"` | Specific page, not just "HikariCP" |
| `tier` | `2` | Vendor best practice (not regulatory/spec) |
| `category` | `performance` | Not safety/security, but performance guidance |

---

### Step 3: Extract Second Claim (Concrete Example)

**Raw statement:**
"A 4-core i7 with one hard disk should have a pool of around 9-10 connections"

**Reasoning:**
- This is a **SPECIFIC EXAMPLE** of the formula
- Validates the formula: `(4*2)+1 = 9` ✓
- Provides a concrete **development default**
- More verifiable than abstract formula (can check if default is ~10)

**Extracted Claim:**

```bash
aphoria corpus create \
  --subject "dbpool/max_connections/development" \
  --predicate "default_value" \
  --value "10" \
  --explanation "Development pool size SHOULD default to 10 connections. This matches HikariCP recommendation for typical dev hardware (4-core + 1 disk). Formula: (4 cores × 2) + 1 spindle = 9, rounded to 10. If unbounded or excessively large in dev, it masks production sizing issues during testing." \
  --authority "HikariCP Wiki: About Pool Sizing" \
  --category "performance" \
  --tier 2
```

**Why these choices:**

| Field | Value | Reasoning |
|-------|-------|-----------|
| `subject` | `dbpool/max_connections/development` | Distinguishes this from production sizing |
| `predicate` | `default_value` | This is a concrete constant, not a formula |
| `value` | `"10"` | Specific number from the recommendation |
| `explanation` | Links back to formula + consequence | Shows how 10 was derived, what breaks if wrong |
| `consequence` (in explanation) | "masks production sizing issues" | Real problem: dev diverges from prod |

---

### Step 4: Extract Third Claim (Implicit Requirement)

**Raw statement:**
"You want a small pool" (implies bounded, not infinite)

**Reasoning:**
- This is **IMPLICIT but CRITICAL**: pool MUST be bounded
- Opposite of what naive developers might do: `Option<usize> = None` (unbounded)
- Has **severe consequence**: unbounded growth exhausts DB connections
- This is actually a **safety** claim, not just performance

**Extracted Claim:**

```bash
aphoria corpus create \
  --subject "dbpool/max_connections" \
  --predicate "required" \
  --value "true" \
  --explanation "Pool max_connections MUST be explicitly configured. HikariCP emphasizes small, bounded pools. If unbounded (None/null), pool grows without limit under load, exhausting database max_connections and causing cascading failures across all clients. This is a safety requirement, not just performance." \
  --authority "HikariCP Wiki: About Pool Sizing" \
  --category "safety" \
  --tier 2
```

**Why these choices:**

| Field | Value | Reasoning |
|-------|-------|-----------|
| `subject` | `dbpool/max_connections` | The field itself, not a subpath |
| `predicate` | `required` | Boolean: this field MUST exist |
| `value` | `"true"` | The requirement is active |
| `category` | `safety` | This prevents outages, not just perf issues |
| `explanation` | Emphasizes MUST + severe consequence | Cascading failures = safety issue |

---

## Decision Framework

Use this table when deciding if something deserves to be a claim:

| Question | If YES | If NO |
|----------|--------|-------|
| Is it prescriptive (MUST/SHOULD)? | ✅ Candidate | ❌ Skip (just background) |
| Can you verify it in code? | ✅ Candidate | ❌ Skip (too abstract) |
| Does it have consequences? | ✅ Strong candidate | ⚠️ Weak claim (why care?) |
| Is it specific to this domain? | ✅ Good claim | ⚠️ Too generic (avoid noise) |
| Would violating it cause a real incident? | ✅ HIGH TIER | ⚠️ LOW TIER (style guide) |

---

## Anti-Patterns (What NOT to Extract)

### ❌ Too Generic

```bash
# BAD: "Code should be maintainable"
# This is vague advice, not a verifiable claim
# Aphoria can't check "maintainability"
```

### ❌ No Consequence

```bash
# BAD: "Use camelCase for variable names"
# This is a style guide, not a safety/security claim
# No one gets paged if you use snake_case
```

### ❌ Not Verifiable

```bash
# BAD: "Algorithm should be fast"
# "Fast" is subjective, can't write an extractor
# Need concrete thresholds: "p95 latency < 100ms"
```

### ❌ Background Information

```bash
# BAD: "HikariCP was created in 2013"
# Interesting history, but not a claim about code
# Skip descriptive prose, focus on requirements
```

---

## Good Claim Examples

✅ **Numeric Thresholds:**
```bash
--predicate "maximum"
--value "100"
--comparison "equals"
--explanation "Connection pool size MUST NOT exceed 100..."
```

✅ **Required Fields:**
```bash
--predicate "required"
--value "true"
--comparison "equals"
--explanation "max_lifetime MUST be set to prevent connection leaks..."
```

✅ **Forbidden Patterns:**
```bash
--predicate "forbidden_pattern"
--value "plaintext_password"
--comparison "present"
--explanation "Passwords MUST NOT be stored in plaintext. Use environment variables..."
```

✅ **Configuration Relationships:**
```bash
--predicate "minimum"
--value "2"
--comparison "equals"
--explanation "min_idle MUST be at least 2 to handle failover..."
```

---

## What You've Learned

After this walkthrough, you should be able to:

1. ✅ Read technical documentation and identify claimable statements
2. ✅ Distinguish prescriptive requirements from descriptive background
3. ✅ Structure claims with proper subject/predicate/value
4. ✅ Write explanations that include WHAT + WHY + CONSEQUENCE
5. ✅ Choose appropriate authority tiers and categories
6. ✅ Avoid extracting noise (generic advice, style guides)

---

## Next Steps

Now apply this process to your own domain:

1. **Find authoritative docs** - wikis, RFCs, vendor best practices
2. **Extract 3-5 claims** - start small, focus on high-impact rules
3. **Add to corpus** - use `aphoria corpus create` for each claim
4. **Scan your code** - see what violations Aphoria finds
5. **Iterate** - refine claims based on false positives/negatives

Remember: **Claims are products, not byproducts.** Invest time in writing clear explanations with consequences. Future maintainers (including yourself) will thank you.