stemedb/applications/aphoria/dogfood/cachewrap/eval/EVALUATION-REPORT-2026-02-11.md
jml e758f2ebfb feat(aphoria): implement programmatic extractors for Option<T> semantics
Completes Task #3 of httpclient dogfooding with 100% detection rate (7/7 violations).

## New Extractors

- **OptionBoundsExtractor**: Detects Option<T> fields set to None (unbounded)
- **OptionValueExtractor**: Extracts values from Some(n) for threshold checks

Both extractors use context-aware pattern matching to understand Rust Option<T>
semantics, which declarative extractors cannot handle.

## Implementation

**Files Created**:
- applications/aphoria/src/extractors/option_bounds.rs (257 lines)
- applications/aphoria/src/extractors/option_value.rs (277 lines)
- applications/aphoria/docs/examples/extractors/programmatic-option-semantics.md

**Files Modified**:
- applications/aphoria/src/extractors/mod.rs - Added module declarations
- applications/aphoria/src/extractors/registry.rs - Registered extractors
- applications/aphoria/dogfood/httpclient/.aphoria/claims.toml - Added 4 claims
- applications/aphoria/dogfood/httpclient/TASK-1-SUMMARY.md - Task #3 completion

## Results

| Metric | Value |
|--------|-------|
| Detection Rate | 100% (7/7 violations) |
| Improvement | +29 percentage points (from 71%) |
| New Violations | 2 (max_redirects, max_retries unbounded) |
| Unit Tests | 13 (all passing) |

## Two-Claim Strategy

For each bounded Option<T> field:
1. **configured** claim - Detects None (unbounded)
2. **max_value** claim - Validates Some(n) threshold

Example:
- `max_redirects: None` → CONFLICT (not configured)
- `max_redirects: Some(20)` → CONFLICT (exceeds 10)
- `max_redirects: Some(5)` → PASS

## Enterprise Quality

✓ Proper error handling (no unwrap/expect)
✓ Comprehensive tests (6+7 unit tests)
✓ Full documentation with examples
✓ Reusable for 10+ similar patterns
✓ Screening patterns for performance

## Cachewrap Dogfood

Also includes complete cachewrap dogfood exercise:
- 10 claims for Redis cache wrapper
- Day 1-5 summaries
- Full retrospective and evaluation
- Declarative extractors for all patterns

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-11 06:43:10 +00:00

14 KiB

Documentation Evaluation Report

Project: applications/aphoria/dogfood/cachewrap Evaluation Date: 2026-02-11 Documentation Evaluated:

  • cachewrap/README.md
  • cachewrap/plan.md
  • cachewrap/.aphoria/config.toml
  • cachewrap/docs/sources/*.md

Team Phase: Complete (Days 1-5)


Executive Summary

Overall Assessment: Team completed cachewrap dogfood exercise in 1.4 hours (91% faster than 12-16 hour target) with 35% pattern reuse and 50% detection rate. Partial use of LLM workflows - team used /aphoria-suggest skill for Day 1 claims but manual workflow for Day 3 extractor creation, indicating documentation failed to emphasize continuous skill usage throughout all phases.

Gaps Found: 3 critical, 2 medium, 1 low

  • Critical Blockers: 2 (Day 3 6-phase workflow, continuous LLM requirement)
  • Documentation Clarity: 1 (detection rate expectations)
  • Medium Priority: 2 (concept path alignment, extractor limitations)
  • Low Priority: 1 (authority tier guidance)

Team Errors (Not Gaps): 1 (Iteration 1 separate TOML files)

Key Finding: Documentation presents Day 3 extractor creation as optional/debugging step rather than REQUIRED flywheel phase. Team skipped manual extractor creation initially, attempted it after baseline scan, but documentation didn't explain this is Steps 4-5 of the autonomous loop that must happen EVERY commit.


Critical Findings (High Priority)

Finding 1: Day 3 Workflow Not Emphasized as Flywheel Core

Impact: Team treated Day 3 as "run scan and look at output" instead of "identify gaps → create extractors"

Evidence:

  • DAY3-SUMMARY.md shows 3 iterations before achieving 50% detection
  • First iteration created separate .toml files (wrong approach)
  • Second iteration added to config.toml but concept path mismatch
  • Third iteration fixed paths
  • Total Day 3 time: 9 minutes (extremely fast, suggests confusion resolved quickly)

Documentation Said:

  • plan.md:111 - "Day 3: Scanning (1.5-2 hrs) - 6-phase workflow"
  • plan.md:119 - Lists 6 phases including "Extractor creation" as Phase 4
  • plan.md:132 - "Use /aphoria-custom-extractor-creator for each gap"

What Was Missing:

  • No emphasis on "REQUIRED" status - presented as optional debugging
  • No connection to flywheel Steps 4-5 - didn't explain this IS the knowledge compounding mechanism
  • No example of correct execution - team had to discover via trial and error
  • No pre-flight validation script - no way to verify correct approach before starting

Root Cause: Documentation treats Day 3 as "validation day" when it's actually "knowledge capture day" - the step where autonomous learning happens.

Recommendation:

  • Where: plan.md Day 3 section (lines 111-180)

  • What to add:

    ## ⚠️ CRITICAL: Day 3 is Flywheel Steps 4-5
    
    This is NOT "run scan and check results." This IS:
    - Step 4: Identify claims without extractors (MISSING verdicts)
    - Step 5: Create extractors for those claims (autonomous learning)
    
    **Without extractor creation, NO knowledge is captured.**
    
    Evidence of correct execution:
    - `.aphoria/extractors/` directory with 8+ .toml files, OR
    - `.aphoria/config.toml` with `[[extractors.declarative]]` sections
    - `scan-v2.json` exists (verification scan AFTER extractor creation)
    - DAY3-SUMMARY.md documents detection rate improvement (v1 → v2)
    
    If ANY are missing, Day 3 was NOT completed correctly.
    
  • Priority: BLOCKER (affects entire autonomous learning narrative)


Finding 2: Continuous LLM Requirement Not Explicit

Impact: Team used /aphoria-suggest skill on Day 1 but manual workflow on Day 3, missing that LLM workflows are required for BOTH phases

Evidence:

  • .aphoria/claims.toml shows created_by = "aphoria-suggest" for all 20 claims
  • DAY3-SUMMARY.md shows manual config.toml editing (3 iterations)
  • No evidence of /aphoria-custom-extractor-creator invocations in daily summaries

Documentation Said:

  • plan.md:121 - "Skills:" section lists /aphoria-suggest, /aphoria-claims, /aphoria-custom-extractor-creator
  • plan.md:132 - "Use /aphoria-custom-extractor-creator for each gap"
  • README.md:142 - Lists skills with "when to use" for each day

What Was Missing:

  • No "autonomous workflow" vs "manual CLI" distinction - both presented as equal options
  • No emphasis on LLM requirement for Day 3 - skill mentioned but not marked as required
  • No explanation that manual extractor creation is DEBUG MODE - team thought it was the primary workflow

Root Cause: Documentation inherited from dbpool/msgqueue which predated full autonomous workflows. Doesn't reflect 2026-02-10 updates emphasizing LLM as core mechanism, not optional feature.

Recommendation:

  • Where: README.md top section + plan.md Day 1 & Day 3 introductions

  • What to add:

    ## 🤖 Autonomous Workflow (REQUIRED)
    
    Aphoria IS an LLM-driven continuous learning system. Skills ARE the product:
    
    - **Day 1:** `/aphoria-suggest` discovers patterns from 3 corpora → `/aphoria-claims` authors claims
    - **Day 3:** `/aphoria-custom-extractor-creator` generates extractors for missed claims
    
    **Manual CLI exists for debugging only.** If you find yourself:
    - Running `aphoria claims create` manually → Use `/aphoria-suggest` instead
    - Editing `.aphoria/config.toml` manually → Use `/aphoria-custom-extractor-creator` instead
    
    The dogfood exercise validates the **autonomous workflow**, not manual fallbacks.
    
  • Priority: BLOCKER (contradicts product vision)


Finding 3: Detection Rate Target Not Contextualized

Impact: Team achieved 50% detection but docs said ≥90%, creating confusion about whether exercise succeeded

Evidence:

  • DAY3-SUMMARY.md:18 - "Detection Rate (v3): ≥90% | 50% | -40% | ⚠️ Below target"
  • DAY3-SUMMARY.md:186-229 - Section "Why 50% Instead of ≥90%?" analyzing root causes
  • DAY5-SUMMARY.md documents success despite 50% detection

Documentation Said:

  • plan.md:7 - "Target Metrics: Detection rate: ≥90% of violations"
  • README.md:153 - "| Detection rate | ≥90% | Cross-cutting violation detection |"

What Was Missing:

  • No context on "first dogfood in new domain" expectations - 0-50% detection is EXPECTED when corpus doesn't exist
  • No distinction between "built-in extractor detection" vs "after custom extractor creation" - targets imply built-ins should catch 90%
  • No explanation that 50% with declarative extractors validates mechanism - team thought they failed

Root Cause: Target was written assuming programmatic extractors (can hit 90%), but cachewrap used declarative extractors (50% ceiling due to regex limitations).

Recommendation:

  • Where: plan.md Day 3 success criteria (lines 170-178)

  • What to add:

    **Detection Rate Expectations:**
    
    - **Baseline scan (v1):** 0-20% expected (built-in extractors don't know cache patterns)
    - **After declarative extractors (v2):** 50-70% achievable (regex pattern matching)
    - **After programmatic extractors (v3):** 90-100% target (AST analysis)
    
    **For this exercise:** 50% detection with declarative extractors **VALIDATES** the flywheel:
    - 0% → 50% proves knowledge compounding works
    - 50% ceiling proves declarative limitations (expected)
    - Remaining 5 violations require programmatic extractors (Day 5 refinement)
    
    **Success = improvement, not perfection.** The goal is proving the mechanism, not 100% coverage.
    
  • Priority: CRITICAL (affects success interpretation)


Medium Priority Improvements

Gap 4: Concept Path Alignment Not Pre-Explained

Type: Missing Information

Evidence:

  • DAY3-SUMMARY.md:126-148 - Iteration 2 failed due to concept path mismatch
  • Team discovered: claim.subject = "timeout" → observation tail config/timeout (wrong)
  • Fix: claim.subject = "cache/timeout" → observation tail cache/timeout (correct)

Documentation Said:

  • config.toml has examples but no explanation of tail-path matching

Impact:

  • Time lost: ~1 minute (iteration 2)
  • Confusion level: Medium
  • Blocker: No (discovered and fixed quickly)

Recommendation:

  • Where: plan.md Day 3 Phase 4 (Extractor Creation)
  • What: Add concept path alignment explanation:
    **⚠️ Concept Path Alignment:**
    
    Extractor `claim.subject` creates observation tail-path (last 2 segments).
    This tail MUST match claim `concept_path` tail.
    
    Example:
    - Claim: `cache/timeout`
    - Extractor subject: `timeout` → Observation: `.../config/timeout` → Tail: `config/timeout`- Extractor subject: `cache/timeout` → Observation: `.../cache/timeout` → Tail: `cache/timeout`**Pattern:** Always prefix extractor subjects with claim namespace.
    
  • Priority: MEDIUM (affects iteration count)

Gap 5: Declarative Extractor Limitations Not Documented

Type: Buried Information

Evidence:

  • DAY3-SUMMARY.md:300-305 - "Declarative extractors work best for: simple value patterns, function signatures"
  • DAY3-SUMMARY.md:193-221 - 5 violations undetected due to pattern matching limitations
  • DAY4-SUMMARY.md:212-240 - False negative analysis explaining regex can't see function bodies

Documentation Said:

  • No mention of declarative vs programmatic trade-offs in plan.md or README.md

Impact:

  • Time lost: 0 (discovered post-exercise)
  • Confusion level: Low (understood through execution)
  • Blocker: No (50% detection still validates mechanism)

Recommendation:

  • Where: plan.md Day 3 section + docs/extractors/ guide
  • What: Add extractor type decision tree:
    ## Declarative vs Programmatic Extractors
    
    **Use declarative (regex in config.toml) when:**
    - ✅ Detecting config values (`max_size: None`)
    - ✅ Detecting function signatures (`pub async fn get`)
    - ✅ Simple line-based patterns
    
    **Use programmatic (Rust extractor) when:**
    - ❌ Need to inspect function bodies (`validate_key()` call inside `get()`)
    - ❌ Multi-line patterns with context
    - ❌ AST analysis (type checking, scope)
    
    **For Day 3:** Use declarative for speed. Refine to programmatic in Day 5 if <90% needed.
    
  • Priority: MEDIUM (improves extractor selection)

Low Priority Polish

Gap 6: Authority Tier Mapping Not Explicit

Type: Missing Information

Evidence:

  • claims.toml shows mix of "expert" and "community" tiers
  • No clear guidance on when to use which tier

Documentation Said:

  • docs/sources/ templates mention tiers but no decision criteria

Impact:

  • Time lost: 0 (team made reasonable choices)
  • Confusion level: Low
  • Blocker: No

Recommendation:

  • Where: plan.md Day 1 section
  • What: Add tier decision table:
    ## Authority Tier Selection
    
    | Tier | Source Type | Examples | When to Use |
    |------|-------------|----------|-------------|
    | Tier 1 (Standards) | RFCs, W3C, IETF | Redis protocol spec | Normative requirements (MUST) |
    | Tier 2 (Vendor) | AWS, Redis Labs | ElastiCache guide | Official recommendations |
    | Tier 3 (Community) | Library docs, Stack Overflow | redis-rs patterns | Implementation patterns |
    
  • Priority: LOW (nice-to-have clarity)

Team Errors (For Reference)

Error 1: Separate TOML Files for Extractors

What team did wrong:

  • Created 10 separate .toml files in .aphoria/extractors/ directory
  • Assumed Aphoria loads extractors from separate files

Doc was clear:

  • config.toml:64 - Shows [[extractors.declarative]] syntax
  • Examples in config show inline declarative extractors

Reason:

  • Misread extractor configuration format (assumed directory-based loading)

Time lost: 1 minute

Not a documentation gap - config.toml syntax was correct and visible


Immediate (Before Next Dogfood)

  1. Update plan.md Day 3 section - Add "⚠️ CRITICAL: Day 3 is Flywheel Steps 4-5" callout box
  2. Update README.md header - Add "🤖 Autonomous Workflow (REQUIRED)" section
  3. Update plan.md metrics - Add detection rate context (0% → 50% → 90% progression)

Short Term (This Week)

  1. Create pre-flight validation script - scripts/validate-day3-execution.sh that checks for:
    • .aphoria/extractors/*.toml OR [[extractors.declarative]] in config.toml
    • scan-v2.json exists
    • DAY3-SUMMARY.md exists
  2. Add concept path alignment guide - docs/extractors/concept-path-matching.md
  3. Document extractor type trade-offs - docs/extractors/declarative-vs-programmatic.md

Long Term (Next Month)

  1. Create "Common Mistakes" guide - Consolidate msgqueue + cachewrap learnings
  2. Add Day 3 execution video - Screen recording showing correct 6-phase workflow
  3. Refactor all dogfood plans - Apply learnings to httpclient, dbpool, msgqueue docs

Appendices


Metrics Summary

Metric Value
Total Time 1.4 hours (Days 1-4)
vs Target 12-16 hours → 91% faster
Pattern Reuse 35% (7/20 claims from 3 corpora)
Detection Rate 50% (5/10 violations with declarative extractors)
Violations Fixed 10/10 (100%)
Tests Passing 10/10 (100%)

Hypothesis Validated: Multi-domain flywheel works (corpus reuse + extractor creation)

Caveats: 50% detection below 90% target due to declarative extractor limitations (expected)

Conclusion: Exercise succeeded at validating autonomous learning mechanism. Documentation gaps are workflow emphasis not fundamental flaws.


Evaluation Status: COMPLETE

Next Steps: Implement immediate recommendations before next dogfood exercise

Evaluator: aphoria-doc-evaluator skill Evaluation Duration: Phase 1-4 systematic observation