stemedb/applications/aphoria/docs/architecture/llm-eval-implementation.md
jml 9bfa626203 docs: reorganize documentation structure for clarity
Major documentation restructure to improve discoverability and reduce duplication.

## Changes

**Deleted (Archived/Consolidated)**:
- Removed duplicate getting started guides
- Archived outdated planning documents
- Consolidated corpus and configuration docs
- Removed obsolete vision/spec files (superseded by vision.md)
- Cleaned up scrapyard and old PDFs

**New Structure**:
- docs/about/ - Project overview and introduction
- docs/guides/ - User guides (moved from root)
- docs/specs/ - Technical specifications
- docs/sdk/ - SDK documentation (Go)
- docs/references/ - API references
- docs/archive/ - Archived historical docs
- applications/aphoria/docs/advanced/ - Advanced topics
- applications/aphoria/docs/reference/ - CLI reference
- applications/aphoria/docs/archive/ - Archived aphoria docs

**Updated**:
- README.md - New root README with clear navigation
- CONTRIBUTING.md - Contribution guidelines
- CLAUDE.md - Updated paths to new structure
- roadmap.md - Added recent completions

## Files Changed
- 57 files changed
- 1,977 insertions(+)
- 961 deletions(-)

**Net change**: +1,016 lines (added CONTRIBUTING.md, README.md, reorganized content)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-11 07:33:40 +00:00

39 KiB

LLM Evaluation Implementation Spec

Status: Implementation Ready Date: 2026-02-05 Scope: Aphoria Phase 7.8


What We Have

The current LLM extraction pipeline (src/llm/):

src/llm/
├── mod.rs           # Module exports
├── client.rs        # GeminiClient - HTTP client for API
├── extractor.rs     # LlmExtractor - orchestration, budget, filtering
├── prompt.rs        # build_system_prompt() with ontology
├── ontology.rs      # OntologyVocabulary from authority assertions
├── cache.rs         # LlmCache - BLAKE3 content hash caching
├── types.rs         # LlmClaim, LlmClaimsResponse
└── prompts.rs       # DEFAULT_SYSTEM_PROMPT, helpers

Key characteristics:

  • Uses Gemini API (configured via GEMINI_API_KEY)
  • Ontology-aware prompts constrain output to authority vocabulary
  • Caches by BLAKE3(prompt + content + model) (prompt hash included)
  • Token budget tracking (max_tokens_per_scan, max_tokens_per_file)
  • Selective triggering (high-value files only)
  • Temperature 0.1 for consistency
  • Structured decoding via Gemini Response Schema

What We Need

1. Observation Storage (SQLite)

Problem: We can't see what the LLM returned or how claims were scored. JSON files are inefficient for querying.

Solution: SQLite database with retention policies.

Location: ~/.aphoria/eval/observations.db

// src/eval/db.rs

use chrono::{Duration, Utc};
use rusqlite::{params, Connection};

pub struct EvalDatabase {
    conn: Connection,
}

impl EvalDatabase {
    pub fn open(path: &Path) -> Result<Self> {
        let conn = Connection::open(path)?;
        conn.execute_batch(r#"
            CREATE TABLE IF NOT EXISTS observations (
                id TEXT PRIMARY KEY,
                timestamp TEXT NOT NULL,
                prompt_version TEXT NOT NULL,
                prompt_hash TEXT NOT NULL,
                model TEXT NOT NULL,
                input_hash TEXT NOT NULL,
                file_path TEXT NOT NULL,
                language TEXT NOT NULL,
                content_length INTEGER NOT NULL,
                raw_response TEXT NOT NULL,
                parsed_claims TEXT NOT NULL,  -- JSON
                final_claims TEXT NOT NULL,   -- JSON
                input_tokens INTEGER NOT NULL,
                output_tokens INTEGER NOT NULL,
                parse_success INTEGER NOT NULL,
                parse_error TEXT,
                cache_hit INTEGER NOT NULL,
                latency_ms INTEGER NOT NULL
            );
            CREATE INDEX IF NOT EXISTS idx_obs_timestamp ON observations(timestamp);
            CREATE INDEX IF NOT EXISTS idx_obs_prompt_hash ON observations(prompt_hash);
        "#)?;
        Ok(Self { conn })
    }

    /// Enforce retention: keep last 1000 or 30 days, whichever is larger
    pub fn enforce_retention(&self) -> Result<usize> {
        let cutoff = Utc::now() - Duration::days(30);
        self.conn.execute(
            "DELETE FROM observations
             WHERE timestamp < ?1
             AND id NOT IN (SELECT id FROM observations ORDER BY timestamp DESC LIMIT 1000)",
            params![cutoff.to_rfc3339()],
        )
    }

    pub fn insert(&self, obs: &Observation) -> Result<()> {
        self.conn.execute(
            r#"INSERT INTO observations (
                id, timestamp, prompt_version, prompt_hash, model, input_hash,
                file_path, language, content_length, raw_response, parsed_claims,
                final_claims, input_tokens, output_tokens, parse_success,
                parse_error, cache_hit, latency_ms
            ) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12, ?13, ?14, ?15, ?16, ?17, ?18)"#,
            params![
                obs.id.to_string(),
                obs.timestamp.to_rfc3339(),
                obs.prompt_version,
                obs.prompt_hash,
                obs.model,
                obs.input_hash,
                obs.file_path,
                obs.language,
                obs.content_length,
                obs.raw_response,
                serde_json::to_string(&obs.parsed_claims)?,
                serde_json::to_string(&obs.final_claims)?,
                obs.input_tokens,
                obs.output_tokens,
                obs.parse_success,
                obs.parse_error,
                obs.cache_hit,
                obs.latency_ms,
            ],
        )?;
        Ok(())
    }
}

Observation struct:

// src/llm/observation.rs

use chrono::{DateTime, Utc};
use serde::{Deserialize, Serialize};
use uuid::Uuid;

/// A logged observation from an LLM extraction.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Observation {
    /// Unique ID for this observation.
    pub id: Uuid,

    /// When this extraction occurred.
    pub timestamp: DateTime<Utc>,

    /// Prompt version (from PROMPT_VERSION constant).
    pub prompt_version: String,

    /// BLAKE3 hash of the prompt template.
    pub prompt_hash: String,

    /// Model used (e.g., "gemini-2.0-flash").
    pub model: String,

    /// BLAKE3 hash of input content.
    pub input_hash: String,

    /// File path (relative to scan root).
    pub file_path: String,

    /// Language detected.
    pub language: String,

    /// Content length in bytes.
    pub content_length: usize,

    /// Raw LLM response (JSON string).
    pub raw_response: String,

    /// Parsed claims (after confidence filter, before ontology validation).
    pub parsed_claims: Vec<ParsedClaim>,

    /// Final claims (after ontology validation).
    pub final_claims: Vec<FinalClaim>,

    /// Token usage.
    pub input_tokens: usize,
    pub output_tokens: usize,

    /// Whether parsing succeeded.
    pub parse_success: bool,

    /// Parse error if any.
    pub parse_error: Option<String>,

    /// Cache status.
    pub cache_hit: bool,

    /// Latency in milliseconds.
    pub latency_ms: u64,
}

/// A claim as parsed from LLM JSON (before validation).
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ParsedClaim {
    pub subject: String,
    pub predicate: String,
    pub value: serde_json::Value,
    pub confidence: f32,
    pub line: usize,
}

/// A claim after ontology validation.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct FinalClaim {
    pub concept_path: String,
    pub predicate: String,
    pub value: serde_json::Value,
    pub confidence: f32,
    pub matched_ontology: bool,
    pub fuzzy_matched: bool,
}

Integration point: Modify LlmExtractor::extract() to emit observations.


2. Cache Key Includes Prompt Hash

Problem: Cache doesn't invalidate when prompt changes.

Solution: Include prompt hash in cache key.

// src/llm/cache.rs

impl LlmCache {
    fn compute_key(content: &str, model: &str, prompt: &str) -> String {
        let mut hasher = blake3::Hasher::new();
        hasher.update(content.as_bytes());
        hasher.update(model.as_bytes());
        hasher.update(prompt.as_bytes());  // NEW: prompt included
        hasher.finalize().to_hex().to_string()
    }
}

3. Bounded Concurrency

Problem: Sequential execution is slow; unbounded parallelism hits rate limits.

Solution: Tokio Semaphore with configurable concurrency.

// src/eval/harness.rs

use std::sync::Arc;
use tokio::sync::Semaphore;

pub struct EvalHarness {
    extractor: LlmExtractor,
    semaphore: Arc<Semaphore>,
}

impl EvalHarness {
    pub fn new(extractor: LlmExtractor, max_concurrent: usize) -> Self {
        Self {
            extractor,
            semaphore: Arc::new(Semaphore::new(max_concurrent)),
        }
    }

    pub async fn run(&self, fixtures: Vec<Fixture>) -> EvalResult {
        let handles: Vec<_> = fixtures
            .into_iter()
            .map(|fixture| {
                let sem = self.semaphore.clone();
                let extractor = self.extractor.clone();
                tokio::spawn(async move {
                    let _permit = sem.acquire().await?;
                    Self::run_fixture(&extractor, fixture).await
                })
            })
            .collect();

        let results = futures::future::join_all(handles).await;
        // aggregate...
    }
}

Default: 5 concurrent (configurable via eval.max_concurrent)


4. Rate Limit Resilience

Problem: 429 errors cause evaluation failures.

Solution: Exponential backoff with retries.

// src/llm/client.rs

impl GeminiClient {
    async fn call_with_retry(&self, request: &Request) -> Result<Response> {
        let mut delay = Duration::from_millis(500);
        let max_retries = 5;

        for attempt in 0..max_retries {
            match self.call(request).await {
                Ok(response) => return Ok(response),
                Err(e) if e.is_rate_limit() => {
                    if attempt == max_retries - 1 {
                        return Err(e);
                    }
                    tracing::warn!(
                        attempt,
                        delay_ms = delay.as_millis(),
                        "Rate limited, backing off"
                    );
                    tokio::time::sleep(delay).await;
                    delay *= 2;
                }
                Err(e) => return Err(e),
            }
        }
        unreachable!()
    }
}

5. Fixture Format

Problem: No standardized test cases to validate prompt changes.

Solution: TOML fixtures with input, expected output, and rationale.

# tests/llm_fixtures/tls/disabled_verification.toml

[metadata]
id = "tls-001"
name = "TLS verification disabled in Python requests"
category = "tls"
language = "python"
created = "2026-02-05"

[input]
# The code to analyze
content = '''
import requests

def fetch_data(url):
    # Disable SSL verification for internal services
    response = requests.get(url, verify=False)
    return response.json()
'''

[expected]
# What the LLM MUST extract (recall test)
must_contain = [
    {
        subject = "tls/cert_verification",
        predicate = "enabled",
        value = false,
        rationale = "requests.get(verify=False) explicitly disables TLS verification"
    },
]

# What the LLM MUST NOT extract (precision test)
must_not_contain = [
    { subject = "tls/cert_verification", predicate = "enabled", value = true },
]

[scoring]
# How important is this fixture?
weight = 1.0
# Expected minimum confidence from LLM
min_confidence = 0.8

ExpectedClaim with rationale:

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExpectedClaim {
    pub subject: String,
    pub predicate: String,
    pub value: serde_json::Value,
    /// Optional explanation for why this claim is expected (shown on failure)
    #[serde(default)]
    pub rationale: Option<String>,
}

Fixture categories:

tests/llm_fixtures/
├── manifest.toml              # Index + baseline metrics
├── tls/                       # TLS/SSL fixtures
│   ├── disabled_verification.toml
│   ├── deprecated_version.toml
│   └── pinning_bypass.toml
├── jwt/                       # JWT fixtures
├── secrets/                   # Hardcoded secrets fixtures
├── auth/                      # Auth bypass fixtures
├── negative/                  # Safe code (expect NO claims)
│   ├── safe_tls.toml
│   └── env_var_secrets.toml
└── edge/                      # Edge cases
    ├── empty_file.toml
    └── huge_file.toml

Manifest:

# tests/llm_fixtures/manifest.toml

[corpus]
version = "1.0.0"
total_fixtures = 35

[baseline]
# Known-good metrics from last successful run
precision = 0.85
recall = 0.78
f1 = 0.81
prompt_version = "1.0.0"
model = "gemini-2.0-flash"
measured_at = "2026-02-05T10:30:00Z"

6. Evaluation Harness

Problem: No way to run fixtures and compute metrics.

Solution: Evaluation engine in src/eval/.

// src/eval/mod.rs

mod db;
mod fixture;
mod harness;
mod matcher;
mod metrics;
mod perturbation;
mod report;

pub use db::EvalDatabase;
pub use fixture::{Fixture, FixtureLoader};
pub use harness::{EvalConfig, EvalHarness, EvalResult};
pub use metrics::{Metrics, CategoryMetrics};
pub use perturbation::Perturbator;
pub use report::{Report, ReportFormat};

Core types:

// src/eval/harness.rs

pub struct EvalConfig {
    /// Path to fixtures directory.
    pub fixtures_dir: PathBuf,

    /// Categories to run (None = all).
    pub categories: Option<Vec<String>>,

    /// Max fixtures to run (for smoke tests).
    pub max_fixtures: Option<usize>,

    /// Evaluation mode.
    pub mode: EvalMode,

    /// Baseline to compare against.
    pub baseline: Option<PathBuf>,

    /// Save observations to database.
    pub save_observations: bool,

    /// Maximum concurrent LLM calls.
    pub max_concurrent: usize,
}

pub enum EvalMode {
    /// Use real LLM API.
    Live,
    /// Use cached responses only (fails if not cached).
    Cached,
    /// Skip LLM, return empty claims (for testing harness).
    Mock,
    /// Perturbation testing for stability.
    Robust,
}

pub struct EvalResult {
    pub run_id: Uuid,
    pub started_at: DateTime<Utc>,
    pub completed_at: DateTime<Utc>,
    pub metrics: Metrics,
    pub fixture_results: Vec<FixtureResult>,
    pub baseline_comparison: Option<BaselineComparison>,
    pub verdict: EvalVerdict,
    /// Stability score (only in Robust mode)
    pub stability: Option<f64>,
}

pub enum EvalVerdict {
    Pass,
    Regression { regressions: Vec<String> },
    Error { message: String },
}

Metrics calculation:

// src/eval/metrics.rs

pub struct Metrics {
    /// True positives: expected claims that were extracted.
    pub true_positives: usize,
    /// False positives: extracted claims that weren't expected.
    pub false_positives: usize,
    /// False negatives: expected claims that weren't extracted.
    pub false_negatives: usize,

    /// Precision = TP / (TP + FP)
    pub precision: f64,
    /// Recall = TP / (TP + FN)
    pub recall: f64,
    /// F1 = 2 * (P * R) / (P + R)
    pub f1: f64,

    /// Total fixtures.
    pub total_fixtures: usize,
    /// Fixtures that passed.
    pub passed: usize,
    /// Fixtures that failed.
    pub failed: usize,

    /// Total tokens used.
    pub total_tokens: u64,
    /// Estimated cost (USD).
    pub estimated_cost: f64,

    /// By category.
    pub by_category: HashMap<String, CategoryMetrics>,
}

impl Metrics {
    pub fn compute(results: &[FixtureResult]) -> Self {
        let mut tp = 0;
        let mut fp = 0;
        let mut fn_ = 0;

        for result in results {
            tp += result.true_positives;
            fp += result.false_positives;
            fn_ += result.false_negatives;
        }

        let precision = if tp + fp > 0 { tp as f64 / (tp + fp) as f64 } else { 0.0 };
        let recall = if tp + fn_ > 0 { tp as f64 / (tp + fn_) as f64 } else { 0.0 };
        let f1 = if precision + recall > 0.0 {
            2.0 * precision * recall / (precision + recall)
        } else {
            0.0
        };

        // ... rest of computation
    }
}

7. Hybrid Type-Coercive Matching

Problem: Strict type matching misses semantically equivalent values.

Solution: Coerce strings to booleans/numbers when reasonable.

// src/eval/matcher.rs

pub struct ClaimMatcher {
    /// Tolerance for confidence comparison.
    pub confidence_tolerance: f32,
}

impl ClaimMatcher {
    /// Check if extracted claims satisfy must_contain requirements.
    pub fn check_must_contain(
        &self,
        extracted: &[Observation],
        expected: &[ExpectedClaim],
    ) -> MatchResult {
        let mut matched = vec![];
        let mut unmatched = vec![];

        for exp in expected {
            if let Some(claim) = self.find_matching_claim(extracted, exp) {
                matched.push((exp.clone(), claim.clone()));
            } else {
                unmatched.push(exp.clone());
            }
        }

        MatchResult { matched, unmatched }
    }

    /// Check if any extracted claim matches (for must_not_contain).
    pub fn check_must_not_contain(
        &self,
        extracted: &[Observation],
        forbidden: &[ExpectedClaim],
    ) -> Vec<(ExpectedClaim, Observation)> {
        let mut violations = vec![];

        for forbid in forbidden {
            if let Some(claim) = self.find_matching_claim(extracted, forbid) {
                violations.push((forbid.clone(), claim.clone()));
            }
        }

        violations
    }

    fn find_matching_claim(
        &self,
        extracted: &[Observation],
        expected: &ExpectedClaim,
    ) -> Option<&Observation> {
        extracted.iter().find(|claim| {
            self.subject_matches(&claim.concept_path, &expected.subject) &&
            claim.predicate == expected.predicate &&
            self.value_matches(&claim.value, &expected.value)
        })
    }

    fn subject_matches(&self, extracted: &str, expected: &str) -> bool {
        // Allow matching on tail path (last 2 segments)
        let ext_tail = extracted.split('/').rev().take(2).collect::<Vec<_>>();
        let exp_tail = expected.split('/').rev().take(2).collect::<Vec<_>>();
        ext_tail == exp_tail
    }

    fn value_matches(&self, extracted: &ObjectValue, expected: &serde_json::Value) -> bool {
        match (extracted, expected) {
            // Direct matches
            (ObjectValue::Boolean(e), serde_json::Value::Bool(x)) => e == x,
            (ObjectValue::Number(e), serde_json::Value::Number(x)) => {
                x.as_f64().map(|n| (e - n).abs() < 0.001).unwrap_or(false)
            }
            (ObjectValue::Text(e), serde_json::Value::String(x)) => e == x,

            // Coercion: string -> boolean
            (ObjectValue::Boolean(e), serde_json::Value::String(s)) => {
                self.coerce_to_bool(s).map(|b| *e == b).unwrap_or(false)
            }
            (ObjectValue::Text(e), serde_json::Value::Bool(x)) => {
                self.coerce_to_bool(e).map(|b| b == *x).unwrap_or(false)
            }

            // Coercion: string -> number
            (ObjectValue::Number(e), serde_json::Value::String(s)) => {
                s.parse::<f64>().map(|n| (e - n).abs() < 0.001).ok().unwrap_or(false)
            }

            _ => false,
        }
    }

    fn coerce_to_bool(&self, s: &str) -> Option<bool> {
        match s.to_lowercase().as_str() {
            "true" | "yes" | "on" | "enabled" | "1" => Some(true),
            "false" | "no" | "off" | "disabled" | "0" => Some(false),
            _ => None,
        }
    }
}

8. Perturbation Testing Mode

Problem: Need to verify LLM consistency across minor input variations.

Solution: Perturbation mode that tests stability.

// src/eval/perturbation.rs

use crate::Language;

pub struct Perturbator;

impl Perturbator {
    /// Generate perturbed variants of input content
    pub fn perturb(content: &str, language: Language) -> Vec<String> {
        let mut variants = vec![content.to_string()];
        variants.push(Self::add_trailing_whitespace(content));
        variants.push(Self::normalize_indentation(content));
        variants.push(Self::add_innocuous_comments(content, language));
        variants.push(Self::remove_comments(content, language));
        variants
    }

    fn add_trailing_whitespace(content: &str) -> String {
        content
            .lines()
            .map(|line| format!("{}  ", line))
            .collect::<Vec<_>>()
            .join("\n")
    }

    fn normalize_indentation(content: &str) -> String {
        // Convert tabs to spaces or vice versa
        content.replace('\t', "    ")
    }

    fn add_innocuous_comments(content: &str, language: Language) -> String {
        let comment_prefix = match language {
            Language::Python => "#",
            Language::JavaScript | Language::TypeScript | Language::Rust | Language::Go => "//",
            _ => "#",
        };
        format!("{} Auto-generated file\n{}", comment_prefix, content)
    }

    fn remove_comments(content: &str, language: Language) -> String {
        // Simple single-line comment removal
        let comment_prefix = match language {
            Language::Python => "#",
            Language::JavaScript | Language::TypeScript | Language::Rust | Language::Go => "//",
            _ => "#",
        };
        content
            .lines()
            .filter(|line| !line.trim().starts_with(comment_prefix))
            .collect::<Vec<_>>()
            .join("\n")
    }
}

Stability metric: % of perturbations producing identical claims.

CLI: aphoria eval run --mode robust


9. Structured Decoding (Gemini Response Schema)

Problem: Free-form JSON parsing can fail.

Solution: Use Gemini's response_schema for guaranteed JSON structure.

// src/llm/client.rs

impl GeminiClient {
    fn build_request(&self, content: &str, prompt: &str) -> Request {
        Request {
            contents: vec![Content {
                role: "user".to_string(),
                parts: vec![Part::Text(content.to_string())],
            }],
            generation_config: GenerationConfig {
                temperature: 0.1,
                response_mime_type: "application/json".to_string(),
                response_schema: Some(self.claims_schema()),
            },
        }
    }

    fn claims_schema(&self) -> Schema {
        Schema {
            type_: "object".to_string(),
            properties: hashmap! {
                "claims".to_string() => Schema {
                    type_: "array".to_string(),
                    items: Some(Box::new(Schema {
                        type_: "object".to_string(),
                        properties: hashmap! {
                            "subject".to_string() => Schema { type_: "string".to_string(), ..Default::default() },
                            "predicate".to_string() => Schema { type_: "string".to_string(), ..Default::default() },
                            "value".to_string() => Schema { type_: "any".to_string(), ..Default::default() },
                            "confidence".to_string() => Schema { type_: "number".to_string(), ..Default::default() },
                            "line".to_string() => Schema { type_: "integer".to_string(), ..Default::default() },
                        },
                        required: vec!["subject", "predicate", "value", "confidence"],
                        ..Default::default()
                    })),
                    ..Default::default()
                },
            },
            required: vec!["claims".to_string()],
            ..Default::default()
        }
    }
}

Benefit: Eliminates JSON parse failures.


10. Synthetic Corpus Generation

Problem: Manual fixture creation is slow.

Solution: Generate fixtures from real scans with human review.

aphoria eval generate-corpus \
    --scan-path /path/to/codebase \
    --output-dir tests/llm_fixtures/synthetic \
    --sample-size 50
// src/eval/corpus.rs

pub struct CorpusGenerator {
    extractor: LlmExtractor,
}

impl CorpusGenerator {
    pub async fn generate(
        &self,
        scan_path: &Path,
        output_dir: &Path,
        sample_size: usize,
    ) -> Result<Vec<PathBuf>> {
        let findings = self.extractor.scan(scan_path).await?;
        let sample = self.stratified_sample(&findings, sample_size);
        let mut fixtures = vec![];
        for finding in sample {
            fixtures.push(self.create_fixture(&finding, output_dir)?);
        }
        Ok(fixtures)
    }

    fn stratified_sample(&self, findings: &[Finding], size: usize) -> Vec<&Finding> {
        // Sample proportionally from each category
        let mut by_category: HashMap<&str, Vec<&Finding>> = HashMap::new();
        for f in findings {
            by_category.entry(&f.category).or_default().push(f);
        }

        let per_category = size / by_category.len().max(1);
        let mut sample = vec![];
        for (_, items) in by_category {
            sample.extend(items.iter().take(per_category));
        }
        sample.truncate(size);
        sample
    }

    fn create_fixture(&self, finding: &Finding, output_dir: &Path) -> Result<PathBuf> {
        let fixture = Fixture {
            metadata: FixtureMetadata {
                id: format!("auto-{}", Uuid::new_v4()),
                name: finding.description.clone(),
                category: finding.category.clone(),
                language: finding.language.to_string(),
                created: Utc::now().date_naive().to_string(),
            },
            input: FixtureInput {
                content: finding.code_snippet.clone(),
            },
            expected: FixtureExpected {
                must_contain: vec![ExpectedClaim {
                    subject: finding.subject.clone(),
                    predicate: finding.predicate.clone(),
                    value: finding.value.clone(),
                    rationale: Some("Auto-generated - requires human review".to_string()),
                }],
                must_not_contain: vec![],
            },
            scoring: FixtureScoring {
                weight: 1.0,
                min_confidence: 0.7,
            },
        };

        let path = output_dir
            .join(&finding.category)
            .join(format!("{}.toml", fixture.metadata.id));
        std::fs::create_dir_all(path.parent().unwrap())?;
        std::fs::write(&path, toml::to_string_pretty(&fixture)?)?;
        Ok(path)
    }
}

Workflow: Scan -> Human review -> Commit to corpus


11. CLI Commands

Problem: No way to run evaluations from command line.

Solution: Add aphoria eval subcommand.

// src/cli.rs additions

#[derive(Subcommand)]
pub enum Commands {
    // ... existing commands ...

    /// Evaluate LLM prompt effectiveness
    Eval {
        #[command(subcommand)]
        command: EvalCommands,
    },
}

#[derive(Subcommand)]
pub enum EvalCommands {
    /// Run evaluation against fixtures
    Run {
        /// Path to fixtures directory
        #[arg(long, default_value = "tests/llm_fixtures")]
        fixtures: PathBuf,

        /// Categories to run (comma-separated)
        #[arg(long)]
        categories: Option<String>,

        /// Maximum fixtures to run
        #[arg(long)]
        max_fixtures: Option<usize>,

        /// Evaluation mode: live, cached, mock, robust
        #[arg(long, default_value = "cached")]
        mode: String,

        /// Baseline file to compare against
        #[arg(long)]
        baseline: Option<PathBuf>,

        /// Exit with code 1 if regression detected
        #[arg(long)]
        fail_on_regression: bool,

        /// Regression threshold (default: 0.05 = 5%)
        #[arg(long, default_value = "0.05")]
        threshold: f64,

        /// Save observation logs
        #[arg(long)]
        save_observations: bool,

        /// Output format: table, json, markdown
        #[arg(long, default_value = "table")]
        format: String,
    },

    /// Show current baseline metrics
    Baseline {
        /// Path to fixtures directory
        #[arg(long, default_value = "tests/llm_fixtures")]
        fixtures: PathBuf,
    },

    /// Update baseline from latest run
    UpdateBaseline {
        /// Run ID to use as new baseline
        #[arg(long)]
        run_id: Option<String>,

        /// Path to fixtures directory
        #[arg(long, default_value = "tests/llm_fixtures")]
        fixtures: PathBuf,

        /// Required - prevents accidental baseline overwrites
        #[arg(long, required = true)]
        force: bool,
    },

    /// List fixtures
    ListFixtures {
        /// Path to fixtures directory
        #[arg(long, default_value = "tests/llm_fixtures")]
        fixtures: PathBuf,

        /// Filter by category
        #[arg(long)]
        category: Option<String>,
    },

    /// Validate fixture format
    ValidateFixtures {
        /// Path to fixtures directory
        #[arg(long, default_value = "tests/llm_fixtures")]
        fixtures: PathBuf,
    },

    /// Generate fixtures from real scans
    GenerateCorpus {
        /// Path to codebase to scan
        #[arg(long)]
        scan_path: PathBuf,

        /// Output directory for generated fixtures
        #[arg(long)]
        output_dir: PathBuf,

        /// Number of fixtures to generate
        #[arg(long, default_value = "50")]
        sample_size: usize,
    },
}

Usage examples:

# Run smoke test (cached responses, fast)
aphoria eval run --mode cached --max-fixtures 10

# Run full evaluation (live API calls)
aphoria eval run --mode live --save-observations

# Run with baseline comparison
aphoria eval run --baseline tests/llm_fixtures/manifest.toml --fail-on-regression

# Run perturbation testing
aphoria eval run --mode robust --max-fixtures 5

# Show current baseline
aphoria eval baseline

# Update baseline (requires --force)
aphoria eval update-baseline --force

# List fixtures
aphoria eval list-fixtures --category tls

# Validate fixture format
aphoria eval validate-fixtures

# Generate fixtures from real codebase
aphoria eval generate-corpus --scan-path ./my-project --output-dir ./test-fixtures

Baseline safety: Without --force, update-baseline shows:

Current baseline: precision=0.85, recall=0.78, f1=0.81 (2026-02-05)
To update, re-run with --force

12. Report Output

Problem: Need human-readable and machine-readable output.

Solution: Multiple report formats.

Table format (default):

╭────────────────────────────────────────────────────────────────────╮
│                    LLM Prompt Evaluation Report                     │
├────────────────────────────────────────────────────────────────────┤
│ Run ID:     abc123-def456                                          │
│ Date:       2026-02-05 14:30:00 UTC                                │
│ Prompt:     v1.0.0                                                 │
│ Model:      gemini-2.0-flash                                       │
╰────────────────────────────────────────────────────────────────────╯

Summary
╭──────────┬─────────┬──────────┬────────┬────────╮
│ Metric   │ Current │ Baseline │ Delta  │ Status │
├──────────┼─────────┼──────────┼────────┼────────┤
│ Precision│ 0.87    │ 0.85     │ +0.02  │ ✓      │
│ Recall   │ 0.76    │ 0.78     │ -0.02  │ ⚠      │
│ F1       │ 0.81    │ 0.81     │ +0.00  │ ✓      │
╰──────────┴─────────┴──────────┴────────┴────────╯

Verdict: ⚠ REVIEW - Recall dropped by 2%

Category Breakdown
╭──────────┬──────────┬────────┬────────╮
│ Category │ Fixtures │ Passed │ Failed │
├──────────┼──────────┼────────┼────────┤
│ tls      │ 12       │ 11     │ 1      │
│ jwt      │ 8        │ 6      │ 2      │
│ secrets  │ 15       │ 14     │ 1      │
│ negative │ 10       │ 10     │ 0      │
╰──────────┴──────────┴────────┴────────╯

Regressions (2)
- jwt-003: JWT algorithm none detection
  Expected: jwt/algorithm = "none"
  Rationale: alg:"none" bypasses signature verification entirely
  Got: Not extracted

- tls-007: TLS version in constants (IMPROVED)
  Previously: Not extracted
  Now: tls/min_version = "1.0" ✓

Cost: 125,430 tokens ($0.12)

JSON format:

{
  "run_id": "abc123-def456",
  "timestamp": "2026-02-05T14:30:00Z",
  "prompt_version": "1.0.0",
  "model": "gemini-2.0-flash",
  "metrics": {
    "precision": 0.87,
    "recall": 0.76,
    "f1": 0.81,
    "total_fixtures": 45,
    "passed": 41,
    "failed": 4
  },
  "baseline_comparison": {
    "precision_delta": 0.02,
    "recall_delta": -0.02,
    "has_regression": true,
    "regression_threshold": 0.05
  },
  "stability": 0.92,
  "verdict": "review",
  "fixture_results": [...]
}

Implementation Plan

Phase 1: Core Infrastructure (2 days)

Task File Description
1.1 src/eval/db.rs SQLite database with retention
1.2 src/llm/cache.rs Update cache key to include prompt hash
1.3 src/llm/client.rs Exponential backoff for 429s

Acceptance: Database stores observations, cache invalidates on prompt change, rate limits handled gracefully.

Phase 2: Fixture & Matching (2 days)

Task File Description
2.1 src/eval/fixture.rs Define Fixture, ExpectedClaim with rationale
2.2 src/eval/matcher.rs Hybrid type-coercive matching
2.3 tests/llm_fixtures/ Create 10 seed fixtures
2.4 src/eval/fixture.rs Add FixtureLoader

Acceptance: Can load fixtures from TOML, matching handles type coercion.

Phase 3: Evaluation Harness (2 days)

Task File Description
3.1 src/eval/harness.rs Bounded concurrency with Semaphore
3.2 src/eval/metrics.rs Implement Metrics::compute()
3.3 src/eval/harness.rs Baseline comparison
3.4 src/eval/perturbation.rs Perturbation testing

Acceptance: Can run fixtures with bounded parallelism, compute precision/recall, measure stability.

Phase 4: Structured Decoding (1 day)

Task File Description
4.1 src/llm/client.rs Gemini Response Schema integration

Acceptance: LLM always returns valid JSON, no parse failures.

Phase 5: CLI & Corpus (2 days)

Task File Description
5.1 src/cli.rs Add EvalCommands with --force, --mode robust
5.2 src/handlers/eval.rs Implement all eval command handlers
5.3 src/eval/corpus.rs Corpus generation from scans

Acceptance: aphoria eval run works end-to-end, corpus generation functional.

Phase 6: Reports & Polish (2 days)

Task File Description
6.1 src/eval/report.rs Table/JSON formats with rationale in failures
6.2 src/eval/report.rs Stability metrics display
6.3 tests/llm_fixtures/ Expand to 25+ fixtures

Acceptance: Reports show rationale on missed claims, stability metrics visible.

Total: 11 days


Fixture Seed List

Initial 10 fixtures to create:

ID Category Name Tests
tls-001 tls Disabled verification (requests) verify=False
tls-002 tls Deprecated TLS version min_version="TLSv1"
jwt-001 jwt Algorithm none alg: "none"
jwt-002 jwt Skip signature verification verify=False
secrets-001 secrets Hardcoded API key API_KEY = "sk_..."
secrets-002 secrets High entropy token Shannon entropy > 4.5
auth-001 auth Debug auth bypass X-Debug-Auth header
negative-001 negative Safe TLS config verify=True (no claims)
negative-002 negative Env var secrets os.getenv() (no claims)
edge-001 edge Empty file Empty content (no claims)

Configuration

Add to aphoria.toml:

[eval]
# Save observations during scans
save_observations = false

# SQLite database path
database_path = "~/.aphoria/eval/observations.db"

# Default fixtures directory
fixtures_dir = "tests/llm_fixtures"

# Regression threshold (5% = 0.05)
regression_threshold = 0.05

# Maximum concurrent LLM calls
max_concurrent = 5

# Retention: days to keep observations
retention_days = 30

# Retention: max observations to keep regardless of age
retention_max_count = 1000

# Rate limit: initial backoff delay (ms)
rate_limit_initial_delay_ms = 500

# Rate limit: max retries before failing
rate_limit_max_retries = 5

Success Criteria

Metric Target
Can run aphoria eval run Works
Baseline comparison Detects 5% regression
Fixtures load correctly 100% valid fixtures load
Metrics match manual calculation Within 0.01
Report is readable Human-verified
Type coercion works "true" matches true
Perturbation mode Stability metric computed
Rate limit handling Survives 429 burst

File Structure After Implementation

applications/aphoria/
├── src/
│   ├── eval/
│   │   ├── mod.rs
│   │   ├── db.rs          # SQLite storage
│   │   ├── corpus.rs      # Synthetic fixture generation
│   │   ├── fixture.rs     # Fixture loading
│   │   ├── harness.rs     # Evaluation engine
│   │   ├── matcher.rs     # Claim matching (type-coercive)
│   │   ├── metrics.rs     # Precision/recall
│   │   ├── perturbation.rs # Perturbation testing
│   │   └── report.rs      # Output formatting
│   ├── llm/
│   │   ├── observation.rs # Observation logging
│   │   └── ...
│   ├── handlers/
│   │   ├── eval.rs        # Eval command handlers
│   │   └── ...
│   └── ...
└── tests/
    └── llm_fixtures/
        ├── manifest.toml
        ├── tls/
        ├── jwt/
        ├── secrets/
        ├── auth/
        ├── negative/
        └── edge/

Verification

# Build
cargo build -p aphoria

# Test
cargo test -p aphoria

# SQLite retention check
sqlite3 ~/.aphoria/eval/observations.db "SELECT COUNT(*) FROM observations"

# Bounded concurrency (watch logs)
RUST_LOG=debug aphoria eval run --mode live 2>&1 | grep "permit"

# Perturbation mode
aphoria eval run --mode robust --max-fixtures 5

# Corpus generation
aphoria eval generate-corpus --scan-path ./test-project --output-dir ./test-fixtures

Open Questions Resolved

Question Decision
Baseline storage In manifest.toml (simple, versioned with fixtures)
Observation storage SQLite with 30-day/1000-count retention
Matching strictness Tail-path + type-coercive matching
Mock vs Live in CI Cached mode for PR, live for manual
Parallelism Bounded (5 default) via Tokio Semaphore
Baseline safety Requires --force flag
Structured output Gemini Response Schema

Ready for implementation.