# LLM Evaluation Implementation Spec > **Status:** Implementation Ready > **Date:** 2026-02-05 > **Scope:** Aphoria Phase 7.8 --- ## What We Have The current LLM extraction pipeline (`src/llm/`): ``` src/llm/ ├── mod.rs # Module exports ├── client.rs # GeminiClient - HTTP client for API ├── extractor.rs # LlmExtractor - orchestration, budget, filtering ├── prompt.rs # build_system_prompt() with ontology ├── ontology.rs # OntologyVocabulary from authority assertions ├── cache.rs # LlmCache - BLAKE3 content hash caching ├── types.rs # LlmClaim, LlmClaimsResponse └── prompts.rs # DEFAULT_SYSTEM_PROMPT, helpers ``` **Key characteristics:** - Uses Gemini API (configured via `GEMINI_API_KEY`) - Ontology-aware prompts constrain output to authority vocabulary - Caches by `BLAKE3(prompt + content + model)` (prompt hash included) - Token budget tracking (`max_tokens_per_scan`, `max_tokens_per_file`) - Selective triggering (high-value files only) - Temperature 0.1 for consistency - Structured decoding via Gemini Response Schema --- ## What We Need ### 1. Observation Storage (SQLite) **Problem:** We can't see what the LLM returned or how claims were scored. JSON files are inefficient for querying. **Solution:** SQLite database with retention policies. **Location:** `~/.aphoria/eval/observations.db` ```rust // src/eval/db.rs use chrono::{Duration, Utc}; use rusqlite::{params, Connection}; pub struct EvalDatabase { conn: Connection, } impl EvalDatabase { pub fn open(path: &Path) -> Result { let conn = Connection::open(path)?; conn.execute_batch(r#" CREATE TABLE IF NOT EXISTS observations ( id TEXT PRIMARY KEY, timestamp TEXT NOT NULL, prompt_version TEXT NOT NULL, prompt_hash TEXT NOT NULL, model TEXT NOT NULL, input_hash TEXT NOT NULL, file_path TEXT NOT NULL, language TEXT NOT NULL, content_length INTEGER NOT NULL, raw_response TEXT NOT NULL, parsed_claims TEXT NOT NULL, -- JSON final_claims TEXT NOT NULL, -- JSON input_tokens INTEGER NOT NULL, output_tokens INTEGER NOT NULL, parse_success INTEGER NOT NULL, parse_error TEXT, cache_hit INTEGER NOT NULL, latency_ms INTEGER NOT NULL ); CREATE INDEX IF NOT EXISTS idx_obs_timestamp ON observations(timestamp); CREATE INDEX IF NOT EXISTS idx_obs_prompt_hash ON observations(prompt_hash); "#)?; Ok(Self { conn }) } /// Enforce retention: keep last 1000 or 30 days, whichever is larger pub fn enforce_retention(&self) -> Result { let cutoff = Utc::now() - Duration::days(30); self.conn.execute( "DELETE FROM observations WHERE timestamp < ?1 AND id NOT IN (SELECT id FROM observations ORDER BY timestamp DESC LIMIT 1000)", params![cutoff.to_rfc3339()], ) } pub fn insert(&self, obs: &Observation) -> Result<()> { self.conn.execute( r#"INSERT INTO observations ( id, timestamp, prompt_version, prompt_hash, model, input_hash, file_path, language, content_length, raw_response, parsed_claims, final_claims, input_tokens, output_tokens, parse_success, parse_error, cache_hit, latency_ms ) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12, ?13, ?14, ?15, ?16, ?17, ?18)"#, params![ obs.id.to_string(), obs.timestamp.to_rfc3339(), obs.prompt_version, obs.prompt_hash, obs.model, obs.input_hash, obs.file_path, obs.language, obs.content_length, obs.raw_response, serde_json::to_string(&obs.parsed_claims)?, serde_json::to_string(&obs.final_claims)?, obs.input_tokens, obs.output_tokens, obs.parse_success, obs.parse_error, obs.cache_hit, obs.latency_ms, ], )?; Ok(()) } } ``` **Observation struct:** ```rust // src/llm/observation.rs use chrono::{DateTime, Utc}; use serde::{Deserialize, Serialize}; use uuid::Uuid; /// A logged observation from an LLM extraction. #[derive(Debug, Clone, Serialize, Deserialize)] pub struct Observation { /// Unique ID for this observation. pub id: Uuid, /// When this extraction occurred. pub timestamp: DateTime, /// Prompt version (from PROMPT_VERSION constant). pub prompt_version: String, /// BLAKE3 hash of the prompt template. pub prompt_hash: String, /// Model used (e.g., "gemini-2.0-flash"). pub model: String, /// BLAKE3 hash of input content. pub input_hash: String, /// File path (relative to scan root). pub file_path: String, /// Language detected. pub language: String, /// Content length in bytes. pub content_length: usize, /// Raw LLM response (JSON string). pub raw_response: String, /// Parsed claims (after confidence filter, before ontology validation). pub parsed_claims: Vec, /// Final claims (after ontology validation). pub final_claims: Vec, /// Token usage. pub input_tokens: usize, pub output_tokens: usize, /// Whether parsing succeeded. pub parse_success: bool, /// Parse error if any. pub parse_error: Option, /// Cache status. pub cache_hit: bool, /// Latency in milliseconds. pub latency_ms: u64, } /// A claim as parsed from LLM JSON (before validation). #[derive(Debug, Clone, Serialize, Deserialize)] pub struct ParsedClaim { pub subject: String, pub predicate: String, pub value: serde_json::Value, pub confidence: f32, pub line: usize, } /// A claim after ontology validation. #[derive(Debug, Clone, Serialize, Deserialize)] pub struct FinalClaim { pub concept_path: String, pub predicate: String, pub value: serde_json::Value, pub confidence: f32, pub matched_ontology: bool, pub fuzzy_matched: bool, } ``` **Integration point:** Modify `LlmExtractor::extract()` to emit observations. --- ### 2. Cache Key Includes Prompt Hash **Problem:** Cache doesn't invalidate when prompt changes. **Solution:** Include prompt hash in cache key. ```rust // src/llm/cache.rs impl LlmCache { fn compute_key(content: &str, model: &str, prompt: &str) -> String { let mut hasher = blake3::Hasher::new(); hasher.update(content.as_bytes()); hasher.update(model.as_bytes()); hasher.update(prompt.as_bytes()); // NEW: prompt included hasher.finalize().to_hex().to_string() } } ``` --- ### 3. Bounded Concurrency **Problem:** Sequential execution is slow; unbounded parallelism hits rate limits. **Solution:** Tokio Semaphore with configurable concurrency. ```rust // src/eval/harness.rs use std::sync::Arc; use tokio::sync::Semaphore; pub struct EvalHarness { extractor: LlmExtractor, semaphore: Arc, } impl EvalHarness { pub fn new(extractor: LlmExtractor, max_concurrent: usize) -> Self { Self { extractor, semaphore: Arc::new(Semaphore::new(max_concurrent)), } } pub async fn run(&self, fixtures: Vec) -> EvalResult { let handles: Vec<_> = fixtures .into_iter() .map(|fixture| { let sem = self.semaphore.clone(); let extractor = self.extractor.clone(); tokio::spawn(async move { let _permit = sem.acquire().await?; Self::run_fixture(&extractor, fixture).await }) }) .collect(); let results = futures::future::join_all(handles).await; // aggregate... } } ``` **Default:** 5 concurrent (configurable via `eval.max_concurrent`) --- ### 4. Rate Limit Resilience **Problem:** 429 errors cause evaluation failures. **Solution:** Exponential backoff with retries. ```rust // src/llm/client.rs impl GeminiClient { async fn call_with_retry(&self, request: &Request) -> Result { let mut delay = Duration::from_millis(500); let max_retries = 5; for attempt in 0..max_retries { match self.call(request).await { Ok(response) => return Ok(response), Err(e) if e.is_rate_limit() => { if attempt == max_retries - 1 { return Err(e); } tracing::warn!( attempt, delay_ms = delay.as_millis(), "Rate limited, backing off" ); tokio::time::sleep(delay).await; delay *= 2; } Err(e) => return Err(e), } } unreachable!() } } ``` --- ### 5. Fixture Format **Problem:** No standardized test cases to validate prompt changes. **Solution:** TOML fixtures with input, expected output, and rationale. ```toml # tests/llm_fixtures/tls/disabled_verification.toml [metadata] id = "tls-001" name = "TLS verification disabled in Python requests" category = "tls" language = "python" created = "2026-02-05" [input] # The code to analyze content = ''' import requests def fetch_data(url): # Disable SSL verification for internal services response = requests.get(url, verify=False) return response.json() ''' [expected] # What the LLM MUST extract (recall test) must_contain = [ { subject = "tls/cert_verification", predicate = "enabled", value = false, rationale = "requests.get(verify=False) explicitly disables TLS verification" }, ] # What the LLM MUST NOT extract (precision test) must_not_contain = [ { subject = "tls/cert_verification", predicate = "enabled", value = true }, ] [scoring] # How important is this fixture? weight = 1.0 # Expected minimum confidence from LLM min_confidence = 0.8 ``` **ExpectedClaim with rationale:** ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub struct ExpectedClaim { pub subject: String, pub predicate: String, pub value: serde_json::Value, /// Optional explanation for why this claim is expected (shown on failure) #[serde(default)] pub rationale: Option, } ``` **Fixture categories:** ``` tests/llm_fixtures/ ├── manifest.toml # Index + baseline metrics ├── tls/ # TLS/SSL fixtures │ ├── disabled_verification.toml │ ├── deprecated_version.toml │ └── pinning_bypass.toml ├── jwt/ # JWT fixtures ├── secrets/ # Hardcoded secrets fixtures ├── auth/ # Auth bypass fixtures ├── negative/ # Safe code (expect NO claims) │ ├── safe_tls.toml │ └── env_var_secrets.toml └── edge/ # Edge cases ├── empty_file.toml └── huge_file.toml ``` **Manifest:** ```toml # tests/llm_fixtures/manifest.toml [corpus] version = "1.0.0" total_fixtures = 35 [baseline] # Known-good metrics from last successful run precision = 0.85 recall = 0.78 f1 = 0.81 prompt_version = "1.0.0" model = "gemini-2.0-flash" measured_at = "2026-02-05T10:30:00Z" ``` --- ### 6. Evaluation Harness **Problem:** No way to run fixtures and compute metrics. **Solution:** Evaluation engine in `src/eval/`. ```rust // src/eval/mod.rs mod db; mod fixture; mod harness; mod matcher; mod metrics; mod perturbation; mod report; pub use db::EvalDatabase; pub use fixture::{Fixture, FixtureLoader}; pub use harness::{EvalConfig, EvalHarness, EvalResult}; pub use metrics::{Metrics, CategoryMetrics}; pub use perturbation::Perturbator; pub use report::{Report, ReportFormat}; ``` **Core types:** ```rust // src/eval/harness.rs pub struct EvalConfig { /// Path to fixtures directory. pub fixtures_dir: PathBuf, /// Categories to run (None = all). pub categories: Option>, /// Max fixtures to run (for smoke tests). pub max_fixtures: Option, /// Evaluation mode. pub mode: EvalMode, /// Baseline to compare against. pub baseline: Option, /// Save observations to database. pub save_observations: bool, /// Maximum concurrent LLM calls. pub max_concurrent: usize, } pub enum EvalMode { /// Use real LLM API. Live, /// Use cached responses only (fails if not cached). Cached, /// Skip LLM, return empty claims (for testing harness). Mock, /// Perturbation testing for stability. Robust, } pub struct EvalResult { pub run_id: Uuid, pub started_at: DateTime, pub completed_at: DateTime, pub metrics: Metrics, pub fixture_results: Vec, pub baseline_comparison: Option, pub verdict: EvalVerdict, /// Stability score (only in Robust mode) pub stability: Option, } pub enum EvalVerdict { Pass, Regression { regressions: Vec }, Error { message: String }, } ``` **Metrics calculation:** ```rust // src/eval/metrics.rs pub struct Metrics { /// True positives: expected claims that were extracted. pub true_positives: usize, /// False positives: extracted claims that weren't expected. pub false_positives: usize, /// False negatives: expected claims that weren't extracted. pub false_negatives: usize, /// Precision = TP / (TP + FP) pub precision: f64, /// Recall = TP / (TP + FN) pub recall: f64, /// F1 = 2 * (P * R) / (P + R) pub f1: f64, /// Total fixtures. pub total_fixtures: usize, /// Fixtures that passed. pub passed: usize, /// Fixtures that failed. pub failed: usize, /// Total tokens used. pub total_tokens: u64, /// Estimated cost (USD). pub estimated_cost: f64, /// By category. pub by_category: HashMap, } impl Metrics { pub fn compute(results: &[FixtureResult]) -> Self { let mut tp = 0; let mut fp = 0; let mut fn_ = 0; for result in results { tp += result.true_positives; fp += result.false_positives; fn_ += result.false_negatives; } let precision = if tp + fp > 0 { tp as f64 / (tp + fp) as f64 } else { 0.0 }; let recall = if tp + fn_ > 0 { tp as f64 / (tp + fn_) as f64 } else { 0.0 }; let f1 = if precision + recall > 0.0 { 2.0 * precision * recall / (precision + recall) } else { 0.0 }; // ... rest of computation } } ``` --- ### 7. Hybrid Type-Coercive Matching **Problem:** Strict type matching misses semantically equivalent values. **Solution:** Coerce strings to booleans/numbers when reasonable. ```rust // src/eval/matcher.rs pub struct ClaimMatcher { /// Tolerance for confidence comparison. pub confidence_tolerance: f32, } impl ClaimMatcher { /// Check if extracted claims satisfy must_contain requirements. pub fn check_must_contain( &self, extracted: &[Observation], expected: &[ExpectedClaim], ) -> MatchResult { let mut matched = vec![]; let mut unmatched = vec![]; for exp in expected { if let Some(claim) = self.find_matching_claim(extracted, exp) { matched.push((exp.clone(), claim.clone())); } else { unmatched.push(exp.clone()); } } MatchResult { matched, unmatched } } /// Check if any extracted claim matches (for must_not_contain). pub fn check_must_not_contain( &self, extracted: &[Observation], forbidden: &[ExpectedClaim], ) -> Vec<(ExpectedClaim, Observation)> { let mut violations = vec![]; for forbid in forbidden { if let Some(claim) = self.find_matching_claim(extracted, forbid) { violations.push((forbid.clone(), claim.clone())); } } violations } fn find_matching_claim( &self, extracted: &[Observation], expected: &ExpectedClaim, ) -> Option<&Observation> { extracted.iter().find(|claim| { self.subject_matches(&claim.concept_path, &expected.subject) && claim.predicate == expected.predicate && self.value_matches(&claim.value, &expected.value) }) } fn subject_matches(&self, extracted: &str, expected: &str) -> bool { // Allow matching on tail path (last 2 segments) let ext_tail = extracted.split('/').rev().take(2).collect::>(); let exp_tail = expected.split('/').rev().take(2).collect::>(); ext_tail == exp_tail } fn value_matches(&self, extracted: &ObjectValue, expected: &serde_json::Value) -> bool { match (extracted, expected) { // Direct matches (ObjectValue::Boolean(e), serde_json::Value::Bool(x)) => e == x, (ObjectValue::Number(e), serde_json::Value::Number(x)) => { x.as_f64().map(|n| (e - n).abs() < 0.001).unwrap_or(false) } (ObjectValue::Text(e), serde_json::Value::String(x)) => e == x, // Coercion: string -> boolean (ObjectValue::Boolean(e), serde_json::Value::String(s)) => { self.coerce_to_bool(s).map(|b| *e == b).unwrap_or(false) } (ObjectValue::Text(e), serde_json::Value::Bool(x)) => { self.coerce_to_bool(e).map(|b| b == *x).unwrap_or(false) } // Coercion: string -> number (ObjectValue::Number(e), serde_json::Value::String(s)) => { s.parse::().map(|n| (e - n).abs() < 0.001).ok().unwrap_or(false) } _ => false, } } fn coerce_to_bool(&self, s: &str) -> Option { match s.to_lowercase().as_str() { "true" | "yes" | "on" | "enabled" | "1" => Some(true), "false" | "no" | "off" | "disabled" | "0" => Some(false), _ => None, } } } ``` --- ### 8. Perturbation Testing Mode **Problem:** Need to verify LLM consistency across minor input variations. **Solution:** Perturbation mode that tests stability. ```rust // src/eval/perturbation.rs use crate::Language; pub struct Perturbator; impl Perturbator { /// Generate perturbed variants of input content pub fn perturb(content: &str, language: Language) -> Vec { let mut variants = vec![content.to_string()]; variants.push(Self::add_trailing_whitespace(content)); variants.push(Self::normalize_indentation(content)); variants.push(Self::add_innocuous_comments(content, language)); variants.push(Self::remove_comments(content, language)); variants } fn add_trailing_whitespace(content: &str) -> String { content .lines() .map(|line| format!("{} ", line)) .collect::>() .join("\n") } fn normalize_indentation(content: &str) -> String { // Convert tabs to spaces or vice versa content.replace('\t', " ") } fn add_innocuous_comments(content: &str, language: Language) -> String { let comment_prefix = match language { Language::Python => "#", Language::JavaScript | Language::TypeScript | Language::Rust | Language::Go => "//", _ => "#", }; format!("{} Auto-generated file\n{}", comment_prefix, content) } fn remove_comments(content: &str, language: Language) -> String { // Simple single-line comment removal let comment_prefix = match language { Language::Python => "#", Language::JavaScript | Language::TypeScript | Language::Rust | Language::Go => "//", _ => "#", }; content .lines() .filter(|line| !line.trim().starts_with(comment_prefix)) .collect::>() .join("\n") } } ``` **Stability metric:** % of perturbations producing identical claims. CLI: `aphoria eval run --mode robust` --- ### 9. Structured Decoding (Gemini Response Schema) **Problem:** Free-form JSON parsing can fail. **Solution:** Use Gemini's `response_schema` for guaranteed JSON structure. ```rust // src/llm/client.rs impl GeminiClient { fn build_request(&self, content: &str, prompt: &str) -> Request { Request { contents: vec![Content { role: "user".to_string(), parts: vec![Part::Text(content.to_string())], }], generation_config: GenerationConfig { temperature: 0.1, response_mime_type: "application/json".to_string(), response_schema: Some(self.claims_schema()), }, } } fn claims_schema(&self) -> Schema { Schema { type_: "object".to_string(), properties: hashmap! { "claims".to_string() => Schema { type_: "array".to_string(), items: Some(Box::new(Schema { type_: "object".to_string(), properties: hashmap! { "subject".to_string() => Schema { type_: "string".to_string(), ..Default::default() }, "predicate".to_string() => Schema { type_: "string".to_string(), ..Default::default() }, "value".to_string() => Schema { type_: "any".to_string(), ..Default::default() }, "confidence".to_string() => Schema { type_: "number".to_string(), ..Default::default() }, "line".to_string() => Schema { type_: "integer".to_string(), ..Default::default() }, }, required: vec!["subject", "predicate", "value", "confidence"], ..Default::default() })), ..Default::default() }, }, required: vec!["claims".to_string()], ..Default::default() } } } ``` **Benefit:** Eliminates JSON parse failures. --- ### 10. Synthetic Corpus Generation **Problem:** Manual fixture creation is slow. **Solution:** Generate fixtures from real scans with human review. ```bash aphoria eval generate-corpus \ --scan-path /path/to/codebase \ --output-dir tests/llm_fixtures/synthetic \ --sample-size 50 ``` ```rust // src/eval/corpus.rs pub struct CorpusGenerator { extractor: LlmExtractor, } impl CorpusGenerator { pub async fn generate( &self, scan_path: &Path, output_dir: &Path, sample_size: usize, ) -> Result> { let findings = self.extractor.scan(scan_path).await?; let sample = self.stratified_sample(&findings, sample_size); let mut fixtures = vec![]; for finding in sample { fixtures.push(self.create_fixture(&finding, output_dir)?); } Ok(fixtures) } fn stratified_sample(&self, findings: &[Finding], size: usize) -> Vec<&Finding> { // Sample proportionally from each category let mut by_category: HashMap<&str, Vec<&Finding>> = HashMap::new(); for f in findings { by_category.entry(&f.category).or_default().push(f); } let per_category = size / by_category.len().max(1); let mut sample = vec![]; for (_, items) in by_category { sample.extend(items.iter().take(per_category)); } sample.truncate(size); sample } fn create_fixture(&self, finding: &Finding, output_dir: &Path) -> Result { let fixture = Fixture { metadata: FixtureMetadata { id: format!("auto-{}", Uuid::new_v4()), name: finding.description.clone(), category: finding.category.clone(), language: finding.language.to_string(), created: Utc::now().date_naive().to_string(), }, input: FixtureInput { content: finding.code_snippet.clone(), }, expected: FixtureExpected { must_contain: vec![ExpectedClaim { subject: finding.subject.clone(), predicate: finding.predicate.clone(), value: finding.value.clone(), rationale: Some("Auto-generated - requires human review".to_string()), }], must_not_contain: vec![], }, scoring: FixtureScoring { weight: 1.0, min_confidence: 0.7, }, }; let path = output_dir .join(&finding.category) .join(format!("{}.toml", fixture.metadata.id)); std::fs::create_dir_all(path.parent().unwrap())?; std::fs::write(&path, toml::to_string_pretty(&fixture)?)?; Ok(path) } } ``` **Workflow:** Scan -> Human review -> Commit to corpus --- ### 11. CLI Commands **Problem:** No way to run evaluations from command line. **Solution:** Add `aphoria eval` subcommand. ```rust // src/cli.rs additions #[derive(Subcommand)] pub enum Commands { // ... existing commands ... /// Evaluate LLM prompt effectiveness Eval { #[command(subcommand)] command: EvalCommands, }, } #[derive(Subcommand)] pub enum EvalCommands { /// Run evaluation against fixtures Run { /// Path to fixtures directory #[arg(long, default_value = "tests/llm_fixtures")] fixtures: PathBuf, /// Categories to run (comma-separated) #[arg(long)] categories: Option, /// Maximum fixtures to run #[arg(long)] max_fixtures: Option, /// Evaluation mode: live, cached, mock, robust #[arg(long, default_value = "cached")] mode: String, /// Baseline file to compare against #[arg(long)] baseline: Option, /// Exit with code 1 if regression detected #[arg(long)] fail_on_regression: bool, /// Regression threshold (default: 0.05 = 5%) #[arg(long, default_value = "0.05")] threshold: f64, /// Save observation logs #[arg(long)] save_observations: bool, /// Output format: table, json, markdown #[arg(long, default_value = "table")] format: String, }, /// Show current baseline metrics Baseline { /// Path to fixtures directory #[arg(long, default_value = "tests/llm_fixtures")] fixtures: PathBuf, }, /// Update baseline from latest run UpdateBaseline { /// Run ID to use as new baseline #[arg(long)] run_id: Option, /// Path to fixtures directory #[arg(long, default_value = "tests/llm_fixtures")] fixtures: PathBuf, /// Required - prevents accidental baseline overwrites #[arg(long, required = true)] force: bool, }, /// List fixtures ListFixtures { /// Path to fixtures directory #[arg(long, default_value = "tests/llm_fixtures")] fixtures: PathBuf, /// Filter by category #[arg(long)] category: Option, }, /// Validate fixture format ValidateFixtures { /// Path to fixtures directory #[arg(long, default_value = "tests/llm_fixtures")] fixtures: PathBuf, }, /// Generate fixtures from real scans GenerateCorpus { /// Path to codebase to scan #[arg(long)] scan_path: PathBuf, /// Output directory for generated fixtures #[arg(long)] output_dir: PathBuf, /// Number of fixtures to generate #[arg(long, default_value = "50")] sample_size: usize, }, } ``` **Usage examples:** ```bash # Run smoke test (cached responses, fast) aphoria eval run --mode cached --max-fixtures 10 # Run full evaluation (live API calls) aphoria eval run --mode live --save-observations # Run with baseline comparison aphoria eval run --baseline tests/llm_fixtures/manifest.toml --fail-on-regression # Run perturbation testing aphoria eval run --mode robust --max-fixtures 5 # Show current baseline aphoria eval baseline # Update baseline (requires --force) aphoria eval update-baseline --force # List fixtures aphoria eval list-fixtures --category tls # Validate fixture format aphoria eval validate-fixtures # Generate fixtures from real codebase aphoria eval generate-corpus --scan-path ./my-project --output-dir ./test-fixtures ``` **Baseline safety:** Without `--force`, update-baseline shows: ``` Current baseline: precision=0.85, recall=0.78, f1=0.81 (2026-02-05) To update, re-run with --force ``` --- ### 12. Report Output **Problem:** Need human-readable and machine-readable output. **Solution:** Multiple report formats. **Table format (default):** ``` ╭────────────────────────────────────────────────────────────────────╮ │ LLM Prompt Evaluation Report │ ├────────────────────────────────────────────────────────────────────┤ │ Run ID: abc123-def456 │ │ Date: 2026-02-05 14:30:00 UTC │ │ Prompt: v1.0.0 │ │ Model: gemini-2.0-flash │ ╰────────────────────────────────────────────────────────────────────╯ Summary ╭──────────┬─────────┬──────────┬────────┬────────╮ │ Metric │ Current │ Baseline │ Delta │ Status │ ├──────────┼─────────┼──────────┼────────┼────────┤ │ Precision│ 0.87 │ 0.85 │ +0.02 │ ✓ │ │ Recall │ 0.76 │ 0.78 │ -0.02 │ ⚠ │ │ F1 │ 0.81 │ 0.81 │ +0.00 │ ✓ │ ╰──────────┴─────────┴──────────┴────────┴────────╯ Verdict: ⚠ REVIEW - Recall dropped by 2% Category Breakdown ╭──────────┬──────────┬────────┬────────╮ │ Category │ Fixtures │ Passed │ Failed │ ├──────────┼──────────┼────────┼────────┤ │ tls │ 12 │ 11 │ 1 │ │ jwt │ 8 │ 6 │ 2 │ │ secrets │ 15 │ 14 │ 1 │ │ negative │ 10 │ 10 │ 0 │ ╰──────────┴──────────┴────────┴────────╯ Regressions (2) - jwt-003: JWT algorithm none detection Expected: jwt/algorithm = "none" Rationale: alg:"none" bypasses signature verification entirely Got: Not extracted - tls-007: TLS version in constants (IMPROVED) Previously: Not extracted Now: tls/min_version = "1.0" ✓ Cost: 125,430 tokens ($0.12) ``` **JSON format:** ```json { "run_id": "abc123-def456", "timestamp": "2026-02-05T14:30:00Z", "prompt_version": "1.0.0", "model": "gemini-2.0-flash", "metrics": { "precision": 0.87, "recall": 0.76, "f1": 0.81, "total_fixtures": 45, "passed": 41, "failed": 4 }, "baseline_comparison": { "precision_delta": 0.02, "recall_delta": -0.02, "has_regression": true, "regression_threshold": 0.05 }, "stability": 0.92, "verdict": "review", "fixture_results": [...] } ``` --- ## Implementation Plan ### Phase 1: Core Infrastructure (2 days) | Task | File | Description | |------|------|-------------| | 1.1 | `src/eval/db.rs` | SQLite database with retention | | 1.2 | `src/llm/cache.rs` | Update cache key to include prompt hash | | 1.3 | `src/llm/client.rs` | Exponential backoff for 429s | **Acceptance:** Database stores observations, cache invalidates on prompt change, rate limits handled gracefully. ### Phase 2: Fixture & Matching (2 days) | Task | File | Description | |------|------|-------------| | 2.1 | `src/eval/fixture.rs` | Define `Fixture`, `ExpectedClaim` with rationale | | 2.2 | `src/eval/matcher.rs` | Hybrid type-coercive matching | | 2.3 | `tests/llm_fixtures/` | Create 10 seed fixtures | | 2.4 | `src/eval/fixture.rs` | Add `FixtureLoader` | **Acceptance:** Can load fixtures from TOML, matching handles type coercion. ### Phase 3: Evaluation Harness (2 days) | Task | File | Description | |------|------|-------------| | 3.1 | `src/eval/harness.rs` | Bounded concurrency with Semaphore | | 3.2 | `src/eval/metrics.rs` | Implement `Metrics::compute()` | | 3.3 | `src/eval/harness.rs` | Baseline comparison | | 3.4 | `src/eval/perturbation.rs` | Perturbation testing | **Acceptance:** Can run fixtures with bounded parallelism, compute precision/recall, measure stability. ### Phase 4: Structured Decoding (1 day) | Task | File | Description | |------|------|-------------| | 4.1 | `src/llm/client.rs` | Gemini Response Schema integration | **Acceptance:** LLM always returns valid JSON, no parse failures. ### Phase 5: CLI & Corpus (2 days) | Task | File | Description | |------|------|-------------| | 5.1 | `src/cli.rs` | Add `EvalCommands` with `--force`, `--mode robust` | | 5.2 | `src/handlers/eval.rs` | Implement all eval command handlers | | 5.3 | `src/eval/corpus.rs` | Corpus generation from scans | **Acceptance:** `aphoria eval run` works end-to-end, corpus generation functional. ### Phase 6: Reports & Polish (2 days) | Task | File | Description | |------|------|-------------| | 6.1 | `src/eval/report.rs` | Table/JSON formats with rationale in failures | | 6.2 | `src/eval/report.rs` | Stability metrics display | | 6.3 | `tests/llm_fixtures/` | Expand to 25+ fixtures | **Acceptance:** Reports show rationale on missed claims, stability metrics visible. **Total:** 11 days --- ## Fixture Seed List Initial 10 fixtures to create: | ID | Category | Name | Tests | |----|----------|------|-------| | tls-001 | tls | Disabled verification (requests) | `verify=False` | | tls-002 | tls | Deprecated TLS version | `min_version="TLSv1"` | | jwt-001 | jwt | Algorithm none | `alg: "none"` | | jwt-002 | jwt | Skip signature verification | `verify=False` | | secrets-001 | secrets | Hardcoded API key | `API_KEY = "sk_..."` | | secrets-002 | secrets | High entropy token | Shannon entropy > 4.5 | | auth-001 | auth | Debug auth bypass | `X-Debug-Auth` header | | negative-001 | negative | Safe TLS config | `verify=True` (no claims) | | negative-002 | negative | Env var secrets | `os.getenv()` (no claims) | | edge-001 | edge | Empty file | Empty content (no claims) | --- ## Configuration Add to `aphoria.toml`: ```toml [eval] # Save observations during scans save_observations = false # SQLite database path database_path = "~/.aphoria/eval/observations.db" # Default fixtures directory fixtures_dir = "tests/llm_fixtures" # Regression threshold (5% = 0.05) regression_threshold = 0.05 # Maximum concurrent LLM calls max_concurrent = 5 # Retention: days to keep observations retention_days = 30 # Retention: max observations to keep regardless of age retention_max_count = 1000 # Rate limit: initial backoff delay (ms) rate_limit_initial_delay_ms = 500 # Rate limit: max retries before failing rate_limit_max_retries = 5 ``` --- ## Success Criteria | Metric | Target | |--------|--------| | Can run `aphoria eval run` | Works | | Baseline comparison | Detects 5% regression | | Fixtures load correctly | 100% valid fixtures load | | Metrics match manual calculation | Within 0.01 | | Report is readable | Human-verified | | Type coercion works | "true" matches true | | Perturbation mode | Stability metric computed | | Rate limit handling | Survives 429 burst | --- ## File Structure After Implementation ``` applications/aphoria/ ├── src/ │ ├── eval/ │ │ ├── mod.rs │ │ ├── db.rs # SQLite storage │ │ ├── corpus.rs # Synthetic fixture generation │ │ ├── fixture.rs # Fixture loading │ │ ├── harness.rs # Evaluation engine │ │ ├── matcher.rs # Claim matching (type-coercive) │ │ ├── metrics.rs # Precision/recall │ │ ├── perturbation.rs # Perturbation testing │ │ └── report.rs # Output formatting │ ├── llm/ │ │ ├── observation.rs # Observation logging │ │ └── ... │ ├── handlers/ │ │ ├── eval.rs # Eval command handlers │ │ └── ... │ └── ... └── tests/ └── llm_fixtures/ ├── manifest.toml ├── tls/ ├── jwt/ ├── secrets/ ├── auth/ ├── negative/ └── edge/ ``` --- ## Verification ```bash # Build cargo build -p aphoria # Test cargo test -p aphoria # SQLite retention check sqlite3 ~/.aphoria/eval/observations.db "SELECT COUNT(*) FROM observations" # Bounded concurrency (watch logs) RUST_LOG=debug aphoria eval run --mode live 2>&1 | grep "permit" # Perturbation mode aphoria eval run --mode robust --max-fixtures 5 # Corpus generation aphoria eval generate-corpus --scan-path ./test-project --output-dir ./test-fixtures ``` --- ## Open Questions Resolved | Question | Decision | |----------|----------| | Baseline storage | In `manifest.toml` (simple, versioned with fixtures) | | Observation storage | SQLite with 30-day/1000-count retention | | Matching strictness | Tail-path + type-coercive matching | | Mock vs Live in CI | Cached mode for PR, live for manual | | Parallelism | Bounded (5 default) via Tokio Semaphore | | Baseline safety | Requires `--force` flag | | Structured output | Gemini Response Schema | --- *Ready for implementation.*