stemedb/applications/aphoria/docs/architecture/llm-eval-implementation.md
jordan bbe6aedc40 feat: Aphoria security extractors + LLM evaluation architecture + ontology docs
New security extractors:
- insecure_deserialization, orm_injection, path_traversal, security_headers
- ssrf, unvalidated_redirects, weak_password, xxe
- Enhanced tls_version extractor with comprehensive cipher/protocol checks

Architecture docs:
- Scout-judge extraction pattern for LLM-based code analysis
- LLM prompt evaluation framework
- LLM eval implementation guide

Core improvements:
- stemedb-ontology README and client enhancements
- WAL journal/segment instrumentation
- Signing and ingestion refinements
- Consumer health demo script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-05 15:22:55 -07:00

1357 lines
39 KiB
Markdown

# LLM Evaluation Implementation Spec
> **Status:** Implementation Ready
> **Date:** 2026-02-05
> **Scope:** Aphoria Phase 7.8
---
## What We Have
The current LLM extraction pipeline (`src/llm/`):
```
src/llm/
├── mod.rs # Module exports
├── client.rs # GeminiClient - HTTP client for API
├── extractor.rs # LlmExtractor - orchestration, budget, filtering
├── prompt.rs # build_system_prompt() with ontology
├── ontology.rs # OntologyVocabulary from authority assertions
├── cache.rs # LlmCache - BLAKE3 content hash caching
├── types.rs # LlmClaim, LlmClaimsResponse
└── prompts.rs # DEFAULT_SYSTEM_PROMPT, helpers
```
**Key characteristics:**
- Uses Gemini API (configured via `GEMINI_API_KEY`)
- Ontology-aware prompts constrain output to authority vocabulary
- Caches by `BLAKE3(prompt + content + model)` (prompt hash included)
- Token budget tracking (`max_tokens_per_scan`, `max_tokens_per_file`)
- Selective triggering (high-value files only)
- Temperature 0.1 for consistency
- Structured decoding via Gemini Response Schema
---
## What We Need
### 1. Observation Storage (SQLite)
**Problem:** We can't see what the LLM returned or how claims were scored. JSON files are inefficient for querying.
**Solution:** SQLite database with retention policies.
**Location:** `~/.aphoria/eval/observations.db`
```rust
// src/eval/db.rs
use chrono::{Duration, Utc};
use rusqlite::{params, Connection};
pub struct EvalDatabase {
conn: Connection,
}
impl EvalDatabase {
pub fn open(path: &Path) -> Result<Self> {
let conn = Connection::open(path)?;
conn.execute_batch(r#"
CREATE TABLE IF NOT EXISTS observations (
id TEXT PRIMARY KEY,
timestamp TEXT NOT NULL,
prompt_version TEXT NOT NULL,
prompt_hash TEXT NOT NULL,
model TEXT NOT NULL,
input_hash TEXT NOT NULL,
file_path TEXT NOT NULL,
language TEXT NOT NULL,
content_length INTEGER NOT NULL,
raw_response TEXT NOT NULL,
parsed_claims TEXT NOT NULL, -- JSON
final_claims TEXT NOT NULL, -- JSON
input_tokens INTEGER NOT NULL,
output_tokens INTEGER NOT NULL,
parse_success INTEGER NOT NULL,
parse_error TEXT,
cache_hit INTEGER NOT NULL,
latency_ms INTEGER NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_obs_timestamp ON observations(timestamp);
CREATE INDEX IF NOT EXISTS idx_obs_prompt_hash ON observations(prompt_hash);
"#)?;
Ok(Self { conn })
}
/// Enforce retention: keep last 1000 or 30 days, whichever is larger
pub fn enforce_retention(&self) -> Result<usize> {
let cutoff = Utc::now() - Duration::days(30);
self.conn.execute(
"DELETE FROM observations
WHERE timestamp < ?1
AND id NOT IN (SELECT id FROM observations ORDER BY timestamp DESC LIMIT 1000)",
params![cutoff.to_rfc3339()],
)
}
pub fn insert(&self, obs: &Observation) -> Result<()> {
self.conn.execute(
r#"INSERT INTO observations (
id, timestamp, prompt_version, prompt_hash, model, input_hash,
file_path, language, content_length, raw_response, parsed_claims,
final_claims, input_tokens, output_tokens, parse_success,
parse_error, cache_hit, latency_ms
) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12, ?13, ?14, ?15, ?16, ?17, ?18)"#,
params![
obs.id.to_string(),
obs.timestamp.to_rfc3339(),
obs.prompt_version,
obs.prompt_hash,
obs.model,
obs.input_hash,
obs.file_path,
obs.language,
obs.content_length,
obs.raw_response,
serde_json::to_string(&obs.parsed_claims)?,
serde_json::to_string(&obs.final_claims)?,
obs.input_tokens,
obs.output_tokens,
obs.parse_success,
obs.parse_error,
obs.cache_hit,
obs.latency_ms,
],
)?;
Ok(())
}
}
```
**Observation struct:**
```rust
// src/llm/observation.rs
use chrono::{DateTime, Utc};
use serde::{Deserialize, Serialize};
use uuid::Uuid;
/// A logged observation from an LLM extraction.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Observation {
/// Unique ID for this observation.
pub id: Uuid,
/// When this extraction occurred.
pub timestamp: DateTime<Utc>,
/// Prompt version (from PROMPT_VERSION constant).
pub prompt_version: String,
/// BLAKE3 hash of the prompt template.
pub prompt_hash: String,
/// Model used (e.g., "gemini-2.0-flash").
pub model: String,
/// BLAKE3 hash of input content.
pub input_hash: String,
/// File path (relative to scan root).
pub file_path: String,
/// Language detected.
pub language: String,
/// Content length in bytes.
pub content_length: usize,
/// Raw LLM response (JSON string).
pub raw_response: String,
/// Parsed claims (after confidence filter, before ontology validation).
pub parsed_claims: Vec<ParsedClaim>,
/// Final claims (after ontology validation).
pub final_claims: Vec<FinalClaim>,
/// Token usage.
pub input_tokens: usize,
pub output_tokens: usize,
/// Whether parsing succeeded.
pub parse_success: bool,
/// Parse error if any.
pub parse_error: Option<String>,
/// Cache status.
pub cache_hit: bool,
/// Latency in milliseconds.
pub latency_ms: u64,
}
/// A claim as parsed from LLM JSON (before validation).
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ParsedClaim {
pub subject: String,
pub predicate: String,
pub value: serde_json::Value,
pub confidence: f32,
pub line: usize,
}
/// A claim after ontology validation.
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct FinalClaim {
pub concept_path: String,
pub predicate: String,
pub value: serde_json::Value,
pub confidence: f32,
pub matched_ontology: bool,
pub fuzzy_matched: bool,
}
```
**Integration point:** Modify `LlmExtractor::extract()` to emit observations.
---
### 2. Cache Key Includes Prompt Hash
**Problem:** Cache doesn't invalidate when prompt changes.
**Solution:** Include prompt hash in cache key.
```rust
// src/llm/cache.rs
impl LlmCache {
fn compute_key(content: &str, model: &str, prompt: &str) -> String {
let mut hasher = blake3::Hasher::new();
hasher.update(content.as_bytes());
hasher.update(model.as_bytes());
hasher.update(prompt.as_bytes()); // NEW: prompt included
hasher.finalize().to_hex().to_string()
}
}
```
---
### 3. Bounded Concurrency
**Problem:** Sequential execution is slow; unbounded parallelism hits rate limits.
**Solution:** Tokio Semaphore with configurable concurrency.
```rust
// src/eval/harness.rs
use std::sync::Arc;
use tokio::sync::Semaphore;
pub struct EvalHarness {
extractor: LlmExtractor,
semaphore: Arc<Semaphore>,
}
impl EvalHarness {
pub fn new(extractor: LlmExtractor, max_concurrent: usize) -> Self {
Self {
extractor,
semaphore: Arc::new(Semaphore::new(max_concurrent)),
}
}
pub async fn run(&self, fixtures: Vec<Fixture>) -> EvalResult {
let handles: Vec<_> = fixtures
.into_iter()
.map(|fixture| {
let sem = self.semaphore.clone();
let extractor = self.extractor.clone();
tokio::spawn(async move {
let _permit = sem.acquire().await?;
Self::run_fixture(&extractor, fixture).await
})
})
.collect();
let results = futures::future::join_all(handles).await;
// aggregate...
}
}
```
**Default:** 5 concurrent (configurable via `eval.max_concurrent`)
---
### 4. Rate Limit Resilience
**Problem:** 429 errors cause evaluation failures.
**Solution:** Exponential backoff with retries.
```rust
// src/llm/client.rs
impl GeminiClient {
async fn call_with_retry(&self, request: &Request) -> Result<Response> {
let mut delay = Duration::from_millis(500);
let max_retries = 5;
for attempt in 0..max_retries {
match self.call(request).await {
Ok(response) => return Ok(response),
Err(e) if e.is_rate_limit() => {
if attempt == max_retries - 1 {
return Err(e);
}
tracing::warn!(
attempt,
delay_ms = delay.as_millis(),
"Rate limited, backing off"
);
tokio::time::sleep(delay).await;
delay *= 2;
}
Err(e) => return Err(e),
}
}
unreachable!()
}
}
```
---
### 5. Fixture Format
**Problem:** No standardized test cases to validate prompt changes.
**Solution:** TOML fixtures with input, expected output, and rationale.
```toml
# tests/llm_fixtures/tls/disabled_verification.toml
[metadata]
id = "tls-001"
name = "TLS verification disabled in Python requests"
category = "tls"
language = "python"
created = "2026-02-05"
[input]
# The code to analyze
content = '''
import requests
def fetch_data(url):
# Disable SSL verification for internal services
response = requests.get(url, verify=False)
return response.json()
'''
[expected]
# What the LLM MUST extract (recall test)
must_contain = [
{
subject = "tls/cert_verification",
predicate = "enabled",
value = false,
rationale = "requests.get(verify=False) explicitly disables TLS verification"
},
]
# What the LLM MUST NOT extract (precision test)
must_not_contain = [
{ subject = "tls/cert_verification", predicate = "enabled", value = true },
]
[scoring]
# How important is this fixture?
weight = 1.0
# Expected minimum confidence from LLM
min_confidence = 0.8
```
**ExpectedClaim with rationale:**
```rust
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExpectedClaim {
pub subject: String,
pub predicate: String,
pub value: serde_json::Value,
/// Optional explanation for why this claim is expected (shown on failure)
#[serde(default)]
pub rationale: Option<String>,
}
```
**Fixture categories:**
```
tests/llm_fixtures/
├── manifest.toml # Index + baseline metrics
├── tls/ # TLS/SSL fixtures
│ ├── disabled_verification.toml
│ ├── deprecated_version.toml
│ └── pinning_bypass.toml
├── jwt/ # JWT fixtures
├── secrets/ # Hardcoded secrets fixtures
├── auth/ # Auth bypass fixtures
├── negative/ # Safe code (expect NO claims)
│ ├── safe_tls.toml
│ └── env_var_secrets.toml
└── edge/ # Edge cases
├── empty_file.toml
└── huge_file.toml
```
**Manifest:**
```toml
# tests/llm_fixtures/manifest.toml
[corpus]
version = "1.0.0"
total_fixtures = 35
[baseline]
# Known-good metrics from last successful run
precision = 0.85
recall = 0.78
f1 = 0.81
prompt_version = "1.0.0"
model = "gemini-2.0-flash"
measured_at = "2026-02-05T10:30:00Z"
```
---
### 6. Evaluation Harness
**Problem:** No way to run fixtures and compute metrics.
**Solution:** Evaluation engine in `src/eval/`.
```rust
// src/eval/mod.rs
mod db;
mod fixture;
mod harness;
mod matcher;
mod metrics;
mod perturbation;
mod report;
pub use db::EvalDatabase;
pub use fixture::{Fixture, FixtureLoader};
pub use harness::{EvalConfig, EvalHarness, EvalResult};
pub use metrics::{Metrics, CategoryMetrics};
pub use perturbation::Perturbator;
pub use report::{Report, ReportFormat};
```
**Core types:**
```rust
// src/eval/harness.rs
pub struct EvalConfig {
/// Path to fixtures directory.
pub fixtures_dir: PathBuf,
/// Categories to run (None = all).
pub categories: Option<Vec<String>>,
/// Max fixtures to run (for smoke tests).
pub max_fixtures: Option<usize>,
/// Evaluation mode.
pub mode: EvalMode,
/// Baseline to compare against.
pub baseline: Option<PathBuf>,
/// Save observations to database.
pub save_observations: bool,
/// Maximum concurrent LLM calls.
pub max_concurrent: usize,
}
pub enum EvalMode {
/// Use real LLM API.
Live,
/// Use cached responses only (fails if not cached).
Cached,
/// Skip LLM, return empty claims (for testing harness).
Mock,
/// Perturbation testing for stability.
Robust,
}
pub struct EvalResult {
pub run_id: Uuid,
pub started_at: DateTime<Utc>,
pub completed_at: DateTime<Utc>,
pub metrics: Metrics,
pub fixture_results: Vec<FixtureResult>,
pub baseline_comparison: Option<BaselineComparison>,
pub verdict: EvalVerdict,
/// Stability score (only in Robust mode)
pub stability: Option<f64>,
}
pub enum EvalVerdict {
Pass,
Regression { regressions: Vec<String> },
Error { message: String },
}
```
**Metrics calculation:**
```rust
// src/eval/metrics.rs
pub struct Metrics {
/// True positives: expected claims that were extracted.
pub true_positives: usize,
/// False positives: extracted claims that weren't expected.
pub false_positives: usize,
/// False negatives: expected claims that weren't extracted.
pub false_negatives: usize,
/// Precision = TP / (TP + FP)
pub precision: f64,
/// Recall = TP / (TP + FN)
pub recall: f64,
/// F1 = 2 * (P * R) / (P + R)
pub f1: f64,
/// Total fixtures.
pub total_fixtures: usize,
/// Fixtures that passed.
pub passed: usize,
/// Fixtures that failed.
pub failed: usize,
/// Total tokens used.
pub total_tokens: u64,
/// Estimated cost (USD).
pub estimated_cost: f64,
/// By category.
pub by_category: HashMap<String, CategoryMetrics>,
}
impl Metrics {
pub fn compute(results: &[FixtureResult]) -> Self {
let mut tp = 0;
let mut fp = 0;
let mut fn_ = 0;
for result in results {
tp += result.true_positives;
fp += result.false_positives;
fn_ += result.false_negatives;
}
let precision = if tp + fp > 0 { tp as f64 / (tp + fp) as f64 } else { 0.0 };
let recall = if tp + fn_ > 0 { tp as f64 / (tp + fn_) as f64 } else { 0.0 };
let f1 = if precision + recall > 0.0 {
2.0 * precision * recall / (precision + recall)
} else {
0.0
};
// ... rest of computation
}
}
```
---
### 7. Hybrid Type-Coercive Matching
**Problem:** Strict type matching misses semantically equivalent values.
**Solution:** Coerce strings to booleans/numbers when reasonable.
```rust
// src/eval/matcher.rs
pub struct ClaimMatcher {
/// Tolerance for confidence comparison.
pub confidence_tolerance: f32,
}
impl ClaimMatcher {
/// Check if extracted claims satisfy must_contain requirements.
pub fn check_must_contain(
&self,
extracted: &[ExtractedClaim],
expected: &[ExpectedClaim],
) -> MatchResult {
let mut matched = vec![];
let mut unmatched = vec![];
for exp in expected {
if let Some(claim) = self.find_matching_claim(extracted, exp) {
matched.push((exp.clone(), claim.clone()));
} else {
unmatched.push(exp.clone());
}
}
MatchResult { matched, unmatched }
}
/// Check if any extracted claim matches (for must_not_contain).
pub fn check_must_not_contain(
&self,
extracted: &[ExtractedClaim],
forbidden: &[ExpectedClaim],
) -> Vec<(ExpectedClaim, ExtractedClaim)> {
let mut violations = vec![];
for forbid in forbidden {
if let Some(claim) = self.find_matching_claim(extracted, forbid) {
violations.push((forbid.clone(), claim.clone()));
}
}
violations
}
fn find_matching_claim(
&self,
extracted: &[ExtractedClaim],
expected: &ExpectedClaim,
) -> Option<&ExtractedClaim> {
extracted.iter().find(|claim| {
self.subject_matches(&claim.concept_path, &expected.subject) &&
claim.predicate == expected.predicate &&
self.value_matches(&claim.value, &expected.value)
})
}
fn subject_matches(&self, extracted: &str, expected: &str) -> bool {
// Allow matching on tail path (last 2 segments)
let ext_tail = extracted.split('/').rev().take(2).collect::<Vec<_>>();
let exp_tail = expected.split('/').rev().take(2).collect::<Vec<_>>();
ext_tail == exp_tail
}
fn value_matches(&self, extracted: &ObjectValue, expected: &serde_json::Value) -> bool {
match (extracted, expected) {
// Direct matches
(ObjectValue::Boolean(e), serde_json::Value::Bool(x)) => e == x,
(ObjectValue::Number(e), serde_json::Value::Number(x)) => {
x.as_f64().map(|n| (e - n).abs() < 0.001).unwrap_or(false)
}
(ObjectValue::Text(e), serde_json::Value::String(x)) => e == x,
// Coercion: string -> boolean
(ObjectValue::Boolean(e), serde_json::Value::String(s)) => {
self.coerce_to_bool(s).map(|b| *e == b).unwrap_or(false)
}
(ObjectValue::Text(e), serde_json::Value::Bool(x)) => {
self.coerce_to_bool(e).map(|b| b == *x).unwrap_or(false)
}
// Coercion: string -> number
(ObjectValue::Number(e), serde_json::Value::String(s)) => {
s.parse::<f64>().map(|n| (e - n).abs() < 0.001).ok().unwrap_or(false)
}
_ => false,
}
}
fn coerce_to_bool(&self, s: &str) -> Option<bool> {
match s.to_lowercase().as_str() {
"true" | "yes" | "on" | "enabled" | "1" => Some(true),
"false" | "no" | "off" | "disabled" | "0" => Some(false),
_ => None,
}
}
}
```
---
### 8. Perturbation Testing Mode
**Problem:** Need to verify LLM consistency across minor input variations.
**Solution:** Perturbation mode that tests stability.
```rust
// src/eval/perturbation.rs
use crate::Language;
pub struct Perturbator;
impl Perturbator {
/// Generate perturbed variants of input content
pub fn perturb(content: &str, language: Language) -> Vec<String> {
let mut variants = vec![content.to_string()];
variants.push(Self::add_trailing_whitespace(content));
variants.push(Self::normalize_indentation(content));
variants.push(Self::add_innocuous_comments(content, language));
variants.push(Self::remove_comments(content, language));
variants
}
fn add_trailing_whitespace(content: &str) -> String {
content
.lines()
.map(|line| format!("{} ", line))
.collect::<Vec<_>>()
.join("\n")
}
fn normalize_indentation(content: &str) -> String {
// Convert tabs to spaces or vice versa
content.replace('\t', " ")
}
fn add_innocuous_comments(content: &str, language: Language) -> String {
let comment_prefix = match language {
Language::Python => "#",
Language::JavaScript | Language::TypeScript | Language::Rust | Language::Go => "//",
_ => "#",
};
format!("{} Auto-generated file\n{}", comment_prefix, content)
}
fn remove_comments(content: &str, language: Language) -> String {
// Simple single-line comment removal
let comment_prefix = match language {
Language::Python => "#",
Language::JavaScript | Language::TypeScript | Language::Rust | Language::Go => "//",
_ => "#",
};
content
.lines()
.filter(|line| !line.trim().starts_with(comment_prefix))
.collect::<Vec<_>>()
.join("\n")
}
}
```
**Stability metric:** % of perturbations producing identical claims.
CLI: `aphoria eval run --mode robust`
---
### 9. Structured Decoding (Gemini Response Schema)
**Problem:** Free-form JSON parsing can fail.
**Solution:** Use Gemini's `response_schema` for guaranteed JSON structure.
```rust
// src/llm/client.rs
impl GeminiClient {
fn build_request(&self, content: &str, prompt: &str) -> Request {
Request {
contents: vec![Content {
role: "user".to_string(),
parts: vec![Part::Text(content.to_string())],
}],
generation_config: GenerationConfig {
temperature: 0.1,
response_mime_type: "application/json".to_string(),
response_schema: Some(self.claims_schema()),
},
}
}
fn claims_schema(&self) -> Schema {
Schema {
type_: "object".to_string(),
properties: hashmap! {
"claims".to_string() => Schema {
type_: "array".to_string(),
items: Some(Box::new(Schema {
type_: "object".to_string(),
properties: hashmap! {
"subject".to_string() => Schema { type_: "string".to_string(), ..Default::default() },
"predicate".to_string() => Schema { type_: "string".to_string(), ..Default::default() },
"value".to_string() => Schema { type_: "any".to_string(), ..Default::default() },
"confidence".to_string() => Schema { type_: "number".to_string(), ..Default::default() },
"line".to_string() => Schema { type_: "integer".to_string(), ..Default::default() },
},
required: vec!["subject", "predicate", "value", "confidence"],
..Default::default()
})),
..Default::default()
},
},
required: vec!["claims".to_string()],
..Default::default()
}
}
}
```
**Benefit:** Eliminates JSON parse failures.
---
### 10. Synthetic Corpus Generation
**Problem:** Manual fixture creation is slow.
**Solution:** Generate fixtures from real scans with human review.
```bash
aphoria eval generate-corpus \
--scan-path /path/to/codebase \
--output-dir tests/llm_fixtures/synthetic \
--sample-size 50
```
```rust
// src/eval/corpus.rs
pub struct CorpusGenerator {
extractor: LlmExtractor,
}
impl CorpusGenerator {
pub async fn generate(
&self,
scan_path: &Path,
output_dir: &Path,
sample_size: usize,
) -> Result<Vec<PathBuf>> {
let findings = self.extractor.scan(scan_path).await?;
let sample = self.stratified_sample(&findings, sample_size);
let mut fixtures = vec![];
for finding in sample {
fixtures.push(self.create_fixture(&finding, output_dir)?);
}
Ok(fixtures)
}
fn stratified_sample(&self, findings: &[Finding], size: usize) -> Vec<&Finding> {
// Sample proportionally from each category
let mut by_category: HashMap<&str, Vec<&Finding>> = HashMap::new();
for f in findings {
by_category.entry(&f.category).or_default().push(f);
}
let per_category = size / by_category.len().max(1);
let mut sample = vec![];
for (_, items) in by_category {
sample.extend(items.iter().take(per_category));
}
sample.truncate(size);
sample
}
fn create_fixture(&self, finding: &Finding, output_dir: &Path) -> Result<PathBuf> {
let fixture = Fixture {
metadata: FixtureMetadata {
id: format!("auto-{}", Uuid::new_v4()),
name: finding.description.clone(),
category: finding.category.clone(),
language: finding.language.to_string(),
created: Utc::now().date_naive().to_string(),
},
input: FixtureInput {
content: finding.code_snippet.clone(),
},
expected: FixtureExpected {
must_contain: vec![ExpectedClaim {
subject: finding.subject.clone(),
predicate: finding.predicate.clone(),
value: finding.value.clone(),
rationale: Some("Auto-generated - requires human review".to_string()),
}],
must_not_contain: vec![],
},
scoring: FixtureScoring {
weight: 1.0,
min_confidence: 0.7,
},
};
let path = output_dir
.join(&finding.category)
.join(format!("{}.toml", fixture.metadata.id));
std::fs::create_dir_all(path.parent().unwrap())?;
std::fs::write(&path, toml::to_string_pretty(&fixture)?)?;
Ok(path)
}
}
```
**Workflow:** Scan -> Human review -> Commit to corpus
---
### 11. CLI Commands
**Problem:** No way to run evaluations from command line.
**Solution:** Add `aphoria eval` subcommand.
```rust
// src/cli.rs additions
#[derive(Subcommand)]
pub enum Commands {
// ... existing commands ...
/// Evaluate LLM prompt effectiveness
Eval {
#[command(subcommand)]
command: EvalCommands,
},
}
#[derive(Subcommand)]
pub enum EvalCommands {
/// Run evaluation against fixtures
Run {
/// Path to fixtures directory
#[arg(long, default_value = "tests/llm_fixtures")]
fixtures: PathBuf,
/// Categories to run (comma-separated)
#[arg(long)]
categories: Option<String>,
/// Maximum fixtures to run
#[arg(long)]
max_fixtures: Option<usize>,
/// Evaluation mode: live, cached, mock, robust
#[arg(long, default_value = "cached")]
mode: String,
/// Baseline file to compare against
#[arg(long)]
baseline: Option<PathBuf>,
/// Exit with code 1 if regression detected
#[arg(long)]
fail_on_regression: bool,
/// Regression threshold (default: 0.05 = 5%)
#[arg(long, default_value = "0.05")]
threshold: f64,
/// Save observation logs
#[arg(long)]
save_observations: bool,
/// Output format: table, json, markdown
#[arg(long, default_value = "table")]
format: String,
},
/// Show current baseline metrics
Baseline {
/// Path to fixtures directory
#[arg(long, default_value = "tests/llm_fixtures")]
fixtures: PathBuf,
},
/// Update baseline from latest run
UpdateBaseline {
/// Run ID to use as new baseline
#[arg(long)]
run_id: Option<String>,
/// Path to fixtures directory
#[arg(long, default_value = "tests/llm_fixtures")]
fixtures: PathBuf,
/// Required - prevents accidental baseline overwrites
#[arg(long, required = true)]
force: bool,
},
/// List fixtures
ListFixtures {
/// Path to fixtures directory
#[arg(long, default_value = "tests/llm_fixtures")]
fixtures: PathBuf,
/// Filter by category
#[arg(long)]
category: Option<String>,
},
/// Validate fixture format
ValidateFixtures {
/// Path to fixtures directory
#[arg(long, default_value = "tests/llm_fixtures")]
fixtures: PathBuf,
},
/// Generate fixtures from real scans
GenerateCorpus {
/// Path to codebase to scan
#[arg(long)]
scan_path: PathBuf,
/// Output directory for generated fixtures
#[arg(long)]
output_dir: PathBuf,
/// Number of fixtures to generate
#[arg(long, default_value = "50")]
sample_size: usize,
},
}
```
**Usage examples:**
```bash
# Run smoke test (cached responses, fast)
aphoria eval run --mode cached --max-fixtures 10
# Run full evaluation (live API calls)
aphoria eval run --mode live --save-observations
# Run with baseline comparison
aphoria eval run --baseline tests/llm_fixtures/manifest.toml --fail-on-regression
# Run perturbation testing
aphoria eval run --mode robust --max-fixtures 5
# Show current baseline
aphoria eval baseline
# Update baseline (requires --force)
aphoria eval update-baseline --force
# List fixtures
aphoria eval list-fixtures --category tls
# Validate fixture format
aphoria eval validate-fixtures
# Generate fixtures from real codebase
aphoria eval generate-corpus --scan-path ./my-project --output-dir ./test-fixtures
```
**Baseline safety:** Without `--force`, update-baseline shows:
```
Current baseline: precision=0.85, recall=0.78, f1=0.81 (2026-02-05)
To update, re-run with --force
```
---
### 12. Report Output
**Problem:** Need human-readable and machine-readable output.
**Solution:** Multiple report formats.
**Table format (default):**
```
╭────────────────────────────────────────────────────────────────────╮
│ LLM Prompt Evaluation Report │
├────────────────────────────────────────────────────────────────────┤
│ Run ID: abc123-def456 │
│ Date: 2026-02-05 14:30:00 UTC │
│ Prompt: v1.0.0 │
│ Model: gemini-2.0-flash │
╰────────────────────────────────────────────────────────────────────╯
Summary
╭──────────┬─────────┬──────────┬────────┬────────╮
│ Metric │ Current │ Baseline │ Delta │ Status │
├──────────┼─────────┼──────────┼────────┼────────┤
│ Precision│ 0.87 │ 0.85 │ +0.02 │ ✓ │
│ Recall │ 0.76 │ 0.78 │ -0.02 │ ⚠ │
│ F1 │ 0.81 │ 0.81 │ +0.00 │ ✓ │
╰──────────┴─────────┴──────────┴────────┴────────╯
Verdict: ⚠ REVIEW - Recall dropped by 2%
Category Breakdown
╭──────────┬──────────┬────────┬────────╮
│ Category │ Fixtures │ Passed │ Failed │
├──────────┼──────────┼────────┼────────┤
│ tls │ 12 │ 11 │ 1 │
│ jwt │ 8 │ 6 │ 2 │
│ secrets │ 15 │ 14 │ 1 │
│ negative │ 10 │ 10 │ 0 │
╰──────────┴──────────┴────────┴────────╯
Regressions (2)
- jwt-003: JWT algorithm none detection
Expected: jwt/algorithm = "none"
Rationale: alg:"none" bypasses signature verification entirely
Got: Not extracted
- tls-007: TLS version in constants (IMPROVED)
Previously: Not extracted
Now: tls/min_version = "1.0" ✓
Cost: 125,430 tokens ($0.12)
```
**JSON format:**
```json
{
"run_id": "abc123-def456",
"timestamp": "2026-02-05T14:30:00Z",
"prompt_version": "1.0.0",
"model": "gemini-2.0-flash",
"metrics": {
"precision": 0.87,
"recall": 0.76,
"f1": 0.81,
"total_fixtures": 45,
"passed": 41,
"failed": 4
},
"baseline_comparison": {
"precision_delta": 0.02,
"recall_delta": -0.02,
"has_regression": true,
"regression_threshold": 0.05
},
"stability": 0.92,
"verdict": "review",
"fixture_results": [...]
}
```
---
## Implementation Plan
### Phase 1: Core Infrastructure (2 days)
| Task | File | Description |
|------|------|-------------|
| 1.1 | `src/eval/db.rs` | SQLite database with retention |
| 1.2 | `src/llm/cache.rs` | Update cache key to include prompt hash |
| 1.3 | `src/llm/client.rs` | Exponential backoff for 429s |
**Acceptance:** Database stores observations, cache invalidates on prompt change, rate limits handled gracefully.
### Phase 2: Fixture & Matching (2 days)
| Task | File | Description |
|------|------|-------------|
| 2.1 | `src/eval/fixture.rs` | Define `Fixture`, `ExpectedClaim` with rationale |
| 2.2 | `src/eval/matcher.rs` | Hybrid type-coercive matching |
| 2.3 | `tests/llm_fixtures/` | Create 10 seed fixtures |
| 2.4 | `src/eval/fixture.rs` | Add `FixtureLoader` |
**Acceptance:** Can load fixtures from TOML, matching handles type coercion.
### Phase 3: Evaluation Harness (2 days)
| Task | File | Description |
|------|------|-------------|
| 3.1 | `src/eval/harness.rs` | Bounded concurrency with Semaphore |
| 3.2 | `src/eval/metrics.rs` | Implement `Metrics::compute()` |
| 3.3 | `src/eval/harness.rs` | Baseline comparison |
| 3.4 | `src/eval/perturbation.rs` | Perturbation testing |
**Acceptance:** Can run fixtures with bounded parallelism, compute precision/recall, measure stability.
### Phase 4: Structured Decoding (1 day)
| Task | File | Description |
|------|------|-------------|
| 4.1 | `src/llm/client.rs` | Gemini Response Schema integration |
**Acceptance:** LLM always returns valid JSON, no parse failures.
### Phase 5: CLI & Corpus (2 days)
| Task | File | Description |
|------|------|-------------|
| 5.1 | `src/cli.rs` | Add `EvalCommands` with `--force`, `--mode robust` |
| 5.2 | `src/handlers/eval.rs` | Implement all eval command handlers |
| 5.3 | `src/eval/corpus.rs` | Corpus generation from scans |
**Acceptance:** `aphoria eval run` works end-to-end, corpus generation functional.
### Phase 6: Reports & Polish (2 days)
| Task | File | Description |
|------|------|-------------|
| 6.1 | `src/eval/report.rs` | Table/JSON formats with rationale in failures |
| 6.2 | `src/eval/report.rs` | Stability metrics display |
| 6.3 | `tests/llm_fixtures/` | Expand to 25+ fixtures |
**Acceptance:** Reports show rationale on missed claims, stability metrics visible.
**Total:** 11 days
---
## Fixture Seed List
Initial 10 fixtures to create:
| ID | Category | Name | Tests |
|----|----------|------|-------|
| tls-001 | tls | Disabled verification (requests) | `verify=False` |
| tls-002 | tls | Deprecated TLS version | `min_version="TLSv1"` |
| jwt-001 | jwt | Algorithm none | `alg: "none"` |
| jwt-002 | jwt | Skip signature verification | `verify=False` |
| secrets-001 | secrets | Hardcoded API key | `API_KEY = "sk_..."` |
| secrets-002 | secrets | High entropy token | Shannon entropy > 4.5 |
| auth-001 | auth | Debug auth bypass | `X-Debug-Auth` header |
| negative-001 | negative | Safe TLS config | `verify=True` (no claims) |
| negative-002 | negative | Env var secrets | `os.getenv()` (no claims) |
| edge-001 | edge | Empty file | Empty content (no claims) |
---
## Configuration
Add to `aphoria.toml`:
```toml
[eval]
# Save observations during scans
save_observations = false
# SQLite database path
database_path = "~/.aphoria/eval/observations.db"
# Default fixtures directory
fixtures_dir = "tests/llm_fixtures"
# Regression threshold (5% = 0.05)
regression_threshold = 0.05
# Maximum concurrent LLM calls
max_concurrent = 5
# Retention: days to keep observations
retention_days = 30
# Retention: max observations to keep regardless of age
retention_max_count = 1000
# Rate limit: initial backoff delay (ms)
rate_limit_initial_delay_ms = 500
# Rate limit: max retries before failing
rate_limit_max_retries = 5
```
---
## Success Criteria
| Metric | Target |
|--------|--------|
| Can run `aphoria eval run` | Works |
| Baseline comparison | Detects 5% regression |
| Fixtures load correctly | 100% valid fixtures load |
| Metrics match manual calculation | Within 0.01 |
| Report is readable | Human-verified |
| Type coercion works | "true" matches true |
| Perturbation mode | Stability metric computed |
| Rate limit handling | Survives 429 burst |
---
## File Structure After Implementation
```
applications/aphoria/
├── src/
│ ├── eval/
│ │ ├── mod.rs
│ │ ├── db.rs # SQLite storage
│ │ ├── corpus.rs # Synthetic fixture generation
│ │ ├── fixture.rs # Fixture loading
│ │ ├── harness.rs # Evaluation engine
│ │ ├── matcher.rs # Claim matching (type-coercive)
│ │ ├── metrics.rs # Precision/recall
│ │ ├── perturbation.rs # Perturbation testing
│ │ └── report.rs # Output formatting
│ ├── llm/
│ │ ├── observation.rs # Observation logging
│ │ └── ...
│ ├── handlers/
│ │ ├── eval.rs # Eval command handlers
│ │ └── ...
│ └── ...
└── tests/
└── llm_fixtures/
├── manifest.toml
├── tls/
├── jwt/
├── secrets/
├── auth/
├── negative/
└── edge/
```
---
## Verification
```bash
# Build
cargo build -p aphoria
# Test
cargo test -p aphoria
# SQLite retention check
sqlite3 ~/.aphoria/eval/observations.db "SELECT COUNT(*) FROM observations"
# Bounded concurrency (watch logs)
RUST_LOG=debug aphoria eval run --mode live 2>&1 | grep "permit"
# Perturbation mode
aphoria eval run --mode robust --max-fixtures 5
# Corpus generation
aphoria eval generate-corpus --scan-path ./test-project --output-dir ./test-fixtures
```
---
## Open Questions Resolved
| Question | Decision |
|----------|----------|
| Baseline storage | In `manifest.toml` (simple, versioned with fixtures) |
| Observation storage | SQLite with 30-day/1000-count retention |
| Matching strictness | Tail-path + type-coercive matching |
| Mock vs Live in CI | Cached mode for PR, live for manual |
| Parallelism | Bounded (5 default) via Tokio Semaphore |
| Baseline safety | Requires `--force` flag |
| Structured output | Gemini Response Schema |
---
*Ready for implementation.*