Major documentation restructure to improve discoverability and reduce duplication. ## Changes **Deleted (Archived/Consolidated)**: - Removed duplicate getting started guides - Archived outdated planning documents - Consolidated corpus and configuration docs - Removed obsolete vision/spec files (superseded by vision.md) - Cleaned up scrapyard and old PDFs **New Structure**: - docs/about/ - Project overview and introduction - docs/guides/ - User guides (moved from root) - docs/specs/ - Technical specifications - docs/sdk/ - SDK documentation (Go) - docs/references/ - API references - docs/archive/ - Archived historical docs - applications/aphoria/docs/advanced/ - Advanced topics - applications/aphoria/docs/reference/ - CLI reference - applications/aphoria/docs/archive/ - Archived aphoria docs **Updated**: - README.md - New root README with clear navigation - CONTRIBUTING.md - Contribution guidelines - CLAUDE.md - Updated paths to new structure - roadmap.md - Added recent completions ## Files Changed - 57 files changed - 1,977 insertions(+) - 961 deletions(-) **Net change**: +1,016 lines (added CONTRIBUTING.md, README.md, reorganized content) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
1357 lines
39 KiB
Markdown
1357 lines
39 KiB
Markdown
# LLM Evaluation Implementation Spec
|
|
|
|
> **Status:** Implementation Ready
|
|
> **Date:** 2026-02-05
|
|
> **Scope:** Aphoria Phase 7.8
|
|
|
|
---
|
|
|
|
## What We Have
|
|
|
|
The current LLM extraction pipeline (`src/llm/`):
|
|
|
|
```
|
|
src/llm/
|
|
├── mod.rs # Module exports
|
|
├── client.rs # GeminiClient - HTTP client for API
|
|
├── extractor.rs # LlmExtractor - orchestration, budget, filtering
|
|
├── prompt.rs # build_system_prompt() with ontology
|
|
├── ontology.rs # OntologyVocabulary from authority assertions
|
|
├── cache.rs # LlmCache - BLAKE3 content hash caching
|
|
├── types.rs # LlmClaim, LlmClaimsResponse
|
|
└── prompts.rs # DEFAULT_SYSTEM_PROMPT, helpers
|
|
```
|
|
|
|
**Key characteristics:**
|
|
- Uses Gemini API (configured via `GEMINI_API_KEY`)
|
|
- Ontology-aware prompts constrain output to authority vocabulary
|
|
- Caches by `BLAKE3(prompt + content + model)` (prompt hash included)
|
|
- Token budget tracking (`max_tokens_per_scan`, `max_tokens_per_file`)
|
|
- Selective triggering (high-value files only)
|
|
- Temperature 0.1 for consistency
|
|
- Structured decoding via Gemini Response Schema
|
|
|
|
---
|
|
|
|
## What We Need
|
|
|
|
### 1. Observation Storage (SQLite)
|
|
|
|
**Problem:** We can't see what the LLM returned or how claims were scored. JSON files are inefficient for querying.
|
|
|
|
**Solution:** SQLite database with retention policies.
|
|
|
|
**Location:** `~/.aphoria/eval/observations.db`
|
|
|
|
```rust
|
|
// src/eval/db.rs
|
|
|
|
use chrono::{Duration, Utc};
|
|
use rusqlite::{params, Connection};
|
|
|
|
pub struct EvalDatabase {
|
|
conn: Connection,
|
|
}
|
|
|
|
impl EvalDatabase {
|
|
pub fn open(path: &Path) -> Result<Self> {
|
|
let conn = Connection::open(path)?;
|
|
conn.execute_batch(r#"
|
|
CREATE TABLE IF NOT EXISTS observations (
|
|
id TEXT PRIMARY KEY,
|
|
timestamp TEXT NOT NULL,
|
|
prompt_version TEXT NOT NULL,
|
|
prompt_hash TEXT NOT NULL,
|
|
model TEXT NOT NULL,
|
|
input_hash TEXT NOT NULL,
|
|
file_path TEXT NOT NULL,
|
|
language TEXT NOT NULL,
|
|
content_length INTEGER NOT NULL,
|
|
raw_response TEXT NOT NULL,
|
|
parsed_claims TEXT NOT NULL, -- JSON
|
|
final_claims TEXT NOT NULL, -- JSON
|
|
input_tokens INTEGER NOT NULL,
|
|
output_tokens INTEGER NOT NULL,
|
|
parse_success INTEGER NOT NULL,
|
|
parse_error TEXT,
|
|
cache_hit INTEGER NOT NULL,
|
|
latency_ms INTEGER NOT NULL
|
|
);
|
|
CREATE INDEX IF NOT EXISTS idx_obs_timestamp ON observations(timestamp);
|
|
CREATE INDEX IF NOT EXISTS idx_obs_prompt_hash ON observations(prompt_hash);
|
|
"#)?;
|
|
Ok(Self { conn })
|
|
}
|
|
|
|
/// Enforce retention: keep last 1000 or 30 days, whichever is larger
|
|
pub fn enforce_retention(&self) -> Result<usize> {
|
|
let cutoff = Utc::now() - Duration::days(30);
|
|
self.conn.execute(
|
|
"DELETE FROM observations
|
|
WHERE timestamp < ?1
|
|
AND id NOT IN (SELECT id FROM observations ORDER BY timestamp DESC LIMIT 1000)",
|
|
params![cutoff.to_rfc3339()],
|
|
)
|
|
}
|
|
|
|
pub fn insert(&self, obs: &Observation) -> Result<()> {
|
|
self.conn.execute(
|
|
r#"INSERT INTO observations (
|
|
id, timestamp, prompt_version, prompt_hash, model, input_hash,
|
|
file_path, language, content_length, raw_response, parsed_claims,
|
|
final_claims, input_tokens, output_tokens, parse_success,
|
|
parse_error, cache_hit, latency_ms
|
|
) VALUES (?1, ?2, ?3, ?4, ?5, ?6, ?7, ?8, ?9, ?10, ?11, ?12, ?13, ?14, ?15, ?16, ?17, ?18)"#,
|
|
params![
|
|
obs.id.to_string(),
|
|
obs.timestamp.to_rfc3339(),
|
|
obs.prompt_version,
|
|
obs.prompt_hash,
|
|
obs.model,
|
|
obs.input_hash,
|
|
obs.file_path,
|
|
obs.language,
|
|
obs.content_length,
|
|
obs.raw_response,
|
|
serde_json::to_string(&obs.parsed_claims)?,
|
|
serde_json::to_string(&obs.final_claims)?,
|
|
obs.input_tokens,
|
|
obs.output_tokens,
|
|
obs.parse_success,
|
|
obs.parse_error,
|
|
obs.cache_hit,
|
|
obs.latency_ms,
|
|
],
|
|
)?;
|
|
Ok(())
|
|
}
|
|
}
|
|
```
|
|
|
|
**Observation struct:**
|
|
|
|
```rust
|
|
// src/llm/observation.rs
|
|
|
|
use chrono::{DateTime, Utc};
|
|
use serde::{Deserialize, Serialize};
|
|
use uuid::Uuid;
|
|
|
|
/// A logged observation from an LLM extraction.
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct Observation {
|
|
/// Unique ID for this observation.
|
|
pub id: Uuid,
|
|
|
|
/// When this extraction occurred.
|
|
pub timestamp: DateTime<Utc>,
|
|
|
|
/// Prompt version (from PROMPT_VERSION constant).
|
|
pub prompt_version: String,
|
|
|
|
/// BLAKE3 hash of the prompt template.
|
|
pub prompt_hash: String,
|
|
|
|
/// Model used (e.g., "gemini-2.0-flash").
|
|
pub model: String,
|
|
|
|
/// BLAKE3 hash of input content.
|
|
pub input_hash: String,
|
|
|
|
/// File path (relative to scan root).
|
|
pub file_path: String,
|
|
|
|
/// Language detected.
|
|
pub language: String,
|
|
|
|
/// Content length in bytes.
|
|
pub content_length: usize,
|
|
|
|
/// Raw LLM response (JSON string).
|
|
pub raw_response: String,
|
|
|
|
/// Parsed claims (after confidence filter, before ontology validation).
|
|
pub parsed_claims: Vec<ParsedClaim>,
|
|
|
|
/// Final claims (after ontology validation).
|
|
pub final_claims: Vec<FinalClaim>,
|
|
|
|
/// Token usage.
|
|
pub input_tokens: usize,
|
|
pub output_tokens: usize,
|
|
|
|
/// Whether parsing succeeded.
|
|
pub parse_success: bool,
|
|
|
|
/// Parse error if any.
|
|
pub parse_error: Option<String>,
|
|
|
|
/// Cache status.
|
|
pub cache_hit: bool,
|
|
|
|
/// Latency in milliseconds.
|
|
pub latency_ms: u64,
|
|
}
|
|
|
|
/// A claim as parsed from LLM JSON (before validation).
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct ParsedClaim {
|
|
pub subject: String,
|
|
pub predicate: String,
|
|
pub value: serde_json::Value,
|
|
pub confidence: f32,
|
|
pub line: usize,
|
|
}
|
|
|
|
/// A claim after ontology validation.
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct FinalClaim {
|
|
pub concept_path: String,
|
|
pub predicate: String,
|
|
pub value: serde_json::Value,
|
|
pub confidence: f32,
|
|
pub matched_ontology: bool,
|
|
pub fuzzy_matched: bool,
|
|
}
|
|
```
|
|
|
|
**Integration point:** Modify `LlmExtractor::extract()` to emit observations.
|
|
|
|
---
|
|
|
|
### 2. Cache Key Includes Prompt Hash
|
|
|
|
**Problem:** Cache doesn't invalidate when prompt changes.
|
|
|
|
**Solution:** Include prompt hash in cache key.
|
|
|
|
```rust
|
|
// src/llm/cache.rs
|
|
|
|
impl LlmCache {
|
|
fn compute_key(content: &str, model: &str, prompt: &str) -> String {
|
|
let mut hasher = blake3::Hasher::new();
|
|
hasher.update(content.as_bytes());
|
|
hasher.update(model.as_bytes());
|
|
hasher.update(prompt.as_bytes()); // NEW: prompt included
|
|
hasher.finalize().to_hex().to_string()
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### 3. Bounded Concurrency
|
|
|
|
**Problem:** Sequential execution is slow; unbounded parallelism hits rate limits.
|
|
|
|
**Solution:** Tokio Semaphore with configurable concurrency.
|
|
|
|
```rust
|
|
// src/eval/harness.rs
|
|
|
|
use std::sync::Arc;
|
|
use tokio::sync::Semaphore;
|
|
|
|
pub struct EvalHarness {
|
|
extractor: LlmExtractor,
|
|
semaphore: Arc<Semaphore>,
|
|
}
|
|
|
|
impl EvalHarness {
|
|
pub fn new(extractor: LlmExtractor, max_concurrent: usize) -> Self {
|
|
Self {
|
|
extractor,
|
|
semaphore: Arc::new(Semaphore::new(max_concurrent)),
|
|
}
|
|
}
|
|
|
|
pub async fn run(&self, fixtures: Vec<Fixture>) -> EvalResult {
|
|
let handles: Vec<_> = fixtures
|
|
.into_iter()
|
|
.map(|fixture| {
|
|
let sem = self.semaphore.clone();
|
|
let extractor = self.extractor.clone();
|
|
tokio::spawn(async move {
|
|
let _permit = sem.acquire().await?;
|
|
Self::run_fixture(&extractor, fixture).await
|
|
})
|
|
})
|
|
.collect();
|
|
|
|
let results = futures::future::join_all(handles).await;
|
|
// aggregate...
|
|
}
|
|
}
|
|
```
|
|
|
|
**Default:** 5 concurrent (configurable via `eval.max_concurrent`)
|
|
|
|
---
|
|
|
|
### 4. Rate Limit Resilience
|
|
|
|
**Problem:** 429 errors cause evaluation failures.
|
|
|
|
**Solution:** Exponential backoff with retries.
|
|
|
|
```rust
|
|
// src/llm/client.rs
|
|
|
|
impl GeminiClient {
|
|
async fn call_with_retry(&self, request: &Request) -> Result<Response> {
|
|
let mut delay = Duration::from_millis(500);
|
|
let max_retries = 5;
|
|
|
|
for attempt in 0..max_retries {
|
|
match self.call(request).await {
|
|
Ok(response) => return Ok(response),
|
|
Err(e) if e.is_rate_limit() => {
|
|
if attempt == max_retries - 1 {
|
|
return Err(e);
|
|
}
|
|
tracing::warn!(
|
|
attempt,
|
|
delay_ms = delay.as_millis(),
|
|
"Rate limited, backing off"
|
|
);
|
|
tokio::time::sleep(delay).await;
|
|
delay *= 2;
|
|
}
|
|
Err(e) => return Err(e),
|
|
}
|
|
}
|
|
unreachable!()
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### 5. Fixture Format
|
|
|
|
**Problem:** No standardized test cases to validate prompt changes.
|
|
|
|
**Solution:** TOML fixtures with input, expected output, and rationale.
|
|
|
|
```toml
|
|
# tests/llm_fixtures/tls/disabled_verification.toml
|
|
|
|
[metadata]
|
|
id = "tls-001"
|
|
name = "TLS verification disabled in Python requests"
|
|
category = "tls"
|
|
language = "python"
|
|
created = "2026-02-05"
|
|
|
|
[input]
|
|
# The code to analyze
|
|
content = '''
|
|
import requests
|
|
|
|
def fetch_data(url):
|
|
# Disable SSL verification for internal services
|
|
response = requests.get(url, verify=False)
|
|
return response.json()
|
|
'''
|
|
|
|
[expected]
|
|
# What the LLM MUST extract (recall test)
|
|
must_contain = [
|
|
{
|
|
subject = "tls/cert_verification",
|
|
predicate = "enabled",
|
|
value = false,
|
|
rationale = "requests.get(verify=False) explicitly disables TLS verification"
|
|
},
|
|
]
|
|
|
|
# What the LLM MUST NOT extract (precision test)
|
|
must_not_contain = [
|
|
{ subject = "tls/cert_verification", predicate = "enabled", value = true },
|
|
]
|
|
|
|
[scoring]
|
|
# How important is this fixture?
|
|
weight = 1.0
|
|
# Expected minimum confidence from LLM
|
|
min_confidence = 0.8
|
|
```
|
|
|
|
**ExpectedClaim with rationale:**
|
|
|
|
```rust
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct ExpectedClaim {
|
|
pub subject: String,
|
|
pub predicate: String,
|
|
pub value: serde_json::Value,
|
|
/// Optional explanation for why this claim is expected (shown on failure)
|
|
#[serde(default)]
|
|
pub rationale: Option<String>,
|
|
}
|
|
```
|
|
|
|
**Fixture categories:**
|
|
|
|
```
|
|
tests/llm_fixtures/
|
|
├── manifest.toml # Index + baseline metrics
|
|
├── tls/ # TLS/SSL fixtures
|
|
│ ├── disabled_verification.toml
|
|
│ ├── deprecated_version.toml
|
|
│ └── pinning_bypass.toml
|
|
├── jwt/ # JWT fixtures
|
|
├── secrets/ # Hardcoded secrets fixtures
|
|
├── auth/ # Auth bypass fixtures
|
|
├── negative/ # Safe code (expect NO claims)
|
|
│ ├── safe_tls.toml
|
|
│ └── env_var_secrets.toml
|
|
└── edge/ # Edge cases
|
|
├── empty_file.toml
|
|
└── huge_file.toml
|
|
```
|
|
|
|
**Manifest:**
|
|
|
|
```toml
|
|
# tests/llm_fixtures/manifest.toml
|
|
|
|
[corpus]
|
|
version = "1.0.0"
|
|
total_fixtures = 35
|
|
|
|
[baseline]
|
|
# Known-good metrics from last successful run
|
|
precision = 0.85
|
|
recall = 0.78
|
|
f1 = 0.81
|
|
prompt_version = "1.0.0"
|
|
model = "gemini-2.0-flash"
|
|
measured_at = "2026-02-05T10:30:00Z"
|
|
```
|
|
|
|
---
|
|
|
|
### 6. Evaluation Harness
|
|
|
|
**Problem:** No way to run fixtures and compute metrics.
|
|
|
|
**Solution:** Evaluation engine in `src/eval/`.
|
|
|
|
```rust
|
|
// src/eval/mod.rs
|
|
|
|
mod db;
|
|
mod fixture;
|
|
mod harness;
|
|
mod matcher;
|
|
mod metrics;
|
|
mod perturbation;
|
|
mod report;
|
|
|
|
pub use db::EvalDatabase;
|
|
pub use fixture::{Fixture, FixtureLoader};
|
|
pub use harness::{EvalConfig, EvalHarness, EvalResult};
|
|
pub use metrics::{Metrics, CategoryMetrics};
|
|
pub use perturbation::Perturbator;
|
|
pub use report::{Report, ReportFormat};
|
|
```
|
|
|
|
**Core types:**
|
|
|
|
```rust
|
|
// src/eval/harness.rs
|
|
|
|
pub struct EvalConfig {
|
|
/// Path to fixtures directory.
|
|
pub fixtures_dir: PathBuf,
|
|
|
|
/// Categories to run (None = all).
|
|
pub categories: Option<Vec<String>>,
|
|
|
|
/// Max fixtures to run (for smoke tests).
|
|
pub max_fixtures: Option<usize>,
|
|
|
|
/// Evaluation mode.
|
|
pub mode: EvalMode,
|
|
|
|
/// Baseline to compare against.
|
|
pub baseline: Option<PathBuf>,
|
|
|
|
/// Save observations to database.
|
|
pub save_observations: bool,
|
|
|
|
/// Maximum concurrent LLM calls.
|
|
pub max_concurrent: usize,
|
|
}
|
|
|
|
pub enum EvalMode {
|
|
/// Use real LLM API.
|
|
Live,
|
|
/// Use cached responses only (fails if not cached).
|
|
Cached,
|
|
/// Skip LLM, return empty claims (for testing harness).
|
|
Mock,
|
|
/// Perturbation testing for stability.
|
|
Robust,
|
|
}
|
|
|
|
pub struct EvalResult {
|
|
pub run_id: Uuid,
|
|
pub started_at: DateTime<Utc>,
|
|
pub completed_at: DateTime<Utc>,
|
|
pub metrics: Metrics,
|
|
pub fixture_results: Vec<FixtureResult>,
|
|
pub baseline_comparison: Option<BaselineComparison>,
|
|
pub verdict: EvalVerdict,
|
|
/// Stability score (only in Robust mode)
|
|
pub stability: Option<f64>,
|
|
}
|
|
|
|
pub enum EvalVerdict {
|
|
Pass,
|
|
Regression { regressions: Vec<String> },
|
|
Error { message: String },
|
|
}
|
|
```
|
|
|
|
**Metrics calculation:**
|
|
|
|
```rust
|
|
// src/eval/metrics.rs
|
|
|
|
pub struct Metrics {
|
|
/// True positives: expected claims that were extracted.
|
|
pub true_positives: usize,
|
|
/// False positives: extracted claims that weren't expected.
|
|
pub false_positives: usize,
|
|
/// False negatives: expected claims that weren't extracted.
|
|
pub false_negatives: usize,
|
|
|
|
/// Precision = TP / (TP + FP)
|
|
pub precision: f64,
|
|
/// Recall = TP / (TP + FN)
|
|
pub recall: f64,
|
|
/// F1 = 2 * (P * R) / (P + R)
|
|
pub f1: f64,
|
|
|
|
/// Total fixtures.
|
|
pub total_fixtures: usize,
|
|
/// Fixtures that passed.
|
|
pub passed: usize,
|
|
/// Fixtures that failed.
|
|
pub failed: usize,
|
|
|
|
/// Total tokens used.
|
|
pub total_tokens: u64,
|
|
/// Estimated cost (USD).
|
|
pub estimated_cost: f64,
|
|
|
|
/// By category.
|
|
pub by_category: HashMap<String, CategoryMetrics>,
|
|
}
|
|
|
|
impl Metrics {
|
|
pub fn compute(results: &[FixtureResult]) -> Self {
|
|
let mut tp = 0;
|
|
let mut fp = 0;
|
|
let mut fn_ = 0;
|
|
|
|
for result in results {
|
|
tp += result.true_positives;
|
|
fp += result.false_positives;
|
|
fn_ += result.false_negatives;
|
|
}
|
|
|
|
let precision = if tp + fp > 0 { tp as f64 / (tp + fp) as f64 } else { 0.0 };
|
|
let recall = if tp + fn_ > 0 { tp as f64 / (tp + fn_) as f64 } else { 0.0 };
|
|
let f1 = if precision + recall > 0.0 {
|
|
2.0 * precision * recall / (precision + recall)
|
|
} else {
|
|
0.0
|
|
};
|
|
|
|
// ... rest of computation
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### 7. Hybrid Type-Coercive Matching
|
|
|
|
**Problem:** Strict type matching misses semantically equivalent values.
|
|
|
|
**Solution:** Coerce strings to booleans/numbers when reasonable.
|
|
|
|
```rust
|
|
// src/eval/matcher.rs
|
|
|
|
pub struct ClaimMatcher {
|
|
/// Tolerance for confidence comparison.
|
|
pub confidence_tolerance: f32,
|
|
}
|
|
|
|
impl ClaimMatcher {
|
|
/// Check if extracted claims satisfy must_contain requirements.
|
|
pub fn check_must_contain(
|
|
&self,
|
|
extracted: &[Observation],
|
|
expected: &[ExpectedClaim],
|
|
) -> MatchResult {
|
|
let mut matched = vec![];
|
|
let mut unmatched = vec![];
|
|
|
|
for exp in expected {
|
|
if let Some(claim) = self.find_matching_claim(extracted, exp) {
|
|
matched.push((exp.clone(), claim.clone()));
|
|
} else {
|
|
unmatched.push(exp.clone());
|
|
}
|
|
}
|
|
|
|
MatchResult { matched, unmatched }
|
|
}
|
|
|
|
/// Check if any extracted claim matches (for must_not_contain).
|
|
pub fn check_must_not_contain(
|
|
&self,
|
|
extracted: &[Observation],
|
|
forbidden: &[ExpectedClaim],
|
|
) -> Vec<(ExpectedClaim, Observation)> {
|
|
let mut violations = vec![];
|
|
|
|
for forbid in forbidden {
|
|
if let Some(claim) = self.find_matching_claim(extracted, forbid) {
|
|
violations.push((forbid.clone(), claim.clone()));
|
|
}
|
|
}
|
|
|
|
violations
|
|
}
|
|
|
|
fn find_matching_claim(
|
|
&self,
|
|
extracted: &[Observation],
|
|
expected: &ExpectedClaim,
|
|
) -> Option<&Observation> {
|
|
extracted.iter().find(|claim| {
|
|
self.subject_matches(&claim.concept_path, &expected.subject) &&
|
|
claim.predicate == expected.predicate &&
|
|
self.value_matches(&claim.value, &expected.value)
|
|
})
|
|
}
|
|
|
|
fn subject_matches(&self, extracted: &str, expected: &str) -> bool {
|
|
// Allow matching on tail path (last 2 segments)
|
|
let ext_tail = extracted.split('/').rev().take(2).collect::<Vec<_>>();
|
|
let exp_tail = expected.split('/').rev().take(2).collect::<Vec<_>>();
|
|
ext_tail == exp_tail
|
|
}
|
|
|
|
fn value_matches(&self, extracted: &ObjectValue, expected: &serde_json::Value) -> bool {
|
|
match (extracted, expected) {
|
|
// Direct matches
|
|
(ObjectValue::Boolean(e), serde_json::Value::Bool(x)) => e == x,
|
|
(ObjectValue::Number(e), serde_json::Value::Number(x)) => {
|
|
x.as_f64().map(|n| (e - n).abs() < 0.001).unwrap_or(false)
|
|
}
|
|
(ObjectValue::Text(e), serde_json::Value::String(x)) => e == x,
|
|
|
|
// Coercion: string -> boolean
|
|
(ObjectValue::Boolean(e), serde_json::Value::String(s)) => {
|
|
self.coerce_to_bool(s).map(|b| *e == b).unwrap_or(false)
|
|
}
|
|
(ObjectValue::Text(e), serde_json::Value::Bool(x)) => {
|
|
self.coerce_to_bool(e).map(|b| b == *x).unwrap_or(false)
|
|
}
|
|
|
|
// Coercion: string -> number
|
|
(ObjectValue::Number(e), serde_json::Value::String(s)) => {
|
|
s.parse::<f64>().map(|n| (e - n).abs() < 0.001).ok().unwrap_or(false)
|
|
}
|
|
|
|
_ => false,
|
|
}
|
|
}
|
|
|
|
fn coerce_to_bool(&self, s: &str) -> Option<bool> {
|
|
match s.to_lowercase().as_str() {
|
|
"true" | "yes" | "on" | "enabled" | "1" => Some(true),
|
|
"false" | "no" | "off" | "disabled" | "0" => Some(false),
|
|
_ => None,
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### 8. Perturbation Testing Mode
|
|
|
|
**Problem:** Need to verify LLM consistency across minor input variations.
|
|
|
|
**Solution:** Perturbation mode that tests stability.
|
|
|
|
```rust
|
|
// src/eval/perturbation.rs
|
|
|
|
use crate::Language;
|
|
|
|
pub struct Perturbator;
|
|
|
|
impl Perturbator {
|
|
/// Generate perturbed variants of input content
|
|
pub fn perturb(content: &str, language: Language) -> Vec<String> {
|
|
let mut variants = vec![content.to_string()];
|
|
variants.push(Self::add_trailing_whitespace(content));
|
|
variants.push(Self::normalize_indentation(content));
|
|
variants.push(Self::add_innocuous_comments(content, language));
|
|
variants.push(Self::remove_comments(content, language));
|
|
variants
|
|
}
|
|
|
|
fn add_trailing_whitespace(content: &str) -> String {
|
|
content
|
|
.lines()
|
|
.map(|line| format!("{} ", line))
|
|
.collect::<Vec<_>>()
|
|
.join("\n")
|
|
}
|
|
|
|
fn normalize_indentation(content: &str) -> String {
|
|
// Convert tabs to spaces or vice versa
|
|
content.replace('\t', " ")
|
|
}
|
|
|
|
fn add_innocuous_comments(content: &str, language: Language) -> String {
|
|
let comment_prefix = match language {
|
|
Language::Python => "#",
|
|
Language::JavaScript | Language::TypeScript | Language::Rust | Language::Go => "//",
|
|
_ => "#",
|
|
};
|
|
format!("{} Auto-generated file\n{}", comment_prefix, content)
|
|
}
|
|
|
|
fn remove_comments(content: &str, language: Language) -> String {
|
|
// Simple single-line comment removal
|
|
let comment_prefix = match language {
|
|
Language::Python => "#",
|
|
Language::JavaScript | Language::TypeScript | Language::Rust | Language::Go => "//",
|
|
_ => "#",
|
|
};
|
|
content
|
|
.lines()
|
|
.filter(|line| !line.trim().starts_with(comment_prefix))
|
|
.collect::<Vec<_>>()
|
|
.join("\n")
|
|
}
|
|
}
|
|
```
|
|
|
|
**Stability metric:** % of perturbations producing identical claims.
|
|
|
|
CLI: `aphoria eval run --mode robust`
|
|
|
|
---
|
|
|
|
### 9. Structured Decoding (Gemini Response Schema)
|
|
|
|
**Problem:** Free-form JSON parsing can fail.
|
|
|
|
**Solution:** Use Gemini's `response_schema` for guaranteed JSON structure.
|
|
|
|
```rust
|
|
// src/llm/client.rs
|
|
|
|
impl GeminiClient {
|
|
fn build_request(&self, content: &str, prompt: &str) -> Request {
|
|
Request {
|
|
contents: vec![Content {
|
|
role: "user".to_string(),
|
|
parts: vec![Part::Text(content.to_string())],
|
|
}],
|
|
generation_config: GenerationConfig {
|
|
temperature: 0.1,
|
|
response_mime_type: "application/json".to_string(),
|
|
response_schema: Some(self.claims_schema()),
|
|
},
|
|
}
|
|
}
|
|
|
|
fn claims_schema(&self) -> Schema {
|
|
Schema {
|
|
type_: "object".to_string(),
|
|
properties: hashmap! {
|
|
"claims".to_string() => Schema {
|
|
type_: "array".to_string(),
|
|
items: Some(Box::new(Schema {
|
|
type_: "object".to_string(),
|
|
properties: hashmap! {
|
|
"subject".to_string() => Schema { type_: "string".to_string(), ..Default::default() },
|
|
"predicate".to_string() => Schema { type_: "string".to_string(), ..Default::default() },
|
|
"value".to_string() => Schema { type_: "any".to_string(), ..Default::default() },
|
|
"confidence".to_string() => Schema { type_: "number".to_string(), ..Default::default() },
|
|
"line".to_string() => Schema { type_: "integer".to_string(), ..Default::default() },
|
|
},
|
|
required: vec!["subject", "predicate", "value", "confidence"],
|
|
..Default::default()
|
|
})),
|
|
..Default::default()
|
|
},
|
|
},
|
|
required: vec!["claims".to_string()],
|
|
..Default::default()
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Benefit:** Eliminates JSON parse failures.
|
|
|
|
---
|
|
|
|
### 10. Synthetic Corpus Generation
|
|
|
|
**Problem:** Manual fixture creation is slow.
|
|
|
|
**Solution:** Generate fixtures from real scans with human review.
|
|
|
|
```bash
|
|
aphoria eval generate-corpus \
|
|
--scan-path /path/to/codebase \
|
|
--output-dir tests/llm_fixtures/synthetic \
|
|
--sample-size 50
|
|
```
|
|
|
|
```rust
|
|
// src/eval/corpus.rs
|
|
|
|
pub struct CorpusGenerator {
|
|
extractor: LlmExtractor,
|
|
}
|
|
|
|
impl CorpusGenerator {
|
|
pub async fn generate(
|
|
&self,
|
|
scan_path: &Path,
|
|
output_dir: &Path,
|
|
sample_size: usize,
|
|
) -> Result<Vec<PathBuf>> {
|
|
let findings = self.extractor.scan(scan_path).await?;
|
|
let sample = self.stratified_sample(&findings, sample_size);
|
|
let mut fixtures = vec![];
|
|
for finding in sample {
|
|
fixtures.push(self.create_fixture(&finding, output_dir)?);
|
|
}
|
|
Ok(fixtures)
|
|
}
|
|
|
|
fn stratified_sample(&self, findings: &[Finding], size: usize) -> Vec<&Finding> {
|
|
// Sample proportionally from each category
|
|
let mut by_category: HashMap<&str, Vec<&Finding>> = HashMap::new();
|
|
for f in findings {
|
|
by_category.entry(&f.category).or_default().push(f);
|
|
}
|
|
|
|
let per_category = size / by_category.len().max(1);
|
|
let mut sample = vec![];
|
|
for (_, items) in by_category {
|
|
sample.extend(items.iter().take(per_category));
|
|
}
|
|
sample.truncate(size);
|
|
sample
|
|
}
|
|
|
|
fn create_fixture(&self, finding: &Finding, output_dir: &Path) -> Result<PathBuf> {
|
|
let fixture = Fixture {
|
|
metadata: FixtureMetadata {
|
|
id: format!("auto-{}", Uuid::new_v4()),
|
|
name: finding.description.clone(),
|
|
category: finding.category.clone(),
|
|
language: finding.language.to_string(),
|
|
created: Utc::now().date_naive().to_string(),
|
|
},
|
|
input: FixtureInput {
|
|
content: finding.code_snippet.clone(),
|
|
},
|
|
expected: FixtureExpected {
|
|
must_contain: vec![ExpectedClaim {
|
|
subject: finding.subject.clone(),
|
|
predicate: finding.predicate.clone(),
|
|
value: finding.value.clone(),
|
|
rationale: Some("Auto-generated - requires human review".to_string()),
|
|
}],
|
|
must_not_contain: vec![],
|
|
},
|
|
scoring: FixtureScoring {
|
|
weight: 1.0,
|
|
min_confidence: 0.7,
|
|
},
|
|
};
|
|
|
|
let path = output_dir
|
|
.join(&finding.category)
|
|
.join(format!("{}.toml", fixture.metadata.id));
|
|
std::fs::create_dir_all(path.parent().unwrap())?;
|
|
std::fs::write(&path, toml::to_string_pretty(&fixture)?)?;
|
|
Ok(path)
|
|
}
|
|
}
|
|
```
|
|
|
|
**Workflow:** Scan -> Human review -> Commit to corpus
|
|
|
|
---
|
|
|
|
### 11. CLI Commands
|
|
|
|
**Problem:** No way to run evaluations from command line.
|
|
|
|
**Solution:** Add `aphoria eval` subcommand.
|
|
|
|
```rust
|
|
// src/cli.rs additions
|
|
|
|
#[derive(Subcommand)]
|
|
pub enum Commands {
|
|
// ... existing commands ...
|
|
|
|
/// Evaluate LLM prompt effectiveness
|
|
Eval {
|
|
#[command(subcommand)]
|
|
command: EvalCommands,
|
|
},
|
|
}
|
|
|
|
#[derive(Subcommand)]
|
|
pub enum EvalCommands {
|
|
/// Run evaluation against fixtures
|
|
Run {
|
|
/// Path to fixtures directory
|
|
#[arg(long, default_value = "tests/llm_fixtures")]
|
|
fixtures: PathBuf,
|
|
|
|
/// Categories to run (comma-separated)
|
|
#[arg(long)]
|
|
categories: Option<String>,
|
|
|
|
/// Maximum fixtures to run
|
|
#[arg(long)]
|
|
max_fixtures: Option<usize>,
|
|
|
|
/// Evaluation mode: live, cached, mock, robust
|
|
#[arg(long, default_value = "cached")]
|
|
mode: String,
|
|
|
|
/// Baseline file to compare against
|
|
#[arg(long)]
|
|
baseline: Option<PathBuf>,
|
|
|
|
/// Exit with code 1 if regression detected
|
|
#[arg(long)]
|
|
fail_on_regression: bool,
|
|
|
|
/// Regression threshold (default: 0.05 = 5%)
|
|
#[arg(long, default_value = "0.05")]
|
|
threshold: f64,
|
|
|
|
/// Save observation logs
|
|
#[arg(long)]
|
|
save_observations: bool,
|
|
|
|
/// Output format: table, json, markdown
|
|
#[arg(long, default_value = "table")]
|
|
format: String,
|
|
},
|
|
|
|
/// Show current baseline metrics
|
|
Baseline {
|
|
/// Path to fixtures directory
|
|
#[arg(long, default_value = "tests/llm_fixtures")]
|
|
fixtures: PathBuf,
|
|
},
|
|
|
|
/// Update baseline from latest run
|
|
UpdateBaseline {
|
|
/// Run ID to use as new baseline
|
|
#[arg(long)]
|
|
run_id: Option<String>,
|
|
|
|
/// Path to fixtures directory
|
|
#[arg(long, default_value = "tests/llm_fixtures")]
|
|
fixtures: PathBuf,
|
|
|
|
/// Required - prevents accidental baseline overwrites
|
|
#[arg(long, required = true)]
|
|
force: bool,
|
|
},
|
|
|
|
/// List fixtures
|
|
ListFixtures {
|
|
/// Path to fixtures directory
|
|
#[arg(long, default_value = "tests/llm_fixtures")]
|
|
fixtures: PathBuf,
|
|
|
|
/// Filter by category
|
|
#[arg(long)]
|
|
category: Option<String>,
|
|
},
|
|
|
|
/// Validate fixture format
|
|
ValidateFixtures {
|
|
/// Path to fixtures directory
|
|
#[arg(long, default_value = "tests/llm_fixtures")]
|
|
fixtures: PathBuf,
|
|
},
|
|
|
|
/// Generate fixtures from real scans
|
|
GenerateCorpus {
|
|
/// Path to codebase to scan
|
|
#[arg(long)]
|
|
scan_path: PathBuf,
|
|
|
|
/// Output directory for generated fixtures
|
|
#[arg(long)]
|
|
output_dir: PathBuf,
|
|
|
|
/// Number of fixtures to generate
|
|
#[arg(long, default_value = "50")]
|
|
sample_size: usize,
|
|
},
|
|
}
|
|
```
|
|
|
|
**Usage examples:**
|
|
|
|
```bash
|
|
# Run smoke test (cached responses, fast)
|
|
aphoria eval run --mode cached --max-fixtures 10
|
|
|
|
# Run full evaluation (live API calls)
|
|
aphoria eval run --mode live --save-observations
|
|
|
|
# Run with baseline comparison
|
|
aphoria eval run --baseline tests/llm_fixtures/manifest.toml --fail-on-regression
|
|
|
|
# Run perturbation testing
|
|
aphoria eval run --mode robust --max-fixtures 5
|
|
|
|
# Show current baseline
|
|
aphoria eval baseline
|
|
|
|
# Update baseline (requires --force)
|
|
aphoria eval update-baseline --force
|
|
|
|
# List fixtures
|
|
aphoria eval list-fixtures --category tls
|
|
|
|
# Validate fixture format
|
|
aphoria eval validate-fixtures
|
|
|
|
# Generate fixtures from real codebase
|
|
aphoria eval generate-corpus --scan-path ./my-project --output-dir ./test-fixtures
|
|
```
|
|
|
|
**Baseline safety:** Without `--force`, update-baseline shows:
|
|
```
|
|
Current baseline: precision=0.85, recall=0.78, f1=0.81 (2026-02-05)
|
|
To update, re-run with --force
|
|
```
|
|
|
|
---
|
|
|
|
### 12. Report Output
|
|
|
|
**Problem:** Need human-readable and machine-readable output.
|
|
|
|
**Solution:** Multiple report formats.
|
|
|
|
**Table format (default):**
|
|
|
|
```
|
|
╭────────────────────────────────────────────────────────────────────╮
|
|
│ LLM Prompt Evaluation Report │
|
|
├────────────────────────────────────────────────────────────────────┤
|
|
│ Run ID: abc123-def456 │
|
|
│ Date: 2026-02-05 14:30:00 UTC │
|
|
│ Prompt: v1.0.0 │
|
|
│ Model: gemini-2.0-flash │
|
|
╰────────────────────────────────────────────────────────────────────╯
|
|
|
|
Summary
|
|
╭──────────┬─────────┬──────────┬────────┬────────╮
|
|
│ Metric │ Current │ Baseline │ Delta │ Status │
|
|
├──────────┼─────────┼──────────┼────────┼────────┤
|
|
│ Precision│ 0.87 │ 0.85 │ +0.02 │ ✓ │
|
|
│ Recall │ 0.76 │ 0.78 │ -0.02 │ ⚠ │
|
|
│ F1 │ 0.81 │ 0.81 │ +0.00 │ ✓ │
|
|
╰──────────┴─────────┴──────────┴────────┴────────╯
|
|
|
|
Verdict: ⚠ REVIEW - Recall dropped by 2%
|
|
|
|
Category Breakdown
|
|
╭──────────┬──────────┬────────┬────────╮
|
|
│ Category │ Fixtures │ Passed │ Failed │
|
|
├──────────┼──────────┼────────┼────────┤
|
|
│ tls │ 12 │ 11 │ 1 │
|
|
│ jwt │ 8 │ 6 │ 2 │
|
|
│ secrets │ 15 │ 14 │ 1 │
|
|
│ negative │ 10 │ 10 │ 0 │
|
|
╰──────────┴──────────┴────────┴────────╯
|
|
|
|
Regressions (2)
|
|
- jwt-003: JWT algorithm none detection
|
|
Expected: jwt/algorithm = "none"
|
|
Rationale: alg:"none" bypasses signature verification entirely
|
|
Got: Not extracted
|
|
|
|
- tls-007: TLS version in constants (IMPROVED)
|
|
Previously: Not extracted
|
|
Now: tls/min_version = "1.0" ✓
|
|
|
|
Cost: 125,430 tokens ($0.12)
|
|
```
|
|
|
|
**JSON format:**
|
|
|
|
```json
|
|
{
|
|
"run_id": "abc123-def456",
|
|
"timestamp": "2026-02-05T14:30:00Z",
|
|
"prompt_version": "1.0.0",
|
|
"model": "gemini-2.0-flash",
|
|
"metrics": {
|
|
"precision": 0.87,
|
|
"recall": 0.76,
|
|
"f1": 0.81,
|
|
"total_fixtures": 45,
|
|
"passed": 41,
|
|
"failed": 4
|
|
},
|
|
"baseline_comparison": {
|
|
"precision_delta": 0.02,
|
|
"recall_delta": -0.02,
|
|
"has_regression": true,
|
|
"regression_threshold": 0.05
|
|
},
|
|
"stability": 0.92,
|
|
"verdict": "review",
|
|
"fixture_results": [...]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 1: Core Infrastructure (2 days)
|
|
|
|
| Task | File | Description |
|
|
|------|------|-------------|
|
|
| 1.1 | `src/eval/db.rs` | SQLite database with retention |
|
|
| 1.2 | `src/llm/cache.rs` | Update cache key to include prompt hash |
|
|
| 1.3 | `src/llm/client.rs` | Exponential backoff for 429s |
|
|
|
|
**Acceptance:** Database stores observations, cache invalidates on prompt change, rate limits handled gracefully.
|
|
|
|
### Phase 2: Fixture & Matching (2 days)
|
|
|
|
| Task | File | Description |
|
|
|------|------|-------------|
|
|
| 2.1 | `src/eval/fixture.rs` | Define `Fixture`, `ExpectedClaim` with rationale |
|
|
| 2.2 | `src/eval/matcher.rs` | Hybrid type-coercive matching |
|
|
| 2.3 | `tests/llm_fixtures/` | Create 10 seed fixtures |
|
|
| 2.4 | `src/eval/fixture.rs` | Add `FixtureLoader` |
|
|
|
|
**Acceptance:** Can load fixtures from TOML, matching handles type coercion.
|
|
|
|
### Phase 3: Evaluation Harness (2 days)
|
|
|
|
| Task | File | Description |
|
|
|------|------|-------------|
|
|
| 3.1 | `src/eval/harness.rs` | Bounded concurrency with Semaphore |
|
|
| 3.2 | `src/eval/metrics.rs` | Implement `Metrics::compute()` |
|
|
| 3.3 | `src/eval/harness.rs` | Baseline comparison |
|
|
| 3.4 | `src/eval/perturbation.rs` | Perturbation testing |
|
|
|
|
**Acceptance:** Can run fixtures with bounded parallelism, compute precision/recall, measure stability.
|
|
|
|
### Phase 4: Structured Decoding (1 day)
|
|
|
|
| Task | File | Description |
|
|
|------|------|-------------|
|
|
| 4.1 | `src/llm/client.rs` | Gemini Response Schema integration |
|
|
|
|
**Acceptance:** LLM always returns valid JSON, no parse failures.
|
|
|
|
### Phase 5: CLI & Corpus (2 days)
|
|
|
|
| Task | File | Description |
|
|
|------|------|-------------|
|
|
| 5.1 | `src/cli.rs` | Add `EvalCommands` with `--force`, `--mode robust` |
|
|
| 5.2 | `src/handlers/eval.rs` | Implement all eval command handlers |
|
|
| 5.3 | `src/eval/corpus.rs` | Corpus generation from scans |
|
|
|
|
**Acceptance:** `aphoria eval run` works end-to-end, corpus generation functional.
|
|
|
|
### Phase 6: Reports & Polish (2 days)
|
|
|
|
| Task | File | Description |
|
|
|------|------|-------------|
|
|
| 6.1 | `src/eval/report.rs` | Table/JSON formats with rationale in failures |
|
|
| 6.2 | `src/eval/report.rs` | Stability metrics display |
|
|
| 6.3 | `tests/llm_fixtures/` | Expand to 25+ fixtures |
|
|
|
|
**Acceptance:** Reports show rationale on missed claims, stability metrics visible.
|
|
|
|
**Total:** 11 days
|
|
|
|
---
|
|
|
|
## Fixture Seed List
|
|
|
|
Initial 10 fixtures to create:
|
|
|
|
| ID | Category | Name | Tests |
|
|
|----|----------|------|-------|
|
|
| tls-001 | tls | Disabled verification (requests) | `verify=False` |
|
|
| tls-002 | tls | Deprecated TLS version | `min_version="TLSv1"` |
|
|
| jwt-001 | jwt | Algorithm none | `alg: "none"` |
|
|
| jwt-002 | jwt | Skip signature verification | `verify=False` |
|
|
| secrets-001 | secrets | Hardcoded API key | `API_KEY = "sk_..."` |
|
|
| secrets-002 | secrets | High entropy token | Shannon entropy > 4.5 |
|
|
| auth-001 | auth | Debug auth bypass | `X-Debug-Auth` header |
|
|
| negative-001 | negative | Safe TLS config | `verify=True` (no claims) |
|
|
| negative-002 | negative | Env var secrets | `os.getenv()` (no claims) |
|
|
| edge-001 | edge | Empty file | Empty content (no claims) |
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
Add to `aphoria.toml`:
|
|
|
|
```toml
|
|
[eval]
|
|
# Save observations during scans
|
|
save_observations = false
|
|
|
|
# SQLite database path
|
|
database_path = "~/.aphoria/eval/observations.db"
|
|
|
|
# Default fixtures directory
|
|
fixtures_dir = "tests/llm_fixtures"
|
|
|
|
# Regression threshold (5% = 0.05)
|
|
regression_threshold = 0.05
|
|
|
|
# Maximum concurrent LLM calls
|
|
max_concurrent = 5
|
|
|
|
# Retention: days to keep observations
|
|
retention_days = 30
|
|
|
|
# Retention: max observations to keep regardless of age
|
|
retention_max_count = 1000
|
|
|
|
# Rate limit: initial backoff delay (ms)
|
|
rate_limit_initial_delay_ms = 500
|
|
|
|
# Rate limit: max retries before failing
|
|
rate_limit_max_retries = 5
|
|
```
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
| Metric | Target |
|
|
|--------|--------|
|
|
| Can run `aphoria eval run` | Works |
|
|
| Baseline comparison | Detects 5% regression |
|
|
| Fixtures load correctly | 100% valid fixtures load |
|
|
| Metrics match manual calculation | Within 0.01 |
|
|
| Report is readable | Human-verified |
|
|
| Type coercion works | "true" matches true |
|
|
| Perturbation mode | Stability metric computed |
|
|
| Rate limit handling | Survives 429 burst |
|
|
|
|
---
|
|
|
|
## File Structure After Implementation
|
|
|
|
```
|
|
applications/aphoria/
|
|
├── src/
|
|
│ ├── eval/
|
|
│ │ ├── mod.rs
|
|
│ │ ├── db.rs # SQLite storage
|
|
│ │ ├── corpus.rs # Synthetic fixture generation
|
|
│ │ ├── fixture.rs # Fixture loading
|
|
│ │ ├── harness.rs # Evaluation engine
|
|
│ │ ├── matcher.rs # Claim matching (type-coercive)
|
|
│ │ ├── metrics.rs # Precision/recall
|
|
│ │ ├── perturbation.rs # Perturbation testing
|
|
│ │ └── report.rs # Output formatting
|
|
│ ├── llm/
|
|
│ │ ├── observation.rs # Observation logging
|
|
│ │ └── ...
|
|
│ ├── handlers/
|
|
│ │ ├── eval.rs # Eval command handlers
|
|
│ │ └── ...
|
|
│ └── ...
|
|
└── tests/
|
|
└── llm_fixtures/
|
|
├── manifest.toml
|
|
├── tls/
|
|
├── jwt/
|
|
├── secrets/
|
|
├── auth/
|
|
├── negative/
|
|
└── edge/
|
|
```
|
|
|
|
---
|
|
|
|
## Verification
|
|
|
|
```bash
|
|
# Build
|
|
cargo build -p aphoria
|
|
|
|
# Test
|
|
cargo test -p aphoria
|
|
|
|
# SQLite retention check
|
|
sqlite3 ~/.aphoria/eval/observations.db "SELECT COUNT(*) FROM observations"
|
|
|
|
# Bounded concurrency (watch logs)
|
|
RUST_LOG=debug aphoria eval run --mode live 2>&1 | grep "permit"
|
|
|
|
# Perturbation mode
|
|
aphoria eval run --mode robust --max-fixtures 5
|
|
|
|
# Corpus generation
|
|
aphoria eval generate-corpus --scan-path ./test-project --output-dir ./test-fixtures
|
|
```
|
|
|
|
---
|
|
|
|
## Open Questions Resolved
|
|
|
|
| Question | Decision |
|
|
|----------|----------|
|
|
| Baseline storage | In `manifest.toml` (simple, versioned with fixtures) |
|
|
| Observation storage | SQLite with 30-day/1000-count retention |
|
|
| Matching strictness | Tail-path + type-coercive matching |
|
|
| Mock vs Live in CI | Cached mode for PR, live for manual |
|
|
| Parallelism | Bounded (5 default) via Tokio Semaphore |
|
|
| Baseline safety | Requires `--force` flag |
|
|
| Structured output | Gemini Response Schema |
|
|
|
|
---
|
|
|
|
*Ready for implementation.*
|