New security extractors: - insecure_deserialization, orm_injection, path_traversal, security_headers - ssrf, unvalidated_redirects, weak_password, xxe - Enhanced tls_version extractor with comprehensive cipher/protocol checks Architecture docs: - Scout-judge extraction pattern for LLM-based code analysis - LLM prompt evaluation framework - LLM eval implementation guide Core improvements: - stemedb-ontology README and client enhancements - WAL journal/segment instrumentation - Signing and ingestion refinements - Consumer health demo script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
1064 lines
35 KiB
Markdown
1064 lines
35 KiB
Markdown
# LLM Prompt Evaluation System
|
|
|
|
> **Status:** Proposed (2026-02-05)
|
|
> **Phase:** 7.8 (extends Phase 7.5 LLM-in-the-Loop Extraction)
|
|
> **Author:** Architecture Team
|
|
|
|
---
|
|
|
|
## Problem Statement
|
|
|
|
Aphoria's LLM-powered claim extraction (Phase 7.5) uses Gemini to extract security claims from high-value code files. The prompts that drive this extraction are effectively **code that we don't treat like code**:
|
|
|
|
| Aspect | Traditional Code | Current Prompts |
|
|
| -------------------- | ---------------------- | ------------------------------------ |
|
|
| Version Control | Git commits | In files, but no semantic versioning |
|
|
| Testing | Unit/integration tests | None |
|
|
| Metrics | Coverage, performance | None |
|
|
| Regression Detection | CI failures | None |
|
|
| Quality Gates | Linting, review | None |
|
|
|
|
**The result:** When we change a prompt, we have no systematic way to know if we made things better or worse. We're flying blind.
|
|
|
|
### Enterprise Requirements
|
|
|
|
For enterprise adoption, customers need assurance that:
|
|
|
|
1. **Prompts produce consistent, high-quality results** - Not random outputs
|
|
2. **Changes are validated before deployment** - Regressions are caught
|
|
3. **Performance is measurable** - Precision, recall, cost are tracked
|
|
4. **The system improves over time** - With evidence, not hope
|
|
|
|
---
|
|
|
|
## Goals
|
|
|
|
### Primary Goals
|
|
|
|
1. **Observability** - Understand prompt effectiveness through metrics and logging
|
|
2. **Testability** - Validate prompts against known scenarios with expected outcomes
|
|
3. **Repeatability** - Run evaluations consistently across environments
|
|
4. **Automation** - Scheduled jobs that detect regressions without human intervention
|
|
|
|
### Non-Goals (Phase 7.8)
|
|
|
|
- Real-time prompt optimization (future: Phase 9)
|
|
- A/B testing in production (future: Phase 9)
|
|
- Multi-model comparison (future)
|
|
- Prompt compression/optimization (future)
|
|
|
|
---
|
|
|
|
## Architecture Overview
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────────────────────┐
|
|
│ LLM Prompt Evaluation System │
|
|
├──────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ Golden Corpus │ Test fixtures with expected outcomes │
|
|
│ │ (fixtures/) │ - Code snippets │
|
|
│ │ │ - Expected claims (must-contain, must-not-contain) │
|
|
│ └────────┬────────┘ - Metadata (language, category, difficulty) │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────┐ │
|
|
│ │ Evaluation │ Orchestrates test runs │
|
|
│ │ Harness │ - Loads fixtures │
|
|
│ │ │ - Invokes LLM Extractor │
|
|
│ │ │ - Compares outputs │
|
|
│ │ │ - Computes metrics │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ├──────────────────────┐ │
|
|
│ │ │ │
|
|
│ ▼ ▼ │
|
|
│ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ LLM Extractor │ │ Observation │ │
|
|
│ │ (instrumented) │ │ Log │ │
|
|
│ │ │ │ │ │
|
|
│ │ - Same code │ │ - Prompt ver │ │
|
|
│ │ path as │ │ - Input hash │ │
|
|
│ │ production │ │ - Output │ │
|
|
│ │ │ │ - Latency │ │
|
|
│ │ │ │ - Tokens │ │
|
|
│ └─────────────────┘ │ - Model │ │
|
|
│ │ - Timestamp │ │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Metrics & Reports │ │
|
|
│ │ │ │
|
|
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
|
|
│ │ │ Precision │ │ Recall │ │ F1 Score │ │ Cost │ │ │
|
|
│ │ │ TP/(TP+FP) │ │ TP/(TP+FN) │ │ Harmonic │ │ Tokens/$ │ │ │
|
|
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │
|
|
│ │ │ │
|
|
│ │ ┌────────────────────────────────────────────────────────────────────┐ │ │
|
|
│ │ │ Regression Report │ │ │
|
|
│ │ │ - Comparison against baseline │ │ │
|
|
│ │ │ - Per-fixture deltas │ │ │
|
|
│ │ │ - Category breakdown │ │ │
|
|
│ │ │ - Recommendations │ │ │
|
|
│ │ └────────────────────────────────────────────────────────────────────┘ │ │
|
|
│ └─────────────────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└──────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Core Components
|
|
|
|
### 1. Golden Corpus
|
|
|
|
A curated set of test fixtures with known expected outcomes.
|
|
|
|
#### Fixture Format
|
|
|
|
```toml
|
|
# fixtures/tls/disabled_verification.toml
|
|
|
|
[metadata]
|
|
id = "tls-001"
|
|
name = "TLS verification disabled in Python requests"
|
|
category = "tls"
|
|
subcategory = "certificate_verification"
|
|
language = "python"
|
|
difficulty = "easy" # easy | medium | hard
|
|
source = "hand-curated" # hand-curated | production-capture | synthetic
|
|
created = "2026-02-05"
|
|
updated = "2026-02-05"
|
|
|
|
[input]
|
|
filename = "client.py"
|
|
content = '''
|
|
import requests
|
|
|
|
def fetch_data(url):
|
|
# Disable SSL verification for internal services
|
|
response = requests.get(url, verify=False)
|
|
return response.json()
|
|
'''
|
|
|
|
[expected]
|
|
# Claims that MUST be extracted (recall)
|
|
must_contain = [
|
|
{ subject = "tls/cert_verification", predicate = "enabled", value = false },
|
|
]
|
|
|
|
# Claims that MUST NOT be extracted (precision)
|
|
must_not_contain = [
|
|
{ subject = "tls/cert_verification", predicate = "enabled", value = true },
|
|
]
|
|
|
|
# Optional: acceptable alternate formulations
|
|
acceptable_variants = [
|
|
{ subject = "ssl/verify", predicate = "enabled", value = false },
|
|
{ subject = "requests/ssl_verify", predicate = "value", value = false },
|
|
]
|
|
|
|
[scoring]
|
|
# How to score this fixture
|
|
weight = 1.0 # Importance multiplier
|
|
min_confidence = 0.7 # Expected minimum confidence
|
|
```
|
|
|
|
#### Corpus Organization
|
|
|
|
```
|
|
applications/aphoria/tests/llm_fixtures/
|
|
├── README.md # Corpus documentation
|
|
├── manifest.toml # Index of all fixtures
|
|
├── tls/
|
|
│ ├── disabled_verification.toml
|
|
│ ├── deprecated_version.toml
|
|
│ └── pinning_bypass.toml
|
|
├── jwt/
|
|
│ ├── alg_none.toml
|
|
│ ├── skip_signature.toml
|
|
│ └── hardcoded_secret.toml
|
|
├── secrets/
|
|
│ ├── api_key_in_code.toml
|
|
│ ├── password_hardcoded.toml
|
|
│ └── high_entropy_token.toml
|
|
├── auth/
|
|
│ ├── bypass_pattern.toml
|
|
│ └── debug_header.toml
|
|
├── negative/ # Files that should NOT trigger claims
|
|
│ ├── safe_tls_config.toml
|
|
│ ├── proper_jwt_validation.toml
|
|
│ └── env_var_secrets.toml
|
|
└── edge_cases/
|
|
├── empty_file.toml
|
|
├── binary_content.toml
|
|
├── huge_file.toml
|
|
└── mixed_languages.toml
|
|
```
|
|
|
|
#### Manifest Structure
|
|
|
|
```toml
|
|
# manifest.toml
|
|
|
|
[corpus]
|
|
version = "1.0.0"
|
|
created = "2026-02-05"
|
|
description = "Golden corpus for LLM extraction evaluation"
|
|
|
|
[categories]
|
|
tls = { fixtures = 12, description = "TLS/SSL configuration" }
|
|
jwt = { fixtures = 8, description = "JWT authentication" }
|
|
secrets = { fixtures = 15, description = "Hardcoded secrets" }
|
|
auth = { fixtures = 6, description = "Authentication bypass" }
|
|
negative = { fixtures = 10, description = "Safe code (no claims expected)" }
|
|
edge_cases = { fixtures = 5, description = "Boundary conditions" }
|
|
|
|
[baseline]
|
|
# Current known-good metrics
|
|
precision = 0.85
|
|
recall = 0.78
|
|
f1 = 0.81
|
|
total_fixtures = 56
|
|
last_updated = "2026-02-05"
|
|
prompt_version = "1.0.0"
|
|
model = "gemini-2.0-flash"
|
|
```
|
|
|
|
---
|
|
|
|
### 2. Observation Log
|
|
|
|
Every LLM extraction is logged with full context for replay and analysis.
|
|
|
|
#### Log Entry Schema
|
|
|
|
```rust
|
|
/// A single observation from an LLM extraction
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct ExtractionObservation {
|
|
/// Unique identifier for this observation
|
|
pub id: Uuid,
|
|
|
|
/// When this extraction occurred
|
|
pub timestamp: DateTime<Utc>,
|
|
|
|
/// Prompt version (semantic version)
|
|
pub prompt_version: String,
|
|
|
|
/// Model identifier (e.g., "gemini-2.0-flash")
|
|
pub model: String,
|
|
|
|
/// BLAKE3 hash of input content (for deduplication)
|
|
pub input_hash: String,
|
|
|
|
/// Input metadata
|
|
pub input: ExtractionInput,
|
|
|
|
/// Output from LLM
|
|
pub output: ExtractionOutput,
|
|
|
|
/// Performance metrics
|
|
pub metrics: ExtractionMetrics,
|
|
|
|
/// Evaluation context (if run during evaluation)
|
|
pub evaluation: Option<EvaluationContext>,
|
|
}
|
|
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct ExtractionInput {
|
|
/// Filename (may be anonymized)
|
|
pub filename: String,
|
|
|
|
/// Language detected
|
|
pub language: String,
|
|
|
|
/// Content length in bytes
|
|
pub content_length: usize,
|
|
|
|
/// Content preview (first 500 chars, for debugging)
|
|
pub content_preview: Option<String>,
|
|
}
|
|
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct ExtractionOutput {
|
|
/// Raw LLM response
|
|
pub raw_response: String,
|
|
|
|
/// Parsed claims (may be empty if parsing failed)
|
|
pub claims: Vec<ExtractedClaim>,
|
|
|
|
/// Whether parsing succeeded
|
|
pub parse_success: bool,
|
|
|
|
/// Parse error if any
|
|
pub parse_error: Option<String>,
|
|
}
|
|
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct ExtractionMetrics {
|
|
/// Total latency (API call + processing)
|
|
pub latency_ms: u64,
|
|
|
|
/// API latency only
|
|
pub api_latency_ms: u64,
|
|
|
|
/// Input tokens
|
|
pub input_tokens: u32,
|
|
|
|
/// Output tokens
|
|
pub output_tokens: u32,
|
|
|
|
/// Total tokens
|
|
pub total_tokens: u32,
|
|
|
|
/// Estimated cost (USD)
|
|
pub estimated_cost_usd: f64,
|
|
|
|
/// Cache hit (if response was cached)
|
|
pub cache_hit: bool,
|
|
}
|
|
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct EvaluationContext {
|
|
/// Fixture ID if from golden corpus
|
|
pub fixture_id: Option<String>,
|
|
|
|
/// Evaluation run ID
|
|
pub run_id: Uuid,
|
|
|
|
/// Whether this matched expected output
|
|
pub matched_expected: bool,
|
|
|
|
/// Detailed match results
|
|
pub match_details: MatchDetails,
|
|
}
|
|
```
|
|
|
|
#### Log Storage
|
|
|
|
```
|
|
~/.aphoria/eval/
|
|
├── observations/
|
|
│ ├── 2026-02-05/
|
|
│ │ ├── 001_tls-001_success.json
|
|
│ │ ├── 002_jwt-003_partial.json
|
|
│ │ └── ...
|
|
│ └── 2026-02-04/
|
|
│ └── ...
|
|
├── runs/
|
|
│ ├── run_abc123.json # Full evaluation run metadata
|
|
│ └── run_def456.json
|
|
└── baselines/
|
|
├── v1.0.0.json # Baseline for prompt v1.0.0
|
|
└── latest.json # Symlink to current baseline
|
|
```
|
|
|
|
---
|
|
|
|
### 3. Evaluation Harness
|
|
|
|
The core engine that runs evaluations.
|
|
|
|
#### Public API
|
|
|
|
```rust
|
|
/// Configuration for an evaluation run
|
|
#[derive(Debug, Clone)]
|
|
pub struct EvalConfig {
|
|
/// Path to fixtures directory
|
|
pub fixtures_dir: PathBuf,
|
|
|
|
/// Which categories to evaluate (None = all)
|
|
pub categories: Option<Vec<String>>,
|
|
|
|
/// Maximum fixtures to run (for quick smoke tests)
|
|
pub max_fixtures: Option<usize>,
|
|
|
|
/// Whether to use real LLM or mock
|
|
pub mode: EvalMode,
|
|
|
|
/// Baseline to compare against
|
|
pub baseline: Option<PathBuf>,
|
|
|
|
/// Output directory for results
|
|
pub output_dir: PathBuf,
|
|
|
|
/// Whether to save observations
|
|
pub save_observations: bool,
|
|
|
|
/// Parallelism (concurrent LLM calls)
|
|
pub parallelism: usize,
|
|
}
|
|
|
|
#[derive(Debug, Clone)]
|
|
pub enum EvalMode {
|
|
/// Use real LLM API (costs money, tests actual prompt)
|
|
Live {
|
|
model: String,
|
|
temperature: f32,
|
|
},
|
|
/// Use cached responses (fast, deterministic, for CI)
|
|
Cached,
|
|
/// Use mock responses (for testing harness itself)
|
|
Mock,
|
|
}
|
|
|
|
/// Result of an evaluation run
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct EvalResult {
|
|
/// Unique run identifier
|
|
pub run_id: Uuid,
|
|
|
|
/// When the run started
|
|
pub started_at: DateTime<Utc>,
|
|
|
|
/// When the run completed
|
|
pub completed_at: DateTime<Utc>,
|
|
|
|
/// Configuration used
|
|
pub config: EvalConfigSummary,
|
|
|
|
/// Aggregate metrics
|
|
pub metrics: AggregateMetrics,
|
|
|
|
/// Per-fixture results
|
|
pub fixture_results: Vec<FixtureResult>,
|
|
|
|
/// Comparison with baseline (if baseline provided)
|
|
pub baseline_comparison: Option<BaselineComparison>,
|
|
|
|
/// Overall verdict
|
|
pub verdict: EvalVerdict,
|
|
}
|
|
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub struct AggregateMetrics {
|
|
/// Precision: TP / (TP + FP)
|
|
pub precision: f64,
|
|
|
|
/// Recall: TP / (TP + FN)
|
|
pub recall: f64,
|
|
|
|
/// F1 score: 2 * (P * R) / (P + R)
|
|
pub f1: f64,
|
|
|
|
/// Total fixtures evaluated
|
|
pub total_fixtures: usize,
|
|
|
|
/// Fixtures that passed
|
|
pub passed: usize,
|
|
|
|
/// Fixtures that failed
|
|
pub failed: usize,
|
|
|
|
/// Fixtures that errored (LLM call failed, parse failed, etc.)
|
|
pub errored: usize,
|
|
|
|
/// Total cost (USD)
|
|
pub total_cost_usd: f64,
|
|
|
|
/// Total tokens used
|
|
pub total_tokens: u64,
|
|
|
|
/// Average latency (ms)
|
|
pub avg_latency_ms: f64,
|
|
|
|
/// Per-category breakdown
|
|
pub by_category: HashMap<String, CategoryMetrics>,
|
|
}
|
|
|
|
#[derive(Debug, Clone, Serialize, Deserialize)]
|
|
pub enum EvalVerdict {
|
|
/// All checks passed
|
|
Pass,
|
|
/// Some regressions detected
|
|
Regression { details: Vec<String> },
|
|
/// Evaluation failed (errors prevented completion)
|
|
Error { message: String },
|
|
}
|
|
```
|
|
|
|
#### Claim Matching Logic
|
|
|
|
```rust
|
|
/// How to match extracted claims against expected claims
|
|
pub struct ClaimMatcher {
|
|
/// Tolerance for confidence comparison
|
|
pub confidence_tolerance: f32,
|
|
|
|
/// Whether to normalize concept paths before comparison
|
|
pub normalize_paths: bool,
|
|
|
|
/// Predicate aliases (e.g., "enabled" == "active" == "on")
|
|
pub predicate_aliases: HashMap<String, Vec<String>>,
|
|
|
|
/// Value equivalences (e.g., true == "true" == "yes" == 1)
|
|
pub value_equivalences: Vec<Vec<ObjectValue>>,
|
|
}
|
|
|
|
impl ClaimMatcher {
|
|
/// Check if extracted claims satisfy must_contain requirements
|
|
pub fn check_must_contain(
|
|
&self,
|
|
extracted: &[ExtractedClaim],
|
|
expected: &[ExpectedClaim],
|
|
) -> MatchResult {
|
|
// For each expected claim:
|
|
// 1. Find matching extracted claim (subject + predicate match)
|
|
// 2. Check value compatibility
|
|
// 3. Check confidence threshold
|
|
// Return: matched, unmatched, partial matches
|
|
}
|
|
|
|
/// Check if extracted claims violate must_not_contain requirements
|
|
pub fn check_must_not_contain(
|
|
&self,
|
|
extracted: &[ExtractedClaim],
|
|
forbidden: &[ExpectedClaim],
|
|
) -> MatchResult {
|
|
// For each forbidden claim:
|
|
// 1. Check if any extracted claim matches
|
|
// 2. Flag violations
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### 4. Prompt Versioning
|
|
|
|
Prompts are versioned to track changes and correlate with metrics.
|
|
|
|
#### Version Schema
|
|
|
|
```rust
|
|
/// Prompt version identifier
|
|
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
|
|
pub struct PromptVersion {
|
|
/// Semantic version (major.minor.patch)
|
|
pub version: String,
|
|
|
|
/// BLAKE3 hash of prompt content
|
|
pub content_hash: String,
|
|
|
|
/// When this version was created
|
|
pub created_at: DateTime<Utc>,
|
|
|
|
/// Description of changes from previous version
|
|
pub changelog: Option<String>,
|
|
}
|
|
|
|
impl PromptVersion {
|
|
/// Compute version from prompt content
|
|
pub fn from_prompt(prompt: &str, changelog: Option<String>) -> Self {
|
|
let content_hash = blake3::hash(prompt.as_bytes()).to_hex().to_string();
|
|
// Version is computed or provided externally
|
|
Self {
|
|
version: "0.0.0".to_string(), // Placeholder
|
|
content_hash,
|
|
created_at: Utc::now(),
|
|
changelog,
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Prompt File Structure
|
|
|
|
````rust
|
|
// llm/prompt.rs
|
|
|
|
/// Current prompt version
|
|
pub const PROMPT_VERSION: &str = "1.2.0";
|
|
|
|
/// Changelog for current version
|
|
pub const PROMPT_CHANGELOG: &str = "Improved JWT claim extraction accuracy";
|
|
|
|
/// The extraction prompt
|
|
pub const EXTRACTION_PROMPT: &str = r#"
|
|
You are a security analyst extracting implicit security claims from code.
|
|
|
|
Given the following code file, identify any security-relevant configurations,
|
|
settings, or patterns. For each finding, output a JSON object with:
|
|
|
|
- subject: The concept path (e.g., "tls/cert_verification")
|
|
- predicate: The aspect being claimed (e.g., "enabled")
|
|
- value: The value found (boolean, string, or number)
|
|
- confidence: Your confidence in this extraction (0.0 to 1.0)
|
|
- description: Brief explanation
|
|
|
|
Focus on:
|
|
- TLS/SSL configuration
|
|
- Authentication settings
|
|
- Cryptographic choices
|
|
- Secret/credential handling
|
|
- Input validation
|
|
- Authorization patterns
|
|
|
|
Output as a JSON array. If no security claims are found, output an empty array.
|
|
|
|
Code:
|
|
```{language}
|
|
{content}
|
|
````
|
|
|
|
"#;
|
|
|
|
````
|
|
|
|
---
|
|
|
|
### 5. Metrics & Reporting
|
|
|
|
#### Metrics Computed
|
|
|
|
| Metric | Formula | Purpose |
|
|
|--------|---------|---------|
|
|
| **Precision** | TP / (TP + FP) | Are we avoiding false positives? |
|
|
| **Recall** | TP / (TP + FN) | Are we finding all issues? |
|
|
| **F1 Score** | 2 * (P * R) / (P + R) | Balanced measure |
|
|
| **Confidence Calibration** | Correlation(confidence, correctness) | Are high-confidence claims actually correct? |
|
|
| **Parse Success Rate** | Successful parses / Total extractions | Is the prompt producing valid JSON? |
|
|
| **Cost per Fixture** | Total cost / Fixtures | Budget tracking |
|
|
| **Latency P50/P95** | Percentiles | Performance tracking |
|
|
|
|
#### Regression Detection
|
|
|
|
```rust
|
|
/// Compare current metrics against baseline
|
|
pub struct BaselineComparison {
|
|
/// Current metrics
|
|
pub current: AggregateMetrics,
|
|
|
|
/// Baseline metrics
|
|
pub baseline: AggregateMetrics,
|
|
|
|
/// Deltas
|
|
pub precision_delta: f64,
|
|
pub recall_delta: f64,
|
|
pub f1_delta: f64,
|
|
|
|
/// Regression thresholds
|
|
pub regression_threshold: f64, // e.g., 0.05 = 5% drop
|
|
|
|
/// Fixtures that regressed
|
|
pub regressed_fixtures: Vec<FixtureRegression>,
|
|
|
|
/// Fixtures that improved
|
|
pub improved_fixtures: Vec<FixtureImprovement>,
|
|
}
|
|
|
|
impl BaselineComparison {
|
|
pub fn has_regression(&self) -> bool {
|
|
self.precision_delta < -self.regression_threshold ||
|
|
self.recall_delta < -self.regression_threshold ||
|
|
self.f1_delta < -self.regression_threshold
|
|
}
|
|
}
|
|
````
|
|
|
|
#### Report Format
|
|
|
|
```markdown
|
|
# Prompt Evaluation Report
|
|
|
|
**Run ID:** abc123
|
|
**Date:** 2026-02-05 14:30:00 UTC
|
|
**Prompt Version:** 1.2.0
|
|
**Model:** gemini-2.0-flash
|
|
|
|
## Summary
|
|
|
|
| Metric | Current | Baseline | Delta | Status |
|
|
| ------------- | ------- | -------- | ----- | ------ |
|
|
| Precision | 0.87 | 0.85 | +0.02 | ✅ |
|
|
| Recall | 0.76 | 0.78 | -0.02 | ⚠️ |
|
|
| F1 Score | 0.81 | 0.81 | +0.00 | ✅ |
|
|
| Parse Success | 98% | 97% | +1% | ✅ |
|
|
|
|
**Verdict:** ⚠️ REVIEW - Recall dropped by 2%
|
|
|
|
## Cost Analysis
|
|
|
|
- Total tokens: 125,430
|
|
- Estimated cost: $0.12
|
|
- Cost per fixture: $0.002
|
|
|
|
## Regressions
|
|
|
|
### jwt-003: JWT algorithm none detection
|
|
|
|
- **Expected:** `jwt/algorithm = "none"` with confidence > 0.8
|
|
- **Got:** Not extracted
|
|
- **Impact:** High (security-critical)
|
|
|
|
## Improvements
|
|
|
|
### tls-007: TLS version in constants
|
|
|
|
- **Previously:** Not extracted
|
|
- **Now:** `tls/min_version = "1.0"` with confidence 0.85
|
|
- **Impact:** Medium
|
|
|
|
## Category Breakdown
|
|
|
|
| Category | Fixtures | Passed | Failed | Precision | Recall |
|
|
| -------- | -------- | ------ | ------ | --------- | ------ |
|
|
| tls | 12 | 11 | 1 | 0.92 | 0.91 |
|
|
| jwt | 8 | 6 | 2 | 0.75 | 0.75 |
|
|
| secrets | 15 | 14 | 1 | 0.93 | 0.87 |
|
|
| auth | 6 | 6 | 0 | 1.00 | 0.83 |
|
|
| negative | 10 | 10 | 0 | 1.00 | N/A |
|
|
```
|
|
|
|
---
|
|
|
|
### 6. Jobs & Automation
|
|
|
|
#### CI Job (Per-PR)
|
|
|
|
```yaml
|
|
# .github/workflows/prompt-eval-smoke.yml
|
|
name: Prompt Evaluation (Smoke)
|
|
|
|
on:
|
|
pull_request:
|
|
paths:
|
|
- "applications/aphoria/src/llm/**"
|
|
- "applications/aphoria/tests/llm_fixtures/**"
|
|
|
|
jobs:
|
|
eval:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
|
|
- name: Run smoke test
|
|
env:
|
|
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
|
|
run: |
|
|
cargo run -p aphoria -- eval prompts \
|
|
--mode cached \
|
|
--max-fixtures 20 \
|
|
--categories tls,jwt,secrets \
|
|
--baseline tests/llm_fixtures/baselines/latest.json \
|
|
--fail-on-regression
|
|
|
|
- name: Upload report
|
|
uses: actions/upload-artifact@v4
|
|
with:
|
|
name: eval-report
|
|
path: eval-report.md
|
|
```
|
|
|
|
**Characteristics:**
|
|
|
|
- Runs on PR that touches prompt code or fixtures
|
|
- Uses cached responses (fast, deterministic)
|
|
- Limited to 20 fixtures (smoke test)
|
|
- Fails if regression detected
|
|
|
|
#### Nightly Job (Full Evaluation)
|
|
|
|
```yaml
|
|
# .github/workflows/prompt-eval-nightly.yml
|
|
name: Prompt Evaluation (Full)
|
|
|
|
on:
|
|
schedule:
|
|
- cron: "0 3 * * *" # 3am UTC daily
|
|
workflow_dispatch:
|
|
|
|
jobs:
|
|
eval:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
|
|
- name: Run full evaluation
|
|
env:
|
|
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
|
|
run: |
|
|
cargo run -p aphoria -- eval prompts \
|
|
--mode live \
|
|
--model gemini-2.0-flash \
|
|
--temperature 0 \
|
|
--baseline tests/llm_fixtures/baselines/latest.json \
|
|
--output-dir ./eval-results \
|
|
--save-observations
|
|
|
|
- name: Update baseline if improved
|
|
run: |
|
|
# If F1 improved by > 2%, update baseline
|
|
./scripts/maybe-update-baseline.sh
|
|
|
|
- name: Upload results
|
|
uses: actions/upload-artifact@v4
|
|
with:
|
|
name: eval-results
|
|
path: eval-results/
|
|
|
|
- name: Post to Slack
|
|
if: failure()
|
|
uses: slackapi/slack-github-action@v1
|
|
with:
|
|
payload: |
|
|
{
|
|
"text": "⚠️ Prompt evaluation regression detected",
|
|
"attachments": [...]
|
|
}
|
|
```
|
|
|
|
**Characteristics:**
|
|
|
|
- Runs nightly at 3am UTC
|
|
- Uses live LLM API (real evaluation)
|
|
- Full corpus coverage
|
|
- Updates baseline if metrics improve significantly
|
|
- Alerts on regression
|
|
|
|
#### On-Demand Job (Prompt Iteration)
|
|
|
|
```bash
|
|
# For prompt development: compare two versions
|
|
aphoria eval prompts \
|
|
--mode live \
|
|
--prompt-file ./prompts/experimental.txt \
|
|
--baseline ./baselines/current.json \
|
|
--output-dir ./eval-comparison \
|
|
--verbose
|
|
|
|
# View comparison
|
|
cat ./eval-comparison/comparison.md
|
|
```
|
|
|
|
---
|
|
|
|
## CLI Interface
|
|
|
|
```
|
|
USAGE:
|
|
aphoria eval prompts [OPTIONS]
|
|
|
|
OPTIONS:
|
|
--fixtures-dir <DIR> Path to fixtures directory [default: tests/llm_fixtures]
|
|
--categories <LIST> Categories to evaluate (comma-separated)
|
|
--max-fixtures <N> Maximum fixtures to run
|
|
--mode <MODE> Evaluation mode: live, cached, mock [default: cached]
|
|
--model <MODEL> Model to use (live mode only) [default: gemini-2.0-flash]
|
|
--temperature <TEMP> Temperature (live mode only) [default: 0]
|
|
--baseline <FILE> Baseline to compare against
|
|
--output-dir <DIR> Output directory for results [default: ./eval-results]
|
|
--save-observations Save observation logs
|
|
--fail-on-regression Exit with code 1 if regression detected
|
|
--regression-threshold <N> Threshold for regression (default: 0.05 = 5%)
|
|
--verbose Verbose output
|
|
--json Output results as JSON
|
|
|
|
SUBCOMMANDS:
|
|
aphoria eval prompts show-baseline Show current baseline metrics
|
|
aphoria eval prompts update-baseline Update baseline from latest run
|
|
aphoria eval prompts list-fixtures List available fixtures
|
|
aphoria eval prompts add-fixture Add a new fixture interactively
|
|
aphoria eval prompts validate-fixtures Validate fixture format
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 7.8.1: Core Infrastructure (Week 1)
|
|
|
|
| Task | Description | Effort |
|
|
| ----------------- | --------------------------------- | ------ |
|
|
| Fixture format | Define TOML schema, parser | 2d |
|
|
| Observation log | Schema, writer, reader | 1d |
|
|
| Claim matcher | Matching logic with fuzzy support | 2d |
|
|
| Prompt versioning | Version extraction, tracking | 1d |
|
|
|
|
**Deliverable:** Can load fixtures, run extractions, compare outputs
|
|
|
|
### Phase 7.8.2: Evaluation Harness (Week 2)
|
|
|
|
| Task | Description | Effort |
|
|
| ------------------- | --------------------------- | ------ |
|
|
| Evaluation harness | Orchestration, parallelism | 2d |
|
|
| Metrics computation | Precision, recall, F1, cost | 1d |
|
|
| Baseline comparison | Regression detection | 1d |
|
|
| Report generation | Markdown, JSON output | 1d |
|
|
|
|
**Deliverable:** Can run full evaluation and generate report
|
|
|
|
### Phase 7.8.3: Golden Corpus (Week 2-3)
|
|
|
|
| Task | Description | Effort |
|
|
| ---------------------- | -------------------------------- | ------ |
|
|
| Seed fixtures (20) | Hand-curated test cases | 2d |
|
|
| Negative fixtures (10) | Safe code that shouldn't trigger | 1d |
|
|
| Edge case fixtures (5) | Boundary conditions | 1d |
|
|
| Baseline establishment | Initial metrics snapshot | 1d |
|
|
|
|
**Deliverable:** 35+ fixtures covering core categories
|
|
|
|
### Phase 7.8.4: CI Integration (Week 3)
|
|
|
|
| Task | Description | Effort |
|
|
| -------------------- | -------------------------------- | ------ |
|
|
| Smoke test workflow | Per-PR cached evaluation | 1d |
|
|
| Nightly workflow | Full live evaluation | 1d |
|
|
| Baseline auto-update | Script for improvement detection | 1d |
|
|
| Alerting | Slack/email on regression | 0.5d |
|
|
|
|
**Deliverable:** Automated evaluation in CI
|
|
|
|
### Phase 7.8.5: CLI & Documentation (Week 4)
|
|
|
|
| Task | Description | Effort |
|
|
| ------------- | ------------------------------ | ------ |
|
|
| CLI commands | `eval prompts` subcommands | 2d |
|
|
| Documentation | Usage guide, fixture authoring | 1d |
|
|
| Skill update | `/aphoria-dev` skill update | 0.5d |
|
|
|
|
**Deliverable:** Production-ready tooling
|
|
|
|
---
|
|
|
|
## Open Questions
|
|
|
|
### 1. Where do we store baseline metrics?
|
|
|
|
**Options:**
|
|
|
|
- **In repository** (`tests/llm_fixtures/baselines/`) - Simple, versioned with code
|
|
- **External artifact store** - Separates metrics from code
|
|
- **Database** - For historical tracking
|
|
|
|
**Recommendation:** Start with repository, migrate to external store when history needed.
|
|
|
|
### 2. How strict should matching be?
|
|
|
|
**Options:**
|
|
|
|
- **Exact match** - Same subject, predicate, value (brittle)
|
|
- **Structural match** - Same concept, fuzzy value (looser)
|
|
- **Semantic match** - Embeddings-based similarity (complex)
|
|
|
|
**Recommendation:** Structural match with configurable fuzzy value matching.
|
|
|
|
### 3. Mock vs Live in CI?
|
|
|
|
**Options:**
|
|
|
|
- **Always mock** - Fast, free, deterministic, tests harness not prompt
|
|
- **Always live** - Expensive, slow, tests actual prompt
|
|
- **Hybrid** - Mock for smoke, live for nightly
|
|
|
|
**Recommendation:** Hybrid approach. Cached (deterministic) for PR, live for nightly.
|
|
|
|
### 4. How do we handle model version changes?
|
|
|
|
Gemini may update models, causing output drift even without prompt changes.
|
|
|
|
**Options:**
|
|
|
|
- Pin model version (if API supports)
|
|
- Track model version in baseline, re-baseline on model change
|
|
- Alert when model version changes
|
|
|
|
**Recommendation:** Track model version, require manual baseline update on change.
|
|
|
|
### 5. What's the corpus growth strategy?
|
|
|
|
**Options:**
|
|
|
|
- Hand-curate only (high quality, slow growth)
|
|
- Production capture with review (faster growth, needs tooling)
|
|
- Synthetic generation (fast, may not reflect reality)
|
|
|
|
**Recommendation:** Start hand-curated, add production capture tooling in Phase 9.
|
|
|
|
---
|
|
|
|
## Success Metrics
|
|
|
|
| Metric | Target | Measurement |
|
|
| --------------------------------------- | ------------- | ------------------------------------- |
|
|
| Regression detection rate | 100% | Simulated regressions caught |
|
|
| False positive rate (regression alerts) | < 5% | Manual review of alerts |
|
|
| Prompt iteration cycle time | < 30 min | Time from change to evaluation result |
|
|
| Corpus coverage | > 50 fixtures | Fixture count |
|
|
| CI job duration (smoke) | < 2 min | Workflow timing |
|
|
| CI job duration (nightly) | < 15 min | Workflow timing |
|
|
|
|
---
|
|
|
|
## Related Documents
|
|
|
|
- [LLM-in-the-Loop Extraction](../../roadmap.md#phase-75-llm-in-the-loop-extraction) - Phase 7.5 implementation
|
|
- [Pattern Learning Store](../../roadmap.md#phase-76-pattern-learning-store) - Phase 7.6 implementation
|
|
- [LLM Extractor Code](../../src/llm/) - Current implementation
|
|
|
|
---
|
|
|
|
## Appendix: Example Fixture
|
|
|
|
```toml
|
|
# fixtures/secrets/high_entropy_api_key.toml
|
|
|
|
[metadata]
|
|
id = "secrets-005"
|
|
name = "High entropy API key in Python config"
|
|
category = "secrets"
|
|
subcategory = "api_keys"
|
|
language = "python"
|
|
difficulty = "medium"
|
|
source = "hand-curated"
|
|
created = "2026-02-05"
|
|
updated = "2026-02-05"
|
|
notes = """
|
|
Tests detection of high-entropy strings that look like API keys.
|
|
The entropy of 'sk_live_abc123...' is > 4.0 which should trigger detection.
|
|
"""
|
|
|
|
[input]
|
|
filename = "config.py"
|
|
content = '''
|
|
import os
|
|
|
|
# Configuration for payment processing
|
|
STRIPE_API_KEY = "sk_live_51ABC123DEF456GHI789JKL012MNO345PQR678STU901VWX234YZ"
|
|
STRIPE_WEBHOOK_SECRET = os.environ.get("STRIPE_WEBHOOK_SECRET")
|
|
|
|
DATABASE_URL = os.environ.get("DATABASE_URL", "postgresql://localhost/app")
|
|
DEBUG = os.environ.get("DEBUG", "false").lower() == "true"
|
|
'''
|
|
|
|
[expected]
|
|
must_contain = [
|
|
{ subject = "secrets/api_key", predicate = "hardcoded", value = true, min_confidence = 0.8 },
|
|
]
|
|
|
|
must_not_contain = [
|
|
# STRIPE_WEBHOOK_SECRET uses env var, should NOT be flagged
|
|
{ subject = "secrets/webhook_secret", predicate = "hardcoded", value = true },
|
|
# DATABASE_URL uses env var with fallback, should NOT be flagged as hardcoded
|
|
{ subject = "secrets/database_url", predicate = "hardcoded", value = true },
|
|
]
|
|
|
|
acceptable_variants = [
|
|
{ subject = "stripe/api_key", predicate = "exposed", value = true },
|
|
{ subject = "payment/secret", predicate = "hardcoded", value = true },
|
|
]
|
|
|
|
[scoring]
|
|
weight = 1.5 # Security-critical, weighted higher
|
|
min_confidence = 0.8
|
|
```
|
|
|
|
---
|
|
|
|
_Last updated: 2026-02-05_
|