New security extractors: - insecure_deserialization, orm_injection, path_traversal, security_headers - ssrf, unvalidated_redirects, weak_password, xxe - Enhanced tls_version extractor with comprehensive cipher/protocol checks Architecture docs: - Scout-judge extraction pattern for LLM-based code analysis - LLM prompt evaluation framework - LLM eval implementation guide Core improvements: - stemedb-ontology README and client enhancements - WAL journal/segment instrumentation - Signing and ingestion refinements - Consumer health demo script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
35 KiB
LLM Prompt Evaluation System
Status: Proposed (2026-02-05) Phase: 7.8 (extends Phase 7.5 LLM-in-the-Loop Extraction) Author: Architecture Team
Problem Statement
Aphoria's LLM-powered claim extraction (Phase 7.5) uses Gemini to extract security claims from high-value code files. The prompts that drive this extraction are effectively code that we don't treat like code:
| Aspect | Traditional Code | Current Prompts |
|---|---|---|
| Version Control | Git commits | In files, but no semantic versioning |
| Testing | Unit/integration tests | None |
| Metrics | Coverage, performance | None |
| Regression Detection | CI failures | None |
| Quality Gates | Linting, review | None |
The result: When we change a prompt, we have no systematic way to know if we made things better or worse. We're flying blind.
Enterprise Requirements
For enterprise adoption, customers need assurance that:
- Prompts produce consistent, high-quality results - Not random outputs
- Changes are validated before deployment - Regressions are caught
- Performance is measurable - Precision, recall, cost are tracked
- The system improves over time - With evidence, not hope
Goals
Primary Goals
- Observability - Understand prompt effectiveness through metrics and logging
- Testability - Validate prompts against known scenarios with expected outcomes
- Repeatability - Run evaluations consistently across environments
- Automation - Scheduled jobs that detect regressions without human intervention
Non-Goals (Phase 7.8)
- Real-time prompt optimization (future: Phase 9)
- A/B testing in production (future: Phase 9)
- Multi-model comparison (future)
- Prompt compression/optimization (future)
Architecture Overview
┌──────────────────────────────────────────────────────────────────────────────┐
│ LLM Prompt Evaluation System │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ Golden Corpus │ Test fixtures with expected outcomes │
│ │ (fixtures/) │ - Code snippets │
│ │ │ - Expected claims (must-contain, must-not-contain) │
│ └────────┬────────┘ - Metadata (language, category, difficulty) │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Evaluation │ Orchestrates test runs │
│ │ Harness │ - Loads fixtures │
│ │ │ - Invokes LLM Extractor │
│ │ │ - Compares outputs │
│ │ │ - Computes metrics │
│ └────────┬────────┘ │
│ │ │
│ ├──────────────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ LLM Extractor │ │ Observation │ │
│ │ (instrumented) │ │ Log │ │
│ │ │ │ │ │
│ │ - Same code │ │ - Prompt ver │ │
│ │ path as │ │ - Input hash │ │
│ │ production │ │ - Output │ │
│ │ │ │ - Latency │ │
│ │ │ │ - Tokens │ │
│ └─────────────────┘ │ - Model │ │
│ │ - Timestamp │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Metrics & Reports │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Precision │ │ Recall │ │ F1 Score │ │ Cost │ │ │
│ │ │ TP/(TP+FP) │ │ TP/(TP+FN) │ │ Harmonic │ │ Tokens/$ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Regression Report │ │ │
│ │ │ - Comparison against baseline │ │ │
│ │ │ - Per-fixture deltas │ │ │
│ │ │ - Category breakdown │ │ │
│ │ │ - Recommendations │ │ │
│ │ └────────────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
Core Components
1. Golden Corpus
A curated set of test fixtures with known expected outcomes.
Fixture Format
# fixtures/tls/disabled_verification.toml
[metadata]
id = "tls-001"
name = "TLS verification disabled in Python requests"
category = "tls"
subcategory = "certificate_verification"
language = "python"
difficulty = "easy" # easy | medium | hard
source = "hand-curated" # hand-curated | production-capture | synthetic
created = "2026-02-05"
updated = "2026-02-05"
[input]
filename = "client.py"
content = '''
import requests
def fetch_data(url):
# Disable SSL verification for internal services
response = requests.get(url, verify=False)
return response.json()
'''
[expected]
# Claims that MUST be extracted (recall)
must_contain = [
{ subject = "tls/cert_verification", predicate = "enabled", value = false },
]
# Claims that MUST NOT be extracted (precision)
must_not_contain = [
{ subject = "tls/cert_verification", predicate = "enabled", value = true },
]
# Optional: acceptable alternate formulations
acceptable_variants = [
{ subject = "ssl/verify", predicate = "enabled", value = false },
{ subject = "requests/ssl_verify", predicate = "value", value = false },
]
[scoring]
# How to score this fixture
weight = 1.0 # Importance multiplier
min_confidence = 0.7 # Expected minimum confidence
Corpus Organization
applications/aphoria/tests/llm_fixtures/
├── README.md # Corpus documentation
├── manifest.toml # Index of all fixtures
├── tls/
│ ├── disabled_verification.toml
│ ├── deprecated_version.toml
│ └── pinning_bypass.toml
├── jwt/
│ ├── alg_none.toml
│ ├── skip_signature.toml
│ └── hardcoded_secret.toml
├── secrets/
│ ├── api_key_in_code.toml
│ ├── password_hardcoded.toml
│ └── high_entropy_token.toml
├── auth/
│ ├── bypass_pattern.toml
│ └── debug_header.toml
├── negative/ # Files that should NOT trigger claims
│ ├── safe_tls_config.toml
│ ├── proper_jwt_validation.toml
│ └── env_var_secrets.toml
└── edge_cases/
├── empty_file.toml
├── binary_content.toml
├── huge_file.toml
└── mixed_languages.toml
Manifest Structure
# manifest.toml
[corpus]
version = "1.0.0"
created = "2026-02-05"
description = "Golden corpus for LLM extraction evaluation"
[categories]
tls = { fixtures = 12, description = "TLS/SSL configuration" }
jwt = { fixtures = 8, description = "JWT authentication" }
secrets = { fixtures = 15, description = "Hardcoded secrets" }
auth = { fixtures = 6, description = "Authentication bypass" }
negative = { fixtures = 10, description = "Safe code (no claims expected)" }
edge_cases = { fixtures = 5, description = "Boundary conditions" }
[baseline]
# Current known-good metrics
precision = 0.85
recall = 0.78
f1 = 0.81
total_fixtures = 56
last_updated = "2026-02-05"
prompt_version = "1.0.0"
model = "gemini-2.0-flash"
2. Observation Log
Every LLM extraction is logged with full context for replay and analysis.
Log Entry Schema
/// A single observation from an LLM extraction
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionObservation {
/// Unique identifier for this observation
pub id: Uuid,
/// When this extraction occurred
pub timestamp: DateTime<Utc>,
/// Prompt version (semantic version)
pub prompt_version: String,
/// Model identifier (e.g., "gemini-2.0-flash")
pub model: String,
/// BLAKE3 hash of input content (for deduplication)
pub input_hash: String,
/// Input metadata
pub input: ExtractionInput,
/// Output from LLM
pub output: ExtractionOutput,
/// Performance metrics
pub metrics: ExtractionMetrics,
/// Evaluation context (if run during evaluation)
pub evaluation: Option<EvaluationContext>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionInput {
/// Filename (may be anonymized)
pub filename: String,
/// Language detected
pub language: String,
/// Content length in bytes
pub content_length: usize,
/// Content preview (first 500 chars, for debugging)
pub content_preview: Option<String>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionOutput {
/// Raw LLM response
pub raw_response: String,
/// Parsed claims (may be empty if parsing failed)
pub claims: Vec<ExtractedClaim>,
/// Whether parsing succeeded
pub parse_success: bool,
/// Parse error if any
pub parse_error: Option<String>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionMetrics {
/// Total latency (API call + processing)
pub latency_ms: u64,
/// API latency only
pub api_latency_ms: u64,
/// Input tokens
pub input_tokens: u32,
/// Output tokens
pub output_tokens: u32,
/// Total tokens
pub total_tokens: u32,
/// Estimated cost (USD)
pub estimated_cost_usd: f64,
/// Cache hit (if response was cached)
pub cache_hit: bool,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EvaluationContext {
/// Fixture ID if from golden corpus
pub fixture_id: Option<String>,
/// Evaluation run ID
pub run_id: Uuid,
/// Whether this matched expected output
pub matched_expected: bool,
/// Detailed match results
pub match_details: MatchDetails,
}
Log Storage
~/.aphoria/eval/
├── observations/
│ ├── 2026-02-05/
│ │ ├── 001_tls-001_success.json
│ │ ├── 002_jwt-003_partial.json
│ │ └── ...
│ └── 2026-02-04/
│ └── ...
├── runs/
│ ├── run_abc123.json # Full evaluation run metadata
│ └── run_def456.json
└── baselines/
├── v1.0.0.json # Baseline for prompt v1.0.0
└── latest.json # Symlink to current baseline
3. Evaluation Harness
The core engine that runs evaluations.
Public API
/// Configuration for an evaluation run
#[derive(Debug, Clone)]
pub struct EvalConfig {
/// Path to fixtures directory
pub fixtures_dir: PathBuf,
/// Which categories to evaluate (None = all)
pub categories: Option<Vec<String>>,
/// Maximum fixtures to run (for quick smoke tests)
pub max_fixtures: Option<usize>,
/// Whether to use real LLM or mock
pub mode: EvalMode,
/// Baseline to compare against
pub baseline: Option<PathBuf>,
/// Output directory for results
pub output_dir: PathBuf,
/// Whether to save observations
pub save_observations: bool,
/// Parallelism (concurrent LLM calls)
pub parallelism: usize,
}
#[derive(Debug, Clone)]
pub enum EvalMode {
/// Use real LLM API (costs money, tests actual prompt)
Live {
model: String,
temperature: f32,
},
/// Use cached responses (fast, deterministic, for CI)
Cached,
/// Use mock responses (for testing harness itself)
Mock,
}
/// Result of an evaluation run
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EvalResult {
/// Unique run identifier
pub run_id: Uuid,
/// When the run started
pub started_at: DateTime<Utc>,
/// When the run completed
pub completed_at: DateTime<Utc>,
/// Configuration used
pub config: EvalConfigSummary,
/// Aggregate metrics
pub metrics: AggregateMetrics,
/// Per-fixture results
pub fixture_results: Vec<FixtureResult>,
/// Comparison with baseline (if baseline provided)
pub baseline_comparison: Option<BaselineComparison>,
/// Overall verdict
pub verdict: EvalVerdict,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AggregateMetrics {
/// Precision: TP / (TP + FP)
pub precision: f64,
/// Recall: TP / (TP + FN)
pub recall: f64,
/// F1 score: 2 * (P * R) / (P + R)
pub f1: f64,
/// Total fixtures evaluated
pub total_fixtures: usize,
/// Fixtures that passed
pub passed: usize,
/// Fixtures that failed
pub failed: usize,
/// Fixtures that errored (LLM call failed, parse failed, etc.)
pub errored: usize,
/// Total cost (USD)
pub total_cost_usd: f64,
/// Total tokens used
pub total_tokens: u64,
/// Average latency (ms)
pub avg_latency_ms: f64,
/// Per-category breakdown
pub by_category: HashMap<String, CategoryMetrics>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum EvalVerdict {
/// All checks passed
Pass,
/// Some regressions detected
Regression { details: Vec<String> },
/// Evaluation failed (errors prevented completion)
Error { message: String },
}
Claim Matching Logic
/// How to match extracted claims against expected claims
pub struct ClaimMatcher {
/// Tolerance for confidence comparison
pub confidence_tolerance: f32,
/// Whether to normalize concept paths before comparison
pub normalize_paths: bool,
/// Predicate aliases (e.g., "enabled" == "active" == "on")
pub predicate_aliases: HashMap<String, Vec<String>>,
/// Value equivalences (e.g., true == "true" == "yes" == 1)
pub value_equivalences: Vec<Vec<ObjectValue>>,
}
impl ClaimMatcher {
/// Check if extracted claims satisfy must_contain requirements
pub fn check_must_contain(
&self,
extracted: &[ExtractedClaim],
expected: &[ExpectedClaim],
) -> MatchResult {
// For each expected claim:
// 1. Find matching extracted claim (subject + predicate match)
// 2. Check value compatibility
// 3. Check confidence threshold
// Return: matched, unmatched, partial matches
}
/// Check if extracted claims violate must_not_contain requirements
pub fn check_must_not_contain(
&self,
extracted: &[ExtractedClaim],
forbidden: &[ExpectedClaim],
) -> MatchResult {
// For each forbidden claim:
// 1. Check if any extracted claim matches
// 2. Flag violations
}
}
4. Prompt Versioning
Prompts are versioned to track changes and correlate with metrics.
Version Schema
/// Prompt version identifier
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct PromptVersion {
/// Semantic version (major.minor.patch)
pub version: String,
/// BLAKE3 hash of prompt content
pub content_hash: String,
/// When this version was created
pub created_at: DateTime<Utc>,
/// Description of changes from previous version
pub changelog: Option<String>,
}
impl PromptVersion {
/// Compute version from prompt content
pub fn from_prompt(prompt: &str, changelog: Option<String>) -> Self {
let content_hash = blake3::hash(prompt.as_bytes()).to_hex().to_string();
// Version is computed or provided externally
Self {
version: "0.0.0".to_string(), // Placeholder
content_hash,
created_at: Utc::now(),
changelog,
}
}
}
Prompt File Structure
// llm/prompt.rs
/// Current prompt version
pub const PROMPT_VERSION: &str = "1.2.0";
/// Changelog for current version
pub const PROMPT_CHANGELOG: &str = "Improved JWT claim extraction accuracy";
/// The extraction prompt
pub const EXTRACTION_PROMPT: &str = r#"
You are a security analyst extracting implicit security claims from code.
Given the following code file, identify any security-relevant configurations,
settings, or patterns. For each finding, output a JSON object with:
- subject: The concept path (e.g., "tls/cert_verification")
- predicate: The aspect being claimed (e.g., "enabled")
- value: The value found (boolean, string, or number)
- confidence: Your confidence in this extraction (0.0 to 1.0)
- description: Brief explanation
Focus on:
- TLS/SSL configuration
- Authentication settings
- Cryptographic choices
- Secret/credential handling
- Input validation
- Authorization patterns
Output as a JSON array. If no security claims are found, output an empty array.
Code:
```{language}
{content}
"#;
---
### 5. Metrics & Reporting
#### Metrics Computed
| Metric | Formula | Purpose |
|--------|---------|---------|
| **Precision** | TP / (TP + FP) | Are we avoiding false positives? |
| **Recall** | TP / (TP + FN) | Are we finding all issues? |
| **F1 Score** | 2 * (P * R) / (P + R) | Balanced measure |
| **Confidence Calibration** | Correlation(confidence, correctness) | Are high-confidence claims actually correct? |
| **Parse Success Rate** | Successful parses / Total extractions | Is the prompt producing valid JSON? |
| **Cost per Fixture** | Total cost / Fixtures | Budget tracking |
| **Latency P50/P95** | Percentiles | Performance tracking |
#### Regression Detection
```rust
/// Compare current metrics against baseline
pub struct BaselineComparison {
/// Current metrics
pub current: AggregateMetrics,
/// Baseline metrics
pub baseline: AggregateMetrics,
/// Deltas
pub precision_delta: f64,
pub recall_delta: f64,
pub f1_delta: f64,
/// Regression thresholds
pub regression_threshold: f64, // e.g., 0.05 = 5% drop
/// Fixtures that regressed
pub regressed_fixtures: Vec<FixtureRegression>,
/// Fixtures that improved
pub improved_fixtures: Vec<FixtureImprovement>,
}
impl BaselineComparison {
pub fn has_regression(&self) -> bool {
self.precision_delta < -self.regression_threshold ||
self.recall_delta < -self.regression_threshold ||
self.f1_delta < -self.regression_threshold
}
}
Report Format
# Prompt Evaluation Report
**Run ID:** abc123
**Date:** 2026-02-05 14:30:00 UTC
**Prompt Version:** 1.2.0
**Model:** gemini-2.0-flash
## Summary
| Metric | Current | Baseline | Delta | Status |
| ------------- | ------- | -------- | ----- | ------ |
| Precision | 0.87 | 0.85 | +0.02 | ✅ |
| Recall | 0.76 | 0.78 | -0.02 | ⚠️ |
| F1 Score | 0.81 | 0.81 | +0.00 | ✅ |
| Parse Success | 98% | 97% | +1% | ✅ |
**Verdict:** ⚠️ REVIEW - Recall dropped by 2%
## Cost Analysis
- Total tokens: 125,430
- Estimated cost: $0.12
- Cost per fixture: $0.002
## Regressions
### jwt-003: JWT algorithm none detection
- **Expected:** `jwt/algorithm = "none"` with confidence > 0.8
- **Got:** Not extracted
- **Impact:** High (security-critical)
## Improvements
### tls-007: TLS version in constants
- **Previously:** Not extracted
- **Now:** `tls/min_version = "1.0"` with confidence 0.85
- **Impact:** Medium
## Category Breakdown
| Category | Fixtures | Passed | Failed | Precision | Recall |
| -------- | -------- | ------ | ------ | --------- | ------ |
| tls | 12 | 11 | 1 | 0.92 | 0.91 |
| jwt | 8 | 6 | 2 | 0.75 | 0.75 |
| secrets | 15 | 14 | 1 | 0.93 | 0.87 |
| auth | 6 | 6 | 0 | 1.00 | 0.83 |
| negative | 10 | 10 | 0 | 1.00 | N/A |
6. Jobs & Automation
CI Job (Per-PR)
# .github/workflows/prompt-eval-smoke.yml
name: Prompt Evaluation (Smoke)
on:
pull_request:
paths:
- "applications/aphoria/src/llm/**"
- "applications/aphoria/tests/llm_fixtures/**"
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run smoke test
env:
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
run: |
cargo run -p aphoria -- eval prompts \
--mode cached \
--max-fixtures 20 \
--categories tls,jwt,secrets \
--baseline tests/llm_fixtures/baselines/latest.json \
--fail-on-regression
- name: Upload report
uses: actions/upload-artifact@v4
with:
name: eval-report
path: eval-report.md
Characteristics:
- Runs on PR that touches prompt code or fixtures
- Uses cached responses (fast, deterministic)
- Limited to 20 fixtures (smoke test)
- Fails if regression detected
Nightly Job (Full Evaluation)
# .github/workflows/prompt-eval-nightly.yml
name: Prompt Evaluation (Full)
on:
schedule:
- cron: "0 3 * * *" # 3am UTC daily
workflow_dispatch:
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run full evaluation
env:
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
run: |
cargo run -p aphoria -- eval prompts \
--mode live \
--model gemini-2.0-flash \
--temperature 0 \
--baseline tests/llm_fixtures/baselines/latest.json \
--output-dir ./eval-results \
--save-observations
- name: Update baseline if improved
run: |
# If F1 improved by > 2%, update baseline
./scripts/maybe-update-baseline.sh
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: eval-results
path: eval-results/
- name: Post to Slack
if: failure()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "⚠️ Prompt evaluation regression detected",
"attachments": [...]
}
Characteristics:
- Runs nightly at 3am UTC
- Uses live LLM API (real evaluation)
- Full corpus coverage
- Updates baseline if metrics improve significantly
- Alerts on regression
On-Demand Job (Prompt Iteration)
# For prompt development: compare two versions
aphoria eval prompts \
--mode live \
--prompt-file ./prompts/experimental.txt \
--baseline ./baselines/current.json \
--output-dir ./eval-comparison \
--verbose
# View comparison
cat ./eval-comparison/comparison.md
CLI Interface
USAGE:
aphoria eval prompts [OPTIONS]
OPTIONS:
--fixtures-dir <DIR> Path to fixtures directory [default: tests/llm_fixtures]
--categories <LIST> Categories to evaluate (comma-separated)
--max-fixtures <N> Maximum fixtures to run
--mode <MODE> Evaluation mode: live, cached, mock [default: cached]
--model <MODEL> Model to use (live mode only) [default: gemini-2.0-flash]
--temperature <TEMP> Temperature (live mode only) [default: 0]
--baseline <FILE> Baseline to compare against
--output-dir <DIR> Output directory for results [default: ./eval-results]
--save-observations Save observation logs
--fail-on-regression Exit with code 1 if regression detected
--regression-threshold <N> Threshold for regression (default: 0.05 = 5%)
--verbose Verbose output
--json Output results as JSON
SUBCOMMANDS:
aphoria eval prompts show-baseline Show current baseline metrics
aphoria eval prompts update-baseline Update baseline from latest run
aphoria eval prompts list-fixtures List available fixtures
aphoria eval prompts add-fixture Add a new fixture interactively
aphoria eval prompts validate-fixtures Validate fixture format
Implementation Plan
Phase 7.8.1: Core Infrastructure (Week 1)
| Task | Description | Effort |
|---|---|---|
| Fixture format | Define TOML schema, parser | 2d |
| Observation log | Schema, writer, reader | 1d |
| Claim matcher | Matching logic with fuzzy support | 2d |
| Prompt versioning | Version extraction, tracking | 1d |
Deliverable: Can load fixtures, run extractions, compare outputs
Phase 7.8.2: Evaluation Harness (Week 2)
| Task | Description | Effort |
|---|---|---|
| Evaluation harness | Orchestration, parallelism | 2d |
| Metrics computation | Precision, recall, F1, cost | 1d |
| Baseline comparison | Regression detection | 1d |
| Report generation | Markdown, JSON output | 1d |
Deliverable: Can run full evaluation and generate report
Phase 7.8.3: Golden Corpus (Week 2-3)
| Task | Description | Effort |
|---|---|---|
| Seed fixtures (20) | Hand-curated test cases | 2d |
| Negative fixtures (10) | Safe code that shouldn't trigger | 1d |
| Edge case fixtures (5) | Boundary conditions | 1d |
| Baseline establishment | Initial metrics snapshot | 1d |
Deliverable: 35+ fixtures covering core categories
Phase 7.8.4: CI Integration (Week 3)
| Task | Description | Effort |
|---|---|---|
| Smoke test workflow | Per-PR cached evaluation | 1d |
| Nightly workflow | Full live evaluation | 1d |
| Baseline auto-update | Script for improvement detection | 1d |
| Alerting | Slack/email on regression | 0.5d |
Deliverable: Automated evaluation in CI
Phase 7.8.5: CLI & Documentation (Week 4)
| Task | Description | Effort |
|---|---|---|
| CLI commands | eval prompts subcommands |
2d |
| Documentation | Usage guide, fixture authoring | 1d |
| Skill update | /aphoria-dev skill update |
0.5d |
Deliverable: Production-ready tooling
Open Questions
1. Where do we store baseline metrics?
Options:
- In repository (
tests/llm_fixtures/baselines/) - Simple, versioned with code - External artifact store - Separates metrics from code
- Database - For historical tracking
Recommendation: Start with repository, migrate to external store when history needed.
2. How strict should matching be?
Options:
- Exact match - Same subject, predicate, value (brittle)
- Structural match - Same concept, fuzzy value (looser)
- Semantic match - Embeddings-based similarity (complex)
Recommendation: Structural match with configurable fuzzy value matching.
3. Mock vs Live in CI?
Options:
- Always mock - Fast, free, deterministic, tests harness not prompt
- Always live - Expensive, slow, tests actual prompt
- Hybrid - Mock for smoke, live for nightly
Recommendation: Hybrid approach. Cached (deterministic) for PR, live for nightly.
4. How do we handle model version changes?
Gemini may update models, causing output drift even without prompt changes.
Options:
- Pin model version (if API supports)
- Track model version in baseline, re-baseline on model change
- Alert when model version changes
Recommendation: Track model version, require manual baseline update on change.
5. What's the corpus growth strategy?
Options:
- Hand-curate only (high quality, slow growth)
- Production capture with review (faster growth, needs tooling)
- Synthetic generation (fast, may not reflect reality)
Recommendation: Start hand-curated, add production capture tooling in Phase 9.
Success Metrics
| Metric | Target | Measurement |
|---|---|---|
| Regression detection rate | 100% | Simulated regressions caught |
| False positive rate (regression alerts) | < 5% | Manual review of alerts |
| Prompt iteration cycle time | < 30 min | Time from change to evaluation result |
| Corpus coverage | > 50 fixtures | Fixture count |
| CI job duration (smoke) | < 2 min | Workflow timing |
| CI job duration (nightly) | < 15 min | Workflow timing |
Related Documents
- LLM-in-the-Loop Extraction - Phase 7.5 implementation
- Pattern Learning Store - Phase 7.6 implementation
- LLM Extractor Code - Current implementation
Appendix: Example Fixture
# fixtures/secrets/high_entropy_api_key.toml
[metadata]
id = "secrets-005"
name = "High entropy API key in Python config"
category = "secrets"
subcategory = "api_keys"
language = "python"
difficulty = "medium"
source = "hand-curated"
created = "2026-02-05"
updated = "2026-02-05"
notes = """
Tests detection of high-entropy strings that look like API keys.
The entropy of 'sk_live_abc123...' is > 4.0 which should trigger detection.
"""
[input]
filename = "config.py"
content = '''
import os
# Configuration for payment processing
STRIPE_API_KEY = "sk_live_51ABC123DEF456GHI789JKL012MNO345PQR678STU901VWX234YZ"
STRIPE_WEBHOOK_SECRET = os.environ.get("STRIPE_WEBHOOK_SECRET")
DATABASE_URL = os.environ.get("DATABASE_URL", "postgresql://localhost/app")
DEBUG = os.environ.get("DEBUG", "false").lower() == "true"
'''
[expected]
must_contain = [
{ subject = "secrets/api_key", predicate = "hardcoded", value = true, min_confidence = 0.8 },
]
must_not_contain = [
# STRIPE_WEBHOOK_SECRET uses env var, should NOT be flagged
{ subject = "secrets/webhook_secret", predicate = "hardcoded", value = true },
# DATABASE_URL uses env var with fallback, should NOT be flagged as hardcoded
{ subject = "secrets/database_url", predicate = "hardcoded", value = true },
]
acceptable_variants = [
{ subject = "stripe/api_key", predicate = "exposed", value = true },
{ subject = "payment/secret", predicate = "hardcoded", value = true },
]
[scoring]
weight = 1.5 # Security-critical, weighted higher
min_confidence = 0.8
Last updated: 2026-02-05