stemedb/applications/aphoria/docs/architecture/llm-prompt-evaluation.md
jml 9bfa626203 docs: reorganize documentation structure for clarity
Major documentation restructure to improve discoverability and reduce duplication.

## Changes

**Deleted (Archived/Consolidated)**:
- Removed duplicate getting started guides
- Archived outdated planning documents
- Consolidated corpus and configuration docs
- Removed obsolete vision/spec files (superseded by vision.md)
- Cleaned up scrapyard and old PDFs

**New Structure**:
- docs/about/ - Project overview and introduction
- docs/guides/ - User guides (moved from root)
- docs/specs/ - Technical specifications
- docs/sdk/ - SDK documentation (Go)
- docs/references/ - API references
- docs/archive/ - Archived historical docs
- applications/aphoria/docs/advanced/ - Advanced topics
- applications/aphoria/docs/reference/ - CLI reference
- applications/aphoria/docs/archive/ - Archived aphoria docs

**Updated**:
- README.md - New root README with clear navigation
- CONTRIBUTING.md - Contribution guidelines
- CLAUDE.md - Updated paths to new structure
- roadmap.md - Added recent completions

## Files Changed
- 57 files changed
- 1,977 insertions(+)
- 961 deletions(-)

**Net change**: +1,016 lines (added CONTRIBUTING.md, README.md, reorganized content)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-11 07:33:40 +00:00

1064 lines
35 KiB
Markdown

# LLM Prompt Evaluation System
> **Status:** Proposed (2026-02-05)
> **Phase:** 7.8 (extends Phase 7.5 LLM-in-the-Loop Extraction)
> **Author:** Architecture Team
---
## Problem Statement
Aphoria's LLM-powered claim extraction (Phase 7.5) uses Gemini to extract security claims from high-value code files. The prompts that drive this extraction are effectively **code that we don't treat like code**:
| Aspect | Traditional Code | Current Prompts |
| -------------------- | ---------------------- | ------------------------------------ |
| Version Control | Git commits | In files, but no semantic versioning |
| Testing | Unit/integration tests | None |
| Metrics | Coverage, performance | None |
| Regression Detection | CI failures | None |
| Quality Gates | Linting, review | None |
**The result:** When we change a prompt, we have no systematic way to know if we made things better or worse. We're flying blind.
### Enterprise Requirements
For enterprise adoption, customers need assurance that:
1. **Prompts produce consistent, high-quality results** - Not random outputs
2. **Changes are validated before deployment** - Regressions are caught
3. **Performance is measurable** - Precision, recall, cost are tracked
4. **The system improves over time** - With evidence, not hope
---
## Goals
### Primary Goals
1. **Observability** - Understand prompt effectiveness through metrics and logging
2. **Testability** - Validate prompts against known scenarios with expected outcomes
3. **Repeatability** - Run evaluations consistently across environments
4. **Automation** - Scheduled jobs that detect regressions without human intervention
### Non-Goals (Phase 7.8)
- Real-time prompt optimization (future: Phase 9)
- A/B testing in production (future: Phase 9)
- Multi-model comparison (future)
- Prompt compression/optimization (future)
---
## Architecture Overview
```
┌──────────────────────────────────────────────────────────────────────────────┐
│ LLM Prompt Evaluation System │
├──────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ │
│ │ Golden Corpus │ Test fixtures with expected outcomes │
│ │ (fixtures/) │ - Code snippets │
│ │ │ - Expected claims (must-contain, must-not-contain) │
│ └────────┬────────┘ - Metadata (language, category, difficulty) │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Evaluation │ Orchestrates test runs │
│ │ Harness │ - Loads fixtures │
│ │ │ - Invokes LLM Extractor │
│ │ │ - Compares outputs │
│ │ │ - Computes metrics │
│ └────────┬────────┘ │
│ │ │
│ ├──────────────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ LLM Extractor │ │ Observation │ │
│ │ (instrumented) │ │ Log │ │
│ │ │ │ │ │
│ │ - Same code │ │ - Prompt ver │ │
│ │ path as │ │ - Input hash │ │
│ │ production │ │ - Output │ │
│ │ │ │ - Latency │ │
│ │ │ │ - Tokens │ │
│ └─────────────────┘ │ - Model │ │
│ │ - Timestamp │ │
│ └────────┬────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Metrics & Reports │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Precision │ │ Recall │ │ F1 Score │ │ Cost │ │ │
│ │ │ TP/(TP+FP) │ │ TP/(TP+FN) │ │ Harmonic │ │ Tokens/$ │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ │ ┌────────────────────────────────────────────────────────────────────┐ │ │
│ │ │ Regression Report │ │ │
│ │ │ - Comparison against baseline │ │ │
│ │ │ - Per-fixture deltas │ │ │
│ │ │ - Category breakdown │ │ │
│ │ │ - Recommendations │ │ │
│ │ └────────────────────────────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────────┘
```
---
## Core Components
### 1. Golden Corpus
A curated set of test fixtures with known expected outcomes.
#### Fixture Format
```toml
# fixtures/tls/disabled_verification.toml
[metadata]
id = "tls-001"
name = "TLS verification disabled in Python requests"
category = "tls"
subcategory = "certificate_verification"
language = "python"
difficulty = "easy" # easy | medium | hard
source = "hand-curated" # hand-curated | production-capture | synthetic
created = "2026-02-05"
updated = "2026-02-05"
[input]
filename = "client.py"
content = '''
import requests
def fetch_data(url):
# Disable SSL verification for internal services
response = requests.get(url, verify=False)
return response.json()
'''
[expected]
# Claims that MUST be extracted (recall)
must_contain = [
{ subject = "tls/cert_verification", predicate = "enabled", value = false },
]
# Claims that MUST NOT be extracted (precision)
must_not_contain = [
{ subject = "tls/cert_verification", predicate = "enabled", value = true },
]
# Optional: acceptable alternate formulations
acceptable_variants = [
{ subject = "ssl/verify", predicate = "enabled", value = false },
{ subject = "requests/ssl_verify", predicate = "value", value = false },
]
[scoring]
# How to score this fixture
weight = 1.0 # Importance multiplier
min_confidence = 0.7 # Expected minimum confidence
```
#### Corpus Organization
```
applications/aphoria/tests/llm_fixtures/
├── README.md # Corpus documentation
├── manifest.toml # Index of all fixtures
├── tls/
│ ├── disabled_verification.toml
│ ├── deprecated_version.toml
│ └── pinning_bypass.toml
├── jwt/
│ ├── alg_none.toml
│ ├── skip_signature.toml
│ └── hardcoded_secret.toml
├── secrets/
│ ├── api_key_in_code.toml
│ ├── password_hardcoded.toml
│ └── high_entropy_token.toml
├── auth/
│ ├── bypass_pattern.toml
│ └── debug_header.toml
├── negative/ # Files that should NOT trigger claims
│ ├── safe_tls_config.toml
│ ├── proper_jwt_validation.toml
│ └── env_var_secrets.toml
└── edge_cases/
├── empty_file.toml
├── binary_content.toml
├── huge_file.toml
└── mixed_languages.toml
```
#### Manifest Structure
```toml
# manifest.toml
[corpus]
version = "1.0.0"
created = "2026-02-05"
description = "Golden corpus for LLM extraction evaluation"
[categories]
tls = { fixtures = 12, description = "TLS/SSL configuration" }
jwt = { fixtures = 8, description = "JWT authentication" }
secrets = { fixtures = 15, description = "Hardcoded secrets" }
auth = { fixtures = 6, description = "Authentication bypass" }
negative = { fixtures = 10, description = "Safe code (no claims expected)" }
edge_cases = { fixtures = 5, description = "Boundary conditions" }
[baseline]
# Current known-good metrics
precision = 0.85
recall = 0.78
f1 = 0.81
total_fixtures = 56
last_updated = "2026-02-05"
prompt_version = "1.0.0"
model = "gemini-2.0-flash"
```
---
### 2. Observation Log
Every LLM extraction is logged with full context for replay and analysis.
#### Log Entry Schema
```rust
/// A single observation from an LLM extraction
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionObservation {
/// Unique identifier for this observation
pub id: Uuid,
/// When this extraction occurred
pub timestamp: DateTime<Utc>,
/// Prompt version (semantic version)
pub prompt_version: String,
/// Model identifier (e.g., "gemini-2.0-flash")
pub model: String,
/// BLAKE3 hash of input content (for deduplication)
pub input_hash: String,
/// Input metadata
pub input: ExtractionInput,
/// Output from LLM
pub output: ExtractionOutput,
/// Performance metrics
pub metrics: ExtractionMetrics,
/// Evaluation context (if run during evaluation)
pub evaluation: Option<EvaluationContext>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionInput {
/// Filename (may be anonymized)
pub filename: String,
/// Language detected
pub language: String,
/// Content length in bytes
pub content_length: usize,
/// Content preview (first 500 chars, for debugging)
pub content_preview: Option<String>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionOutput {
/// Raw LLM response
pub raw_response: String,
/// Parsed claims (may be empty if parsing failed)
pub claims: Vec<Observation>,
/// Whether parsing succeeded
pub parse_success: bool,
/// Parse error if any
pub parse_error: Option<String>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionMetrics {
/// Total latency (API call + processing)
pub latency_ms: u64,
/// API latency only
pub api_latency_ms: u64,
/// Input tokens
pub input_tokens: u32,
/// Output tokens
pub output_tokens: u32,
/// Total tokens
pub total_tokens: u32,
/// Estimated cost (USD)
pub estimated_cost_usd: f64,
/// Cache hit (if response was cached)
pub cache_hit: bool,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EvaluationContext {
/// Fixture ID if from golden corpus
pub fixture_id: Option<String>,
/// Evaluation run ID
pub run_id: Uuid,
/// Whether this matched expected output
pub matched_expected: bool,
/// Detailed match results
pub match_details: MatchDetails,
}
```
#### Log Storage
```
~/.aphoria/eval/
├── observations/
│ ├── 2026-02-05/
│ │ ├── 001_tls-001_success.json
│ │ ├── 002_jwt-003_partial.json
│ │ └── ...
│ └── 2026-02-04/
│ └── ...
├── runs/
│ ├── run_abc123.json # Full evaluation run metadata
│ └── run_def456.json
└── baselines/
├── v1.0.0.json # Baseline for prompt v1.0.0
└── latest.json # Symlink to current baseline
```
---
### 3. Evaluation Harness
The core engine that runs evaluations.
#### Public API
```rust
/// Configuration for an evaluation run
#[derive(Debug, Clone)]
pub struct EvalConfig {
/// Path to fixtures directory
pub fixtures_dir: PathBuf,
/// Which categories to evaluate (None = all)
pub categories: Option<Vec<String>>,
/// Maximum fixtures to run (for quick smoke tests)
pub max_fixtures: Option<usize>,
/// Whether to use real LLM or mock
pub mode: EvalMode,
/// Baseline to compare against
pub baseline: Option<PathBuf>,
/// Output directory for results
pub output_dir: PathBuf,
/// Whether to save observations
pub save_observations: bool,
/// Parallelism (concurrent LLM calls)
pub parallelism: usize,
}
#[derive(Debug, Clone)]
pub enum EvalMode {
/// Use real LLM API (costs money, tests actual prompt)
Live {
model: String,
temperature: f32,
},
/// Use cached responses (fast, deterministic, for CI)
Cached,
/// Use mock responses (for testing harness itself)
Mock,
}
/// Result of an evaluation run
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EvalResult {
/// Unique run identifier
pub run_id: Uuid,
/// When the run started
pub started_at: DateTime<Utc>,
/// When the run completed
pub completed_at: DateTime<Utc>,
/// Configuration used
pub config: EvalConfigSummary,
/// Aggregate metrics
pub metrics: AggregateMetrics,
/// Per-fixture results
pub fixture_results: Vec<FixtureResult>,
/// Comparison with baseline (if baseline provided)
pub baseline_comparison: Option<BaselineComparison>,
/// Overall verdict
pub verdict: EvalVerdict,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AggregateMetrics {
/// Precision: TP / (TP + FP)
pub precision: f64,
/// Recall: TP / (TP + FN)
pub recall: f64,
/// F1 score: 2 * (P * R) / (P + R)
pub f1: f64,
/// Total fixtures evaluated
pub total_fixtures: usize,
/// Fixtures that passed
pub passed: usize,
/// Fixtures that failed
pub failed: usize,
/// Fixtures that errored (LLM call failed, parse failed, etc.)
pub errored: usize,
/// Total cost (USD)
pub total_cost_usd: f64,
/// Total tokens used
pub total_tokens: u64,
/// Average latency (ms)
pub avg_latency_ms: f64,
/// Per-category breakdown
pub by_category: HashMap<String, CategoryMetrics>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum EvalVerdict {
/// All checks passed
Pass,
/// Some regressions detected
Regression { details: Vec<String> },
/// Evaluation failed (errors prevented completion)
Error { message: String },
}
```
#### Claim Matching Logic
```rust
/// How to match extracted claims against expected claims
pub struct ClaimMatcher {
/// Tolerance for confidence comparison
pub confidence_tolerance: f32,
/// Whether to normalize concept paths before comparison
pub normalize_paths: bool,
/// Predicate aliases (e.g., "enabled" == "active" == "on")
pub predicate_aliases: HashMap<String, Vec<String>>,
/// Value equivalences (e.g., true == "true" == "yes" == 1)
pub value_equivalences: Vec<Vec<ObjectValue>>,
}
impl ClaimMatcher {
/// Check if extracted claims satisfy must_contain requirements
pub fn check_must_contain(
&self,
extracted: &[Observation],
expected: &[ExpectedClaim],
) -> MatchResult {
// For each expected claim:
// 1. Find matching extracted claim (subject + predicate match)
// 2. Check value compatibility
// 3. Check confidence threshold
// Return: matched, unmatched, partial matches
}
/// Check if extracted claims violate must_not_contain requirements
pub fn check_must_not_contain(
&self,
extracted: &[Observation],
forbidden: &[ExpectedClaim],
) -> MatchResult {
// For each forbidden claim:
// 1. Check if any extracted claim matches
// 2. Flag violations
}
}
```
---
### 4. Prompt Versioning
Prompts are versioned to track changes and correlate with metrics.
#### Version Schema
```rust
/// Prompt version identifier
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct PromptVersion {
/// Semantic version (major.minor.patch)
pub version: String,
/// BLAKE3 hash of prompt content
pub content_hash: String,
/// When this version was created
pub created_at: DateTime<Utc>,
/// Description of changes from previous version
pub changelog: Option<String>,
}
impl PromptVersion {
/// Compute version from prompt content
pub fn from_prompt(prompt: &str, changelog: Option<String>) -> Self {
let content_hash = blake3::hash(prompt.as_bytes()).to_hex().to_string();
// Version is computed or provided externally
Self {
version: "0.0.0".to_string(), // Placeholder
content_hash,
created_at: Utc::now(),
changelog,
}
}
}
```
#### Prompt File Structure
````rust
// llm/prompt.rs
/// Current prompt version
pub const PROMPT_VERSION: &str = "1.2.0";
/// Changelog for current version
pub const PROMPT_CHANGELOG: &str = "Improved JWT claim extraction accuracy";
/// The extraction prompt
pub const EXTRACTION_PROMPT: &str = r#"
You are a security analyst extracting implicit security claims from code.
Given the following code file, identify any security-relevant configurations,
settings, or patterns. For each finding, output a JSON object with:
- subject: The concept path (e.g., "tls/cert_verification")
- predicate: The aspect being claimed (e.g., "enabled")
- value: The value found (boolean, string, or number)
- confidence: Your confidence in this extraction (0.0 to 1.0)
- description: Brief explanation
Focus on:
- TLS/SSL configuration
- Authentication settings
- Cryptographic choices
- Secret/credential handling
- Input validation
- Authorization patterns
Output as a JSON array. If no security claims are found, output an empty array.
Code:
```{language}
{content}
````
"#;
````
---
### 5. Metrics & Reporting
#### Metrics Computed
| Metric | Formula | Purpose |
|--------|---------|---------|
| **Precision** | TP / (TP + FP) | Are we avoiding false positives? |
| **Recall** | TP / (TP + FN) | Are we finding all issues? |
| **F1 Score** | 2 * (P * R) / (P + R) | Balanced measure |
| **Confidence Calibration** | Correlation(confidence, correctness) | Are high-confidence claims actually correct? |
| **Parse Success Rate** | Successful parses / Total extractions | Is the prompt producing valid JSON? |
| **Cost per Fixture** | Total cost / Fixtures | Budget tracking |
| **Latency P50/P95** | Percentiles | Performance tracking |
#### Regression Detection
```rust
/// Compare current metrics against baseline
pub struct BaselineComparison {
/// Current metrics
pub current: AggregateMetrics,
/// Baseline metrics
pub baseline: AggregateMetrics,
/// Deltas
pub precision_delta: f64,
pub recall_delta: f64,
pub f1_delta: f64,
/// Regression thresholds
pub regression_threshold: f64, // e.g., 0.05 = 5% drop
/// Fixtures that regressed
pub regressed_fixtures: Vec<FixtureRegression>,
/// Fixtures that improved
pub improved_fixtures: Vec<FixtureImprovement>,
}
impl BaselineComparison {
pub fn has_regression(&self) -> bool {
self.precision_delta < -self.regression_threshold ||
self.recall_delta < -self.regression_threshold ||
self.f1_delta < -self.regression_threshold
}
}
````
#### Report Format
```markdown
# Prompt Evaluation Report
**Run ID:** abc123
**Date:** 2026-02-05 14:30:00 UTC
**Prompt Version:** 1.2.0
**Model:** gemini-2.0-flash
## Summary
| Metric | Current | Baseline | Delta | Status |
| ------------- | ------- | -------- | ----- | ------ |
| Precision | 0.87 | 0.85 | +0.02 | |
| Recall | 0.76 | 0.78 | -0.02 | ⚠️ |
| F1 Score | 0.81 | 0.81 | +0.00 | |
| Parse Success | 98% | 97% | +1% | |
**Verdict:** ⚠️ REVIEW - Recall dropped by 2%
## Cost Analysis
- Total tokens: 125,430
- Estimated cost: $0.12
- Cost per fixture: $0.002
## Regressions
### jwt-003: JWT algorithm none detection
- **Expected:** `jwt/algorithm = "none"` with confidence > 0.8
- **Got:** Not extracted
- **Impact:** High (security-critical)
## Improvements
### tls-007: TLS version in constants
- **Previously:** Not extracted
- **Now:** `tls/min_version = "1.0"` with confidence 0.85
- **Impact:** Medium
## Category Breakdown
| Category | Fixtures | Passed | Failed | Precision | Recall |
| -------- | -------- | ------ | ------ | --------- | ------ |
| tls | 12 | 11 | 1 | 0.92 | 0.91 |
| jwt | 8 | 6 | 2 | 0.75 | 0.75 |
| secrets | 15 | 14 | 1 | 0.93 | 0.87 |
| auth | 6 | 6 | 0 | 1.00 | 0.83 |
| negative | 10 | 10 | 0 | 1.00 | N/A |
```
---
### 6. Jobs & Automation
#### CI Job (Per-PR)
```yaml
# .github/workflows/prompt-eval-smoke.yml
name: Prompt Evaluation (Smoke)
on:
pull_request:
paths:
- "applications/aphoria/src/llm/**"
- "applications/aphoria/tests/llm_fixtures/**"
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run smoke test
env:
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
run: |
cargo run -p aphoria -- eval prompts \
--mode cached \
--max-fixtures 20 \
--categories tls,jwt,secrets \
--baseline tests/llm_fixtures/baselines/latest.json \
--fail-on-regression
- name: Upload report
uses: actions/upload-artifact@v4
with:
name: eval-report
path: eval-report.md
```
**Characteristics:**
- Runs on PR that touches prompt code or fixtures
- Uses cached responses (fast, deterministic)
- Limited to 20 fixtures (smoke test)
- Fails if regression detected
#### Nightly Job (Full Evaluation)
```yaml
# .github/workflows/prompt-eval-nightly.yml
name: Prompt Evaluation (Full)
on:
schedule:
- cron: "0 3 * * *" # 3am UTC daily
workflow_dispatch:
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run full evaluation
env:
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
run: |
cargo run -p aphoria -- eval prompts \
--mode live \
--model gemini-2.0-flash \
--temperature 0 \
--baseline tests/llm_fixtures/baselines/latest.json \
--output-dir ./eval-results \
--save-observations
- name: Update baseline if improved
run: |
# If F1 improved by > 2%, update baseline
./scripts/maybe-update-baseline.sh
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: eval-results
path: eval-results/
- name: Post to Slack
if: failure()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "⚠️ Prompt evaluation regression detected",
"attachments": [...]
}
```
**Characteristics:**
- Runs nightly at 3am UTC
- Uses live LLM API (real evaluation)
- Full corpus coverage
- Updates baseline if metrics improve significantly
- Alerts on regression
#### On-Demand Job (Prompt Iteration)
```bash
# For prompt development: compare two versions
aphoria eval prompts \
--mode live \
--prompt-file ./prompts/experimental.txt \
--baseline ./baselines/current.json \
--output-dir ./eval-comparison \
--verbose
# View comparison
cat ./eval-comparison/comparison.md
```
---
## CLI Interface
```
USAGE:
aphoria eval prompts [OPTIONS]
OPTIONS:
--fixtures-dir <DIR> Path to fixtures directory [default: tests/llm_fixtures]
--categories <LIST> Categories to evaluate (comma-separated)
--max-fixtures <N> Maximum fixtures to run
--mode <MODE> Evaluation mode: live, cached, mock [default: cached]
--model <MODEL> Model to use (live mode only) [default: gemini-2.0-flash]
--temperature <TEMP> Temperature (live mode only) [default: 0]
--baseline <FILE> Baseline to compare against
--output-dir <DIR> Output directory for results [default: ./eval-results]
--save-observations Save observation logs
--fail-on-regression Exit with code 1 if regression detected
--regression-threshold <N> Threshold for regression (default: 0.05 = 5%)
--verbose Verbose output
--json Output results as JSON
SUBCOMMANDS:
aphoria eval prompts show-baseline Show current baseline metrics
aphoria eval prompts update-baseline Update baseline from latest run
aphoria eval prompts list-fixtures List available fixtures
aphoria eval prompts add-fixture Add a new fixture interactively
aphoria eval prompts validate-fixtures Validate fixture format
```
---
## Implementation Plan
### Phase 7.8.1: Core Infrastructure (Week 1)
| Task | Description | Effort |
| ----------------- | --------------------------------- | ------ |
| Fixture format | Define TOML schema, parser | 2d |
| Observation log | Schema, writer, reader | 1d |
| Claim matcher | Matching logic with fuzzy support | 2d |
| Prompt versioning | Version extraction, tracking | 1d |
**Deliverable:** Can load fixtures, run extractions, compare outputs
### Phase 7.8.2: Evaluation Harness (Week 2)
| Task | Description | Effort |
| ------------------- | --------------------------- | ------ |
| Evaluation harness | Orchestration, parallelism | 2d |
| Metrics computation | Precision, recall, F1, cost | 1d |
| Baseline comparison | Regression detection | 1d |
| Report generation | Markdown, JSON output | 1d |
**Deliverable:** Can run full evaluation and generate report
### Phase 7.8.3: Golden Corpus (Week 2-3)
| Task | Description | Effort |
| ---------------------- | -------------------------------- | ------ |
| Seed fixtures (20) | Hand-curated test cases | 2d |
| Negative fixtures (10) | Safe code that shouldn't trigger | 1d |
| Edge case fixtures (5) | Boundary conditions | 1d |
| Baseline establishment | Initial metrics snapshot | 1d |
**Deliverable:** 35+ fixtures covering core categories
### Phase 7.8.4: CI Integration (Week 3)
| Task | Description | Effort |
| -------------------- | -------------------------------- | ------ |
| Smoke test workflow | Per-PR cached evaluation | 1d |
| Nightly workflow | Full live evaluation | 1d |
| Baseline auto-update | Script for improvement detection | 1d |
| Alerting | Slack/email on regression | 0.5d |
**Deliverable:** Automated evaluation in CI
### Phase 7.8.5: CLI & Documentation (Week 4)
| Task | Description | Effort |
| ------------- | ------------------------------ | ------ |
| CLI commands | `eval prompts` subcommands | 2d |
| Documentation | Usage guide, fixture authoring | 1d |
| Skill update | `/aphoria-dev` skill update | 0.5d |
**Deliverable:** Production-ready tooling
---
## Open Questions
### 1. Where do we store baseline metrics?
**Options:**
- **In repository** (`tests/llm_fixtures/baselines/`) - Simple, versioned with code
- **External artifact store** - Separates metrics from code
- **Database** - For historical tracking
**Recommendation:** Start with repository, migrate to external store when history needed.
### 2. How strict should matching be?
**Options:**
- **Exact match** - Same subject, predicate, value (brittle)
- **Structural match** - Same concept, fuzzy value (looser)
- **Semantic match** - Embeddings-based similarity (complex)
**Recommendation:** Structural match with configurable fuzzy value matching.
### 3. Mock vs Live in CI?
**Options:**
- **Always mock** - Fast, free, deterministic, tests harness not prompt
- **Always live** - Expensive, slow, tests actual prompt
- **Hybrid** - Mock for smoke, live for nightly
**Recommendation:** Hybrid approach. Cached (deterministic) for PR, live for nightly.
### 4. How do we handle model version changes?
Gemini may update models, causing output drift even without prompt changes.
**Options:**
- Pin model version (if API supports)
- Track model version in baseline, re-baseline on model change
- Alert when model version changes
**Recommendation:** Track model version, require manual baseline update on change.
### 5. What's the corpus growth strategy?
**Options:**
- Hand-curate only (high quality, slow growth)
- Production capture with review (faster growth, needs tooling)
- Synthetic generation (fast, may not reflect reality)
**Recommendation:** Start hand-curated, add production capture tooling in Phase 9.
---
## Success Metrics
| Metric | Target | Measurement |
| --------------------------------------- | ------------- | ------------------------------------- |
| Regression detection rate | 100% | Simulated regressions caught |
| False positive rate (regression alerts) | < 5% | Manual review of alerts |
| Prompt iteration cycle time | < 30 min | Time from change to evaluation result |
| Corpus coverage | > 50 fixtures | Fixture count |
| CI job duration (smoke) | < 2 min | Workflow timing |
| CI job duration (nightly) | < 15 min | Workflow timing |
---
## Related Documents
- [LLM-in-the-Loop Extraction](../../roadmap.md#phase-75-llm-in-the-loop-extraction) - Phase 7.5 implementation
- [Pattern Learning Store](../../roadmap.md#phase-76-pattern-learning-store) - Phase 7.6 implementation
- [LLM Extractor Code](../../src/llm/) - Current implementation
---
## Appendix: Example Fixture
```toml
# fixtures/secrets/high_entropy_api_key.toml
[metadata]
id = "secrets-005"
name = "High entropy API key in Python config"
category = "secrets"
subcategory = "api_keys"
language = "python"
difficulty = "medium"
source = "hand-curated"
created = "2026-02-05"
updated = "2026-02-05"
notes = """
Tests detection of high-entropy strings that look like API keys.
The entropy of 'sk_live_abc123...' is > 4.0 which should trigger detection.
"""
[input]
filename = "config.py"
content = '''
import os
# Configuration for payment processing
STRIPE_API_KEY = "sk_live_51ABC123DEF456GHI789JKL012MNO345PQR678STU901VWX234YZ"
STRIPE_WEBHOOK_SECRET = os.environ.get("STRIPE_WEBHOOK_SECRET")
DATABASE_URL = os.environ.get("DATABASE_URL", "postgresql://localhost/app")
DEBUG = os.environ.get("DEBUG", "false").lower() == "true"
'''
[expected]
must_contain = [
{ subject = "secrets/api_key", predicate = "hardcoded", value = true, min_confidence = 0.8 },
]
must_not_contain = [
# STRIPE_WEBHOOK_SECRET uses env var, should NOT be flagged
{ subject = "secrets/webhook_secret", predicate = "hardcoded", value = true },
# DATABASE_URL uses env var with fallback, should NOT be flagged as hardcoded
{ subject = "secrets/database_url", predicate = "hardcoded", value = true },
]
acceptable_variants = [
{ subject = "stripe/api_key", predicate = "exposed", value = true },
{ subject = "payment/secret", predicate = "hardcoded", value = true },
]
[scoring]
weight = 1.5 # Security-critical, weighted higher
min_confidence = 0.8
```
---
_Last updated: 2026-02-05_