# LLM Prompt Evaluation System > **Status:** Proposed (2026-02-05) > **Phase:** 7.8 (extends Phase 7.5 LLM-in-the-Loop Extraction) > **Author:** Architecture Team --- ## Problem Statement Aphoria's LLM-powered claim extraction (Phase 7.5) uses Gemini to extract security claims from high-value code files. The prompts that drive this extraction are effectively **code that we don't treat like code**: | Aspect | Traditional Code | Current Prompts | | -------------------- | ---------------------- | ------------------------------------ | | Version Control | Git commits | In files, but no semantic versioning | | Testing | Unit/integration tests | None | | Metrics | Coverage, performance | None | | Regression Detection | CI failures | None | | Quality Gates | Linting, review | None | **The result:** When we change a prompt, we have no systematic way to know if we made things better or worse. We're flying blind. ### Enterprise Requirements For enterprise adoption, customers need assurance that: 1. **Prompts produce consistent, high-quality results** - Not random outputs 2. **Changes are validated before deployment** - Regressions are caught 3. **Performance is measurable** - Precision, recall, cost are tracked 4. **The system improves over time** - With evidence, not hope --- ## Goals ### Primary Goals 1. **Observability** - Understand prompt effectiveness through metrics and logging 2. **Testability** - Validate prompts against known scenarios with expected outcomes 3. **Repeatability** - Run evaluations consistently across environments 4. **Automation** - Scheduled jobs that detect regressions without human intervention ### Non-Goals (Phase 7.8) - Real-time prompt optimization (future: Phase 9) - A/B testing in production (future: Phase 9) - Multi-model comparison (future) - Prompt compression/optimization (future) --- ## Architecture Overview ``` ┌──────────────────────────────────────────────────────────────────────────────┐ │ LLM Prompt Evaluation System │ ├──────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────┐ │ │ │ Golden Corpus │ Test fixtures with expected outcomes │ │ │ (fixtures/) │ - Code snippets │ │ │ │ - Expected claims (must-contain, must-not-contain) │ │ └────────┬────────┘ - Metadata (language, category, difficulty) │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Evaluation │ Orchestrates test runs │ │ │ Harness │ - Loads fixtures │ │ │ │ - Invokes LLM Extractor │ │ │ │ - Compares outputs │ │ │ │ - Computes metrics │ │ └────────┬────────┘ │ │ │ │ │ ├──────────────────────┐ │ │ │ │ │ │ ▼ ▼ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ LLM Extractor │ │ Observation │ │ │ │ (instrumented) │ │ Log │ │ │ │ │ │ │ │ │ │ - Same code │ │ - Prompt ver │ │ │ │ path as │ │ - Input hash │ │ │ │ production │ │ - Output │ │ │ │ │ │ - Latency │ │ │ │ │ │ - Tokens │ │ │ └─────────────────┘ │ - Model │ │ │ │ - Timestamp │ │ │ └────────┬────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ Metrics & Reports │ │ │ │ │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ Precision │ │ Recall │ │ F1 Score │ │ Cost │ │ │ │ │ │ TP/(TP+FP) │ │ TP/(TP+FN) │ │ Harmonic │ │ Tokens/$ │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ │ │ │ │ ┌────────────────────────────────────────────────────────────────────┐ │ │ │ │ │ Regression Report │ │ │ │ │ │ - Comparison against baseline │ │ │ │ │ │ - Per-fixture deltas │ │ │ │ │ │ - Category breakdown │ │ │ │ │ │ - Recommendations │ │ │ │ │ └────────────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────┘ │ │ │ └──────────────────────────────────────────────────────────────────────────────┘ ``` --- ## Core Components ### 1. Golden Corpus A curated set of test fixtures with known expected outcomes. #### Fixture Format ```toml # fixtures/tls/disabled_verification.toml [metadata] id = "tls-001" name = "TLS verification disabled in Python requests" category = "tls" subcategory = "certificate_verification" language = "python" difficulty = "easy" # easy | medium | hard source = "hand-curated" # hand-curated | production-capture | synthetic created = "2026-02-05" updated = "2026-02-05" [input] filename = "client.py" content = ''' import requests def fetch_data(url): # Disable SSL verification for internal services response = requests.get(url, verify=False) return response.json() ''' [expected] # Claims that MUST be extracted (recall) must_contain = [ { subject = "tls/cert_verification", predicate = "enabled", value = false }, ] # Claims that MUST NOT be extracted (precision) must_not_contain = [ { subject = "tls/cert_verification", predicate = "enabled", value = true }, ] # Optional: acceptable alternate formulations acceptable_variants = [ { subject = "ssl/verify", predicate = "enabled", value = false }, { subject = "requests/ssl_verify", predicate = "value", value = false }, ] [scoring] # How to score this fixture weight = 1.0 # Importance multiplier min_confidence = 0.7 # Expected minimum confidence ``` #### Corpus Organization ``` applications/aphoria/tests/llm_fixtures/ ├── README.md # Corpus documentation ├── manifest.toml # Index of all fixtures ├── tls/ │ ├── disabled_verification.toml │ ├── deprecated_version.toml │ └── pinning_bypass.toml ├── jwt/ │ ├── alg_none.toml │ ├── skip_signature.toml │ └── hardcoded_secret.toml ├── secrets/ │ ├── api_key_in_code.toml │ ├── password_hardcoded.toml │ └── high_entropy_token.toml ├── auth/ │ ├── bypass_pattern.toml │ └── debug_header.toml ├── negative/ # Files that should NOT trigger claims │ ├── safe_tls_config.toml │ ├── proper_jwt_validation.toml │ └── env_var_secrets.toml └── edge_cases/ ├── empty_file.toml ├── binary_content.toml ├── huge_file.toml └── mixed_languages.toml ``` #### Manifest Structure ```toml # manifest.toml [corpus] version = "1.0.0" created = "2026-02-05" description = "Golden corpus for LLM extraction evaluation" [categories] tls = { fixtures = 12, description = "TLS/SSL configuration" } jwt = { fixtures = 8, description = "JWT authentication" } secrets = { fixtures = 15, description = "Hardcoded secrets" } auth = { fixtures = 6, description = "Authentication bypass" } negative = { fixtures = 10, description = "Safe code (no claims expected)" } edge_cases = { fixtures = 5, description = "Boundary conditions" } [baseline] # Current known-good metrics precision = 0.85 recall = 0.78 f1 = 0.81 total_fixtures = 56 last_updated = "2026-02-05" prompt_version = "1.0.0" model = "gemini-2.0-flash" ``` --- ### 2. Observation Log Every LLM extraction is logged with full context for replay and analysis. #### Log Entry Schema ```rust /// A single observation from an LLM extraction #[derive(Debug, Clone, Serialize, Deserialize)] pub struct ExtractionObservation { /// Unique identifier for this observation pub id: Uuid, /// When this extraction occurred pub timestamp: DateTime, /// Prompt version (semantic version) pub prompt_version: String, /// Model identifier (e.g., "gemini-2.0-flash") pub model: String, /// BLAKE3 hash of input content (for deduplication) pub input_hash: String, /// Input metadata pub input: ExtractionInput, /// Output from LLM pub output: ExtractionOutput, /// Performance metrics pub metrics: ExtractionMetrics, /// Evaluation context (if run during evaluation) pub evaluation: Option, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct ExtractionInput { /// Filename (may be anonymized) pub filename: String, /// Language detected pub language: String, /// Content length in bytes pub content_length: usize, /// Content preview (first 500 chars, for debugging) pub content_preview: Option, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct ExtractionOutput { /// Raw LLM response pub raw_response: String, /// Parsed claims (may be empty if parsing failed) pub claims: Vec, /// Whether parsing succeeded pub parse_success: bool, /// Parse error if any pub parse_error: Option, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct ExtractionMetrics { /// Total latency (API call + processing) pub latency_ms: u64, /// API latency only pub api_latency_ms: u64, /// Input tokens pub input_tokens: u32, /// Output tokens pub output_tokens: u32, /// Total tokens pub total_tokens: u32, /// Estimated cost (USD) pub estimated_cost_usd: f64, /// Cache hit (if response was cached) pub cache_hit: bool, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct EvaluationContext { /// Fixture ID if from golden corpus pub fixture_id: Option, /// Evaluation run ID pub run_id: Uuid, /// Whether this matched expected output pub matched_expected: bool, /// Detailed match results pub match_details: MatchDetails, } ``` #### Log Storage ``` ~/.aphoria/eval/ ├── observations/ │ ├── 2026-02-05/ │ │ ├── 001_tls-001_success.json │ │ ├── 002_jwt-003_partial.json │ │ └── ... │ └── 2026-02-04/ │ └── ... ├── runs/ │ ├── run_abc123.json # Full evaluation run metadata │ └── run_def456.json └── baselines/ ├── v1.0.0.json # Baseline for prompt v1.0.0 └── latest.json # Symlink to current baseline ``` --- ### 3. Evaluation Harness The core engine that runs evaluations. #### Public API ```rust /// Configuration for an evaluation run #[derive(Debug, Clone)] pub struct EvalConfig { /// Path to fixtures directory pub fixtures_dir: PathBuf, /// Which categories to evaluate (None = all) pub categories: Option>, /// Maximum fixtures to run (for quick smoke tests) pub max_fixtures: Option, /// Whether to use real LLM or mock pub mode: EvalMode, /// Baseline to compare against pub baseline: Option, /// Output directory for results pub output_dir: PathBuf, /// Whether to save observations pub save_observations: bool, /// Parallelism (concurrent LLM calls) pub parallelism: usize, } #[derive(Debug, Clone)] pub enum EvalMode { /// Use real LLM API (costs money, tests actual prompt) Live { model: String, temperature: f32, }, /// Use cached responses (fast, deterministic, for CI) Cached, /// Use mock responses (for testing harness itself) Mock, } /// Result of an evaluation run #[derive(Debug, Clone, Serialize, Deserialize)] pub struct EvalResult { /// Unique run identifier pub run_id: Uuid, /// When the run started pub started_at: DateTime, /// When the run completed pub completed_at: DateTime, /// Configuration used pub config: EvalConfigSummary, /// Aggregate metrics pub metrics: AggregateMetrics, /// Per-fixture results pub fixture_results: Vec, /// Comparison with baseline (if baseline provided) pub baseline_comparison: Option, /// Overall verdict pub verdict: EvalVerdict, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct AggregateMetrics { /// Precision: TP / (TP + FP) pub precision: f64, /// Recall: TP / (TP + FN) pub recall: f64, /// F1 score: 2 * (P * R) / (P + R) pub f1: f64, /// Total fixtures evaluated pub total_fixtures: usize, /// Fixtures that passed pub passed: usize, /// Fixtures that failed pub failed: usize, /// Fixtures that errored (LLM call failed, parse failed, etc.) pub errored: usize, /// Total cost (USD) pub total_cost_usd: f64, /// Total tokens used pub total_tokens: u64, /// Average latency (ms) pub avg_latency_ms: f64, /// Per-category breakdown pub by_category: HashMap, } #[derive(Debug, Clone, Serialize, Deserialize)] pub enum EvalVerdict { /// All checks passed Pass, /// Some regressions detected Regression { details: Vec }, /// Evaluation failed (errors prevented completion) Error { message: String }, } ``` #### Claim Matching Logic ```rust /// How to match extracted claims against expected claims pub struct ClaimMatcher { /// Tolerance for confidence comparison pub confidence_tolerance: f32, /// Whether to normalize concept paths before comparison pub normalize_paths: bool, /// Predicate aliases (e.g., "enabled" == "active" == "on") pub predicate_aliases: HashMap>, /// Value equivalences (e.g., true == "true" == "yes" == 1) pub value_equivalences: Vec>, } impl ClaimMatcher { /// Check if extracted claims satisfy must_contain requirements pub fn check_must_contain( &self, extracted: &[Observation], expected: &[ExpectedClaim], ) -> MatchResult { // For each expected claim: // 1. Find matching extracted claim (subject + predicate match) // 2. Check value compatibility // 3. Check confidence threshold // Return: matched, unmatched, partial matches } /// Check if extracted claims violate must_not_contain requirements pub fn check_must_not_contain( &self, extracted: &[Observation], forbidden: &[ExpectedClaim], ) -> MatchResult { // For each forbidden claim: // 1. Check if any extracted claim matches // 2. Flag violations } } ``` --- ### 4. Prompt Versioning Prompts are versioned to track changes and correlate with metrics. #### Version Schema ```rust /// Prompt version identifier #[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)] pub struct PromptVersion { /// Semantic version (major.minor.patch) pub version: String, /// BLAKE3 hash of prompt content pub content_hash: String, /// When this version was created pub created_at: DateTime, /// Description of changes from previous version pub changelog: Option, } impl PromptVersion { /// Compute version from prompt content pub fn from_prompt(prompt: &str, changelog: Option) -> Self { let content_hash = blake3::hash(prompt.as_bytes()).to_hex().to_string(); // Version is computed or provided externally Self { version: "0.0.0".to_string(), // Placeholder content_hash, created_at: Utc::now(), changelog, } } } ``` #### Prompt File Structure ````rust // llm/prompt.rs /// Current prompt version pub const PROMPT_VERSION: &str = "1.2.0"; /// Changelog for current version pub const PROMPT_CHANGELOG: &str = "Improved JWT claim extraction accuracy"; /// The extraction prompt pub const EXTRACTION_PROMPT: &str = r#" You are a security analyst extracting implicit security claims from code. Given the following code file, identify any security-relevant configurations, settings, or patterns. For each finding, output a JSON object with: - subject: The concept path (e.g., "tls/cert_verification") - predicate: The aspect being claimed (e.g., "enabled") - value: The value found (boolean, string, or number) - confidence: Your confidence in this extraction (0.0 to 1.0) - description: Brief explanation Focus on: - TLS/SSL configuration - Authentication settings - Cryptographic choices - Secret/credential handling - Input validation - Authorization patterns Output as a JSON array. If no security claims are found, output an empty array. Code: ```{language} {content} ```` "#; ```` --- ### 5. Metrics & Reporting #### Metrics Computed | Metric | Formula | Purpose | |--------|---------|---------| | **Precision** | TP / (TP + FP) | Are we avoiding false positives? | | **Recall** | TP / (TP + FN) | Are we finding all issues? | | **F1 Score** | 2 * (P * R) / (P + R) | Balanced measure | | **Confidence Calibration** | Correlation(confidence, correctness) | Are high-confidence claims actually correct? | | **Parse Success Rate** | Successful parses / Total extractions | Is the prompt producing valid JSON? | | **Cost per Fixture** | Total cost / Fixtures | Budget tracking | | **Latency P50/P95** | Percentiles | Performance tracking | #### Regression Detection ```rust /// Compare current metrics against baseline pub struct BaselineComparison { /// Current metrics pub current: AggregateMetrics, /// Baseline metrics pub baseline: AggregateMetrics, /// Deltas pub precision_delta: f64, pub recall_delta: f64, pub f1_delta: f64, /// Regression thresholds pub regression_threshold: f64, // e.g., 0.05 = 5% drop /// Fixtures that regressed pub regressed_fixtures: Vec, /// Fixtures that improved pub improved_fixtures: Vec, } impl BaselineComparison { pub fn has_regression(&self) -> bool { self.precision_delta < -self.regression_threshold || self.recall_delta < -self.regression_threshold || self.f1_delta < -self.regression_threshold } } ```` #### Report Format ```markdown # Prompt Evaluation Report **Run ID:** abc123 **Date:** 2026-02-05 14:30:00 UTC **Prompt Version:** 1.2.0 **Model:** gemini-2.0-flash ## Summary | Metric | Current | Baseline | Delta | Status | | ------------- | ------- | -------- | ----- | ------ | | Precision | 0.87 | 0.85 | +0.02 | ✅ | | Recall | 0.76 | 0.78 | -0.02 | ⚠️ | | F1 Score | 0.81 | 0.81 | +0.00 | ✅ | | Parse Success | 98% | 97% | +1% | ✅ | **Verdict:** ⚠️ REVIEW - Recall dropped by 2% ## Cost Analysis - Total tokens: 125,430 - Estimated cost: $0.12 - Cost per fixture: $0.002 ## Regressions ### jwt-003: JWT algorithm none detection - **Expected:** `jwt/algorithm = "none"` with confidence > 0.8 - **Got:** Not extracted - **Impact:** High (security-critical) ## Improvements ### tls-007: TLS version in constants - **Previously:** Not extracted - **Now:** `tls/min_version = "1.0"` with confidence 0.85 - **Impact:** Medium ## Category Breakdown | Category | Fixtures | Passed | Failed | Precision | Recall | | -------- | -------- | ------ | ------ | --------- | ------ | | tls | 12 | 11 | 1 | 0.92 | 0.91 | | jwt | 8 | 6 | 2 | 0.75 | 0.75 | | secrets | 15 | 14 | 1 | 0.93 | 0.87 | | auth | 6 | 6 | 0 | 1.00 | 0.83 | | negative | 10 | 10 | 0 | 1.00 | N/A | ``` --- ### 6. Jobs & Automation #### CI Job (Per-PR) ```yaml # .github/workflows/prompt-eval-smoke.yml name: Prompt Evaluation (Smoke) on: pull_request: paths: - "applications/aphoria/src/llm/**" - "applications/aphoria/tests/llm_fixtures/**" jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run smoke test env: GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }} run: | cargo run -p aphoria -- eval prompts \ --mode cached \ --max-fixtures 20 \ --categories tls,jwt,secrets \ --baseline tests/llm_fixtures/baselines/latest.json \ --fail-on-regression - name: Upload report uses: actions/upload-artifact@v4 with: name: eval-report path: eval-report.md ``` **Characteristics:** - Runs on PR that touches prompt code or fixtures - Uses cached responses (fast, deterministic) - Limited to 20 fixtures (smoke test) - Fails if regression detected #### Nightly Job (Full Evaluation) ```yaml # .github/workflows/prompt-eval-nightly.yml name: Prompt Evaluation (Full) on: schedule: - cron: "0 3 * * *" # 3am UTC daily workflow_dispatch: jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run full evaluation env: GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }} run: | cargo run -p aphoria -- eval prompts \ --mode live \ --model gemini-2.0-flash \ --temperature 0 \ --baseline tests/llm_fixtures/baselines/latest.json \ --output-dir ./eval-results \ --save-observations - name: Update baseline if improved run: | # If F1 improved by > 2%, update baseline ./scripts/maybe-update-baseline.sh - name: Upload results uses: actions/upload-artifact@v4 with: name: eval-results path: eval-results/ - name: Post to Slack if: failure() uses: slackapi/slack-github-action@v1 with: payload: | { "text": "⚠️ Prompt evaluation regression detected", "attachments": [...] } ``` **Characteristics:** - Runs nightly at 3am UTC - Uses live LLM API (real evaluation) - Full corpus coverage - Updates baseline if metrics improve significantly - Alerts on regression #### On-Demand Job (Prompt Iteration) ```bash # For prompt development: compare two versions aphoria eval prompts \ --mode live \ --prompt-file ./prompts/experimental.txt \ --baseline ./baselines/current.json \ --output-dir ./eval-comparison \ --verbose # View comparison cat ./eval-comparison/comparison.md ``` --- ## CLI Interface ``` USAGE: aphoria eval prompts [OPTIONS] OPTIONS: --fixtures-dir Path to fixtures directory [default: tests/llm_fixtures] --categories Categories to evaluate (comma-separated) --max-fixtures Maximum fixtures to run --mode Evaluation mode: live, cached, mock [default: cached] --model Model to use (live mode only) [default: gemini-2.0-flash] --temperature Temperature (live mode only) [default: 0] --baseline Baseline to compare against --output-dir Output directory for results [default: ./eval-results] --save-observations Save observation logs --fail-on-regression Exit with code 1 if regression detected --regression-threshold Threshold for regression (default: 0.05 = 5%) --verbose Verbose output --json Output results as JSON SUBCOMMANDS: aphoria eval prompts show-baseline Show current baseline metrics aphoria eval prompts update-baseline Update baseline from latest run aphoria eval prompts list-fixtures List available fixtures aphoria eval prompts add-fixture Add a new fixture interactively aphoria eval prompts validate-fixtures Validate fixture format ``` --- ## Implementation Plan ### Phase 7.8.1: Core Infrastructure (Week 1) | Task | Description | Effort | | ----------------- | --------------------------------- | ------ | | Fixture format | Define TOML schema, parser | 2d | | Observation log | Schema, writer, reader | 1d | | Claim matcher | Matching logic with fuzzy support | 2d | | Prompt versioning | Version extraction, tracking | 1d | **Deliverable:** Can load fixtures, run extractions, compare outputs ### Phase 7.8.2: Evaluation Harness (Week 2) | Task | Description | Effort | | ------------------- | --------------------------- | ------ | | Evaluation harness | Orchestration, parallelism | 2d | | Metrics computation | Precision, recall, F1, cost | 1d | | Baseline comparison | Regression detection | 1d | | Report generation | Markdown, JSON output | 1d | **Deliverable:** Can run full evaluation and generate report ### Phase 7.8.3: Golden Corpus (Week 2-3) | Task | Description | Effort | | ---------------------- | -------------------------------- | ------ | | Seed fixtures (20) | Hand-curated test cases | 2d | | Negative fixtures (10) | Safe code that shouldn't trigger | 1d | | Edge case fixtures (5) | Boundary conditions | 1d | | Baseline establishment | Initial metrics snapshot | 1d | **Deliverable:** 35+ fixtures covering core categories ### Phase 7.8.4: CI Integration (Week 3) | Task | Description | Effort | | -------------------- | -------------------------------- | ------ | | Smoke test workflow | Per-PR cached evaluation | 1d | | Nightly workflow | Full live evaluation | 1d | | Baseline auto-update | Script for improvement detection | 1d | | Alerting | Slack/email on regression | 0.5d | **Deliverable:** Automated evaluation in CI ### Phase 7.8.5: CLI & Documentation (Week 4) | Task | Description | Effort | | ------------- | ------------------------------ | ------ | | CLI commands | `eval prompts` subcommands | 2d | | Documentation | Usage guide, fixture authoring | 1d | | Skill update | `/aphoria-dev` skill update | 0.5d | **Deliverable:** Production-ready tooling --- ## Open Questions ### 1. Where do we store baseline metrics? **Options:** - **In repository** (`tests/llm_fixtures/baselines/`) - Simple, versioned with code - **External artifact store** - Separates metrics from code - **Database** - For historical tracking **Recommendation:** Start with repository, migrate to external store when history needed. ### 2. How strict should matching be? **Options:** - **Exact match** - Same subject, predicate, value (brittle) - **Structural match** - Same concept, fuzzy value (looser) - **Semantic match** - Embeddings-based similarity (complex) **Recommendation:** Structural match with configurable fuzzy value matching. ### 3. Mock vs Live in CI? **Options:** - **Always mock** - Fast, free, deterministic, tests harness not prompt - **Always live** - Expensive, slow, tests actual prompt - **Hybrid** - Mock for smoke, live for nightly **Recommendation:** Hybrid approach. Cached (deterministic) for PR, live for nightly. ### 4. How do we handle model version changes? Gemini may update models, causing output drift even without prompt changes. **Options:** - Pin model version (if API supports) - Track model version in baseline, re-baseline on model change - Alert when model version changes **Recommendation:** Track model version, require manual baseline update on change. ### 5. What's the corpus growth strategy? **Options:** - Hand-curate only (high quality, slow growth) - Production capture with review (faster growth, needs tooling) - Synthetic generation (fast, may not reflect reality) **Recommendation:** Start hand-curated, add production capture tooling in Phase 9. --- ## Success Metrics | Metric | Target | Measurement | | --------------------------------------- | ------------- | ------------------------------------- | | Regression detection rate | 100% | Simulated regressions caught | | False positive rate (regression alerts) | < 5% | Manual review of alerts | | Prompt iteration cycle time | < 30 min | Time from change to evaluation result | | Corpus coverage | > 50 fixtures | Fixture count | | CI job duration (smoke) | < 2 min | Workflow timing | | CI job duration (nightly) | < 15 min | Workflow timing | --- ## Related Documents - [LLM-in-the-Loop Extraction](../../roadmap.md#phase-75-llm-in-the-loop-extraction) - Phase 7.5 implementation - [Pattern Learning Store](../../roadmap.md#phase-76-pattern-learning-store) - Phase 7.6 implementation - [LLM Extractor Code](../../src/llm/) - Current implementation --- ## Appendix: Example Fixture ```toml # fixtures/secrets/high_entropy_api_key.toml [metadata] id = "secrets-005" name = "High entropy API key in Python config" category = "secrets" subcategory = "api_keys" language = "python" difficulty = "medium" source = "hand-curated" created = "2026-02-05" updated = "2026-02-05" notes = """ Tests detection of high-entropy strings that look like API keys. The entropy of 'sk_live_abc123...' is > 4.0 which should trigger detection. """ [input] filename = "config.py" content = ''' import os # Configuration for payment processing STRIPE_API_KEY = "sk_live_51ABC123DEF456GHI789JKL012MNO345PQR678STU901VWX234YZ" STRIPE_WEBHOOK_SECRET = os.environ.get("STRIPE_WEBHOOK_SECRET") DATABASE_URL = os.environ.get("DATABASE_URL", "postgresql://localhost/app") DEBUG = os.environ.get("DEBUG", "false").lower() == "true" ''' [expected] must_contain = [ { subject = "secrets/api_key", predicate = "hardcoded", value = true, min_confidence = 0.8 }, ] must_not_contain = [ # STRIPE_WEBHOOK_SECRET uses env var, should NOT be flagged { subject = "secrets/webhook_secret", predicate = "hardcoded", value = true }, # DATABASE_URL uses env var with fallback, should NOT be flagged as hardcoded { subject = "secrets/database_url", predicate = "hardcoded", value = true }, ] acceptable_variants = [ { subject = "stripe/api_key", predicate = "exposed", value = true }, { subject = "payment/secret", predicate = "hardcoded", value = true }, ] [scoring] weight = 1.5 # Security-critical, weighted higher min_confidence = 0.8 ``` --- _Last updated: 2026-02-05_