# LLM Prompt Evaluation System

> **Status:** Proposed (2026-02-05)
> **Phase:** 7.8 (extends Phase 7.5 LLM-in-the-Loop Extraction)
> **Author:** Architecture Team

---

## Problem Statement

Aphoria's LLM-powered claim extraction (Phase 7.5) uses Gemini to extract security claims from high-value code files. The prompts that drive this extraction are effectively **code that we don't treat like code**:

| Aspect               | Traditional Code       | Current Prompts                      |
| -------------------- | ---------------------- | ------------------------------------ |
| Version Control      | Git commits            | In files, but no semantic versioning |
| Testing              | Unit/integration tests | None                                 |
| Metrics              | Coverage, performance  | None                                 |
| Regression Detection | CI failures            | None                                 |
| Quality Gates        | Linting, review        | None                                 |

**The result:** When we change a prompt, we have no systematic way to know if we made things better or worse. We're flying blind.

### Enterprise Requirements

For enterprise adoption, customers need assurance that:

1. **Prompts produce consistent, high-quality results** - Not random outputs
2. **Changes are validated before deployment** - Regressions are caught
3. **Performance is measurable** - Precision, recall, cost are tracked
4. **The system improves over time** - With evidence, not hope

---

## Goals

### Primary Goals

1. **Observability** - Understand prompt effectiveness through metrics and logging
2. **Testability** - Validate prompts against known scenarios with expected outcomes
3. **Repeatability** - Run evaluations consistently across environments
4. **Automation** - Scheduled jobs that detect regressions without human intervention

### Non-Goals (Phase 7.8)

- Real-time prompt optimization (future: Phase 9)
- A/B testing in production (future: Phase 9)
- Multi-model comparison (future)
- Prompt compression/optimization (future)

---

## Architecture Overview

```
┌──────────────────────────────────────────────────────────────────────────────┐
│                     LLM Prompt Evaluation System                              │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────┐                                                         │
│  │  Golden Corpus  │  Test fixtures with expected outcomes                   │
│  │  (fixtures/)    │  - Code snippets                                        │
│  │                 │  - Expected claims (must-contain, must-not-contain)     │
│  └────────┬────────┘  - Metadata (language, category, difficulty)            │
│           │                                                                  │
│           ▼                                                                  │
│  ┌─────────────────┐                                                         │
│  │   Evaluation    │  Orchestrates test runs                                 │
│  │   Harness       │  - Loads fixtures                                       │
│  │                 │  - Invokes LLM Extractor                                │
│  │                 │  - Compares outputs                                     │
│  │                 │  - Computes metrics                                     │
│  └────────┬────────┘                                                         │
│           │                                                                  │
│           ├──────────────────────┐                                           │
│           │                      │                                           │
│           ▼                      ▼                                           │
│  ┌─────────────────┐    ┌─────────────────┐                                  │
│  │  LLM Extractor  │    │  Observation    │                                  │
│  │  (instrumented) │    │  Log            │                                  │
│  │                 │    │                 │                                  │
│  │  - Same code    │    │  - Prompt ver   │                                  │
│  │    path as      │    │  - Input hash   │                                  │
│  │    production   │    │  - Output       │                                  │
│  │                 │    │  - Latency      │                                  │
│  │                 │    │  - Tokens       │                                  │
│  └─────────────────┘    │  - Model        │                                  │
│                         │  - Timestamp    │                                  │
│                         └────────┬────────┘                                  │
│                                  │                                           │
│                                  ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │                         Metrics & Reports                               │ │
│  │                                                                         │ │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │ │
│  │  │  Precision  │  │   Recall    │  │  F1 Score   │  │    Cost     │     │ │
│  │  │  TP/(TP+FP) │  │ TP/(TP+FN)  │  │  Harmonic   │  │  Tokens/$   │     │ │
│  │  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘     │ │
│  │                                                                         │ │
│  │  ┌────────────────────────────────────────────────────────────────────┐ │ │
│  │  │  Regression Report                                                 │ │ │
│  │  │  - Comparison against baseline                                     │ │ │
│  │  │  - Per-fixture deltas                                              │ │ │
│  │  │  - Category breakdown                                              │ │ │
│  │  │  - Recommendations                                                 │ │ │
│  │  └────────────────────────────────────────────────────────────────────┘ │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘
```

---

## Core Components

### 1. Golden Corpus

A curated set of test fixtures with known expected outcomes.

#### Fixture Format

```toml
# fixtures/tls/disabled_verification.toml

[metadata]
id = "tls-001"
name = "TLS verification disabled in Python requests"
category = "tls"
subcategory = "certificate_verification"
language = "python"
difficulty = "easy"  # easy | medium | hard
source = "hand-curated"  # hand-curated | production-capture | synthetic
created = "2026-02-05"
updated = "2026-02-05"

[input]
filename = "client.py"
content = '''
import requests

def fetch_data(url):
    # Disable SSL verification for internal services
    response = requests.get(url, verify=False)
    return response.json()
'''

[expected]
# Claims that MUST be extracted (recall)
must_contain = [
    { subject = "tls/cert_verification", predicate = "enabled", value = false },
]

# Claims that MUST NOT be extracted (precision)
must_not_contain = [
    { subject = "tls/cert_verification", predicate = "enabled", value = true },
]

# Optional: acceptable alternate formulations
acceptable_variants = [
    { subject = "ssl/verify", predicate = "enabled", value = false },
    { subject = "requests/ssl_verify", predicate = "value", value = false },
]

[scoring]
# How to score this fixture
weight = 1.0  # Importance multiplier
min_confidence = 0.7  # Expected minimum confidence
```

#### Corpus Organization

```
applications/aphoria/tests/llm_fixtures/
├── README.md                    # Corpus documentation
├── manifest.toml                # Index of all fixtures
├── tls/
│   ├── disabled_verification.toml
│   ├── deprecated_version.toml
│   └── pinning_bypass.toml
├── jwt/
│   ├── alg_none.toml
│   ├── skip_signature.toml
│   └── hardcoded_secret.toml
├── secrets/
│   ├── api_key_in_code.toml
│   ├── password_hardcoded.toml
│   └── high_entropy_token.toml
├── auth/
│   ├── bypass_pattern.toml
│   └── debug_header.toml
├── negative/                    # Files that should NOT trigger claims
│   ├── safe_tls_config.toml
│   ├── proper_jwt_validation.toml
│   └── env_var_secrets.toml
└── edge_cases/
    ├── empty_file.toml
    ├── binary_content.toml
    ├── huge_file.toml
    └── mixed_languages.toml
```

#### Manifest Structure

```toml
# manifest.toml

[corpus]
version = "1.0.0"
created = "2026-02-05"
description = "Golden corpus for LLM extraction evaluation"

[categories]
tls = { fixtures = 12, description = "TLS/SSL configuration" }
jwt = { fixtures = 8, description = "JWT authentication" }
secrets = { fixtures = 15, description = "Hardcoded secrets" }
auth = { fixtures = 6, description = "Authentication bypass" }
negative = { fixtures = 10, description = "Safe code (no claims expected)" }
edge_cases = { fixtures = 5, description = "Boundary conditions" }

[baseline]
# Current known-good metrics
precision = 0.85
recall = 0.78
f1 = 0.81
total_fixtures = 56
last_updated = "2026-02-05"
prompt_version = "1.0.0"
model = "gemini-2.0-flash"
```

---

### 2. Observation Log

Every LLM extraction is logged with full context for replay and analysis.

#### Log Entry Schema

```rust
/// A single observation from an LLM extraction
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionObservation {
    /// Unique identifier for this observation
    pub id: Uuid,

    /// When this extraction occurred
    pub timestamp: DateTime<Utc>,

    /// Prompt version (semantic version)
    pub prompt_version: String,

    /// Model identifier (e.g., "gemini-2.0-flash")
    pub model: String,

    /// BLAKE3 hash of input content (for deduplication)
    pub input_hash: String,

    /// Input metadata
    pub input: ExtractionInput,

    /// Output from LLM
    pub output: ExtractionOutput,

    /// Performance metrics
    pub metrics: ExtractionMetrics,

    /// Evaluation context (if run during evaluation)
    pub evaluation: Option<EvaluationContext>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionInput {
    /// Filename (may be anonymized)
    pub filename: String,

    /// Language detected
    pub language: String,

    /// Content length in bytes
    pub content_length: usize,

    /// Content preview (first 500 chars, for debugging)
    pub content_preview: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionOutput {
    /// Raw LLM response
    pub raw_response: String,

    /// Parsed claims (may be empty if parsing failed)
    pub claims: Vec<Observation>,

    /// Whether parsing succeeded
    pub parse_success: bool,

    /// Parse error if any
    pub parse_error: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionMetrics {
    /// Total latency (API call + processing)
    pub latency_ms: u64,

    /// API latency only
    pub api_latency_ms: u64,

    /// Input tokens
    pub input_tokens: u32,

    /// Output tokens
    pub output_tokens: u32,

    /// Total tokens
    pub total_tokens: u32,

    /// Estimated cost (USD)
    pub estimated_cost_usd: f64,

    /// Cache hit (if response was cached)
    pub cache_hit: bool,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EvaluationContext {
    /// Fixture ID if from golden corpus
    pub fixture_id: Option<String>,

    /// Evaluation run ID
    pub run_id: Uuid,

    /// Whether this matched expected output
    pub matched_expected: bool,

    /// Detailed match results
    pub match_details: MatchDetails,
}
```

#### Log Storage

```
~/.aphoria/eval/
├── observations/
│   ├── 2026-02-05/
│   │   ├── 001_tls-001_success.json
│   │   ├── 002_jwt-003_partial.json
│   │   └── ...
│   └── 2026-02-04/
│       └── ...
├── runs/
│   ├── run_abc123.json        # Full evaluation run metadata
│   └── run_def456.json
└── baselines/
    ├── v1.0.0.json            # Baseline for prompt v1.0.0
    └── latest.json            # Symlink to current baseline
```

---

### 3. Evaluation Harness

The core engine that runs evaluations.

#### Public API

```rust
/// Configuration for an evaluation run
#[derive(Debug, Clone)]
pub struct EvalConfig {
    /// Path to fixtures directory
    pub fixtures_dir: PathBuf,

    /// Which categories to evaluate (None = all)
    pub categories: Option<Vec<String>>,

    /// Maximum fixtures to run (for quick smoke tests)
    pub max_fixtures: Option<usize>,

    /// Whether to use real LLM or mock
    pub mode: EvalMode,

    /// Baseline to compare against
    pub baseline: Option<PathBuf>,

    /// Output directory for results
    pub output_dir: PathBuf,

    /// Whether to save observations
    pub save_observations: bool,

    /// Parallelism (concurrent LLM calls)
    pub parallelism: usize,
}

#[derive(Debug, Clone)]
pub enum EvalMode {
    /// Use real LLM API (costs money, tests actual prompt)
    Live {
        model: String,
        temperature: f32,
    },
    /// Use cached responses (fast, deterministic, for CI)
    Cached,
    /// Use mock responses (for testing harness itself)
    Mock,
}

/// Result of an evaluation run
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EvalResult {
    /// Unique run identifier
    pub run_id: Uuid,

    /// When the run started
    pub started_at: DateTime<Utc>,

    /// When the run completed
    pub completed_at: DateTime<Utc>,

    /// Configuration used
    pub config: EvalConfigSummary,

    /// Aggregate metrics
    pub metrics: AggregateMetrics,

    /// Per-fixture results
    pub fixture_results: Vec<FixtureResult>,

    /// Comparison with baseline (if baseline provided)
    pub baseline_comparison: Option<BaselineComparison>,

    /// Overall verdict
    pub verdict: EvalVerdict,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AggregateMetrics {
    /// Precision: TP / (TP + FP)
    pub precision: f64,

    /// Recall: TP / (TP + FN)
    pub recall: f64,

    /// F1 score: 2 * (P * R) / (P + R)
    pub f1: f64,

    /// Total fixtures evaluated
    pub total_fixtures: usize,

    /// Fixtures that passed
    pub passed: usize,

    /// Fixtures that failed
    pub failed: usize,

    /// Fixtures that errored (LLM call failed, parse failed, etc.)
    pub errored: usize,

    /// Total cost (USD)
    pub total_cost_usd: f64,

    /// Total tokens used
    pub total_tokens: u64,

    /// Average latency (ms)
    pub avg_latency_ms: f64,

    /// Per-category breakdown
    pub by_category: HashMap<String, CategoryMetrics>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum EvalVerdict {
    /// All checks passed
    Pass,
    /// Some regressions detected
    Regression { details: Vec<String> },
    /// Evaluation failed (errors prevented completion)
    Error { message: String },
}
```

#### Claim Matching Logic

```rust
/// How to match extracted claims against expected claims
pub struct ClaimMatcher {
    /// Tolerance for confidence comparison
    pub confidence_tolerance: f32,

    /// Whether to normalize concept paths before comparison
    pub normalize_paths: bool,

    /// Predicate aliases (e.g., "enabled" == "active" == "on")
    pub predicate_aliases: HashMap<String, Vec<String>>,

    /// Value equivalences (e.g., true == "true" == "yes" == 1)
    pub value_equivalences: Vec<Vec<ObjectValue>>,
}

impl ClaimMatcher {
    /// Check if extracted claims satisfy must_contain requirements
    pub fn check_must_contain(
        &self,
        extracted: &[Observation],
        expected: &[ExpectedClaim],
    ) -> MatchResult {
        // For each expected claim:
        // 1. Find matching extracted claim (subject + predicate match)
        // 2. Check value compatibility
        // 3. Check confidence threshold
        // Return: matched, unmatched, partial matches
    }

    /// Check if extracted claims violate must_not_contain requirements
    pub fn check_must_not_contain(
        &self,
        extracted: &[Observation],
        forbidden: &[ExpectedClaim],
    ) -> MatchResult {
        // For each forbidden claim:
        // 1. Check if any extracted claim matches
        // 2. Flag violations
    }
}
```

---

### 4. Prompt Versioning

Prompts are versioned to track changes and correlate with metrics.

#### Version Schema

```rust
/// Prompt version identifier
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct PromptVersion {
    /// Semantic version (major.minor.patch)
    pub version: String,

    /// BLAKE3 hash of prompt content
    pub content_hash: String,

    /// When this version was created
    pub created_at: DateTime<Utc>,

    /// Description of changes from previous version
    pub changelog: Option<String>,
}

impl PromptVersion {
    /// Compute version from prompt content
    pub fn from_prompt(prompt: &str, changelog: Option<String>) -> Self {
        let content_hash = blake3::hash(prompt.as_bytes()).to_hex().to_string();
        // Version is computed or provided externally
        Self {
            version: "0.0.0".to_string(), // Placeholder
            content_hash,
            created_at: Utc::now(),
            changelog,
        }
    }
}
```

#### Prompt File Structure

````rust
// llm/prompt.rs

/// Current prompt version
pub const PROMPT_VERSION: &str = "1.2.0";

/// Changelog for current version
pub const PROMPT_CHANGELOG: &str = "Improved JWT claim extraction accuracy";

/// The extraction prompt
pub const EXTRACTION_PROMPT: &str = r#"
You are a security analyst extracting implicit security claims from code.

Given the following code file, identify any security-relevant configurations,
settings, or patterns. For each finding, output a JSON object with:

- subject: The concept path (e.g., "tls/cert_verification")
- predicate: The aspect being claimed (e.g., "enabled")
- value: The value found (boolean, string, or number)
- confidence: Your confidence in this extraction (0.0 to 1.0)
- description: Brief explanation

Focus on:
- TLS/SSL configuration
- Authentication settings
- Cryptographic choices
- Secret/credential handling
- Input validation
- Authorization patterns

Output as a JSON array. If no security claims are found, output an empty array.

Code:
```{language}
{content}
````

"#;

````

---

### 5. Metrics & Reporting

#### Metrics Computed

| Metric | Formula | Purpose |
|--------|---------|---------|
| **Precision** | TP / (TP + FP) | Are we avoiding false positives? |
| **Recall** | TP / (TP + FN) | Are we finding all issues? |
| **F1 Score** | 2 * (P * R) / (P + R) | Balanced measure |
| **Confidence Calibration** | Correlation(confidence, correctness) | Are high-confidence claims actually correct? |
| **Parse Success Rate** | Successful parses / Total extractions | Is the prompt producing valid JSON? |
| **Cost per Fixture** | Total cost / Fixtures | Budget tracking |
| **Latency P50/P95** | Percentiles | Performance tracking |

#### Regression Detection

```rust
/// Compare current metrics against baseline
pub struct BaselineComparison {
    /// Current metrics
    pub current: AggregateMetrics,

    /// Baseline metrics
    pub baseline: AggregateMetrics,

    /// Deltas
    pub precision_delta: f64,
    pub recall_delta: f64,
    pub f1_delta: f64,

    /// Regression thresholds
    pub regression_threshold: f64,  // e.g., 0.05 = 5% drop

    /// Fixtures that regressed
    pub regressed_fixtures: Vec<FixtureRegression>,

    /// Fixtures that improved
    pub improved_fixtures: Vec<FixtureImprovement>,
}

impl BaselineComparison {
    pub fn has_regression(&self) -> bool {
        self.precision_delta < -self.regression_threshold ||
        self.recall_delta < -self.regression_threshold ||
        self.f1_delta < -self.regression_threshold
    }
}
````

#### Report Format

```markdown
# Prompt Evaluation Report

**Run ID:** abc123
**Date:** 2026-02-05 14:30:00 UTC
**Prompt Version:** 1.2.0
**Model:** gemini-2.0-flash

## Summary

| Metric        | Current | Baseline | Delta | Status |
| ------------- | ------- | -------- | ----- | ------ |
| Precision     | 0.87    | 0.85     | +0.02 | ✅     |
| Recall        | 0.76    | 0.78     | -0.02 | ⚠️     |
| F1 Score      | 0.81    | 0.81     | +0.00 | ✅     |
| Parse Success | 98%     | 97%      | +1%   | ✅     |

**Verdict:** ⚠️ REVIEW - Recall dropped by 2%

## Cost Analysis

- Total tokens: 125,430
- Estimated cost: $0.12
- Cost per fixture: $0.002

## Regressions

### jwt-003: JWT algorithm none detection

- **Expected:** `jwt/algorithm = "none"` with confidence > 0.8
- **Got:** Not extracted
- **Impact:** High (security-critical)

## Improvements

### tls-007: TLS version in constants

- **Previously:** Not extracted
- **Now:** `tls/min_version = "1.0"` with confidence 0.85
- **Impact:** Medium

## Category Breakdown

| Category | Fixtures | Passed | Failed | Precision | Recall |
| -------- | -------- | ------ | ------ | --------- | ------ |
| tls      | 12       | 11     | 1      | 0.92      | 0.91   |
| jwt      | 8        | 6      | 2      | 0.75      | 0.75   |
| secrets  | 15       | 14     | 1      | 0.93      | 0.87   |
| auth     | 6        | 6      | 0      | 1.00      | 0.83   |
| negative | 10       | 10     | 0      | 1.00      | N/A    |
```

---

### 6. Jobs & Automation

#### CI Job (Per-PR)

```yaml
# .github/workflows/prompt-eval-smoke.yml
name: Prompt Evaluation (Smoke)

on:
    pull_request:
        paths:
            - "applications/aphoria/src/llm/**"
            - "applications/aphoria/tests/llm_fixtures/**"

jobs:
    eval:
        runs-on: ubuntu-latest
        steps:
            - uses: actions/checkout@v4

            - name: Run smoke test
              env:
                  GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
              run: |
                  cargo run -p aphoria -- eval prompts \
                    --mode cached \
                    --max-fixtures 20 \
                    --categories tls,jwt,secrets \
                    --baseline tests/llm_fixtures/baselines/latest.json \
                    --fail-on-regression

            - name: Upload report
              uses: actions/upload-artifact@v4
              with:
                  name: eval-report
                  path: eval-report.md
```

**Characteristics:**

- Runs on PR that touches prompt code or fixtures
- Uses cached responses (fast, deterministic)
- Limited to 20 fixtures (smoke test)
- Fails if regression detected

#### Nightly Job (Full Evaluation)

```yaml
# .github/workflows/prompt-eval-nightly.yml
name: Prompt Evaluation (Full)

on:
    schedule:
        - cron: "0 3 * * *" # 3am UTC daily
    workflow_dispatch:

jobs:
    eval:
        runs-on: ubuntu-latest
        steps:
            - uses: actions/checkout@v4

            - name: Run full evaluation
              env:
                  GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
              run: |
                  cargo run -p aphoria -- eval prompts \
                    --mode live \
                    --model gemini-2.0-flash \
                    --temperature 0 \
                    --baseline tests/llm_fixtures/baselines/latest.json \
                    --output-dir ./eval-results \
                    --save-observations

            - name: Update baseline if improved
              run: |
                  # If F1 improved by > 2%, update baseline
                  ./scripts/maybe-update-baseline.sh

            - name: Upload results
              uses: actions/upload-artifact@v4
              with:
                  name: eval-results
                  path: eval-results/

            - name: Post to Slack
              if: failure()
              uses: slackapi/slack-github-action@v1
              with:
                  payload: |
                      {
                        "text": "⚠️ Prompt evaluation regression detected",
                        "attachments": [...]
                      }
```

**Characteristics:**

- Runs nightly at 3am UTC
- Uses live LLM API (real evaluation)
- Full corpus coverage
- Updates baseline if metrics improve significantly
- Alerts on regression

#### On-Demand Job (Prompt Iteration)

```bash
# For prompt development: compare two versions
aphoria eval prompts \
  --mode live \
  --prompt-file ./prompts/experimental.txt \
  --baseline ./baselines/current.json \
  --output-dir ./eval-comparison \
  --verbose

# View comparison
cat ./eval-comparison/comparison.md
```

---

## CLI Interface

```
USAGE:
    aphoria eval prompts [OPTIONS]

OPTIONS:
    --fixtures-dir <DIR>       Path to fixtures directory [default: tests/llm_fixtures]
    --categories <LIST>        Categories to evaluate (comma-separated)
    --max-fixtures <N>         Maximum fixtures to run
    --mode <MODE>              Evaluation mode: live, cached, mock [default: cached]
    --model <MODEL>            Model to use (live mode only) [default: gemini-2.0-flash]
    --temperature <TEMP>       Temperature (live mode only) [default: 0]
    --baseline <FILE>          Baseline to compare against
    --output-dir <DIR>         Output directory for results [default: ./eval-results]
    --save-observations        Save observation logs
    --fail-on-regression       Exit with code 1 if regression detected
    --regression-threshold <N> Threshold for regression (default: 0.05 = 5%)
    --verbose                  Verbose output
    --json                     Output results as JSON

SUBCOMMANDS:
    aphoria eval prompts show-baseline    Show current baseline metrics
    aphoria eval prompts update-baseline  Update baseline from latest run
    aphoria eval prompts list-fixtures    List available fixtures
    aphoria eval prompts add-fixture      Add a new fixture interactively
    aphoria eval prompts validate-fixtures  Validate fixture format
```

---

## Implementation Plan

### Phase 7.8.1: Core Infrastructure (Week 1)

| Task              | Description                       | Effort |
| ----------------- | --------------------------------- | ------ |
| Fixture format    | Define TOML schema, parser        | 2d     |
| Observation log   | Schema, writer, reader            | 1d     |
| Claim matcher     | Matching logic with fuzzy support | 2d     |
| Prompt versioning | Version extraction, tracking      | 1d     |

**Deliverable:** Can load fixtures, run extractions, compare outputs

### Phase 7.8.2: Evaluation Harness (Week 2)

| Task                | Description                 | Effort |
| ------------------- | --------------------------- | ------ |
| Evaluation harness  | Orchestration, parallelism  | 2d     |
| Metrics computation | Precision, recall, F1, cost | 1d     |
| Baseline comparison | Regression detection        | 1d     |
| Report generation   | Markdown, JSON output       | 1d     |

**Deliverable:** Can run full evaluation and generate report

### Phase 7.8.3: Golden Corpus (Week 2-3)

| Task                   | Description                      | Effort |
| ---------------------- | -------------------------------- | ------ |
| Seed fixtures (20)     | Hand-curated test cases          | 2d     |
| Negative fixtures (10) | Safe code that shouldn't trigger | 1d     |
| Edge case fixtures (5) | Boundary conditions              | 1d     |
| Baseline establishment | Initial metrics snapshot         | 1d     |

**Deliverable:** 35+ fixtures covering core categories

### Phase 7.8.4: CI Integration (Week 3)

| Task                 | Description                      | Effort |
| -------------------- | -------------------------------- | ------ |
| Smoke test workflow  | Per-PR cached evaluation         | 1d     |
| Nightly workflow     | Full live evaluation             | 1d     |
| Baseline auto-update | Script for improvement detection | 1d     |
| Alerting             | Slack/email on regression        | 0.5d   |

**Deliverable:** Automated evaluation in CI

### Phase 7.8.5: CLI & Documentation (Week 4)

| Task          | Description                    | Effort |
| ------------- | ------------------------------ | ------ |
| CLI commands  | `eval prompts` subcommands     | 2d     |
| Documentation | Usage guide, fixture authoring | 1d     |
| Skill update  | `/aphoria-dev` skill update    | 0.5d   |

**Deliverable:** Production-ready tooling

---

## Open Questions

### 1. Where do we store baseline metrics?

**Options:**

- **In repository** (`tests/llm_fixtures/baselines/`) - Simple, versioned with code
- **External artifact store** - Separates metrics from code
- **Database** - For historical tracking

**Recommendation:** Start with repository, migrate to external store when history needed.

### 2. How strict should matching be?

**Options:**

- **Exact match** - Same subject, predicate, value (brittle)
- **Structural match** - Same concept, fuzzy value (looser)
- **Semantic match** - Embeddings-based similarity (complex)

**Recommendation:** Structural match with configurable fuzzy value matching.

### 3. Mock vs Live in CI?

**Options:**

- **Always mock** - Fast, free, deterministic, tests harness not prompt
- **Always live** - Expensive, slow, tests actual prompt
- **Hybrid** - Mock for smoke, live for nightly

**Recommendation:** Hybrid approach. Cached (deterministic) for PR, live for nightly.

### 4. How do we handle model version changes?

Gemini may update models, causing output drift even without prompt changes.

**Options:**

- Pin model version (if API supports)
- Track model version in baseline, re-baseline on model change
- Alert when model version changes

**Recommendation:** Track model version, require manual baseline update on change.

### 5. What's the corpus growth strategy?

**Options:**

- Hand-curate only (high quality, slow growth)
- Production capture with review (faster growth, needs tooling)
- Synthetic generation (fast, may not reflect reality)

**Recommendation:** Start hand-curated, add production capture tooling in Phase 9.

---

## Success Metrics

| Metric                                  | Target        | Measurement                           |
| --------------------------------------- | ------------- | ------------------------------------- |
| Regression detection rate               | 100%          | Simulated regressions caught          |
| False positive rate (regression alerts) | < 5%          | Manual review of alerts               |
| Prompt iteration cycle time             | < 30 min      | Time from change to evaluation result |
| Corpus coverage                         | > 50 fixtures | Fixture count                         |
| CI job duration (smoke)                 | < 2 min       | Workflow timing                       |
| CI job duration (nightly)               | < 15 min      | Workflow timing                       |

---

## Related Documents

- [LLM-in-the-Loop Extraction](../../roadmap.md#phase-75-llm-in-the-loop-extraction) - Phase 7.5 implementation
- [Pattern Learning Store](../../roadmap.md#phase-76-pattern-learning-store) - Phase 7.6 implementation
- [LLM Extractor Code](../../src/llm/) - Current implementation

---

## Appendix: Example Fixture

```toml
# fixtures/secrets/high_entropy_api_key.toml

[metadata]
id = "secrets-005"
name = "High entropy API key in Python config"
category = "secrets"
subcategory = "api_keys"
language = "python"
difficulty = "medium"
source = "hand-curated"
created = "2026-02-05"
updated = "2026-02-05"
notes = """
Tests detection of high-entropy strings that look like API keys.
The entropy of 'sk_live_abc123...' is > 4.0 which should trigger detection.
"""

[input]
filename = "config.py"
content = '''
import os

# Configuration for payment processing
STRIPE_API_KEY = "sk_live_51ABC123DEF456GHI789JKL012MNO345PQR678STU901VWX234YZ"
STRIPE_WEBHOOK_SECRET = os.environ.get("STRIPE_WEBHOOK_SECRET")

DATABASE_URL = os.environ.get("DATABASE_URL", "postgresql://localhost/app")
DEBUG = os.environ.get("DEBUG", "false").lower() == "true"
'''

[expected]
must_contain = [
    { subject = "secrets/api_key", predicate = "hardcoded", value = true, min_confidence = 0.8 },
]

must_not_contain = [
    # STRIPE_WEBHOOK_SECRET uses env var, should NOT be flagged
    { subject = "secrets/webhook_secret", predicate = "hardcoded", value = true },
    # DATABASE_URL uses env var with fallback, should NOT be flagged as hardcoded
    { subject = "secrets/database_url", predicate = "hardcoded", value = true },
]

acceptable_variants = [
    { subject = "stripe/api_key", predicate = "exposed", value = true },
    { subject = "payment/secret", predicate = "hardcoded", value = true },
]

[scoring]
weight = 1.5  # Security-critical, weighted higher
min_confidence = 0.8
```

---

_Last updated: 2026-02-05_