stemedb/applications/aphoria/docs/architecture/llm-prompt-evaluation.md
jml 9bfa626203 docs: reorganize documentation structure for clarity
Major documentation restructure to improve discoverability and reduce duplication.

## Changes

**Deleted (Archived/Consolidated)**:
- Removed duplicate getting started guides
- Archived outdated planning documents
- Consolidated corpus and configuration docs
- Removed obsolete vision/spec files (superseded by vision.md)
- Cleaned up scrapyard and old PDFs

**New Structure**:
- docs/about/ - Project overview and introduction
- docs/guides/ - User guides (moved from root)
- docs/specs/ - Technical specifications
- docs/sdk/ - SDK documentation (Go)
- docs/references/ - API references
- docs/archive/ - Archived historical docs
- applications/aphoria/docs/advanced/ - Advanced topics
- applications/aphoria/docs/reference/ - CLI reference
- applications/aphoria/docs/archive/ - Archived aphoria docs

**Updated**:
- README.md - New root README with clear navigation
- CONTRIBUTING.md - Contribution guidelines
- CLAUDE.md - Updated paths to new structure
- roadmap.md - Added recent completions

## Files Changed
- 57 files changed
- 1,977 insertions(+)
- 961 deletions(-)

**Net change**: +1,016 lines (added CONTRIBUTING.md, README.md, reorganized content)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-11 07:33:40 +00:00

35 KiB

LLM Prompt Evaluation System

Status: Proposed (2026-02-05) Phase: 7.8 (extends Phase 7.5 LLM-in-the-Loop Extraction) Author: Architecture Team


Problem Statement

Aphoria's LLM-powered claim extraction (Phase 7.5) uses Gemini to extract security claims from high-value code files. The prompts that drive this extraction are effectively code that we don't treat like code:

Aspect Traditional Code Current Prompts
Version Control Git commits In files, but no semantic versioning
Testing Unit/integration tests None
Metrics Coverage, performance None
Regression Detection CI failures None
Quality Gates Linting, review None

The result: When we change a prompt, we have no systematic way to know if we made things better or worse. We're flying blind.

Enterprise Requirements

For enterprise adoption, customers need assurance that:

  1. Prompts produce consistent, high-quality results - Not random outputs
  2. Changes are validated before deployment - Regressions are caught
  3. Performance is measurable - Precision, recall, cost are tracked
  4. The system improves over time - With evidence, not hope

Goals

Primary Goals

  1. Observability - Understand prompt effectiveness through metrics and logging
  2. Testability - Validate prompts against known scenarios with expected outcomes
  3. Repeatability - Run evaluations consistently across environments
  4. Automation - Scheduled jobs that detect regressions without human intervention

Non-Goals (Phase 7.8)

  • Real-time prompt optimization (future: Phase 9)
  • A/B testing in production (future: Phase 9)
  • Multi-model comparison (future)
  • Prompt compression/optimization (future)

Architecture Overview

┌──────────────────────────────────────────────────────────────────────────────┐
│                     LLM Prompt Evaluation System                              │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────┐                                                         │
│  │  Golden Corpus  │  Test fixtures with expected outcomes                   │
│  │  (fixtures/)    │  - Code snippets                                        │
│  │                 │  - Expected claims (must-contain, must-not-contain)     │
│  └────────┬────────┘  - Metadata (language, category, difficulty)            │
│           │                                                                  │
│           ▼                                                                  │
│  ┌─────────────────┐                                                         │
│  │   Evaluation    │  Orchestrates test runs                                 │
│  │   Harness       │  - Loads fixtures                                       │
│  │                 │  - Invokes LLM Extractor                                │
│  │                 │  - Compares outputs                                     │
│  │                 │  - Computes metrics                                     │
│  └────────┬────────┘                                                         │
│           │                                                                  │
│           ├──────────────────────┐                                           │
│           │                      │                                           │
│           ▼                      ▼                                           │
│  ┌─────────────────┐    ┌─────────────────┐                                  │
│  │  LLM Extractor  │    │  Observation    │                                  │
│  │  (instrumented) │    │  Log            │                                  │
│  │                 │    │                 │                                  │
│  │  - Same code    │    │  - Prompt ver   │                                  │
│  │    path as      │    │  - Input hash   │                                  │
│  │    production   │    │  - Output       │                                  │
│  │                 │    │  - Latency      │                                  │
│  │                 │    │  - Tokens       │                                  │
│  └─────────────────┘    │  - Model        │                                  │
│                         │  - Timestamp    │                                  │
│                         └────────┬────────┘                                  │
│                                  │                                           │
│                                  ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │                         Metrics & Reports                               │ │
│  │                                                                         │ │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │ │
│  │  │  Precision  │  │   Recall    │  │  F1 Score   │  │    Cost     │     │ │
│  │  │  TP/(TP+FP) │  │ TP/(TP+FN)  │  │  Harmonic   │  │  Tokens/$   │     │ │
│  │  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘     │ │
│  │                                                                         │ │
│  │  ┌────────────────────────────────────────────────────────────────────┐ │ │
│  │  │  Regression Report                                                 │ │ │
│  │  │  - Comparison against baseline                                     │ │ │
│  │  │  - Per-fixture deltas                                              │ │ │
│  │  │  - Category breakdown                                              │ │ │
│  │  │  - Recommendations                                                 │ │ │
│  │  └────────────────────────────────────────────────────────────────────┘ │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Core Components

1. Golden Corpus

A curated set of test fixtures with known expected outcomes.

Fixture Format

# fixtures/tls/disabled_verification.toml

[metadata]
id = "tls-001"
name = "TLS verification disabled in Python requests"
category = "tls"
subcategory = "certificate_verification"
language = "python"
difficulty = "easy"  # easy | medium | hard
source = "hand-curated"  # hand-curated | production-capture | synthetic
created = "2026-02-05"
updated = "2026-02-05"

[input]
filename = "client.py"
content = '''
import requests

def fetch_data(url):
    # Disable SSL verification for internal services
    response = requests.get(url, verify=False)
    return response.json()
'''

[expected]
# Claims that MUST be extracted (recall)
must_contain = [
    { subject = "tls/cert_verification", predicate = "enabled", value = false },
]

# Claims that MUST NOT be extracted (precision)
must_not_contain = [
    { subject = "tls/cert_verification", predicate = "enabled", value = true },
]

# Optional: acceptable alternate formulations
acceptable_variants = [
    { subject = "ssl/verify", predicate = "enabled", value = false },
    { subject = "requests/ssl_verify", predicate = "value", value = false },
]

[scoring]
# How to score this fixture
weight = 1.0  # Importance multiplier
min_confidence = 0.7  # Expected minimum confidence

Corpus Organization

applications/aphoria/tests/llm_fixtures/
├── README.md                    # Corpus documentation
├── manifest.toml                # Index of all fixtures
├── tls/
│   ├── disabled_verification.toml
│   ├── deprecated_version.toml
│   └── pinning_bypass.toml
├── jwt/
│   ├── alg_none.toml
│   ├── skip_signature.toml
│   └── hardcoded_secret.toml
├── secrets/
│   ├── api_key_in_code.toml
│   ├── password_hardcoded.toml
│   └── high_entropy_token.toml
├── auth/
│   ├── bypass_pattern.toml
│   └── debug_header.toml
├── negative/                    # Files that should NOT trigger claims
│   ├── safe_tls_config.toml
│   ├── proper_jwt_validation.toml
│   └── env_var_secrets.toml
└── edge_cases/
    ├── empty_file.toml
    ├── binary_content.toml
    ├── huge_file.toml
    └── mixed_languages.toml

Manifest Structure

# manifest.toml

[corpus]
version = "1.0.0"
created = "2026-02-05"
description = "Golden corpus for LLM extraction evaluation"

[categories]
tls = { fixtures = 12, description = "TLS/SSL configuration" }
jwt = { fixtures = 8, description = "JWT authentication" }
secrets = { fixtures = 15, description = "Hardcoded secrets" }
auth = { fixtures = 6, description = "Authentication bypass" }
negative = { fixtures = 10, description = "Safe code (no claims expected)" }
edge_cases = { fixtures = 5, description = "Boundary conditions" }

[baseline]
# Current known-good metrics
precision = 0.85
recall = 0.78
f1 = 0.81
total_fixtures = 56
last_updated = "2026-02-05"
prompt_version = "1.0.0"
model = "gemini-2.0-flash"

2. Observation Log

Every LLM extraction is logged with full context for replay and analysis.

Log Entry Schema

/// A single observation from an LLM extraction
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionObservation {
    /// Unique identifier for this observation
    pub id: Uuid,

    /// When this extraction occurred
    pub timestamp: DateTime<Utc>,

    /// Prompt version (semantic version)
    pub prompt_version: String,

    /// Model identifier (e.g., "gemini-2.0-flash")
    pub model: String,

    /// BLAKE3 hash of input content (for deduplication)
    pub input_hash: String,

    /// Input metadata
    pub input: ExtractionInput,

    /// Output from LLM
    pub output: ExtractionOutput,

    /// Performance metrics
    pub metrics: ExtractionMetrics,

    /// Evaluation context (if run during evaluation)
    pub evaluation: Option<EvaluationContext>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionInput {
    /// Filename (may be anonymized)
    pub filename: String,

    /// Language detected
    pub language: String,

    /// Content length in bytes
    pub content_length: usize,

    /// Content preview (first 500 chars, for debugging)
    pub content_preview: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionOutput {
    /// Raw LLM response
    pub raw_response: String,

    /// Parsed claims (may be empty if parsing failed)
    pub claims: Vec<Observation>,

    /// Whether parsing succeeded
    pub parse_success: bool,

    /// Parse error if any
    pub parse_error: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionMetrics {
    /// Total latency (API call + processing)
    pub latency_ms: u64,

    /// API latency only
    pub api_latency_ms: u64,

    /// Input tokens
    pub input_tokens: u32,

    /// Output tokens
    pub output_tokens: u32,

    /// Total tokens
    pub total_tokens: u32,

    /// Estimated cost (USD)
    pub estimated_cost_usd: f64,

    /// Cache hit (if response was cached)
    pub cache_hit: bool,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EvaluationContext {
    /// Fixture ID if from golden corpus
    pub fixture_id: Option<String>,

    /// Evaluation run ID
    pub run_id: Uuid,

    /// Whether this matched expected output
    pub matched_expected: bool,

    /// Detailed match results
    pub match_details: MatchDetails,
}

Log Storage

~/.aphoria/eval/
├── observations/
│   ├── 2026-02-05/
│   │   ├── 001_tls-001_success.json
│   │   ├── 002_jwt-003_partial.json
│   │   └── ...
│   └── 2026-02-04/
│       └── ...
├── runs/
│   ├── run_abc123.json        # Full evaluation run metadata
│   └── run_def456.json
└── baselines/
    ├── v1.0.0.json            # Baseline for prompt v1.0.0
    └── latest.json            # Symlink to current baseline

3. Evaluation Harness

The core engine that runs evaluations.

Public API

/// Configuration for an evaluation run
#[derive(Debug, Clone)]
pub struct EvalConfig {
    /// Path to fixtures directory
    pub fixtures_dir: PathBuf,

    /// Which categories to evaluate (None = all)
    pub categories: Option<Vec<String>>,

    /// Maximum fixtures to run (for quick smoke tests)
    pub max_fixtures: Option<usize>,

    /// Whether to use real LLM or mock
    pub mode: EvalMode,

    /// Baseline to compare against
    pub baseline: Option<PathBuf>,

    /// Output directory for results
    pub output_dir: PathBuf,

    /// Whether to save observations
    pub save_observations: bool,

    /// Parallelism (concurrent LLM calls)
    pub parallelism: usize,
}

#[derive(Debug, Clone)]
pub enum EvalMode {
    /// Use real LLM API (costs money, tests actual prompt)
    Live {
        model: String,
        temperature: f32,
    },
    /// Use cached responses (fast, deterministic, for CI)
    Cached,
    /// Use mock responses (for testing harness itself)
    Mock,
}

/// Result of an evaluation run
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EvalResult {
    /// Unique run identifier
    pub run_id: Uuid,

    /// When the run started
    pub started_at: DateTime<Utc>,

    /// When the run completed
    pub completed_at: DateTime<Utc>,

    /// Configuration used
    pub config: EvalConfigSummary,

    /// Aggregate metrics
    pub metrics: AggregateMetrics,

    /// Per-fixture results
    pub fixture_results: Vec<FixtureResult>,

    /// Comparison with baseline (if baseline provided)
    pub baseline_comparison: Option<BaselineComparison>,

    /// Overall verdict
    pub verdict: EvalVerdict,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AggregateMetrics {
    /// Precision: TP / (TP + FP)
    pub precision: f64,

    /// Recall: TP / (TP + FN)
    pub recall: f64,

    /// F1 score: 2 * (P * R) / (P + R)
    pub f1: f64,

    /// Total fixtures evaluated
    pub total_fixtures: usize,

    /// Fixtures that passed
    pub passed: usize,

    /// Fixtures that failed
    pub failed: usize,

    /// Fixtures that errored (LLM call failed, parse failed, etc.)
    pub errored: usize,

    /// Total cost (USD)
    pub total_cost_usd: f64,

    /// Total tokens used
    pub total_tokens: u64,

    /// Average latency (ms)
    pub avg_latency_ms: f64,

    /// Per-category breakdown
    pub by_category: HashMap<String, CategoryMetrics>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum EvalVerdict {
    /// All checks passed
    Pass,
    /// Some regressions detected
    Regression { details: Vec<String> },
    /// Evaluation failed (errors prevented completion)
    Error { message: String },
}

Claim Matching Logic

/// How to match extracted claims against expected claims
pub struct ClaimMatcher {
    /// Tolerance for confidence comparison
    pub confidence_tolerance: f32,

    /// Whether to normalize concept paths before comparison
    pub normalize_paths: bool,

    /// Predicate aliases (e.g., "enabled" == "active" == "on")
    pub predicate_aliases: HashMap<String, Vec<String>>,

    /// Value equivalences (e.g., true == "true" == "yes" == 1)
    pub value_equivalences: Vec<Vec<ObjectValue>>,
}

impl ClaimMatcher {
    /// Check if extracted claims satisfy must_contain requirements
    pub fn check_must_contain(
        &self,
        extracted: &[Observation],
        expected: &[ExpectedClaim],
    ) -> MatchResult {
        // For each expected claim:
        // 1. Find matching extracted claim (subject + predicate match)
        // 2. Check value compatibility
        // 3. Check confidence threshold
        // Return: matched, unmatched, partial matches
    }

    /// Check if extracted claims violate must_not_contain requirements
    pub fn check_must_not_contain(
        &self,
        extracted: &[Observation],
        forbidden: &[ExpectedClaim],
    ) -> MatchResult {
        // For each forbidden claim:
        // 1. Check if any extracted claim matches
        // 2. Flag violations
    }
}

4. Prompt Versioning

Prompts are versioned to track changes and correlate with metrics.

Version Schema

/// Prompt version identifier
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct PromptVersion {
    /// Semantic version (major.minor.patch)
    pub version: String,

    /// BLAKE3 hash of prompt content
    pub content_hash: String,

    /// When this version was created
    pub created_at: DateTime<Utc>,

    /// Description of changes from previous version
    pub changelog: Option<String>,
}

impl PromptVersion {
    /// Compute version from prompt content
    pub fn from_prompt(prompt: &str, changelog: Option<String>) -> Self {
        let content_hash = blake3::hash(prompt.as_bytes()).to_hex().to_string();
        // Version is computed or provided externally
        Self {
            version: "0.0.0".to_string(), // Placeholder
            content_hash,
            created_at: Utc::now(),
            changelog,
        }
    }
}

Prompt File Structure

// llm/prompt.rs

/// Current prompt version
pub const PROMPT_VERSION: &str = "1.2.0";

/// Changelog for current version
pub const PROMPT_CHANGELOG: &str = "Improved JWT claim extraction accuracy";

/// The extraction prompt
pub const EXTRACTION_PROMPT: &str = r#"
You are a security analyst extracting implicit security claims from code.

Given the following code file, identify any security-relevant configurations,
settings, or patterns. For each finding, output a JSON object with:

- subject: The concept path (e.g., "tls/cert_verification")
- predicate: The aspect being claimed (e.g., "enabled")
- value: The value found (boolean, string, or number)
- confidence: Your confidence in this extraction (0.0 to 1.0)
- description: Brief explanation

Focus on:
- TLS/SSL configuration
- Authentication settings
- Cryptographic choices
- Secret/credential handling
- Input validation
- Authorization patterns

Output as a JSON array. If no security claims are found, output an empty array.

Code:
```{language}
{content}

"#;


---

### 5. Metrics & Reporting

#### Metrics Computed

| Metric | Formula | Purpose |
|--------|---------|---------|
| **Precision** | TP / (TP + FP) | Are we avoiding false positives? |
| **Recall** | TP / (TP + FN) | Are we finding all issues? |
| **F1 Score** | 2 * (P * R) / (P + R) | Balanced measure |
| **Confidence Calibration** | Correlation(confidence, correctness) | Are high-confidence claims actually correct? |
| **Parse Success Rate** | Successful parses / Total extractions | Is the prompt producing valid JSON? |
| **Cost per Fixture** | Total cost / Fixtures | Budget tracking |
| **Latency P50/P95** | Percentiles | Performance tracking |

#### Regression Detection

```rust
/// Compare current metrics against baseline
pub struct BaselineComparison {
    /// Current metrics
    pub current: AggregateMetrics,

    /// Baseline metrics
    pub baseline: AggregateMetrics,

    /// Deltas
    pub precision_delta: f64,
    pub recall_delta: f64,
    pub f1_delta: f64,

    /// Regression thresholds
    pub regression_threshold: f64,  // e.g., 0.05 = 5% drop

    /// Fixtures that regressed
    pub regressed_fixtures: Vec<FixtureRegression>,

    /// Fixtures that improved
    pub improved_fixtures: Vec<FixtureImprovement>,
}

impl BaselineComparison {
    pub fn has_regression(&self) -> bool {
        self.precision_delta < -self.regression_threshold ||
        self.recall_delta < -self.regression_threshold ||
        self.f1_delta < -self.regression_threshold
    }
}

Report Format

# Prompt Evaluation Report

**Run ID:** abc123
**Date:** 2026-02-05 14:30:00 UTC
**Prompt Version:** 1.2.0
**Model:** gemini-2.0-flash

## Summary

| Metric        | Current | Baseline | Delta | Status |
| ------------- | ------- | -------- | ----- | ------ |
| Precision     | 0.87    | 0.85     | +0.02 | ✅     |
| Recall        | 0.76    | 0.78     | -0.02 | ⚠️     |
| F1 Score      | 0.81    | 0.81     | +0.00 | ✅     |
| Parse Success | 98%     | 97%      | +1%   | ✅     |

**Verdict:** ⚠️ REVIEW - Recall dropped by 2%

## Cost Analysis

- Total tokens: 125,430
- Estimated cost: $0.12
- Cost per fixture: $0.002

## Regressions

### jwt-003: JWT algorithm none detection

- **Expected:** `jwt/algorithm = "none"` with confidence > 0.8
- **Got:** Not extracted
- **Impact:** High (security-critical)

## Improvements

### tls-007: TLS version in constants

- **Previously:** Not extracted
- **Now:** `tls/min_version = "1.0"` with confidence 0.85
- **Impact:** Medium

## Category Breakdown

| Category | Fixtures | Passed | Failed | Precision | Recall |
| -------- | -------- | ------ | ------ | --------- | ------ |
| tls      | 12       | 11     | 1      | 0.92      | 0.91   |
| jwt      | 8        | 6      | 2      | 0.75      | 0.75   |
| secrets  | 15       | 14     | 1      | 0.93      | 0.87   |
| auth     | 6        | 6      | 0      | 1.00      | 0.83   |
| negative | 10       | 10     | 0      | 1.00      | N/A    |

6. Jobs & Automation

CI Job (Per-PR)

# .github/workflows/prompt-eval-smoke.yml
name: Prompt Evaluation (Smoke)

on:
    pull_request:
        paths:
            - "applications/aphoria/src/llm/**"
            - "applications/aphoria/tests/llm_fixtures/**"

jobs:
    eval:
        runs-on: ubuntu-latest
        steps:
            - uses: actions/checkout@v4

            - name: Run smoke test
              env:
                  GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
              run: |
                  cargo run -p aphoria -- eval prompts \
                    --mode cached \
                    --max-fixtures 20 \
                    --categories tls,jwt,secrets \
                    --baseline tests/llm_fixtures/baselines/latest.json \
                    --fail-on-regression                  

            - name: Upload report
              uses: actions/upload-artifact@v4
              with:
                  name: eval-report
                  path: eval-report.md

Characteristics:

  • Runs on PR that touches prompt code or fixtures
  • Uses cached responses (fast, deterministic)
  • Limited to 20 fixtures (smoke test)
  • Fails if regression detected

Nightly Job (Full Evaluation)

# .github/workflows/prompt-eval-nightly.yml
name: Prompt Evaluation (Full)

on:
    schedule:
        - cron: "0 3 * * *" # 3am UTC daily
    workflow_dispatch:

jobs:
    eval:
        runs-on: ubuntu-latest
        steps:
            - uses: actions/checkout@v4

            - name: Run full evaluation
              env:
                  GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
              run: |
                  cargo run -p aphoria -- eval prompts \
                    --mode live \
                    --model gemini-2.0-flash \
                    --temperature 0 \
                    --baseline tests/llm_fixtures/baselines/latest.json \
                    --output-dir ./eval-results \
                    --save-observations                  

            - name: Update baseline if improved
              run: |
                  # If F1 improved by > 2%, update baseline
                  ./scripts/maybe-update-baseline.sh                  

            - name: Upload results
              uses: actions/upload-artifact@v4
              with:
                  name: eval-results
                  path: eval-results/

            - name: Post to Slack
              if: failure()
              uses: slackapi/slack-github-action@v1
              with:
                  payload: |
                      {
                        "text": "⚠️ Prompt evaluation regression detected",
                        "attachments": [...]
                      }                      

Characteristics:

  • Runs nightly at 3am UTC
  • Uses live LLM API (real evaluation)
  • Full corpus coverage
  • Updates baseline if metrics improve significantly
  • Alerts on regression

On-Demand Job (Prompt Iteration)

# For prompt development: compare two versions
aphoria eval prompts \
  --mode live \
  --prompt-file ./prompts/experimental.txt \
  --baseline ./baselines/current.json \
  --output-dir ./eval-comparison \
  --verbose

# View comparison
cat ./eval-comparison/comparison.md

CLI Interface

USAGE:
    aphoria eval prompts [OPTIONS]

OPTIONS:
    --fixtures-dir <DIR>       Path to fixtures directory [default: tests/llm_fixtures]
    --categories <LIST>        Categories to evaluate (comma-separated)
    --max-fixtures <N>         Maximum fixtures to run
    --mode <MODE>              Evaluation mode: live, cached, mock [default: cached]
    --model <MODEL>            Model to use (live mode only) [default: gemini-2.0-flash]
    --temperature <TEMP>       Temperature (live mode only) [default: 0]
    --baseline <FILE>          Baseline to compare against
    --output-dir <DIR>         Output directory for results [default: ./eval-results]
    --save-observations        Save observation logs
    --fail-on-regression       Exit with code 1 if regression detected
    --regression-threshold <N> Threshold for regression (default: 0.05 = 5%)
    --verbose                  Verbose output
    --json                     Output results as JSON

SUBCOMMANDS:
    aphoria eval prompts show-baseline    Show current baseline metrics
    aphoria eval prompts update-baseline  Update baseline from latest run
    aphoria eval prompts list-fixtures    List available fixtures
    aphoria eval prompts add-fixture      Add a new fixture interactively
    aphoria eval prompts validate-fixtures  Validate fixture format

Implementation Plan

Phase 7.8.1: Core Infrastructure (Week 1)

Task Description Effort
Fixture format Define TOML schema, parser 2d
Observation log Schema, writer, reader 1d
Claim matcher Matching logic with fuzzy support 2d
Prompt versioning Version extraction, tracking 1d

Deliverable: Can load fixtures, run extractions, compare outputs

Phase 7.8.2: Evaluation Harness (Week 2)

Task Description Effort
Evaluation harness Orchestration, parallelism 2d
Metrics computation Precision, recall, F1, cost 1d
Baseline comparison Regression detection 1d
Report generation Markdown, JSON output 1d

Deliverable: Can run full evaluation and generate report

Phase 7.8.3: Golden Corpus (Week 2-3)

Task Description Effort
Seed fixtures (20) Hand-curated test cases 2d
Negative fixtures (10) Safe code that shouldn't trigger 1d
Edge case fixtures (5) Boundary conditions 1d
Baseline establishment Initial metrics snapshot 1d

Deliverable: 35+ fixtures covering core categories

Phase 7.8.4: CI Integration (Week 3)

Task Description Effort
Smoke test workflow Per-PR cached evaluation 1d
Nightly workflow Full live evaluation 1d
Baseline auto-update Script for improvement detection 1d
Alerting Slack/email on regression 0.5d

Deliverable: Automated evaluation in CI

Phase 7.8.5: CLI & Documentation (Week 4)

Task Description Effort
CLI commands eval prompts subcommands 2d
Documentation Usage guide, fixture authoring 1d
Skill update /aphoria-dev skill update 0.5d

Deliverable: Production-ready tooling


Open Questions

1. Where do we store baseline metrics?

Options:

  • In repository (tests/llm_fixtures/baselines/) - Simple, versioned with code
  • External artifact store - Separates metrics from code
  • Database - For historical tracking

Recommendation: Start with repository, migrate to external store when history needed.

2. How strict should matching be?

Options:

  • Exact match - Same subject, predicate, value (brittle)
  • Structural match - Same concept, fuzzy value (looser)
  • Semantic match - Embeddings-based similarity (complex)

Recommendation: Structural match with configurable fuzzy value matching.

3. Mock vs Live in CI?

Options:

  • Always mock - Fast, free, deterministic, tests harness not prompt
  • Always live - Expensive, slow, tests actual prompt
  • Hybrid - Mock for smoke, live for nightly

Recommendation: Hybrid approach. Cached (deterministic) for PR, live for nightly.

4. How do we handle model version changes?

Gemini may update models, causing output drift even without prompt changes.

Options:

  • Pin model version (if API supports)
  • Track model version in baseline, re-baseline on model change
  • Alert when model version changes

Recommendation: Track model version, require manual baseline update on change.

5. What's the corpus growth strategy?

Options:

  • Hand-curate only (high quality, slow growth)
  • Production capture with review (faster growth, needs tooling)
  • Synthetic generation (fast, may not reflect reality)

Recommendation: Start hand-curated, add production capture tooling in Phase 9.


Success Metrics

Metric Target Measurement
Regression detection rate 100% Simulated regressions caught
False positive rate (regression alerts) < 5% Manual review of alerts
Prompt iteration cycle time < 30 min Time from change to evaluation result
Corpus coverage > 50 fixtures Fixture count
CI job duration (smoke) < 2 min Workflow timing
CI job duration (nightly) < 15 min Workflow timing


Appendix: Example Fixture

# fixtures/secrets/high_entropy_api_key.toml

[metadata]
id = "secrets-005"
name = "High entropy API key in Python config"
category = "secrets"
subcategory = "api_keys"
language = "python"
difficulty = "medium"
source = "hand-curated"
created = "2026-02-05"
updated = "2026-02-05"
notes = """
Tests detection of high-entropy strings that look like API keys.
The entropy of 'sk_live_abc123...' is > 4.0 which should trigger detection.
"""

[input]
filename = "config.py"
content = '''
import os

# Configuration for payment processing
STRIPE_API_KEY = "sk_live_51ABC123DEF456GHI789JKL012MNO345PQR678STU901VWX234YZ"
STRIPE_WEBHOOK_SECRET = os.environ.get("STRIPE_WEBHOOK_SECRET")

DATABASE_URL = os.environ.get("DATABASE_URL", "postgresql://localhost/app")
DEBUG = os.environ.get("DEBUG", "false").lower() == "true"
'''

[expected]
must_contain = [
    { subject = "secrets/api_key", predicate = "hardcoded", value = true, min_confidence = 0.8 },
]

must_not_contain = [
    # STRIPE_WEBHOOK_SECRET uses env var, should NOT be flagged
    { subject = "secrets/webhook_secret", predicate = "hardcoded", value = true },
    # DATABASE_URL uses env var with fallback, should NOT be flagged as hardcoded
    { subject = "secrets/database_url", predicate = "hardcoded", value = true },
]

acceptable_variants = [
    { subject = "stripe/api_key", predicate = "exposed", value = true },
    { subject = "payment/secret", predicate = "hardcoded", value = true },
]

[scoring]
weight = 1.5  # Security-critical, weighted higher
min_confidence = 0.8

Last updated: 2026-02-05