jordan bbe6aedc40 feat: Aphoria security extractors + LLM evaluation architecture + ontology docs

New security extractors:
- insecure_deserialization, orm_injection, path_traversal, security_headers
- ssrf, unvalidated_redirects, weak_password, xxe
- Enhanced tls_version extractor with comprehensive cipher/protocol checks

Architecture docs:
- Scout-judge extraction pattern for LLM-based code analysis
- LLM prompt evaluation framework
- LLM eval implementation guide

Core improvements:
- stemedb-ontology README and client enhancements
- WAL journal/segment instrumentation
- Signing and ingestion refinements
- Consumer health demo script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-05 15:22:55 -07:00

35 KiB

Raw Blame History

LLM Prompt Evaluation System

Status: Proposed (2026-02-05) Phase: 7.8 (extends Phase 7.5 LLM-in-the-Loop Extraction) Author: Architecture Team

Problem Statement

Aphoria's LLM-powered claim extraction (Phase 7.5) uses Gemini to extract security claims from high-value code files. The prompts that drive this extraction are effectively code that we don't treat like code:

Aspect	Traditional Code	Current Prompts
Version Control	Git commits	In files, but no semantic versioning
Testing	Unit/integration tests	None
Metrics	Coverage, performance	None
Regression Detection	CI failures	None
Quality Gates	Linting, review	None

The result: When we change a prompt, we have no systematic way to know if we made things better or worse. We're flying blind.

Enterprise Requirements

For enterprise adoption, customers need assurance that:

Prompts produce consistent, high-quality results - Not random outputs
Changes are validated before deployment - Regressions are caught
Performance is measurable - Precision, recall, cost are tracked
The system improves over time - With evidence, not hope

Goals

Primary Goals

Observability - Understand prompt effectiveness through metrics and logging
Testability - Validate prompts against known scenarios with expected outcomes
Repeatability - Run evaluations consistently across environments
Automation - Scheduled jobs that detect regressions without human intervention

Non-Goals (Phase 7.8)

Real-time prompt optimization (future: Phase 9)
A/B testing in production (future: Phase 9)
Multi-model comparison (future)
Prompt compression/optimization (future)

Architecture Overview

┌──────────────────────────────────────────────────────────────────────────────┐
│                     LLM Prompt Evaluation System                              │
├──────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────────┐                                                         │
│  │  Golden Corpus  │  Test fixtures with expected outcomes                   │
│  │  (fixtures/)    │  - Code snippets                                        │
│  │                 │  - Expected claims (must-contain, must-not-contain)     │
│  └────────┬────────┘  - Metadata (language, category, difficulty)            │
│           │                                                                  │
│           ▼                                                                  │
│  ┌─────────────────┐                                                         │
│  │   Evaluation    │  Orchestrates test runs                                 │
│  │   Harness       │  - Loads fixtures                                       │
│  │                 │  - Invokes LLM Extractor                                │
│  │                 │  - Compares outputs                                     │
│  │                 │  - Computes metrics                                     │
│  └────────┬────────┘                                                         │
│           │                                                                  │
│           ├──────────────────────┐                                           │
│           │                      │                                           │
│           ▼                      ▼                                           │
│  ┌─────────────────┐    ┌─────────────────┐                                  │
│  │  LLM Extractor  │    │  Observation    │                                  │
│  │  (instrumented) │    │  Log            │                                  │
│  │                 │    │                 │                                  │
│  │  - Same code    │    │  - Prompt ver   │                                  │
│  │    path as      │    │  - Input hash   │                                  │
│  │    production   │    │  - Output       │                                  │
│  │                 │    │  - Latency      │                                  │
│  │                 │    │  - Tokens       │                                  │
│  └─────────────────┘    │  - Model        │                                  │
│                         │  - Timestamp    │                                  │
│                         └────────┬────────┘                                  │
│                                  │                                           │
│                                  ▼                                           │
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │                         Metrics & Reports                               │ │
│  │                                                                         │ │
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐     │ │
│  │  │  Precision  │  │   Recall    │  │  F1 Score   │  │    Cost     │     │ │
│  │  │  TP/(TP+FP) │  │ TP/(TP+FN)  │  │  Harmonic   │  │  Tokens/$   │     │ │
│  │  └─────────────┘  └─────────────┘  └─────────────┘  └─────────────┘     │ │
│  │                                                                         │ │
│  │  ┌────────────────────────────────────────────────────────────────────┐ │ │
│  │  │  Regression Report                                                 │ │ │
│  │  │  - Comparison against baseline                                     │ │ │
│  │  │  - Per-fixture deltas                                              │ │ │
│  │  │  - Category breakdown                                              │ │ │
│  │  │  - Recommendations                                                 │ │ │
│  │  └────────────────────────────────────────────────────────────────────┘ │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
│                                                                              │
└──────────────────────────────────────────────────────────────────────────────┘

Core Components

1. Golden Corpus

A curated set of test fixtures with known expected outcomes.

Fixture Format

# fixtures/tls/disabled_verification.toml

[metadata]
id = "tls-001"
name = "TLS verification disabled in Python requests"
category = "tls"
subcategory = "certificate_verification"
language = "python"
difficulty = "easy"  # easy | medium | hard
source = "hand-curated"  # hand-curated | production-capture | synthetic
created = "2026-02-05"
updated = "2026-02-05"

[input]
filename = "client.py"
content = '''
import requests

def fetch_data(url):
    # Disable SSL verification for internal services
    response = requests.get(url, verify=False)
    return response.json()
'''

[expected]
# Claims that MUST be extracted (recall)
must_contain = [
    { subject = "tls/cert_verification", predicate = "enabled", value = false },
]

# Claims that MUST NOT be extracted (precision)
must_not_contain = [
    { subject = "tls/cert_verification", predicate = "enabled", value = true },
]

# Optional: acceptable alternate formulations
acceptable_variants = [
    { subject = "ssl/verify", predicate = "enabled", value = false },
    { subject = "requests/ssl_verify", predicate = "value", value = false },
]

[scoring]
# How to score this fixture
weight = 1.0  # Importance multiplier
min_confidence = 0.7  # Expected minimum confidence

Corpus Organization

applications/aphoria/tests/llm_fixtures/
├── README.md                    # Corpus documentation
├── manifest.toml                # Index of all fixtures
├── tls/
│   ├── disabled_verification.toml
│   ├── deprecated_version.toml
│   └── pinning_bypass.toml
├── jwt/
│   ├── alg_none.toml
│   ├── skip_signature.toml
│   └── hardcoded_secret.toml
├── secrets/
│   ├── api_key_in_code.toml
│   ├── password_hardcoded.toml
│   └── high_entropy_token.toml
├── auth/
│   ├── bypass_pattern.toml
│   └── debug_header.toml
├── negative/                    # Files that should NOT trigger claims
│   ├── safe_tls_config.toml
│   ├── proper_jwt_validation.toml
│   └── env_var_secrets.toml
└── edge_cases/
    ├── empty_file.toml
    ├── binary_content.toml
    ├── huge_file.toml
    └── mixed_languages.toml

Manifest Structure

# manifest.toml

[corpus]
version = "1.0.0"
created = "2026-02-05"
description = "Golden corpus for LLM extraction evaluation"

[categories]
tls = { fixtures = 12, description = "TLS/SSL configuration" }
jwt = { fixtures = 8, description = "JWT authentication" }
secrets = { fixtures = 15, description = "Hardcoded secrets" }
auth = { fixtures = 6, description = "Authentication bypass" }
negative = { fixtures = 10, description = "Safe code (no claims expected)" }
edge_cases = { fixtures = 5, description = "Boundary conditions" }

[baseline]
# Current known-good metrics
precision = 0.85
recall = 0.78
f1 = 0.81
total_fixtures = 56
last_updated = "2026-02-05"
prompt_version = "1.0.0"
model = "gemini-2.0-flash"

2. Observation Log

Every LLM extraction is logged with full context for replay and analysis.

Log Entry Schema

/// A single observation from an LLM extraction
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionObservation {
    /// Unique identifier for this observation
    pub id: Uuid,

    /// When this extraction occurred
    pub timestamp: DateTime<Utc>,

    /// Prompt version (semantic version)
    pub prompt_version: String,

    /// Model identifier (e.g., "gemini-2.0-flash")
    pub model: String,

    /// BLAKE3 hash of input content (for deduplication)
    pub input_hash: String,

    /// Input metadata
    pub input: ExtractionInput,

    /// Output from LLM
    pub output: ExtractionOutput,

    /// Performance metrics
    pub metrics: ExtractionMetrics,

    /// Evaluation context (if run during evaluation)
    pub evaluation: Option<EvaluationContext>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionInput {
    /// Filename (may be anonymized)
    pub filename: String,

    /// Language detected
    pub language: String,

    /// Content length in bytes
    pub content_length: usize,

    /// Content preview (first 500 chars, for debugging)
    pub content_preview: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionOutput {
    /// Raw LLM response
    pub raw_response: String,

    /// Parsed claims (may be empty if parsing failed)
    pub claims: Vec<ExtractedClaim>,

    /// Whether parsing succeeded
    pub parse_success: bool,

    /// Parse error if any
    pub parse_error: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ExtractionMetrics {
    /// Total latency (API call + processing)
    pub latency_ms: u64,

    /// API latency only
    pub api_latency_ms: u64,

    /// Input tokens
    pub input_tokens: u32,

    /// Output tokens
    pub output_tokens: u32,

    /// Total tokens
    pub total_tokens: u32,

    /// Estimated cost (USD)
    pub estimated_cost_usd: f64,

    /// Cache hit (if response was cached)
    pub cache_hit: bool,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EvaluationContext {
    /// Fixture ID if from golden corpus
    pub fixture_id: Option<String>,

    /// Evaluation run ID
    pub run_id: Uuid,

    /// Whether this matched expected output
    pub matched_expected: bool,

    /// Detailed match results
    pub match_details: MatchDetails,
}

Log Storage

~/.aphoria/eval/
├── observations/
│   ├── 2026-02-05/
│   │   ├── 001_tls-001_success.json
│   │   ├── 002_jwt-003_partial.json
│   │   └── ...
│   └── 2026-02-04/
│       └── ...
├── runs/
│   ├── run_abc123.json        # Full evaluation run metadata
│   └── run_def456.json
└── baselines/
    ├── v1.0.0.json            # Baseline for prompt v1.0.0
    └── latest.json            # Symlink to current baseline

3. Evaluation Harness

The core engine that runs evaluations.

Public API

/// Configuration for an evaluation run
#[derive(Debug, Clone)]
pub struct EvalConfig {
    /// Path to fixtures directory
    pub fixtures_dir: PathBuf,

    /// Which categories to evaluate (None = all)
    pub categories: Option<Vec<String>>,

    /// Maximum fixtures to run (for quick smoke tests)
    pub max_fixtures: Option<usize>,

    /// Whether to use real LLM or mock
    pub mode: EvalMode,

    /// Baseline to compare against
    pub baseline: Option<PathBuf>,

    /// Output directory for results
    pub output_dir: PathBuf,

    /// Whether to save observations
    pub save_observations: bool,

    /// Parallelism (concurrent LLM calls)
    pub parallelism: usize,
}

#[derive(Debug, Clone)]
pub enum EvalMode {
    /// Use real LLM API (costs money, tests actual prompt)
    Live {
        model: String,
        temperature: f32,
    },
    /// Use cached responses (fast, deterministic, for CI)
    Cached,
    /// Use mock responses (for testing harness itself)
    Mock,
}

/// Result of an evaluation run
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct EvalResult {
    /// Unique run identifier
    pub run_id: Uuid,

    /// When the run started
    pub started_at: DateTime<Utc>,

    /// When the run completed
    pub completed_at: DateTime<Utc>,

    /// Configuration used
    pub config: EvalConfigSummary,

    /// Aggregate metrics
    pub metrics: AggregateMetrics,

    /// Per-fixture results
    pub fixture_results: Vec<FixtureResult>,

    /// Comparison with baseline (if baseline provided)
    pub baseline_comparison: Option<BaselineComparison>,

    /// Overall verdict
    pub verdict: EvalVerdict,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AggregateMetrics {
    /// Precision: TP / (TP + FP)
    pub precision: f64,

    /// Recall: TP / (TP + FN)
    pub recall: f64,

    /// F1 score: 2 * (P * R) / (P + R)
    pub f1: f64,

    /// Total fixtures evaluated
    pub total_fixtures: usize,

    /// Fixtures that passed
    pub passed: usize,

    /// Fixtures that failed
    pub failed: usize,

    /// Fixtures that errored (LLM call failed, parse failed, etc.)
    pub errored: usize,

    /// Total cost (USD)
    pub total_cost_usd: f64,

    /// Total tokens used
    pub total_tokens: u64,

    /// Average latency (ms)
    pub avg_latency_ms: f64,

    /// Per-category breakdown
    pub by_category: HashMap<String, CategoryMetrics>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub enum EvalVerdict {
    /// All checks passed
    Pass,
    /// Some regressions detected
    Regression { details: Vec<String> },
    /// Evaluation failed (errors prevented completion)
    Error { message: String },
}

Claim Matching Logic

/// How to match extracted claims against expected claims
pub struct ClaimMatcher {
    /// Tolerance for confidence comparison
    pub confidence_tolerance: f32,

    /// Whether to normalize concept paths before comparison
    pub normalize_paths: bool,

    /// Predicate aliases (e.g., "enabled" == "active" == "on")
    pub predicate_aliases: HashMap<String, Vec<String>>,

    /// Value equivalences (e.g., true == "true" == "yes" == 1)
    pub value_equivalences: Vec<Vec<ObjectValue>>,
}

impl ClaimMatcher {
    /// Check if extracted claims satisfy must_contain requirements
    pub fn check_must_contain(
        &self,
        extracted: &[ExtractedClaim],
        expected: &[ExpectedClaim],
    ) -> MatchResult {
        // For each expected claim:
        // 1. Find matching extracted claim (subject + predicate match)
        // 2. Check value compatibility
        // 3. Check confidence threshold
        // Return: matched, unmatched, partial matches
    }

    /// Check if extracted claims violate must_not_contain requirements
    pub fn check_must_not_contain(
        &self,
        extracted: &[ExtractedClaim],
        forbidden: &[ExpectedClaim],
    ) -> MatchResult {
        // For each forbidden claim:
        // 1. Check if any extracted claim matches
        // 2. Flag violations
    }
}

4. Prompt Versioning

Prompts are versioned to track changes and correlate with metrics.

Version Schema

/// Prompt version identifier
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct PromptVersion {
    /// Semantic version (major.minor.patch)
    pub version: String,

    /// BLAKE3 hash of prompt content
    pub content_hash: String,

    /// When this version was created
    pub created_at: DateTime<Utc>,

    /// Description of changes from previous version
    pub changelog: Option<String>,
}

impl PromptVersion {
    /// Compute version from prompt content
    pub fn from_prompt(prompt: &str, changelog: Option<String>) -> Self {
        let content_hash = blake3::hash(prompt.as_bytes()).to_hex().to_string();
        // Version is computed or provided externally
        Self {
            version: "0.0.0".to_string(), // Placeholder
            content_hash,
            created_at: Utc::now(),
            changelog,
        }
    }
}

Prompt File Structure

// llm/prompt.rs

/// Current prompt version
pub const PROMPT_VERSION: &str = "1.2.0";

/// Changelog for current version
pub const PROMPT_CHANGELOG: &str = "Improved JWT claim extraction accuracy";

/// The extraction prompt
pub const EXTRACTION_PROMPT: &str = r#"
You are a security analyst extracting implicit security claims from code.

Given the following code file, identify any security-relevant configurations,
settings, or patterns. For each finding, output a JSON object with:

- subject: The concept path (e.g., "tls/cert_verification")
- predicate: The aspect being claimed (e.g., "enabled")
- value: The value found (boolean, string, or number)
- confidence: Your confidence in this extraction (0.0 to 1.0)
- description: Brief explanation

Focus on:
- TLS/SSL configuration
- Authentication settings
- Cryptographic choices
- Secret/credential handling
- Input validation
- Authorization patterns

Output as a JSON array. If no security claims are found, output an empty array.

Code:
```{language}
{content}

"#;


---

### 5. Metrics & Reporting

#### Metrics Computed

| Metric | Formula | Purpose |
|--------|---------|---------|
| **Precision** | TP / (TP + FP) | Are we avoiding false positives? |
| **Recall** | TP / (TP + FN) | Are we finding all issues? |
| **F1 Score** | 2 * (P * R) / (P + R) | Balanced measure |
| **Confidence Calibration** | Correlation(confidence, correctness) | Are high-confidence claims actually correct? |
| **Parse Success Rate** | Successful parses / Total extractions | Is the prompt producing valid JSON? |
| **Cost per Fixture** | Total cost / Fixtures | Budget tracking |
| **Latency P50/P95** | Percentiles | Performance tracking |

#### Regression Detection

```rust
/// Compare current metrics against baseline
pub struct BaselineComparison {
    /// Current metrics
    pub current: AggregateMetrics,

    /// Baseline metrics
    pub baseline: AggregateMetrics,

    /// Deltas
    pub precision_delta: f64,
    pub recall_delta: f64,
    pub f1_delta: f64,

    /// Regression thresholds
    pub regression_threshold: f64,  // e.g., 0.05 = 5% drop

    /// Fixtures that regressed
    pub regressed_fixtures: Vec<FixtureRegression>,

    /// Fixtures that improved
    pub improved_fixtures: Vec<FixtureImprovement>,
}

impl BaselineComparison {
    pub fn has_regression(&self) -> bool {
        self.precision_delta < -self.regression_threshold ||
        self.recall_delta < -self.regression_threshold ||
        self.f1_delta < -self.regression_threshold
    }
}

Report Format

# Prompt Evaluation Report

**Run ID:** abc123
**Date:** 2026-02-05 14:30:00 UTC
**Prompt Version:** 1.2.0
**Model:** gemini-2.0-flash

## Summary

| Metric        | Current | Baseline | Delta | Status |
| ------------- | ------- | -------- | ----- | ------ |
| Precision     | 0.87    | 0.85     | +0.02 | ✅     |
| Recall        | 0.76    | 0.78     | -0.02 | ⚠️     |
| F1 Score      | 0.81    | 0.81     | +0.00 | ✅     |
| Parse Success | 98%     | 97%      | +1%   | ✅     |

**Verdict:** ⚠️ REVIEW - Recall dropped by 2%

## Cost Analysis

- Total tokens: 125,430
- Estimated cost: $0.12
- Cost per fixture: $0.002

## Regressions

### jwt-003: JWT algorithm none detection

- **Expected:** `jwt/algorithm = "none"` with confidence > 0.8
- **Got:** Not extracted
- **Impact:** High (security-critical)

## Improvements

### tls-007: TLS version in constants

- **Previously:** Not extracted
- **Now:** `tls/min_version = "1.0"` with confidence 0.85
- **Impact:** Medium

## Category Breakdown

| Category | Fixtures | Passed | Failed | Precision | Recall |
| -------- | -------- | ------ | ------ | --------- | ------ |
| tls      | 12       | 11     | 1      | 0.92      | 0.91   |
| jwt      | 8        | 6      | 2      | 0.75      | 0.75   |
| secrets  | 15       | 14     | 1      | 0.93      | 0.87   |
| auth     | 6        | 6      | 0      | 1.00      | 0.83   |
| negative | 10       | 10     | 0      | 1.00      | N/A    |

6. Jobs & Automation

CI Job (Per-PR)

# .github/workflows/prompt-eval-smoke.yml
name: Prompt Evaluation (Smoke)

on:
    pull_request:
        paths:
            - "applications/aphoria/src/llm/**"
            - "applications/aphoria/tests/llm_fixtures/**"

jobs:
    eval:
        runs-on: ubuntu-latest
        steps:
            - uses: actions/checkout@v4

            - name: Run smoke test
              env:
                  GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
              run: |
                  cargo run -p aphoria -- eval prompts \
                    --mode cached \
                    --max-fixtures 20 \
                    --categories tls,jwt,secrets \
                    --baseline tests/llm_fixtures/baselines/latest.json \
                    --fail-on-regression                  

            - name: Upload report
              uses: actions/upload-artifact@v4
              with:
                  name: eval-report
                  path: eval-report.md

Characteristics:

Runs on PR that touches prompt code or fixtures
Uses cached responses (fast, deterministic)
Limited to 20 fixtures (smoke test)
Fails if regression detected

Nightly Job (Full Evaluation)

# .github/workflows/prompt-eval-nightly.yml
name: Prompt Evaluation (Full)

on:
    schedule:
        - cron: "0 3 * * *" # 3am UTC daily
    workflow_dispatch:

jobs:
    eval:
        runs-on: ubuntu-latest
        steps:
            - uses: actions/checkout@v4

            - name: Run full evaluation
              env:
                  GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
              run: |
                  cargo run -p aphoria -- eval prompts \
                    --mode live \
                    --model gemini-2.0-flash \
                    --temperature 0 \
                    --baseline tests/llm_fixtures/baselines/latest.json \
                    --output-dir ./eval-results \
                    --save-observations                  

            - name: Update baseline if improved
              run: |
                  # If F1 improved by > 2%, update baseline
                  ./scripts/maybe-update-baseline.sh                  

            - name: Upload results
              uses: actions/upload-artifact@v4
              with:
                  name: eval-results
                  path: eval-results/

            - name: Post to Slack
              if: failure()
              uses: slackapi/slack-github-action@v1
              with:
                  payload: |
                      {
                        "text": "⚠️ Prompt evaluation regression detected",
                        "attachments": [...]
                      }

Characteristics:

Runs nightly at 3am UTC
Uses live LLM API (real evaluation)
Full corpus coverage
Updates baseline if metrics improve significantly
Alerts on regression

On-Demand Job (Prompt Iteration)

# For prompt development: compare two versions
aphoria eval prompts \
  --mode live \
  --prompt-file ./prompts/experimental.txt \
  --baseline ./baselines/current.json \
  --output-dir ./eval-comparison \
  --verbose

# View comparison
cat ./eval-comparison/comparison.md

CLI Interface

USAGE:
    aphoria eval prompts [OPTIONS]

OPTIONS:
    --fixtures-dir <DIR>       Path to fixtures directory [default: tests/llm_fixtures]
    --categories <LIST>        Categories to evaluate (comma-separated)
    --max-fixtures <N>         Maximum fixtures to run
    --mode <MODE>              Evaluation mode: live, cached, mock [default: cached]
    --model <MODEL>            Model to use (live mode only) [default: gemini-2.0-flash]
    --temperature <TEMP>       Temperature (live mode only) [default: 0]
    --baseline <FILE>          Baseline to compare against
    --output-dir <DIR>         Output directory for results [default: ./eval-results]
    --save-observations        Save observation logs
    --fail-on-regression       Exit with code 1 if regression detected
    --regression-threshold <N> Threshold for regression (default: 0.05 = 5%)
    --verbose                  Verbose output
    --json                     Output results as JSON

SUBCOMMANDS:
    aphoria eval prompts show-baseline    Show current baseline metrics
    aphoria eval prompts update-baseline  Update baseline from latest run
    aphoria eval prompts list-fixtures    List available fixtures
    aphoria eval prompts add-fixture      Add a new fixture interactively
    aphoria eval prompts validate-fixtures  Validate fixture format

Implementation Plan

Phase 7.8.1: Core Infrastructure (Week 1)

Task	Description	Effort
Fixture format	Define TOML schema, parser	2d
Observation log	Schema, writer, reader	1d
Claim matcher	Matching logic with fuzzy support	2d
Prompt versioning	Version extraction, tracking	1d

Deliverable: Can load fixtures, run extractions, compare outputs

Phase 7.8.2: Evaluation Harness (Week 2)

Task	Description	Effort
Evaluation harness	Orchestration, parallelism	2d
Metrics computation	Precision, recall, F1, cost	1d
Baseline comparison	Regression detection	1d
Report generation	Markdown, JSON output	1d

Deliverable: Can run full evaluation and generate report

Phase 7.8.3: Golden Corpus (Week 2-3)

Task	Description	Effort
Seed fixtures (20)	Hand-curated test cases	2d
Negative fixtures (10)	Safe code that shouldn't trigger	1d
Edge case fixtures (5)	Boundary conditions	1d
Baseline establishment	Initial metrics snapshot	1d

Deliverable: 35+ fixtures covering core categories

Phase 7.8.4: CI Integration (Week 3)

Task	Description	Effort
Smoke test workflow	Per-PR cached evaluation	1d
Nightly workflow	Full live evaluation	1d
Baseline auto-update	Script for improvement detection	1d
Alerting	Slack/email on regression	0.5d

Deliverable: Automated evaluation in CI

Phase 7.8.5: CLI & Documentation (Week 4)

Task	Description	Effort
CLI commands	`eval prompts` subcommands	2d
Documentation	Usage guide, fixture authoring	1d
Skill update	`/aphoria-dev` skill update	0.5d

Deliverable: Production-ready tooling

Open Questions

1. Where do we store baseline metrics?

Options:

In repository (tests/llm_fixtures/baselines/) - Simple, versioned with code
External artifact store - Separates metrics from code
Database - For historical tracking

Recommendation: Start with repository, migrate to external store when history needed.

2. How strict should matching be?

Options:

Exact match - Same subject, predicate, value (brittle)
Structural match - Same concept, fuzzy value (looser)
Semantic match - Embeddings-based similarity (complex)

Recommendation: Structural match with configurable fuzzy value matching.

3. Mock vs Live in CI?

Options:

Always mock - Fast, free, deterministic, tests harness not prompt
Always live - Expensive, slow, tests actual prompt
Hybrid - Mock for smoke, live for nightly

Recommendation: Hybrid approach. Cached (deterministic) for PR, live for nightly.

4. How do we handle model version changes?

Gemini may update models, causing output drift even without prompt changes.

Options:

Pin model version (if API supports)
Track model version in baseline, re-baseline on model change
Alert when model version changes

Recommendation: Track model version, require manual baseline update on change.

5. What's the corpus growth strategy?

Options:

Hand-curate only (high quality, slow growth)
Production capture with review (faster growth, needs tooling)
Synthetic generation (fast, may not reflect reality)

Recommendation: Start hand-curated, add production capture tooling in Phase 9.

Success Metrics

Metric	Target	Measurement
Regression detection rate	100%	Simulated regressions caught
False positive rate (regression alerts)	< 5%	Manual review of alerts
Prompt iteration cycle time	< 30 min	Time from change to evaluation result
Corpus coverage	> 50 fixtures	Fixture count
CI job duration (smoke)	< 2 min	Workflow timing
CI job duration (nightly)	< 15 min	Workflow timing

LLM-in-the-Loop Extraction - Phase 7.5 implementation
Pattern Learning Store - Phase 7.6 implementation
LLM Extractor Code - Current implementation

Appendix: Example Fixture

# fixtures/secrets/high_entropy_api_key.toml

[metadata]
id = "secrets-005"
name = "High entropy API key in Python config"
category = "secrets"
subcategory = "api_keys"
language = "python"
difficulty = "medium"
source = "hand-curated"
created = "2026-02-05"
updated = "2026-02-05"
notes = """
Tests detection of high-entropy strings that look like API keys.
The entropy of 'sk_live_abc123...' is > 4.0 which should trigger detection.
"""

[input]
filename = "config.py"
content = '''
import os

# Configuration for payment processing
STRIPE_API_KEY = "sk_live_51ABC123DEF456GHI789JKL012MNO345PQR678STU901VWX234YZ"
STRIPE_WEBHOOK_SECRET = os.environ.get("STRIPE_WEBHOOK_SECRET")

DATABASE_URL = os.environ.get("DATABASE_URL", "postgresql://localhost/app")
DEBUG = os.environ.get("DEBUG", "false").lower() == "true"
'''

[expected]
must_contain = [
    { subject = "secrets/api_key", predicate = "hardcoded", value = true, min_confidence = 0.8 },
]

must_not_contain = [
    # STRIPE_WEBHOOK_SECRET uses env var, should NOT be flagged
    { subject = "secrets/webhook_secret", predicate = "hardcoded", value = true },
    # DATABASE_URL uses env var with fallback, should NOT be flagged as hardcoded
    { subject = "secrets/database_url", predicate = "hardcoded", value = true },
]

acceptable_variants = [
    { subject = "stripe/api_key", predicate = "exposed", value = true },
    { subject = "payment/secret", predicate = "hardcoded", value = true },
]

[scoring]
weight = 1.5  # Security-critical, weighted higher
min_confidence = 0.8

Last updated: 2026-02-05

35 KiB Raw Blame History

LLM Prompt Evaluation System

Problem Statement

Enterprise Requirements

Goals

Primary Goals

Non-Goals (Phase 7.8)

Architecture Overview

Core Components

1. Golden Corpus

Fixture Format

Corpus Organization

Manifest Structure

2. Observation Log

Log Entry Schema

Log Storage

3. Evaluation Harness

Public API

Claim Matching Logic

4. Prompt Versioning

Version Schema

Prompt File Structure

Report Format

6. Jobs & Automation

CI Job (Per-PR)

Nightly Job (Full Evaluation)

On-Demand Job (Prompt Iteration)

CLI Interface

Implementation Plan

Phase 7.8.1: Core Infrastructure (Week 1)

Phase 7.8.2: Evaluation Harness (Week 2)

Phase 7.8.3: Golden Corpus (Week 2-3)

Phase 7.8.4: CI Integration (Week 3)

Phase 7.8.5: CLI & Documentation (Week 4)

Open Questions

1. Where do we store baseline metrics?

2. How strict should matching be?

3. Mock vs Live in CI?

4. How do we handle model version changes?

5. What's the corpus growth strategy?

Success Metrics

Related Documents

Appendix: Example Fixture

35 KiB

Raw Blame History