stemedb/batteries/pre-aphoria.md

# Pre-Aphoria Validation Battery

**Purpose:** Verify stemedb behaves as documented before building ConceptPath and Aphoria on top of it. Every test maps to a claim the product makes or a code path Aphoria depends on.

**Test file:** `crates/stemedb-query/tests/battery_pre_aphoria.rs`

---

## Battery 1: The Semaglutide Scenario

Reproduces the exact example from `what-is-episteme.md`. Four sources, four tiers, one subject, conflicting claims. If this doesn't work, the product demo fails.

### 1.1 `test_semaglutide_four_sources_ingest_and_query`

Setup:
- Agent A signs: subject=`Semaglutide`, predicate=`has_side_effect`, object=`Text("gastroparesis_warning")`, source_class=Regulatory, confidence=1.0, timestamp=T
- Agent B signs: subject=`Semaglutide`, predicate=`has_side_effect`, object=`Text("no_gastroparesis_signal")`, source_class=Clinical, confidence=0.9, timestamp=T+1
- Agent C signs: subject=`Semaglutide`, predicate=`has_side_effect`, object=`Text("gastroparesis")`, source_class=Anecdotal, confidence=0.2, timestamp=T+2
- Agent D signs: subject=`Semaglutide`, predicate=`has_side_effect`, object=`Text("no_gastroparesis_signal")`, source_class=Clinical, confidence=0.9, timestamp=T+3

Ingest all four through WAL + IngestWorker.

Assert:
- All four assertions are stored (query with no lens returns 4 results)
- Authority lens (TrustAwareAuthority) winner is the Regulatory assertion (FDA)
- Recency lens winner is Agent D (most recent)
- Consensus lens groups by object value: "no_gastroparesis_signal" has 2 assertions, "gastroparesis" variants have 2

### 1.2 `test_semaglutide_skeptic_analysis`

Using the same four assertions from 1.1:

Assert:
- Skeptic lens `analyze()` returns `ConflictAnalysis` with:
  - `candidates_count` = 4
  - `claims.len()` >= 2 (at least two distinct object values)
  - `status` = `Contested` (conflict_score >= 0.4)
  - `conflict_score` > 0.3 (there is real disagreement between object values)
  - The claim with object `"no_gastroparesis_signal"` has `assertion_count` = 2
  - Claims are sorted descending by `weight_share`

### 1.3 `test_semaglutide_source_class_decay`

Using the same four assertions, all with timestamp 6 months ago:

Query with `source_class_decay: true`:
- Regulatory assertion (Tier 0): confidence unchanged (no half-life)
- Clinical assertions (Tier 1, 730-day half-life): confidence decayed slightly (~0.9 * 2^(-180/730) ~ 0.75)
- Anecdotal assertion (Tier 5, 30-day half-life): confidence decayed to near zero (~0.2 * 2^(-180/30) ~ 0.003)

Assert:
- After decay, the Anecdotal assertion's effective confidence is < 0.01
- After decay, the Regulatory assertion's confidence is exactly 1.0
- After decay, Clinical assertions' confidence is between 0.7 and 0.85
- Authority lens after decay still picks Regulatory as winner

### 1.4 `test_semaglutide_time_travel`

Using the same four assertions with staggered timestamps (T, T+100, T+200, T+300):

Query with `as_of: T+150`:
- Only assertions at T and T+100 are included
- Assert exactly 2 candidates
- Conflict landscape is different from the full query (only FDA + NEJM)

---

## Battery 2: The JWT Conflict Scenario

Reproduces the JWT outage story. Validates escalation — the claim that Episteme is an "active safety system."

### 2.1 `test_jwt_conflict_escalation_fires`

Setup:
- RFC 7519 (Tier 0, confidence 1.0): predicate=`aud_validation`, object=`Boolean(true)`
- Internal wiki (Tier 3, confidence 0.8): predicate=`aud_validation`, object=`Boolean(false)`
- Stack Overflow (Tier 5, confidence 0.6): predicate=`aud_validation`, object=`Boolean(false)`
- Approved runbook (Tier 2, confidence 0.95): predicate=`aud_validation`, object=`Boolean(true)`

Configure escalation policy:
```
name: "security-config"
min_conflict_score: 0.5
level: High
predicate_pattern: None
```

Ingest all four. Run materializer with escalation policies.

Assert:
- Escalation event is created (query `ESC:` prefix, find at least one)
- Event has `level` = `High`
- Event has `conflict_score` >= 0.5
- Event has correct subject and predicate
- Event `resolved` = false

### 2.2 `test_jwt_escalation_predicate_filter`

Same four assertions as 2.1. Two policies:
- Policy A: `predicate_pattern: Some("aud")`, `min_conflict_score: 0.3`, `level: Critical`
- Policy B: `predicate_pattern: Some("revenue")`, `min_conflict_score: 0.3`, `level: Medium`

Assert:
- Policy A fires (predicate `aud_validation` contains "aud")
- Policy B does NOT fire (predicate doesn't contain "revenue")
- Only one escalation event exists, with level `Critical`

### 2.3 `test_jwt_layered_lens_tier_agreement`

Same four assertions. Query with Layered Consensus lens.

Assert:
- Tier 0 result: winner object = `Boolean(true)` (RFC says validate)
- Tier 2 result: winner object = `Boolean(true)` (Runbook agrees)
- Tier 3 result: winner object = `Boolean(false)` (Wiki says skip)
- Tier 5 result: winner object = `Boolean(false)` (SO says skip)
- `overall_conflict_score` > 0.5 (cross-tier disagreement between 0/2 and 3/5)
- `overall_winner` comes from Tier 0 (highest authority)

---

## Battery 3: Decay Math Precision

Aphoria computes conflict scores after decay. If decay is wrong, every conflict score is wrong.

### 3.1 `test_decay_tier0_never_decays`

Regulatory assertion, confidence 0.95, timestamp 10 years ago.
Query with `source_class_decay: true`.

Assert: effective confidence is exactly 0.95 (unchanged).

### 3.2 `test_decay_tier1_exact_halflife`

Clinical assertion, confidence 1.0, timestamp exactly 730 days ago.
Query with `source_class_decay: true`.

Assert: effective confidence is 0.5 (within tolerance of 0.02).

### 3.3 `test_decay_tier1_two_halflives`

Clinical assertion, confidence 1.0, timestamp exactly 1460 days ago.
Query with `source_class_decay: true`.

Assert: effective confidence is 0.25 (within tolerance of 0.02).

### 3.4 `test_decay_tier5_exact_halflife`

Anecdotal assertion, confidence 1.0, timestamp exactly 30 days ago.
Query with `source_class_decay: true`.

Assert: effective confidence is 0.5 (within tolerance of 0.02).

### 3.5 `test_decay_tier5_three_halflives`

Anecdotal assertion, confidence 1.0, timestamp exactly 90 days ago.
Query with `source_class_decay: true`.

Assert: effective confidence is 0.125 (within tolerance of 0.02).

### 3.6 `test_decay_zero_confidence_stays_zero`

Assertion with confidence 0.0, any tier, any age.

Assert: effective confidence is 0.0 after decay (0 * anything = 0).

### 3.7 `test_decay_never_goes_negative`

Anecdotal assertion, confidence 0.01, timestamp 365 days ago (12+ half-lives).

Assert: effective confidence >= 0.0.

### 3.8 `test_decay_uses_as_of_for_age_calculation`

Two assertions, both at timestamp T=1000:
- Assertion A: Clinical, confidence 0.9
- Assertion B: Anecdotal, confidence 0.9

Query with `as_of: T + 730*86400` (exactly 730 days after assertions) and `source_class_decay: true`.

Assert:
- A's effective confidence ~ 0.45 (Clinical, one half-life)
- B's effective confidence ~ near zero (Anecdotal, 24+ half-lives at 30-day rate)

---

## Battery 4: Conflict Score Calibration

Two conflict score implementations exist. `compute_conflict_score` in `traits.rs` uses confidence variance. `calculate_conflict_score` in `skeptic/analysis.rs` uses Shannon entropy over object value groups. Both need validation.

### 4.1 `test_variance_conflict_score_unanimous`

5 assertions, all confidence 0.8.
`compute_conflict_score()` returns 0.0 (no variance).

### 4.2 `test_variance_conflict_score_maximum`

2 assertions, confidence 0.0 and 1.0.
`compute_conflict_score()` returns 1.0 (maximum variance).

### 4.3 `test_variance_conflict_score_moderate`

3 assertions, confidence 0.2, 0.5, 0.8.
`compute_conflict_score()` returns a value between 0.2 and 0.8.

### 4.4 `test_variance_conflict_score_single`

1 assertion. Returns 0.0.

### 4.5 `test_variance_conflict_score_empty`

0 assertions. Returns 0.0.

### 4.6 `test_skeptic_entropy_same_confidence_different_objects` [POTENTIAL BUG DETECTOR]

Three assertions, ALL with confidence 0.9:
- Object A: `Text("yes")`, confidence 0.9
- Object B: `Text("no")`, confidence 0.9
- Object C: `Text("no")`, confidence 0.9

Skeptic lens `analyze()`:
- Groups into 2 claims: "yes" (weight 0.9) and "no" (weight 1.8)
- Entropy is non-zero because there are two groups with different weights
- `conflict_score` > 0.0
- `status` is NOT `Unanimous`

**Note:** The variance-based `compute_conflict_score` would return 0.0 for these candidates (all same confidence). The Skeptic entropy-based score correctly detects the disagreement. This test validates the Skeptic lens is the correct tool for Aphoria's conflict detection, NOT the variance-based score.

### 4.7 `test_skeptic_entropy_unanimous_different_confidence`

Three assertions, all same object `Text("yes")`, but different confidences (0.3, 0.6, 0.9):

Skeptic lens `analyze()`:
- Groups into 1 claim (all same object)
- `conflict_score` = 0.0 (unanimous — no disagreement on the value)
- `status` = `Unanimous`

**Note:** Even though confidences differ, there's no actual conflict — all sources agree. The Skeptic lens correctly identifies this as unanimous.

### 4.8 `test_variance_score_nan_defensive`

2 assertions with confidence `f32::NAN`.
`compute_conflict_score()` returns 0.0 (defensive, not NaN propagation).

---

## Battery 5: scan_prefix with ConceptPath-shaped Keys

Storage foundation for hierarchical queries.

### 5.1 `test_prefix_scan_concept_path_keys`

Store via IndexStore:
```
S:code://rust/citadeldb/auth/jwt/aud_validation  → [hash_a]
S:code://rust/citadeldb/auth/jwt/expiry          → [hash_b]
S:code://rust/citadeldb/net/tls/verify           → [hash_c]
S:code://rust/citadeldb/auth/oauth/scopes        → [hash_d]
```

Assert:
- `scan_prefix("S:code://rust/citadeldb/auth/jwt/")` → 2 keys (aud_validation, expiry)
- `scan_prefix("S:code://rust/citadeldb/auth/")` → 3 keys (jwt/aud, jwt/expiry, oauth/scopes)
- `scan_prefix("S:code://rust/citadeldb/")` → 4 keys (all)
- `scan_prefix("S:code://")` → 4 keys (all)
- `scan_prefix("S:rfc://")` → 0 keys (different scheme)

### 5.2 `test_prefix_scan_no_false_positives`

Store:
```
S:code://rust/citadeldb/auth          → [hash_a]
S:code://rust/citadeldb/authentication → [hash_b]
```

Assert:
- `scan_prefix("S:code://rust/citadeldb/auth/")` → 0 keys (trailing slash prevents matching "auth" without children)
- `scan_prefix("S:code://rust/citadeldb/auth")` → 2 keys (both match the prefix "auth")

This validates that the trailing `/` in hierarchical queries is necessary to prevent `auth` from matching `authentication`.

### 5.3 `test_prefix_scan_sp_keys_with_concept_paths`

Store via IndexStore (using SP: compound keys):
```
SP:code://rust/citadeldb/auth/jwt/aud_validation:config_value  → [hash_a]
SP:code://rust/citadeldb/auth/jwt/expiry:config_value          → [hash_b]
```

Assert:
- `scan_prefix("SP:code://rust/citadeldb/auth/jwt/")` → 2 keys
- The parsed SP key for hash_a correctly splits into subject=`code://rust/citadeldb/auth/jwt/aud_validation` and predicate=`config_value` (validates the rfind fix)

---

## Battery 6: Signature Tamper Detection

Aphoria ingests signed assertions. If signature verification has gaps, tampered claims enter the graph.

### 6.1 `test_valid_signature_accepted`

Agent A signs an assertion. Ingest through IngestWorker.

Assert: assertion is stored, index entries exist.

### 6.2 `test_tampered_confidence_rejected`

Agent A signs assertion with confidence=0.8. Modify the serialized assertion bytes to change confidence to 1.0. Attempt to ingest.

Assert: `IngestError::InvalidSignature`. Assertion is NOT stored.

### 6.3 `test_tampered_subject_rejected`

Agent A signs assertion with subject="X". Clone the assertion, change subject to "Y", keep original signature.

Assert: ingestion fails with invalid signature.

### 6.4 `test_wrong_agent_id_rejected`

Agent A signs assertion. Replace `agent_id` in the `SignatureEntry` with Agent B's public key (but keep Agent A's signature bytes).

Assert: ingestion fails — the signature was made by A's private key but claims to be from B's public key.

### 6.5 `test_multi_sig_all_valid_accepted`

Agent A and Agent B both sign the same assertion (two valid SignatureEntries).

Assert: ingestion succeeds.

### 6.6 `test_multi_sig_one_invalid_rejected`

Agent A signs validly, Agent B's signature is invalid (tampered).

Assert: ingestion fails. ALL signatures must be valid.

---

## Battery 7: Materialized View Consistency

Aphoria queries MVs for fast conflict checks. Stale or inconsistent MVs produce wrong verdicts.

### 7.1 `test_mv_initial_materialization`

Ingest assertion A (confidence 0.9) for subject=S, predicate=P.
Run materializer `step()`.

Assert:
- MV exists at `MV:{S}:{P}`
- MV winner_hash matches A's content hash
- MV confidence = 0.9
- Changelog entry exists (first materialization)

### 7.2 `test_mv_winner_changes_on_update`

Ingest A (confidence 0.9), materialize. Then ingest B (same S/P, confidence 0.95), materialize again.

Assert:
- MV winner changes to B
- Changelog has 2 entries: initial (winner=A), update (previous=A, new=B)

### 7.3 `test_mv_no_changelog_when_winner_unchanged`

Ingest A (confidence 0.9), materialize. Ingest B (same S/P, confidence 0.5), materialize again.

Assert:
- MV winner stays A (B has lower confidence)
- No new changelog entry after second materialization

### 7.4 `test_mv_since_query_returns_changelog`

Ingest A at T=1000, materialize at T=1001. Ingest B at T=2000, materialize at T=2001.

Query with `since: 1500`:
- Returns changelog entries only from after T=1500
- Should include the B materialization but not the A materialization

### 7.5 `test_mv_max_stale_fast_path`

Ingest A, materialize. Query immediately with `max_stale: 60`.

Assert: fast path is used (MV is fresh).

### 7.6 `test_mv_max_stale_slow_path`

Ingest A, materialize. Wait (or mock time) so MV is 120 seconds old. Query with `max_stale: 60`.

Assert: slow path is used (MV is stale, falls through to index lookup).

---

## Findings to Watch For

### Known Risk: Two Conflict Score Implementations

`compute_conflict_score` in `traits.rs` (line 89) uses **confidence variance**. It measures how much confidence values disagree, not how much object values disagree. Three sources saying "yes" at 0.9 and two sources saying "no" at 0.9 produces a conflict score of **0.0** because all confidences are identical.

`calculate_conflict_score` in `skeptic/analysis.rs` (line 36) uses **Shannon entropy over object value groups**. It correctly detects that "yes" vs "no" is a real conflict regardless of confidence values.

**Aphoria must use the Skeptic lens for conflict detection, not the standard lens conflict score.** Battery 4.6 validates this distinction explicitly. If Aphoria were to use `compute_conflict_score` from standard lenses, it would miss conflicts where sources disagree on values but agree on confidence levels.

### Known Risk: Decay + Time-Travel Interaction

When both `source_class_decay` and `as_of` are set, the age calculation must use `as_of` as the reference time, not `now`. Battery 3.8 validates this. If the implementation uses `now` for age but filters by `as_of` for inclusion, the decay amounts will be wrong for historical queries.

### ConceptPath Readiness

Battery 5 validates the storage layer works with ConceptPath-shaped keys before any type changes. If these tests pass, the `scan_prefix` foundation is solid and ConceptPath implementation can proceed with confidence.