stemedb/batteries/pre-aphoria.md
jordan 3320c24afa feat: WAL hardening (Phase 5B) - CRC32C, crash recovery, group commit, log rotation
Add CRC32C checksums to WAL record format (v2), implement crash recovery
with automatic truncation of corrupt records, add feature-gated group commit
buffer for batched fsync under concurrent load, and implement log rotation
via segment files with global offset addressing.

Key changes:
- Record format v2: [len:u32][crc32c:u32][blake3:32][payload:N]
- recover_file() scans and truncates corrupt tail records
- GroupCommitBuffer batches fsync via MPSC channel (tokio feature gate)
- SegmentManager with binary search resolution and cursor-based cleanup
- Journal::read() auto-refreshes segments on miss for writer/reader split
- Split recovery.rs and key_codec.rs into directory modules for 500-line max

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 12:36:35 -07:00

408 lines
15 KiB
Markdown

# Pre-Aphoria Validation Battery
**Purpose:** Verify stemedb behaves as documented before building ConceptPath and Aphoria on top of it. Every test maps to a claim the product makes or a code path Aphoria depends on.
**Test file:** `crates/stemedb-query/tests/battery_pre_aphoria.rs`
---
## Battery 1: The Semaglutide Scenario
Reproduces the exact example from `what-is-episteme.md`. Four sources, four tiers, one subject, conflicting claims. If this doesn't work, the product demo fails.
### 1.1 `test_semaglutide_four_sources_ingest_and_query`
Setup:
- Agent A signs: subject=`Semaglutide`, predicate=`has_side_effect`, object=`Text("gastroparesis_warning")`, source_class=Regulatory, confidence=1.0, timestamp=T
- Agent B signs: subject=`Semaglutide`, predicate=`has_side_effect`, object=`Text("no_gastroparesis_signal")`, source_class=Clinical, confidence=0.9, timestamp=T+1
- Agent C signs: subject=`Semaglutide`, predicate=`has_side_effect`, object=`Text("gastroparesis")`, source_class=Anecdotal, confidence=0.2, timestamp=T+2
- Agent D signs: subject=`Semaglutide`, predicate=`has_side_effect`, object=`Text("no_gastroparesis_signal")`, source_class=Clinical, confidence=0.9, timestamp=T+3
Ingest all four through WAL + IngestWorker.
Assert:
- All four assertions are stored (query with no lens returns 4 results)
- Authority lens (TrustAwareAuthority) winner is the Regulatory assertion (FDA)
- Recency lens winner is Agent D (most recent)
- Consensus lens groups by object value: "no_gastroparesis_signal" has 2 assertions, "gastroparesis" variants have 2
### 1.2 `test_semaglutide_skeptic_analysis`
Using the same four assertions from 1.1:
Assert:
- Skeptic lens `analyze()` returns `ConflictAnalysis` with:
- `candidates_count` = 4
- `claims.len()` >= 2 (at least two distinct object values)
- `status` = `Contested` (conflict_score >= 0.4)
- `conflict_score` > 0.3 (there is real disagreement between object values)
- The claim with object `"no_gastroparesis_signal"` has `assertion_count` = 2
- Claims are sorted descending by `weight_share`
### 1.3 `test_semaglutide_source_class_decay`
Using the same four assertions, all with timestamp 6 months ago:
Query with `source_class_decay: true`:
- Regulatory assertion (Tier 0): confidence unchanged (no half-life)
- Clinical assertions (Tier 1, 730-day half-life): confidence decayed slightly (~0.9 * 2^(-180/730) ~ 0.75)
- Anecdotal assertion (Tier 5, 30-day half-life): confidence decayed to near zero (~0.2 * 2^(-180/30) ~ 0.003)
Assert:
- After decay, the Anecdotal assertion's effective confidence is < 0.01
- After decay, the Regulatory assertion's confidence is exactly 1.0
- After decay, Clinical assertions' confidence is between 0.7 and 0.85
- Authority lens after decay still picks Regulatory as winner
### 1.4 `test_semaglutide_time_travel`
Using the same four assertions with staggered timestamps (T, T+100, T+200, T+300):
Query with `as_of: T+150`:
- Only assertions at T and T+100 are included
- Assert exactly 2 candidates
- Conflict landscape is different from the full query (only FDA + NEJM)
---
## Battery 2: The JWT Conflict Scenario
Reproduces the JWT outage story. Validates escalation the claim that Episteme is an "active safety system."
### 2.1 `test_jwt_conflict_escalation_fires`
Setup:
- RFC 7519 (Tier 0, confidence 1.0): predicate=`aud_validation`, object=`Boolean(true)`
- Internal wiki (Tier 3, confidence 0.8): predicate=`aud_validation`, object=`Boolean(false)`
- Stack Overflow (Tier 5, confidence 0.6): predicate=`aud_validation`, object=`Boolean(false)`
- Approved runbook (Tier 2, confidence 0.95): predicate=`aud_validation`, object=`Boolean(true)`
Configure escalation policy:
```
name: "security-config"
min_conflict_score: 0.5
level: High
predicate_pattern: None
```
Ingest all four. Run materializer with escalation policies.
Assert:
- Escalation event is created (query `ESC:` prefix, find at least one)
- Event has `level` = `High`
- Event has `conflict_score` >= 0.5
- Event has correct subject and predicate
- Event `resolved` = false
### 2.2 `test_jwt_escalation_predicate_filter`
Same four assertions as 2.1. Two policies:
- Policy A: `predicate_pattern: Some("aud")`, `min_conflict_score: 0.3`, `level: Critical`
- Policy B: `predicate_pattern: Some("revenue")`, `min_conflict_score: 0.3`, `level: Medium`
Assert:
- Policy A fires (predicate `aud_validation` contains "aud")
- Policy B does NOT fire (predicate doesn't contain "revenue")
- Only one escalation event exists, with level `Critical`
### 2.3 `test_jwt_layered_lens_tier_agreement`
Same four assertions. Query with Layered Consensus lens.
Assert:
- Tier 0 result: winner object = `Boolean(true)` (RFC says validate)
- Tier 2 result: winner object = `Boolean(true)` (Runbook agrees)
- Tier 3 result: winner object = `Boolean(false)` (Wiki says skip)
- Tier 5 result: winner object = `Boolean(false)` (SO says skip)
- `overall_conflict_score` > 0.5 (cross-tier disagreement between 0/2 and 3/5)
- `overall_winner` comes from Tier 0 (highest authority)
---
## Battery 3: Decay Math Precision
Aphoria computes conflict scores after decay. If decay is wrong, every conflict score is wrong.
### 3.1 `test_decay_tier0_never_decays`
Regulatory assertion, confidence 0.95, timestamp 10 years ago.
Query with `source_class_decay: true`.
Assert: effective confidence is exactly 0.95 (unchanged).
### 3.2 `test_decay_tier1_exact_halflife`
Clinical assertion, confidence 1.0, timestamp exactly 730 days ago.
Query with `source_class_decay: true`.
Assert: effective confidence is 0.5 (within tolerance of 0.02).
### 3.3 `test_decay_tier1_two_halflives`
Clinical assertion, confidence 1.0, timestamp exactly 1460 days ago.
Query with `source_class_decay: true`.
Assert: effective confidence is 0.25 (within tolerance of 0.02).
### 3.4 `test_decay_tier5_exact_halflife`
Anecdotal assertion, confidence 1.0, timestamp exactly 30 days ago.
Query with `source_class_decay: true`.
Assert: effective confidence is 0.5 (within tolerance of 0.02).
### 3.5 `test_decay_tier5_three_halflives`
Anecdotal assertion, confidence 1.0, timestamp exactly 90 days ago.
Query with `source_class_decay: true`.
Assert: effective confidence is 0.125 (within tolerance of 0.02).
### 3.6 `test_decay_zero_confidence_stays_zero`
Assertion with confidence 0.0, any tier, any age.
Assert: effective confidence is 0.0 after decay (0 * anything = 0).
### 3.7 `test_decay_never_goes_negative`
Anecdotal assertion, confidence 0.01, timestamp 365 days ago (12+ half-lives).
Assert: effective confidence >= 0.0.
### 3.8 `test_decay_uses_as_of_for_age_calculation`
Two assertions, both at timestamp T=1000:
- Assertion A: Clinical, confidence 0.9
- Assertion B: Anecdotal, confidence 0.9
Query with `as_of: T + 730*86400` (exactly 730 days after assertions) and `source_class_decay: true`.
Assert:
- A's effective confidence ~ 0.45 (Clinical, one half-life)
- B's effective confidence ~ near zero (Anecdotal, 24+ half-lives at 30-day rate)
---
## Battery 4: Conflict Score Calibration
Two conflict score implementations exist. `compute_conflict_score` in `traits.rs` uses confidence variance. `calculate_conflict_score` in `skeptic/analysis.rs` uses Shannon entropy over object value groups. Both need validation.
### 4.1 `test_variance_conflict_score_unanimous`
5 assertions, all confidence 0.8.
`compute_conflict_score()` returns 0.0 (no variance).
### 4.2 `test_variance_conflict_score_maximum`
2 assertions, confidence 0.0 and 1.0.
`compute_conflict_score()` returns 1.0 (maximum variance).
### 4.3 `test_variance_conflict_score_moderate`
3 assertions, confidence 0.2, 0.5, 0.8.
`compute_conflict_score()` returns a value between 0.2 and 0.8.
### 4.4 `test_variance_conflict_score_single`
1 assertion. Returns 0.0.
### 4.5 `test_variance_conflict_score_empty`
0 assertions. Returns 0.0.
### 4.6 `test_skeptic_entropy_same_confidence_different_objects` [POTENTIAL BUG DETECTOR]
Three assertions, ALL with confidence 0.9:
- Object A: `Text("yes")`, confidence 0.9
- Object B: `Text("no")`, confidence 0.9
- Object C: `Text("no")`, confidence 0.9
Skeptic lens `analyze()`:
- Groups into 2 claims: "yes" (weight 0.9) and "no" (weight 1.8)
- Entropy is non-zero because there are two groups with different weights
- `conflict_score` > 0.0
- `status` is NOT `Unanimous`
**Note:** The variance-based `compute_conflict_score` would return 0.0 for these candidates (all same confidence). The Skeptic entropy-based score correctly detects the disagreement. This test validates the Skeptic lens is the correct tool for Aphoria's conflict detection, NOT the variance-based score.
### 4.7 `test_skeptic_entropy_unanimous_different_confidence`
Three assertions, all same object `Text("yes")`, but different confidences (0.3, 0.6, 0.9):
Skeptic lens `analyze()`:
- Groups into 1 claim (all same object)
- `conflict_score` = 0.0 (unanimous — no disagreement on the value)
- `status` = `Unanimous`
**Note:** Even though confidences differ, there's no actual conflict — all sources agree. The Skeptic lens correctly identifies this as unanimous.
### 4.8 `test_variance_score_nan_defensive`
2 assertions with confidence `f32::NAN`.
`compute_conflict_score()` returns 0.0 (defensive, not NaN propagation).
---
## Battery 5: scan_prefix with ConceptPath-shaped Keys
Storage foundation for hierarchical queries.
### 5.1 `test_prefix_scan_concept_path_keys`
Store via IndexStore:
```
S:code://rust/citadeldb/auth/jwt/aud_validation → [hash_a]
S:code://rust/citadeldb/auth/jwt/expiry → [hash_b]
S:code://rust/citadeldb/net/tls/verify → [hash_c]
S:code://rust/citadeldb/auth/oauth/scopes → [hash_d]
```
Assert:
- `scan_prefix("S:code://rust/citadeldb/auth/jwt/")` → 2 keys (aud_validation, expiry)
- `scan_prefix("S:code://rust/citadeldb/auth/")` → 3 keys (jwt/aud, jwt/expiry, oauth/scopes)
- `scan_prefix("S:code://rust/citadeldb/")` → 4 keys (all)
- `scan_prefix("S:code://")` → 4 keys (all)
- `scan_prefix("S:rfc://")` → 0 keys (different scheme)
### 5.2 `test_prefix_scan_no_false_positives`
Store:
```
S:code://rust/citadeldb/auth → [hash_a]
S:code://rust/citadeldb/authentication → [hash_b]
```
Assert:
- `scan_prefix("S:code://rust/citadeldb/auth/")` → 0 keys (trailing slash prevents matching "auth" without children)
- `scan_prefix("S:code://rust/citadeldb/auth")` → 2 keys (both match the prefix "auth")
This validates that the trailing `/` in hierarchical queries is necessary to prevent `auth` from matching `authentication`.
### 5.3 `test_prefix_scan_sp_keys_with_concept_paths`
Store via IndexStore (using SP: compound keys):
```
SP:code://rust/citadeldb/auth/jwt/aud_validation:config_value → [hash_a]
SP:code://rust/citadeldb/auth/jwt/expiry:config_value → [hash_b]
```
Assert:
- `scan_prefix("SP:code://rust/citadeldb/auth/jwt/")` → 2 keys
- The parsed SP key for hash_a correctly splits into subject=`code://rust/citadeldb/auth/jwt/aud_validation` and predicate=`config_value` (validates the rfind fix)
---
## Battery 6: Signature Tamper Detection
Aphoria ingests signed assertions. If signature verification has gaps, tampered claims enter the graph.
### 6.1 `test_valid_signature_accepted`
Agent A signs an assertion. Ingest through IngestWorker.
Assert: assertion is stored, index entries exist.
### 6.2 `test_tampered_confidence_rejected`
Agent A signs assertion with confidence=0.8. Modify the serialized assertion bytes to change confidence to 1.0. Attempt to ingest.
Assert: `IngestError::InvalidSignature`. Assertion is NOT stored.
### 6.3 `test_tampered_subject_rejected`
Agent A signs assertion with subject="X". Clone the assertion, change subject to "Y", keep original signature.
Assert: ingestion fails with invalid signature.
### 6.4 `test_wrong_agent_id_rejected`
Agent A signs assertion. Replace `agent_id` in the `SignatureEntry` with Agent B's public key (but keep Agent A's signature bytes).
Assert: ingestion fails — the signature was made by A's private key but claims to be from B's public key.
### 6.5 `test_multi_sig_all_valid_accepted`
Agent A and Agent B both sign the same assertion (two valid SignatureEntries).
Assert: ingestion succeeds.
### 6.6 `test_multi_sig_one_invalid_rejected`
Agent A signs validly, Agent B's signature is invalid (tampered).
Assert: ingestion fails. ALL signatures must be valid.
---
## Battery 7: Materialized View Consistency
Aphoria queries MVs for fast conflict checks. Stale or inconsistent MVs produce wrong verdicts.
### 7.1 `test_mv_initial_materialization`
Ingest assertion A (confidence 0.9) for subject=S, predicate=P.
Run materializer `step()`.
Assert:
- MV exists at `MV:{S}:{P}`
- MV winner_hash matches A's content hash
- MV confidence = 0.9
- Changelog entry exists (first materialization)
### 7.2 `test_mv_winner_changes_on_update`
Ingest A (confidence 0.9), materialize. Then ingest B (same S/P, confidence 0.95), materialize again.
Assert:
- MV winner changes to B
- Changelog has 2 entries: initial (winner=A), update (previous=A, new=B)
### 7.3 `test_mv_no_changelog_when_winner_unchanged`
Ingest A (confidence 0.9), materialize. Ingest B (same S/P, confidence 0.5), materialize again.
Assert:
- MV winner stays A (B has lower confidence)
- No new changelog entry after second materialization
### 7.4 `test_mv_since_query_returns_changelog`
Ingest A at T=1000, materialize at T=1001. Ingest B at T=2000, materialize at T=2001.
Query with `since: 1500`:
- Returns changelog entries only from after T=1500
- Should include the B materialization but not the A materialization
### 7.5 `test_mv_max_stale_fast_path`
Ingest A, materialize. Query immediately with `max_stale: 60`.
Assert: fast path is used (MV is fresh).
### 7.6 `test_mv_max_stale_slow_path`
Ingest A, materialize. Wait (or mock time) so MV is 120 seconds old. Query with `max_stale: 60`.
Assert: slow path is used (MV is stale, falls through to index lookup).
---
## Findings to Watch For
### Known Risk: Two Conflict Score Implementations
`compute_conflict_score` in `traits.rs` (line 89) uses **confidence variance**. It measures how much confidence values disagree, not how much object values disagree. Three sources saying "yes" at 0.9 and two sources saying "no" at 0.9 produces a conflict score of **0.0** because all confidences are identical.
`calculate_conflict_score` in `skeptic/analysis.rs` (line 36) uses **Shannon entropy over object value groups**. It correctly detects that "yes" vs "no" is a real conflict regardless of confidence values.
**Aphoria must use the Skeptic lens for conflict detection, not the standard lens conflict score.** Battery 4.6 validates this distinction explicitly. If Aphoria were to use `compute_conflict_score` from standard lenses, it would miss conflicts where sources disagree on values but agree on confidence levels.
### Known Risk: Decay + Time-Travel Interaction
When both `source_class_decay` and `as_of` are set, the age calculation must use `as_of` as the reference time, not `now`. Battery 3.8 validates this. If the implementation uses `now` for age but filters by `as_of` for inclusion, the decay amounts will be wrong for historical queries.
### ConceptPath Readiness
Battery 5 validates the storage layer works with ConceptPath-shaped keys before any type changes. If these tests pass, the `scan_prefix` foundation is solid and ConceptPath implementation can proceed with confidence.