stemedb/batteries/pre-aphoria.md
jordan 3320c24afa feat: WAL hardening (Phase 5B) - CRC32C, crash recovery, group commit, log rotation
Add CRC32C checksums to WAL record format (v2), implement crash recovery
with automatic truncation of corrupt records, add feature-gated group commit
buffer for batched fsync under concurrent load, and implement log rotation
via segment files with global offset addressing.

Key changes:
- Record format v2: [len:u32][crc32c:u32][blake3:32][payload:N]
- recover_file() scans and truncates corrupt tail records
- GroupCommitBuffer batches fsync via MPSC channel (tokio feature gate)
- SegmentManager with binary search resolution and cursor-based cleanup
- Journal::read() auto-refreshes segments on miss for writer/reader split
- Split recovery.rs and key_codec.rs into directory modules for 500-line max

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 12:36:35 -07:00

15 KiB

Pre-Aphoria Validation Battery

Purpose: Verify stemedb behaves as documented before building ConceptPath and Aphoria on top of it. Every test maps to a claim the product makes or a code path Aphoria depends on.

Test file: crates/stemedb-query/tests/battery_pre_aphoria.rs


Battery 1: The Semaglutide Scenario

Reproduces the exact example from what-is-episteme.md. Four sources, four tiers, one subject, conflicting claims. If this doesn't work, the product demo fails.

1.1 test_semaglutide_four_sources_ingest_and_query

Setup:

  • Agent A signs: subject=Semaglutide, predicate=has_side_effect, object=Text("gastroparesis_warning"), source_class=Regulatory, confidence=1.0, timestamp=T
  • Agent B signs: subject=Semaglutide, predicate=has_side_effect, object=Text("no_gastroparesis_signal"), source_class=Clinical, confidence=0.9, timestamp=T+1
  • Agent C signs: subject=Semaglutide, predicate=has_side_effect, object=Text("gastroparesis"), source_class=Anecdotal, confidence=0.2, timestamp=T+2
  • Agent D signs: subject=Semaglutide, predicate=has_side_effect, object=Text("no_gastroparesis_signal"), source_class=Clinical, confidence=0.9, timestamp=T+3

Ingest all four through WAL + IngestWorker.

Assert:

  • All four assertions are stored (query with no lens returns 4 results)
  • Authority lens (TrustAwareAuthority) winner is the Regulatory assertion (FDA)
  • Recency lens winner is Agent D (most recent)
  • Consensus lens groups by object value: "no_gastroparesis_signal" has 2 assertions, "gastroparesis" variants have 2

1.2 test_semaglutide_skeptic_analysis

Using the same four assertions from 1.1:

Assert:

  • Skeptic lens analyze() returns ConflictAnalysis with:
    • candidates_count = 4
    • claims.len() >= 2 (at least two distinct object values)
    • status = Contested (conflict_score >= 0.4)
    • conflict_score > 0.3 (there is real disagreement between object values)
    • The claim with object "no_gastroparesis_signal" has assertion_count = 2
    • Claims are sorted descending by weight_share

1.3 test_semaglutide_source_class_decay

Using the same four assertions, all with timestamp 6 months ago:

Query with source_class_decay: true:

  • Regulatory assertion (Tier 0): confidence unchanged (no half-life)
  • Clinical assertions (Tier 1, 730-day half-life): confidence decayed slightly (~0.9 * 2^(-180/730) ~ 0.75)
  • Anecdotal assertion (Tier 5, 30-day half-life): confidence decayed to near zero (~0.2 * 2^(-180/30) ~ 0.003)

Assert:

  • After decay, the Anecdotal assertion's effective confidence is < 0.01
  • After decay, the Regulatory assertion's confidence is exactly 1.0
  • After decay, Clinical assertions' confidence is between 0.7 and 0.85
  • Authority lens after decay still picks Regulatory as winner

1.4 test_semaglutide_time_travel

Using the same four assertions with staggered timestamps (T, T+100, T+200, T+300):

Query with as_of: T+150:

  • Only assertions at T and T+100 are included
  • Assert exactly 2 candidates
  • Conflict landscape is different from the full query (only FDA + NEJM)

Battery 2: The JWT Conflict Scenario

Reproduces the JWT outage story. Validates escalation — the claim that Episteme is an "active safety system."

2.1 test_jwt_conflict_escalation_fires

Setup:

  • RFC 7519 (Tier 0, confidence 1.0): predicate=aud_validation, object=Boolean(true)
  • Internal wiki (Tier 3, confidence 0.8): predicate=aud_validation, object=Boolean(false)
  • Stack Overflow (Tier 5, confidence 0.6): predicate=aud_validation, object=Boolean(false)
  • Approved runbook (Tier 2, confidence 0.95): predicate=aud_validation, object=Boolean(true)

Configure escalation policy:

name: "security-config"
min_conflict_score: 0.5
level: High
predicate_pattern: None

Ingest all four. Run materializer with escalation policies.

Assert:

  • Escalation event is created (query ESC: prefix, find at least one)
  • Event has level = High
  • Event has conflict_score >= 0.5
  • Event has correct subject and predicate
  • Event resolved = false

2.2 test_jwt_escalation_predicate_filter

Same four assertions as 2.1. Two policies:

  • Policy A: predicate_pattern: Some("aud"), min_conflict_score: 0.3, level: Critical
  • Policy B: predicate_pattern: Some("revenue"), min_conflict_score: 0.3, level: Medium

Assert:

  • Policy A fires (predicate aud_validation contains "aud")
  • Policy B does NOT fire (predicate doesn't contain "revenue")
  • Only one escalation event exists, with level Critical

2.3 test_jwt_layered_lens_tier_agreement

Same four assertions. Query with Layered Consensus lens.

Assert:

  • Tier 0 result: winner object = Boolean(true) (RFC says validate)
  • Tier 2 result: winner object = Boolean(true) (Runbook agrees)
  • Tier 3 result: winner object = Boolean(false) (Wiki says skip)
  • Tier 5 result: winner object = Boolean(false) (SO says skip)
  • overall_conflict_score > 0.5 (cross-tier disagreement between 0/2 and 3/5)
  • overall_winner comes from Tier 0 (highest authority)

Battery 3: Decay Math Precision

Aphoria computes conflict scores after decay. If decay is wrong, every conflict score is wrong.

3.1 test_decay_tier0_never_decays

Regulatory assertion, confidence 0.95, timestamp 10 years ago. Query with source_class_decay: true.

Assert: effective confidence is exactly 0.95 (unchanged).

3.2 test_decay_tier1_exact_halflife

Clinical assertion, confidence 1.0, timestamp exactly 730 days ago. Query with source_class_decay: true.

Assert: effective confidence is 0.5 (within tolerance of 0.02).

3.3 test_decay_tier1_two_halflives

Clinical assertion, confidence 1.0, timestamp exactly 1460 days ago. Query with source_class_decay: true.

Assert: effective confidence is 0.25 (within tolerance of 0.02).

3.4 test_decay_tier5_exact_halflife

Anecdotal assertion, confidence 1.0, timestamp exactly 30 days ago. Query with source_class_decay: true.

Assert: effective confidence is 0.5 (within tolerance of 0.02).

3.5 test_decay_tier5_three_halflives

Anecdotal assertion, confidence 1.0, timestamp exactly 90 days ago. Query with source_class_decay: true.

Assert: effective confidence is 0.125 (within tolerance of 0.02).

3.6 test_decay_zero_confidence_stays_zero

Assertion with confidence 0.0, any tier, any age.

Assert: effective confidence is 0.0 after decay (0 * anything = 0).

3.7 test_decay_never_goes_negative

Anecdotal assertion, confidence 0.01, timestamp 365 days ago (12+ half-lives).

Assert: effective confidence >= 0.0.

3.8 test_decay_uses_as_of_for_age_calculation

Two assertions, both at timestamp T=1000:

  • Assertion A: Clinical, confidence 0.9
  • Assertion B: Anecdotal, confidence 0.9

Query with as_of: T + 730*86400 (exactly 730 days after assertions) and source_class_decay: true.

Assert:

  • A's effective confidence ~ 0.45 (Clinical, one half-life)
  • B's effective confidence ~ near zero (Anecdotal, 24+ half-lives at 30-day rate)

Battery 4: Conflict Score Calibration

Two conflict score implementations exist. compute_conflict_score in traits.rs uses confidence variance. calculate_conflict_score in skeptic/analysis.rs uses Shannon entropy over object value groups. Both need validation.

4.1 test_variance_conflict_score_unanimous

5 assertions, all confidence 0.8. compute_conflict_score() returns 0.0 (no variance).

4.2 test_variance_conflict_score_maximum

2 assertions, confidence 0.0 and 1.0. compute_conflict_score() returns 1.0 (maximum variance).

4.3 test_variance_conflict_score_moderate

3 assertions, confidence 0.2, 0.5, 0.8. compute_conflict_score() returns a value between 0.2 and 0.8.

4.4 test_variance_conflict_score_single

1 assertion. Returns 0.0.

4.5 test_variance_conflict_score_empty

0 assertions. Returns 0.0.

4.6 test_skeptic_entropy_same_confidence_different_objects [POTENTIAL BUG DETECTOR]

Three assertions, ALL with confidence 0.9:

  • Object A: Text("yes"), confidence 0.9
  • Object B: Text("no"), confidence 0.9
  • Object C: Text("no"), confidence 0.9

Skeptic lens analyze():

  • Groups into 2 claims: "yes" (weight 0.9) and "no" (weight 1.8)
  • Entropy is non-zero because there are two groups with different weights
  • conflict_score > 0.0
  • status is NOT Unanimous

Note: The variance-based compute_conflict_score would return 0.0 for these candidates (all same confidence). The Skeptic entropy-based score correctly detects the disagreement. This test validates the Skeptic lens is the correct tool for Aphoria's conflict detection, NOT the variance-based score.

4.7 test_skeptic_entropy_unanimous_different_confidence

Three assertions, all same object Text("yes"), but different confidences (0.3, 0.6, 0.9):

Skeptic lens analyze():

  • Groups into 1 claim (all same object)
  • conflict_score = 0.0 (unanimous — no disagreement on the value)
  • status = Unanimous

Note: Even though confidences differ, there's no actual conflict — all sources agree. The Skeptic lens correctly identifies this as unanimous.

4.8 test_variance_score_nan_defensive

2 assertions with confidence f32::NAN. compute_conflict_score() returns 0.0 (defensive, not NaN propagation).


Battery 5: scan_prefix with ConceptPath-shaped Keys

Storage foundation for hierarchical queries.

5.1 test_prefix_scan_concept_path_keys

Store via IndexStore:

S:code://rust/citadeldb/auth/jwt/aud_validation  → [hash_a]
S:code://rust/citadeldb/auth/jwt/expiry          → [hash_b]
S:code://rust/citadeldb/net/tls/verify           → [hash_c]
S:code://rust/citadeldb/auth/oauth/scopes        → [hash_d]

Assert:

  • scan_prefix("S:code://rust/citadeldb/auth/jwt/") → 2 keys (aud_validation, expiry)
  • scan_prefix("S:code://rust/citadeldb/auth/") → 3 keys (jwt/aud, jwt/expiry, oauth/scopes)
  • scan_prefix("S:code://rust/citadeldb/") → 4 keys (all)
  • scan_prefix("S:code://") → 4 keys (all)
  • scan_prefix("S:rfc://") → 0 keys (different scheme)

5.2 test_prefix_scan_no_false_positives

Store:

S:code://rust/citadeldb/auth          → [hash_a]
S:code://rust/citadeldb/authentication → [hash_b]

Assert:

  • scan_prefix("S:code://rust/citadeldb/auth/") → 0 keys (trailing slash prevents matching "auth" without children)
  • scan_prefix("S:code://rust/citadeldb/auth") → 2 keys (both match the prefix "auth")

This validates that the trailing / in hierarchical queries is necessary to prevent auth from matching authentication.

5.3 test_prefix_scan_sp_keys_with_concept_paths

Store via IndexStore (using SP: compound keys):

SP:code://rust/citadeldb/auth/jwt/aud_validation:config_value  → [hash_a]
SP:code://rust/citadeldb/auth/jwt/expiry:config_value          → [hash_b]

Assert:

  • scan_prefix("SP:code://rust/citadeldb/auth/jwt/") → 2 keys
  • The parsed SP key for hash_a correctly splits into subject=code://rust/citadeldb/auth/jwt/aud_validation and predicate=config_value (validates the rfind fix)

Battery 6: Signature Tamper Detection

Aphoria ingests signed assertions. If signature verification has gaps, tampered claims enter the graph.

6.1 test_valid_signature_accepted

Agent A signs an assertion. Ingest through IngestWorker.

Assert: assertion is stored, index entries exist.

6.2 test_tampered_confidence_rejected

Agent A signs assertion with confidence=0.8. Modify the serialized assertion bytes to change confidence to 1.0. Attempt to ingest.

Assert: IngestError::InvalidSignature. Assertion is NOT stored.

6.3 test_tampered_subject_rejected

Agent A signs assertion with subject="X". Clone the assertion, change subject to "Y", keep original signature.

Assert: ingestion fails with invalid signature.

6.4 test_wrong_agent_id_rejected

Agent A signs assertion. Replace agent_id in the SignatureEntry with Agent B's public key (but keep Agent A's signature bytes).

Assert: ingestion fails — the signature was made by A's private key but claims to be from B's public key.

6.5 test_multi_sig_all_valid_accepted

Agent A and Agent B both sign the same assertion (two valid SignatureEntries).

Assert: ingestion succeeds.

6.6 test_multi_sig_one_invalid_rejected

Agent A signs validly, Agent B's signature is invalid (tampered).

Assert: ingestion fails. ALL signatures must be valid.


Battery 7: Materialized View Consistency

Aphoria queries MVs for fast conflict checks. Stale or inconsistent MVs produce wrong verdicts.

7.1 test_mv_initial_materialization

Ingest assertion A (confidence 0.9) for subject=S, predicate=P. Run materializer step().

Assert:

  • MV exists at MV:{S}:{P}
  • MV winner_hash matches A's content hash
  • MV confidence = 0.9
  • Changelog entry exists (first materialization)

7.2 test_mv_winner_changes_on_update

Ingest A (confidence 0.9), materialize. Then ingest B (same S/P, confidence 0.95), materialize again.

Assert:

  • MV winner changes to B
  • Changelog has 2 entries: initial (winner=A), update (previous=A, new=B)

7.3 test_mv_no_changelog_when_winner_unchanged

Ingest A (confidence 0.9), materialize. Ingest B (same S/P, confidence 0.5), materialize again.

Assert:

  • MV winner stays A (B has lower confidence)
  • No new changelog entry after second materialization

7.4 test_mv_since_query_returns_changelog

Ingest A at T=1000, materialize at T=1001. Ingest B at T=2000, materialize at T=2001.

Query with since: 1500:

  • Returns changelog entries only from after T=1500
  • Should include the B materialization but not the A materialization

7.5 test_mv_max_stale_fast_path

Ingest A, materialize. Query immediately with max_stale: 60.

Assert: fast path is used (MV is fresh).

7.6 test_mv_max_stale_slow_path

Ingest A, materialize. Wait (or mock time) so MV is 120 seconds old. Query with max_stale: 60.

Assert: slow path is used (MV is stale, falls through to index lookup).


Findings to Watch For

Known Risk: Two Conflict Score Implementations

compute_conflict_score in traits.rs (line 89) uses confidence variance. It measures how much confidence values disagree, not how much object values disagree. Three sources saying "yes" at 0.9 and two sources saying "no" at 0.9 produces a conflict score of 0.0 because all confidences are identical.

calculate_conflict_score in skeptic/analysis.rs (line 36) uses Shannon entropy over object value groups. It correctly detects that "yes" vs "no" is a real conflict regardless of confidence values.

Aphoria must use the Skeptic lens for conflict detection, not the standard lens conflict score. Battery 4.6 validates this distinction explicitly. If Aphoria were to use compute_conflict_score from standard lenses, it would miss conflicts where sources disagree on values but agree on confidence levels.

Known Risk: Decay + Time-Travel Interaction

When both source_class_decay and as_of are set, the age calculation must use as_of as the reference time, not now. Battery 3.8 validates this. If the implementation uses now for age but filters by as_of for inclusion, the decay amounts will be wrong for historical queries.

ConceptPath Readiness

Battery 5 validates the storage layer works with ConceptPath-shaped keys before any type changes. If these tests pass, the scan_prefix foundation is solid and ConceptPath implementation can proceed with confidence.