tidaldb/docs/planning/milestone-7/phase-5/task-01-crash-recovery-uat.md
2026-02-23 22:41:16 -07:00

17 KiB

Task 01: Crash Recovery UAT Tests

Delivers

Three integration tests in tidal/tests/m7_uat.rs proving crash recovery correctness:

  1. uat_crash_at_wal_write -- Kill after WAL write but before checkpoint; restart; verify all WAL-committed signals are recovered.
  2. uat_crash_at_checkpoint -- Kill after checkpoint flush; restart; verify checkpoint state is consistent and BLAKE3 integrity holds.
  3. uat_crash_with_m6_state -- Write items, signals, cohort definitions, collections, co-engagement, hide/block relationships; kill; restart; verify all state surfaces recovered; verify hard negatives never leak.

Complexity: L

Dependencies

  • m7p1 complete (CrashPoint enum, WAL compaction, BLAKE3 checkpoint integrity, crash fencing for M6 state surfaces)
  • m7p2 complete (session cleanup sweeper -- tested in task-02, but session start/close must work correctly for crash recovery)
  • TempTidalHome available via #[cfg(feature = "test-utils")]

Technical Design

File: tidal/tests/m7_uat.rs

Shared schema and helpers

//! Milestone 7 UAT Test Suite.
//!
//! Comprehensive acceptance tests for the full M7 production hardening feature set.
//!
//! UAT Steps:
//! 1. Crash at WAL write: recovery replays all committed events.
//! 2. Crash at checkpoint: BLAKE3-verified checkpoint is consistent.
//! 3. Crash with M6 state: cohorts, collections, co-engagement, hard negatives all survive.
//! 4. Degradation progression under concurrent load.
//! 5. Per-agent rate limiting isolation.
//! 6. Session auto-cleanup after TTL expiry.
//! 7. QueryStats populated on RETRIEVE and SEARCH.
//! 8. Prometheus metrics contain expected metric names.
//! 9. RLHF export + cross-session aggregation.
//! 10. All prior UAT suites pass (regression gate).

#![allow(clippy::unwrap_used, clippy::cast_precision_loss, clippy::too_many_lines)]

use std::collections::HashMap;
use std::time::Duration;

use tidaldb::TidalDb;
#[cfg(feature = "test-utils")]
use tidaldb::TempTidalHome;
use tidaldb::cohort::{CohortDef, Predicate};
use tidaldb::entities::{RelationshipType, Visibility};
use tidaldb::query::retrieve::Retrieve;
use tidaldb::query::search::Search;
use tidaldb::schema::{
    DecaySpec, EntityId, EntityKind, SchemaBuilder, TextFieldType, Timestamp, Window,
};
use tidaldb::storage::indexes::filter::FilterExpr;

fn m7_uat_schema() -> tidaldb::schema::Schema {
    let mut builder = SchemaBuilder::new();
    for &(name, half_life_days) in &[
        ("view", 7),
        ("like", 14),
        ("share", 7),
        ("skip", 1),
        ("completion", 30),
        ("comment", 7),
        ("follow", 30),
    ] {
        let _ = builder
            .signal(
                name,
                EntityKind::Item,
                DecaySpec::Exponential {
                    half_life: Duration::from_secs(half_life_days * 24 * 3600),
                },
            )
            .windows(&[
                Window::OneHour,
                Window::TwentyFourHours,
                Window::SevenDays,
                Window::AllTime,
            ])
            .velocity(true)
            .add();
    }
    builder.text_field("title", TextFieldType::Text);
    builder.text_field("description", TextFieldType::Text);
    builder.text_field("category", TextFieldType::Keyword);

    // Session policy for rate limiting tests (task-02).
    builder.session_policy(
        "default",
        tidaldb::schema::AgentPolicy::builder()
            .max_session_duration(Duration::from_secs(300))
            .build(),
    );

    builder.build().expect("m7 uat schema must be valid")
}

Test 1: Crash at WAL write

#[test]
#[cfg(feature = "test-utils")]
fn uat_crash_at_wal_write() {
    let home = TempTidalHome::new().unwrap();
    let schema = m7_uat_schema();

    // Phase 1: Write items and signals, then close gracefully.
    // This simulates "all writes made it to WAL before crash."
    let expected_scores: Vec<(u64, f64)>;
    {
        let db = TidalDb::builder()
            .with_data_dir(home.path())
            .with_schema(schema.clone())
            .open()
            .unwrap();

        let now = Timestamp::now();

        // Write 100 items with metadata.
        for id in 1..=100u64 {
            let mut meta = HashMap::new();
            meta.insert("title".to_string(), format!("Item {id}"));
            meta.insert("category".to_string(), "test".to_string());
            db.write_item_with_metadata(EntityId::new(id), &meta).unwrap();
        }

        // Write 500 signals across 50 items (10 signals each).
        for id in 1..=50u64 {
            for _ in 0..10 {
                db.signal("view", EntityId::new(id), 1.0, now).unwrap();
            }
        }

        // Capture expected decay scores for validation after recovery.
        expected_scores = (1..=50u64)
            .map(|id| {
                let score = db
                    .read_decay_score(EntityId::new(id), "view", 0)
                    .unwrap()
                    .unwrap_or(0.0);
                (id, score)
            })
            .collect();

        // Simulate crash: close the DB (WAL is flushed but we pretend
        // the process died after WAL write, before next checkpoint).
        db.close().unwrap();
    }

    // Phase 2: Reopen and verify all state recovered from WAL replay.
    {
        let db = TidalDb::builder()
            .with_data_dir(home.path())
            .with_schema(schema)
            .open()
            .unwrap();

        // Verify item count.
        assert_eq!(db.item_count(), 100, "all 100 items should survive recovery");

        // Verify signal state matches pre-crash values.
        for &(id, expected) in &expected_scores {
            let recovered = db
                .read_decay_score(EntityId::new(id), "view", 0)
                .unwrap()
                .unwrap_or(0.0);
            // Decay scores may drift slightly due to time elapsed during restart.
            // Allow 1% relative tolerance.
            let diff = (recovered - expected).abs();
            let tolerance = expected.abs() * 0.01 + 1e-6;
            assert!(
                diff < tolerance,
                "item {id}: expected score ~{expected:.4}, got {recovered:.4} (diff {diff:.6})"
            );
        }

        // Items 51-100 should have no view signals.
        for id in 51..=100u64 {
            let score = db
                .read_decay_score(EntityId::new(id), "view", 0)
                .unwrap()
                .unwrap_or(0.0);
            assert!(
                score < 1e-6,
                "item {id} should have no view signals after recovery, got {score}"
            );
        }

        // RETRIEVE should work with recovered state.
        let results = db
            .retrieve(
                &Retrieve::builder()
                    .profile("trending")
                    .limit(10)
                    .build()
                    .unwrap(),
            )
            .unwrap();
        assert!(
            !results.items.is_empty(),
            "RETRIEVE should return results after WAL recovery"
        );

        db.close().unwrap();
    }
}

Test 2: Crash at checkpoint

#[test]
#[cfg(feature = "test-utils")]
fn uat_crash_at_checkpoint() {
    let home = TempTidalHome::new().unwrap();
    let schema = m7_uat_schema();

    // Phase 1: Open, write data, force a checkpoint, then close.
    {
        let db = TidalDb::builder()
            .with_data_dir(home.path())
            .with_schema(schema.clone())
            .open()
            .unwrap();

        let now = Timestamp::now();

        // Write 200 items.
        for id in 1..=200u64 {
            let mut meta = HashMap::new();
            meta.insert("title".to_string(), format!("Checkpoint Item {id}"));
            db.write_item_with_metadata(EntityId::new(id), &meta).unwrap();
        }

        // Write signals so checkpoint has non-trivial signal state.
        for id in 1..=100u64 {
            for _ in 0..5 {
                db.signal("view", EntityId::new(id), 1.0, now).unwrap();
            }
        }

        // Force a checkpoint (if the API is available; otherwise close triggers it).
        db.close().unwrap();
    }

    // Phase 2: Write MORE data after reopening (these go to WAL after checkpoint).
    {
        let db = TidalDb::builder()
            .with_data_dir(home.path())
            .with_schema(schema.clone())
            .open()
            .unwrap();

        let now = Timestamp::now();

        // Write 50 more items (IDs 201-250).
        for id in 201..=250u64 {
            let mut meta = HashMap::new();
            meta.insert("title".to_string(), format!("Post-Checkpoint Item {id}"));
            db.write_item_with_metadata(EntityId::new(id), &meta).unwrap();
        }

        // Write signals on new items.
        for id in 201..=250u64 {
            db.signal("view", EntityId::new(id), 1.0, now).unwrap();
        }

        // Simulate crash: close (WAL has post-checkpoint events).
        db.close().unwrap();
    }

    // Phase 3: Reopen -- checkpoint + WAL replay should produce consistent state.
    {
        let db = TidalDb::builder()
            .with_data_dir(home.path())
            .with_schema(schema)
            .open()
            .unwrap();

        // All 250 items should be present.
        assert_eq!(
            db.item_count(),
            250,
            "checkpoint + WAL replay should recover all 250 items"
        );

        // Items from checkpoint era (1-100) should have signal state.
        let score_50 = db
            .read_decay_score(EntityId::new(50), "view", 0)
            .unwrap()
            .unwrap_or(0.0);
        assert!(
            score_50 > 0.0,
            "checkpoint-era item 50 should have positive decay score"
        );

        // Items from post-checkpoint era (201-250) should also have signal state.
        let score_225 = db
            .read_decay_score(EntityId::new(225), "view", 0)
            .unwrap()
            .unwrap_or(0.0);
        assert!(
            score_225 > 0.0,
            "post-checkpoint item 225 should have positive decay score from WAL replay"
        );

        // Items 101-200 should have no signals.
        let score_150 = db
            .read_decay_score(EntityId::new(150), "view", 0)
            .unwrap()
            .unwrap_or(0.0);
        assert!(
            score_150 < 1e-6,
            "item 150 should have no signals, got {score_150}"
        );

        db.close().unwrap();
    }
}

Test 3: Crash with M6 state (cohort, collection, hard negatives)

#[test]
#[cfg(feature = "test-utils")]
fn uat_crash_with_m6_state_and_hard_negatives() {
    let home = TempTidalHome::new().unwrap();
    let schema = m7_uat_schema();

    let user_id = 1001u64;
    let blocked_creator_id = 5u64;
    let hidden_item_id = 42u64;

    // Phase 1: Write M6 state surfaces, then close.
    {
        let db = TidalDb::builder()
            .with_data_dir(home.path())
            .with_schema(schema.clone())
            .open()
            .unwrap();

        let now = Timestamp::now();

        // Write user.
        let mut user_meta = HashMap::new();
        user_meta.insert("country".to_string(), "US".to_string());
        user_meta.insert("interest".to_string(), "music".to_string());
        db.write_user(EntityId::new(user_id), &user_meta).unwrap();

        // Write 100 items across 10 creators.
        for id in 1..=100u64 {
            let creator_id = ((id - 1) % 10) + 1;
            let mut meta = HashMap::new();
            meta.insert("title".to_string(), format!("Song {id}"));
            meta.insert("category".to_string(), "music".to_string());
            meta.insert("creator_id".to_string(), creator_id.to_string());
            db.write_item_with_metadata(EntityId::new(id), &meta).unwrap();
        }

        // Define cohort.
        db.define_cohort(CohortDef {
            name: "us_music".to_string(),
            predicate: Predicate::And(vec![
                Predicate::Eq { field: "country".into(), value: "US".into() },
                Predicate::Eq { field: "interest".into(), value: "music".into() },
            ]),
        })
        .unwrap();

        // Write signals with cohort attribution.
        for id in 1..=20u64 {
            let creator_id = ((id - 1) % 10) + 1;
            db.signal_with_context("view", EntityId::new(id), 1.0, now, Some(user_id), Some(creator_id))
                .unwrap();
        }

        // Create a collection.
        let coll = db
            .create_collection(EntityId::new(user_id), "favorites", Visibility::Private)
            .unwrap();
        db.add_to_collection(coll, EntityId::new(1)).unwrap();
        db.add_to_collection(coll, EntityId::new(2)).unwrap();
        db.add_to_collection(coll, EntityId::new(3)).unwrap();

        // Hard negatives: block a creator, hide an item.
        db.write_relationship(
            EntityId::new(user_id),
            RelationshipType::Blocks,
            EntityId::new(blocked_creator_id),
            1.0,
            now,
        )
        .unwrap();

        db.write_relationship(
            EntityId::new(user_id),
            RelationshipType::Hide,
            EntityId::new(hidden_item_id),
            1.0,
            now,
        )
        .unwrap();

        db.close().unwrap();
    }

    // Phase 2: Reopen and verify all M6 state surfaces recovered.
    {
        let db = TidalDb::builder()
            .with_data_dir(home.path())
            .with_schema(schema)
            .open()
            .unwrap();

        // Items survived.
        assert_eq!(db.item_count(), 100, "all 100 items should survive restart");

        // Cohort definition survived (duplicate should fail).
        let dup = db.define_cohort(CohortDef {
            name: "us_music".to_string(),
            predicate: Predicate::Eq { field: "x".into(), value: "y".into() },
        });
        assert!(dup.is_err(), "cohort 'us_music' should already be registered after restart");

        // Collection survived.
        let collections = db.list_collections(EntityId::new(user_id)).unwrap();
        assert!(
            collections.iter().any(|c| c.name == "favorites"),
            "collection 'favorites' should survive restart"
        );

        // Hard negative invariant: RETRIEVE for the user must NOT return
        // the hidden item or items from the blocked creator.
        let results = db
            .retrieve(
                &Retrieve::builder()
                    .profile("trending")
                    .for_user(user_id)
                    .limit(50)
                    .build()
                    .unwrap(),
            )
            .unwrap();

        for item in &results.items {
            let id = item.entity_id.as_u64();
            assert_ne!(
                id,
                hidden_item_id,
                "hidden item {hidden_item_id} must not appear in results after crash recovery"
            );
            // Items from blocked creator 5 are: 5, 15, 25, 35, 45, 55, 65, 75, 85, 95.
            let item_creator = ((id - 1) % 10) + 1;
            assert_ne!(
                item_creator, blocked_creator_id,
                "item {id} from blocked creator {blocked_creator_id} must not appear after crash recovery"
            );
        }

        db.close().unwrap();
    }
}

Helper functions

No shared helper beyond m7_uat_schema(). Each test is self-contained with its own TempTidalHome to guarantee isolation.

Assertions summary

Test Key assertions
uat_crash_at_wal_write Item count == 100; decay scores within 1% of pre-crash values; items without signals have score ~0; RETRIEVE returns results
uat_crash_at_checkpoint Item count == 250 (checkpoint + WAL); checkpoint-era items have signals; post-checkpoint items have signals; unsignaled items have score ~0
uat_crash_with_m6_state_and_hard_negatives Item count == 100; cohort definition survives; collection survives; hidden item never in RETRIEVE results; blocked creator items never in RETRIEVE results

Acceptance Criteria

  • uat_crash_at_wal_write passes: 100 items + 500 signals written, closed, reopened; all decay scores within 1% tolerance; RETRIEVE works
  • uat_crash_at_checkpoint passes: checkpoint + post-checkpoint WAL replay produces 250 items with correct signal state
  • uat_crash_with_m6_state_and_hard_negatives passes: cohort, collection, hard negatives all survive restart; RETRIEVE never returns hidden/blocked content
  • All three tests use #[cfg(feature = "test-utils")] and TempTidalHome
  • Each test completes in under 60 seconds
  • cargo clippy --manifest-path tidal/Cargo.toml -- -D warnings passes

Test Strategy

All three tests follow the same pattern:

  1. Open a TempTidalHome-backed DB.
  2. Write state (items, signals, relationships, cohorts, collections).
  3. Close the DB (simulating crash after WAL flush).
  4. Reopen with the same TempTidalHome path and schema.
  5. Assert recovered state matches expectations.

The tests use small datasets (100-250 items, 500 signals max) to keep runtime under 10 seconds per test. The 1% tolerance on decay scores accounts for time elapsed during close/reopen (decay continues with wall-clock time).

No mock injection or CrashPoint hooks are strictly required for these UAT-level tests. The close-then-reopen pattern is sufficient to exercise the WAL replay and checkpoint recovery paths. The CrashPoint fault injection from m7p1 is exercised in the m7p1 unit/property tests; the UAT validates the end-user-visible outcome.