# Task 01: Crash Recovery UAT Tests ## Delivers Three integration tests in `tidal/tests/m7_uat.rs` proving crash recovery correctness: 1. **`uat_crash_at_wal_write`** -- Kill after WAL write but before checkpoint; restart; verify all WAL-committed signals are recovered. 2. **`uat_crash_at_checkpoint`** -- Kill after checkpoint flush; restart; verify checkpoint state is consistent and BLAKE3 integrity holds. 3. **`uat_crash_with_m6_state`** -- Write items, signals, cohort definitions, collections, co-engagement, hide/block relationships; kill; restart; verify all state surfaces recovered; verify hard negatives never leak. ## Complexity: L ## Dependencies - m7p1 complete (CrashPoint enum, WAL compaction, BLAKE3 checkpoint integrity, crash fencing for M6 state surfaces) - m7p2 complete (session cleanup sweeper -- tested in task-02, but session start/close must work correctly for crash recovery) - `TempTidalHome` available via `#[cfg(feature = "test-utils")]` ## Technical Design ### File: `tidal/tests/m7_uat.rs` #### Shared schema and helpers ```rust //! Milestone 7 UAT Test Suite. //! //! Comprehensive acceptance tests for the full M7 production hardening feature set. //! //! UAT Steps: //! 1. Crash at WAL write: recovery replays all committed events. //! 2. Crash at checkpoint: BLAKE3-verified checkpoint is consistent. //! 3. Crash with M6 state: cohorts, collections, co-engagement, hard negatives all survive. //! 4. Degradation progression under concurrent load. //! 5. Per-agent rate limiting isolation. //! 6. Session auto-cleanup after TTL expiry. //! 7. QueryStats populated on RETRIEVE and SEARCH. //! 8. Prometheus metrics contain expected metric names. //! 9. RLHF export + cross-session aggregation. //! 10. All prior UAT suites pass (regression gate). #![allow(clippy::unwrap_used, clippy::cast_precision_loss, clippy::too_many_lines)] use std::collections::HashMap; use std::time::Duration; use tidaldb::TidalDb; #[cfg(feature = "test-utils")] use tidaldb::TempTidalHome; use tidaldb::cohort::{CohortDef, Predicate}; use tidaldb::entities::{RelationshipType, Visibility}; use tidaldb::query::retrieve::Retrieve; use tidaldb::query::search::Search; use tidaldb::schema::{ DecaySpec, EntityId, EntityKind, SchemaBuilder, TextFieldType, Timestamp, Window, }; use tidaldb::storage::indexes::filter::FilterExpr; fn m7_uat_schema() -> tidaldb::schema::Schema { let mut builder = SchemaBuilder::new(); for &(name, half_life_days) in &[ ("view", 7), ("like", 14), ("share", 7), ("skip", 1), ("completion", 30), ("comment", 7), ("follow", 30), ] { let _ = builder .signal( name, EntityKind::Item, DecaySpec::Exponential { half_life: Duration::from_secs(half_life_days * 24 * 3600), }, ) .windows(&[ Window::OneHour, Window::TwentyFourHours, Window::SevenDays, Window::AllTime, ]) .velocity(true) .add(); } builder.text_field("title", TextFieldType::Text); builder.text_field("description", TextFieldType::Text); builder.text_field("category", TextFieldType::Keyword); // Session policy for rate limiting tests (task-02). builder.session_policy( "default", tidaldb::schema::AgentPolicy::builder() .max_session_duration(Duration::from_secs(300)) .build(), ); builder.build().expect("m7 uat schema must be valid") } ``` #### Test 1: Crash at WAL write ```rust #[test] #[cfg(feature = "test-utils")] fn uat_crash_at_wal_write() { let home = TempTidalHome::new().unwrap(); let schema = m7_uat_schema(); // Phase 1: Write items and signals, then close gracefully. // This simulates "all writes made it to WAL before crash." let expected_scores: Vec<(u64, f64)>; { let db = TidalDb::builder() .with_data_dir(home.path()) .with_schema(schema.clone()) .open() .unwrap(); let now = Timestamp::now(); // Write 100 items with metadata. for id in 1..=100u64 { let mut meta = HashMap::new(); meta.insert("title".to_string(), format!("Item {id}")); meta.insert("category".to_string(), "test".to_string()); db.write_item_with_metadata(EntityId::new(id), &meta).unwrap(); } // Write 500 signals across 50 items (10 signals each). for id in 1..=50u64 { for _ in 0..10 { db.signal("view", EntityId::new(id), 1.0, now).unwrap(); } } // Capture expected decay scores for validation after recovery. expected_scores = (1..=50u64) .map(|id| { let score = db .read_decay_score(EntityId::new(id), "view", 0) .unwrap() .unwrap_or(0.0); (id, score) }) .collect(); // Simulate crash: close the DB (WAL is flushed but we pretend // the process died after WAL write, before next checkpoint). db.close().unwrap(); } // Phase 2: Reopen and verify all state recovered from WAL replay. { let db = TidalDb::builder() .with_data_dir(home.path()) .with_schema(schema) .open() .unwrap(); // Verify item count. assert_eq!(db.item_count(), 100, "all 100 items should survive recovery"); // Verify signal state matches pre-crash values. for &(id, expected) in &expected_scores { let recovered = db .read_decay_score(EntityId::new(id), "view", 0) .unwrap() .unwrap_or(0.0); // Decay scores may drift slightly due to time elapsed during restart. // Allow 1% relative tolerance. let diff = (recovered - expected).abs(); let tolerance = expected.abs() * 0.01 + 1e-6; assert!( diff < tolerance, "item {id}: expected score ~{expected:.4}, got {recovered:.4} (diff {diff:.6})" ); } // Items 51-100 should have no view signals. for id in 51..=100u64 { let score = db .read_decay_score(EntityId::new(id), "view", 0) .unwrap() .unwrap_or(0.0); assert!( score < 1e-6, "item {id} should have no view signals after recovery, got {score}" ); } // RETRIEVE should work with recovered state. let results = db .retrieve( &Retrieve::builder() .profile("trending") .limit(10) .build() .unwrap(), ) .unwrap(); assert!( !results.items.is_empty(), "RETRIEVE should return results after WAL recovery" ); db.close().unwrap(); } } ``` #### Test 2: Crash at checkpoint ```rust #[test] #[cfg(feature = "test-utils")] fn uat_crash_at_checkpoint() { let home = TempTidalHome::new().unwrap(); let schema = m7_uat_schema(); // Phase 1: Open, write data, force a checkpoint, then close. { let db = TidalDb::builder() .with_data_dir(home.path()) .with_schema(schema.clone()) .open() .unwrap(); let now = Timestamp::now(); // Write 200 items. for id in 1..=200u64 { let mut meta = HashMap::new(); meta.insert("title".to_string(), format!("Checkpoint Item {id}")); db.write_item_with_metadata(EntityId::new(id), &meta).unwrap(); } // Write signals so checkpoint has non-trivial signal state. for id in 1..=100u64 { for _ in 0..5 { db.signal("view", EntityId::new(id), 1.0, now).unwrap(); } } // Force a checkpoint (if the API is available; otherwise close triggers it). db.close().unwrap(); } // Phase 2: Write MORE data after reopening (these go to WAL after checkpoint). { let db = TidalDb::builder() .with_data_dir(home.path()) .with_schema(schema.clone()) .open() .unwrap(); let now = Timestamp::now(); // Write 50 more items (IDs 201-250). for id in 201..=250u64 { let mut meta = HashMap::new(); meta.insert("title".to_string(), format!("Post-Checkpoint Item {id}")); db.write_item_with_metadata(EntityId::new(id), &meta).unwrap(); } // Write signals on new items. for id in 201..=250u64 { db.signal("view", EntityId::new(id), 1.0, now).unwrap(); } // Simulate crash: close (WAL has post-checkpoint events). db.close().unwrap(); } // Phase 3: Reopen -- checkpoint + WAL replay should produce consistent state. { let db = TidalDb::builder() .with_data_dir(home.path()) .with_schema(schema) .open() .unwrap(); // All 250 items should be present. assert_eq!( db.item_count(), 250, "checkpoint + WAL replay should recover all 250 items" ); // Items from checkpoint era (1-100) should have signal state. let score_50 = db .read_decay_score(EntityId::new(50), "view", 0) .unwrap() .unwrap_or(0.0); assert!( score_50 > 0.0, "checkpoint-era item 50 should have positive decay score" ); // Items from post-checkpoint era (201-250) should also have signal state. let score_225 = db .read_decay_score(EntityId::new(225), "view", 0) .unwrap() .unwrap_or(0.0); assert!( score_225 > 0.0, "post-checkpoint item 225 should have positive decay score from WAL replay" ); // Items 101-200 should have no signals. let score_150 = db .read_decay_score(EntityId::new(150), "view", 0) .unwrap() .unwrap_or(0.0); assert!( score_150 < 1e-6, "item 150 should have no signals, got {score_150}" ); db.close().unwrap(); } } ``` #### Test 3: Crash with M6 state (cohort, collection, hard negatives) ```rust #[test] #[cfg(feature = "test-utils")] fn uat_crash_with_m6_state_and_hard_negatives() { let home = TempTidalHome::new().unwrap(); let schema = m7_uat_schema(); let user_id = 1001u64; let blocked_creator_id = 5u64; let hidden_item_id = 42u64; // Phase 1: Write M6 state surfaces, then close. { let db = TidalDb::builder() .with_data_dir(home.path()) .with_schema(schema.clone()) .open() .unwrap(); let now = Timestamp::now(); // Write user. let mut user_meta = HashMap::new(); user_meta.insert("country".to_string(), "US".to_string()); user_meta.insert("interest".to_string(), "music".to_string()); db.write_user(EntityId::new(user_id), &user_meta).unwrap(); // Write 100 items across 10 creators. for id in 1..=100u64 { let creator_id = ((id - 1) % 10) + 1; let mut meta = HashMap::new(); meta.insert("title".to_string(), format!("Song {id}")); meta.insert("category".to_string(), "music".to_string()); meta.insert("creator_id".to_string(), creator_id.to_string()); db.write_item_with_metadata(EntityId::new(id), &meta).unwrap(); } // Define cohort. db.define_cohort(CohortDef { name: "us_music".to_string(), predicate: Predicate::And(vec![ Predicate::Eq { field: "country".into(), value: "US".into() }, Predicate::Eq { field: "interest".into(), value: "music".into() }, ]), }) .unwrap(); // Write signals with cohort attribution. for id in 1..=20u64 { let creator_id = ((id - 1) % 10) + 1; db.signal_with_context("view", EntityId::new(id), 1.0, now, Some(user_id), Some(creator_id)) .unwrap(); } // Create a collection. let coll = db .create_collection(EntityId::new(user_id), "favorites", Visibility::Private) .unwrap(); db.add_to_collection(coll, EntityId::new(1)).unwrap(); db.add_to_collection(coll, EntityId::new(2)).unwrap(); db.add_to_collection(coll, EntityId::new(3)).unwrap(); // Hard negatives: block a creator, hide an item. db.write_relationship( EntityId::new(user_id), RelationshipType::Blocks, EntityId::new(blocked_creator_id), 1.0, now, ) .unwrap(); db.write_relationship( EntityId::new(user_id), RelationshipType::Hide, EntityId::new(hidden_item_id), 1.0, now, ) .unwrap(); db.close().unwrap(); } // Phase 2: Reopen and verify all M6 state surfaces recovered. { let db = TidalDb::builder() .with_data_dir(home.path()) .with_schema(schema) .open() .unwrap(); // Items survived. assert_eq!(db.item_count(), 100, "all 100 items should survive restart"); // Cohort definition survived (duplicate should fail). let dup = db.define_cohort(CohortDef { name: "us_music".to_string(), predicate: Predicate::Eq { field: "x".into(), value: "y".into() }, }); assert!(dup.is_err(), "cohort 'us_music' should already be registered after restart"); // Collection survived. let collections = db.list_collections(EntityId::new(user_id)).unwrap(); assert!( collections.iter().any(|c| c.name == "favorites"), "collection 'favorites' should survive restart" ); // Hard negative invariant: RETRIEVE for the user must NOT return // the hidden item or items from the blocked creator. let results = db .retrieve( &Retrieve::builder() .profile("trending") .for_user(user_id) .limit(50) .build() .unwrap(), ) .unwrap(); for item in &results.items { let id = item.entity_id.as_u64(); assert_ne!( id, hidden_item_id, "hidden item {hidden_item_id} must not appear in results after crash recovery" ); // Items from blocked creator 5 are: 5, 15, 25, 35, 45, 55, 65, 75, 85, 95. let item_creator = ((id - 1) % 10) + 1; assert_ne!( item_creator, blocked_creator_id, "item {id} from blocked creator {blocked_creator_id} must not appear after crash recovery" ); } db.close().unwrap(); } } ``` ### Helper functions No shared helper beyond `m7_uat_schema()`. Each test is self-contained with its own `TempTidalHome` to guarantee isolation. ### Assertions summary | Test | Key assertions | |------|---------------| | `uat_crash_at_wal_write` | Item count == 100; decay scores within 1% of pre-crash values; items without signals have score ~0; RETRIEVE returns results | | `uat_crash_at_checkpoint` | Item count == 250 (checkpoint + WAL); checkpoint-era items have signals; post-checkpoint items have signals; unsignaled items have score ~0 | | `uat_crash_with_m6_state_and_hard_negatives` | Item count == 100; cohort definition survives; collection survives; hidden item never in RETRIEVE results; blocked creator items never in RETRIEVE results | ## Acceptance Criteria - [ ] `uat_crash_at_wal_write` passes: 100 items + 500 signals written, closed, reopened; all decay scores within 1% tolerance; RETRIEVE works - [ ] `uat_crash_at_checkpoint` passes: checkpoint + post-checkpoint WAL replay produces 250 items with correct signal state - [ ] `uat_crash_with_m6_state_and_hard_negatives` passes: cohort, collection, hard negatives all survive restart; RETRIEVE never returns hidden/blocked content - [ ] All three tests use `#[cfg(feature = "test-utils")]` and `TempTidalHome` - [ ] Each test completes in under 60 seconds - [ ] `cargo clippy --manifest-path tidal/Cargo.toml -- -D warnings` passes ## Test Strategy All three tests follow the same pattern: 1. Open a `TempTidalHome`-backed DB. 2. Write state (items, signals, relationships, cohorts, collections). 3. Close the DB (simulating crash after WAL flush). 4. Reopen with the same `TempTidalHome` path and schema. 5. Assert recovered state matches expectations. The tests use small datasets (100-250 items, 500 signals max) to keep runtime under 10 seconds per test. The 1% tolerance on decay scores accounts for time elapsed during close/reopen (decay continues with wall-clock time). No mock injection or `CrashPoint` hooks are strictly required for these UAT-level tests. The close-then-reopen pattern is sufficient to exercise the WAL replay and checkpoint recovery paths. The `CrashPoint` fault injection from m7p1 is exercised in the m7p1 unit/property tests; the UAT validates the end-user-visible outcome.