504 lines
17 KiB
Markdown
504 lines
17 KiB
Markdown
# Task 01: Crash Recovery UAT Tests
|
|
|
|
## Delivers
|
|
|
|
Three integration tests in `tidal/tests/m7_uat.rs` proving crash recovery correctness:
|
|
|
|
1. **`uat_crash_at_wal_write`** -- Kill after WAL write but before checkpoint; restart; verify all WAL-committed signals are recovered.
|
|
2. **`uat_crash_at_checkpoint`** -- Kill after checkpoint flush; restart; verify checkpoint state is consistent and BLAKE3 integrity holds.
|
|
3. **`uat_crash_with_m6_state`** -- Write items, signals, cohort definitions, collections, co-engagement, hide/block relationships; kill; restart; verify all state surfaces recovered; verify hard negatives never leak.
|
|
|
|
## Complexity: L
|
|
|
|
## Dependencies
|
|
|
|
- m7p1 complete (CrashPoint enum, WAL compaction, BLAKE3 checkpoint integrity, crash fencing for M6 state surfaces)
|
|
- m7p2 complete (session cleanup sweeper -- tested in task-02, but session start/close must work correctly for crash recovery)
|
|
- `TempTidalHome` available via `#[cfg(feature = "test-utils")]`
|
|
|
|
## Technical Design
|
|
|
|
### File: `tidal/tests/m7_uat.rs`
|
|
|
|
#### Shared schema and helpers
|
|
|
|
```rust
|
|
//! Milestone 7 UAT Test Suite.
|
|
//!
|
|
//! Comprehensive acceptance tests for the full M7 production hardening feature set.
|
|
//!
|
|
//! UAT Steps:
|
|
//! 1. Crash at WAL write: recovery replays all committed events.
|
|
//! 2. Crash at checkpoint: BLAKE3-verified checkpoint is consistent.
|
|
//! 3. Crash with M6 state: cohorts, collections, co-engagement, hard negatives all survive.
|
|
//! 4. Degradation progression under concurrent load.
|
|
//! 5. Per-agent rate limiting isolation.
|
|
//! 6. Session auto-cleanup after TTL expiry.
|
|
//! 7. QueryStats populated on RETRIEVE and SEARCH.
|
|
//! 8. Prometheus metrics contain expected metric names.
|
|
//! 9. RLHF export + cross-session aggregation.
|
|
//! 10. All prior UAT suites pass (regression gate).
|
|
|
|
#![allow(clippy::unwrap_used, clippy::cast_precision_loss, clippy::too_many_lines)]
|
|
|
|
use std::collections::HashMap;
|
|
use std::time::Duration;
|
|
|
|
use tidaldb::TidalDb;
|
|
#[cfg(feature = "test-utils")]
|
|
use tidaldb::TempTidalHome;
|
|
use tidaldb::cohort::{CohortDef, Predicate};
|
|
use tidaldb::entities::{RelationshipType, Visibility};
|
|
use tidaldb::query::retrieve::Retrieve;
|
|
use tidaldb::query::search::Search;
|
|
use tidaldb::schema::{
|
|
DecaySpec, EntityId, EntityKind, SchemaBuilder, TextFieldType, Timestamp, Window,
|
|
};
|
|
use tidaldb::storage::indexes::filter::FilterExpr;
|
|
|
|
fn m7_uat_schema() -> tidaldb::schema::Schema {
|
|
let mut builder = SchemaBuilder::new();
|
|
for &(name, half_life_days) in &[
|
|
("view", 7),
|
|
("like", 14),
|
|
("share", 7),
|
|
("skip", 1),
|
|
("completion", 30),
|
|
("comment", 7),
|
|
("follow", 30),
|
|
] {
|
|
let _ = builder
|
|
.signal(
|
|
name,
|
|
EntityKind::Item,
|
|
DecaySpec::Exponential {
|
|
half_life: Duration::from_secs(half_life_days * 24 * 3600),
|
|
},
|
|
)
|
|
.windows(&[
|
|
Window::OneHour,
|
|
Window::TwentyFourHours,
|
|
Window::SevenDays,
|
|
Window::AllTime,
|
|
])
|
|
.velocity(true)
|
|
.add();
|
|
}
|
|
builder.text_field("title", TextFieldType::Text);
|
|
builder.text_field("description", TextFieldType::Text);
|
|
builder.text_field("category", TextFieldType::Keyword);
|
|
|
|
// Session policy for rate limiting tests (task-02).
|
|
builder.session_policy(
|
|
"default",
|
|
tidaldb::schema::AgentPolicy::builder()
|
|
.max_session_duration(Duration::from_secs(300))
|
|
.build(),
|
|
);
|
|
|
|
builder.build().expect("m7 uat schema must be valid")
|
|
}
|
|
```
|
|
|
|
#### Test 1: Crash at WAL write
|
|
|
|
```rust
|
|
#[test]
|
|
#[cfg(feature = "test-utils")]
|
|
fn uat_crash_at_wal_write() {
|
|
let home = TempTidalHome::new().unwrap();
|
|
let schema = m7_uat_schema();
|
|
|
|
// Phase 1: Write items and signals, then close gracefully.
|
|
// This simulates "all writes made it to WAL before crash."
|
|
let expected_scores: Vec<(u64, f64)>;
|
|
{
|
|
let db = TidalDb::builder()
|
|
.with_data_dir(home.path())
|
|
.with_schema(schema.clone())
|
|
.open()
|
|
.unwrap();
|
|
|
|
let now = Timestamp::now();
|
|
|
|
// Write 100 items with metadata.
|
|
for id in 1..=100u64 {
|
|
let mut meta = HashMap::new();
|
|
meta.insert("title".to_string(), format!("Item {id}"));
|
|
meta.insert("category".to_string(), "test".to_string());
|
|
db.write_item_with_metadata(EntityId::new(id), &meta).unwrap();
|
|
}
|
|
|
|
// Write 500 signals across 50 items (10 signals each).
|
|
for id in 1..=50u64 {
|
|
for _ in 0..10 {
|
|
db.signal("view", EntityId::new(id), 1.0, now).unwrap();
|
|
}
|
|
}
|
|
|
|
// Capture expected decay scores for validation after recovery.
|
|
expected_scores = (1..=50u64)
|
|
.map(|id| {
|
|
let score = db
|
|
.read_decay_score(EntityId::new(id), "view", 0)
|
|
.unwrap()
|
|
.unwrap_or(0.0);
|
|
(id, score)
|
|
})
|
|
.collect();
|
|
|
|
// Simulate crash: close the DB (WAL is flushed but we pretend
|
|
// the process died after WAL write, before next checkpoint).
|
|
db.close().unwrap();
|
|
}
|
|
|
|
// Phase 2: Reopen and verify all state recovered from WAL replay.
|
|
{
|
|
let db = TidalDb::builder()
|
|
.with_data_dir(home.path())
|
|
.with_schema(schema)
|
|
.open()
|
|
.unwrap();
|
|
|
|
// Verify item count.
|
|
assert_eq!(db.item_count(), 100, "all 100 items should survive recovery");
|
|
|
|
// Verify signal state matches pre-crash values.
|
|
for &(id, expected) in &expected_scores {
|
|
let recovered = db
|
|
.read_decay_score(EntityId::new(id), "view", 0)
|
|
.unwrap()
|
|
.unwrap_or(0.0);
|
|
// Decay scores may drift slightly due to time elapsed during restart.
|
|
// Allow 1% relative tolerance.
|
|
let diff = (recovered - expected).abs();
|
|
let tolerance = expected.abs() * 0.01 + 1e-6;
|
|
assert!(
|
|
diff < tolerance,
|
|
"item {id}: expected score ~{expected:.4}, got {recovered:.4} (diff {diff:.6})"
|
|
);
|
|
}
|
|
|
|
// Items 51-100 should have no view signals.
|
|
for id in 51..=100u64 {
|
|
let score = db
|
|
.read_decay_score(EntityId::new(id), "view", 0)
|
|
.unwrap()
|
|
.unwrap_or(0.0);
|
|
assert!(
|
|
score < 1e-6,
|
|
"item {id} should have no view signals after recovery, got {score}"
|
|
);
|
|
}
|
|
|
|
// RETRIEVE should work with recovered state.
|
|
let results = db
|
|
.retrieve(
|
|
&Retrieve::builder()
|
|
.profile("trending")
|
|
.limit(10)
|
|
.build()
|
|
.unwrap(),
|
|
)
|
|
.unwrap();
|
|
assert!(
|
|
!results.items.is_empty(),
|
|
"RETRIEVE should return results after WAL recovery"
|
|
);
|
|
|
|
db.close().unwrap();
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Test 2: Crash at checkpoint
|
|
|
|
```rust
|
|
#[test]
|
|
#[cfg(feature = "test-utils")]
|
|
fn uat_crash_at_checkpoint() {
|
|
let home = TempTidalHome::new().unwrap();
|
|
let schema = m7_uat_schema();
|
|
|
|
// Phase 1: Open, write data, force a checkpoint, then close.
|
|
{
|
|
let db = TidalDb::builder()
|
|
.with_data_dir(home.path())
|
|
.with_schema(schema.clone())
|
|
.open()
|
|
.unwrap();
|
|
|
|
let now = Timestamp::now();
|
|
|
|
// Write 200 items.
|
|
for id in 1..=200u64 {
|
|
let mut meta = HashMap::new();
|
|
meta.insert("title".to_string(), format!("Checkpoint Item {id}"));
|
|
db.write_item_with_metadata(EntityId::new(id), &meta).unwrap();
|
|
}
|
|
|
|
// Write signals so checkpoint has non-trivial signal state.
|
|
for id in 1..=100u64 {
|
|
for _ in 0..5 {
|
|
db.signal("view", EntityId::new(id), 1.0, now).unwrap();
|
|
}
|
|
}
|
|
|
|
// Force a checkpoint (if the API is available; otherwise close triggers it).
|
|
db.close().unwrap();
|
|
}
|
|
|
|
// Phase 2: Write MORE data after reopening (these go to WAL after checkpoint).
|
|
{
|
|
let db = TidalDb::builder()
|
|
.with_data_dir(home.path())
|
|
.with_schema(schema.clone())
|
|
.open()
|
|
.unwrap();
|
|
|
|
let now = Timestamp::now();
|
|
|
|
// Write 50 more items (IDs 201-250).
|
|
for id in 201..=250u64 {
|
|
let mut meta = HashMap::new();
|
|
meta.insert("title".to_string(), format!("Post-Checkpoint Item {id}"));
|
|
db.write_item_with_metadata(EntityId::new(id), &meta).unwrap();
|
|
}
|
|
|
|
// Write signals on new items.
|
|
for id in 201..=250u64 {
|
|
db.signal("view", EntityId::new(id), 1.0, now).unwrap();
|
|
}
|
|
|
|
// Simulate crash: close (WAL has post-checkpoint events).
|
|
db.close().unwrap();
|
|
}
|
|
|
|
// Phase 3: Reopen -- checkpoint + WAL replay should produce consistent state.
|
|
{
|
|
let db = TidalDb::builder()
|
|
.with_data_dir(home.path())
|
|
.with_schema(schema)
|
|
.open()
|
|
.unwrap();
|
|
|
|
// All 250 items should be present.
|
|
assert_eq!(
|
|
db.item_count(),
|
|
250,
|
|
"checkpoint + WAL replay should recover all 250 items"
|
|
);
|
|
|
|
// Items from checkpoint era (1-100) should have signal state.
|
|
let score_50 = db
|
|
.read_decay_score(EntityId::new(50), "view", 0)
|
|
.unwrap()
|
|
.unwrap_or(0.0);
|
|
assert!(
|
|
score_50 > 0.0,
|
|
"checkpoint-era item 50 should have positive decay score"
|
|
);
|
|
|
|
// Items from post-checkpoint era (201-250) should also have signal state.
|
|
let score_225 = db
|
|
.read_decay_score(EntityId::new(225), "view", 0)
|
|
.unwrap()
|
|
.unwrap_or(0.0);
|
|
assert!(
|
|
score_225 > 0.0,
|
|
"post-checkpoint item 225 should have positive decay score from WAL replay"
|
|
);
|
|
|
|
// Items 101-200 should have no signals.
|
|
let score_150 = db
|
|
.read_decay_score(EntityId::new(150), "view", 0)
|
|
.unwrap()
|
|
.unwrap_or(0.0);
|
|
assert!(
|
|
score_150 < 1e-6,
|
|
"item 150 should have no signals, got {score_150}"
|
|
);
|
|
|
|
db.close().unwrap();
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Test 3: Crash with M6 state (cohort, collection, hard negatives)
|
|
|
|
```rust
|
|
#[test]
|
|
#[cfg(feature = "test-utils")]
|
|
fn uat_crash_with_m6_state_and_hard_negatives() {
|
|
let home = TempTidalHome::new().unwrap();
|
|
let schema = m7_uat_schema();
|
|
|
|
let user_id = 1001u64;
|
|
let blocked_creator_id = 5u64;
|
|
let hidden_item_id = 42u64;
|
|
|
|
// Phase 1: Write M6 state surfaces, then close.
|
|
{
|
|
let db = TidalDb::builder()
|
|
.with_data_dir(home.path())
|
|
.with_schema(schema.clone())
|
|
.open()
|
|
.unwrap();
|
|
|
|
let now = Timestamp::now();
|
|
|
|
// Write user.
|
|
let mut user_meta = HashMap::new();
|
|
user_meta.insert("country".to_string(), "US".to_string());
|
|
user_meta.insert("interest".to_string(), "music".to_string());
|
|
db.write_user(EntityId::new(user_id), &user_meta).unwrap();
|
|
|
|
// Write 100 items across 10 creators.
|
|
for id in 1..=100u64 {
|
|
let creator_id = ((id - 1) % 10) + 1;
|
|
let mut meta = HashMap::new();
|
|
meta.insert("title".to_string(), format!("Song {id}"));
|
|
meta.insert("category".to_string(), "music".to_string());
|
|
meta.insert("creator_id".to_string(), creator_id.to_string());
|
|
db.write_item_with_metadata(EntityId::new(id), &meta).unwrap();
|
|
}
|
|
|
|
// Define cohort.
|
|
db.define_cohort(CohortDef {
|
|
name: "us_music".to_string(),
|
|
predicate: Predicate::And(vec![
|
|
Predicate::Eq { field: "country".into(), value: "US".into() },
|
|
Predicate::Eq { field: "interest".into(), value: "music".into() },
|
|
]),
|
|
})
|
|
.unwrap();
|
|
|
|
// Write signals with cohort attribution.
|
|
for id in 1..=20u64 {
|
|
let creator_id = ((id - 1) % 10) + 1;
|
|
db.signal_with_context("view", EntityId::new(id), 1.0, now, Some(user_id), Some(creator_id))
|
|
.unwrap();
|
|
}
|
|
|
|
// Create a collection.
|
|
let coll = db
|
|
.create_collection(EntityId::new(user_id), "favorites", Visibility::Private)
|
|
.unwrap();
|
|
db.add_to_collection(coll, EntityId::new(1)).unwrap();
|
|
db.add_to_collection(coll, EntityId::new(2)).unwrap();
|
|
db.add_to_collection(coll, EntityId::new(3)).unwrap();
|
|
|
|
// Hard negatives: block a creator, hide an item.
|
|
db.write_relationship(
|
|
EntityId::new(user_id),
|
|
RelationshipType::Blocks,
|
|
EntityId::new(blocked_creator_id),
|
|
1.0,
|
|
now,
|
|
)
|
|
.unwrap();
|
|
|
|
db.write_relationship(
|
|
EntityId::new(user_id),
|
|
RelationshipType::Hide,
|
|
EntityId::new(hidden_item_id),
|
|
1.0,
|
|
now,
|
|
)
|
|
.unwrap();
|
|
|
|
db.close().unwrap();
|
|
}
|
|
|
|
// Phase 2: Reopen and verify all M6 state surfaces recovered.
|
|
{
|
|
let db = TidalDb::builder()
|
|
.with_data_dir(home.path())
|
|
.with_schema(schema)
|
|
.open()
|
|
.unwrap();
|
|
|
|
// Items survived.
|
|
assert_eq!(db.item_count(), 100, "all 100 items should survive restart");
|
|
|
|
// Cohort definition survived (duplicate should fail).
|
|
let dup = db.define_cohort(CohortDef {
|
|
name: "us_music".to_string(),
|
|
predicate: Predicate::Eq { field: "x".into(), value: "y".into() },
|
|
});
|
|
assert!(dup.is_err(), "cohort 'us_music' should already be registered after restart");
|
|
|
|
// Collection survived.
|
|
let collections = db.list_collections(EntityId::new(user_id)).unwrap();
|
|
assert!(
|
|
collections.iter().any(|c| c.name == "favorites"),
|
|
"collection 'favorites' should survive restart"
|
|
);
|
|
|
|
// Hard negative invariant: RETRIEVE for the user must NOT return
|
|
// the hidden item or items from the blocked creator.
|
|
let results = db
|
|
.retrieve(
|
|
&Retrieve::builder()
|
|
.profile("trending")
|
|
.for_user(user_id)
|
|
.limit(50)
|
|
.build()
|
|
.unwrap(),
|
|
)
|
|
.unwrap();
|
|
|
|
for item in &results.items {
|
|
let id = item.entity_id.as_u64();
|
|
assert_ne!(
|
|
id,
|
|
hidden_item_id,
|
|
"hidden item {hidden_item_id} must not appear in results after crash recovery"
|
|
);
|
|
// Items from blocked creator 5 are: 5, 15, 25, 35, 45, 55, 65, 75, 85, 95.
|
|
let item_creator = ((id - 1) % 10) + 1;
|
|
assert_ne!(
|
|
item_creator, blocked_creator_id,
|
|
"item {id} from blocked creator {blocked_creator_id} must not appear after crash recovery"
|
|
);
|
|
}
|
|
|
|
db.close().unwrap();
|
|
}
|
|
}
|
|
```
|
|
|
|
### Helper functions
|
|
|
|
No shared helper beyond `m7_uat_schema()`. Each test is self-contained with its own `TempTidalHome` to guarantee isolation.
|
|
|
|
### Assertions summary
|
|
|
|
| Test | Key assertions |
|
|
|------|---------------|
|
|
| `uat_crash_at_wal_write` | Item count == 100; decay scores within 1% of pre-crash values; items without signals have score ~0; RETRIEVE returns results |
|
|
| `uat_crash_at_checkpoint` | Item count == 250 (checkpoint + WAL); checkpoint-era items have signals; post-checkpoint items have signals; unsignaled items have score ~0 |
|
|
| `uat_crash_with_m6_state_and_hard_negatives` | Item count == 100; cohort definition survives; collection survives; hidden item never in RETRIEVE results; blocked creator items never in RETRIEVE results |
|
|
|
|
## Acceptance Criteria
|
|
|
|
- [ ] `uat_crash_at_wal_write` passes: 100 items + 500 signals written, closed, reopened; all decay scores within 1% tolerance; RETRIEVE works
|
|
- [ ] `uat_crash_at_checkpoint` passes: checkpoint + post-checkpoint WAL replay produces 250 items with correct signal state
|
|
- [ ] `uat_crash_with_m6_state_and_hard_negatives` passes: cohort, collection, hard negatives all survive restart; RETRIEVE never returns hidden/blocked content
|
|
- [ ] All three tests use `#[cfg(feature = "test-utils")]` and `TempTidalHome`
|
|
- [ ] Each test completes in under 60 seconds
|
|
- [ ] `cargo clippy --manifest-path tidal/Cargo.toml -- -D warnings` passes
|
|
|
|
## Test Strategy
|
|
|
|
All three tests follow the same pattern:
|
|
1. Open a `TempTidalHome`-backed DB.
|
|
2. Write state (items, signals, relationships, cohorts, collections).
|
|
3. Close the DB (simulating crash after WAL flush).
|
|
4. Reopen with the same `TempTidalHome` path and schema.
|
|
5. Assert recovered state matches expectations.
|
|
|
|
The tests use small datasets (100-250 items, 500 signals max) to keep runtime under 10 seconds per test. The 1% tolerance on decay scores accounts for time elapsed during close/reopen (decay continues with wall-clock time).
|
|
|
|
No mock injection or `CrashPoint` hooks are strictly required for these UAT-level tests. The close-then-reopen pattern is sufficient to exercise the WAL replay and checkpoint recovery paths. The `CrashPoint` fault injection from m7p1 is exercised in the m7p1 unit/property tests; the UAT validates the end-user-visible outcome.
|