tidaldb/docs/planning/milestone-7/phase-5/task-01-crash-recovery-uat.md

# Task 01: Crash Recovery UAT Tests

## Delivers

Three integration tests in `tidal/tests/m7_uat.rs` proving crash recovery correctness:

1. **`uat_crash_at_wal_write`** -- Kill after WAL write but before checkpoint; restart; verify all WAL-committed signals are recovered.
2. **`uat_crash_at_checkpoint`** -- Kill after checkpoint flush; restart; verify checkpoint state is consistent and BLAKE3 integrity holds.
3. **`uat_crash_with_m6_state`** -- Write items, signals, cohort definitions, collections, co-engagement, hide/block relationships; kill; restart; verify all state surfaces recovered; verify hard negatives never leak.

## Complexity: L

## Dependencies

- m7p1 complete (CrashPoint enum, WAL compaction, BLAKE3 checkpoint integrity, crash fencing for M6 state surfaces)
- m7p2 complete (session cleanup sweeper -- tested in task-02, but session start/close must work correctly for crash recovery)
- `TempTidalHome` available via `#[cfg(feature = "test-utils")]`

## Technical Design

### File: `tidal/tests/m7_uat.rs`

#### Shared schema and helpers

```rust
//! Milestone 7 UAT Test Suite.
//!
//! Comprehensive acceptance tests for the full M7 production hardening feature set.
//!
//! UAT Steps:
//! 1. Crash at WAL write: recovery replays all committed events.
//! 2. Crash at checkpoint: BLAKE3-verified checkpoint is consistent.
//! 3. Crash with M6 state: cohorts, collections, co-engagement, hard negatives all survive.
//! 4. Degradation progression under concurrent load.
//! 5. Per-agent rate limiting isolation.
//! 6. Session auto-cleanup after TTL expiry.
//! 7. QueryStats populated on RETRIEVE and SEARCH.
//! 8. Prometheus metrics contain expected metric names.
//! 9. RLHF export + cross-session aggregation.
//! 10. All prior UAT suites pass (regression gate).

#![allow(clippy::unwrap_used, clippy::cast_precision_loss, clippy::too_many_lines)]

use std::collections::HashMap;
use std::time::Duration;

use tidaldb::TidalDb;
#[cfg(feature = "test-utils")]
use tidaldb::TempTidalHome;
use tidaldb::cohort::{CohortDef, Predicate};
use tidaldb::entities::{RelationshipType, Visibility};
use tidaldb::query::retrieve::Retrieve;
use tidaldb::query::search::Search;
use tidaldb::schema::{
    DecaySpec, EntityId, EntityKind, SchemaBuilder, TextFieldType, Timestamp, Window,
};
use tidaldb::storage::indexes::filter::FilterExpr;

fn m7_uat_schema() -> tidaldb::schema::Schema {
    let mut builder = SchemaBuilder::new();
    for &(name, half_life_days) in &[
        ("view", 7),
        ("like", 14),
        ("share", 7),
        ("skip", 1),
        ("completion", 30),
        ("comment", 7),
        ("follow", 30),
    ] {
        let _ = builder
            .signal(
                name,
                EntityKind::Item,
                DecaySpec::Exponential {
                    half_life: Duration::from_secs(half_life_days * 24 * 3600),
                },
            )
            .windows(&[
                Window::OneHour,
                Window::TwentyFourHours,
                Window::SevenDays,
                Window::AllTime,
            ])
            .velocity(true)
            .add();
    }
    builder.text_field("title", TextFieldType::Text);
    builder.text_field("description", TextFieldType::Text);
    builder.text_field("category", TextFieldType::Keyword);

    // Session policy for rate limiting tests (task-02).
    builder.session_policy(
        "default",
        tidaldb::schema::AgentPolicy::builder()
            .max_session_duration(Duration::from_secs(300))
            .build(),
    );

    builder.build().expect("m7 uat schema must be valid")
}
```

#### Test 1: Crash at WAL write

```rust
#[test]
#[cfg(feature = "test-utils")]
fn uat_crash_at_wal_write() {
    let home = TempTidalHome::new().unwrap();
    let schema = m7_uat_schema();

    // Phase 1: Write items and signals, then close gracefully.
    // This simulates "all writes made it to WAL before crash."
    let expected_scores: Vec<(u64, f64)>;
    {
        let db = TidalDb::builder()
            .with_data_dir(home.path())
            .with_schema(schema.clone())
            .open()
            .unwrap();

        let now = Timestamp::now();

        // Write 100 items with metadata.
        for id in 1..=100u64 {
            let mut meta = HashMap::new();
            meta.insert("title".to_string(), format!("Item {id}"));
            meta.insert("category".to_string(), "test".to_string());
            db.write_item_with_metadata(EntityId::new(id), &meta).unwrap();
        }

        // Write 500 signals across 50 items (10 signals each).
        for id in 1..=50u64 {
            for _ in 0..10 {
                db.signal("view", EntityId::new(id), 1.0, now).unwrap();
            }
        }

        // Capture expected decay scores for validation after recovery.
        expected_scores = (1..=50u64)
            .map(|id| {
                let score = db
                    .read_decay_score(EntityId::new(id), "view", 0)
                    .unwrap()
                    .unwrap_or(0.0);
                (id, score)
            })
            .collect();

        // Simulate crash: close the DB (WAL is flushed but we pretend
        // the process died after WAL write, before next checkpoint).
        db.close().unwrap();
    }

    // Phase 2: Reopen and verify all state recovered from WAL replay.
    {
        let db = TidalDb::builder()
            .with_data_dir(home.path())
            .with_schema(schema)
            .open()
            .unwrap();

        // Verify item count.
        assert_eq!(db.item_count(), 100, "all 100 items should survive recovery");

        // Verify signal state matches pre-crash values.
        for &(id, expected) in &expected_scores {
            let recovered = db
                .read_decay_score(EntityId::new(id), "view", 0)
                .unwrap()
                .unwrap_or(0.0);
            // Decay scores may drift slightly due to time elapsed during restart.
            // Allow 1% relative tolerance.
            let diff = (recovered - expected).abs();
            let tolerance = expected.abs() * 0.01 + 1e-6;
            assert!(
                diff < tolerance,
                "item {id}: expected score ~{expected:.4}, got {recovered:.4} (diff {diff:.6})"
            );
        }

        // Items 51-100 should have no view signals.
        for id in 51..=100u64 {
            let score = db
                .read_decay_score(EntityId::new(id), "view", 0)
                .unwrap()
                .unwrap_or(0.0);
            assert!(
                score < 1e-6,
                "item {id} should have no view signals after recovery, got {score}"
            );
        }

        // RETRIEVE should work with recovered state.
        let results = db
            .retrieve(
                &Retrieve::builder()
                    .profile("trending")
                    .limit(10)
                    .build()
                    .unwrap(),
            )
            .unwrap();
        assert!(
            !results.items.is_empty(),
            "RETRIEVE should return results after WAL recovery"
        );

        db.close().unwrap();
    }
}
```

#### Test 2: Crash at checkpoint

```rust
#[test]
#[cfg(feature = "test-utils")]
fn uat_crash_at_checkpoint() {
    let home = TempTidalHome::new().unwrap();
    let schema = m7_uat_schema();

    // Phase 1: Open, write data, force a checkpoint, then close.
    {
        let db = TidalDb::builder()
            .with_data_dir(home.path())
            .with_schema(schema.clone())
            .open()
            .unwrap();

        let now = Timestamp::now();

        // Write 200 items.
        for id in 1..=200u64 {
            let mut meta = HashMap::new();
            meta.insert("title".to_string(), format!("Checkpoint Item {id}"));
            db.write_item_with_metadata(EntityId::new(id), &meta).unwrap();
        }

        // Write signals so checkpoint has non-trivial signal state.
        for id in 1..=100u64 {
            for _ in 0..5 {
                db.signal("view", EntityId::new(id), 1.0, now).unwrap();
            }
        }

        // Force a checkpoint (if the API is available; otherwise close triggers it).
        db.close().unwrap();
    }

    // Phase 2: Write MORE data after reopening (these go to WAL after checkpoint).
    {
        let db = TidalDb::builder()
            .with_data_dir(home.path())
            .with_schema(schema.clone())
            .open()
            .unwrap();

        let now = Timestamp::now();

        // Write 50 more items (IDs 201-250).
        for id in 201..=250u64 {
            let mut meta = HashMap::new();
            meta.insert("title".to_string(), format!("Post-Checkpoint Item {id}"));
            db.write_item_with_metadata(EntityId::new(id), &meta).unwrap();
        }

        // Write signals on new items.
        for id in 201..=250u64 {
            db.signal("view", EntityId::new(id), 1.0, now).unwrap();
        }

        // Simulate crash: close (WAL has post-checkpoint events).
        db.close().unwrap();
    }

    // Phase 3: Reopen -- checkpoint + WAL replay should produce consistent state.
    {
        let db = TidalDb::builder()
            .with_data_dir(home.path())
            .with_schema(schema)
            .open()
            .unwrap();

        // All 250 items should be present.
        assert_eq!(
            db.item_count(),
            250,
            "checkpoint + WAL replay should recover all 250 items"
        );

        // Items from checkpoint era (1-100) should have signal state.
        let score_50 = db
            .read_decay_score(EntityId::new(50), "view", 0)
            .unwrap()
            .unwrap_or(0.0);
        assert!(
            score_50 > 0.0,
            "checkpoint-era item 50 should have positive decay score"
        );

        // Items from post-checkpoint era (201-250) should also have signal state.
        let score_225 = db
            .read_decay_score(EntityId::new(225), "view", 0)
            .unwrap()
            .unwrap_or(0.0);
        assert!(
            score_225 > 0.0,
            "post-checkpoint item 225 should have positive decay score from WAL replay"
        );

        // Items 101-200 should have no signals.
        let score_150 = db
            .read_decay_score(EntityId::new(150), "view", 0)
            .unwrap()
            .unwrap_or(0.0);
        assert!(
            score_150 < 1e-6,
            "item 150 should have no signals, got {score_150}"
        );

        db.close().unwrap();
    }
}
```

#### Test 3: Crash with M6 state (cohort, collection, hard negatives)

```rust
#[test]
#[cfg(feature = "test-utils")]
fn uat_crash_with_m6_state_and_hard_negatives() {
    let home = TempTidalHome::new().unwrap();
    let schema = m7_uat_schema();

    let user_id = 1001u64;
    let blocked_creator_id = 5u64;
    let hidden_item_id = 42u64;

    // Phase 1: Write M6 state surfaces, then close.
    {
        let db = TidalDb::builder()
            .with_data_dir(home.path())
            .with_schema(schema.clone())
            .open()
            .unwrap();

        let now = Timestamp::now();

        // Write user.
        let mut user_meta = HashMap::new();
        user_meta.insert("country".to_string(), "US".to_string());
        user_meta.insert("interest".to_string(), "music".to_string());
        db.write_user(EntityId::new(user_id), &user_meta).unwrap();

        // Write 100 items across 10 creators.
        for id in 1..=100u64 {
            let creator_id = ((id - 1) % 10) + 1;
            let mut meta = HashMap::new();
            meta.insert("title".to_string(), format!("Song {id}"));
            meta.insert("category".to_string(), "music".to_string());
            meta.insert("creator_id".to_string(), creator_id.to_string());
            db.write_item_with_metadata(EntityId::new(id), &meta).unwrap();
        }

        // Define cohort.
        db.define_cohort(CohortDef {
            name: "us_music".to_string(),
            predicate: Predicate::And(vec![
                Predicate::Eq { field: "country".into(), value: "US".into() },
                Predicate::Eq { field: "interest".into(), value: "music".into() },
            ]),
        })
        .unwrap();

        // Write signals with cohort attribution.
        for id in 1..=20u64 {
            let creator_id = ((id - 1) % 10) + 1;
            db.signal_with_context("view", EntityId::new(id), 1.0, now, Some(user_id), Some(creator_id))
                .unwrap();
        }

        // Create a collection.
        let coll = db
            .create_collection(EntityId::new(user_id), "favorites", Visibility::Private)
            .unwrap();
        db.add_to_collection(coll, EntityId::new(1)).unwrap();
        db.add_to_collection(coll, EntityId::new(2)).unwrap();
        db.add_to_collection(coll, EntityId::new(3)).unwrap();

        // Hard negatives: block a creator, hide an item.
        db.write_relationship(
            EntityId::new(user_id),
            RelationshipType::Blocks,
            EntityId::new(blocked_creator_id),
            1.0,
            now,
        )
        .unwrap();

        db.write_relationship(
            EntityId::new(user_id),
            RelationshipType::Hide,
            EntityId::new(hidden_item_id),
            1.0,
            now,
        )
        .unwrap();

        db.close().unwrap();
    }

    // Phase 2: Reopen and verify all M6 state surfaces recovered.
    {
        let db = TidalDb::builder()
            .with_data_dir(home.path())
            .with_schema(schema)
            .open()
            .unwrap();

        // Items survived.
        assert_eq!(db.item_count(), 100, "all 100 items should survive restart");

        // Cohort definition survived (duplicate should fail).
        let dup = db.define_cohort(CohortDef {
            name: "us_music".to_string(),
            predicate: Predicate::Eq { field: "x".into(), value: "y".into() },
        });
        assert!(dup.is_err(), "cohort 'us_music' should already be registered after restart");

        // Collection survived.
        let collections = db.list_collections(EntityId::new(user_id)).unwrap();
        assert!(
            collections.iter().any(|c| c.name == "favorites"),
            "collection 'favorites' should survive restart"
        );

        // Hard negative invariant: RETRIEVE for the user must NOT return
        // the hidden item or items from the blocked creator.
        let results = db
            .retrieve(
                &Retrieve::builder()
                    .profile("trending")
                    .for_user(user_id)
                    .limit(50)
                    .build()
                    .unwrap(),
            )
            .unwrap();

        for item in &results.items {
            let id = item.entity_id.as_u64();
            assert_ne!(
                id,
                hidden_item_id,
                "hidden item {hidden_item_id} must not appear in results after crash recovery"
            );
            // Items from blocked creator 5 are: 5, 15, 25, 35, 45, 55, 65, 75, 85, 95.
            let item_creator = ((id - 1) % 10) + 1;
            assert_ne!(
                item_creator, blocked_creator_id,
                "item {id} from blocked creator {blocked_creator_id} must not appear after crash recovery"
            );
        }

        db.close().unwrap();
    }
}
```

### Helper functions

No shared helper beyond `m7_uat_schema()`. Each test is self-contained with its own `TempTidalHome` to guarantee isolation.

### Assertions summary

| Test | Key assertions |
|------|---------------|
| `uat_crash_at_wal_write` | Item count == 100; decay scores within 1% of pre-crash values; items without signals have score ~0; RETRIEVE returns results |
| `uat_crash_at_checkpoint` | Item count == 250 (checkpoint + WAL); checkpoint-era items have signals; post-checkpoint items have signals; unsignaled items have score ~0 |
| `uat_crash_with_m6_state_and_hard_negatives` | Item count == 100; cohort definition survives; collection survives; hidden item never in RETRIEVE results; blocked creator items never in RETRIEVE results |

## Acceptance Criteria

- [ ] `uat_crash_at_wal_write` passes: 100 items + 500 signals written, closed, reopened; all decay scores within 1% tolerance; RETRIEVE works
- [ ] `uat_crash_at_checkpoint` passes: checkpoint + post-checkpoint WAL replay produces 250 items with correct signal state
- [ ] `uat_crash_with_m6_state_and_hard_negatives` passes: cohort, collection, hard negatives all survive restart; RETRIEVE never returns hidden/blocked content
- [ ] All three tests use `#[cfg(feature = "test-utils")]` and `TempTidalHome`
- [ ] Each test completes in under 60 seconds
- [ ] `cargo clippy --manifest-path tidal/Cargo.toml -- -D warnings` passes

## Test Strategy

All three tests follow the same pattern:
1. Open a `TempTidalHome`-backed DB.
2. Write state (items, signals, relationships, cohorts, collections).
3. Close the DB (simulating crash after WAL flush).
4. Reopen with the same `TempTidalHome` path and schema.
5. Assert recovered state matches expectations.

The tests use small datasets (100-250 items, 500 signals max) to keep runtime under 10 seconds per test. The 1% tolerance on decay scores accounts for time elapsed during close/reopen (decay continues with wall-clock time).

No mock injection or `CrashPoint` hooks are strictly required for these UAT-level tests. The close-then-reopen pattern is sufficient to exercise the WAL replay and checkpoint recovery paths. The `CrashPoint` fault injection from m7p1 is exercised in the m7p1 unit/property tests; the UAT validates the end-user-visible outcome.