tidaldb/docs/planning/milestone-7/phase-1/task-07-crash-fencing-m6-state.md
2026-02-23 22:41:16 -07:00

496 lines
17 KiB
Markdown

# Task 07: Crash Fencing for M6 State
## Delivers
Property tests proving crash recovery correctness for all M6 state surfaces: `CohortSignalLedger`, `CollectionIndex`, `CoEngagementIndex`, and active sessions. Each state surface is tested against its relevant crash points with random event sequences. After simulated crash + reopen, every invariant must hold.
## Complexity: L
## Dependencies
- Task 01 (CrashPoint enum + fault injection hooks)
## Technical Design
### 1. CohortSignalLedger crash recovery tests
The cohort ledger checkpoints atomically via `WriteBatch`. If the checkpoint is interrupted, the previous checkpoint (or empty state) is the restore point. The cohort ledger is best-effort (does not go through WAL), so signals written to the cohort ledger but not checkpointed are lost on crash. This is by design -- the acceptance criterion is that the restored state is consistent (no corrupt entries), not that every cohort signal survives.
```rust
// tidal/tests/m7_crash_m6.rs
use std::time::Duration;
use tidaldb::cohort::CohortSignalLedger;
use tidaldb::schema::{DecaySpec, EntityId, EntityKind, SchemaBuilder, Timestamp, Window};
use tidaldb::signals::SignalTypeId;
use tidaldb::storage::InMemoryBackend;
use tidaldb::signals::checkpoint::CheckpointMeta;
fn test_schema() -> tidaldb::schema::Schema {
let mut builder = SchemaBuilder::new();
let _ = builder
.signal("view", EntityKind::Item,
DecaySpec::Exponential { half_life: Duration::from_secs(7 * 24 * 3600) })
.windows(&[Window::AllTime])
.velocity(false)
.add();
builder.build().unwrap()
}
#[test]
fn cohort_ledger_checkpoint_restore_roundtrip() {
let schema = test_schema();
let ledger = CohortSignalLedger::new(&schema);
let type_id = ledger.resolve_signal_type("view").unwrap();
let ts_ns = Timestamp::now().as_nanos();
// Record signals across multiple cohorts.
for i in 1..=100u64 {
ledger.record("cohort_a", EntityId::new(i), type_id, 1.0, ts_ns);
if i % 3 == 0 {
ledger.record("cohort_b", EntityId::new(i), type_id, 2.0, ts_ns);
}
}
let storage = InMemoryBackend::new();
let meta = CheckpointMeta {
checkpoint_time_ns: ts_ns,
wal_sequence: 42,
payload_hash: [0u8; 32], // cohort uses same meta but no BLAKE3 yet
};
ledger.checkpoint(&storage, meta).unwrap();
// Restore into a fresh ledger.
let ledger2 = CohortSignalLedger::new(&schema);
let restored_meta = ledger2.restore(&storage).unwrap().unwrap();
assert_eq!(restored_meta.wal_sequence, 42);
assert_eq!(ledger2.entry_count(), ledger.entry_count());
// Verify specific counts.
let count_a = ledger2
.read_windowed_count("cohort_a", EntityId::new(1), "view", Window::AllTime)
.unwrap();
assert_eq!(count_a, 1);
let count_b = ledger2
.read_windowed_count("cohort_b", EntityId::new(3), "view", Window::AllTime)
.unwrap();
assert_eq!(count_b, 1);
// Cohort not in the ledger returns 0.
let count_c = ledger2
.read_windowed_count("cohort_c", EntityId::new(1), "view", Window::AllTime)
.unwrap();
assert_eq!(count_c, 0);
}
```
### 2. CohortSignalLedger crash during checkpoint
```rust
use proptest::prelude::*;
proptest! {
#![proptest_config(ProptestConfig::with_cases(500))]
#[test]
fn cohort_ledger_survives_checkpoint_crash(
cohort_count in 1usize..5,
entities_per_cohort in 1usize..20,
signals_per_entity in 1usize..10,
) {
let schema = test_schema();
let storage = InMemoryBackend::new();
let type_id = SignalTypeId::new(0); // "view" is first alphabetically
let base_ns = 1_000_000_000_000u64;
// Phase 1: Write signals and do a clean checkpoint.
let ledger = CohortSignalLedger::new(&schema);
for c in 0..cohort_count {
let cohort_name = format!("cohort_{c}");
for e in 1..=entities_per_cohort as u64 {
for s in 0..signals_per_entity {
let ts_ns = base_ns + (s as u64) * 1_000_000;
ledger.record(&cohort_name, EntityId::new(e), type_id, 1.0, ts_ns);
}
}
}
let meta = CheckpointMeta {
checkpoint_time_ns: base_ns,
wal_sequence: 100,
payload_hash: [0u8; 32],
};
ledger.checkpoint(&storage, meta).unwrap();
let checkpoint_entry_count = ledger.entry_count();
// Phase 2: Write more signals (these will be lost on crash).
for e in 1..=3u64 {
ledger.record("cohort_0", EntityId::new(e + 1000), type_id, 1.0, base_ns);
}
prop_assert!(ledger.entry_count() > checkpoint_entry_count);
// Phase 3: Simulate crash (don't checkpoint the new signals).
// Restore from the checkpoint.
let ledger2 = CohortSignalLedger::new(&schema);
let restored_meta = ledger2.restore(&storage).unwrap().unwrap();
prop_assert_eq!(restored_meta.wal_sequence, 100);
// Restored entry count should match the CHECKPOINT entry count,
// not the live entry count (post-checkpoint signals are lost).
prop_assert_eq!(ledger2.entry_count(), checkpoint_entry_count);
}
}
```
### 3. CollectionIndex crash recovery
Collections are persisted to fjall via `Tag::Collection` keys. The `rebuild_collections` function scans these keys on open. Test that after a crash during `add_to_collection`, the collection state on reopen is consistent (either the item is in the collection or it is not -- no partial state).
```rust
#[test]
fn collection_persists_across_restart() {
let dir = tempfile::tempdir().unwrap();
let schema = test_schema_with_collections(); // schema with text fields for items
// Phase 1: Create collection, add items, close.
{
let db = TidalDb::builder()
.with_data_dir(dir.path())
.with_schema(schema.clone())
.open()
.unwrap();
// Write items first (collections reference items).
for i in 1..=10u64 {
db.write_item(EntityId::new(i), &HashMap::new()).unwrap();
}
let cid = db.create_collection(42, "favorites", Visibility::Private).unwrap();
for i in 1..=5u64 {
db.add_to_collection(cid, EntityId::new(i)).unwrap();
}
db.close().unwrap();
}
// Phase 2: Reopen and verify collection state.
{
let db = TidalDb::builder()
.with_data_dir(dir.path())
.with_schema(schema)
.open()
.unwrap();
let collections = db.list_collections(42).unwrap();
assert_eq!(collections.len(), 1);
assert_eq!(collections[0].name, "favorites");
assert_eq!(collections[0].item_count, 5);
db.close().unwrap();
}
}
proptest! {
#![proptest_config(ProptestConfig::with_cases(200))]
#[test]
fn collection_items_survive_restart(
item_count in 1usize..50,
items_in_collection in 1usize..50,
) {
let n_items = item_count;
let n_in_coll = items_in_collection.min(n_items);
let dir = tempfile::tempdir().unwrap();
let schema = test_schema_with_collections();
{
let db = TidalDb::builder()
.with_data_dir(dir.path())
.with_schema(schema.clone())
.open()
.unwrap();
for i in 1..=n_items as u64 {
db.write_item(EntityId::new(i), &HashMap::new()).unwrap();
}
let cid = db.create_collection(1, "test", Visibility::Public).unwrap();
for i in 1..=n_in_coll as u64 {
db.add_to_collection(cid, EntityId::new(i)).unwrap();
}
db.close().unwrap();
}
{
let db = TidalDb::builder()
.with_data_dir(dir.path())
.with_schema(schema)
.open()
.unwrap();
let collections = db.list_collections(1).unwrap();
prop_assert_eq!(collections.len(), 1);
prop_assert_eq!(collections[0].item_count, n_in_coll);
db.close().unwrap();
}
}
}
```
### 4. CoEngagementIndex crash recovery
The co-engagement index checkpoints during shutdown. Test that edges survive a clean restart and that the weight-based eviction invariant (edge_count <= capacity) holds after recovery.
```rust
#[test]
fn co_engagement_survives_restart() {
let dir = tempfile::tempdir().unwrap();
let schema = test_schema_with_entities();
{
let db = TidalDb::builder()
.with_data_dir(dir.path())
.with_schema(schema.clone())
.open()
.unwrap();
// Write items and signals that trigger co-engagement.
for i in 1..=20u64 {
db.write_item(EntityId::new(i), &HashMap::new()).unwrap();
}
// Simulate user engagement: user 1 likes items 1-10 sequentially.
for i in 1..=10u64 {
let ts = Timestamp::from_nanos(1_000_000_000_000 + i * 1_000_000);
db.signal("like", EntityId::new(i), 1.0, ts).unwrap();
}
let edge_count_before = db.co_engagement().edge_count();
assert!(edge_count_before > 0, "co-engagement edges should exist");
db.close().unwrap();
}
{
let db = TidalDb::builder()
.with_data_dir(dir.path())
.with_schema(schema)
.open()
.unwrap();
let edge_count_after = db.co_engagement().edge_count();
// Edges should survive the restart (checkpoint in shutdown).
assert!(edge_count_after > 0, "co-engagement edges should survive restart");
db.close().unwrap();
}
}
proptest! {
#![proptest_config(ProptestConfig::with_cases(200))]
#[test]
fn co_engagement_capacity_invariant_after_restart(
edge_count in 1usize..100,
) {
let index = CoEngagementIndex::with_capacity(50);
let storage = InMemoryBackend::new();
// Insert edges directly.
for i in 0..edge_count {
index.insert_edge(i as u64, (i + 1) as u64, (i as f32) + 0.5);
}
// Checkpoint.
index.checkpoint(&storage).unwrap();
// Restore into a fresh index.
let index2 = CoEngagementIndex::with_capacity(50);
index2.restore(&storage).unwrap();
// Edge count after restore matches the original (capped by checkpoint,
// not by capacity -- capacity eviction happens on record_positive, not restore).
let expected = edge_count.min(100); // all inserted edges, up to what we wrote
prop_assert_eq!(index2.edge_count(), expected.min(index.edge_count()));
}
}
```
### 5. Session crash recovery
Sessions are journaled to `sessions.log` via the WAL writer thread. If a session-start is written but no session-close, the session should be restored as active on reopen. Test this with the existing `session_durability` test pattern.
```rust
#[test]
fn active_session_restored_after_crash() {
let dir = tempfile::tempdir().unwrap();
let schema = test_schema_with_sessions();
let mut session_id = 0u64;
{
let db = TidalDb::builder()
.with_data_dir(dir.path())
.with_schema(schema.clone())
.open()
.unwrap();
// Start a session but do NOT close it (simulating crash).
let handle = db.start_session(42, "test-agent", "default").unwrap();
session_id = handle.session_id().as_u64();
// Write some session signals.
handle.signal("view", EntityId::new(1), 1.0, Timestamp::now()).unwrap();
handle.signal("view", EntityId::new(2), 1.0, Timestamp::now()).unwrap();
// Graceful close (writes checkpoint, but session is still "open").
db.close().unwrap();
}
{
let db = TidalDb::builder()
.with_data_dir(dir.path())
.with_schema(schema)
.open()
.unwrap();
// The session should be restored with its signals.
// The exact API depends on session restore implementation.
// At minimum: the session's signal effects (view counts) should be present.
let count = db.read_windowed_count(
EntityId::new(1), "view", Window::AllTime
).unwrap();
assert!(count >= 1, "session signals should survive restart");
db.close().unwrap();
}
}
#[test]
fn session_metadata_restored_after_crash() {
let dir = tempfile::tempdir().unwrap();
let schema = test_schema_with_sessions();
{
let db = TidalDb::builder()
.with_data_dir(dir.path())
.with_schema(schema.clone())
.open()
.unwrap();
let handle = db.start_session(42, "test-agent", "default").unwrap();
handle.annotate(EntityId::new(1), "good recommendation").unwrap();
// Do NOT close the session -- simulate crash.
db.close().unwrap();
}
{
let db = TidalDb::builder()
.with_data_dir(dir.path())
.with_schema(schema)
.open()
.unwrap();
// Session journal recovery should have replayed the session-start
// and the annotation signal. The exact verification depends on
// whether session state is queryable after restore.
// At minimum: the database should open without error.
db.health_check().unwrap();
db.close().unwrap();
}
}
```
### 6. Cross-surface crash test
Verify that a crash during a mixed workload (signals + collections + co-engagement + cohort) does not corrupt any surface.
```rust
#[test]
fn mixed_workload_crash_recovery() {
let dir = tempfile::tempdir().unwrap();
let schema = test_schema_full();
{
let db = TidalDb::builder()
.with_data_dir(dir.path())
.with_schema(schema.clone())
.open()
.unwrap();
// Write items.
for i in 1..=100u64 {
let mut meta = HashMap::new();
meta.insert("title".to_string(), format!("Item {i}"));
meta.insert("creator_id".to_string(), format!("{}", i % 10));
db.write_item_with_metadata(EntityId::new(i), &meta).unwrap();
}
// Write signals.
for i in 1..=100u64 {
let ts = Timestamp::from_nanos(1_000_000_000_000 + i * 1_000_000);
db.signal("view", EntityId::new(i), 1.0, ts).unwrap();
}
// Create collection.
let cid = db.create_collection(1, "test", Visibility::Public).unwrap();
for i in 1..=10u64 {
db.add_to_collection(cid, EntityId::new(i)).unwrap();
}
db.close().unwrap();
}
{
let db = TidalDb::builder()
.with_data_dir(dir.path())
.with_schema(schema)
.open()
.unwrap();
// Verify all surfaces are consistent.
db.health_check().unwrap();
// Signals survived.
let count = db.read_windowed_count(
EntityId::new(50), "view", Window::AllTime
).unwrap();
assert!(count >= 1);
// Collections survived.
let collections = db.list_collections(1).unwrap();
assert_eq!(collections.len(), 1);
assert_eq!(collections[0].item_count, 10);
// Items survived.
let meta = db.get_item_metadata(EntityId::new(1)).unwrap();
assert!(meta.is_some());
db.close().unwrap();
}
}
```
## Acceptance Criteria
- [ ] `cohort_ledger_checkpoint_restore_roundtrip`: cohort signal state round-trips correctly through checkpoint/restore
- [ ] `cohort_ledger_survives_checkpoint_crash`: 500 proptest cases, interrupted checkpoint does not corrupt state
- [ ] `collection_persists_across_restart`: collection with items survives clean restart
- [ ] `collection_items_survive_restart`: 200 proptest cases, collection item counts match after restart
- [ ] `co_engagement_survives_restart`: co-engagement edges persist through shutdown/reopen
- [ ] `co_engagement_capacity_invariant_after_restart`: 200 proptest cases, edge count preserved through checkpoint/restore
- [ ] `active_session_restored_after_crash`: session signals survive when session is not closed before shutdown
- [ ] `session_metadata_restored_after_crash`: session journal events replayed on open
- [ ] `mixed_workload_crash_recovery`: all surfaces consistent after mixed write + restart
- [ ] All tests pass with `cargo test --test m7_crash_m6`
## Test Strategy
All tests described in the Technical Design section are the deliverable. The key principle: each test writes state through the public `TidalDb` API, closes the database, reopens it, and verifies the state through the public API. This tests the full persistence pipeline (in-memory -> checkpoint/fjall -> restore/rebuild) without relying on internal implementation details.
Property tests use proptest with varied workload sizes to catch edge cases in serialization, key encoding, and scan/rebuild logic.