# m7p1: Crash Recovery Hardening

## Delivers

Fault injection test harness, WAL compaction, checkpoint integrity verification, recovery time measurement, and crash fencing for all M6 state surfaces. Every write-path stage is tested for crash safety. Recovery completes in under 30 seconds with 1M items and 5 minutes of WAL backlog.

This phase transforms crash recovery from "it probably works" into "it is proven correct under all failure modes." The `CrashPoint` enum and `CrashInjector` harness give every future milestone a systematic way to verify durability invariants before shipping.

## Dependencies

- M6 complete (all 6 phases)
- `tidal/src/signals/checkpoint/` -- checkpoint format, meta serialization
- `tidal/src/wal/` -- WAL segments, reader/recovery, checkpoint manager
- `tidal/src/db/state_rebuild.rs` -- entity state rebuild on open
- `tidal/src/db/open.rs` -- schema-aware open with WAL replay
- `tidal/src/cohort/checkpoint.rs` -- cohort signal checkpoint/restore
- `tidal/src/entities/co_engagement.rs` -- co-engagement checkpoint/restore
- `tidal/src/entities/collection.rs` -- collection index serialization
- `tidal/src/db/session_restore.rs` -- session journal recovery

## Research References

- `docs/research/tidaldb_wal.md` -- WAL design, segment layout, crash recovery model
- `thoughts.md` Part V.5-6 -- lessons from Engram/Citadel on crash recovery
- `docs/research/tidaldb_signal_ledger.md` -- three-tier hybrid, checkpoint semantics

## Acceptance Criteria (Phase Level)

- [ ] Fault injection harness: `CrashPoint` enum covering WAL pre-write, WAL post-write, checkpoint pre-flush, checkpoint post-flush, signal aggregation update, cohort ledger update, collection index update, co-engagement update; configurable via `#[cfg(test)]` feature flag
- [ ] Property tests for each crash point: generate N random event sequences (N >= 1000), inject crash at random position, restart, verify state matches expected from WAL replay to 6 decimal places for decay scores
- [ ] WAL compaction: after successful checkpoint, WAL segments with seqno <= checkpoint seqno atomically deleted; write-new-then-delete-old pattern
- [ ] Checkpoint integrity: `CheckpointMeta` extended with BLAKE3 hash; verified on open; corrupt checkpoint triggers fallback to WAL-only replay with warning log
- [ ] Recovery time < 30 seconds for 1M items checkpoint + 5 minutes WAL backlog (Criterion benchmarked)
- [ ] `tidalctl recover --path <dir> --verify-only` dry-runs WAL replay; reports event count, last seqno, inconsistency count, estimated recovery time; no state written
- [ ] Crash fencing for cohort state: CohortSignalLedger checkpoint/restore roundtrips correctly under all crash points
- [ ] Crash fencing for collection state: CollectionIndex persisted bitmaps survive all crash points
- [ ] Crash fencing for co-engagement state: CoEngagementIndex weight-based eviction invariant preserved across restart
- [ ] Crash fencing for session state: active sessions with WAL session-start but no session-close correctly restored
- [ ] No phantom items after any crash scenario
- [ ] No lost signals after any crash scenario (WAL is the source of truth)
- [ ] No leaked hard negatives after crash recovery (hidden items remain hidden)

## Task Execution Order

```
task-01 (CrashPoint enum + fault injection hooks)
    |
    v
task-02 (signal ledger crash property tests)
    |
task-03 (WAL compaction) ----+
    |                        |
task-04 (checkpoint BLAKE3)  |
    |                        |
task-07 (M6 state fencing) --+
    |                        |
    v                        v
task-05 (recovery benchmark)
    |
task-06 (tidalctl recover --verify-only)
    |
    v
task-08 (hard negative crash invariant)
```

Tasks 02, 03, 04, and 07 can parallelize after task-01 completes. Task-05 depends on tasks 03 and 04 (compaction and BLAKE3 change recovery semantics). Task-06 depends on task-05 (needs the benchmark harness for estimated recovery time). Task-08 depends on task-07 (needs M6 crash fencing in place).

## Module Location

New and modified modules:

```
tidal/src/
  testing/
    mod.rs              -- new: #[cfg(test)] test utilities module
    crash_injector.rs   -- new: CrashPoint enum, CrashInjector struct
  signals/checkpoint/
    meta.rs             -- modified: add BLAKE3 hash field to CheckpointMeta
    integrity.rs        -- new: BLAKE3 hash computation and verification
  wal/
    compaction.rs       -- new: post-checkpoint WAL segment compaction
  db/
    state_rebuild.rs    -- modified: BLAKE3 verification on restore, fallback path
    open.rs             -- modified: integrate compaction after periodic checkpoint

tidal/tests/
  m7_crash_property.rs  -- new: property tests for signal ledger crash points
  m7_crash_m6.rs        -- new: property tests for M6 state crash fencing
  m7_crash_invariant.rs -- new: hard negative crash invariant test

tidal/benches/
  recovery.rs           -- new: Criterion benchmark for cold-start recovery time
```

## Notes

### CrashPoint design philosophy

The `CrashPoint` enum is not a simulation framework. It is a set of hooks at real write-path boundaries where `#[cfg(test)]` code can trigger a controlled panic or early return. The `CrashInjector` is configured per-test to fire at a specific point after N operations. This gives deterministic, reproducible crash scenarios without the complexity of process-level fault injection (which requires fork/exec and is brittle on macOS).

The trade-off: we test the recovery logic, not the OS-level crash semantics. For OS-level crash safety (torn pages, partial fsync), we rely on the WAL's two-phase batch validation (BLAKE3 checksums) and Tantivy's segment merge model. Both have been proven correct in their respective upstream projects.

### BLAKE3 for checkpoint integrity

BLAKE3 is already a dependency (used for WAL batch checksums and signal event deduplication). Adding a 32-byte hash to `CheckpointMeta` costs nothing at write time (hashing 983 bytes takes ~200ns) and catches silent corruption on read. The fallback path (WAL-only replay) is already implemented -- it is the first-boot path. We simply reuse it when the checkpoint hash does not match.

### Recovery time target

The 30-second target is conservative. At current checkpoint format sizes (983 bytes per entity-signal entry), 1M entries is ~940MB of checkpoint data. fjall bulk-scan throughput is ~500MB/s on NVMe. The WAL replay overhead for 5 minutes of backlog at 1000 signals/sec is ~5000 events at 21 bytes each, which is negligible. The bottleneck will be DashMap insertion during restore. Sharding (16 shards default) should keep this under 10 seconds. The benchmark will tell us where we actually stand.

### WAL compaction safety

WAL compaction follows the write-new-then-delete-old pattern used by the existing `CheckpointManager::write()`. The invariant: at no point during compaction is there a state where both the checkpoint and the WAL segments covering those events are absent. If we crash during compaction, the worst case is that old segments survive and get replayed redundantly on the next open -- which is correct (the ledger's DashMap insert is idempotent for identical events).

## Done When

All 13 acceptance criteria above pass. `cargo test --manifest-path tidal/Cargo.toml` passes including the new `m7_crash_property`, `m7_crash_m6`, and `m7_crash_invariant` integration test suites. The `recovery` Criterion benchmark shows cold-start recovery < 30 seconds for 1M items + 5 minutes WAL backlog.