tidaldb/docs/planning/milestone-8/phase-6/OVERVIEW.md
jordan f4cfd6c81f feat: complete M8 replication primitives + forage enhancements + docs
Milestone 8 (phases 1-4):
- Shard-aware WAL segment naming, BatchHeader v2, ShardRouter
- Transport trait, InProcessTransport, WalShipper, FollowerDb
- HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine
- Session replication bridge with SeqNo/HWM, idempotency store

Forage application:
- Multi-source discovery engine with MAB exploration
- Embedding-based label system, server handlers, UI refresh

Other:
- QUICKSTART.md, README.md, milestone-8 planning docs
- Hard negative union semantics, RLHF export enhancements
- Recovery benchmark and visibility test expansions
- Split 8 oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 13:17:19 -07:00

76 lines
4.2 KiB
Markdown

# m8p6: End-to-End UAT
## Delivers
A comprehensive end-to-end test suite that exercises the complete UAT scenario:
3 regions, 5 shards per region, 25K signals/sec, network partition, failover,
partition heal, deterministic reconciliation, and tenant migration. This is the
gate for M8 completion.
Deliverables:
- `m8_uat` integration test suite matching all 5 UAT scenario steps
- `SimulatedCluster`: test harness that creates a multi-region tidalDB cluster using `InProcessTransport`
- `NetworkPartition`: injectable fault that blocks `Transport::send_segment` between specified regions
- `ShardCrash`: injectable fault that drops a shard primary and triggers follower promotion
- Performance assertions: cross-region replication < 2s p99, failover < 10s
## Dependencies
- **Requires:** All phases 8.1-8.5
- **Files created:**
- `tidal/tests/m8_uat.rs` -- integration test suite
- `tidal/src/testing/cluster.rs` -- `SimulatedCluster` harness
- `tidal/src/testing/faults.rs` -- `NetworkPartition`, `ShardCrash` fault injection
## Research References
- `docs/research/tidaldb_wal.md` -- invariant checklist for replication correctness
## Acceptance Criteria (Phase Level)
- [ ] **UAT Step 1:** Write signals for a user in us-east, read in eu-west after < 2 seconds; verified by `ReplicationLagGauge` assertion and `read_decay_score` equivalence
- [ ] **UAT Step 2:** Crash an entire shard primary (simulated); follower is promoted within 10 seconds; all acknowledged signals are present on the promoted follower; no data loss
- [ ] **UAT Step 3:** Execute `RETRIEVE items COHORT locale:EU` while ap-south is partitioned; query succeeds using available shards; results include items from non-partitioned regions only; degradation flag set in `QueryStats`
- [ ] **UAT Step 4:** Heal the partition; `ReconciliationEngine` runs; after reconciliation: no duplicate signal counts (verified by sum of all events across all regions); hard negatives never leaked; decay scores on all shards match analytical formula to 6 decimal places
- [ ] **UAT Step 5:** Move a tenant to a new region by changing routing config; during migration: zero downtime, all queries succeed; after migration: tenant's data is on new region only; old region's copy is GC'd
- [ ] Invariant: no signal event is lost or double-counted across the entire test run (verified by WAL event count == materialized signal count on all shards)
- [ ] Invariant: hard negatives (hide/mute/block) are monotonically enforced -- once hidden, never visible during convergence
## Task Execution Order
```
Task 01: SimulatedCluster Harness ──────┐
├──> Task 03: UAT Scenario Tests (Steps 1-5)
Task 02: Fault Injection ────────────────┘ │
v
Task 04: Performance Assertions + CI
```
Tasks 01 and 02 are parallelizable. Task 03 depends on both. Task 04 depends on 03.
## Module Location
| File | Status | Contains |
|------|--------|----------|
| `tidal/tests/m8_uat.rs` | NEW | All UAT scenario tests |
| `tidal/src/testing/cluster.rs` | NEW | `SimulatedCluster` harness |
| `tidal/src/testing/faults.rs` | NEW | `NetworkPartition`, `ShardCrash` fault injection |
## Notes
### All tests use InProcessTransport
No actual network. The `NetworkPartition` fault works by intercepting `send_segment` calls and dropping them for the specified region pair.
### Deterministic reconciliation verification
After partition heal, we replay all WAL segments from both sides of the partition through a single-node `TidalDb` (the ground truth). We then compare every signal count and decay score on every shard against this ground truth. Any divergence fails the test.
### Performance assertions are soft
The 2s p99 target is for in-process transport. Real network latency is additive. The test verifies that replication logic itself adds < 100ms overhead; the remaining budget is for network RTT.
## Done When
`cargo test --test m8_uat` passes all 5 UAT scenario steps with 25K signals/sec sustained throughput across 3 simulated regions, verifying no signal loss, no duplicate counts, no leaked hard negatives, and correct decay scores after partition heal and reconciliation.