Milestone 8 (phases 1-4): - Shard-aware WAL segment naming, BatchHeader v2, ShardRouter - Transport trait, InProcessTransport, WalShipper, FollowerDb - HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine - Session replication bridge with SeqNo/HWM, idempotency store Forage application: - Multi-source discovery engine with MAB exploration - Embedding-based label system, server handlers, UI refresh Other: - QUICKSTART.md, README.md, milestone-8 planning docs - Hard negative union semantics, RLHF export enhancements - Recovery benchmark and visibility test expansions - Split 8 oversized source files per CODING_GUIDELINES §9 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
76 lines
4.2 KiB
Markdown
76 lines
4.2 KiB
Markdown
# m8p6: End-to-End UAT
|
|
|
|
## Delivers
|
|
|
|
A comprehensive end-to-end test suite that exercises the complete UAT scenario:
|
|
3 regions, 5 shards per region, 25K signals/sec, network partition, failover,
|
|
partition heal, deterministic reconciliation, and tenant migration. This is the
|
|
gate for M8 completion.
|
|
|
|
Deliverables:
|
|
- `m8_uat` integration test suite matching all 5 UAT scenario steps
|
|
- `SimulatedCluster`: test harness that creates a multi-region tidalDB cluster using `InProcessTransport`
|
|
- `NetworkPartition`: injectable fault that blocks `Transport::send_segment` between specified regions
|
|
- `ShardCrash`: injectable fault that drops a shard primary and triggers follower promotion
|
|
- Performance assertions: cross-region replication < 2s p99, failover < 10s
|
|
|
|
## Dependencies
|
|
|
|
- **Requires:** All phases 8.1-8.5
|
|
- **Files created:**
|
|
- `tidal/tests/m8_uat.rs` -- integration test suite
|
|
- `tidal/src/testing/cluster.rs` -- `SimulatedCluster` harness
|
|
- `tidal/src/testing/faults.rs` -- `NetworkPartition`, `ShardCrash` fault injection
|
|
|
|
## Research References
|
|
|
|
- `docs/research/tidaldb_wal.md` -- invariant checklist for replication correctness
|
|
|
|
## Acceptance Criteria (Phase Level)
|
|
|
|
- [ ] **UAT Step 1:** Write signals for a user in us-east, read in eu-west after < 2 seconds; verified by `ReplicationLagGauge` assertion and `read_decay_score` equivalence
|
|
- [ ] **UAT Step 2:** Crash an entire shard primary (simulated); follower is promoted within 10 seconds; all acknowledged signals are present on the promoted follower; no data loss
|
|
- [ ] **UAT Step 3:** Execute `RETRIEVE items COHORT locale:EU` while ap-south is partitioned; query succeeds using available shards; results include items from non-partitioned regions only; degradation flag set in `QueryStats`
|
|
- [ ] **UAT Step 4:** Heal the partition; `ReconciliationEngine` runs; after reconciliation: no duplicate signal counts (verified by sum of all events across all regions); hard negatives never leaked; decay scores on all shards match analytical formula to 6 decimal places
|
|
- [ ] **UAT Step 5:** Move a tenant to a new region by changing routing config; during migration: zero downtime, all queries succeed; after migration: tenant's data is on new region only; old region's copy is GC'd
|
|
- [ ] Invariant: no signal event is lost or double-counted across the entire test run (verified by WAL event count == materialized signal count on all shards)
|
|
- [ ] Invariant: hard negatives (hide/mute/block) are monotonically enforced -- once hidden, never visible during convergence
|
|
|
|
## Task Execution Order
|
|
|
|
```
|
|
Task 01: SimulatedCluster Harness ──────┐
|
|
├──> Task 03: UAT Scenario Tests (Steps 1-5)
|
|
Task 02: Fault Injection ────────────────┘ │
|
|
v
|
|
Task 04: Performance Assertions + CI
|
|
```
|
|
|
|
Tasks 01 and 02 are parallelizable. Task 03 depends on both. Task 04 depends on 03.
|
|
|
|
## Module Location
|
|
|
|
| File | Status | Contains |
|
|
|------|--------|----------|
|
|
| `tidal/tests/m8_uat.rs` | NEW | All UAT scenario tests |
|
|
| `tidal/src/testing/cluster.rs` | NEW | `SimulatedCluster` harness |
|
|
| `tidal/src/testing/faults.rs` | NEW | `NetworkPartition`, `ShardCrash` fault injection |
|
|
|
|
## Notes
|
|
|
|
### All tests use InProcessTransport
|
|
|
|
No actual network. The `NetworkPartition` fault works by intercepting `send_segment` calls and dropping them for the specified region pair.
|
|
|
|
### Deterministic reconciliation verification
|
|
|
|
After partition heal, we replay all WAL segments from both sides of the partition through a single-node `TidalDb` (the ground truth). We then compare every signal count and decay score on every shard against this ground truth. Any divergence fails the test.
|
|
|
|
### Performance assertions are soft
|
|
|
|
The 2s p99 target is for in-process transport. Real network latency is additive. The test verifies that replication logic itself adds < 100ms overhead; the remaining budget is for network RTT.
|
|
|
|
## Done When
|
|
|
|
`cargo test --test m8_uat` passes all 5 UAT scenario steps with 25K signals/sec sustained throughput across 3 simulated regions, verifying no signal loss, no duplicate counts, no leaked hard negatives, and correct decay scores after partition heal and reconciliation.
|