Milestone 8 (phases 1-4): - Shard-aware WAL segment naming, BatchHeader v2, ShardRouter - Transport trait, InProcessTransport, WalShipper, FollowerDb - HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine - Session replication bridge with SeqNo/HWM, idempotency store Forage application: - Multi-source discovery engine with MAB exploration - Embedding-based label system, server handlers, UI refresh Other: - QUICKSTART.md, README.md, milestone-8 planning docs - Hard negative union semantics, RLHF export enhancements - Recovery benchmark and visibility test expansions - Split 8 oversized source files per CODING_GUIDELINES §9 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4.2 KiB
m8p6: End-to-End UAT
Delivers
A comprehensive end-to-end test suite that exercises the complete UAT scenario: 3 regions, 5 shards per region, 25K signals/sec, network partition, failover, partition heal, deterministic reconciliation, and tenant migration. This is the gate for M8 completion.
Deliverables:
m8_uatintegration test suite matching all 5 UAT scenario stepsSimulatedCluster: test harness that creates a multi-region tidalDB cluster usingInProcessTransportNetworkPartition: injectable fault that blocksTransport::send_segmentbetween specified regionsShardCrash: injectable fault that drops a shard primary and triggers follower promotion- Performance assertions: cross-region replication < 2s p99, failover < 10s
Dependencies
- Requires: All phases 8.1-8.5
- Files created:
tidal/tests/m8_uat.rs-- integration test suitetidal/src/testing/cluster.rs--SimulatedClusterharnesstidal/src/testing/faults.rs--NetworkPartition,ShardCrashfault injection
Research References
docs/research/tidaldb_wal.md-- invariant checklist for replication correctness
Acceptance Criteria (Phase Level)
- UAT Step 1: Write signals for a user in us-east, read in eu-west after < 2 seconds; verified by
ReplicationLagGaugeassertion andread_decay_scoreequivalence - UAT Step 2: Crash an entire shard primary (simulated); follower is promoted within 10 seconds; all acknowledged signals are present on the promoted follower; no data loss
- UAT Step 3: Execute
RETRIEVE items COHORT locale:EUwhile ap-south is partitioned; query succeeds using available shards; results include items from non-partitioned regions only; degradation flag set inQueryStats - UAT Step 4: Heal the partition;
ReconciliationEngineruns; after reconciliation: no duplicate signal counts (verified by sum of all events across all regions); hard negatives never leaked; decay scores on all shards match analytical formula to 6 decimal places - UAT Step 5: Move a tenant to a new region by changing routing config; during migration: zero downtime, all queries succeed; after migration: tenant's data is on new region only; old region's copy is GC'd
- Invariant: no signal event is lost or double-counted across the entire test run (verified by WAL event count == materialized signal count on all shards)
- Invariant: hard negatives (hide/mute/block) are monotonically enforced -- once hidden, never visible during convergence
Task Execution Order
Task 01: SimulatedCluster Harness ──────┐
├──> Task 03: UAT Scenario Tests (Steps 1-5)
Task 02: Fault Injection ────────────────┘ │
v
Task 04: Performance Assertions + CI
Tasks 01 and 02 are parallelizable. Task 03 depends on both. Task 04 depends on 03.
Module Location
| File | Status | Contains |
|---|---|---|
tidal/tests/m8_uat.rs |
NEW | All UAT scenario tests |
tidal/src/testing/cluster.rs |
NEW | SimulatedCluster harness |
tidal/src/testing/faults.rs |
NEW | NetworkPartition, ShardCrash fault injection |
Notes
All tests use InProcessTransport
No actual network. The NetworkPartition fault works by intercepting send_segment calls and dropping them for the specified region pair.
Deterministic reconciliation verification
After partition heal, we replay all WAL segments from both sides of the partition through a single-node TidalDb (the ground truth). We then compare every signal count and decay score on every shard against this ground truth. Any divergence fails the test.
Performance assertions are soft
The 2s p99 target is for in-process transport. Real network latency is additive. The test verifies that replication logic itself adds < 100ms overhead; the remaining budget is for network RTT.
Done When
cargo test --test m8_uat passes all 5 UAT scenario steps with 25K signals/sec sustained throughput across 3 simulated regions, verifying no signal loss, no duplicate counts, no leaked hard negatives, and correct decay scores after partition heal and reconciliation.