tidaldb/docs/planning/milestone-8/phase-6/OVERVIEW.md
jordan f4cfd6c81f feat: complete M8 replication primitives + forage enhancements + docs
Milestone 8 (phases 1-4):
- Shard-aware WAL segment naming, BatchHeader v2, ShardRouter
- Transport trait, InProcessTransport, WalShipper, FollowerDb
- HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine
- Session replication bridge with SeqNo/HWM, idempotency store

Forage application:
- Multi-source discovery engine with MAB exploration
- Embedding-based label system, server handlers, UI refresh

Other:
- QUICKSTART.md, README.md, milestone-8 planning docs
- Hard negative union semantics, RLHF export enhancements
- Recovery benchmark and visibility test expansions
- Split 8 oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 13:17:19 -07:00

4.2 KiB

m8p6: End-to-End UAT

Delivers

A comprehensive end-to-end test suite that exercises the complete UAT scenario: 3 regions, 5 shards per region, 25K signals/sec, network partition, failover, partition heal, deterministic reconciliation, and tenant migration. This is the gate for M8 completion.

Deliverables:

  • m8_uat integration test suite matching all 5 UAT scenario steps
  • SimulatedCluster: test harness that creates a multi-region tidalDB cluster using InProcessTransport
  • NetworkPartition: injectable fault that blocks Transport::send_segment between specified regions
  • ShardCrash: injectable fault that drops a shard primary and triggers follower promotion
  • Performance assertions: cross-region replication < 2s p99, failover < 10s

Dependencies

  • Requires: All phases 8.1-8.5
  • Files created:
    • tidal/tests/m8_uat.rs -- integration test suite
    • tidal/src/testing/cluster.rs -- SimulatedCluster harness
    • tidal/src/testing/faults.rs -- NetworkPartition, ShardCrash fault injection

Research References

  • docs/research/tidaldb_wal.md -- invariant checklist for replication correctness

Acceptance Criteria (Phase Level)

  • UAT Step 1: Write signals for a user in us-east, read in eu-west after < 2 seconds; verified by ReplicationLagGauge assertion and read_decay_score equivalence
  • UAT Step 2: Crash an entire shard primary (simulated); follower is promoted within 10 seconds; all acknowledged signals are present on the promoted follower; no data loss
  • UAT Step 3: Execute RETRIEVE items COHORT locale:EU while ap-south is partitioned; query succeeds using available shards; results include items from non-partitioned regions only; degradation flag set in QueryStats
  • UAT Step 4: Heal the partition; ReconciliationEngine runs; after reconciliation: no duplicate signal counts (verified by sum of all events across all regions); hard negatives never leaked; decay scores on all shards match analytical formula to 6 decimal places
  • UAT Step 5: Move a tenant to a new region by changing routing config; during migration: zero downtime, all queries succeed; after migration: tenant's data is on new region only; old region's copy is GC'd
  • Invariant: no signal event is lost or double-counted across the entire test run (verified by WAL event count == materialized signal count on all shards)
  • Invariant: hard negatives (hide/mute/block) are monotonically enforced -- once hidden, never visible during convergence

Task Execution Order

Task 01: SimulatedCluster Harness ──────┐
                                         ├──> Task 03: UAT Scenario Tests (Steps 1-5)
Task 02: Fault Injection ────────────────┘         │
                                                    v
                                          Task 04: Performance Assertions + CI

Tasks 01 and 02 are parallelizable. Task 03 depends on both. Task 04 depends on 03.

Module Location

File Status Contains
tidal/tests/m8_uat.rs NEW All UAT scenario tests
tidal/src/testing/cluster.rs NEW SimulatedCluster harness
tidal/src/testing/faults.rs NEW NetworkPartition, ShardCrash fault injection

Notes

All tests use InProcessTransport

No actual network. The NetworkPartition fault works by intercepting send_segment calls and dropping them for the specified region pair.

Deterministic reconciliation verification

After partition heal, we replay all WAL segments from both sides of the partition through a single-node TidalDb (the ground truth). We then compare every signal count and decay score on every shard against this ground truth. Any divergence fails the test.

Performance assertions are soft

The 2s p99 target is for in-process transport. Real network latency is additive. The test verifies that replication logic itself adds < 100ms overhead; the remaining budget is for network RTT.

Done When

cargo test --test m8_uat passes all 5 UAT scenario steps with 25K signals/sec sustained throughput across 3 simulated regions, verifying no signal loss, no duplicate counts, no leaked hard negatives, and correct decay scores after partition heal and reconciliation.