Milestone 8 (phases 1-4): - Shard-aware WAL segment naming, BatchHeader v2, ShardRouter - Transport trait, InProcessTransport, WalShipper, FollowerDb - HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine - Session replication bridge with SeqNo/HWM, idempotency store Forage application: - Multi-source discovery engine with MAB exploration - Embedding-based label system, server handlers, UI refresh Other: - QUICKSTART.md, README.md, milestone-8 planning docs - Hard negative union semantics, RLHF export enhancements - Recovery benchmark and visibility test expansions - Split 8 oversized source files per CODING_GUIDELINES §9 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
6.2 KiB
m8p2: WAL Shipping and Follower Replay
Delivers
One-way WAL replication from leader to followers. The leader ships sealed WAL
segments over an abstract transport trait. Followers receive segments, validate
checksums, and replay them idempotently through the existing signal ledger
apply_wal_event() path. A replication lag metric is emitted. A follower can
serve read queries (RETRIEVE, SEARCH) with bounded staleness.
This is the "read replicas" capability -- the foundation for multi-region deployment.
Deliverables:
Transporttrait:async fn send_segment(peer: ShardId, segment: &WalSegmentPayload)andasync fn recv_segment() -> WalSegmentPayloadInProcessTransport: for testing, usestokio::sync::mpscchannels between co-located instancesWalShipper: background task on leader that watches for sealed segments, ships them to registered followersSegmentReceiver: background task on follower that receives segments, validates BLAKE3, replays eventsReplicationLagGauge: tracks the delta between leader's latest seqno and each follower's applied seqnoFollowerDb: aTidalDbvariant that does not accept writes, only replays segments; serves read queries from its local state
Dependencies
- Requires: Phase 8.1 (ShardId, RegionId, WalSegmentId, BatchHeader v2, ReplicationState)
- Files modified:
tidal/src/wal/segment.rs--sealed_segments_since(seqno)helpertidal/src/db/open.rs-- supportNodeRole::Followerstartuptidal/src/db/mod.rs--TidalDb::is_follower()guard on write pathstidal/src/signals/ledger/mod.rs-- ensureapply_wal_event()is idempotent when replaying duplicate segments
- Files created:
tidal/src/replication/transport.rs--Transporttrait,WalSegmentPayloadtidal/src/replication/in_process.rs--InProcessTransporttidal/src/replication/shipper.rs--WalShippertidal/src/replication/receiver.rs--SegmentReceivertidal/src/replication/lag.rs--ReplicationLagGauge
Research References
docs/research/tidaldb_wal.md-- Segment sealing, batch checksum validationthoughts.md-- Part V.5 (quarantine-first ingestion; WAL is source of truth)
Acceptance Criteria (Phase Level)
Transporttrait hassend_segmentandrecv_segmentasync methods;InProcessTransportimplements them via bounded mpsc channelsWalShipperruns as a backgroundtokio::task; polls for newly sealed segments every 2 seconds (configurable); ships segments to all registered followers in parallelSegmentReceivervalidates BLAKE3 checksum of each received segment before replay; rejects corrupted segments withWalError::Corruption- Follower replay is idempotent: replaying a segment with seqno <= follower's high-water-mark is a no-op (no duplicate signal counting)
ReplicationLagGaugereportsleader_seqno - follower_applied_seqnoper follower; accessible viaMetricsState- Leader writes 1,000 signals -> follower replays all 1,000 ->
read_decay_scoreon follower matches leader to 6 decimal places (analytical equivalence) - Follower rejects write operations (
db.signal(),db.write_item()) withTidalError::ReadOnly - Replication lag converges to 0 within 5 seconds after leader quiesces (in-process transport)
- Leader crash and restart: follower continues serving reads from last replayed state; leader resumes shipping from last sealed segment
FollowerDbservesdb.retrieve()anddb.search()queries against its local replayed state
Task Execution Order
Task 01: Transport Trait ──────┐
├──> Task 03: WalShipper
Task 02: InProcessTransport ───┘ │
v
Task 04: SegmentReceiver
│
v
Task 05: FollowerDb
│
v
Task 06: ReplicationLagGauge
│
v
Task 07: Integration Tests
Tasks 01 and 02 are parallelizable. Task 03 requires Task 01. Tasks 04-07 are sequential.
Module Location
| File | Status | Contains |
|---|---|---|
tidal/src/replication/transport.rs |
NEW | Transport trait, WalSegmentPayload |
tidal/src/replication/in_process.rs |
NEW | InProcessTransport (channel-based) |
tidal/src/replication/shipper.rs |
NEW | WalShipper background task |
tidal/src/replication/receiver.rs |
NEW | SegmentReceiver with checksum validation and replay |
tidal/src/replication/lag.rs |
NEW | ReplicationLagGauge |
tidal/src/wal/segment.rs |
MODIFIED | sealed_segments_since(seqno) |
tidal/src/db/open.rs |
MODIFIED | Follower startup path |
tidal/src/db/mod.rs |
MODIFIED | Write-rejection guard for followers |
tidal/src/signals/ledger/mod.rs |
MODIFIED | Idempotency guard on apply_wal_event |
Notes
In-process transport only in this phase
A TCP/gRPC transport is deferred to Phase 8.5. The Transport trait is async to support both in-process channels and future network transports.
Idempotency via seqno
Followers track their high-water-mark applied_seqno. Segments with first_seq <= applied_seqno are skipped entirely. This reuses the existing checkpoint format from M1.
Timer-based segment sealing
The existing WalHandle seals segments when they reach max_size. For replication, we add a timer-based seal: every wal_ship_interval (default 2s), the active segment is sealed even if not full. This bounds replication lag.
No Raft, no consensus
This is primary-backup replication. One leader, N followers. Promotion is manual or triggered by the control plane (Phase 8.5).
Done When
A developer can start a leader and a follower using InProcessTransport, write 10,000 signals to the leader, observe the follower replay all events with lag < 5 seconds, and execute db.retrieve() on the follower with results matching the leader's state (modulo staleness of up to 1 batch).