tidaldb/docs/planning/milestone-8/phase-1/OVERVIEW.md
jordan f4cfd6c81f feat: complete M8 replication primitives + forage enhancements + docs
Milestone 8 (phases 1-4):
- Shard-aware WAL segment naming, BatchHeader v2, ShardRouter
- Transport trait, InProcessTransport, WalShipper, FollowerDb
- HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine
- Session replication bridge with SeqNo/HWM, idempotency store

Forage application:
- Multi-source discovery engine with MAB exploration
- Embedding-based label system, server handlers, UI refresh

Other:
- QUICKSTART.md, README.md, milestone-8 planning docs
- Hard negative union semantics, RLHF export enhancements
- Recovery benchmark and visibility test expansions
- Split 8 oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 13:17:19 -07:00

5.9 KiB

m8p1: Shard-Aware Foundations

Delivers

The identity types, WAL segment tagging, and shard routing table that make tidalDB distribution-aware without introducing any network code. After this phase, every WAL segment carries a globally unique ID (region_id:shard_id:seqno), every entity operation is routable through a ShardRouter, and the existing single-node deployment works identically with the default shard_id=0 / region_id=0 configuration. This is the "build the atoms right" phase -- no new runtime behavior, but every data structure is distribution-ready.

Deliverables:

  • ShardId(u16), RegionId(u16), WalSegmentId { region_id, shard_id, seqno } identity types
  • WAL batch header v2: adds shard_id and region_id fields (backward-compatible; v1 readers skip unknown fields)
  • ShardRouter: maps EntityId -> ShardId via configurable range boundaries
  • NodeConfig extending Config with cluster role, shard assignment, region assignment
  • ReplicationState tracking per-shard high-water-mark seqno for follower bookkeeping
  • All existing tests pass unchanged (shard_id=0 is the default; single-node is shard 0)

Dependencies

  • Requires: M7 complete (WAL format v1, BatchHeader, EventRecord, SegmentWriter, CheckpointManager, Config, StorageMode)
  • Files modified:
    • tidal/src/wal/format/batch.rs -- extend BatchHeader with shard/region fields
    • tidal/src/wal/segment.rs -- segment filename includes shard_id prefix for multi-shard directories
    • tidal/src/db/config.rs -- add NodeConfig with cluster fields
    • tidal/src/wal/checkpoint.rs -- checkpoint includes shard_id
  • Files created:
    • tidal/src/replication/mod.rs -- module root
    • tidal/src/replication/shard.rs -- ShardId, RegionId, ShardRouter
    • tidal/src/replication/segment_id.rs -- WalSegmentId
    • tidal/src/replication/state.rs -- ReplicationState

Research References

  • docs/research/tidaldb_wal.md -- WAL segment format, batch header layout
  • thoughts.md -- Part V.12 (subject-prefix key encoding for sharding)

Acceptance Criteria (Phase Level)

  • ShardId(u16) and RegionId(u16) are Copy + Clone + Debug + Eq + Hash + Ord + Serialize + Deserialize
  • WalSegmentId { region_id: RegionId, shard_id: ShardId, seqno: u64 } has total ordering by (region_id, shard_id, seqno) and a human-readable Display impl producing "r0:s0:42"
  • BatchHeader v2 adds shard_id: u16 and region_id: u16 at bytes 58-61 (within existing 64-byte header); FORMAT_VERSION bumped to 2; v1 batches decode as shard_id=0, region_id=0
  • ShardRouter::route(entity_id: EntityId) -> ShardId returns the correct shard for hash-based routing; default single-shard config always returns ShardId(0)
  • ShardRouter is constructable from a Vec<(ShardId, EntityIdRange)> with validation that ranges are non-overlapping and cover the full u64 space
  • NodeConfig extends Config with role: NodeRole, shard_id: ShardId, region_id: RegionId, peer_shards: Vec<ShardId>; defaults produce a single-node config
  • ReplicationState tracks HashMap<ShardId, u64> (high-water-mark seqno per shard) with atomic reads/writes
  • All existing M0-M7 tests pass without modification (single-node = shard 0, region 0)
  • Segment filename format for multi-shard: wal-s{shard_id:05}-{first_seq:020}.seg; single-shard (shard_id=0) retains old format wal-{first_seq:020}.seg for backward compatibility
  • Property test: 10,000 random EntityIds always route to exactly one shard; routing is a pure function of entity_id and shard_ranges

Task Execution Order

Task 01: Identity Types ─────────┐
                                  ├──> Task 03: BatchHeader v2
Task 02: ShardRouter ────────────┤
                                  ├──> Task 04: Segment Naming
                                  │
                                  └──> Task 05: NodeConfig
                                            │
                                            v
                                  Task 06: ReplicationState

Tasks 01 and 02 are fully parallelizable. Task 03 and 04 depend on Task 01. Task 05 depends on both 01 and 02. Task 06 depends on 05.

Module Location

File Status Contains
tidal/src/replication/mod.rs NEW Module root, re-exports
tidal/src/replication/shard.rs NEW ShardId, RegionId, ShardRouter, EntityIdRange
tidal/src/replication/segment_id.rs NEW WalSegmentId, ordering, Display
tidal/src/replication/state.rs NEW ReplicationState, high-water-mark tracking
tidal/src/wal/format/batch.rs MODIFIED BatchHeader v2 with shard/region fields
tidal/src/wal/segment.rs MODIFIED Shard-aware segment filename
tidal/src/wal/checkpoint.rs MODIFIED Checkpoint includes shard_id
tidal/src/db/config.rs MODIFIED NodeConfig, NodeRole enum
tidal/src/lib.rs MODIFIED Add pub mod replication;

Notes

Backward compatibility is non-negotiable

WAL v1 segments must be readable by v2 code. The 4 bytes at offsets 58-61 in the v1 header are currently zero-padding; v2 reinterprets them as shard_id and region_id. This is safe because v1 always wrote zeros there.

Hash-based vs range-based routing

ShardRouter supports both: hash(entity_id) % num_shards for uniform distribution, and explicit range boundaries for production deployments. The trait abstracts the choice.

No network code in this phase

Everything is in-process. The replication module defines data structures and routing logic only. The Transport trait is introduced in Phase 8.2.

Done When

A developer can construct a NodeConfig with 3 regions and 5 shards per region, create a ShardRouter from range boundaries, route EntityIds to shards, construct a WAL BatchHeader v2 with shard/region tags, and all existing single-node tests pass unchanged.