tidaldb/docs/planning/milestone-8/phase-1/OVERVIEW.md

# m8p1: Shard-Aware Foundations

## Delivers

The identity types, WAL segment tagging, and shard routing table that make
tidalDB distribution-aware without introducing any network code. After this
phase, every WAL segment carries a globally unique ID
(`region_id:shard_id:seqno`), every entity operation is routable through a
`ShardRouter`, and the existing single-node deployment works identically with
the default shard_id=0 / region_id=0 configuration. This is the "build the
atoms right" phase -- no new runtime behavior, but every data structure is
distribution-ready.

Deliverables:
- `ShardId(u16)`, `RegionId(u16)`, `WalSegmentId { region_id, shard_id, seqno }` identity types
- WAL batch header v2: adds `shard_id` and `region_id` fields (backward-compatible; v1 readers skip unknown fields)
- `ShardRouter`: maps `EntityId -> ShardId` via configurable range boundaries
- `NodeConfig` extending `Config` with cluster role, shard assignment, region assignment
- `ReplicationState` tracking per-shard high-water-mark seqno for follower bookkeeping
- All existing tests pass unchanged (shard_id=0 is the default; single-node is shard 0)

## Dependencies

- **Requires:** M7 complete (WAL format v1, `BatchHeader`, `EventRecord`, `SegmentWriter`, `CheckpointManager`, `Config`, `StorageMode`)
- **Files modified:**
  - `tidal/src/wal/format/batch.rs` -- extend `BatchHeader` with shard/region fields
  - `tidal/src/wal/segment.rs` -- segment filename includes shard_id prefix for multi-shard directories
  - `tidal/src/db/config.rs` -- add `NodeConfig` with cluster fields
  - `tidal/src/wal/checkpoint.rs` -- checkpoint includes shard_id
- **Files created:**
  - `tidal/src/replication/mod.rs` -- module root
  - `tidal/src/replication/shard.rs` -- `ShardId`, `RegionId`, `ShardRouter`
  - `tidal/src/replication/segment_id.rs` -- `WalSegmentId`
  - `tidal/src/replication/state.rs` -- `ReplicationState`

## Research References

- `docs/research/tidaldb_wal.md` -- WAL segment format, batch header layout
- `thoughts.md` -- Part V.12 (subject-prefix key encoding for sharding)

## Acceptance Criteria (Phase Level)

- [ ] `ShardId(u16)` and `RegionId(u16)` are `Copy + Clone + Debug + Eq + Hash + Ord + Serialize + Deserialize`
- [ ] `WalSegmentId { region_id: RegionId, shard_id: ShardId, seqno: u64 }` has total ordering by `(region_id, shard_id, seqno)` and a human-readable `Display` impl producing `"r0:s0:42"`
- [ ] `BatchHeader` v2 adds `shard_id: u16` and `region_id: u16` at bytes 58-61 (within existing 64-byte header); `FORMAT_VERSION` bumped to 2; v1 batches decode as shard_id=0, region_id=0
- [ ] `ShardRouter::route(entity_id: EntityId) -> ShardId` returns the correct shard for hash-based routing; default single-shard config always returns `ShardId(0)`
- [ ] `ShardRouter` is constructable from a `Vec<(ShardId, EntityIdRange)>` with validation that ranges are non-overlapping and cover the full u64 space
- [ ] `NodeConfig` extends `Config` with `role: NodeRole`, `shard_id: ShardId`, `region_id: RegionId`, `peer_shards: Vec<ShardId>`; defaults produce a single-node config
- [ ] `ReplicationState` tracks `HashMap<ShardId, u64>` (high-water-mark seqno per shard) with atomic reads/writes
- [ ] All existing M0-M7 tests pass without modification (single-node = shard 0, region 0)
- [ ] Segment filename format for multi-shard: `wal-s{shard_id:05}-{first_seq:020}.seg`; single-shard (shard_id=0) retains old format `wal-{first_seq:020}.seg` for backward compatibility
- [ ] Property test: 10,000 random EntityIds always route to exactly one shard; routing is a pure function of entity_id and shard_ranges

## Task Execution Order

```
Task 01: Identity Types ─────────┐
                                  ├──> Task 03: BatchHeader v2
Task 02: ShardRouter ────────────┤
                                  ├──> Task 04: Segment Naming
                                  │
                                  └──> Task 05: NodeConfig
                                            │
                                            v
                                  Task 06: ReplicationState
```

Tasks 01 and 02 are fully parallelizable. Task 03 and 04 depend on Task 01. Task 05 depends on both 01 and 02. Task 06 depends on 05.

## Module Location

| File | Status | Contains |
|------|--------|----------|
| `tidal/src/replication/mod.rs` | NEW | Module root, re-exports |
| `tidal/src/replication/shard.rs` | NEW | `ShardId`, `RegionId`, `ShardRouter`, `EntityIdRange` |
| `tidal/src/replication/segment_id.rs` | NEW | `WalSegmentId`, ordering, Display |
| `tidal/src/replication/state.rs` | NEW | `ReplicationState`, high-water-mark tracking |
| `tidal/src/wal/format/batch.rs` | MODIFIED | `BatchHeader` v2 with shard/region fields |
| `tidal/src/wal/segment.rs` | MODIFIED | Shard-aware segment filename |
| `tidal/src/wal/checkpoint.rs` | MODIFIED | Checkpoint includes shard_id |
| `tidal/src/db/config.rs` | MODIFIED | `NodeConfig`, `NodeRole` enum |
| `tidal/src/lib.rs` | MODIFIED | Add `pub mod replication;` |

## Notes

### Backward compatibility is non-negotiable

WAL v1 segments must be readable by v2 code. The 4 bytes at offsets 58-61 in the v1 header are currently zero-padding; v2 reinterprets them as shard_id and region_id. This is safe because v1 always wrote zeros there.

### Hash-based vs range-based routing

`ShardRouter` supports both: `hash(entity_id) % num_shards` for uniform distribution, and explicit range boundaries for production deployments. The trait abstracts the choice.

### No network code in this phase

Everything is in-process. The `replication` module defines data structures and routing logic only. The `Transport` trait is introduced in Phase 8.2.

## Done When

A developer can construct a `NodeConfig` with 3 regions and 5 shards per region, create a `ShardRouter` from range boundaries, route EntityIds to shards, construct a WAL `BatchHeader` v2 with shard/region tags, and all existing single-node tests pass unchanged.