tidaldb/docs/planning/milestone-8/phase-1/OVERVIEW.md
jordan f4cfd6c81f feat: complete M8 replication primitives + forage enhancements + docs
Milestone 8 (phases 1-4):
- Shard-aware WAL segment naming, BatchHeader v2, ShardRouter
- Transport trait, InProcessTransport, WalShipper, FollowerDb
- HLC, PNCounter, LWWRegister, CrdtSignalState, ReconciliationEngine
- Session replication bridge with SeqNo/HWM, idempotency store

Forage application:
- Multi-source discovery engine with MAB exploration
- Embedding-based label system, server handlers, UI refresh

Other:
- QUICKSTART.md, README.md, milestone-8 planning docs
- Hard negative union semantics, RLHF export enhancements
- Recovery benchmark and visibility test expansions
- Split 8 oversized source files per CODING_GUIDELINES §9

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 13:17:19 -07:00

101 lines
5.9 KiB
Markdown

# m8p1: Shard-Aware Foundations
## Delivers
The identity types, WAL segment tagging, and shard routing table that make
tidalDB distribution-aware without introducing any network code. After this
phase, every WAL segment carries a globally unique ID
(`region_id:shard_id:seqno`), every entity operation is routable through a
`ShardRouter`, and the existing single-node deployment works identically with
the default shard_id=0 / region_id=0 configuration. This is the "build the
atoms right" phase -- no new runtime behavior, but every data structure is
distribution-ready.
Deliverables:
- `ShardId(u16)`, `RegionId(u16)`, `WalSegmentId { region_id, shard_id, seqno }` identity types
- WAL batch header v2: adds `shard_id` and `region_id` fields (backward-compatible; v1 readers skip unknown fields)
- `ShardRouter`: maps `EntityId -> ShardId` via configurable range boundaries
- `NodeConfig` extending `Config` with cluster role, shard assignment, region assignment
- `ReplicationState` tracking per-shard high-water-mark seqno for follower bookkeeping
- All existing tests pass unchanged (shard_id=0 is the default; single-node is shard 0)
## Dependencies
- **Requires:** M7 complete (WAL format v1, `BatchHeader`, `EventRecord`, `SegmentWriter`, `CheckpointManager`, `Config`, `StorageMode`)
- **Files modified:**
- `tidal/src/wal/format/batch.rs` -- extend `BatchHeader` with shard/region fields
- `tidal/src/wal/segment.rs` -- segment filename includes shard_id prefix for multi-shard directories
- `tidal/src/db/config.rs` -- add `NodeConfig` with cluster fields
- `tidal/src/wal/checkpoint.rs` -- checkpoint includes shard_id
- **Files created:**
- `tidal/src/replication/mod.rs` -- module root
- `tidal/src/replication/shard.rs` -- `ShardId`, `RegionId`, `ShardRouter`
- `tidal/src/replication/segment_id.rs` -- `WalSegmentId`
- `tidal/src/replication/state.rs` -- `ReplicationState`
## Research References
- `docs/research/tidaldb_wal.md` -- WAL segment format, batch header layout
- `thoughts.md` -- Part V.12 (subject-prefix key encoding for sharding)
## Acceptance Criteria (Phase Level)
- [ ] `ShardId(u16)` and `RegionId(u16)` are `Copy + Clone + Debug + Eq + Hash + Ord + Serialize + Deserialize`
- [ ] `WalSegmentId { region_id: RegionId, shard_id: ShardId, seqno: u64 }` has total ordering by `(region_id, shard_id, seqno)` and a human-readable `Display` impl producing `"r0:s0:42"`
- [ ] `BatchHeader` v2 adds `shard_id: u16` and `region_id: u16` at bytes 58-61 (within existing 64-byte header); `FORMAT_VERSION` bumped to 2; v1 batches decode as shard_id=0, region_id=0
- [ ] `ShardRouter::route(entity_id: EntityId) -> ShardId` returns the correct shard for hash-based routing; default single-shard config always returns `ShardId(0)`
- [ ] `ShardRouter` is constructable from a `Vec<(ShardId, EntityIdRange)>` with validation that ranges are non-overlapping and cover the full u64 space
- [ ] `NodeConfig` extends `Config` with `role: NodeRole`, `shard_id: ShardId`, `region_id: RegionId`, `peer_shards: Vec<ShardId>`; defaults produce a single-node config
- [ ] `ReplicationState` tracks `HashMap<ShardId, u64>` (high-water-mark seqno per shard) with atomic reads/writes
- [ ] All existing M0-M7 tests pass without modification (single-node = shard 0, region 0)
- [ ] Segment filename format for multi-shard: `wal-s{shard_id:05}-{first_seq:020}.seg`; single-shard (shard_id=0) retains old format `wal-{first_seq:020}.seg` for backward compatibility
- [ ] Property test: 10,000 random EntityIds always route to exactly one shard; routing is a pure function of entity_id and shard_ranges
## Task Execution Order
```
Task 01: Identity Types ─────────┐
├──> Task 03: BatchHeader v2
Task 02: ShardRouter ────────────┤
├──> Task 04: Segment Naming
└──> Task 05: NodeConfig
v
Task 06: ReplicationState
```
Tasks 01 and 02 are fully parallelizable. Task 03 and 04 depend on Task 01. Task 05 depends on both 01 and 02. Task 06 depends on 05.
## Module Location
| File | Status | Contains |
|------|--------|----------|
| `tidal/src/replication/mod.rs` | NEW | Module root, re-exports |
| `tidal/src/replication/shard.rs` | NEW | `ShardId`, `RegionId`, `ShardRouter`, `EntityIdRange` |
| `tidal/src/replication/segment_id.rs` | NEW | `WalSegmentId`, ordering, Display |
| `tidal/src/replication/state.rs` | NEW | `ReplicationState`, high-water-mark tracking |
| `tidal/src/wal/format/batch.rs` | MODIFIED | `BatchHeader` v2 with shard/region fields |
| `tidal/src/wal/segment.rs` | MODIFIED | Shard-aware segment filename |
| `tidal/src/wal/checkpoint.rs` | MODIFIED | Checkpoint includes shard_id |
| `tidal/src/db/config.rs` | MODIFIED | `NodeConfig`, `NodeRole` enum |
| `tidal/src/lib.rs` | MODIFIED | Add `pub mod replication;` |
## Notes
### Backward compatibility is non-negotiable
WAL v1 segments must be readable by v2 code. The 4 bytes at offsets 58-61 in the v1 header are currently zero-padding; v2 reinterprets them as shard_id and region_id. This is safe because v1 always wrote zeros there.
### Hash-based vs range-based routing
`ShardRouter` supports both: `hash(entity_id) % num_shards` for uniform distribution, and explicit range boundaries for production deployments. The trait abstracts the choice.
### No network code in this phase
Everything is in-process. The `replication` module defines data structures and routing logic only. The `Transport` trait is introduced in Phase 8.2.
## Done When
A developer can construct a `NodeConfig` with 3 regions and 5 shards per region, create a `ShardRouter` from range boundaries, route EntityIds to shards, construct a WAL `BatchHeader` v2 with shard/region tags, and all existing single-node tests pass unchanged.