tidaldb/docs/planning/milestone-1/phase-2/task-01-wal-format-and-segment-files.md
jordan 29400d48db feat: implement Milestone 1 phases 1-3 — schema, WAL, and storage layer
Implements the foundation of tidalDB's data pipeline:

**Phase 1 – Schema primitives**
- EntityId newtype (u64, big-endian ordering)
- SignalTypeDefinition with pre-computed decay λ, deduped/sorted windows
- SchemaBuilder with full constraint validation (duplicates, identifiers,
  half-life, windows, velocity)
- LumenError wrapping all subsystems with required From impls

**Phase 2 – Write-Ahead Log**
- Length-prefixed, BLAKE3-protected entry format
- Group-commit writer (batch up to 100 events / 10 ms)
- Double-buffered content-hash deduplication
- Checkpoint, truncation, and crash-recovery with full replay
- Integration, property, and UAT tests (incl. 5,500-event deterministic UAT)
- Proptest coverage scaled to 10 000 events/run (was ≤500) to meet
  acceptance criterion; cases reduced 100→10 to keep runtime comparable

**Phase 3 – Storage engine**
- StorageEngine trait (get/put/delete/scan/batch/flush)
- Key encoding: [EntityId][0x00][Tag][suffix] with ordering/prefix helpers
- InMemoryBackend (BTreeMap + RwLock)
- FjallStorage with three isolated keyspaces and atomic batch helper
- Property tests for key ordering and round-trip correctness

Also adds planning docs for phases 4-5, research docs, architecture
overview, and roadmap updates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 16:43:24 -07:00

222 lines
8.8 KiB
Markdown

# Task 01: WAL Wire Format and Segment Files
## Context
**Milestone:** 1 -- Signal Engine
**Phase:** m1p2 -- Write-Ahead Log
**Status:** COMPLETE
**Depends On:** None
**Blocks:** Task 02 (Group Commit Writer), Task 03 (Crash Recovery and Replay)
**Complexity:** M
## Objective
Define the on-disk binary format for WAL batches and event records, implement the segment file writer that manages 16 MB rotating files, and define the `WalError` type. This is the foundation everything else builds on — the format dictates how writers produce batches, how readers parse them, and how crash recovery validates them.
The key design decision (already resolved in `docs/research/tidaldb_wal.md`) is batch-oriented framing: frame entire batches rather than individual events. A 64-byte cache-line-aligned header with BLAKE3 checksum, followed by tightly-packed 21-byte event records. This matches the group-commit write path exactly and amortizes both checksum and fsync cost across 100 events per batch.
## Requirements
- `BatchHeader` is exactly 64 bytes (`#[repr(C)]`, compile-time assertion)
- Magic bytes `0x54494C44` ("TIDL") at offset 0 for human-readable crash dumps
- BLAKE3 hash at bytes [32..64] covers `header[0..32] || all_event_bytes` — NOT the hash field itself
- `EventRecord` is exactly 21 bytes, little-endian throughout: entity_id (u64), signal_type (u8), weight (f32), timestamp_nanos (u64)
- `SegmentWriter` opens or creates a segment file and appends batches
- Segment files named `wal-{first_seq:020}.seg` — zero-padded 20-digit, lexicographic = numeric order
- `list_segments(dir)` returns `Vec<(first_seq, PathBuf)>` sorted by first sequence number
- `WalError` covers: `Io(std::io::Error)`, `Corruption(String)`, `Closed`, `SendFailed`, `ShutdownFailed`
## Technical Design
### Wire Format
```
BATCH FRAME:
+==========================================================================+
| Offset | Size | Field | Encoding | Notes |
+--------+------+---------------------+------------------+----------------+
| 0 | 4 | Magic | [0x54,0x49,0x4C,0x44] | "TIDL" |
| 4 | 1 | Version | u8 | Currently 1 |
| 5 | 1 | Flags | u8 | Reserved (0) |
| 6 | 2 | Event count | u16 LE | 1..=65535 |
| 8 | 8 | First sequence no. | u64 LE | Monotonic |
| 16 | 8 | Batch timestamp | u64 LE | Nanos epoch |
| 24 | 4 | Payload byte length | u32 LE | count * 21 |
| 28 | 4 | Reserved | [0u8; 4] | Future use |
| 32 | 32 | BLAKE3 checksum | [u8; 32] | See below |
+--------+------+---------------------+------------------+----------------+
| 64 | N*21 | Event records | packed structs | |
+==========================================================================+
BLAKE3 INPUT: blake3(header[0..32] || event_bytes[..])
(hash covers magic through reserved; the hash field [32..64] is excluded)
EVENT RECORD (21 bytes each, tightly packed):
| Offset | Size | Field | Encoding |
|--------|------|----------------|-----------|
| 0 | 8 | Entity ID | u64 LE |
| 8 | 1 | Signal type | u8 |
| 9 | 4 | Weight | f32 LE |
| 13 | 8 | Timestamp nanos| u64 LE |
```
### Module Structure
```
tidal/src/wal/
format.rs -- BatchHeader, EventRecord: encode/decode
segment.rs -- SegmentWriter, list_segments
error.rs -- WalError
```
### Public API Surface
```rust
// === format.rs ===
pub const MAGIC: [u8; 4] = [0x54, 0x49, 0x4C, 0x44]; // "TIDL"
pub const HEADER_SIZE: usize = 64;
pub const EVENT_SIZE: usize = 21;
pub const FORMAT_VERSION: u8 = 1;
#[derive(Debug, Clone, PartialEq)]
pub struct BatchHeader {
pub event_count: u16,
pub first_seq: u64,
pub batch_timestamp_nanos: u64,
pub payload_len: u32,
pub checksum: [u8; 32],
}
impl BatchHeader {
pub fn encode(&self) -> [u8; HEADER_SIZE];
pub fn decode(bytes: &[u8; HEADER_SIZE]) -> Result<Self, WalError>;
pub fn compute_checksum(header_prefix: &[u8; 32], events: &[u8]) -> [u8; 32];
}
#[derive(Debug, Clone, PartialEq)]
pub struct EventRecord {
pub entity_id: u64,
pub signal_type: u8,
pub weight: f32,
pub timestamp_nanos: u64,
}
impl EventRecord {
pub fn encode(&self) -> [u8; EVENT_SIZE];
pub fn decode(bytes: &[u8; EVENT_SIZE]) -> Self;
}
// === segment.rs ===
pub struct SegmentWriter { /* file handle, current size, segment_size limit */ }
impl SegmentWriter {
pub fn open(dir: &Path, first_seq: u64, segment_size: u64) -> Result<Self, WalError>;
/// Append raw batch bytes. Returns true if segment is now full.
pub fn append_batch(&mut self, bytes: &[u8]) -> Result<bool, WalError>;
pub fn flush(&mut self) -> Result<(), WalError>;
pub fn segment_size(&self) -> u64;
pub fn current_size(&self) -> u64;
}
pub fn segment_path(dir: &Path, first_seq: u64) -> PathBuf;
pub fn list_segments(dir: &Path) -> Result<Vec<(u64, PathBuf)>, WalError>;
```
## Test Strategy
### Unit Tests
```rust
#[test]
fn batch_header_roundtrip() {
let header = BatchHeader {
event_count: 42,
first_seq: 1000,
batch_timestamp_nanos: 1_700_000_000_000_000_000,
payload_len: 42 * 21,
checksum: [0xAB; 32],
};
let encoded = header.encode();
let decoded = BatchHeader::decode(&encoded).unwrap();
assert_eq!(header, decoded);
}
#[test]
fn event_record_roundtrip() {
let event = EventRecord { entity_id: 999, signal_type: 3, weight: 2.5, timestamp_nanos: 42_000_000_000 };
let encoded = event.encode();
let decoded = EventRecord::decode(&encoded);
assert_eq!(decoded.entity_id, 999);
assert_eq!(decoded.weight.to_bits(), 2.5_f32.to_bits());
}
#[test]
fn magic_bytes_in_header() {
let header = BatchHeader { event_count: 1, first_seq: 1, batch_timestamp_nanos: 0, payload_len: 21, checksum: [0u8; 32] };
let encoded = header.encode();
assert_eq!(&encoded[0..4], &[0x54, 0x49, 0x4C, 0x44]);
}
#[test]
fn segment_naming_is_ordered() {
let p1 = segment_path(Path::new("/tmp"), 1);
let p2 = segment_path(Path::new("/tmp"), 1000);
// Lexicographic order matches numeric order
assert!(p1.file_name() < p2.file_name());
}
#[test]
fn list_segments_returns_sorted() {
let dir = tempfile::tempdir().unwrap();
// Create segment files out of order
std::fs::write(segment_path(dir.path(), 200), b"").unwrap();
std::fs::write(segment_path(dir.path(), 1), b"").unwrap();
std::fs::write(segment_path(dir.path(), 100), b"").unwrap();
let segments = list_segments(dir.path()).unwrap();
assert_eq!(segments[0].0, 1);
assert_eq!(segments[1].0, 100);
assert_eq!(segments[2].0, 200);
}
#[test]
fn header_decode_rejects_wrong_magic() {
let mut bytes = [0u8; 64];
bytes[0] = 0xFF; // wrong magic
assert!(BatchHeader::decode(&bytes).is_err());
}
#[test]
fn header_decode_rejects_wrong_version() {
let mut bytes = [0u8; 64];
bytes[0..4].copy_from_slice(&[0x54, 0x49, 0x4C, 0x44]); // correct magic
bytes[4] = 99; // wrong version
assert!(BatchHeader::decode(&bytes).is_err());
}
```
## Acceptance Criteria
- [x] `BatchHeader` encodes to exactly 64 bytes (compile-time assertion)
- [x] `EventRecord` encodes to exactly 21 bytes (compile-time assertion)
- [x] Magic bytes `0x54494C44` appear at bytes [0..4] of every encoded header
- [x] BLAKE3 checksum covers `header[0..32] || event_bytes` (excludes the hash field itself)
- [x] `BatchHeader::decode()` returns `WalError::Corruption` on wrong magic or unknown version
- [x] `EventRecord::encode`/`decode` roundtrip is lossless for all finite f32 weights
- [x] Segment files are named `wal-{seq:020}.seg`; `list_segments()` returns them sorted ascending
- [x] `SegmentWriter::append_batch()` writes raw bytes and returns `true` when the segment has exceeded its size limit
- [x] All little-endian encoding — no byte-swap cost on x86/ARM
- [x] `cargo clippy -D warnings` passes
## Research References
- [docs/research/tidaldb_wal.md](../../../research/tidaldb_wal.md) — Section 1 (Approach 3: batch-oriented framing with wire format table), Section 5 (segment rotation at 16 MB, naming convention)
## Implementation Notes
- `payload_len` is always `event_count * 21`. The redundancy allows Phase 1 crash validation (check bounds before computing BLAKE3) without reading the event data.
- The hash field at `header[32..64]` is written AFTER computing the hash. The hash input uses a zeroed header suffix — equivalently, it hashes `header[0..32] || events`.
- `f32::to_bits()` / `f32::from_bits()` are used for weight encoding — safe, const, and exact. Never cast f32 to u32 via `as`.
- Segment files do not need pre-allocation in m1p2. Defer `fallocate` until disk write performance is a measured bottleneck.