tidaldb/docs/planning/milestone-1/phase-2/task-01-wal-format-and-segment-files.md
jordan 29400d48db feat: implement Milestone 1 phases 1-3 — schema, WAL, and storage layer
Implements the foundation of tidalDB's data pipeline:

**Phase 1 – Schema primitives**
- EntityId newtype (u64, big-endian ordering)
- SignalTypeDefinition with pre-computed decay λ, deduped/sorted windows
- SchemaBuilder with full constraint validation (duplicates, identifiers,
  half-life, windows, velocity)
- LumenError wrapping all subsystems with required From impls

**Phase 2 – Write-Ahead Log**
- Length-prefixed, BLAKE3-protected entry format
- Group-commit writer (batch up to 100 events / 10 ms)
- Double-buffered content-hash deduplication
- Checkpoint, truncation, and crash-recovery with full replay
- Integration, property, and UAT tests (incl. 5,500-event deterministic UAT)
- Proptest coverage scaled to 10 000 events/run (was ≤500) to meet
  acceptance criterion; cases reduced 100→10 to keep runtime comparable

**Phase 3 – Storage engine**
- StorageEngine trait (get/put/delete/scan/batch/flush)
- Key encoding: [EntityId][0x00][Tag][suffix] with ordering/prefix helpers
- InMemoryBackend (BTreeMap + RwLock)
- FjallStorage with three isolated keyspaces and atomic batch helper
- Property tests for key ordering and round-trip correctness

Also adds planning docs for phases 4-5, research docs, architecture
overview, and roadmap updates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 16:43:24 -07:00

8.8 KiB

Task 01: WAL Wire Format and Segment Files

Context

Milestone: 1 -- Signal Engine Phase: m1p2 -- Write-Ahead Log Status: COMPLETE Depends On: None Blocks: Task 02 (Group Commit Writer), Task 03 (Crash Recovery and Replay) Complexity: M

Objective

Define the on-disk binary format for WAL batches and event records, implement the segment file writer that manages 16 MB rotating files, and define the WalError type. This is the foundation everything else builds on — the format dictates how writers produce batches, how readers parse them, and how crash recovery validates them.

The key design decision (already resolved in docs/research/tidaldb_wal.md) is batch-oriented framing: frame entire batches rather than individual events. A 64-byte cache-line-aligned header with BLAKE3 checksum, followed by tightly-packed 21-byte event records. This matches the group-commit write path exactly and amortizes both checksum and fsync cost across 100 events per batch.

Requirements

  • BatchHeader is exactly 64 bytes (#[repr(C)], compile-time assertion)
  • Magic bytes 0x54494C44 ("TIDL") at offset 0 for human-readable crash dumps
  • BLAKE3 hash at bytes [32..64] covers header[0..32] || all_event_bytes — NOT the hash field itself
  • EventRecord is exactly 21 bytes, little-endian throughout: entity_id (u64), signal_type (u8), weight (f32), timestamp_nanos (u64)
  • SegmentWriter opens or creates a segment file and appends batches
  • Segment files named wal-{first_seq:020}.seg — zero-padded 20-digit, lexicographic = numeric order
  • list_segments(dir) returns Vec<(first_seq, PathBuf)> sorted by first sequence number
  • WalError covers: Io(std::io::Error), Corruption(String), Closed, SendFailed, ShutdownFailed

Technical Design

Wire Format

BATCH FRAME:
+==========================================================================+
| Offset | Size | Field               | Encoding         | Notes          |
+--------+------+---------------------+------------------+----------------+
| 0      | 4    | Magic               | [0x54,0x49,0x4C,0x44] | "TIDL"    |
| 4      | 1    | Version             | u8               | Currently 1    |
| 5      | 1    | Flags               | u8               | Reserved (0)   |
| 6      | 2    | Event count         | u16 LE           | 1..=65535      |
| 8      | 8    | First sequence no.  | u64 LE           | Monotonic      |
| 16     | 8    | Batch timestamp     | u64 LE           | Nanos epoch    |
| 24     | 4    | Payload byte length | u32 LE           | count * 21     |
| 28     | 4    | Reserved            | [0u8; 4]         | Future use     |
| 32     | 32   | BLAKE3 checksum     | [u8; 32]         | See below      |
+--------+------+---------------------+------------------+----------------+
| 64     | N*21 | Event records       | packed structs   |                |
+==========================================================================+

BLAKE3 INPUT: blake3(header[0..32] || event_bytes[..])
(hash covers magic through reserved; the hash field [32..64] is excluded)

EVENT RECORD (21 bytes each, tightly packed):
| Offset | Size | Field          | Encoding  |
|--------|------|----------------|-----------|
| 0      | 8    | Entity ID      | u64 LE    |
| 8      | 1    | Signal type    | u8        |
| 9      | 4    | Weight         | f32 LE    |
| 13     | 8    | Timestamp nanos| u64 LE    |

Module Structure

tidal/src/wal/
  format.rs   -- BatchHeader, EventRecord: encode/decode
  segment.rs  -- SegmentWriter, list_segments
  error.rs    -- WalError

Public API Surface

// === format.rs ===

pub const MAGIC: [u8; 4] = [0x54, 0x49, 0x4C, 0x44]; // "TIDL"
pub const HEADER_SIZE: usize = 64;
pub const EVENT_SIZE: usize = 21;
pub const FORMAT_VERSION: u8 = 1;

#[derive(Debug, Clone, PartialEq)]
pub struct BatchHeader {
    pub event_count: u16,
    pub first_seq: u64,
    pub batch_timestamp_nanos: u64,
    pub payload_len: u32,
    pub checksum: [u8; 32],
}

impl BatchHeader {
    pub fn encode(&self) -> [u8; HEADER_SIZE];
    pub fn decode(bytes: &[u8; HEADER_SIZE]) -> Result<Self, WalError>;
    pub fn compute_checksum(header_prefix: &[u8; 32], events: &[u8]) -> [u8; 32];
}

#[derive(Debug, Clone, PartialEq)]
pub struct EventRecord {
    pub entity_id: u64,
    pub signal_type: u8,
    pub weight: f32,
    pub timestamp_nanos: u64,
}

impl EventRecord {
    pub fn encode(&self) -> [u8; EVENT_SIZE];
    pub fn decode(bytes: &[u8; EVENT_SIZE]) -> Self;
}

// === segment.rs ===

pub struct SegmentWriter { /* file handle, current size, segment_size limit */ }

impl SegmentWriter {
    pub fn open(dir: &Path, first_seq: u64, segment_size: u64) -> Result<Self, WalError>;
    /// Append raw batch bytes. Returns true if segment is now full.
    pub fn append_batch(&mut self, bytes: &[u8]) -> Result<bool, WalError>;
    pub fn flush(&mut self) -> Result<(), WalError>;
    pub fn segment_size(&self) -> u64;
    pub fn current_size(&self) -> u64;
}

pub fn segment_path(dir: &Path, first_seq: u64) -> PathBuf;
pub fn list_segments(dir: &Path) -> Result<Vec<(u64, PathBuf)>, WalError>;

Test Strategy

Unit Tests

#[test]
fn batch_header_roundtrip() {
    let header = BatchHeader {
        event_count: 42,
        first_seq: 1000,
        batch_timestamp_nanos: 1_700_000_000_000_000_000,
        payload_len: 42 * 21,
        checksum: [0xAB; 32],
    };
    let encoded = header.encode();
    let decoded = BatchHeader::decode(&encoded).unwrap();
    assert_eq!(header, decoded);
}

#[test]
fn event_record_roundtrip() {
    let event = EventRecord { entity_id: 999, signal_type: 3, weight: 2.5, timestamp_nanos: 42_000_000_000 };
    let encoded = event.encode();
    let decoded = EventRecord::decode(&encoded);
    assert_eq!(decoded.entity_id, 999);
    assert_eq!(decoded.weight.to_bits(), 2.5_f32.to_bits());
}

#[test]
fn magic_bytes_in_header() {
    let header = BatchHeader { event_count: 1, first_seq: 1, batch_timestamp_nanos: 0, payload_len: 21, checksum: [0u8; 32] };
    let encoded = header.encode();
    assert_eq!(&encoded[0..4], &[0x54, 0x49, 0x4C, 0x44]);
}

#[test]
fn segment_naming_is_ordered() {
    let p1 = segment_path(Path::new("/tmp"), 1);
    let p2 = segment_path(Path::new("/tmp"), 1000);
    // Lexicographic order matches numeric order
    assert!(p1.file_name() < p2.file_name());
}

#[test]
fn list_segments_returns_sorted() {
    let dir = tempfile::tempdir().unwrap();
    // Create segment files out of order
    std::fs::write(segment_path(dir.path(), 200), b"").unwrap();
    std::fs::write(segment_path(dir.path(), 1), b"").unwrap();
    std::fs::write(segment_path(dir.path(), 100), b"").unwrap();
    let segments = list_segments(dir.path()).unwrap();
    assert_eq!(segments[0].0, 1);
    assert_eq!(segments[1].0, 100);
    assert_eq!(segments[2].0, 200);
}

#[test]
fn header_decode_rejects_wrong_magic() {
    let mut bytes = [0u8; 64];
    bytes[0] = 0xFF; // wrong magic
    assert!(BatchHeader::decode(&bytes).is_err());
}

#[test]
fn header_decode_rejects_wrong_version() {
    let mut bytes = [0u8; 64];
    bytes[0..4].copy_from_slice(&[0x54, 0x49, 0x4C, 0x44]); // correct magic
    bytes[4] = 99; // wrong version
    assert!(BatchHeader::decode(&bytes).is_err());
}

Acceptance Criteria

  • BatchHeader encodes to exactly 64 bytes (compile-time assertion)
  • EventRecord encodes to exactly 21 bytes (compile-time assertion)
  • Magic bytes 0x54494C44 appear at bytes [0..4] of every encoded header
  • BLAKE3 checksum covers header[0..32] || event_bytes (excludes the hash field itself)
  • BatchHeader::decode() returns WalError::Corruption on wrong magic or unknown version
  • EventRecord::encode/decode roundtrip is lossless for all finite f32 weights
  • Segment files are named wal-{seq:020}.seg; list_segments() returns them sorted ascending
  • SegmentWriter::append_batch() writes raw bytes and returns true when the segment has exceeded its size limit
  • All little-endian encoding — no byte-swap cost on x86/ARM
  • cargo clippy -D warnings passes

Research References

  • docs/research/tidaldb_wal.md — Section 1 (Approach 3: batch-oriented framing with wire format table), Section 5 (segment rotation at 16 MB, naming convention)

Implementation Notes

  • payload_len is always event_count * 21. The redundancy allows Phase 1 crash validation (check bounds before computing BLAKE3) without reading the event data.
  • The hash field at header[32..64] is written AFTER computing the hash. The hash input uses a zeroed header suffix — equivalently, it hashes header[0..32] || events.
  • f32::to_bits() / f32::from_bits() are used for weight encoding — safe, const, and exact. Never cast f32 to u32 via as.
  • Segment files do not need pre-allocation in m1p2. Defer fallocate until disk write performance is a measured bottleneck.