tidaldb/docs/planning/milestone-1/phase-2/task-02-group-commit-writer.md
jordan 29400d48db feat: implement Milestone 1 phases 1-3 — schema, WAL, and storage layer
Implements the foundation of tidalDB's data pipeline:

**Phase 1 – Schema primitives**
- EntityId newtype (u64, big-endian ordering)
- SignalTypeDefinition with pre-computed decay λ, deduped/sorted windows
- SchemaBuilder with full constraint validation (duplicates, identifiers,
  half-life, windows, velocity)
- LumenError wrapping all subsystems with required From impls

**Phase 2 – Write-Ahead Log**
- Length-prefixed, BLAKE3-protected entry format
- Group-commit writer (batch up to 100 events / 10 ms)
- Double-buffered content-hash deduplication
- Checkpoint, truncation, and crash-recovery with full replay
- Integration, property, and UAT tests (incl. 5,500-event deterministic UAT)
- Proptest coverage scaled to 10 000 events/run (was ≤500) to meet
  acceptance criterion; cases reduced 100→10 to keep runtime comparable

**Phase 3 – Storage engine**
- StorageEngine trait (get/put/delete/scan/batch/flush)
- Key encoding: [EntityId][0x00][Tag][suffix] with ordering/prefix helpers
- InMemoryBackend (BTreeMap + RwLock)
- FjallStorage with three isolated keyspaces and atomic batch helper
- Property tests for key ordering and round-trip correctness

Also adds planning docs for phases 4-5, research docs, architecture
overview, and roadmap updates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 16:43:24 -07:00

6.9 KiB

Task 02: Group Commit Writer

Context

Milestone: 1 -- Signal Engine Phase: m1p2 -- Write-Ahead Log Status: COMPLETE Depends On: Task 01 (wire format types, SegmentWriter, WalError) Blocks: Task 04 (WalHandle public API depends on writer channel) Complexity: M

Objective

Implement the group commit writer thread that accumulates signal events from concurrent callers, forms batches by count or timeout, writes each batch to the current segment, fsyncs, and notifies all waiting callers with their assigned sequence numbers.

The group commit pattern amortizes fsync cost across concurrent writers. A dedicated thread owns the file handle — no concurrency on the write path. Callers send events through a bounded crossbeam channel and block on a per-caller reply channel until their batch is durably committed.

Requirements

  • Single writer thread owns the WAL file handle; no concurrent writes
  • crossbeam::channel::bounded(10_000) for the command channel
  • Batch accumulation: drain up to 100 events with recv_deadline(10ms) — batch fills at count limit or timeout
  • One fsync per batch (not per event); called after write_all
  • Each event receives a monotonically increasing u64 sequence number starting at 1
  • Sequence number monotonicity survives segment rotation
  • WalCommand::Append { event, reply } — reply channel receives Result<u64, WalError> (seq or Ok(0) for dedup)
  • WalCommand::TruncateBefore { before_seq, reply } — deletes eligible segments from the writer thread (no race with writes)
  • WalCommand::Shutdown — flush partial batch, fsync, exit cleanly
  • WriterConfig carries: dir, segment_size, batch_size, batch_timeout, dedup_window
  • run_writer is a free function taking &Receiver<WalCommand>, &WriterConfig, SegmentWriter, initial next_seq, and DedupWindow

Technical Design

Writer Loop

pub fn run_writer(
    rx: &Receiver<WalCommand>,
    config: &WriterConfig,
    mut segment: SegmentWriter,
    mut next_seq: u64,
    mut dedup: DedupWindow,
) -> Result<(), WalError> {
    let mut batch: Vec<(EventRecord, Sender<Result<u64, WalError>>)> = Vec::with_capacity(config.batch_size);

    loop {
        // Block until first command
        match rx.recv() {
            Ok(cmd) => handle_command(cmd, &mut batch, &mut dedup),
            Err(_) => break, // channel closed
        }

        // Drain up to batch_size - 1 more with timeout
        let deadline = Instant::now() + config.batch_timeout;
        while batch.len() < config.batch_size {
            match rx.recv_deadline(deadline) {
                Ok(cmd) => handle_command(cmd, &mut batch, &mut dedup),
                Err(RecvTimeoutError::Timeout) => break,
                Err(RecvTimeoutError::Disconnected) => { /* drain and exit */ break }
            }
        }

        // Flush batch if non-empty
        if !batch.is_empty() {
            flush_batch(&mut batch, &mut segment, config, &mut next_seq)?;
        }
    }

    // Final flush on shutdown
    if !batch.is_empty() {
        flush_batch(&mut batch, &mut segment, config, &mut next_seq)?;
    }
    segment.flush()?;
    Ok(())
}

Batch Flush

fn flush_batch(
    batch: &mut Vec<(EventRecord, Sender<Result<u64, WalError>>)>,
    segment: &mut SegmentWriter,
    config: &WriterConfig,
    next_seq: &mut u64,
) -> Result<(), WalError> {
    // Assign sequence numbers to non-dedup events
    // Encode all events
    // Encode batch header with BLAKE3
    // Write batch bytes to segment (handles rotation)
    // fsync
    // Notify all waiters with their sequence numbers
}

Segment Rotation

When SegmentWriter::append_batch() returns true (segment full), the writer:

  1. Calls segment.flush() on the current segment (fsync already done per batch)
  2. Creates a new SegmentWriter with first_seq = next_seq
  3. Continues writing

Deduplication Integration

Before adding an event to the batch, DedupWindow::check_and_insert() is called. If it returns true (duplicate), the event's reply channel gets Ok(0) immediately — the event does not join the batch.

Test Strategy

Integration Tests

#[test]
fn writer_fsyncs_per_batch() {
    // Write 10 events to a WAL
    // Verify the WAL file exists and is non-empty
    // Verify contents are readable by WalReader
}

#[test]
fn writer_sequence_numbers_monotonic() {
    // Spawn 4 threads, each appending 25 events concurrently
    // Collect all 100 sequence numbers
    // Assert they form a contiguous range [1..=100] with no duplicates
}

#[test]
fn writer_respects_segment_size() {
    // Configure segment_size = 1024 (tiny)
    // Write enough events to force multiple rotations
    // Assert multiple segment files exist in the WAL dir
    // Assert all events are readable in order across segments
}

#[test]
fn writer_shutdown_flushes_partial_batch() {
    // Append 5 events (less than batch_size=100)
    // Shutdown immediately
    // Reopen and verify all 5 events are replayed
}

#[test]
fn truncate_before_deletes_old_segments() {
    // Write events across 3 segments
    // Checkpoint at the end of segment 2
    // Truncate before segment 3's first seq
    // Assert segments 1 and 2 are deleted, segment 3 remains
}

Acceptance Criteria

  • Batch accumulates up to 100 events or 10ms, then writes and fsyncs
  • One fsync per batch — not per event
  • Sequence numbers are monotonically increasing across the lifetime of the WAL
  • Concurrent appenders each receive the correct, unique sequence number for their event
  • Duplicate events receive Ok(0) — deduplicated before joining the batch
  • Shutdown command flushes any partial batch before the thread exits
  • Segment rotation is transparent — sequence numbers continue without reset
  • TruncateBefore runs inside the writer thread to prevent races with active writes
  • Channel capacity 10,000 provides backpressure under load without deadlock

Research References

  • docs/research/tidaldb_wal.md — Section 3 (Pattern 4: crossbeam-channel with recv_deadline, full implementation sketch, comparison table), Section 5 (segment rotation strategy)
  • thoughts.md — Part V.6 (group commit: batch fsync amortization)

Implementation Notes

  • Use crossbeam::channel::bounded(1) as the per-event reply channel — bounded to 1 because each append waits for exactly one reply.
  • recv_deadline (not recv_timeout) — recv_deadline uses a fixed instant, so batch accumulation does not reset the timeout on each event.
  • Do NOT hold the reply channel senders in the batch after sending replies. Memory leak if the writer thread is slow and batch grows large.
  • The writer thread name is "tidaldb-wal-writer" — visible in top, htop, and crash backtraces.
  • std::thread::Builder::new().name(...).spawn(...) is used (not bare thread::spawn) so the name appears in panic messages.