# tidalDB WAL Design Research

## Question

What WAL entry format, group commit strategy, crash detection mechanism, checkpoint/truncation pattern, and deduplication approach should tidalDB use for its signal event write-ahead log?

This research directly informs the implementation of tidalDB's signal durability layer -- the component that sits between the public `signal()` API and the signal aggregation system. Every signal event (view, like, skip, completion) flows through the WAL before any derived state is updated. The WAL is the source of truth; signal aggregates, decay scores, and windowed counts are all derived from WAL replay.

## TidalDB Context

**Workload characteristics:**
- Signal events are small and uniform: `entity_id` (u64), `signal_type` (u8), `weight` (f32), `timestamp` (u64) = ~21 bytes of payload, padded to approximately 29-40 bytes with framing overhead
- Write velocity: 1K-100K events/sec, bursty (viral content causes 10-100x spikes)
- Group commit target: 100 events or 10ms, whichever first
- fsync per batch, not per event (amortized durability cost)
- Append-only, immutable -- events are never updated or deleted
- BLAKE3 checksums for content-addressing and deduplication (already decided)
- Single-node embedded Rust library (no network, no replication)
- `#![forbid(unsafe_code)]` where possible

**Already decided (not re-litigated here):**
- BLAKE3 for checksums (content-addressing, dedup)
- Group commit: 100 events or 10ms
- fsync per batch
- Append-only, immutable event log
- WAL is source of truth
- Signal ledger architecture: three-tier hybrid per `docs/research/tidaldb_signal_ledger.md`

**What the WAL must support:**
1. **Append**: write a batch of signal events atomically
2. **Checkpoint**: mark a sequence position as "all state through here has been materialized"
3. **Truncation**: delete WAL segments before the checkpoint
4. **Replay**: reconstruct state from any checkpoint forward
5. **Deduplication**: detect and skip duplicate events via content hash

---

## 1. WAL Entry Format

### Approach 1: Length-Prefix Framing (LevelDB/RocksDB Model)

**How it works:** Each record is a variable-length frame: `[checksum: 4B][length: 2B][type: 1B][data: length B]`. Records are packed into fixed-size pages (typically 32 KB). Records that span page boundaries are split into FIRST/MIDDLE/LAST fragments; records that fit within a page use the FULL type.

**Used by:** LevelDB, RocksDB, Pebble (CockroachDB), Prometheus TSDB. This is the most widely deployed WAL record format in production databases.

**Wire format (LevelDB/RocksDB):**
```
Offset  Size  Field
0       4     CRC32C of type + data
4       2     Length of data (little-endian, max 65535)
6       1     Type: FULL=1, FIRST=2, MIDDLE=3, LAST=4
7       N     Data bytes
```
Total header overhead: 7 bytes per record/fragment.

**Crash detection:** CRC32C validates each fragment independently. A partial write at the tail of the log is detected when: (a) the length field reads past the end of the file, (b) the CRC32C does not match the data, or (c) the type byte is invalid. Recovery stops at the first corrupted record and truncates.

**Space efficiency:** 7 bytes overhead per record. For tidalDB's ~21-byte signal events, that is 25% overhead. However, tidalDB writes batches, not individual events -- a batch of 100 events (2,100 bytes payload) has 7 bytes overhead = 0.3%. With BLAKE3 (32 bytes) replacing CRC32C (4 bytes), the per-record overhead rises to 35 bytes, which is significant for small records but negligible at the batch level.

**Decode cost:** Single memcpy + CRC check per fragment. No allocation required for sequential reads. The 32 KB page alignment enables efficient I/O since modern SSDs have 4 KB sectors and filesystem block sizes are typically 4 KB.

**Evidence:** The LevelDB log format paper (Ghemawat and Dean, 2011) introduced this design. RocksDB adopted it verbatim. Prometheus TSDB uses the same page-based approach (32 KB pages, same record types). Google's internal production experience at planet-scale validates the approach.

**Strengths for tidalDB:**
- Proven crash detection across billions of production hours
- Fixed page size enables efficient sequential I/O
- Fragment spanning handles records of any size
- Simple recovery: scan forward, stop at first bad CRC

**Weaknesses for tidalDB:**
- CRC32C is not BLAKE3 -- tidalDB must substitute its chosen hash (addressed below)
- 2-byte length field caps individual records at 64 KB (sufficient for tidalDB's batch sizes, but worth noting)
- Page-boundary fragmentation adds complexity to the writer and reader

### Approach 2: Fixed-Size Records

**How it works:** Every WAL entry occupies exactly N bytes, padded as needed. No length prefix required; record boundaries are implicit from file offset.

**Used by:** TigerBeetle (fixed-size prepare messages of 8K transfer batches), some embedded systems with uniform record sizes.

**Wire format (tidalDB hypothetical):**
```
Offset  Size  Field
0       32    BLAKE3 hash of bytes [32..64]
32      8     Sequence number (u64 big-endian)
40      8     Entity ID (u64 big-endian)
48      1     Signal type (u8)
49      4     Weight (f32 big-endian)
53      8     Timestamp (u64 big-endian)
61      3     Padding to 64 bytes
```
Total: 64 bytes per event. Cache-line aligned.

**Crash detection:** Trivial -- if the file size is not a multiple of 64, the last record is partial. Validate BLAKE3 hash of every record on read.

**Space efficiency:** 64 bytes per event when the payload is ~21 bytes = 67% overhead. For 100K events/sec at 64 bytes = 6.4 MB/sec = 553 GB/day. Compare to variable-length at ~28 bytes/event = 2.8 MB/sec = 242 GB/day. The fixed-size approach wastes 2.3x more disk.

**Decode cost:** Zero-copy access by offset. No parsing required. Index into the file: `record_n = &mmap[n * 64..(n+1) * 64]`. Fastest possible random access.

**Evidence:** TigerBeetle uses fixed-size messages but at a much larger granularity (batches of 8K transfers). For single events, the padding waste is substantial. The Database of Databases catalogs no production WAL that uses fixed-size records for variable-length payloads at tidalDB's event size.

**Strengths for tidalDB:**
- Simplest possible implementation (~50 lines of code)
- O(1) random access by sequence number
- Cache-line alignment (64 bytes) for read performance
- Trivial crash detection

**Weaknesses for tidalDB:**
- 67% space waste for small events
- Cannot batch -- each event is a separate record (or batch must be a single large fixed-size block, wasting even more space)
- No schema evolution -- format change requires migration
- Does not leverage tidalDB's batch-oriented write path

### Approach 3: Batch-Oriented Length-Prefix Framing (Recommended)

**How it works:** Instead of framing individual events, frame entire batches. Each WAL entry is a batch header followed by N tightly-packed events. The batch is the unit of checksumming, fsyncing, and replay.

**Wire format (tidalDB-specific):**
```
BATCH HEADER (48 bytes):
Offset  Size  Field
0       4     Magic bytes: 0x54_49_44_4C ("TIDL")
4       1     Format version (u8, currently 1)
5       1     Batch flags (u8, reserved)
6       2     Event count (u16 little-endian, max 65535)
8       8     First sequence number (u64 little-endian)
16      8     Batch timestamp (u64 little-endian, nanoseconds)
24      4     Payload length in bytes (u32 little-endian)
28      4     Reserved (zeroed, for future use)
32      32    BLAKE3 hash of bytes [0..32] + all event bytes
--- Total: 64 bytes (cache-line aligned) ---

EVENT RECORD (21 bytes each, tightly packed):
Offset  Size  Field
0       8     Entity ID (u64 little-endian)
8       1     Signal type (u8)
9       4     Weight (f32 little-endian)
13      8     Timestamp (u64 little-endian)
--- Total: 21 bytes per event ---
```

A batch of 100 events: 64 + (100 * 21) = 2,164 bytes. Overhead: 64/2164 = 3.0%.
A batch of 10 events: 64 + (10 * 21) = 274 bytes. Overhead: 64/274 = 23.4%.

**Crash detection:** BLAKE3 hash covers the header fields (bytes 0-31) concatenated with all event bytes. A partial write produces either: (a) an incomplete header (< 64 bytes at tail), (b) a header with payload_length that exceeds remaining file bytes, or (c) a BLAKE3 mismatch. Recovery scans forward from the last known good batch, stops at first failure, truncates.

**Space efficiency:** 21 bytes per event + amortized 0.64 bytes/event at batch size 100. Total ~21.6 bytes/event. At 100K events/sec = 2.16 MB/sec = 187 GB/day. This is 2.9x more efficient than fixed-size and comparable to per-record length-prefix but simpler (no page fragmentation).

**Decode cost:** Read 64-byte header, validate magic + version, read `payload_length` bytes, validate BLAKE3 over header + payload, then iterate events at 21-byte stride. No allocation for sequential scan. Batch-level random access via an in-memory index of `(sequence_number -> file_offset)` built at startup.

**Evidence:** This design synthesizes:
- Citadel's quarantine journal: length-prefixed records with BLAKE3 checksums and batch fsync (from `thoughts.md`)
- Prometheus TSDB: batch-oriented WAL records (Series, Samples, Tombstones records each contain multiple items)
- RocksDB WriteBatch: the WAL writes entire WriteBatch objects as single records, not individual key-value pairs

**Strengths for tidalDB:**
- Matches the group-commit write path exactly (batch is the unit of write and the unit of WAL)
- BLAKE3 hash per batch, not per event (amortizes hash cost)
- Simple recovery: scan batch headers, no page fragmentation logic
- Cache-line aligned header for read performance
- Schema evolution via version byte and reserved fields
- Minimal space overhead at target batch sizes

**Weaknesses for tidalDB:**
- Individual event random access requires reading the containing batch
- Small batches (< 10 events) have higher relative overhead than per-event framing
- Custom format -- not reusing an existing library's format

---

## 2. Rust WAL Crates Survey

### Crate 1: OkayWAL (khonsulabs/okaywal)

**Repository:** https://github.com/khonsulabs/okaywal
**Last commit:** November 26, 2023 (v0.3.1). No commits in 26+ months.
**Downloads:** Low (niche crate from the BonsaiDB ecosystem).
**unsafe code:** `#![forbid(unsafe_code)]` -- fully safe Rust.

**Features:**
- Segment-based WAL with automatic rotation
- CRC-32 checksums per chunk
- fsync batching across threads
- Automatic checkpointing via `LogManager` trait
- Interactive recovery with basic versioning

**Record format:** Segments named `wal-{id}` with magic bytes "okw", version, then entries marked by control bytes (1=new entry, 2=chunk, 3=end). Each chunk: 4-byte length + data + CRC-32.

**Evaluation against tidalDB:**
- (+) Safe Rust, segment-based, checkpoint support
- (-) CRC-32 checksums, not BLAKE3 -- would require forking to replace
- (-) Last commit 26 months ago -- maintenance risk is severe
- (-) API requires `LogManager` trait with `recover()` semantics that assume a specific application structure
- (-) No batch-oriented write API -- chunks are individual records
- (-) Part of the BonsaiDB ecosystem which has uncertain maintenance status (BonsaiDB development appears stalled)

**Verdict:** Do not use. Maintenance abandoned, wrong checksum algorithm, wrong abstraction level.

### Crate 2: commitlog (zowens/commitlog)

**Repository:** https://github.com/zowens/commitlog
**Last commit:** Unknown recent date, 159 commits total.
**Downloads:** Moderate (117 stars).
**unsafe code:** Not documented as `forbid(unsafe_code)`.

**Features:**
- Segment-based append-only log
- Offset-based message addressing (monotonically increasing)
- Configurable segment size
- Read from arbitrary offset with limit

**Evaluation against tidalDB:**
- (+) Conceptually aligned: append-only, offset-based
- (-) No checksum support at all -- corruption detection is absent
- (-) No fsync control -- durability guarantees unclear
- (-) No checkpoint or truncation API
- (-) Designed for distributed log (Kafka-like) abstractions, not embedded WAL
- (-) Maintenance health unknown

**Verdict:** Do not use. Missing critical durability features (checksums, fsync control).

### Crate 3: walcraft

**Repository:** https://github.com/RustyFarmer101/walcraft
**Downloads:** Very low.

**Features:**
- In-memory buffer with append-only log files
- Configurable buffer size, storage size
- Optional fsync
- Older files auto-deleted to save space

**Evaluation against tidalDB:**
- (-) Very early-stage, minimal community adoption
- (-) No checksum support documented
- (-) No checkpoint/truncation API
- (-) "Write mode prevents switching back to read mode" -- unusual constraint

**Verdict:** Do not use. Too immature, missing critical features.

### Crate 4: walrus-rust

**Repository:** crates.io/crates/walrus-rust
**Last updated:** ~3 months ago (as of early 2026).

**Features:**
- FD backend (pread/pwrite) and mmap backend
- io_uring support for batch operations on Linux
- Topic-based organization
- Configurable consistency modes
- Persistent read offset tracking

**Evaluation against tidalDB:**
- (+) Active development, modern Rust (2024 edition)
- (+) Multiple backends including io_uring
- (-) Topic-based organization adds unwanted complexity
- (-) io_uring is Linux-only; tidalDB targets macOS + Linux
- (-) Unclear checksum strategy
- (-) 6.1K SLoC is substantial for a WAL crate -- large dependency surface

**Verdict:** Interesting but over-engineered for tidalDB's needs. Topic-based organization and io_uring are complexity without value for a single-node embedded database.

### Existing Database WAL Implementations

**fjall (v3):** fjall has its own internal WAL (called "journal") for memtable durability. It is not exposed as a standalone API. tidalDB already uses fjall for entity storage, but the signal WAL serves a different purpose -- it is the source of truth for signal events before they flow into fjall-backed storage. Using fjall's internal WAL would couple the signal durability path to the entity store, violating the architectural separation between WAL and storage engine documented in `thoughts.md`.

**sled:** Uses a log-structured approach with epoch-based GC. The WAL is deeply coupled to sled's page cache and cannot be extracted. Also, sled's maintenance status has been uncertain since 2022.

**redb:** Uses copy-on-write B-trees (LMDB-inspired). No WAL at all -- durability comes from the COW mechanism. Not applicable.

### Recommendation: Build a Custom WAL

**The evidence strongly favors building tidalDB's own WAL.** The reasons:

1. **No existing crate meets the requirements.** Every surveyed crate is missing at least two critical features: BLAKE3 checksums, batch-oriented writes, checkpoint/truncation, or adequate maintenance.

2. **The WAL is small.** A batch-oriented, append-only, segment-based WAL with BLAKE3 checksums is approximately 400-600 lines of Rust. This is well within the "could we write this in 200 lines?" threshold from CODING_GUIDELINES.md (it exceeds 200 lines, but the alternative -- forking and maintaining someone else's abandoned crate -- is worse).

3. **The WAL is load-bearing.** This is the durability primitive. Every signal event flows through it. Depending on an abandoned or under-maintained external crate for the single most critical component is unacceptable risk.

4. **The format must match the workload.** tidalDB's batch-oriented, BLAKE3-checksummed, fixed-event-size signal events are a specific enough format that a general-purpose WAL crate adds abstraction overhead without value.

5. **Precedent from sister projects.** Engram, Citadel, and StemeDB all built custom WALs (per `thoughts.md`). Each is under 1,000 lines. Each is tuned to its workload. This is the pattern that works.

---

## 3. Group Commit Implementation Patterns

### Pattern 1: Dedicated Writer Thread with Channel

**How it works:** A single background thread owns the WAL file handle. Writer threads send events through a channel (bounded MPSC). The writer thread loops: `recv_timeout(10ms)`, accumulating events into a batch buffer. When the buffer hits 100 events or the timeout fires, the writer thread writes the batch to the WAL file, fsyncs, and notifies all waiting writers via a shared `Condvar` or per-writer oneshot channels.

**Used by:** Citadel's `GroupCommitQueue` (from `thoughts.md`), MySQL InnoDB binary log group commit (leader-follower model), MariaDB Aria storage engine.

**Implementation sketch (Rust):**
```
Writer threads:
  1. Send (event, oneshot::Sender<SeqNo>) to channel
  2. Block on oneshot::Receiver

WAL thread:
  loop {
    batch = drain_channel(max=100, timeout=10ms)
    write_batch_to_file(&batch)
    fsync()
    for (event, notifier) in batch {
      notifier.send(seq_no)
    }
  }
```

**Latency:** Minimum latency = time to fill batch or 10ms timeout. At 10K events/sec, a batch of 100 fills in 10ms -- the timeout and batch size converge. At 100K events/sec, a batch fills in 1ms -- latency is dominated by fsync (~0.1-1ms on NVMe). Worst case p99: 10ms (timeout) + fsync time.

**Throughput at 10K events/sec:** Easily sustained. 100 batches/sec * 1 fsync/batch = 100 fsyncs/sec. NVMe SSDs sustain 10K-50K fsyncs/sec. Headroom: 100-500x.

**Implementation complexity:** Moderate. ~150 lines for the writer thread, channel setup, and notification. Requires careful shutdown handling (poison the channel, drain remaining events, final fsync).

**Strengths:**
- Single writer eliminates all file-level concurrency concerns
- fsync is naturally batched by the channel drain
- Backpressure via bounded channel
- Clean separation of concerns

**Weaknesses:**
- Thread overhead (one dedicated OS thread)
- Minimum one channel hop latency
- Shutdown ordering must be explicit

### Pattern 2: Leader-Follower with Mutex + Condvar

**How it works:** All writer threads contend on a mutex protecting the batch buffer. The first thread to arrive after a flush becomes the "leader." Subsequent threads add their events and wait on a condvar. When the batch is full or the leader's timer expires, the leader writes the batch, fsyncs, and calls `condvar.notify_all()`.

**Used by:** MySQL's binary log group commit (WL#5223: "The first transaction that reaches a stage is elected leader and the others are followers").

**Implementation sketch (Rust):**
```
fn append(&self, event: Event) -> SeqNo {
  let mut batch = self.batch.lock();
  batch.events.push(event);
  if batch.events.len() >= 100 || batch.timer_expired() {
    // I am the leader
    let events = std::mem::take(&mut batch.events);
    drop(batch); // release lock before I/O
    self.write_and_fsync(&events);
    self.condvar.notify_all();
    return seq_no;
  }
  // I am a follower -- wait for the leader
  self.condvar.wait(batch);
  seq_no
}
```

**Latency:** Similar to Pattern 1. Leaders pay write + fsync cost. Followers wake up immediately after fsync completes. Slightly lower latency than channel-based because there is no channel hop -- the mutex is the synchronization point.

**Throughput at 10K events/sec:** Sustained easily. The mutex is held only to append an event (~50ns) and then released. The write + fsync happens outside the lock.

**Implementation complexity:** Lower than Pattern 1 (~100 lines). But the correctness reasoning is harder: spurious wakeups, timer management, and ensuring the leader's fsync is visible to all followers require careful `Condvar` usage.

**Strengths:**
- No dedicated thread -- uses caller threads
- Slightly lower latency (no channel hop)
- Simpler resource management

**Weaknesses:**
- Mutex contention under high concurrency
- `Condvar` correctness is subtle (spurious wakeups, notification ordering)
- Timer management is awkward (who checks the timer? the leader? a background thread?)
- Leader thread pays the full I/O cost, creating latency asymmetry

### Pattern 3: std::sync::mpsc with recv_timeout

**How it works:** Similar to Pattern 1 but using the standard library's `mpsc::channel` instead of crossbeam.

**Evaluation:** `std::sync::mpsc::Receiver::recv_timeout()` has a known bug (spurious early returns). Crossbeam channels are 2-10x faster under load and do not have this bug. There is no reason to prefer `std::sync::mpsc` over crossbeam for this use case.

**Verdict:** Use crossbeam if choosing the channel-based pattern.

### Pattern 4: crossbeam-channel with recv_timeout (Recommended)

**How it works:** Same as Pattern 1 but using `crossbeam::channel::bounded` with `recv_timeout`. This is the production-grade implementation of the channel-based group commit pattern.

**Used by:** Effectively the Rust-idiomatic version of Pattern 1. crossbeam-channel is the de facto standard for high-performance synchronous channels in Rust (173M downloads, actively maintained, heavily audited).

**Implementation sketch (Rust):**
```rust
use crossbeam::channel::{bounded, Sender, Receiver};
use std::time::{Duration, Instant};

struct GroupCommitter {
    rx: Receiver<(SignalEvent, oneshot::Sender<u64>)>,
    wal: WalWriter,
    batch_size: usize,    // 100
    batch_timeout: Duration, // 10ms
}

impl GroupCommitter {
    fn run(&mut self) {
        let mut batch = Vec::with_capacity(self.batch_size);
        loop {
            // Block until first event arrives
            match self.rx.recv() {
                Ok(item) => batch.push(item),
                Err(_) => break, // channel closed, shut down
            }
            // Drain up to batch_size with timeout
            let deadline = Instant::now() + self.batch_timeout;
            while batch.len() < self.batch_size {
                match self.rx.recv_deadline(deadline) {
                    Ok(item) => batch.push(item),
                    Err(_) => break, // timeout or disconnected
                }
            }
            // Write and fsync the batch
            let seq_start = self.wal.write_batch(&batch);
            self.wal.fsync();
            // Notify all waiters
            for (i, (_, notifier)) in batch.drain(..).enumerate() {
                let _ = notifier.send(seq_start + i as u64);
            }
        }
    }
}
```

**Latency:** Same as Pattern 1. At 10K events/sec with batch_size=100 and timeout=10ms, batches fill in ~10ms. At 100K events/sec, batches fill in ~1ms. fsync adds 0.05-0.5ms on NVMe.

**Throughput at 10K events/sec:** 100 batches/sec, each ~2.1 KB. Total: 210 KB/sec write + 100 fsyncs/sec. Trivial for any modern SSD.

**Throughput at 100K events/sec:** 1,000 batches/sec, each ~2.1 KB. Total: 2.1 MB/sec + 1,000 fsyncs/sec. Well within NVMe capabilities (10K-50K IOPS for 4 KB random writes with fsync).

**Implementation complexity:** ~100-150 lines. Clean, testable, well-understood.

**Strengths:**
- Proven channel implementation (crossbeam)
- `recv_deadline` provides exact timeout semantics
- Single writer thread -- no file concurrency
- Natural backpressure via bounded channel
- Easy to test: send events, assert batch sizes
- Clean shutdown: drop all senders, writer drains and exits

**Weaknesses:**
- One dedicated OS thread
- crossbeam is an additional dependency (but already widely used in the Rust ecosystem, and fjall likely already depends on it transitively)

### Comparison Table

| Criterion | Dedicated Thread (crossbeam) | Leader-Follower (Condvar) | std::sync::mpsc |
|---|---|---|---|
| **Latency (p50)** | ~5-10ms at 10K/s | ~5-10ms at 10K/s | Same, but buggy |
| **Latency (p99)** | 10ms + fsync | 10ms + fsync + wakeup jitter | Unreliable |
| **Throughput ceiling** | 100K+/sec | 100K+/sec (mutex contention at >1M) | 100K+/sec |
| **Implementation complexity** | Moderate (150 LoC) | Lower (100 LoC) but subtler | Same as crossbeam |
| **Correctness risk** | Low (single writer) | Moderate (condvar semantics) | High (known bugs) |
| **Testability** | High (channel-based) | Moderate (timing-dependent) | Same as crossbeam |
| **Shutdown cleanliness** | Clean (drop senders) | Requires poison flag | Clean (drop senders) |

**Recommendation: Pattern 4 (crossbeam-channel with recv_deadline).** It has the best correctness properties (single writer, no mutex reasoning), is the most testable, and crossbeam-channel is battle-tested.

---

## 4. Crash Detection: Partial Write Handling

### The Problem

When a process crashes or power is lost during a WAL write, the file may contain a partial batch at the tail. The WAL must detect this and recover cleanly without losing any previously committed data.

Partial writes can manifest as:
1. **Truncated header:** Fewer than 64 bytes written for the batch header
2. **Truncated payload:** Header is complete but the event data is incomplete
3. **Corrupted bytes:** The OS wrote garbage (filesystem metadata inconsistency)
4. **Torn write:** Part of the batch is correct, part is zeroed or garbage (sector-level atomicity failure)

### Approach 1: Checksum-Only Validation

**How it works:** Each batch has a BLAKE3 hash covering the header + payload. On recovery, scan batches sequentially. If the BLAKE3 hash does not match, the batch is invalid. Truncate the file to the end of the last valid batch.

**Used by:** LevelDB, RocksDB (with CRC32C), Prometheus TSDB, Citadel (with BLAKE3).

**Recovery algorithm:**
```
offset = 0
last_valid_offset = 0
while offset < file_length:
    if file_length - offset < 64:
        break  # incomplete header
    header = read(offset, 64)
    if header.magic != TIDL:
        break  # corruption
    payload_end = offset + 64 + header.payload_length
    if payload_end > file_length:
        break  # incomplete payload
    payload = read(offset + 64, header.payload_length)
    expected_hash = blake3(header[0..32] + payload)
    if expected_hash != header.blake3:
        break  # corrupted batch
    last_valid_offset = payload_end
    offset = payload_end
truncate(file, last_valid_offset)
```

**Strengths:** Simple. Deterministic. The BLAKE3 hash catches all corruption types (truncated, torn, garbage). No additional sentinel bytes or alignment tricks needed.

**Weaknesses:** Requires reading and hashing every batch during recovery. For a 1 GB WAL, that is ~1 GB of I/O + BLAKE3 computation. At BLAKE3's 8 GB/sec throughput, recovery takes ~0.125 seconds for 1 GB -- acceptable.

### Approach 2: Sentinel Markers

**How it works:** Write a known sentinel value (e.g., `0xDEADBEEF`) at the end of each batch after the checksum. If the sentinel is missing, the batch is incomplete.

**Used by:** UnisonDB (Go, `0xDEADBEEFFEEDFACE` trailer).

**Evaluation:** The sentinel adds marginal value over checksum-only validation. The BLAKE3 hash already detects any corruption. The sentinel's only advantage is a fast pre-check (read 4-8 bytes at the expected end position) before computing the full hash. But for tidalDB's batch sizes (~2 KB), the hash is fast enough that the sentinel pre-check saves negligible time.

**Verdict:** Unnecessary given BLAKE3. Adds format complexity without meaningful benefit.

### Approach 3: Checksum + Length-Prefix Combination (Recommended)

**How it works:** The batch header contains both the payload length and the BLAKE3 hash. Recovery uses a two-phase check:

1. **Phase 1 (fast):** Read the 64-byte header. Verify magic bytes. Check that `offset + 64 + payload_length <= file_length`. This catches truncated writes without any hashing.
2. **Phase 2 (thorough):** Read the payload. Compute BLAKE3 over `header[0..32] || payload`. Compare to stored hash. This catches corruption and torn writes.

Phase 1 rejects most crash-induced damage instantly. Phase 2 catches the rest.

**Used by:** This is exactly the LevelDB/RocksDB model (length + CRC), upgraded to BLAKE3 and applied at batch granularity.

**Strengths:** Two-layer detection catches partial writes fast (Phase 1) and corruption thoroughly (Phase 2). The length prefix is essential anyway for parsing -- no additional cost.

**Weaknesses:** None meaningful. This is strictly better than checksum-only or sentinel-only.

### Approach 4: Tail Scanning with Zeroed Pages

**How it works:** Pre-allocate WAL segments filled with zeros. Scan backward from the end of the file looking for non-zero content. The first non-zero content from the end is the tail of the log.

**Used by:** Some older database implementations. OkayWAL pre-allocates segment files.

**Evaluation:** Pre-allocation improves write performance (avoids filesystem metadata updates) but zero-scanning for tail detection is fragile -- legitimate zero bytes in the data could cause false boundaries. Not suitable for tidalDB's event data.

**Verdict:** Pre-allocation is worth considering for performance, but not for crash detection. Stick with Approach 3.

### Recommendation

Use **Approach 3: BLAKE3 + Length-Prefix Combination.** The batch header already contains both the payload length and the BLAKE3 hash. Recovery is a simple forward scan:

1. Read 64-byte header. Verify magic. Verify `payload_length` fits.
2. Read payload. Verify BLAKE3 hash.
3. If either check fails, truncate at previous batch boundary.

This is the same proven pattern used by LevelDB, RocksDB, and Prometheus TSDB, with BLAKE3 substituted for CRC32C.

**Recovery time estimate:** At 100K events/sec and 10ms batches, the WAL grows at ~2.1 MB/sec. Between checkpoints (every 30 seconds per the signal ledger research), the WAL accumulates ~63 MB. Scanning 63 MB at BLAKE3's 8 GB/sec = ~8ms. Total recovery: read checkpoint metadata + scan ~63 MB of WAL = under 50ms. Excellent.

---

## 5. Checkpoint + Truncation Patterns

### Survey of Production Systems

**PostgreSQL:** Checkpoints write a special WAL record containing the "redo point" -- the LSN from which recovery must start. All WAL segments before the redo point's segment can be recycled. Checkpoints are triggered by time (default 5 min) or WAL size (default 1 GB). WAL segments are 16 MB files.

**SQLite (WAL mode):** The WAL is a single file of frames. A checkpoint copies dirty pages back to the main database file. After a complete checkpoint, the WAL is reset (overwritten from the beginning). The WAL-index (shm file) tracks which frames are valid.

**LevelDB/RocksDB:** The WAL is per-memtable. When the memtable is flushed to an SST file, the corresponding WAL file is deleted. There is no explicit "checkpoint" -- the SST flush is the checkpoint. Multiple WAL files can coexist (one per active memtable).

**Prometheus TSDB:** WAL segments are 128 MB files in a `wal/` directory. A checkpoint is a filtered copy of the WAL segments being truncated, stored in `checkpoint.NNNNNN/`. Truncation deletes the first 2/3 of segments. The checkpoint retains series definitions and recent samples that are still needed.

**TigerBeetle:** The WAL is a ring buffer. The superblock tracks which prepares have been applied to the state. Completed prepares can be overwritten by new ones. No segment files -- it is a fixed-size ring.

### tidalDB Checkpoint Design

tidalDB's signal WAL has a specific lifecycle:

1. Signal events arrive and are appended to the WAL in batches
2. A background thread reads WAL events and updates in-memory signal state (decay scores, windowed counts)
3. Periodically, the in-memory signal state is flushed to the entity store (fjall)
4. Once flushed, the WAL events up to that point are no longer needed

This maps directly to the **LevelDB/RocksDB model**: the "flush to entity store" is the checkpoint, and the WAL segments before the checkpoint can be deleted.

### Segment Rotation Strategy

**Recommendation: size-based rotation at 16 MB per segment.**

Rationale:
- PostgreSQL uses 16 MB segments (40+ years of production experience validates this size)
- At tidalDB's write rate of 2.1 MB/sec (100K events/sec), a 16 MB segment lasts ~7.6 seconds
- At 10K events/sec, a segment lasts ~76 seconds
- Truncation granularity is one segment -- smaller segments mean less wasted space after truncation
- 16 MB fits comfortably in the filesystem page cache

**Segment naming:** `wal-{first_sequence_number:020}.seg` (e.g., `wal-00000000000000000001.seg`). Zero-padded 20-digit sequence number ensures lexicographic ordering matches numeric ordering.

### Checkpoint Implementation

```
checkpoint.meta file:
{
  "checkpoint_sequence": 1000000,  // all events through this seq are materialized
  "checkpoint_timestamp": 1708000000000000000,  // nanoseconds
  "segment_file": "wal-00000000000000950000.seg"
}
```

**Checkpoint process:**
1. Signal materializer flushes in-memory state to entity store (fjall)
2. fjall fsync completes
3. Write new `checkpoint.meta` with the last-materialized sequence number
4. fsync `checkpoint.meta`
5. Delete all WAL segments whose last sequence number < checkpoint_sequence

**Checkpoint frequency:** Every 30 seconds (matching the signal ledger research recommendation for entity_state flush interval). This bounds WAL size to ~63 MB at 100K events/sec.

### Truncation

**Truncation is segment deletion.** Once a checkpoint is recorded, all segments containing only events with sequence numbers less than the checkpoint sequence are safe to delete. The current active segment (being written to) is never deleted.

This is the Prometheus model: "files cannot be deleted at random -- deletion happens for first N files while not creating a gap in the sequence."

**Edge case:** If the checkpoint falls in the middle of a segment, that segment is retained until the next checkpoint advances past its last event. This wastes at most one segment (~16 MB) of space. Acceptable.

---

## 6. Deduplication via Content Hash

### The Dedup Problem

Webhook retries, client double-submissions, and network replays can deliver the same signal event multiple times. tidalDB must detect and skip duplicates. The content-addressing property of BLAKE3 (already decided) enables this: hash the event content, check if the hash has been seen.

The question is: **where and how to store the set of seen hashes?**

### Approach 1: In-Memory HashSet<[u8; 32]>

**How it works:** Maintain a `HashSet<[u8; 32]>` of all BLAKE3 hashes seen since the last truncation.

**Memory cost:** Each hash is 32 bytes. With 24 bytes of `HashSet` overhead per entry, that is ~56 bytes per event. At 100K events/sec for 30 seconds (one checkpoint interval): 3M entries * 56 bytes = 168 MB. At 10K events/sec: 300K entries * 56 bytes = 16.8 MB.

**Evaluation:** At 10K events/sec, 16.8 MB is acceptable. At 100K events/sec, 168 MB is substantial but within tidalDB's memory budget (the signal ledger already budgets 400-800 MB for in-memory entity state). The concern is that this grows linearly with write rate and checkpoint interval.

**Strengths:** Zero false positives. Exact deduplication. Simple implementation.
**Weaknesses:** Memory grows linearly. At sustained 100K/sec, could become problematic.

### Approach 2: Bloom Filter

**How it works:** A probabilistic set that reports "definitely not seen" or "possibly seen." Uses ~9.6 bits per element at 1% false positive rate.

**Memory cost:** At 100K events/sec for 30 seconds: 3M entries * 9.6 bits = 3.6 MB. At 10K events/sec: 300K * 9.6 bits = 360 KB.

**Evaluation:** Dramatically lower memory than HashSet. But: a false positive means a legitimate event is silently dropped. For a ranking system, dropping a real "like" event means the ranking is wrong. A 1% false positive rate means 1 in 100 legitimate events could be dropped. This is unacceptable for tidalDB's signal fidelity requirements.

A Bloom filter at 0.01% FPR requires ~19.2 bits per element: 7.2 MB for 3M entries. Better, but still has false positives. Unacceptable.

**Verdict:** Do not use Bloom filters for deduplication. False positives corrupt ranking data.

### Approach 3: Bounded Sliding Window HashSet (Recommended)

**How it works:** Maintain a bounded `HashSet<[u8; 32]>` covering only the last N seconds of events. Webhook retries typically arrive within seconds, not minutes. A 60-second window captures virtually all retries while bounding memory.

**Implementation:** Two HashSets, alternating every 30 seconds (double-buffering):
```rust
struct DedupWindow {
    current: HashSet<[u8; 32]>,
    previous: HashSet<[u8; 32]>,
    rotation_time: Instant,
    window_duration: Duration, // 30 seconds
}

impl DedupWindow {
    fn check_and_insert(&mut self, hash: [u8; 32]) -> bool {
        if Instant::now() - self.rotation_time > self.window_duration {
            std::mem::swap(&mut self.current, &mut self.previous);
            self.current.clear();
            self.rotation_time = Instant::now();
        }
        // Check both windows
        if self.current.contains(&hash) || self.previous.contains(&hash) {
            return true; // duplicate
        }
        self.current.insert(hash);
        false // new event
    }
}
```

**Memory cost:** Two windows of 30 seconds each. At 100K events/sec: 3M entries per window * 56 bytes * 2 = 336 MB worst case. At 10K events/sec: 300K * 56 * 2 = 33.6 MB. At the expected median of ~50K events/sec: ~168 MB.

**Optimization:** Do not hash every event separately for dedup. The BLAKE3 hash is already computed for the batch checksum. For per-event dedup, compute a lightweight hash of the event content: `blake3::hash(&event_bytes)` is ~50ns for 21-byte input. At 100K events/sec, that is 5ms/sec of hashing -- negligible.

**Further optimization:** Use a `HashMap<[u8; 16], ()>` with truncated hashes (first 16 bytes of BLAKE3). Collision probability for 16-byte hashes at 3M entries: ~2.7 * 10^-26. Effectively zero. Memory drops to ~40 bytes per entry: 3M * 40 * 2 = 240 MB at 100K/sec, or 24 MB at 10K/sec.

**Even further:** Use `HashSet<u128>` (the first 128 bits of the BLAKE3 hash). Each entry: 16 bytes + ~24 bytes HashSet overhead = 40 bytes. Or use a `HashMap` with `ahash` for the HashSet's internal hashing, which avoids re-hashing the already-random BLAKE3 output. (Note: Rust's `HashSet` with `RandomState` performs well with uniformly distributed keys like hash digests.)

**Strengths:**
- Zero false positives (exact dedup)
- Bounded memory (double-buffer with rotation)
- Covers the retry window (webhook retries are seconds, not minutes)
- Natural alignment with checkpoint interval

**Weaknesses:**
- Events arriving after 60 seconds will not be deduped
- Memory is still proportional to write rate (but bounded by window size)
- At 100K events/sec sustained, memory is non-trivial

### Approach 4: WAL Scan at Startup Only

**How it works:** On startup, scan the WAL from the last checkpoint and build the dedup set. During operation, maintain the in-memory set.

**Evaluation:** This is not an alternative to Approach 3 -- it is a complement. On startup, the dedup window must be reconstructed by scanning the WAL. This takes ~8ms for 63 MB of WAL (per the recovery time estimate above). After startup, the in-memory set is maintained incrementally.

**Verdict:** Use WAL scan at startup to initialize the dedup window. This is required regardless of which in-memory approach is chosen.

### Comparison Table

| Criterion | Full HashSet | Bloom Filter (1%) | Bloom Filter (0.01%) | Bounded Window (Recommended) |
|---|---|---|---|---|
| **Memory at 10K/s** | 16.8 MB | 360 KB | 720 KB | 33.6 MB (two windows) |
| **Memory at 100K/s** | 168 MB | 3.6 MB | 7.2 MB | 240 MB (truncated hash) |
| **False positives** | Zero | 1% (unacceptable) | 0.01% (unacceptable) | Zero |
| **Late duplicates (>60s)** | Caught | Caught (within filter lifetime) | Caught | Missed (acceptable) |
| **Implementation complexity** | Low | Moderate (sizing, rotation) | Moderate | Low-moderate |
| **Startup cost** | WAL scan | WAL scan + filter rebuild | WAL scan + filter rebuild | WAL scan |

**Recommendation: Bounded Sliding Window HashSet (Approach 3)** with truncated 128-bit hashes and 30-second double-buffer rotation. Zero false positives, bounded memory, covers the webhook retry window. Initialize from WAL scan at startup.

---

## Comparison Table: WAL Entry Formats

| Criterion | Length-Prefix (LevelDB) | Fixed-Size (64B) | Batch-Oriented (Recommended) |
|---|---|---|---|
| **Space per event** | ~28 bytes | 64 bytes | ~21.6 bytes |
| **Disk at 100K events/sec** | 242 GB/day | 553 GB/day | 187 GB/day |
| **Crash detection** | CRC per fragment | File size + hash | Length + BLAKE3 per batch |
| **Decode cost** | Low (per fragment) | Zero (offset math) | Low (per batch) |
| **Batch alignment** | No (per record) | No (per record) | Yes (native) |
| **Schema evolution** | Via type byte | None | Via version byte + reserved |
| **Implementation complexity** | Moderate (page fragmentation) | Trivial | Low (no fragmentation) |
| **Random access** | Sequential scan | O(1) by offset | Batch-level index |
| **Production precedent** | LevelDB, RocksDB, Prometheus | TigerBeetle (different scale) | Prometheus batch records |

---

## Recommendation

Build a custom, batch-oriented WAL with the following design:

1. **Entry format:** Batch-oriented length-prefix framing (Approach 3 from Section 1). 64-byte cache-aligned header with magic bytes, version, event count, sequence number, payload length, and BLAKE3 hash. Tightly-packed 21-byte events. No page-boundary fragmentation.

2. **Implementation:** Custom Rust crate, `#![forbid(unsafe_code)]`, ~400-600 lines. No external WAL crate -- none meet the requirements.

3. **Group commit:** Dedicated writer thread with `crossbeam::channel::bounded` and `recv_deadline` (Pattern 4 from Section 3). Batch size 100, timeout 10ms. Single writer thread eliminates all file-level concurrency concerns.

4. **Crash detection:** BLAKE3 + length-prefix two-phase validation (Approach 3 from Section 4). Phase 1: verify header magic and payload length. Phase 2: verify BLAKE3 hash. Truncate at first invalid batch.

5. **Segments and rotation:** 16 MB segment files, named by first sequence number. New segment when current segment exceeds 16 MB.

6. **Checkpoint:** Write `checkpoint.meta` with the last-materialized sequence number. Delete all segments before checkpoint. Frequency: every 30 seconds.

7. **Deduplication:** Bounded sliding window `HashSet<u128>` (first 128 bits of per-event BLAKE3 hash). 30-second double-buffer rotation. Initialize from WAL scan at startup.

---

## Implementation Blueprint for @tidal-engineer

### Wire Format (Exact Byte Layout)

```
BATCH FRAME:
+=======================================================================+
| Offset | Size | Field               | Encoding         | Notes       |
+--------+------+---------------------+------------------+-------------+
| 0      | 4    | Magic               | 0x54494C44       | "TIDL" (BE) |
| 4      | 1    | Version             | u8               | Currently 1 |
| 5      | 1    | Flags               | u8               | Reserved (0)|
| 6      | 2    | Event count         | u16 LE           | 1..65535    |
| 8      | 8    | First sequence no.  | u64 LE           | Monotonic   |
| 16     | 8    | Batch timestamp     | u64 LE           | Nanos epoch |
| 24     | 4    | Payload byte length | u32 LE           | count * 21  |
| 28     | 4    | Reserved            | [0u8; 4]         | Future use  |
| 32     | 32   | BLAKE3 checksum     | [u8; 32]         | See below   |
+--------+------+---------------------+------------------+-------------+
| 64     | N*21 | Event records       | packed structs   | See below   |
+=======================================================================+
Header: 64 bytes (1 cache line)
Total:  64 + (event_count * 21) bytes

BLAKE3 INPUT:
  blake3( header_bytes[0..32] || event_bytes[0..N*21] )
  (Hash covers magic through reserved, then all event data)
  (The hash field itself at [32..64] is NOT included in the hash input)

EVENT RECORD (21 bytes each):
+=======================================================================+
| Offset | Size | Field        | Encoding  | Notes                     |
+--------+------+--------------+-----------+---------------------------+
| 0      | 8    | Entity ID    | u64 LE    | Item/User/Creator ID      |
| 8      | 1    | Signal type  | u8        | Enum variant index        |
| 9      | 4    | Weight       | f32 LE    | IEEE 754                  |
| 13     | 8    | Timestamp    | u64 LE    | Nanos since Unix epoch    |
+=======================================================================+
```

**Endianness rationale:** Little-endian throughout for event records and header numerics. This matches x86/ARM native byte order, avoiding byte-swap costs on the write and read paths. (Note: the magic bytes "TIDL" are written in their natural big-endian character order for human readability in hex dumps, but this is a fixed constant, not a numeric encoding decision.)

### Module Structure

```
tidal/src/wal/
  mod.rs          -- public API: WalWriter, WalReader, WalConfig
  writer.rs       -- GroupCommitWriter: channel, batch loop, fsync
  reader.rs       -- WalReader: sequential scan, replay iterator
  segment.rs      -- Segment file management: create, rotate, delete
  format.rs       -- BatchHeader, EventRecord: encode/decode
  checkpoint.rs   -- CheckpointManager: write/read checkpoint.meta
  dedup.rs        -- DedupWindow: bounded sliding-window HashSet
  error.rs        -- WalError enum
```

Estimated total: 400-600 lines of Rust.

### Group Commit Writer Design

```rust
pub struct WalConfig {
    pub dir: PathBuf,
    pub batch_size: usize,          // default: 100
    pub batch_timeout: Duration,    // default: 10ms
    pub segment_size: u64,          // default: 16 MB
    pub checkpoint_interval: Duration, // default: 30s
}

pub struct WalHandle {
    tx: crossbeam::channel::Sender<WalCommand>,
    thread: Option<std::thread::JoinHandle<()>>,
}

enum WalCommand {
    Append {
        event: SignalEvent,
        result: oneshot::Sender<Result<u64, WalError>>,
    },
    Shutdown,
}

impl WalHandle {
    /// Append a signal event. Blocks until the event is durably committed.
    /// Returns the assigned sequence number.
    pub fn append(&self, event: SignalEvent) -> Result<u64, WalError> {
        let (tx, rx) = oneshot::channel();
        self.tx.send(WalCommand::Append { event, result: tx })?;
        rx.recv()?
    }

    /// Graceful shutdown: flush remaining events, fsync, close.
    pub fn shutdown(mut self) -> Result<(), WalError> {
        let _ = self.tx.send(WalCommand::Shutdown);
        if let Some(thread) = self.thread.take() {
            thread.join().map_err(|_| WalError::ShutdownFailed)?;
        }
        Ok(())
    }
}
```

### Recovery Procedure

```
On startup:
1. Read checkpoint.meta -> last_checkpoint_seq
2. Identify WAL segments with events after last_checkpoint_seq
3. For each segment, in order:
   a. Read 64-byte batch header
   b. Verify magic bytes == "TIDL"
   c. Verify version == 1
   d. Verify offset + 64 + payload_length <= file_length
   e. Read payload
   f. Compute BLAKE3(header[0..32] || payload)
   g. If hash matches: yield events for replay, advance offset
   h. If hash fails: truncate file at previous batch boundary, stop
4. Populate DedupWindow from replayed events
5. Resume normal operation
```

### Dedup Window

```rust
pub struct DedupWindow {
    current: HashSet<u128>,
    previous: HashSet<u128>,
    rotation_time: Instant,
    window: Duration,
}

impl DedupWindow {
    pub fn new(window: Duration) -> Self { ... }

    /// Returns true if the event is a duplicate.
    pub fn check_and_insert(&mut self, event_bytes: &[u8]) -> bool {
        self.maybe_rotate();
        let hash = u128::from_le_bytes(
            blake3::hash(event_bytes).as_bytes()[..16]
                .try_into().unwrap()
        );
        if self.current.contains(&hash) || self.previous.contains(&hash) {
            return true;
        }
        self.current.insert(hash);
        false
    }

    fn maybe_rotate(&mut self) {
        if self.rotation_time.elapsed() > self.window {
            std::mem::swap(&mut self.current, &mut self.previous);
            self.current.clear();
            self.rotation_time = Instant::now();
        }
    }
}
```

**Memory at 10K events/sec:** ~300K entries per window * 16 bytes * 2 windows + HashSet overhead = ~19 MB
**Memory at 100K events/sec:** ~3M entries per window * 16 bytes * 2 + overhead = ~144 MB

### Dependencies Required

```toml
[dependencies]
blake3 = "1"                    # already planned per CODING_GUIDELINES.md
crossbeam = { version = "0.8", features = ["channel"] }
```

No other new dependencies required. `crossbeam` is a widely-audited, actively-maintained crate (173M+ downloads). It uses some `unsafe` internally for lock-free data structures, but this is well-reviewed and the WAL code itself remains `#![forbid(unsafe_code)]`.

### Performance Estimates

| Metric | 10K events/sec | 100K events/sec |
|---|---|---|
| **Batch rate** | 100/sec | 1,000/sec |
| **WAL write throughput** | 210 KB/sec | 2.1 MB/sec |
| **fsync rate** | 100/sec | 1,000/sec |
| **WAL growth between checkpoints** | ~6.3 MB | ~63 MB |
| **Recovery time** | <5ms | <10ms |
| **Dedup memory** | ~19 MB | ~144 MB |
| **BLAKE3 hashing cost** | ~0.5ms/sec | ~5ms/sec |
| **Per-event amortized latency** | ~100us (dominated by fsync wait) | ~10us |

---

## Open Questions

1. **oneshot channel implementation.** The blueprint uses `oneshot::Sender` for per-event notification. Should this be `tokio::sync::oneshot` (adds tokio dependency), `crossbeam`'s internal mechanism, or a custom `Arc<(Mutex<Option<Result>>, Condvar)>`? The simplest zero-dependency option is a `std::sync::mpsc::channel()` with capacity 1, since each writer waits on exactly one response. Benchmark the overhead.

2. **Segment pre-allocation.** OkayWAL and PostgreSQL both pre-allocate segment files (`fallocate` / `ftruncate`) to avoid filesystem metadata updates during writes. This can improve write throughput by 10-30% on some filesystems. Should tidalDB pre-allocate 16 MB segments? Benchmark on macOS (APFS) and Linux (ext4, XFS).

3. **WAL compression.** At 100K events/sec, the WAL writes 2.1 MB/sec to disk. Signal events have low entropy (many entity IDs repeat, signal types are from a small enum). LZ4 or ZSTD compression could reduce this 2-4x. However, compression adds CPU cost and complicates partial-write recovery. Defer until disk bandwidth is a measured bottleneck.

4. **Multi-batch fsync.** The current design fsyncs once per batch. At 100K events/sec with batch_size=100, that is 1,000 fsyncs/sec. Some workloads may benefit from accumulating multiple batches before fsync (e.g., fsync every 5ms regardless of batch count). This is a tuning knob, not an architectural decision -- but it should be measured.

5. **DedupWindow memory at extreme write rates.** At 100K events/sec sustained, the dedup window uses ~144 MB. If this is too much, consider: (a) shorter window (10s instead of 30s), (b) sampling (only dedup the first N events per entity per second), or (c) a probabilistic approach with a very low FPR (0.001%) counting filter that flags "possible duplicate" for a slower exact check. Benchmark memory pressure under sustained 100K/sec.

6. **Sequence number overflow.** The WAL uses u64 sequence numbers. At 100K events/sec sustained, overflow occurs after 5.8 billion years. Not a concern, but the implementation should still handle wrapping gracefully (it will not happen, but a panic on overflow is better than silent wrapping).

7. **Batch append atomicity.** The WAL writer thread writes the batch header + payload in a single `write()` call. If `write()` returns a short write (possible on some systems under memory pressure), the batch is partially written. The implementation should loop on `write_all()` (which handles short writes) and rely on the BLAKE3 hash to detect any corruption if the process crashes mid-write.

8. **Interaction with fjall's internal WAL.** fjall has its own journal for memtable durability. tidalDB's signal WAL is a separate file. During crash recovery, both must be replayed consistently. The ordering is: replay signal WAL -> reconstruct in-memory signal state -> verify consistency with fjall's entity store. Document and test this interaction explicitly.

---

## Sources

### WAL Format Design
- Ghemawat, S. and Dean, J. "LevelDB Log Format." [github.com/google/leveldb/blob/main/doc/log_format.md](https://github.com/google/leveldb/blob/main/doc/log_format.md)
- Facebook. "RocksDB Write Ahead Log File Format." [github.com/facebook/rocksdb/wiki/Write-Ahead-Log-File-Format](https://github.com/facebook/rocksdb/wiki/Write-Ahead-Log-File-Format)
- Prometheus. "TSDB WAL Format." [github.com/prometheus/prometheus/blob/main/tsdb/docs/format/wal.md](https://github.com/prometheus/prometheus/blob/main/tsdb/docs/format/wal.md)
- Vernekar, G. "Prometheus TSDB (Part 2): WAL and Checkpoint." [ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/](https://ganeshvernekar.com/blog/prometheus-tsdb-wal-and-checkpoint/)

### Crash Recovery and Partial Writes
- UnisonDB. "Building a Corruption-Proof Write-Ahead Log in Go." [unisondb.io/blog/building-corruption-proof-write-ahead-log-in-go/](https://unisondb.io/blog/building-corruption-proof-write-ahead-log-in-go/)
- "Torn Write Detection and Protection." [transactional.blog/blog/2025-torn-writes](https://transactional.blog/blog/2025-torn-writes)
- PostgreSQL Documentation. "WAL Internals." [postgresql.org/docs/current/wal-internals.html](https://www.postgresql.org/docs/current/wal-internals.html)

### Checkpoint and Truncation
- PostgreSQL Documentation. "WAL Configuration." [postgresql.org/docs/current/wal-configuration.html](https://www.postgresql.org/docs/current/wal-configuration.html)
- SQLite. "WAL-mode File Format." [sqlite.org/walformat.html](https://sqlite.org/walformat.html)
- TigerBeetle. "Architecture (Internals)." [github.com/tigerbeetle/tigerbeetle/blob/main/docs/internals/ARCHITECTURE.md](https://github.com/tigerbeetle/tigerbeetle/blob/main/docs/internals/ARCHITECTURE.md)

### Group Commit
- MySQL. "WL#5223: Group Commit of Binary Log." [dev.mysql.com/worklog/task/?id=5223](https://dev.mysql.com/worklog/task/?id=5223)
- Percona. "Group Commit and Real fsync." [percona.com/blog/2006/05/03/group-commit-and-real-fsync/](https://www.percona.com/blog/2006/05/03/group-commit-and-real-fsync/)
- MariaDB. "Aria Group Commit." [mariadb.com/docs/server/server-usage/storage-engines/aria/aria-group-commit](https://mariadb.com/docs/server/server-usage/storage-engines/aria/aria-group-commit)
- MariaDB. "Group Commit for the Binary Log." [mariadb.com/kb/en/group-commit-for-the-binary-log](https://mariadb.com/kb/en/group-commit-for-the-binary-log)

### Rust WAL Crates
- OkayWAL. [github.com/khonsulabs/okaywal](https://github.com/khonsulabs/okaywal) (last commit Nov 2023)
- commitlog. [github.com/zowens/commitlog](https://github.com/zowens/commitlog)
- walcraft. [github.com/RustyFarmer101/walcraft](https://github.com/RustyFarmer101/walcraft)
- walrus-rust. [crates.io/crates/walrus-rust](https://crates.io/crates/walrus-rust)
- BonsaiDB Blog. "Introducing OkayWAL." [bonsaidb.io/blog/introducing-okaywal/](https://bonsaidb.io/blog/introducing-okaywal/)

### BLAKE3
- BLAKE3 Team. "BLAKE3: One function, fast everywhere." [github.com/BLAKE3-team/BLAKE3](https://github.com/BLAKE3-team/BLAKE3/)
- "BLAKE3 slower than SHA-256 for small inputs." [forum.solana.com/t/blake3-slower-than-sha-256-for-small-inputs/829](https://forum.solana.com/t/blake3-slower-than-sha-256-for-small-inputs/829)

### Deduplication
- "Advanced Bloom Filter Based Algorithms for Efficient Approximate Data De-Duplication in Streams." [arxiv.org/abs/1212.3964](https://arxiv.org/abs/1212.3964)
- "Sliding Bloom Filters." Springer, 2013. [link.springer.com/chapter/10.1007/978-3-642-45030-3_48](https://link.springer.com/chapter/10.1007/978-3-642-45030-3_48)
- Sinha et al. "Efficient Cloud Data Deduplication with Blake3." IEEE, 2024. [ieeexplore.ieee.org/document/10607693](https://ieeexplore.ieee.org/document/10607693/)

### Channel and Concurrency
- crossbeam. [github.com/crossbeam-rs/crossbeam](https://github.com/crossbeam-rs/crossbeam)
- "Rust Channel Comparison Table." [codeandbitters.com/rust-channel-comparison/](https://codeandbitters.com/rust-channel-comparison/)

### Database Internals
- Fjall. [github.com/fjall-rs/fjall](https://github.com/fjall-rs/fjall)
- Redb. [github.com/cberner/redb](https://github.com/cberner/redb)
- Comer, A. "Build a Database Pt. 3: Write Ahead Log." [adambcomer.com/blog/simple-database/wal/](https://adambcomer.com/blog/simple-database/wal/)
- Vieira, V.K. and G.M.D. "Design and Reliability of a User Space Write-Ahead Log in Rust." arXiv:2507.13062, 2025. [arxiv.org/abs/2507.13062](https://arxiv.org/abs/2507.13062)

### tidalDB Internal References
- `thoughts.md` -- Citadel quarantine journal, Engram WAL, StemeDB WAL patterns
- `docs/research/tidaldb_signal_ledger.md` -- Signal storage architecture, checkpoint intervals
- `CODING_GUIDELINES.md` -- Dependency policy, unsafe code policy, testing requirements