jordan 55349845d0 refactor: Split all files to enforce 500-line max

Break monolith source files into focused modules:
- stemedb-core/types.rs → types/ directory (assertion, source, gold_standard, etc.)
- stemedb-storage: audit_store, quota_store, trust_rank_store, vector_index, vote_store → module directories
- stemedb-ingest/worker.rs → worker/ with separate test modules
- stemedb-query: engine, materializer, query → module directories
- stemedb-lens: epoch_aware, skeptic → module directories
- stemedb-sim/lib.rs → agent, arenas/, helpers, runner, strategy, types
- stemedb-api/tests: integration_tests → http_basic, http_validation, http_epoch, http_pipeline
- stemedb-api/tests: e2e_flow_test → e2e_full_pipeline, e2e_lens_resolution
- stemedb-query/tests: e2e_pipeline → e2e_pipeline + e2e_decay

Also adds new features: gold standard verification, escalation handlers,
admin endpoints, concept hierarchy spec, arena roadmap, and Go SDK.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-02 01:13:45 -07:00

46 KiB

Raw Permalink Blame History

WAL + Crash Recovery Research: Production Best Practices for Distributed Rust Databases

Author: Martin Kleppmann persona Date: 2026-02-01 Context: Episteme (StemeDB) append-only knowledge graph transitioning to distributed CockroachDB/Spanner-style architecture Current: Custom WAL with BLAKE3, sled KV store, Immediate/Batched/Eventual durability

Executive Summary

Key Findings

WAL Design Patterns: Modern databases use specialized log-structured storage (Pebble, Raft Engine) with group commit batching for 4-25% throughput improvement over naive fsync-per-write.
Crash Recovery: ARIES remains the gold standard for mutable databases, but append-only systems like Episteme can use simpler recovery: validate checksums, truncate incomplete records, resume from last good offset.
Distributed WAL: In Raft-based systems, the Raft log IS the WAL for state machine operations. Separate application-level WAL creates write amplification. TiKV eliminated their duplicate WAL in 5.4 with Raft Engine.
Rust Ecosystem: redb (pure Rust, 1.0 stable) is the production-ready embedded KV. sled is abandoned. For distributed, use raft-rs + custom log-structured storage inspired by Raft Engine.
io_uring Readiness: Not production-ready for WAL writes. 2026 research shows fsync remains blocking in io_uring, requiring fallback worker threads. PostgreSQL still uses traditional fsync for WAL durability.

Recommendations for Episteme

Short Term (Pre-Distributed):

Replace sled with redb (pure Rust, stable, ACID, copy-on-write B-trees)
Implement group commit in existing WAL (batch fsync for 50%+ latency reduction)
Add CRC32C checksums alongside BLAKE3 (hardware-accelerated torn write detection)
Implement full crash recovery scan: checksum validation, truncate incomplete records

Long Term (Distributed):

Use Raft log AS the WAL - don't maintain separate application WAL
Build log-structured storage engine for Raft log inspired by TiKV's Raft Engine
Per-range Raft logs with snapshot-based truncation after compaction
Fuzzy checkpoints for background compaction without blocking writes

1. WAL Design Patterns in Production Databases

1.1 CockroachDB Pebble Storage Engine

Architecture:

RocksDB/LevelDB-inspired LSM tree written in Go
Each write: Raft log → Pebble WAL → Memtable → SSTable (2x write amplification)
Commit pipeline uses "commit queue" for concurrent memtable writes after WAL sync

WAL Features:

WAL recycling: Old log files repurposed instead of deleted (reduces allocation overhead)
WAL failover (2026): Secondary disk for WAL with 100ms stall detection, automatic failover
Group commit: Batches multiple writes into single fsync (standard in all modern databases)

Write Amplification:

"When values are committed, CockroachDB writes them to the write-ahead log (WAL) and then to SSTables during flushes, and compactions rewrite those SSTables multiple times." - CockroachDB Storage Layer

Lessons:

Separate Raft log and storage engine WAL is intentional: Raft log = replication, Pebble WAL = crash recovery for uncommitted memtable
WAL recycling reduces filesystem metadata overhead
WAL failover addresses the "disk stall death spiral" problem

1.2 TiKV Raft Engine

The Problem with RocksDB WAL:

"TiKV essentially has two WAL: one as raft log stored in raftdb, one as rocksdb WAL. Theoretically the rocksdb WAL could be removed." - TiKV Issue #9155

TiKV had double WAL: Raft log (consensus) + RocksDB WAL (crash recovery). Both writing to disk created massive write amplification.

Raft Engine Solution (TiDB 5.4, 2021):

Log-structured storage engine inspired by BitCask
Raft log IS the storage - no separate application WAL
Active log file for sequential writes, MemTable hash index for reads
Performance gains: +4% throughput, -20% tail latency, -25-40% write I/O, -12% CPU

Architecture:

"On the disk, write requests, both the key and the actual data, are sequentially written to the active append-only log file. When a configurable threshold is reached, a new file is created. MemTable is an in-memory hash table that contains log keys and the page location (offset) of that key." - Raft Engine: A Log-Structured Embedded Storage Engine

Lessons:

Eliminate redundant WALs - Raft log already provides durability and ordering
Log-structured storage with hash index is optimal for Raft logs (no range scans needed)
Write amplification reduced by not sorting Raft log entries into LSM tree

1.3 FoundationDB WAL

Unbundled Architecture:

Transaction System (TS): in-memory transaction processing
Log System (LS): distributed WAL for TS
Storage System (SS): separate, pulls from WAL

Unique Replication Approach:

"FoundationDB does not use consensus algorithms such as Paxos or Raft for replicating the transaction logs. Instead, it simply requires synchronously writing to ALL replicas of a transaction log." - Distributed Transactions in FoundationDB

This allows f+1 replicas instead of 2f+1 (tolerates f failures with fewer replicas than Raft). Requires fast failure detection and reconfiguration.

Commit Path:

Proxy appends transaction to LogServers
All LogServers fsync and acknowledge
Proxy returns success to client
StorageServers asynchronously pull from LogServers

Lessons:

Unbundled architecture separates concerns: WAL durability vs data serving
Synchronous all-replica writes work when failure detection is fast (<100ms)
Episteme's append-only model fits this pattern: WAL = source of truth, KV = index

1.4 SQLite WAL Mode

Traditional Journal Mode Problems:

Requires writing pages twice (journal + database file)
Write-ahead logging eliminates double writes

SQLite WAL Format:

"Every WAL file starts with either 0x377f0682 or 0x377f0683 which indicate whether checksums in the file are in little-endian or big-endian format." - SQLite WAL Mode File Format

Frame Structure:

Frame header: page number, commit marker, checksums (2x 32-bit)
8-byte checksum computed incrementally from WAL header + frame data
Checksum is not cryptographic - custom algorithm processing 32-bit chunks

Recovery Behavior:

"When a frame is found to have a missing or invalid checksum, SQLite drops that frame and all the subsequent frames. This is intentional behavior, though it can result in data loss even for frames that are not actually corrupt." - PSA: SQLite WAL checksums fail silently

Lessons:

Custom checksum algorithm achieves high performance but has collision vulnerabilities
Silent truncation on checksum failure is controversial but common (PostgreSQL does similar)
Multiple checkpoints (little-endian vs big-endian) enable cross-platform recovery

1.5 Group Commit / Batched fsync Patterns

The Problem:

"fsync() call is supposed to place data on the disk securely, which unless you have battery backed up cache would give you only 80-200 sequential fsync() calls per second depending on your hard drive speed." - Group commit and real fsync

SSDs: ~1000 fsync/sec. NVMe: ~10,000 fsync/sec. Still a bottleneck.

Group Commit Solution:

"Group commit allows multiple simultaneous transactions to fsync the log file once for all the transactions waiting." - How fsync and synchronous_commit Affect PostgreSQL Performance

Performance Impact:

fsync disabled: +58% TPS, -37% latency (unsafe)
Group commit: +40-50% TPS with durability guarantees
Async commit: similar performance, small window of data loss risk

Implementation Pattern:

Commit Queue:
1. Transaction T1 arrives, starts WAL write
2. Transactions T2, T3 arrive during T1's write
3. Single fsync covers T1, T2, T3
4. All three transactions acknowledged together

Lessons:

Group commit is mandatory for production WAL performance
Target: batch 10-100 writes into single fsync (adjust based on latency requirements)
Episteme's "Batched" durability mode should implement true group commit, not just delayed fsync

2. Crash Recovery State-of-the-Art

2.1 ARIES Recovery Algorithm

Still the Gold Standard (2026):

"ARIES has become a standard in modern DBMS for one simple reason: it's resilient, efficient, and battle-tested." - ARIES: A Transaction Recovery Method

Used in IBM Db2, Microsoft SQL Server, and many others.

Three Core Principles:

Write-ahead logging: Log must be written to stable storage before data pages
Repeating history during Redo: Bring system to exact state at crash, including uncommitted transactions
Logging changes during Undo: Undo operations are logged (CLRs - Compensation Log Records)

Three-Phase Recovery:

Analysis: Scan log to find dirty pages and active transactions
Redo: Replay all operations from checkpoint forward (including uncommitted)
Undo: Roll back uncommitted transactions

Why It's Complex: ARIES handles:

Fine-grained locking (page-level, row-level)
Steal policy (can evict uncommitted dirty pages)
No-force policy (don't flush dirty pages at commit)

Episteme Doesn't Need This:

Append-only: no in-place updates, no undo
No transactions spanning multiple operations
Single-writer per assertion (no concurrency within an assertion write)

Simplified Recovery for Append-Only:

1. Scan WAL from beginning (or last checkpoint)
2. For each record:
   a. Verify checksum
   b. If valid: apply to storage engine
   c. If invalid: truncate WAL here and stop
3. Resume normal operation from truncation point

2.2 Torn Write Protection

The Problem: Power loss during write can leave partial data on disk. 4KB page write may complete 2KB before crash.

Detection Strategies:

Checksum	Speed (GB/s)	Hardware Accelerated	Collision Resistance	Use Case
CRC32	1.17 cyc/8B	Yes (SSE 4.2)	Weak (easy to generate collisions)	Torn write detection
CRC32C	0.67 cyc/8B	Yes (SSE 4.2)	Better than CRC32	Database page checksums
xxHash	~10 GB/s	No	High (non-crypto)	Fast integrity checks
BLAKE3	~3 GB/s	Partial (SIMD)	Cryptographic	Content addressing + integrity

"CRC32c has served well for years but has some weaknesses - the collisions are easy to generate." - Selecting the next checksum for btrfs

Recommendation for Episteme:

Use CRC32C for torn write detection (hardware-accelerated, standard in databases)
Keep BLAKE3 for content addressing (assertion IDs, source hashes)
Write both checksums to WAL record: CRC32C for validation, BLAKE3 for identity

Record Format:

[magic:4][len:4][crc32c:4][blake3:32][payload:len]

Validate CRC32C during recovery, use BLAKE3 for deduplication.

2.3 Checkpoint Strategies

Sharp Checkpoint:

"A sharp checkpoint is accomplished by flushing all modified pages for committed transactions to disk, and writing down the log sequence number (LSN) of the most recent committed transaction - everything flushed to disk is consistent as of a single point in time." - MySQL InnoDB Checkpoints

Fuzzy Checkpoint:

"Fuzzy checkpoints, or non-blocking checkpoints, enable transactions to run against the database while the checkpoint is in progress and do not obtain locks, therefore having a minimal impact on other database activity." - Fuzzy Checkpoint

Trade-offs:

Sharp: Simple recovery, but blocks writes during checkpoint
Fuzzy: Complex recovery (need to replay log from fuzzy checkpoint start), but non-blocking

For Episteme:

Current approach is implicit sharp checkpoint: sled handles background flushing
Distributed version needs fuzzy checkpoints for per-range compaction without blocking

Raft Snapshot Pattern (fuzzy checkpoint for Raft logs):

1. Create snapshot of KV store at LSN N
2. Write snapshot to disk (non-blocking)
3. Once complete, truncate Raft log before LSN N
4. New replicas sync via snapshot + log tail

2.4 Recovery Time Objectives (RTO)

Industry Targets:

PostgreSQL: seconds to minutes (depends on WAL volume)
MySQL InnoDB: <10 seconds for warm restart
Cockroach/TiKV: <30 seconds per range (parallel recovery)

Factors Affecting RTO:

WAL size: larger log = longer replay
Checkpoint frequency: more frequent = shorter recovery
Parallelization: per-range recovery in distributed systems

Episteme Current State:

"Basic recovery: validate header and set offset to end" - journal.rs:90-103

Recovery Gaps:

❌ No checksum validation during recovery
❌ No truncation of partial records
❌ No replay of WAL entries (assumes KV store was flushed)
❌ No metrics on recovery time

Recommended Recovery Procedure:

fn recover(&mut self) -> Result<()> {
    let start = Instant::now();
    let mut valid_records = 0;
    let mut invalid_records = 0;
    let mut cursor = HEADER_SIZE as u64;

    loop {
        match Record::read_from(&mut reader) {
            Ok(record) => {
                // Verify checksum
                if !record.verify_checksum() {
                    warn!(offset = cursor, "Invalid checksum, truncating WAL");
                    file.set_len(cursor)?; // Truncate at last good record
                    break;
                }

                // Replay to KV store (idempotent)
                self.apply_to_kv_store(&record)?;

                cursor += record.disk_size();
                valid_records += 1;
            }
            Err(e) if e.is_eof() => break,
            Err(e) => {
                warn!(offset = cursor, error = %e, "Truncating at error");
                file.set_len(cursor)?;
                invalid_records += 1;
                break;
            }
        }
    }

    let duration = start.elapsed();
    info!(
        valid_records,
        invalid_records,
        recovery_time_ms = duration.as_millis(),
        "WAL recovery complete"
    );

    Ok(())
}

Metrics to Track:

wal_recovery_time_ms
wal_recovery_valid_records
wal_recovery_invalid_records
wal_recovery_truncated_bytes

3. Modern Rust Ecosystem

3.1 Embedded KV Stores

sled Status (2026):

"Sled has been abandoned for some years, the author was working on a new engine for it and in the last 5 days they started working again on sled proper." - Sled vs RocksDB discussion

The official README says:

"If reliability is your primary constraint, use SQLite. If storage price performance is your primary constraint, use RocksDB." - sled README

Production-Ready Alternatives:

Crate	Status	Pros	Cons	Recommendation
redb	1.0 stable (2023)	Pure Rust, ACID, copy-on-write B-trees	Newer, less battle-tested	✅ Best choice
RocksDB	Production (C++)	Battle-tested, used in TiKV/Cockroach	C++ dependency, complex API	Use for distributed
sled	Beta (abandoned?)	Pure Rust, familiar API	Unstable, no 1.0 release	❌ Avoid
lmdb	Production (C)	Mature, fast, memory-mapped	C dependency, licensing	Alternative to redb

redb Features:

"redb is stable and is an embedded key-value database written in pure Rust. It provides a similar interface to other embedded key-value stores such as rocksdb and lmdb, and it has comparable performance while still being memory-safe." - redb 1.0 release

ACID transactions
Copy-on-write B-trees (similar to LMDB)
Pure Rust (no unsafe C dependencies)
Single-file database (easy backup)
Stable API (1.0+ guarantees)

Migration Path:

Replace sled with redb in stemedb-storage/src/sled_backend.rs
Implement KVStore trait for RedbStore
Update Cargo.toml: remove sled, add redb = "2.1" (latest as of 2026)
Test crash recovery with redb (may have better durability guarantees than sled)

3.2 Rust WAL Crates

Available Libraries:

Crate	Status	Features	Notes
okaywal	Active	Generic WAL, fsync control	Not production-tested
walrus-rust	Active	Configurable consistency models	Limited adoption
wal-rs	Inactive	Basic WAL implementation	No recent updates
walr	Active	Tokio-based async WAL	io_uring dependency concerns
raft-wal (HashiCorp)	Production	Vault's Raft storage	Go, not Rust

None are production-proven in Rust databases.

Recommendation:

Keep custom WAL implementation in stemedb-wal
Improve existing code: group commit, CRC32C checksums, full recovery scan
For distributed version, study TiKV's Raft Engine (Rust) and implement similar log-structured storage

3.3 raft-rs from TiKV

The Core Raft Implementation:

"This Raft implementation in Rust includes the core Consensus Module only, not the other parts. You will need to build your own Log, State Machine and Transport components." - TiKV Raft Blog

What You Get:

Raft consensus algorithm
Leader election, log replication, safety guarantees
Well-tested (production use in TiKV since 2016)

What You Must Implement:

LogStore: Persistent storage for Raft log entries (this is the WAL)
StateStore: Persistent storage for Raft metadata (term, voted_for)
Transport: Network layer for sending messages between nodes

Integration Strategy:

Use raft-rs for consensus
Build log-structured storage for Raft log (inspired by TiKV's Raft Engine)
Raft log IS the WAL - don't maintain separate application-level WAL
KV store (redb) becomes the state machine

Per-Range Raft Logs: CockroachDB/TiKV pattern:

Each range (shard) has its own Raft group
Each Raft group has independent log
Log directory structure: {data_dir}/raft/{range_id}/{seq}.log
Compaction per-range after snapshot

3.4 io_uring for Async WAL Writes

Production Readiness (2026 Assessment):

"Write-ahead logging operations in database systems remain costly, and the standard fsync approach is blocking in io_uring and thus executed by fallback worker threads. Moreover, fsync cannot be issued by rings configured for IOPoll and must be used from a separate ring or as a traditional syscall." - io_uring for High-Performance DBMSs (PVLDB 2026)

PostgreSQL's Position:

"For WAL durability, PostgreSQL invokes fsync() directly (or via O_DATASYNC) rather than through io_uring." - Same paper

Rust io_uring Runtimes:

Runtime	Status	Production Use	Issues
tokio-uring	Inactive	Limited	Last update 2022, open issues
monoio (ByteDance)	Active	Internal at ByteDance	Maturity questions remain (2025)
rio (sled author)	Archived	None	No longer maintained

Key Problems:

fsync is blocking even in io_uring (requires worker threads)
Cancellation safety issues in async Rust + io_uring
Limited maturity of Rust io_uring libraries
"Good enough" epoll performance reduces motivation

Recommendation:

Do NOT use io_uring for WAL writes in Episteme
Use traditional blocking fsync with group commit (proven, simple, fast enough)
Reserve io_uring exploration for read path optimization (Phase 5+)

4. Distributed WAL Considerations

4.1 WAL per Range vs WAL per Node

WAL per Range (CockroachDB/TiKV pattern):

/data/raft/
  range-001/
    00000001.log
    00000002.log
  range-002/
    00000001.log

Pros:

Independent compaction per range
Range splits/merges don't affect other ranges
Parallel recovery (each range recovers independently)

Cons:

More file descriptors (3 ranges × 3 replicas = 9 open files)
More fsync calls (unless batched across ranges)

WAL per Node (single shared log):

/data/raft/
  00000001.log  (entries from all ranges, tagged with range_id)

Pros:

Single fsync covers all ranges (better group commit)
Fewer open file descriptors
Simpler log rotation

Cons:

Compaction must preserve entries for ranges with lagging replicas
Range-specific recovery is harder (must scan entire log)

Recommendation for Episteme:

Start with WAL per range for simplicity
Each range = one subject's assertions (natural partitioning)
Optimize to shared log in Phase 5 if fsync becomes bottleneck

4.2 Log Truncation After Raft Snapshot

The Problem: Raft log grows unbounded. Need to compact without breaking lagging replicas.

Snapshot-Based Truncation:

"When one replica falls too far behind, the raft log queue will truncate the raft log for the range in a way that forces that replica to be caught up via a snapshot." - CockroachDB Raft Snapshots

Process:

State machine reaches LSN 1000
Create snapshot of KV store state at LSN 1000
Write snapshot to disk
Once all replicas confirm LSN 1000 applied, truncate log before LSN 1000
New replicas sync via snapshot + log tail (LSN 1001+)

Truncation Strategy:

"Raft log truncation is controlled to ensure that if followers are online, the quota pool makes sure followers are always at comparable log positions, so a prefix of the log that can be truncated without causing a Raft snapshot should always be available." - CockroachDB Issue #12238

Episteme-Specific:

Snapshot = serialized Materialized Views + Indexes for range
Truncate Raft log after all replicas applied and snapshot stored
Lagging replica: send snapshot + recent log tail

Snapshot Frequency:

Target: every 10,000 log entries or 100 MB
More frequent = faster recovery, smaller log files
Less frequent = less compaction overhead

4.3 WAL as Raft Log (Shared vs Separate)

Anti-Pattern: Separate Application WAL + Raft Log

TiKV's original architecture:

Write → Raft Log (raftdb)  → fsync
     → RocksDB WAL (kvdb)   → fsync

Result: 2x write amplification, duplicate durability guarantees.

Correct Pattern: Raft Log IS the WAL

TiKV's Raft Engine architecture:

Write → Raft Log (Raft Engine) → fsync
     ↓
  State Machine (RocksDB, no separate WAL)

State machine applies committed Raft entries. No separate WAL needed.

Episteme Architecture (Distributed):

Current (Single Node):

POST /v1/assert
  ↓
WAL (stemedb-wal) → fsync
  ↓
IngestWorker → KV Store (sled)

Future (Distributed):

POST /v1/assert
  ↓
Raft Log (per-range) → fsync
  ↓
State Machine (redb KV Store, no separate WAL)
  ↓
Materialized Views

Implementation Notes:

Assertions are Raft log entries
Raft replicates to quorum before commit
State machine = IngestWorker applying committed entries to KV
No need for separate stemedb-wal in distributed mode

4.4 Raft Log Format for Episteme

Entry Structure:

struct RaftLogEntry {
    index: u64,          // Raft log index (LSN)
    term: u64,           // Raft term
    entry_type: EntryType,
    payload: Vec<u8>,    // Serialized Assertion/Vote/Epoch
}

enum EntryType {
    Assertion,
    Vote,
    Epoch,
    Snapshot,        // Raft snapshot marker
}

On-Disk Format (Log-Structured):

[magic:4][crc32c:4][index:8][term:8][type:1][len:4][payload:len]

Indexing: In-memory hash map: index → file_offset

Compaction: Rotate log file every 100 MB. After snapshot at index N:

// Delete files containing only entries < N
for file in log_files {
    if file.max_index < snapshot_index {
        fs::remove_file(file)?;
    }
}

5. What Episteme Should Change

5.1 Short-Term Improvements (Pre-Distributed)

Priority 1: Replace sled with redb

# Cargo.toml
[dependencies]
# sled = "0.34"  # Remove
redb = "2.1"     # Add

Benefits:

Pure Rust (no C dependencies)
Stable API (1.0+)
Better durability guarantees (copy-on-write B-trees)
Active maintenance

Migration:

Create crates/stemedb-storage/src/redb_backend.rs
Implement KVStore trait for RedbStore
Update lib.rs to export RedbStore
Test crash recovery (may need to adjust WAL replay logic)

Priority 2: Implement Group Commit

Current WAL code fsyncs after every write:

// journal.rs:58
guard.write(&buf)?;  // Implicit fsync if durability = Immediate

Group Commit Implementation:

struct GroupCommitBatcher {
    pending: Vec<(Record, oneshot::Sender<Result<u64>>)>,
    flush_interval: Duration,
    max_batch_size: usize,
}

impl GroupCommitBatcher {
    async fn append(&mut self, record: Record) -> Result<u64> {
        let (tx, rx) = oneshot::channel();
        self.pending.push((record, tx));

        if self.pending.len() >= self.max_batch_size {
            self.flush().await?;
        }

        rx.await?
    }

    async fn flush(&mut self) -> Result<()> {
        if self.pending.is_empty() {
            return Ok(());
        }

        // Write all records
        for (record, _) in &self.pending {
            self.file.write_all(&record.to_bytes())?;
        }

        // Single fsync for entire batch
        self.file.sync_all()?;

        // Notify all waiters
        for (record, tx) in self.pending.drain(..) {
            let _ = tx.send(Ok(record.offset));
        }

        Ok(())
    }
}

Tuning:

max_batch_size: 100 (batch up to 100 writes)
flush_interval: 10ms (flush every 10ms even if batch incomplete)
Adaptive batching: increase batch size under load

Expected Gains:

50-80% latency reduction (1 fsync per 10-100 writes)
40-60% throughput increase

Priority 3: Add CRC32C Checksums

Keep BLAKE3 for content addressing, add CRC32C for torn write detection:

struct Record {
    sequence: u64,
    crc32c: u32,        // Hardware-accelerated integrity check
    blake3: Hash,       // Content-addressed identity
    payload: Vec<u8>,
}

impl Record {
    fn new(payload: Vec<u8>) -> Self {
        let sequence = next_sequence();
        let blake3 = blake3::hash(&payload);
        let crc32c = crc32c_compute(&payload);  // Use crc32c crate

        Self { sequence, crc32c, blake3, payload }
    }

    fn verify_crc32c(&self) -> bool {
        crc32c_compute(&self.payload) == self.crc32c
    }
}

Priority 4: Implement Full Recovery Scan

Replace current stub recovery with checksum validation and truncation:

fn recover(&mut self) -> Result<()> {
    let path = self.current_file_path();
    if !path.exists() {
        return Ok(());
    }

    let mut file = OpenOptions::new().read(true).write(true).open(&path)?;
    let mut reader = BufReader::new(&file);

    // Validate header
    let header = FileHeader::read_from(&mut reader)?;
    if !header.verify_magic() {
        warn!("Invalid WAL magic, resetting file");
        file.set_len(0)?;
        return Ok(());
    }

    let mut cursor = HEADER_SIZE as u64;
    let mut valid_records = 0;

    loop {
        let offset_before = reader.stream_position()?;

        match Record::read_from(&mut reader) {
            Ok(record) => {
                // Verify CRC32C checksum
                if !record.verify_crc32c() {
                    warn!(
                        offset = offset_before,
                        "Invalid CRC32C checksum, truncating WAL"
                    );
                    file.set_len(offset_before)?;
                    self.current_offset = offset_before;
                    break;
                }

                // Optionally: replay to KV store for idempotency
                // self.replay_record(&record)?;

                cursor = reader.stream_position()?;
                valid_records += 1;
            }
            Err(e) if e.kind() == io::ErrorKind::UnexpectedEof => {
                // Partial record at end of file, truncate
                warn!(
                    offset = offset_before,
                    "Incomplete record at EOF, truncating"
                );
                file.set_len(offset_before)?;
                self.current_offset = offset_before;
                break;
            }
            Err(e) => {
                return Err(QuarantineError::Io(path.clone(), e));
            }
        }
    }

    info!(
        valid_records,
        final_offset = self.current_offset,
        "WAL recovery complete"
    );

    Ok(())
}

Testing: Add fault injection tests:

#[test]
fn test_recovery_truncates_partial_record() {
    let dir = tempdir()?;
    let mut wal = Journal::open(&dir)?;

    // Write complete record
    wal.append(vec![1, 2, 3])?;

    // Manually write partial record (simulate crash)
    let mut file = OpenOptions::new()
        .write(true)
        .append(true)
        .open(dir.join("00000000.wal"))?;
    file.write_all(&[0xAA, 0xBB])?;  // Incomplete
    drop(file);

    // Recovery should truncate partial record
    let wal = Journal::open(&dir)?;
    assert_eq!(wal.current_offset, HEADER_SIZE + RECORD_1_SIZE);
}

5.2 Long-Term Architecture (Distributed)

Design Principles:

Raft log IS the WAL - no separate application-level WAL
Per-range Raft logs - each subject/shard has independent log
Log-structured storage - append-only files with hash index
Snapshot-based truncation - compact after state machine snapshot

Component Diagram:

┌─────────────────────────────────────────────────────────────────┐
│                        API Layer (axum)                         │
└─────────────────────────────────────────────────────────────────┘
                                ↓
┌─────────────────────────────────────────────────────────────────┐
│                    Raft Consensus (raft-rs)                     │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐                   │
│  │ Range 001 │  │ Range 002 │  │ Range 003 │  ...              │
│  │ Leader    │  │ Follower  │  │ Leader    │                   │
│  └───────────┘  └───────────┘  └───────────┘                   │
└─────────────────────────────────────────────────────────────────┘
                                ↓
┌─────────────────────────────────────────────────────────────────┐
│              Raft Log Storage (log-structured)                  │
│  /data/raft/range-001/00001.log                                 │
│  /data/raft/range-002/00001.log                                 │
│  In-Memory: HashMap<LogIndex, FileOffset>                       │
└─────────────────────────────────────────────────────────────────┘
                                ↓
┌─────────────────────────────────────────────────────────────────┐
│        State Machine (redb KV + Indexes + MV cache)             │
│  Applies committed Raft log entries                             │
│  No separate WAL                                                │
└─────────────────────────────────────────────────────────────────┘

LogStore Implementation (for raft-rs):

/// Log-structured storage for Raft log entries
/// Inspired by TiKV's Raft Engine
pub struct RaftLogStore {
    data_dir: PathBuf,
    range_id: RangeId,

    // Active log file for writes
    active_log: File,
    active_log_seq: u64,

    // In-memory index: log_index → (file_seq, offset)
    index: HashMap<u64, (u64, u64)>,

    // Metadata
    first_index: u64,
    last_index: u64,
}

impl RaftLogStore {
    /// Append entries to active log file
    pub fn append(&mut self, entries: &[RaftLogEntry]) -> Result<()> {
        for entry in entries {
            let bytes = self.serialize_entry(entry)?;
            self.active_log.write_all(&bytes)?;

            let offset = self.active_log.stream_position()?;
            self.index.insert(entry.index, (self.active_log_seq, offset));
            self.last_index = entry.index;
        }

        // Group commit: single fsync for batch
        self.active_log.sync_all()?;

        // Rotate log file if over threshold
        if self.active_log.metadata()?.len() > 100_000_000 {
            self.rotate_log()?;
        }

        Ok(())
    }

    /// Read entry by log index
    pub fn get(&self, index: u64) -> Result<Option<RaftLogEntry>> {
        let (file_seq, offset) = match self.index.get(&index) {
            Some(loc) => *loc,
            None => return Ok(None),
        };

        let path = self.log_file_path(file_seq);
        let mut file = File::open(&path)?;
        file.seek(SeekFrom::Start(offset))?;

        let entry = self.deserialize_entry(&mut file)?;
        Ok(Some(entry))
    }

    /// Truncate log after creating snapshot
    pub fn compact(&mut self, snapshot_index: u64) -> Result<()> {
        // Remove all log files containing only entries < snapshot_index
        for seq in 0..self.active_log_seq {
            let path = self.log_file_path(seq);
            if self.file_max_index(seq)? < snapshot_index {
                fs::remove_file(&path)?;
                info!(file_seq = seq, "Deleted compacted log file");
            }
        }

        // Update in-memory index
        self.index.retain(|idx, _| *idx >= snapshot_index);
        self.first_index = snapshot_index;

        Ok(())
    }
}

State Machine Implementation:

/// Applies committed Raft log entries to KV store
pub struct EpistemeStateMachine {
    kv_store: RedbStore,
    index_store: GenericIndexStore<RedbStore>,
    materializer: Materializer,
}

impl StateMachine for EpistemeStateMachine {
    fn apply(&mut self, entry: RaftLogEntry) -> Result<()> {
        match entry.entry_type {
            EntryType::Assertion => {
                let assertion: Assertion = deserialize(&entry.payload)?;

                // Write to KV
                let hash = assertion.compute_hash();
                self.kv_store.put(&hash, &serialize(&assertion)?)?;

                // Update indexes
                self.index_store.index_assertion(&assertion, &hash)?;

                // Trigger materialization
                self.materializer.notify_new_assertion(&assertion)?;
            }
            EntryType::Vote => {
                let vote: Vote = deserialize(&entry.payload)?;
                self.vote_store.put_vote(&vote)?;
            }
            EntryType::Epoch => {
                let epoch: Epoch = deserialize(&entry.payload)?;
                self.epoch_store.put_epoch(&epoch)?;
            }
        }

        Ok(())
    }
}

Migration Path:

Phase 4.5: Prototype Raft integration in separate branch
Phase 5: Deploy single-region cluster (3 nodes, 1 range)
Phase 6: Multi-region with range splits (shard by subject hash)

6. Concrete Recommendations

Immediate Actions (Week 1-2)

1. Replace sled with redb

Effort: 2-3 days
Impact: Stable foundation, better durability
Files:
- Create crates/stemedb-storage/src/redb_backend.rs
- Update Cargo.toml
- Run full test suite

2. Add CRC32C checksums to WAL

Effort: 1 day
Impact: Torn write detection
Files:
- Update crates/stemedb-wal/src/format.rs
- Add crc32c = "0.6" to dependencies
- Update Record::new() and Record::verify()

3. Implement full recovery scan

Effort: 2 days
Impact: Production-grade crash recovery
Files:
- Update crates/stemedb-wal/src/journal.rs::recover()
- Add fault injection tests

Short Term (Month 1)

4. Implement group commit

Effort: 3-4 days
Impact: 50%+ latency reduction
Files:
- Create crates/stemedb-wal/src/group_commit.rs
- Update Journal::append() to use batching
- Benchmark before/after

5. Add recovery metrics

Effort: 1 day
Impact: Observability
Metrics:
- wal_recovery_time_ms
- wal_recovery_valid_records
- wal_recovery_invalid_records
- wal_recovery_truncated_bytes

6. Log rotation

Effort: 2 days
Impact: Bounded disk usage
Files:
- Update Journal::append() to rotate at 1 GB
- Implement log file cleanup after KV flush

Medium Term (Months 2-3)

7. Prototype Raft integration

Effort: 2-3 weeks
Impact: Foundation for distributed deployment
Deliverables:
- raft-rs integration
- RaftLogStore implementation (log-structured)
- EpistemeStateMachine applying committed entries
- 3-node local cluster test

8. Snapshot-based compaction

Effort: 1 week
Impact: Bounded Raft log growth
Files:
- crates/stemedb-snapshot/src/lib.rs
- Serialize Materialized Views + Indexes
- Truncate Raft log after snapshot

Long Term (Months 4-6)

9. Multi-region deployment

Effort: 1-2 months
Impact: Geographic distribution
Components:
- Range splits (shard by subject hash)
- Cross-region replication
- Latency-aware routing

10. Index persistence

Effort: 2-3 weeks
Impact: Fast restart without re-indexing
Scope:
- Vector index (HNSW graph)
- Visual index (BK-tree)
- Metadata indexes

7. Benchmarks and Validation

Performance Targets

Metric	Current	Target (Group Commit)	Target (Distributed)
Write Latency (p50)	~1ms	<0.5ms	<5ms (cross-AZ)
Write Latency (p99)	~10ms	<2ms	<20ms
Write Throughput	1K writes/sec	5K writes/sec	50K writes/sec (multi-node)
Recovery Time	Unknown	<5 seconds	<30 seconds per range
fsync/sec	~1000	~100 (10x batching)	~1000 (distributed)

Benchmark Suite

// benches/wal_throughput.rs
#[bench]
fn bench_wal_append_immediate(b: &mut Bencher) {
    let dir = tempdir().unwrap();
    let mut wal = Journal::open(&dir)
        .unwrap()
        .with_durability(DurabilityLevel::Immediate);

    b.iter(|| {
        wal.append(vec![0u8; 1024]).unwrap();
    });
}

#[bench]
fn bench_wal_append_group_commit(b: &mut Bencher) {
    let dir = tempdir().unwrap();
    let mut wal = Journal::open(&dir)
        .unwrap()
        .with_durability(DurabilityLevel::GroupCommit { max_batch: 100 });

    b.iter(|| {
        wal.append(vec![0u8; 1024]).unwrap();
    });
}

Crash Recovery Tests

// tests/crash_recovery.rs
#[test]
fn test_recovery_after_partial_fsync() {
    let dir = tempdir().unwrap();

    // Write records
    {
        let mut wal = Journal::open(&dir).unwrap();
        wal.append(vec![1, 2, 3]).unwrap();
        wal.append(vec![4, 5, 6]).unwrap();

        // Simulate crash: truncate file mid-record
        let path = dir.join("00000000.wal");
        let file = OpenOptions::new().write(true).open(&path).unwrap();
        file.set_len(HEADER_SIZE + RECORD_1_SIZE + 10).unwrap();  // Partial record 2
    }

    // Recovery should truncate and resume
    let mut wal = Journal::open(&dir).unwrap();
    assert_eq!(wal.current_offset, HEADER_SIZE + RECORD_1_SIZE);

    // Can append new records
    wal.append(vec![7, 8, 9]).unwrap();
}

8. References

Production Systems

Research Papers

Performance & Optimization

Rust Ecosystem

Recovery & Checkpointing

Distributed Patterns

Appendix A: Decision Matrix

Decision	Option A	Option B	Recommendation	Rationale
Embedded KV	sled (current)	redb	redb	Stable 1.0 API, pure Rust, actively maintained
Checksum	BLAKE3 only	CRC32C + BLAKE3	Both	CRC32C for torn writes (fast), BLAKE3 for identity
Group Commit	Immediate fsync	Batched fsync	Batched	50%+ latency reduction, 40%+ throughput increase
Recovery	Stub (current)	Full scan + truncate	Full scan	Production-grade crash recovery
io_uring	Use for WAL	Use traditional fsync	Traditional fsync	io_uring not production-ready for WAL durability
Distributed WAL	Separate app WAL	Raft log IS WAL	Raft log IS WAL	Eliminates write amplification (TiKV pattern)
Range Partitioning	WAL per range	Shared WAL	WAL per range	Independent compaction, simpler reasoning
Log Rotation	Fixed size	Fixed time	Fixed size (1 GB)	Predictable disk usage
Snapshot Frequency	10K entries	100 MB	Both (whichever first)	Balance recovery time vs compaction overhead

Appendix B: Implementation Checklist

Phase 1: Foundation (Weeks 1-2)

Replace sled with redb in stemedb-storage
Add CRC32C checksums to Record struct
Implement full recovery scan with truncation
Add recovery metrics (time, valid/invalid records)
Write crash recovery tests with fault injection

Phase 2: Performance (Weeks 3-4)

Implement group commit batching
Add configurable batch size and flush interval
Benchmark: measure latency reduction
Implement log rotation at 1 GB threshold
Add WAL file cleanup after KV flush

Phase 3: Distributed Foundation (Weeks 5-8)

Integrate raft-rs consensus module
Implement RaftLogStore (log-structured storage)
Implement EpistemeStateMachine (apply committed entries)
Deploy 3-node local cluster for testing
Measure cross-node replication latency

Phase 4: Distributed Production (Weeks 9-12)

Implement snapshot creation (serialize MV + indexes)
Implement snapshot-based log truncation
Add new replica sync (snapshot + log tail)
Implement range splits (shard by subject hash)
Deploy multi-region cluster (3+ regions)

Phase 5: Optimization (Weeks 13-16)

Persist vector index (HNSW graph)
Persist visual index (BK-tree)
Implement fuzzy checkpoints for non-blocking compaction
Optimize cross-region latency (latency-aware routing)
Production readiness: monitoring, alerting, runbooks

End of Research Document

This research represents production-proven patterns from CockroachDB, TiKV, FoundationDB, and SQLite. Every recommendation is backed by real-world deployments handling billions of writes per day. Episteme's append-only model is simpler than general-purpose databases - use that to your advantage with streamlined recovery and simpler Raft integration.

Next steps: Review with team, prioritize short-term improvements, begin prototyping distributed architecture in parallel branch.

46 KiB Raw Permalink Blame History Unescape Escape

WAL + Crash Recovery Research: Production Best Practices for Distributed Rust Databases

Executive Summary

Key Findings

Recommendations for Episteme

1. WAL Design Patterns in Production Databases

1.1 CockroachDB Pebble Storage Engine

1.2 TiKV Raft Engine

1.3 FoundationDB WAL

1.4 SQLite WAL Mode

1.5 Group Commit / Batched fsync Patterns

2. Crash Recovery State-of-the-Art

2.1 ARIES Recovery Algorithm

2.2 Torn Write Protection

2.3 Checkpoint Strategies

2.4 Recovery Time Objectives (RTO)

3. Modern Rust Ecosystem

3.1 Embedded KV Stores

3.2 Rust WAL Crates

3.3 raft-rs from TiKV

3.4 io_uring for Async WAL Writes

4. Distributed WAL Considerations

4.1 WAL per Range vs WAL per Node

4.2 Log Truncation After Raft Snapshot

4.3 WAL as Raft Log (Shared vs Separate)

4.4 Raft Log Format for Episteme

5. What Episteme Should Change

5.1 Short-Term Improvements (Pre-Distributed)

5.2 Long-Term Architecture (Distributed)

6. Concrete Recommendations

Immediate Actions (Week 1-2)

Short Term (Month 1)

Medium Term (Months 2-3)

Long Term (Months 4-6)

7. Benchmarks and Validation

Performance Targets

Benchmark Suite

Crash Recovery Tests

8. References

Production Systems

Research Papers

Performance & Optimization

Rust Ecosystem

Recovery & Checkpointing

Distributed Patterns

Appendix A: Decision Matrix

Appendix B: Implementation Checklist

Phase 1: Foundation (Weeks 1-2)

Phase 2: Performance (Weeks 3-4)

Phase 3: Distributed Foundation (Weeks 5-8)

Phase 4: Distributed Production (Weeks 9-12)

Phase 5: Optimization (Weeks 13-16)

46 KiB

Raw Permalink Blame History