stemedb/ai-lookup/services/ingestor.md
jordan 3320c24afa feat: WAL hardening (Phase 5B) - CRC32C, crash recovery, group commit, log rotation
Add CRC32C checksums to WAL record format (v2), implement crash recovery
with automatic truncation of corrupt records, add feature-gated group commit
buffer for batched fsync under concurrent load, and implement log rotation
via segment files with global offset addressing.

Key changes:
- Record format v2: [len:u32][crc32c:u32][blake3:32][payload:N]
- recover_file() scans and truncates corrupt tail records
- GroupCommitBuffer batches fsync via MPSC channel (tokio feature gate)
- SegmentManager with binary search resolution and cursor-based cleanup
- Journal::read() auto-refreshes segments on miss for writer/reader split
- Split recovery.rs and key_codec.rs into directory modules for 500-line max

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 12:36:35 -07:00

84 lines
2.3 KiB
Markdown

# Ingestor Service
> **Crate:** `stemedb-ingest`
> **Status:** Implemented (Phase 1)
## Purpose
The Ingestor is the background worker that bridges the Write-Ahead Log (WAL) to the KV storage engine. It continuously tails the WAL and persists records to the HybridStore (fjall + redb) using content-addressed keys.
## Architecture
```
[WAL Journal] ---> [IngestWorker] ---> [KVStore (HybridStore)]
|
v
[Subject Index]
```
## Key Components
### RecordType
Discriminator for WAL payloads (8-byte aligned header):
- `Assertion = 0` - Knowledge claims
- `Vote = 1` - Consensus votes
- `Epoch = 2` - Paradigm definitions
### Storage Layout
| Key Pattern | Value | Description |
|-------------|-------|-------------|
| `H:{blake3_hash}` | Serialized Assertion | Content-addressed assertion store |
| `V:{assertion_hash}:{vote_hash}` | Serialized Vote | Votes on assertions |
| `E:{epoch_id_hex}` | Serialized Epoch | Epoch definitions |
| `S:{subject}` | BLAKE3 hash bytes | Subject adjacency index |
## Usage
```rust
use stemedb_ingest::{Ingestor, serialize_assertion};
use stemedb_wal::Journal;
use stemedb_storage::HybridStore;
// Create components
let journal = Arc::new(Mutex::new(Journal::open("./wal")?));
let store = Arc::new(HybridStore::open("./db")?);
// Create and start ingestor
let mut ingestor = Ingestor::new(journal.clone(), store);
ingestor.start(); // Spawns background task
// Write to WAL (records will be ingested automatically)
let assertion = Assertion { ... };
let payload = serialize_assertion(&assertion)?;
journal.lock().await.append(payload)?;
```
## Serialization
Records are serialized with an 8-byte header to maintain rkyv alignment:
```
[type: u8][padding: 7 bytes][rkyv payload...]
```
Helper functions:
- `serialize_assertion(&Assertion) -> Result<Vec<u8>>`
- `serialize_vote(&Vote) -> Result<Vec<u8>>`
- `serialize_epoch(&Epoch) -> Result<Vec<u8>>`
## Testing
The ingestor has integration tests covering:
- Single assertion ingestion
- Vote ingestion
- Epoch ingestion
- Multiple record processing
- Subject index creation
## Related
- [Storage Service](./storage.md) - KVStore trait and HybridStore (fjall + redb)
- [Content Addressing](../patterns/content-addressing.md) - BLAKE3 hashing