tidaldb/docs/planning/milestone-1/phase-3/task-01-storage-engine-trait-and-key-encoding.md
jordan 29400d48db feat: implement Milestone 1 phases 1-3 — schema, WAL, and storage layer
Implements the foundation of tidalDB's data pipeline:

**Phase 1 – Schema primitives**
- EntityId newtype (u64, big-endian ordering)
- SignalTypeDefinition with pre-computed decay λ, deduped/sorted windows
- SchemaBuilder with full constraint validation (duplicates, identifiers,
  half-life, windows, velocity)
- LumenError wrapping all subsystems with required From impls

**Phase 2 – Write-Ahead Log**
- Length-prefixed, BLAKE3-protected entry format
- Group-commit writer (batch up to 100 events / 10 ms)
- Double-buffered content-hash deduplication
- Checkpoint, truncation, and crash-recovery with full replay
- Integration, property, and UAT tests (incl. 5,500-event deterministic UAT)
- Proptest coverage scaled to 10 000 events/run (was ≤500) to meet
  acceptance criterion; cases reduced 100→10 to keep runtime comparable

**Phase 3 – Storage engine**
- StorageEngine trait (get/put/delete/scan/batch/flush)
- Key encoding: [EntityId][0x00][Tag][suffix] with ordering/prefix helpers
- InMemoryBackend (BTreeMap + RwLock)
- FjallStorage with three isolated keyspaces and atomic batch helper
- Property tests for key ordering and round-trip correctness

Also adds planning docs for phases 4-5, research docs, architecture
overview, and roadmap updates.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-20 16:43:24 -07:00

260 lines
10 KiB
Markdown

# Task 01: StorageEngine Trait and Key Encoding
## Context
**Milestone:** 1 -- Signal Engine
**Phase:** m1p3 -- Storage Engine Trait and fjall Backend
**Status:** COMPLETE
**Depends On:** m1p1 (`EntityId`, `EntityKind`)
**Blocks:** Task 02 (FjallBackend), Task 03 (InMemoryBackend)
**Complexity:** M
## Objective
Define the `StorageEngine` trait that abstracts all persistent entity state access, the key encoding scheme that colocates entity data for efficient prefix scans, and the supporting types (`WriteBatch`, `BatchOp`, `PrefixIterator`, `StorageError`).
This is the boundary that keeps the rest of tidalDB storage-engine-agnostic. The WAL (m1p2) is the signal event source of truth; the storage engine is where derived entity state (metadata, signal checkpoints, indexes) lives. Every higher module — signal ledger, entity API, query engine — talks to a `StorageEngine`, never to fjall directly.
## Requirements
- `StorageEngine` is a `Send + Sync` object-safe trait
- Operations: `get(&[u8]) -> Result<Option<Vec<u8>>>`, `put(&[u8], &[u8]) -> Result<()>`, `delete(&[u8]) -> Result<()>`, `scan_prefix(&[u8]) -> PrefixIterator<'_>`, `write_batch(WriteBatch) -> Result<()>`, `flush() -> Result<()>`
- Key encoding: `[entity_id: 8 bytes BE][0x00][Tag: 1 byte][suffix: variable]`
- 8-byte big-endian entity ID: byte-lexicographic order matches numeric order
- `0x00` NUL separator between entity ID and tag
- 1-byte `Tag` discriminant for data category within the keyspace
- `Tag` enum: `Evt`=0x01 (raw events), `Sig`=0x02 (signal state), `Meta`=0x03 (entity metadata), `Rel`=0x04 (relationships), `Mv`=0x05 (materialized views), `Idx`=0x06 (inverted index)
- `entity_prefix(entity_id)` returns 9 bytes: `[entity_id: 8 BE][0x00]` — scans all tags for one entity
- `entity_tag_prefix(entity_id, tag)` returns 10 bytes: `[entity_id: 8 BE][0x00][tag: 1]` — scans one tag for one entity
- `encode_key(entity_id, tag, suffix)` and `parse_key(key)` roundtrip correctly for all inputs
- `WriteBatch` collects `Put` and `Delete` operations; `write_batch()` applies them atomically
- `PrefixIterator<'_>` is a type alias for `Box<dyn Iterator<Item = Result<(Vec<u8>, Vec<u8>), StorageError>> + '_>`
- `StorageError` integrates with `LumenError::Storage`
## Technical Design
### Key Encoding
```
[entity_id: u64 BE, 8 bytes][NUL: 0x00, 1 byte][Tag: u8, 1 byte][suffix: 0..N bytes]
Total prefix for entity scan: 9 bytes
Total prefix for tag scan: 10 bytes
```
**Why big-endian for entity IDs?** Byte-lexicographic order of the 8-byte encoding must match numeric order of the u64 value. Big-endian achieves this: `EntityId(1)``[0,0,0,0,0,0,0,1]`, `EntityId(256)``[0,0,0,1,0,0,0,0]`. Little-endian would invert the ordering.
**Why NUL separator?** Prevents a variable-length entity ID prefix from colliding with suffixes. With fixed 8-byte IDs the separator is redundant but is kept for consistency with the subject-prefix pattern from `thoughts.md` and for future extensibility.
### Public API
```rust
// === keys.rs ===
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
#[repr(u8)]
pub enum Tag {
Evt = 0x01, // raw event records (signal WAL overflow/cold tier)
Sig = 0x02, // signal state checkpoints
Meta = 0x03, // entity metadata (title, category, created_at, ...)
Rel = 0x04, // relationship edges (follows, blocks, interaction weights)
Mv = 0x05, // materialized views (pre-computed aggregates)
Idx = 0x06, // inverted index entries
}
/// Build a full key: [entity_id: 8 BE][0x00][tag: 1][suffix]
pub fn encode_key(entity_id: EntityId, tag: Tag, suffix: &[u8]) -> Vec<u8>;
/// Parse a key back into (entity_id, tag, suffix).
/// Returns Err on keys too short to contain entity_id + separator + tag.
pub fn parse_key(key: &[u8]) -> Result<(EntityId, Tag, &[u8]), StorageError>;
/// Prefix for all keys belonging to one entity: [entity_id: 8 BE][0x00]
pub fn entity_prefix(entity_id: EntityId) -> [u8; 9];
/// Prefix for one tag of one entity: [entity_id: 8 BE][0x00][tag: 1]
pub fn entity_tag_prefix(entity_id: EntityId, tag: Tag) -> [u8; 10];
```
```rust
// === batch.rs ===
#[derive(Debug, Clone)]
pub enum BatchOp {
Put { key: Vec<u8>, value: Vec<u8> },
Delete { key: Vec<u8> },
}
#[derive(Debug, Default, Clone)]
pub struct WriteBatch {
ops: Vec<BatchOp>,
}
impl WriteBatch {
pub fn new() -> Self;
pub fn put(&mut self, key: Vec<u8>, value: Vec<u8>) -> &mut Self;
pub fn delete(&mut self, key: Vec<u8>) -> &mut Self;
pub fn ops(&self) -> &[BatchOp];
pub fn is_empty(&self) -> bool;
pub fn len(&self) -> usize;
}
```
```rust
// === iterator.rs ===
/// Boxed prefix scan iterator yielding (key, value) pairs.
pub type PrefixIterator<'a> = Box<dyn Iterator<Item = Result<(Vec<u8>, Vec<u8>), StorageError>> + 'a>;
```
```rust
// === error.rs ===
#[derive(Debug, thiserror::Error)]
pub enum StorageError {
#[error("I/O error: {0}")]
Io(#[from] std::io::Error),
#[error("storage backend error: {0}")]
Backend(String),
#[error("key parse error: {0}")]
KeyParse(String),
#[error("engine closed")]
Closed,
}
```
## Test Strategy
### Property Tests (proptest)
```rust
// encode_key / parse_key roundtrip for all tags and suffixes
proptest! {
#[test]
fn key_roundtrip(
id: u64,
tag in prop_oneof![
Just(Tag::Evt), Just(Tag::Sig), Just(Tag::Meta),
Just(Tag::Rel), Just(Tag::Mv), Just(Tag::Idx),
],
suffix in prop::collection::vec(any::<u8>(), 0..64),
) {
let entity_id = EntityId::new(id);
let key = encode_key(entity_id, tag, &suffix);
let (parsed_id, parsed_tag, parsed_suffix) = parse_key(&key).unwrap();
prop_assert_eq!(parsed_id, entity_id);
prop_assert_eq!(parsed_tag, tag);
prop_assert_eq!(parsed_suffix, suffix.as_slice());
}
}
// Byte-lexicographic order of encoded keys matches numeric order of entity IDs
proptest! {
#[test]
fn key_ordering_matches_entity_id_ordering(a: u64, b: u64) {
let key_a = encode_key(EntityId::new(a), Tag::Meta, b"");
let key_b = encode_key(EntityId::new(b), Tag::Meta, b"");
prop_assert_eq!(
key_a.cmp(&key_b),
a.cmp(&b),
"key ordering must match entity ID ordering"
);
}
}
// entity_prefix is a prefix of every key for that entity
proptest! {
#[test]
fn entity_prefix_is_prefix_of_all_entity_keys(id: u64) {
let entity_id = EntityId::new(id);
let prefix = entity_prefix(entity_id);
for tag in [Tag::Evt, Tag::Sig, Tag::Meta, Tag::Rel] {
let key = encode_key(entity_id, tag, b"suffix");
prop_assert!(key.starts_with(&prefix));
}
}
}
// entity_tag_prefix is a prefix of every key for that entity and tag
proptest! {
#[test]
fn entity_tag_prefix_is_precise(id: u64, suffix in prop::collection::vec(any::<u8>(), 0..32)) {
let entity_id = EntityId::new(id);
let prefix = entity_tag_prefix(entity_id, Tag::Sig);
let key = encode_key(entity_id, Tag::Sig, &suffix);
prop_assert!(key.starts_with(&prefix));
// Tag::Meta key does NOT start with Tag::Sig prefix
let other_key = encode_key(entity_id, Tag::Meta, &suffix);
prop_assert!(!other_key.starts_with(&prefix));
}
}
```
### Unit Tests
```rust
#[test]
fn tag_byte_values() {
assert_eq!(Tag::Evt as u8, 0x01);
assert_eq!(Tag::Sig as u8, 0x02);
assert_eq!(Tag::Meta as u8, 0x03);
assert_eq!(Tag::Rel as u8, 0x04);
assert_eq!(Tag::Mv as u8, 0x05);
assert_eq!(Tag::Idx as u8, 0x06);
}
#[test]
fn entity_prefix_length() {
let prefix = entity_prefix(EntityId::new(1));
assert_eq!(prefix.len(), 9);
}
#[test]
fn entity_tag_prefix_length() {
let prefix = entity_tag_prefix(EntityId::new(1), Tag::Meta);
assert_eq!(prefix.len(), 10);
}
#[test]
fn parse_key_rejects_short_input() {
assert!(parse_key(b"").is_err());
assert!(parse_key(&[0u8; 8]).is_err()); // missing NUL + tag
assert!(parse_key(&[0u8; 9]).is_err()); // missing tag
}
#[test]
fn write_batch_ops_order_preserved() {
let mut batch = WriteBatch::new();
batch.put(b"k1".to_vec(), b"v1".to_vec());
batch.delete(b"k2".to_vec());
batch.put(b"k3".to_vec(), b"v3".to_vec());
assert_eq!(batch.len(), 3);
assert!(matches!(batch.ops()[0], BatchOp::Put { .. }));
assert!(matches!(batch.ops()[1], BatchOp::Delete { .. }));
assert!(matches!(batch.ops()[2], BatchOp::Put { .. }));
}
```
## Acceptance Criteria
- [x] `encode_key` / `parse_key` roundtrip correctly for all 6 `Tag` variants and arbitrary suffixes (property tested)
- [x] Byte-lexicographic ordering of encoded keys matches numeric ordering of `EntityId` (property tested)
- [x] `entity_prefix` is 9 bytes and is a prefix of every key for that entity (property tested)
- [x] `entity_tag_prefix` is 10 bytes and is a prefix of only keys with the matching entity+tag (property tested)
- [x] `parse_key` returns `StorageError::KeyParse` for inputs shorter than 10 bytes
- [x] `WriteBatch` preserves insertion order of operations
- [x] `StorageEngine` trait is object-safe (`dyn StorageEngine` compiles)
- [x] `StorageEngine: Send + Sync` — enforced by the trait bound
- [x] `cargo clippy -D warnings` passes
## Research References
- [thoughts.md](../../../../thoughts.md) — Part V.12 (subject-prefix keys: `[entity_id][NUL][TAG][suffix]`, rationale for co-location, entity-scoped prefix scans)
- [CODING_GUIDELINES.md](../../../../CODING_GUIDELINES.md) — Section 2 (key encoding: big-endian for byte-lexicographic ordering, NUL separator convention)
## Implementation Notes
- `Tag` uses `#[repr(u8)]` for direct byte encoding. A `From<u8>` impl with a catch-all `→ StorageError::KeyParse` allows forward-compatible decoding of unknown future tag values.
- `PrefixIterator<'_>` is a type alias (not a newtype) to avoid boxing overhead in callers that know the concrete iterator type at compile time. The `'_` lifetime ties the iterator to the backend's lifetime.
- `StorageError` uses `thiserror` (already in `Cargo.toml`) for `Display` and `Error` implementations.
- Do NOT add `serde` to the storage error types. Error propagation uses `From` impls, not serialization.