stemedb/ai-lookup/services/storage.md
jordan 42d4e09508 feat: Index persistence (Phase 5C) - vector hot/cold, visual checkpoint
Phase 5C (Index Persistence) implementation:
- PersistentVectorIndex with hot/cold architecture
  - Hot: in-memory HNSW for recent vectors
  - Cold: memory-mapped HNSW loaded from disk
  - Background builder for WAL replay and atomic swap
  - BLAKE3 integrity verification
- PersistentVisualIndex with checkpoint persistence
  - BkTreeSnapshot with rkyv serialization
  - CRC32C corruption detection
  - Atomic write pattern (temp → fsync → rename)
- Key codec additions for vector index metadata
- Split large files into modules (<500 lines each)
  - battery_pre_sentinel.rs → battery/ directory
  - visual_index.rs → visual_index/ directory
  - persistent.rs → persistent/ directory
- Refactored ingest worker tests for clarity
- Updated roadmap to mark Phase 5 complete

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:43:18 -07:00

4.5 KiB

Storage

Last Updated: 2026-01-31 Confidence: High

Summary

Episteme uses a Log-Structured, Content-Addressed storage model. Writes append to WAL, then index asynchronously. Reads query indexes and apply Lenses.

Key Facts:

  • Append-only (never mutate)
  • WAL for durability (fsync on write)
  • KV store: HybridStore (fjall for writes, redb for reads)
  • Content-addressed by BLAKE3 hash

File Pointers:

  • crates/stemedb-storage/src/traits.rs - KVStore trait
  • crates/stemedb-storage/src/key_codec.rs - Centralized key encoding (40+ builders, subject validation, extraction)
  • crates/stemedb-storage/src/hybrid_backend.rs - HybridStore (routes to fjall or redb)
  • crates/stemedb-storage/src/fjall_backend.rs - FjallStore (write-heavy keys)
  • crates/stemedb-storage/src/redb_backend.rs - RedbStore (read-heavy keys)
  • crates/stemedb-storage/src/serde_helpers.rs - Storage-layer serialize/deserialize helpers
  • crates/stemedb-storage/src/vote_store.rs - VoteStore (Ballot Box)
  • crates/stemedb-storage/src/index_store.rs - IndexStore (S: and SP: indexes)
  • crates/stemedb-storage/src/trust_rank_store.rs - TrustRankStore (TR:)

KV Layout

All keys use a centralized key_codec module (crates/stemedb-storage/src/key_codec.rs). Subject-scoped keys use {subject}\x00 prefix for co-location; global keys use \x00 prefix to sort first.

Subject-Prefixed Keys (co-located per subject)

Key Pattern Value Purpose
{subject}\x00H:{hash} Assertion (serialized) Main content store
{subject}\x00S:{hash_list} Vec<Hash> (rkyv) Subject index (IndexStore)
{subject}\x00SP:{predicate} Vec<Hash> (rkyv) Compound index (IndexStore)
{subject}\x00MV:{predicate} MaterializedView (rkyv) Pre-computed winner (Materializer)
{subject}\x00V:{hash}:{vh} Vote (serialized) Ballot Box votes
{subject}\x00VC:{hash} u64 (LE bytes) Vote count cache
{subject}\x00VW:{hash} f32 (LE bytes) Aggregate weight cache
{subject}\x00GS:{predicate} GoldStandard (rkyv) Gold standard entries

Global Keys (sort first via \x00 prefix)

Key Pattern Value Purpose
\x00TRUST:{agent_id} TrustRank (rkyv) Agent reputation (TrustRankStore)
\x00QUOTA:{agent_id}:{window} Quota record Per-agent per-window quota
\x00QLIMIT:{agent_id} Quota limit Per-agent quota limit
\x00E:{epoch_id} Epoch (serialized) Paradigm definitions
\x00SUPERSEDED:{epoch_id} Supersession marker O(1) epoch supersession lookup
\x00SUP:{hash} Supersession record Supersession data
\x00AUD:{query_id} QueryAudit (rkyv) Query audit trail
\x00ESC:{ts}:{id} EscalationEvent (rkyv) Escalation events
\x00TP:{pack_id} TrustPack (rkyv) Trust packs
\x00META:{key} Varies System metadata (e.g., cursor)
\x00HASH_SUBJECT:{hash} Subject string Reverse lookup: hash → subject
\x00SUBJECTS:{subject} Marker Known subjects index
\x00GS_LIST:{subj}:{pred} Listing data Gold standard listing

Serialization

stemedb-core (shared types)

For core types, use the canonical module:

use stemedb_core::serde::{serialize, deserialize};

let bytes = serialize(&my_value)?;
let value: MyType = deserialize(&bytes)?;

File: crates/stemedb-core/src/serde.rs

Raw AllocSerializer usage is prohibited in production code (enforced via CLAUDE.md).

stemedb-storage (store implementations)

In storage modules, use the storage-layer helpers that map to StorageError:

use crate::serde_helpers::{serialize, deserialize};

let bytes = serialize(&my_value)?;  // Returns Result<Vec<u8>, StorageError>
let value: MyType = deserialize(&bytes)?;

File: crates/stemedb-storage/src/serde_helpers.rs

This provides unified error handling across all store implementations (VoteStore, IndexStore, TrustRankStore, AuditStore, TrustPackStore, QuotaStore).

Write Path

1. Agent submits signed Assertion
2. Validate signature
3. Append to WAL (fsync)
4. Return 202 Accepted with Hash
5. Background: tail WAL -> update indexes

Read Path

1. Query: GET(Subject, Predicate, Lens)
2. Lookup: {subject}\x00SP:{predicate} -> [Hash...]
3. Hydrate: Load assertions from {subject}\x00H:{hash}
4. Resolve: Apply Lens
5. Return: Deterministic answer