stemedb/ai-lookup/services/storage.md
jordan ad07a75d0a feat: add source content to source registry, signed assertions, feed endpoint, dashboard enhancements
- Add `content: Option<String>` to SourceRecord with rkyv schema evolution
  (LegacySourceRecord compat deserializer for backward compatibility)
- Add MAX_SOURCE_CONTENT_LEN (1MB) limit with API validation
- Strip content from list responses, include in single-source GET
- Update Go SDK RegisterSourceRequest with Content field
- FCM pipeline extracts PDF text via pdftotext and passes to registration
- Dashboard impact panel fetches and displays source content with expand/collapse
- Add feed endpoint, dashboard feed panel, and signed assertion support
- Update data-structures.md, API docs, and storage docs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:54:27 -07:00

129 lines
4.9 KiB
Markdown

# Storage
**Last Updated:** 2026-02-19
**Confidence:** High
## Summary
Episteme uses a Log-Structured, Content-Addressed storage model. Writes append to WAL, then index asynchronously. Reads query indexes and apply Lenses.
**Key Facts:**
- Append-only (never mutate)
- WAL for durability (fsync on write)
- KV store: HybridStore (fjall for writes, redb for reads)
- Content-addressed by BLAKE3 hash
**File Pointers:**
- `crates/stemedb-storage/src/traits.rs` - KVStore trait
- `crates/stemedb-storage/src/key_codec.rs` - Centralized key encoding (40+ builders, subject validation, extraction)
- `crates/stemedb-storage/src/hybrid_backend.rs` - HybridStore (routes to fjall or redb)
- `crates/stemedb-storage/src/fjall_backend.rs` - FjallStore (write-heavy keys)
- `crates/stemedb-storage/src/redb_backend.rs` - RedbStore (read-heavy keys)
- `crates/stemedb-storage/src/serde_helpers.rs` - Storage-layer serialize/deserialize helpers
- `crates/stemedb-storage/src/vote_store.rs` - VoteStore (Ballot Box)
- `crates/stemedb-storage/src/index_store.rs` - IndexStore (S: and SP: indexes)
- `crates/stemedb-storage/src/trust_rank_store.rs` - TrustRankStore (TR:)
## KV Layout
All keys use a centralized `key_codec` module (`crates/stemedb-storage/src/key_codec.rs`). Subject-scoped keys use `{subject}\x00` prefix for co-location; global keys use `\x00` prefix to sort first.
### Subject-Prefixed Keys (co-located per subject)
| Key Pattern | Value | Purpose |
|-------------|-------|---------|
| `{subject}\x00H:{hash}` | `Assertion` (serialized) | Main content store |
| `{subject}\x00S:{hash_list}` | `Vec<Hash>` (rkyv) | Subject index (IndexStore) |
| `{subject}\x00SP:{predicate}` | `Vec<Hash>` (rkyv) | Compound index (IndexStore) |
| `{subject}\x00MV:{predicate}` | `MaterializedView` (rkyv) | Pre-computed winner (Materializer) |
| `{subject}\x00V:{hash}:{vh}` | `Vote` (serialized) | Ballot Box votes |
| `{subject}\x00VC:{hash}` | `u64` (LE bytes) | Vote count cache |
| `{subject}\x00VW:{hash}` | `f32` (LE bytes) | Aggregate weight cache |
| `{subject}\x00GS:{predicate}` | `GoldStandard` (rkyv) | Gold standard entries |
### Global Keys (sort first via `\x00` prefix)
| Key Pattern | Value | Purpose |
|-------------|-------|---------|
| `\x00TRUST:{agent_id}` | `TrustRank` (rkyv) | Agent reputation (TrustRankStore) |
| `\x00QUOTA:{agent_id}:{window}` | Quota record | Per-agent per-window quota |
| `\x00QLIMIT:{agent_id}` | Quota limit | Per-agent quota limit |
| `\x00E:{epoch_id}` | `Epoch` (serialized) | Paradigm definitions |
| `\x00SUPERSEDED:{epoch_id}` | Supersession marker | O(1) epoch supersession lookup |
| `\x00SUP:{hash}` | Supersession record | Supersession data |
| `\x00AUD:{query_id}` | `QueryAudit` (rkyv) | Query audit trail |
| `\x00ESC:{ts}:{id}` | `EscalationEvent` (rkyv) | Escalation events |
| `\x00TP:{pack_id}` | `TrustPack` (rkyv) | Trust packs |
| `\x00META:{key}` | Varies | System metadata (e.g., cursor) |
| `\x00HASH_SUBJECT:{hash}` | Subject string | Reverse lookup: hash → subject |
| `\x00SUBJECTS:{subject}` | Marker | Known subjects index |
| `\x00GS_LIST:{subj}:{pred}` | Listing data | Gold standard listing |
## Serialization
### stemedb-core (shared types)
For core types, use the canonical module:
```rust
use stemedb_core::serde::{serialize, deserialize};
let bytes = serialize(&my_value)?;
let value: MyType = deserialize(&bytes)?;
```
**File:** `crates/stemedb-core/src/serde.rs`
Raw `AllocSerializer` usage is prohibited in production code (enforced via CLAUDE.md).
### stemedb-storage (store implementations)
In storage modules, use the storage-layer helpers that map to `StorageError`:
```rust
use crate::serde_helpers::{serialize, deserialize};
let bytes = serialize(&my_value)?; // Returns Result<Vec<u8>, StorageError>
let value: MyType = deserialize(&bytes)?;
```
**File:** `crates/stemedb-storage/src/serde_helpers.rs`
This provides unified error handling across all store implementations (VoteStore, IndexStore, TrustRankStore, AuditStore, TrustPackStore, QuotaStore).
For types with schema evolution (rkyv compat), use the dedicated compat functions:
```rust
use crate::serde_helpers::deserialize_source_record_compat;
let record: SourceRecord = deserialize_source_record_compat(&bytes)?;
```
Available compat deserializers: `deserialize_source_record_compat` (SourceRecord). For assertions, use `stemedb_core::serde::deserialize_assertion_compat` directly.
## Write Path
```
1. Agent submits signed Assertion
2. Validate signature
3. Append to WAL (fsync)
4. Return 202 Accepted with Hash
5. Background: tail WAL -> update indexes
```
## Read Path
```
1. Query: GET(Subject, Predicate, Lens)
2. Lookup: {subject}\x00SP:{predicate} -> [Hash...]
3. Hydrate: Load assertions from {subject}\x00H:{hash}
4. Resolve: Apply Lens
5. Return: Deterministic answer
```
## Related Topics
- [Assertion](./assertion.md)
- [Ballot Box](./ballot-box.md) - High-velocity vote storage
- [Architecture](../../../architecture.md)