Phase 1 delivers the complete durability and storage layer:
- WAL with crash recovery: Append-only journal with BLAKE3 checksums,
fsync guarantees, and proper seek-to-EOF on reopen
- Storage engine: sled-backed KVStore with scan_prefix for range queries
- Content-addressed storage: H:{hash}, V:{hash}, E:{hash} key patterns
- Ingestor: Background worker tailing WAL, writing to KV with 8-byte
aligned record headers for rkyv zero-copy deserialization
- Comprehensive tests: 31 tests covering crash recovery, round-trips,
and multi-cycle durability
New crates: stemedb-wal, stemedb-storage, stemedb-ingest
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
5.7 KiB
Episteme (StemeDB) Architecture
Design Philosophy: Immutable History, Probabilistic Resolution, Materialized Speed. Status: Draft Spec v1.0
1. System Overview
Episteme is a Log-Structured, Content-Addressed Knowledge Graph. Unlike traditional databases that mutate state in place, Episteme appends Assertions to an immutable ledger (Merkle DAG). State resolution happens via Lenses.
To solve the O(N) read latency of conflict resolution, Episteme employs a Materialized View layer that pre-calculates the "Current Truth" for standard lenses.
High-Level Data Flow
[Writer Agent] [Reader Agent]
│ ▲
│ (1) Sign & │ (6) Sub-millisecond Answer
│ Propose │ (Pre-computed)
▼ │
┌────────────┐ ┌────────────┐
│ Ingestion │ │ Resolution │
│ Gateway │ │ Engine │
└─────┬──────┘ └─────┬──────┘
│ (2) Append │ (5) Read Materialized View
│ to Ballot │ OR Apply Custom Lens
▼ │
┌────────────┐ ┌────────────┐
│ Quarantine │ │ Indexing │
│ Journal │──────► Service │
└─────┬──────┘ (3) └─────┬──────┘
│ │ (4) Compaction & Materialization
▼ ▼
┌────────────┐ ┌────────────┐
│ Job Manager│ │ Materialized│
└────────────┘ │ Views │
(TAN Meter) └────────────┘
2. Core Data Structures
2.1. The Atomic Unit: Assertion (The Candidate)
Assertions are proposals of truth. They are immutable.
struct Assertion {
pub subject: EntityId,
pub predicate: RelationId,
pub object: Value,
pub epoch: Option<EpochId>,
pub agent_id: PublicKey, // The Proposer
pub timestamp: u64,
// ... lineage and vector fields ...
}
2.2. The Ballot Box: Vote (The High-Velocity Stream)
To prevent lock contention on Assertions, Agents write Votes to a separate high-velocity log.
struct Vote {
pub assertion_hash: Hash, // What are we voting on?
pub agent_id: PublicKey, // Who is voting?
pub weight: f32, // 0.0 - 1.0 (Confidence)
pub signature: Signature, // Cryptographic proof
pub timestamp: u64,
}
2.3. The Storage Layout (LSM Tree)
| Key | Value | Purpose |
|---|---|---|
H:{Hash} |
Assertion |
Immutable Content Store |
V:{Hash} |
List<Vote> |
The Ballot Box (Append-only) |
MV:{Subject}:{Predicate} |
Assertion |
Materialized View (The "Winner") |
S:{Subject} |
List<Hash> |
Adjacency Index |
3. The Write Path (The Ballot Box)
- Ingest: Agents submit
AssertionsorVotes. - Journal: Written to
episteme-wal. - Ballot Box: Votes are appended to the
V:{Hash}stream. - Compactor (Async): A background worker aggregates Votes + TrustRank to update the
MV:{Subject}:{Predicate}key.- This ensures that Read queries (
GET /query) are O(1) lookups on the Materialized View, not O(N) calculations.
- This ensures that Read queries (
4. The Read Path (The Cortex)
Fast Path (Standard Lenses):
- Query:
GET /query?lens=Consensus - Action:
GET MV:{Subject}:{Predicate} - Cost: O(1). Low latency.
Slow Path (Custom/Skeptic Lenses):
- Query:
GET /query?lens=Skeptic - Action: Gather all candidates + votes, compute variance on the fly.
- Cost: O(N). High latency, used for analysis/debugging.
Standard Lenses
- Consensus: Highest cluster density.
- Authority: Filter by Reputation.
- Recency: Last Writer Wins.
- Skeptic: Returns variance/conflict metrics.
- EpochAware: Validates against current paradigm.
- Constraints: Returns all
must_use/forbiddenassertions for a context. Acts as a "Pre-Flight Check" to solve the Optimization Conflict.
5. The Meter (Economic Safety)
To prevent infinite loops, the Job Manager enforces Temporal Advantage Normalization (TAN).
- Budgeting: Every Job must declare a
max_cost. - Throttling: Forking Reality or Deep Recursion is rejected if
current_cost + projected_cost > max_cost.
6. The Simulator (Mid-Training Pipeline)
The system continuously exports data to train the next generation of Agents.
- Negative Samples: High-confidence assertions that were later superseded (Failures).
- Golden Paths: Branches that successfully merged to Main (Successes).
- Format: Exported as HuggingFace-compatible datasets for LoRA fine-tuning.
7. Implementation Roadmap
Phase 1: The Spine (Foundation)
- Reuse
quarantine-journalpattern for WAL. - Implement
Assertion,Epoch, andVotestructs. - Basic
sledstorage backend.
Phase 2: The Lattice (Connectivity)
- The Ballot Box: Implement separate Vote storage stream.
- Materializer: Implement background worker to maintain
MVkeys. - The Meter: Implement Budget/TAN middleware in Job Manager.
- Agent Wallet: Sidecar for key management/signing.
Phase 3: The Cortex (Reasoning)
- SMT Backend & Branching.
- Vector Search.
- Lens: Constraints: Implement the pre-flight check logic.
Phase 4: The Hive (Learning)
- The Simulator: Log exporter pipeline.
- TrustRank Learning Loop.