Phase 1 delivers the complete durability and storage layer:
- WAL with crash recovery: Append-only journal with BLAKE3 checksums,
fsync guarantees, and proper seek-to-EOF on reopen
- Storage engine: sled-backed KVStore with scan_prefix for range queries
- Content-addressed storage: H:{hash}, V:{hash}, E:{hash} key patterns
- Ingestor: Background worker tailing WAL, writing to KV with 8-byte
aligned record headers for rkyv zero-copy deserialization
- Comprehensive tests: 31 tests covering crash recovery, round-trips,
and multi-cycle durability
New crates: stemedb-wal, stemedb-storage, stemedb-ingest
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
5.2 KiB
Episteme: The Probabilistic Knowledge Lattice
Internal Codename: StemeDB Category: Infrastructure / Database Role: The Cortex (Reasoning & Truth)
1. The Manifesto: "A Marketplace of Truth"
We are building the shared, long-term memory for autonomous research agents.
Current databases (Postgres, Neo4j, Vector DBs) suffer from The Tower of Babel problem: they store Data, not Evidence. They are deterministic, stateless, and brittle.
Episteme rejects the idea of a single, static "database state." Instead, it models knowledge as a Probabilistic Marketplace.
- Democracy: Truth is established via high-velocity consensus (Voting), not just overwrite privileges.
- Economics: Reasoning has a cost. The system enforces efficiency via "The Meter."
- Evolution: The database doesn't just store data; it exports training sets ("The Simulator") to make agents smarter.
2. The Core Data Model: The Hyper-Edge
The atomic unit of Episteme is not a Row, Document, or Embedding. It is the Signed Assertion.
struct Assertion {
// The Proposition (The "What")
subject: EntityId, // e.g., "Tesla_Inc"
predicate: RelationId, // e.g., "has_annual_revenue"
object: Value, // e.g., "$96.7B"
// The Meta-Cognition (The "Why")
confidence: f32, // 0.0 to 1.0 (Agent's subjective certainty)
source_hash: Hash, // Content-addressed link to source (PDF, URL, Log)
visual_hash: Option<Hash>, // pHash for visual anchoring against web drift
agent_id: PublicKey, // Who made this claim? (Cryptographic multi-sig)
timestamp: u64, // When?
// The Semantic Vector (The "Meaning")
vector: Vec<f32>, // Embedding for semantic navigation
// The Paradigm (The "Context")
epoch: Option<EpochId>, // "covid-guidelines-2020", "gaap-2024"
}
3. The Query Engine: "Truth Lenses"
Reading is a compute-heavy operation. You must apply a Lens to collapse the probabilistic field into a concrete answer. To ensure sub-millisecond latency, Episteme uses Materialized Views to pre-calculate the results of standard lenses.
Standard Lenses
- Lens::Consensus: Returns the value with the highest cluster density (Weighted by Multi-Sig). Materialized for speed.
- Lens::Authority: Filters by Agent Reputation (TrustRank).
- Lens::Recency: Returns the latest assertion, ignoring history.
- Lens::EpochAware: Validates assertions against the current paradigm, filtering superseded epochs.
- Lens::Skeptic: Returns the variance between claims (identifies high-conflict/unstable truth).
4. Features for the Agentive Team
4.1. "Forking Reality" (Branching)
Agents need to simulate futures without polluting the main branch. Episteme supports Copy-on-Write Branching via Sparse Merkle Trees.
4.2. The Ballot Box: High-Velocity Consensus
To avoid write contention, Episteme separates the "Candidate" (Assertion) from the "Votes" (Signatures).
- Agents write Votes to a high-speed append-only log ("The Ballot Box").
- A background process aggregates these votes to update the Materialized View.
- This allows thousands of agents to "vote" on a fact simultaneously without locking.
4.3. The Hive: Learning & Trust
Episteme uses Recursive TrustRank Optimization to advance the team's collective intelligence.
- Closed-Loop Learning: When an Agent's prediction is met by a reality assertion, the delta is back-propagated to the Agent's Reputation score.
- The Simulator (Mid-Training): A pipeline that converts high-confidence failure logs into Synthetic Trajectories, allowing agents to be fine-tuned on their own history (creating a "Memory Adapter" LoRA).
4.4. The Meter: Economics of Reasoning
Deep Research is computationally expensive. Episteme enforces Temporal Advantage Normalization (TAN).
- Budgeting: Every Job carries a budget (tokens/dollars).
- Throttling: The system rejects "Fork Reality" requests if the projected cost exceeds the Value of Information.
- Efficiency Rewards: Agents receive positive reinforcement signals for solving problems under budget.
5. Architecture: The Rust Stack
Episteme follows the "Defensive by Default" best practices.
Tier 1: The Spine (Durability)
- Component:
episteme-wal(Quarantine Pattern) - Role: Raw, serialized append-only log. Ensures we never lose a claim.
Tier 2: The Lattice (Graph/Index)
- Component:
episteme-core(Hot/Warm memory) - Warm Tier:
sled(LSM Tree) for the Merkle DAG +hnswfor vector search. - Ballot Box: High-velocity stream for vote ingestion.
Tier 3: The Cortex (Compute)
- Component:
episteme-lens - Role: The WASM runtime for executing Lenses and resolving probabilistic state.
- Materializer: Background worker maintaining O(1) read views.
6. The Ecosystem Triad
| System | Biological Analogy | Function | Question Answered |
|---|---|---|---|
| LogDB | The Spine | Immutable Event Log | "What happened?" |
| AssociativeDB | The Hippocampus | Associative Memory | "What is this like?" |
| Episteme | The Cortex | Structured Reasoning | "Is this true?" |