stemedb/vision.md
jordan 3cfaa1e1d3 feat: Complete Phase 1 (The Spine) - storage foundation
Phase 1 delivers the complete durability and storage layer:

- WAL with crash recovery: Append-only journal with BLAKE3 checksums,
  fsync guarantees, and proper seek-to-EOF on reopen
- Storage engine: sled-backed KVStore with scan_prefix for range queries
- Content-addressed storage: H:{hash}, V:{hash}, E:{hash} key patterns
- Ingestor: Background worker tailing WAL, writing to KV with 8-byte
  aligned record headers for rkyv zero-copy deserialization
- Comprehensive tests: 31 tests covering crash recovery, round-trips,
  and multi-cycle durability

New crates: stemedb-wal, stemedb-storage, stemedb-ingest

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 14:15:34 -07:00

101 lines
5.2 KiB
Markdown

# Episteme: The Probabilistic Knowledge Lattice
> **Internal Codename:** StemeDB
> **Category:** Infrastructure / Database
> **Role:** The Cortex (Reasoning & Truth)
## 1. The Manifesto: "A Marketplace of Truth"
We are building the shared, long-term memory for autonomous research agents.
Current databases (Postgres, Neo4j, Vector DBs) suffer from **The Tower of Babel** problem: they store *Data*, not *Evidence*. They are deterministic, stateless, and brittle.
**Episteme** rejects the idea of a single, static "database state." Instead, it models knowledge as a **Probabilistic Marketplace**.
* **Democracy:** Truth is established via high-velocity consensus (Voting), not just overwrite privileges.
* **Economics:** Reasoning has a cost. The system enforces efficiency via "The Meter."
* **Evolution:** The database doesn't just store data; it exports training sets ("The Simulator") to make agents smarter.
## 2. The Core Data Model: The Hyper-Edge
The atomic unit of Episteme is not a Row, Document, or Embedding. It is the **Signed Assertion**.
```rust
struct Assertion {
// The Proposition (The "What")
subject: EntityId, // e.g., "Tesla_Inc"
predicate: RelationId, // e.g., "has_annual_revenue"
object: Value, // e.g., "$96.7B"
// The Meta-Cognition (The "Why")
confidence: f32, // 0.0 to 1.0 (Agent's subjective certainty)
source_hash: Hash, // Content-addressed link to source (PDF, URL, Log)
visual_hash: Option<Hash>, // pHash for visual anchoring against web drift
agent_id: PublicKey, // Who made this claim? (Cryptographic multi-sig)
timestamp: u64, // When?
// The Semantic Vector (The "Meaning")
vector: Vec<f32>, // Embedding for semantic navigation
// The Paradigm (The "Context")
epoch: Option<EpochId>, // "covid-guidelines-2020", "gaap-2024"
}
```
## 3. The Query Engine: "Truth Lenses"
Reading is a compute-heavy operation. You must apply a **Lens** to collapse the probabilistic field into a concrete answer. To ensure sub-millisecond latency, Episteme uses **Materialized Views** to pre-calculate the results of standard lenses.
### Standard Lenses
1. **Lens::Consensus:** Returns the value with the highest cluster density (Weighted by Multi-Sig). *Materialized for speed.*
2. **Lens::Authority:** Filters by Agent Reputation (TrustRank).
3. **Lens::Recency:** Returns the latest assertion, ignoring history.
4. **Lens::EpochAware:** Validates assertions against the *current* paradigm, filtering superseded epochs.
5. **Lens::Skeptic:** Returns the *variance* between claims (identifies high-conflict/unstable truth).
## 4. Features for the Agentive Team
### 4.1. "Forking Reality" (Branching)
Agents need to simulate futures without polluting the main branch. Episteme supports **Copy-on-Write Branching** via Sparse Merkle Trees.
### 4.2. The Ballot Box: High-Velocity Consensus
To avoid write contention, Episteme separates the "Candidate" (Assertion) from the "Votes" (Signatures).
* Agents write **Votes** to a high-speed append-only log ("The Ballot Box").
* A background process aggregates these votes to update the Materialized View.
* This allows thousands of agents to "vote" on a fact simultaneously without locking.
### 4.3. The Hive: Learning & Trust
Episteme uses **Recursive TrustRank Optimization** to advance the team's collective intelligence.
* **Closed-Loop Learning:** When an Agent's prediction is met by a reality assertion, the delta is back-propagated to the Agent's Reputation score.
* **The Simulator (Mid-Training):** A pipeline that converts high-confidence failure logs into **Synthetic Trajectories**, allowing agents to be fine-tuned on their own history (creating a "Memory Adapter" LoRA).
### 4.4. The Meter: Economics of Reasoning
Deep Research is computationally expensive. Episteme enforces **Temporal Advantage Normalization (TAN)**.
* **Budgeting:** Every Job carries a budget (tokens/dollars).
* **Throttling:** The system rejects "Fork Reality" requests if the projected cost exceeds the Value of Information.
* **Efficiency Rewards:** Agents receive positive reinforcement signals for solving problems under budget.
## 5. Architecture: The Rust Stack
Episteme follows the **"Defensive by Default"** best practices.
### Tier 1: The Spine (Durability)
* **Component:** `episteme-wal` (Quarantine Pattern)
* **Role:** Raw, serialized append-only log. Ensures we never lose a claim.
### Tier 2: The Lattice (Graph/Index)
* **Component:** `episteme-core` (Hot/Warm memory)
* **Warm Tier:** `sled` (LSM Tree) for the Merkle DAG + `hnsw` for vector search.
* **Ballot Box:** High-velocity stream for vote ingestion.
### Tier 3: The Cortex (Compute)
* **Component:** `episteme-lens`
* **Role:** The WASM runtime for executing Lenses and resolving probabilistic state.
* **Materializer:** Background worker maintaining O(1) read views.
## 6. The Ecosystem Triad
| System | Biological Analogy | Function | Question Answered |
| :--- | :--- | :--- | :--- |
| **LogDB** | **The Spine** | Immutable Event Log | "What happened?" |
| **AssociativeDB** | **The Hippocampus** | Associative Memory | "What is this like?" |
| **Episteme** | **The Cortex** | Structured Reasoning | "Is this true?" |