stemedb/vision.md
jordan 3cfaa1e1d3 feat: Complete Phase 1 (The Spine) - storage foundation
Phase 1 delivers the complete durability and storage layer:

- WAL with crash recovery: Append-only journal with BLAKE3 checksums,
  fsync guarantees, and proper seek-to-EOF on reopen
- Storage engine: sled-backed KVStore with scan_prefix for range queries
- Content-addressed storage: H:{hash}, V:{hash}, E:{hash} key patterns
- Ingestor: Background worker tailing WAL, writing to KV with 8-byte
  aligned record headers for rkyv zero-copy deserialization
- Comprehensive tests: 31 tests covering crash recovery, round-trips,
  and multi-cycle durability

New crates: stemedb-wal, stemedb-storage, stemedb-ingest

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-31 14:15:34 -07:00

5.2 KiB

Episteme: The Probabilistic Knowledge Lattice

Internal Codename: StemeDB Category: Infrastructure / Database Role: The Cortex (Reasoning & Truth)

1. The Manifesto: "A Marketplace of Truth"

We are building the shared, long-term memory for autonomous research agents.

Current databases (Postgres, Neo4j, Vector DBs) suffer from The Tower of Babel problem: they store Data, not Evidence. They are deterministic, stateless, and brittle.

Episteme rejects the idea of a single, static "database state." Instead, it models knowledge as a Probabilistic Marketplace.

  • Democracy: Truth is established via high-velocity consensus (Voting), not just overwrite privileges.
  • Economics: Reasoning has a cost. The system enforces efficiency via "The Meter."
  • Evolution: The database doesn't just store data; it exports training sets ("The Simulator") to make agents smarter.

2. The Core Data Model: The Hyper-Edge

The atomic unit of Episteme is not a Row, Document, or Embedding. It is the Signed Assertion.

struct Assertion {
    // The Proposition (The "What")
    subject: EntityId,       // e.g., "Tesla_Inc"
    predicate: RelationId,   // e.g., "has_annual_revenue"
    object: Value,           // e.g., "$96.7B"

    // The Meta-Cognition (The "Why")
    confidence: f32,         // 0.0 to 1.0 (Agent's subjective certainty)
    source_hash: Hash,       // Content-addressed link to source (PDF, URL, Log)
    visual_hash: Option<Hash>, // pHash for visual anchoring against web drift
    agent_id: PublicKey,     // Who made this claim? (Cryptographic multi-sig)
    timestamp: u64,          // When?
    
    // The Semantic Vector (The "Meaning")
    vector: Vec<f32>,        // Embedding for semantic navigation

    // The Paradigm (The "Context")
    epoch: Option<EpochId>,  // "covid-guidelines-2020", "gaap-2024"
}

3. The Query Engine: "Truth Lenses"

Reading is a compute-heavy operation. You must apply a Lens to collapse the probabilistic field into a concrete answer. To ensure sub-millisecond latency, Episteme uses Materialized Views to pre-calculate the results of standard lenses.

Standard Lenses

  1. Lens::Consensus: Returns the value with the highest cluster density (Weighted by Multi-Sig). Materialized for speed.
  2. Lens::Authority: Filters by Agent Reputation (TrustRank).
  3. Lens::Recency: Returns the latest assertion, ignoring history.
  4. Lens::EpochAware: Validates assertions against the current paradigm, filtering superseded epochs.
  5. Lens::Skeptic: Returns the variance between claims (identifies high-conflict/unstable truth).

4. Features for the Agentive Team

4.1. "Forking Reality" (Branching)

Agents need to simulate futures without polluting the main branch. Episteme supports Copy-on-Write Branching via Sparse Merkle Trees.

4.2. The Ballot Box: High-Velocity Consensus

To avoid write contention, Episteme separates the "Candidate" (Assertion) from the "Votes" (Signatures).

  • Agents write Votes to a high-speed append-only log ("The Ballot Box").
  • A background process aggregates these votes to update the Materialized View.
  • This allows thousands of agents to "vote" on a fact simultaneously without locking.

4.3. The Hive: Learning & Trust

Episteme uses Recursive TrustRank Optimization to advance the team's collective intelligence.

  • Closed-Loop Learning: When an Agent's prediction is met by a reality assertion, the delta is back-propagated to the Agent's Reputation score.
  • The Simulator (Mid-Training): A pipeline that converts high-confidence failure logs into Synthetic Trajectories, allowing agents to be fine-tuned on their own history (creating a "Memory Adapter" LoRA).

4.4. The Meter: Economics of Reasoning

Deep Research is computationally expensive. Episteme enforces Temporal Advantage Normalization (TAN).

  • Budgeting: Every Job carries a budget (tokens/dollars).
  • Throttling: The system rejects "Fork Reality" requests if the projected cost exceeds the Value of Information.
  • Efficiency Rewards: Agents receive positive reinforcement signals for solving problems under budget.

5. Architecture: The Rust Stack

Episteme follows the "Defensive by Default" best practices.

Tier 1: The Spine (Durability)

  • Component: episteme-wal (Quarantine Pattern)
  • Role: Raw, serialized append-only log. Ensures we never lose a claim.

Tier 2: The Lattice (Graph/Index)

  • Component: episteme-core (Hot/Warm memory)
  • Warm Tier: sled (LSM Tree) for the Merkle DAG + hnsw for vector search.
  • Ballot Box: High-velocity stream for vote ingestion.

Tier 3: The Cortex (Compute)

  • Component: episteme-lens
  • Role: The WASM runtime for executing Lenses and resolving probabilistic state.
  • Materializer: Background worker maintaining O(1) read views.

6. The Ecosystem Triad

System Biological Analogy Function Question Answered
LogDB The Spine Immutable Event Log "What happened?"
AssociativeDB The Hippocampus Associative Memory "What is this like?"
Episteme The Cortex Structured Reasoning "Is this true?"